public inbox for linux-pm@vger.kernel.org
 help / color / mirror / Atom feed
* RFC -- updated Documentation/power/devices.txt
@ 2006-07-10 22:25 David Brownell
  2006-07-11  5:56 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: David Brownell @ 2006-07-10 22:25 UTC (permalink / raw)
  To: linux-pm; +Cc: Andrew Morton, Linus Torvalds

When I did those PM_EVENT_PRETHAW updates I did the Right Thing and
started to update its in-tree documentation too.  Seems I'm the first
person to do that in a long time ... anyway, once I tossed in 

	- PM_EVENT_PRETHAW
	- Linus' suspend_late/resume_early and suspend_prepare
	- ... and class suspend/resume
	- Wakeup events
	- Fixes to the existing docs (confusions, obsolete stuff, ...)

This turned into more of a rewrite.  So here's my current version,
which I'm circulating for comments; I'll send it as a patch after
a few days, for merging when those patches from MM go upstream.

The focus here was just to have the current ("in MM") methods, parameters,
and models presented, with enough context to help readers understand the
"why" as well as the "what" and "when" (and the "huh?").

Yes, seeing it written out this way does beg some questions.  Like
what can be done with that (here-deprecated) sysfs attribute thing,
since it's starting to look even more obviously broken.

- Dave


Most of the code in Linux is device drivers, so much of the Linux power
management code is also driver-specific.  Some drivers will do very little;
others will do a lot.

This writeup gives an overview of how drivers interact with system-wide
power management goals, emphasizing the models and interfaces that are
shared by everything that hooks up to the driver model core.  Read it as
background for the domain-specific work you'd do with any specific driver.


Two Models for Device Power Management
======================================
Drivers will use one of these models to put devices into low-power states:

 * As part of entering a system-wide low-power state like "suspend-to-ram",
   or (mostly for systems with disks) "hibernate" (suspend-to-disk).

   This is something that device, bus, and class drivers collaborate on
   by implementing various role-specific suspend and resume methods to
   cleanly power down hardware and software subsystems, then reactivate
   them without loss of data.

   Some drivers can manage hardware wakeup events, which make the system
   leave that low-power state.  This feature may be disabled using the
   device's power/wakeup file; enabling it may cost some power usage,
   but let the whole system enter low power states more often.

 * While the system is running, independently of any other power management
   activity.  Upstream drivers will normally not know (or care) if the device
   is in some low power state when issuing requests.

   This doesn't have, or need much infrastructure; it's just something you
   should do when writing your drivers.  For example, clk_disable() unused
   clocks as part of minimizing power drain for currently-unused hardware.
   Of course, sometimes clusters of drivers will collaborate with each
   other, which could involve task-specific power management.

There's not a lot to be said about those low power states except that they
are very system-specific, and often device-specific.

Most suspended devices will have quiesced all I/O:  no more DMA or irqs, no
more data read or written, and requests from upstream drivers are no longer
accepted.  A given bus or platform may have different requirements though.

Examples of hardware wakeup events include an alarm from a real time clock,
network wake-on-LAN packets, keyboard or mouse activity, and media insertion
or removal (for PCMCIA, MMC/SD, USB, and so on).


Bus driver methods
------------------
The core methods to suspend and resume devices reside in struct bus_type.
These are mostly of interest to people writing infrastructure for busses
like PCI or USB, or because they define the primitives that device drivers
may need to apply in domain-specific ways to their devices:

struct bus_type {
       ...
       int  (*suspend_prepare)(struct device * dev, pm_message_t state);
       int  (*suspend)(struct device * dev, pm_message_t state);
       int  (*suspend_late)(struct device * dev, pm_message_t state);

       int  (*resume)(struct device * dev);
       int  (*resume_early)(struct device * dev);
};

Bus drivers implement those methods as appropriate for the hardware and
the drivers using it; PCI works differently from USB, and so on.  Not many
people write bus drivers; most driver code is a "device driver" that
builds on top of bus-specific framework code.

For more information on these driver calls, see the description later;
they are called in phases for every device, respecting the parent-child
sequencing in the driver model treee.  Note that at this writing, not
many bus drivers leverage all of those phases, or pass them down to
lower driver levels.


EXAMPLE:  PCI device driver methods
-----------------------------------
PCI framework software calls these methods when the pci device driver bound
to a device device has provided them:

struct pci_driver {
       ...
       int  (*suspend_prepare)(struct pci_device *pdev, pm_message_t state);
       int  (*suspend)(struct pci_device *pdev, pm_message_t state);
       int  (*suspend_late)(struct pci_device *pdev, pm_message_t state);

       int  (*resume)(struct pci_device *pdev);
       int  (*resume_early)(struct pci_device *pdev);
};

Drivers will implement those methods, and call PCI-specific procedures
like pci_set_power_state() and pci_enable_wake() to manage PCI-specific
mechanisms.  Devices are suspended before their bridges enter low power
states, and likewise busses resume before their bridges.


Upper layers of driver stacks
-----------------------------
Device drivers generally have at least two interfaces, and the methods
sketched above are the ones which apply to the lower level (nearer PCI, USB,
or other bus hardware).  The network and block layers are examples of upper
level interfaces, as is a character device talking to userspace.

Power management requests normally need to flow through those upper levels,
which often work in domain-oriented requests like "blank that screen".  In
some cases those upper levels will have power management intelligence that
relates to end-user activity, or other devices that work in cooperation.

When those interfaces are structured using class interfaces, there is a
standard way to have the upper layer stop issuing requests to a given
class device (and restart later):

struct class {
       ...
       int  (*suspend)(struct device * dev, pm_message_t state);
       int  (*resume)(struct device * dev);
};

Those calls are issued in specific phases of the process by which the
system enters a low power "suspend" state, or resumes from it.


Driver calls to enter System Sleep States
=========================================
When the system enters a low power state, each device's driver is asked
to suspend the device by putting it into state compatible with the target
system state.  That's usually some version of "off", but the details are
system-specific.  Also, wakeup-enabled devices will usually stay partly
functional in order to wake the system.

When the system leaves that low power state, the device's driver is asked
to resume it.  The suspend and resume operations always go together, and
both are multi-phase operations.

For simple drivers, suspend might quiesce the device using the class code
and then turn its hardware as "off" as possible with late_suspend.  The
matching resume calls would then completely reinitialize the hardware
before reactivating its class I/O queues.

More power-aware drivers drivers will use more than one device low power
state, either at runtime or during system sleep states, and might trigger
system wakeup events.


Call Sequence Guarantees
--------------------------------------------
To ensure that bridges and similar links needed to talk to a device are
available when the device is suspended or resumed, the device tree is
walked in a bottom-up order to suspend devices.  A top-down order is
used to resume those devices.

The ordering of the device tree is defined by the order in which devices
get registered:  a child can never be registered/probed or resumed before
its parent, or removed/suspended after that parent.

The policy is that the device tree should match hardware bus topology.
(Or at least the control bus, for devices which use multiple busses.)


suspending devices
------------------
Suspending a given device is done in several phases.  Each phase will be
omitted if it's not relevant for that device.  Other devices will often
be suspending at the same time, so each phase will normally be done for
all devices before the next phase begins.

In order, the phases are:

   *	bus.suspend_prepare(dev) is called early, with all tasks active.
   	This method may sleep, and is often morphed into a device
   	driver call with bus-specific parameters.

	Drivers that need to synchronize their suspension with various
	management activities could synchronize here.  This may be a good
	point to arrange that predictable actions, like removing media from
	suspended systems, will not cause trouble on resume.

   *	class.suspend(dev) is called after tasks are frozen, for devices
	associated with a class that has such a method.

	Since I/O activity usually comes from such higher layers, this is
	a good place to quiesce all drivers of a given type (and keep such
	code out of those drivers).

   *	bus.suspend(dev) is called next.  This method may sleep, and is often
        morphed into a device driver call with bus-specific parameters.

	This call should handle parts of device suspend logic that require
	sleeping.  It probably does work to quiesce the device which hasn't
	been abstracted into class.suspend() or bus.suspend_late().

   *	bus.suspend_late(dev) is called with IRQs disabled, and with only
	one CPU active.  This may be morphed into a driver call
	with bus-specific parameters.

	This call might save low level hardware state that might otherwise
	be lost in the upcoming low power state, and actually put the
	device into a lowpower state ... so that in some cases the device
	may stay partly usable until this late.  This "late" call may also
	help when coping with hardware that behaves badly.

At the end of those phases, drivers should normally have stopped all I/O
transactions (DMA, IRQs), saved enough state that they can re-initialize
or restore previous state (as needed by the hardware), and placed the
device into a low-power state.  On many platforms they will also use
clk_disable() to gate off one or more clock sources; sometimes they will
also switch off power supplies, or reduce voltages.  A pm_message_t
parameter is currently used to nuance those semantics (described later).

When any driver sees that its device_can_wakeup(dev), it should make sure
to use the relevant hardware signals to trigger a system wakeup event.
For example, enable_irq_wake() might identify GPIO signals hooked up to
a switch or other external hardware, and pci_enable_wake() does something
similar for PCI's PME# signal.

Low Power (suspend) States
--------------------------
Device low-power states aren't very standard.  One device might only handle
"on" and "off, while another might support a dozen different versions of
"on" (how many engines are active?), plus a state that gets back to "on"
faster than from a full "off".

Some busses define rules about what different suspend states mean.  PCI
gives one example:  after the suspend sequence completes, a non-legacy
PCI device may not perform DMA or issue IRQs, and any wakeup events it
issues would be issued through the PME# bus signal.  Plus, there are
several PCI-standard device states, some of which are optional.

In contrast, integrated system-on-chip processors often use irqs as the
wakeup event sources (so drivers would call enable_irq_wake) and might
be able to treat DMA completion as a wakeup event (sometimes DMA can stay
active too, it'd only be the CPU and some peripherals that sleeps).

Some details here may be platform-specific.  Systems may have devices that
may be fully active in certain sleep states, such as an LCD display that's
refreshed using DMA while most of the system is sleeping lightly ... and
its framebuffer might even be updated by a DSP or other non-Linux CPU while
the Linux control processor stays idle.

Moreover, the specific actions taken may depend on the target system state.
One target system state might allow a given device to be very operational;
another might require a hard shut down with re-initialization on resume.


pm_message_t meaning
--------------------
Parameters to suspend calls include the device affected and a message of
type pm_message_t, which has one field:  the event.  If driver does not
recognize the event code, suspend calls may abort the request and return
a negative errno.  However, most drivers will be fine if they implement
PM_EVENT_SUSPEND semantics for all messages.

The event codes are used to nuance the goal of suspending the device,
and mostly matter when creating or resuming suspend-to-disk snapshots:

PM_EVENT_SUSPEND -- quiesce the driver and put hardware into a low-power
    state.  When used with system sleep states like "suspend-to-RAM" or
    "standby", the upcoming resume() call will often be able to rely on
    state kept in hardware, or issue system wakeup events.  When used
    instead with suspend-to-disk, few devices support this capability;
    most are completely powered off.

PM_EVENT_FREEZE -- quiesce the driver, but don't necessarily change into
    any low power mode.  The driver's resume() will often be called soon.
    Wakeup events are not allowed.

PM_EVENT_PRETHAW -- quiesce the driver, knowing that the upcoming resume()
    will restore a suspend-to-disk snapshot from a different kernel image.
    Drivers that are smart enough to look at their hardware state during
    resume() processing need that state to be correct ... a PRETHAW could
    be used to invalidate that state (by resetting the device).  Other
    drivers might handle this the same way as PM_EVENT_FREEZE.

There's also PM_EVENT_ON, a value which never appears as a suspend event
but is sometimes used to record the "not suspended" device state.


Resuming Devices
----------------
Resuming is similarly done in multiple phases:

   *	bus.resume_early(dev) is called with IRQs disabled, and with
   	only one CPU active.

	This reverses the effects of bus.suspend_late().

   *	bus.resume(dev) is called next.  This may be morphed into a device
   	driver call with bus-specific parameters.

	This reverses the effects of bus.suspend().

   *	class.resume(dev) is called for devices associated with a class
	that has such a method.

	This reverses the effects of class.suspend(), and would usually
	reactivate the device's I/O queue.

Drivers need to be able to handle hardware which has been reset since
the suspend methods were called, for example by complete reinitialization.
(This is the hardest part, and the one most protected by NDA'd documents
and chip errata.) At the end of those phases, drivers should normally be
as functional as they were before resume():  I/O can be performed using DMA
and IRQs, and the relevant clocks are gated on.

However, the details here may again be platform-specific.  For example,
some systems support multiple "run" states, and the mode in effect at
the end of resume() might not be the one which preceded suspension.
That means availability of certain clocks or power supplies changed,
which could easily affect how a driver works.


System devices
--------------
System devices follow a slightly different API, which can be found in

	include/linux/sysdev.h
	drivers/base/sys.c

System devices will only be suspended with interrupts disabled, and
after all other devices have been suspended. On resume, they will be
resumed before any other devices, and also with interrupts disabled.

That is, sysdev_driver.suspend() is called right after the suspend_late()
phase; sysdev_driver.resume() is called before the resume_early() phase.


Runtime Power Management
========================
Many devices are able to dynamically power down while the system is still
running. This feature is useful for devices that are not being used, and
can offer significant power savings on a running system; these devices
often support a range of runtime power states.  Those states will in some
cases (like PCI) be constrained by a bus the device uses, and will usually
be states that are also used in system sleep states.

Normally runtime power management is handled by the drivers without specific
userspace intervention, by techniques that include disabling unused clocks
and switching off unused power supplies.  For now you can also test some of
this functionality through sysfs.

	DEPRECATED:  USE "power/state" ONLY FOR DRIVER TESTING.
	IT WILL BE REMOVED, AND REPLACED WITH SOMETHING WHICH
	GIVES MORE APPROPRIATE ACCESS TO THE RANGE OF POWER STATES
	EACH TYPE OF DEVICE AND BUS SUPPORTS THROUGH ITS DRIVER.

In each device's directory, there is a 'power' directory, which
contains at least a 'state' file. Reading from this file displays what
power state the device is currently in. Writing to this file initiates
a transition to the specified power state, which must currently be a
number: '0' for 'on', or either '2' or '3' for 'suspended' (which
sometimes also implies reset, especially in conjunction with deeper
system sleep states).

The PM core will call the ->suspend() method in the bus_type object
that the device belongs to if the specified state is not 0, or
->resume() if it is.

When using this interface to suspend, the PM core does not call the
bus.suspend_prepare() or bus.suspend_late() methods.  When using it
to resume, the PM core does not call the bus.resume_early() method.

Nothing will happen if the specified state is the same state the
device is currently in.

The driver is responsible for saving the working state of the device
and putting it into the low-power state specified. If this was
successful, it returns 0, and the device's power_state field is
updated.

The drivers must also take care not to suspend a device that is
currently in use. It is their responsibility to provide their own
exclusion mechanisms.

There is currently no way to know what states a device or driver
supports a priori. This may change in the future.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC -- updated Documentation/power/devices.txt
  2006-07-10 22:25 RFC -- updated Documentation/power/devices.txt David Brownell
@ 2006-07-11  5:56 ` Andrew Morton
  2006-07-11 16:38   ` David Brownell
  2006-07-11 21:57   ` David Brownell
  2006-07-11 14:40 ` Pavel Machek
  2006-07-11 21:28 ` Pavel Machek
  2 siblings, 2 replies; 28+ messages in thread
From: Andrew Morton @ 2006-07-11  5:56 UTC (permalink / raw)
  To: David Brownell; +Cc: torvalds, linux-pm

On Mon, 10 Jul 2006 15:25:43 -0700
David Brownell <david-b@pacbell.net> wrote:

> Two Models for Device Power Management

Is useful, thanks.  Later on it'd be good to include some pointers to
drivers which do everything right, and which are well-maintained, for people
to crib from.  (Not "skeleton" drivers, IMO - they tend to go out-of-date).

It's all very suspend-the-whole-machine centric.  We don't presently help
that NIC driver to put itself into a low-power state if there's been no net
activity for 100 milliseconds.  I guess that comes later.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC -- updated Documentation/power/devices.txt
@ 2006-07-11  7:56 Woodruff, Richard
  2006-07-11 16:51 ` David Brownell
  0 siblings, 1 reply; 28+ messages in thread
From: Woodruff, Richard @ 2006-07-11  7:56 UTC (permalink / raw)
  To: David Brownell, linux-pm

One trivial comment would be to use number bullets instead of asterisks
as ordering is important.

> suspending devices
> ------------------
> Suspending a given device is done in several phases.  Each phase will
be
> omitted if it's not relevant for that device.  Other devices will
often
> be suspending at the same time, so each phase will normally be done
for
> all devices before the next phase begins.
> 
> In order, the phases are:
> 
>    *	bus.suspend_prepare(dev) is called early, with all tasks active.
>    	This method may sleep, and is often morphed into a device
>    	driver call with bus-specific parameters.
> 
> 	Drivers that need to synchronize their suspension with various
> 	management activities could synchronize here.  This may be a
good
> 	point to arrange that predictable actions, like removing media
from
> 	suspended systems, will not cause trouble on resume.
> 
>    *	class.suspend(dev) is called after tasks are frozen, for devices
> 	associated with a class that has such a method.

Should/is call.suspend be called before bus.suspend_preapare? If it is
it should be documented first.  Stopping class level things before bus
level which services it seems more natural.  The resume bullets are
documented this way.

> Low Power (suspend) States
> --------------------------
...
> Some details here may be platform-specific.  Systems may have devices
that
> may be fully active in certain sleep states, such as an LCD display
that's
> refreshed using DMA while most of the system is sleeping lightly ...
and
> its framebuffer might even be updated by a DSP or other non-Linux CPU
> while
> the Linux control processor stays idle.

Partial system idle states generally involve several devices.  There
doesn't seem to be a grouping mechanism to define these relationships.
To take device relationships into account we currently look to scripted
user space groupings and notification accesses via extended driver sysfs
endpoints.

> Runtime Power Management
> ========================
...
> In each device's directory, there is a 'power' directory, which
> contains at least a 'state' file. Reading from this file displays what
> power state the device is currently in. Writing to this file initiates
> a transition to the specified power state, which must currently be a
> number: '0' for 'on', or either '2' or '3' for 'suspended' (which
> sometimes also implies reset, especially in conjunction with deeper
> system sleep states).

It would be nice if some of the previous 'on-ness' discussions would
result in some movement here.

Regards,
Richard W.
 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC -- updated Documentation/power/devices.txt
  2006-07-10 22:25 RFC -- updated Documentation/power/devices.txt David Brownell
  2006-07-11  5:56 ` Andrew Morton
@ 2006-07-11 14:40 ` Pavel Machek
  2006-07-11 21:28 ` Pavel Machek
  2 siblings, 0 replies; 28+ messages in thread
From: Pavel Machek @ 2006-07-11 14:40 UTC (permalink / raw)
  To: David Brownell; +Cc: Andrew Morton, Linus Torvalds, linux-pm

Hi!

> When I did those PM_EVENT_PRETHAW updates I did the Right Thing and
> started to update its in-tree documentation too.  Seems I'm the first
> person to do that in a long time ... anyway, once I tossed in 
> 
> 	- PM_EVENT_PRETHAW
> 	- Linus' suspend_late/resume_early and suspend_prepare
> 	- ... and class suspend/resume
> 	- Wakeup events
> 	- Fixes to the existing docs (confusions, obsolete stuff, ...)
> 
> This turned into more of a rewrite.  So here's my current version,
> which I'm circulating for comments; I'll send it as a patch after
> a few days, for merging when those patches from MM go upstream.
> 
> The focus here was just to have the current ("in MM") methods, parameters,
> and models presented, with enough context to help readers understand the
> "why" as well as the "what" and "when" (and the "huh?").
> 
> Yes, seeing it written out this way does beg some questions.  Like
> what can be done with that (here-deprecated) sysfs attribute thing,
> since it's starting to look even more obviously broken.

Few minor comments:

> Bus driver methods
> ------------------
> The core methods to suspend and resume devices reside in struct bus_type.
> These are mostly of interest to people writing infrastructure for busses
> like PCI or USB, or because they define the primitives that device drivers

", or to people defining the primitives" ?

> struct bus_type {
>        ...
>        int  (*suspend_prepare)(struct device * dev, pm_message_t state);
>        int  (*suspend)(struct device * dev, pm_message_t state);
>        int  (*suspend_late)(struct device * dev, pm_message_t state);
> 
>        int  (*resume)(struct device * dev);
>        int  (*resume_early)(struct device * dev);

Swap last two lines, and order will be nicely chronological.

It would be nice to kill space between * and dev to stay consistent.

> For more information on these driver calls, see the description later;
> they are called in phases for every device, respecting the parent-child
> sequencing in the driver model treee.  Note that at this writing,
> not

"at time of this writing"?
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC -- updated Documentation/power/devices.txt
  2006-07-11  5:56 ` Andrew Morton
@ 2006-07-11 16:38   ` David Brownell
  2006-07-11 21:57   ` David Brownell
  1 sibling, 0 replies; 28+ messages in thread
From: David Brownell @ 2006-07-11 16:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: torvalds, linux-pm

> > Two Models for Device Power Management
> 
> Is useful, thanks.  Later on it'd be good to include some pointers to
> drivers which do everything right, and which are well-maintained, for people
> to crib from.  (Not "skeleton" drivers, IMO - they tend to go out-of-date).

We should have more drivers like that, yes!  I'm not sure I'd want
links to them in that text file though ... it's hard enough to keep
writeups current.


> It's all very suspend-the-whole-machine centric. 

Which reflects the current state of the APIs, and also the fact that
for the "driver doesn't needlessly waste power" cases there don't
need much interaction with other PM code.

We do have clk_disable() for hardware that needs it, and there's some
discussion of a similar power supply API for switching supplies on/off.
Those APIs would also be used in suspend-the-whole-thing cases.


> We don't presently help 
> that NIC driver to put itself into a low-power state if there's been no net
> activity for 100 milliseconds.  I guess that comes later.

Maybe WLAN firmware can do that stuff, but most Ethernet hardware I've
had occasion to look at can't do that.  It needs e.g. the 50 MHz clock
for the PHY so there's no point in a clk_disable(), and DMA active, etc.

Ther's some notion that the kernel could help with autosuspending though,
and some work to make it happen for USB.

- Dave

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC -- updated Documentation/power/devices.txt
  2006-07-11  7:56 Woodruff, Richard
@ 2006-07-11 16:51 ` David Brownell
  0 siblings, 0 replies; 28+ messages in thread
From: David Brownell @ 2006-07-11 16:51 UTC (permalink / raw)
  To: Woodruff, Richard; +Cc: linux-pm

On Tuesday 11 July 2006 12:56 am, Woodruff, Richard wrote:
> One trivial comment would be to use number bullets instead of asterisks
> as ordering is important.

OK ... though with ASCII text, that means manual re-numbering when
things change.  It'd be nice if <ol> worked.  :)


> >    *	class.suspend(dev) is called after tasks are frozen, for devices
> > 	associated with a class that has such a method.
> 
> Should/is call.suspend be called before bus.suspend_preapare? If it is
> it should be documented first.  Stopping class level things before bus
> level which services it seems more natural. 

The suspend_prepare() is new, and it's called at that point; there's no
generic class.suspend_prepare().

However class.suspend() is called before bus.suspend(), and bus.suspend()
is what drivers currently provide.  ISTR the notion is that we'll be moving
some code out of driver suspend() methods into reusable class methods,
say for networking, and the suspend_prepare() solves much earlier issues
like needing to handshake with userspace.


> The resume bullets are documented this way.

Bug, now fixed; thanks.  Driver has early and normal opportunities to
resume, then class gets a chance to reactivate after the hardware works.


> > Low Power (suspend) States
> > --------------------------
> > ...
> > Some details here may be platform-specific.  Systems may have devices
> > that may be fully active in certain sleep states, ...
> 
> Partial system idle states generally involve several devices.  There
> doesn't seem to be a grouping mechanism to define these relationships.

Not in the standard infrastructure, no; but then those groupings exist
regardless of whether they're used in a "partial system idle" state.


> To take device relationships into account we currently look to scripted
> user space groupings and notification accesses via extended driver sysfs
> endpoints.

Those things are out-of-scope for this writeup, since we don't have
such grouping APIs.

>From my perspective, the groupings are part of a product-specific definition
of the different system states.  Those can be "run" states (operating points)
or "sleep" states ... where of course the line between a partially-off "run"
state and a partially-on "sleep" state can be very fuzzy!


> > Runtime Power Management
> > ========================
> > ...
> > In each device's directory, there is a 'power' directory, which
> > contains at least a 'state' file. Reading from this file displays what
> > power state the device is currently in. Writing to this file initiates
> > a transition to the specified power state, which must currently be a
> > number: '0' for 'on', or either '2' or '3' for 'suspended' (which
> > sometimes also implies reset, especially in conjunction with deeper
> > system sleep states).
> 
> It would be nice if some of the previous 'on-ness' discussions would
> result in some movement here.

True; that's also in line with Andrew's comment, and the deprecation warning
I stuck in for those /sys/devices/.../power/state files.  But all that would
be new behavior, and at this point I was just aiming to correctly describe
the stuff that's there now (and have it make some sense).

If those discussions bear fruit, that section will need updating.

- Dave

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC -- updated Documentation/power/devices.txt
  2006-07-10 22:25 RFC -- updated Documentation/power/devices.txt David Brownell
  2006-07-11  5:56 ` Andrew Morton
  2006-07-11 14:40 ` Pavel Machek
@ 2006-07-11 21:28 ` Pavel Machek
  2 siblings, 0 replies; 28+ messages in thread
From: Pavel Machek @ 2006-07-11 21:28 UTC (permalink / raw)
  To: David Brownell; +Cc: Andrew Morton, Linus Torvalds, linux-pm

Hi!

Looks good to me. Few more nits (sorry for two separate mails, I was
in a hurry).

> Call Sequence Guarantees
> --------------------------------------------

This probably needs few less '-'s.


> suspending devices
> ------------------

Capitalize?

> pm_message_t meaning
> --------------------

Capitalize 'meaning'?

> The event codes are used to nuance the goal of suspending the device,
> and mostly matter when creating or resuming suspend-to-disk snapshots:
> 
> PM_EVENT_SUSPEND -- quiesce the driver and put hardware into a low-power
>     state.  When used with system sleep states like "suspend-to-RAM" or
>     "standby", the upcoming resume() call will often be able to rely on
>     state kept in hardware, or issue system wakeup events.  When used
>     instead with suspend-to-disk, few devices support this capability;
>     most are completely powered off.

Previous paragraphs make it shound as if running DMA is allowed while
suspended. That's okay for PM_EVENT_SUSPEND...

> PM_EVENT_FREEZE -- quiesce the driver, but don't necessarily change into
>     any low power mode.  The driver's resume() will often be called soon.
>     Wakeup events are not allowed.
> 
> PM_EVENT_PRETHAW -- quiesce the driver, knowing that the upcoming resume()
>     will restore a suspend-to-disk snapshot from a different kernel image.
>     Drivers that are smart enough to look at their hardware state during
>     resume() processing need that state to be correct ... a PRETHAW could
>     be used to invalidate that state (by resetting the device).  Other
>     drivers might handle this the same way as PM_EVENT_FREEZE.

...but not for PM_EVENT_FREEZE/PRETHAW. I guess it should be stated
here...
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC -- updated Documentation/power/devices.txt
  2006-07-11  5:56 ` Andrew Morton
  2006-07-11 16:38   ` David Brownell
@ 2006-07-11 21:57   ` David Brownell
  2006-07-12 12:25     ` Pavel Machek
  2006-07-12 14:04     ` Alan Stern
  1 sibling, 2 replies; 28+ messages in thread
From: David Brownell @ 2006-07-11 21:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: torvalds, linux-pm

On Monday 10 July 2006 10:56 pm, Andrew Morton wrote:

> It's all very suspend-the-whole-machine centric.  We don't presently help
> that NIC driver to put itself into a low-power state if there's been no net
> activity for 100 milliseconds.  I guess that comes later.

Though on more thought, there actually are a bunch of mechanisms that
can kick in at runtime, and some useful things that can be said there.

Here's an updated version that covers a bit more of the runtime issues,
albeit without a network driver example.  (As well as addressing the
issues noted by Pavel and Richard.)

It adds examples of runtime suspend mechanisms:  the system timer and
dynamic tick, USB host controllers entering low power modes, and one of
the way those can interact to make it easier to use x86 C3 states ...

- Dave



Most of the code in Linux is device drivers, so most of the Linux power
management code is also driver-specific.  Most drivers will do very little;
others, especially for platforms with small batteries (like cell phones),
will do a lot.

This writeup gives an overview of how drivers interact with system-wide
power management goals, emphasizing the models and interfaces that are
shared by everything that hooks up to the driver model core.  Read it as
background for the domain-specific work you'd do with any specific driver.


Two Models for Device Power Management
======================================
Drivers will use one of these models to put devices into low-power states:

    1	As part of entering system-wide low-power states like "suspend-to-ram",
	or (mostly for systems with disks) "hibernate" (suspend-to-disk).

	This is something that device, bus, and class drivers collaborate on
	by implementing various role-specific suspend and resume methods to
	cleanly power down hardware and software subsystems, then reactivate
	them without loss of data.

	Some drivers can manage hardware wakeup events, which make the system
	leave that low-power state.  This feature may be disabled using the
	device's power/wakeup file; enabling it may cost some power usage,
	but let the whole system enter low power states more often.

    2	While the system is running, independently of other power management
	activity.  Upstream drivers will normally not know (or care) if the
	device is in some low power state when issuing requests.

	This doesn't have, or need much infrastructure; it's just something you
	should do when writing your drivers.  For example, clk_disable() unused
	clocks as part of minimizing power drain for currently-unused hardware.
	Of course, sometimes clusters of drivers will collaborate with each
	other, which could involve task-specific power management.

There's not a lot to be said about those low power states except that they
are very system-specific, and often device-specific.  Also, that if enough
drivers put themselves into low power states (model #2), the effect may be
the same as entering some system-wide low-power states (model #1) ... and
that synergies exist, so that several drivers using model #2 can put the
system into a state where even deeper power saving options are available.

Most suspended devices will have quiesced all I/O:  no more DMA or irqs, no
more data read or written, and requests from upstream drivers are no longer
accepted.  A given bus or platform may have different requirements though.

Examples of hardware wakeup events include an alarm from a real time clock,
network wake-on-LAN packets, keyboard or mouse activity, and media insertion
or removal (for PCMCIA, MMC/SD, USB, and so on).


Interfaces for Entering System Suspend States
=============================================
Most of the programming interfaces a device driver needs to know about
relate to that first model:  entering a system-wide low power state,
rather than just minimizing power consumption by one device.


Bus driver methods
------------------
The core methods to suspend and resume devices reside in struct bus_type.
These are mostly of interest to people writing infrastructure for busses
like PCI or USB, or because they define the primitives that device drivers
may need to apply in domain-specific ways to their devices:

struct bus_type {
	...
	int  (*suspend_prepare)(struct device *dev, pm_message_t state);
	int  (*suspend)(struct device *dev, pm_message_t state);
	int  (*suspend_late)(struct device *dev, pm_message_t state);

	int  (*resume_early)(struct device *dev);
	int  (*resume)(struct device *dev);
};

Bus drivers implement those methods as appropriate for the hardware and
the drivers using it; PCI works differently from USB, and so on.  Not many
people write bus drivers; most driver code is a "device driver" that
builds on top of bus-specific framework code.

For more information on these driver calls, see the description later;
they are called in phases for every device, respecting the parent-child
sequencing in the driver model treee.  Note that as this is being written,
only the suspend() and resume() are widely available; not many bus drivers
leverage all of those phases, or pass them down to lower driver levels.


EXAMPLE:  PCI device driver methods
-----------------------------------
PCI framework software calls these methods when the pci device driver bound
to a device device has provided them:

struct pci_driver {
	...
	int  (*suspend_prepare)(struct pci_device *pdev, pm_message_t state);
	int  (*suspend)(struct pci_device *pdev, pm_message_t state);
	int  (*suspend_late)(struct pci_device *pdev, pm_message_t state);

	int  (*resume_early)(struct pci_device *pdev);
	int  (*resume)(struct pci_device *pdev);
};

Drivers will implement those methods, and call PCI-specific procedures
like pci_set_power_state() and pci_enable_wake() to manage PCI-specific
mechanisms.  Devices are suspended before their bridges enter low power
states, and likewise bridges resume before their devices.


Upper layers of driver stacks
-----------------------------
Device drivers generally have at least two interfaces, and the methods
sketched above are the ones which apply to the lower level (nearer PCI, USB,
or other bus hardware).  The network and block layers are examples of upper
level interfaces, as is a character device talking to userspace.

Power management requests normally need to flow through those upper levels,
which often use domain-oriented requests like "blank that screen".  In
some cases those upper levels will have power management intelligence that
relates to end-user activity, or other devices that work in cooperation.

When those interfaces are structured using class interfaces, there is a
standard way to have the upper layer stop issuing requests to a given
class device (and restart later):

struct class {
	...
	int  (*suspend)(struct device *dev, pm_message_t state);
	int  (*resume)(struct device *dev);
};

Those calls are issued in specific phases of the process by which the
system enters a low power "suspend" state, or resumes from it.


Calling Drivers to enter System Sleep States
============================================
When the system enters a low power state, each device's driver is asked
to suspend the device by putting it into state compatible with the target
system state.  That's usually some version of "off", but the details are
system-specific.  Also, wakeup-enabled devices will usually stay partly
functional in order to wake the system.

When the system leaves that low power state, the device's driver is asked
to resume it.  The suspend and resume operations always go together, and
both are multi-phase operations.

For simple drivers, suspend might quiesce the device using the class code
and then turn its hardware as "off" as possible with late_suspend.  The
matching resume calls would then completely reinitialize the hardware
before reactivating its class I/O queues.

More power-aware drivers drivers will use more than one device low power
state, either at runtime or during system sleep states, and might trigger
system wakeup events.


Call Sequence Guarantees
------------------------
To ensure that bridges and similar links needed to talk to a device are
available when the device is suspended or resumed, the device tree is
walked in a bottom-up order to suspend devices.  A top-down order is
used to resume those devices.

The ordering of the device tree is defined by the order in which devices
get registered:  a child can never be registered/probed or resumed before
its parent, or removed/suspended after that parent.

The policy is that the device tree should match hardware bus topology.
(Or at least the control bus, for devices which use multiple busses.)


Suspending Devices
------------------
Suspending a given device is done in several phases.  Each phase will be
omitted if it's not relevant for that device.  Other devices will often
be suspending at the same time, so each phase will normally be done for
all devices before the next phase begins.

In order, the phases are:

   1	bus.suspend_prepare(dev) is called early, with all tasks active.
   	This method may sleep, and is often morphed into a device
   	driver call with bus-specific parameters.

	Drivers that need to synchronize their suspension with various
	management activities could synchronize here.  This may be a good
	point to arrange that predictable actions, like removing media from
	suspended systems, will not cause trouble on resume.

   2	class.suspend(dev) is called after tasks are frozen, for devices
	associated with a class that has such a method.

	Since I/O activity usually comes from such higher layers, this is
	a good place to quiesce all drivers of a given type (and keep such
	code out of those drivers).

   3	bus.suspend(dev) is called next.  This method may sleep, and is often
        morphed into a device driver call with bus-specific parameters.

	This call should handle parts of device suspend logic that require
	sleeping.  It probably does work to quiesce the device which hasn't
	been abstracted into class.suspend() or bus.suspend_late().

   4	bus.suspend_late(dev) is called with IRQs disabled, and with only
	one CPU active.  This may be morphed into a driver call
	with bus-specific parameters.

	This call might save low level hardware state that might otherwise
	be lost in the upcoming low power state, and actually put the
	device into a lowpower state ... so that in some cases the device
	may stay partly usable until this late.  This "late" call may also
	help when coping with hardware that behaves badly.

At the end of those phases, drivers should normally have stopped all I/O
transactions (DMA, IRQs), saved enough state that they can re-initialize
or restore previous state (as needed by the hardware), and placed the
device into a low-power state.  On many platforms they will also use
clk_disable() to gate off one or more clock sources; sometimes they will
also switch off power supplies, or reduce voltages.  A pm_message_t
parameter is currently used to nuance those semantics (described later).

When any driver sees that its device_can_wakeup(dev), it should make sure
to use the relevant hardware signals to trigger a system wakeup event.
For example, enable_irq_wake() might identify GPIO signals hooked up to
a switch or other external hardware, and pci_enable_wake() does something
similar for PCI's PME# signal.


Low Power (suspend) States
--------------------------
Device low-power states aren't very standard.  One device might only handle
"on" and "off, while another might support a dozen different versions of
"on" (how many engines are active?), plus a state that gets back to "on"
faster than from a full "off".

Some busses define rules about what different suspend states mean.  PCI
gives one example:  after the suspend sequence completes, a non-legacy
PCI device may not perform DMA or issue IRQs, and any wakeup events it
issues would be issued through the PME# bus signal.  Plus, there are
several PCI-standard device states, some of which are optional.

In contrast, integrated system-on-chip processors often use irqs as the
wakeup event sources (so drivers would call enable_irq_wake) and might
be able to treat DMA completion as a wakeup event (sometimes DMA can stay
active too, it'd only be the CPU and some peripherals that sleeps).

Some details here may be platform-specific.  Systems may have devices that
can be fully active in certain sleep states, such as an LCD display that's
refreshed using DMA while most of the system is sleeping lightly ... and
its framebuffer might even be updated by a DSP or other non-Linux CPU while
the Linux control processor stays idle.

Moreover, the specific actions taken may depend on the target system state.
One target system state might allow a given device to be very operational;
another might require a hard shut down with re-initialization on resume.
And two different target systems might use the same device in different
ways; the aforementioned LCD might be active in one product's "standby",
but a different product using the same SOC might work differently.


Meaning of pm_message_t.event
-----------------------------
Parameters to suspend calls include the device affected and a message of
type pm_message_t, which has one field:  the event.  If driver does not
recognize the event code, suspend calls may abort the request and return
a negative errno.  However, most drivers will be fine if they implement
PM_EVENT_SUSPEND semantics for all messages.

The event codes are used to nuance the goal of suspending the device,
and mostly matter when creating or resuming suspend-to-disk snapshots:

    PM_EVENT_SUSPEND -- quiesce the driver and put hardware into a low-power
	state.  When used with system sleep states like "suspend-to-RAM" or
	"standby", the upcoming resume() call will often be able to rely on
	state kept in hardware, or issue system wakeup events.  When used
	instead with suspend-to-disk, few devices support this capability;
	most are completely powered off.

    PM_EVENT_FREEZE -- quiesce the driver, but don't necessarily change into
	any low power mode.  The driver's resume() will often be called soon.
	Neither wakeup events nor DMA are allowed.

    PM_EVENT_PRETHAW -- quiesce the driver, knowing that the upcoming resume()
	will restore a suspend-to-disk snapshot from a different kernel image.
	Drivers that are smart enough to look at their hardware state during
	resume() processing need that state to be correct ... a PRETHAW could
	be used to invalidate that state (by resetting the device).  Other
	drivers might handle this the same way as PM_EVENT_FREEZE.

There's also PM_EVENT_ON, a value which never appears as a suspend event
but is sometimes used to record the "not suspended" device state.


Resuming Devices
----------------
Resuming is similarly done in multiple phases:

   1	bus.resume_early(dev) is called with IRQs disabled, and with
   	only one CPU active.

	This reverses the effects of bus.suspend_late().

   2	bus.resume(dev) is called next.  This may be morphed into a device
   	driver call with bus-specific parameters.

	This reverses the effects of bus.suspend().

   3	class.resume(dev) is called for devices associated with a class
	that has such a method.

	This reverses the effects of class.suspend(), and would usually
	reactivate the device's I/O queue.

Drivers need to be able to handle hardware which has been reset since
the suspend methods were called, for example by complete reinitialization.
(This is the hardest part, and the one most protected by NDA'd documents
and chip errata.) At the end of those phases, drivers should normally be
as functional as they were before resume():  I/O can be performed using DMA
and IRQs, and the relevant clocks are gated on.

However, the details here may again be platform-specific.  For example,
some systems support multiple "run" states, and the mode in effect at
the end of resume() might not be the one which preceded suspension.
That means availability of certain clocks or power supplies changed,
which could easily affect how a driver works.


System Devices
--------------
System devices follow a slightly different API, which can be found in

	include/linux/sysdev.h
	drivers/base/sys.c

System devices will only be suspended with interrupts disabled, and
after all other devices have been suspended. On resume, they will be
resumed before any other devices, and also with interrupts disabled.

That is, sysdev_driver.suspend() is called right after the suspend_late()
phase; sysdev_driver.resume() is called before the resume_early() phase.


Runtime Power Management
========================
Many devices are able to dynamically power down while the system is still
running. This feature is useful for devices that are not being used, and
can offer significant power savings on a running system; these devices
often support a range of runtime power states.  Those states will in some
cases (like PCI) be constrained by a bus the device uses, and will usually
be hardware states that are also used in system sleep states.

However, note that if a driver puts a device into a runtime low power state
and the system then goes into a system-wide sleep state, it normally ought
to resume into that runtime low power state rather than "full on".  That
distinction would be part of the driver-internal state machine for that
hardware.


Power Saving Techniques
-----------------------
Normally runtime power management is handled by the drivers without specific
userspace or kernel intervention, by device-aware use of techniques like:

    Using fewer CPU cycles
	- shortening "hot" code paths
	- using DMA instead of PIO
	- (sometimes) offloading work to device firmware
	- removing timers, or making them lower frequency
	- eliminating cache misses

    Reducing other resource costs
	- disabling unused clocks in software
	- letting hardware gate off unused clocks
	- switching off unused power supplies
	- eliminating (or delaying/merging) IRQs
	- avoiding needless DMA transfers
	- tuning DMA to avoid byte transfers (word and/or burst modes)
	- using lower voltages
	- switching off unused intermediate busses

    Using device-specific low power states

Read your hardware documentation carefully to see the opportunities that
may be available.  If you can, measure the actual power usage and check
it against the budget established for your project.


Examples:  USB hosts, system timer, system CPU
----------------------------------------------
USB host controllers make interesting, if complex, examples.  In many cases
these have no work to do:  no USB devices are connected, or all of them are
in the USB "suspend" state.  Linux host controller drivers can then disable
periodic DMA transfers that would otherwise be a constant power drain on the
memory subsystem, and enter a suspend state.  In power-aware controllers,
entering that suspend state may disable the clock used with USB signaling
(12 or 480 MHz), saving a certain amount of power.

The controller will be woken from that state (with an IRQ) by changes to the
signal state on the data lines of a given port, for example by an existing
peripheral requestin "remote wakeup" or by plugging a new peripheral.  The
same wakeup mechanism usually works from "standby" sleep states, and on some
systems also from "suspend to RAM" (or even "suspend to disk") states.

System devices like clocks and CPUs may have special roles in the platform
power management scheme.  For example, system timers using a "dynamic tick"
approach don't just save CPU cycles (by eliminating needless timer IRQs),
but they may also open the door to using lower power CPU "idle" states that
cost more than a jiffie to enter or exit.  On x86 systems these are states
like "C3"; note that periodic DMA transfers from a USB host controller will
also prevent entry to a C3 state, much like a periodic timer IRQ.  That kind
of interaction among runtime mechanisms is common.

If the CPU has "cpufreq" support, there also may be opportunities to shift
to lower voltage settings, which reduces the power of executing a given
number of instructions.


/sys/devices/.../power/state files
----------------------------------
For now you can also test some of this functionality using sysfs.

	DEPRECATED:  USE "power/state" ONLY FOR DRIVER TESTING.
	IT WILL BE REMOVED, AND REPLACED WITH SOMETHING WHICH
	GIVES MORE APPROPRIATE ACCESS TO THE RANGE OF POWER STATES
	EACH TYPE OF DEVICE AND BUS SUPPORTS THROUGH ITS DRIVER.

In each device's directory, there is a 'power' directory, which
contains at least a 'state' file. Reading from this file displays what
power state the device is currently in. Writing to this file initiates
a transition to the specified power state, which must currently be a
number: '0' for 'on', or either '2' or '3' for 'suspended' (which
sometimes also implies reset, especially in conjunction with deeper
system sleep states).

The PM core will call the ->suspend() method in the bus_type object
that the device belongs to if the specified state is not 0, or
->resume() if it is.

When using this interface to suspend, the PM core does not call the
bus.suspend_prepare() or bus.suspend_late() methods.  When using it
to resume, the PM core does not call the bus.resume_early() method.

Nothing will happen if the specified state is the same state the
device is currently in.

The driver is responsible for saving the working state of the device
and putting it into the low-power state specified. If this was
successful, it returns 0, and the device's power_state field is
updated.

The drivers must also take care not to suspend a device that is
currently in use. It is their responsibility to provide their own
exclusion mechanisms.

There is currently no way to know what states a device or driver
supports a priori. This may change in the future.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC -- updated Documentation/power/devices.txt
  2006-07-11 21:57   ` David Brownell
@ 2006-07-12 12:25     ` Pavel Machek
  2006-07-12 14:04     ` Alan Stern
  1 sibling, 0 replies; 28+ messages in thread
From: Pavel Machek @ 2006-07-12 12:25 UTC (permalink / raw)
  To: David Brownell; +Cc: Andrew Morton, torvalds, linux-pm

Hi!

Not sure if I'm commenting on latest version...

>     PM_EVENT_FREEZE -- quiesce the driver, but don't necessarily change into
> 	any low power mode.  The driver's resume() will often be called soon.
> 	Neither wakeup events nor DMA are allowed.
> 
>     PM_EVENT_PRETHAW -- quiesce the driver, knowing that the upcoming resume()
> 	will restore a suspend-to-disk snapshot from a different kernel image.
> 	Drivers that are smart enough to look at their hardware state during
> 	resume() processing need that state to be correct ... a PRETHAW could
> 	be used to invalidate that state (by resetting the device).  Other
> 	drivers might handle this the same way as PM_EVENT_FREEZE.

"Neither wakeup events nor DMA are allowed." should be added here,
too.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC -- updated Documentation/power/devices.txt
  2006-07-11 21:57   ` David Brownell
  2006-07-12 12:25     ` Pavel Machek
@ 2006-07-12 14:04     ` Alan Stern
  2006-07-12 15:45       ` David Brownell
  1 sibling, 1 reply; 28+ messages in thread
From: Alan Stern @ 2006-07-12 14:04 UTC (permalink / raw)
  To: David Brownell; +Cc: Linux-pm mailing list

Just a few comments...

On Tue, 11 Jul 2006, David Brownell wrote:

> Most suspended devices will have quiesced all I/O:  no more DMA or irqs, no
> more data read or written, and requests from upstream drivers are no longer
> accepted.  A given bus or platform may have different requirements though.

Mention that a driver may decide to leave the upstream drivers active and 
automatically resume the device whenever a request arrives from upstream.

>    1	bus.suspend_prepare(dev) is called early, with all tasks active.
>    	This method may sleep, and is often morphed into a device
>    	driver call with bus-specific parameters.
> 
> 	Drivers that need to synchronize their suspension with various
> 	management activities could synchronize here.  This may be a good
> 	point to arrange that predictable actions, like removing media from
> 	suspended systems, will not cause trouble on resume.

It's worth mentioning another example: arranging that firmware images
needed for resuming are available, because the userspace helpers that 
provide the firmware won't be running later on.

> also switch off power supplies, or reduce voltages.  A pm_message_t
> parameter is currently used to nuance those semantics (described later).

Please don't use "nuance" as a verb!  (Occurs in several places)

> Resuming Devices
> ----------------
> Resuming is similarly done in multiple phases:
> 
>    1	bus.resume_early(dev) is called with IRQs disabled, and with
>    	only one CPU active.
> 
> 	This reverses the effects of bus.suspend_late().
> 
>    2	bus.resume(dev) is called next.  This may be morphed into a device
>    	driver call with bus-specific parameters.
> 
> 	This reverses the effects of bus.suspend().
> 
>    3	class.resume(dev) is called for devices associated with a class
> 	that has such a method.
> 
> 	This reverses the effects of class.suspend(), and would usually
> 	reactivate the device's I/O queue.

Note that there is no reversal of bus.suspend_early().

> Runtime Power Management
> ========================
> Many devices are able to dynamically power down while the system is still
> running. This feature is useful for devices that are not being used, and
> can offer significant power savings on a running system; these devices
> often support a range of runtime power states.  Those states will in some
> cases (like PCI) be constrained by a bus the device uses, and will usually
> be hardware states that are also used in system sleep states.
> 
> However, note that if a driver puts a device into a runtime low power state
> and the system then goes into a system-wide sleep state, it normally ought
> to resume into that runtime low power state rather than "full on".  That
> distinction would be part of the driver-internal state machine for that
> hardware.

Is that last bit really right?  The PM core records each device's power 
state when beginning a system sleep, and it tries to arrange to leave the 
device in that same state when the sleep ends.  This may involve a 
transition through the fully-on state...  I'm not clear on the details.

> The controller will be woken from that state (with an IRQ) by changes to the
> signal state on the data lines of a given port, for example by an existing
> peripheral requestin "remote wakeup" or by plugging a new peripheral.  The
----------------------^ typo

> /sys/devices/.../power/state files
> ----------------------------------

Add a paragraph explaining /sys/devices/.../power/wakeup also, even though 
it may not exist in the vanilla kernel yet.

> For now you can also test some of this functionality using sysfs.
> 
> 	DEPRECATED:  USE "power/state" ONLY FOR DRIVER TESTING.
> 	IT WILL BE REMOVED, AND REPLACED WITH SOMETHING WHICH
> 	GIVES MORE APPROPRIATE ACCESS TO THE RANGE OF POWER STATES
> 	EACH TYPE OF DEVICE AND BUS SUPPORTS THROUGH ITS DRIVER.

Once people have decided on what that "something" should be!  :-)

Alan Stern

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC -- updated Documentation/power/devices.txt
  2006-07-12 14:04     ` Alan Stern
@ 2006-07-12 15:45       ` David Brownell
  2006-07-12 16:03         ` Alan Stern
  0 siblings, 1 reply; 28+ messages in thread
From: David Brownell @ 2006-07-12 15:45 UTC (permalink / raw)
  To: Alan Stern; +Cc: Linux-pm mailing list

On Wednesday 12 July 2006 7:04 am, Alan Stern wrote:
> Just a few comments...
> 
> On Tue, 11 Jul 2006, David Brownell wrote:
> 
> > Most suspended devices will have quiesced all I/O:  no more DMA or irqs, no
> > more data read or written, and requests from upstream drivers are no longer
> > accepted.  A given bus or platform may have different requirements though.
> 
> Mention that a driver may decide to leave the upstream drivers active and 
> automatically resume the device whenever a request arrives from upstream.

I added similar text right after the #2 mode says that upstream drivers
won't know or care about this state, a para earlier.  Right there, I think
such nuances would be confusing ... that whole section is just background
and frame-setting.


> >    1	bus.suspend_prepare(dev) is called early, with all tasks active.
> >    	This method may sleep, and is often morphed into a device
> >    	driver call with bus-specific parameters.
> > 
> > 	Drivers that need to synchronize their suspension with various
> > 	management activities could synchronize here.  This may be a good
> > 	point to arrange that predictable actions, like removing media from
> > 	suspended systems, will not cause trouble on resume.
> 
> It's worth mentioning another example: arranging that firmware images
> needed for resuming are available, because the userspace helpers that 
> provide the firmware won't be running later on.

Hmm, agreed that firmware reload is an issue, but I'm not totally sure
that's the right place to address that ... I'll add "needing to reload
firmware" as another predictable action though.


> > also switch off power supplies, or reduce voltages.  A pm_message_t
> > parameter is currently used to nuance those semantics (described later).
> 
> Please don't use "nuance" as a verb!  (Occurs in several places)

Erm, OK; "refine" then ... the fact that something can be nuanced
evidently does not imply that something has nuanced it.  ;)


> Note that there is no reversal of bus.suspend_early().

Self-evident yes?  Maybe we _should_ have a resume_late() though...


> > Runtime Power Management
> > ========================
> > Many devices are able to dynamically power down while the system is still
> > running. This feature is useful for devices that are not being used, and
> > can offer significant power savings on a running system; these devices
> > often support a range of runtime power states.  Those states will in some
> > cases (like PCI) be constrained by a bus the device uses, and will usually
> > be hardware states that are also used in system sleep states.
> > 
> > However, note that if a driver puts a device into a runtime low power state
> > and the system then goes into a system-wide sleep state, it normally ought
> > to resume into that runtime low power state rather than "full on".  That
> > distinction would be part of the driver-internal state machine for that
> > hardware.
> 
> Is that last bit really right?  The PM core records each device's power 
> state when beginning a system sleep, and it tries to arrange to leave the 
> device in that same state when the sleep ends.  This may involve a 
> transition through the fully-on state...  I'm not clear on the details.

I think the PM core should stop recording _its_ notion of the power state,
but that's a different discussion.

Yes that bit is right, it's just emphasizing that there _is_ a state machine
in the driver (or should be!), and that with runtime power management it's
not going to be in lock-step with system suspend/resume transitions.  That
is after all the whole point of runtime device PM:  to decouple from those
system-wide transitions so that "system on" needn't imply "device on".


> > The controller will be woken from that state (with an IRQ) by changes to the
> > signal state on the data lines of a given port, for example by an existing
> > peripheral requestin "remote wakeup" or by plugging a new peripheral.  The
> ----------------------^ typo

Fixed.

> > /sys/devices/.../power/state files
> > ----------------------------------
> 
> Add a paragraph explaining /sys/devices/.../power/wakeup also, even though 
> it may not exist in the vanilla kernel yet.

It does exist in vanilla kernels, for several releases now!  And mentioned
in several places there already.  It's not just for runtime PM either; I'll
add a section earlier on.

 
> > For now you can also test some of this functionality using sysfs.
> > 
> > 	DEPRECATED:  USE "power/state" ONLY FOR DRIVER TESTING.
> > 	IT WILL BE REMOVED, AND REPLACED WITH SOMETHING WHICH
> > 	GIVES MORE APPROPRIATE ACCESS TO THE RANGE OF POWER STATES
> > 	EACH TYPE OF DEVICE AND BUS SUPPORTS THROUGH ITS DRIVER.
> 
> Once people have decided on what that "something" should be!  :-)

Certainly.  But there seems to be no disagreement that power/state is
broken for a variety of reasons.

- Dave



> Alan Stern
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC -- updated Documentation/power/devices.txt
  2006-07-12 15:45       ` David Brownell
@ 2006-07-12 16:03         ` Alan Stern
  2006-07-23  1:37           ` David Brownell
  0 siblings, 1 reply; 28+ messages in thread
From: Alan Stern @ 2006-07-12 16:03 UTC (permalink / raw)
  To: David Brownell; +Cc: Linux-pm mailing list

On Wed, 12 Jul 2006, David Brownell wrote:

> > Is that last bit really right?  The PM core records each device's power 
> > state when beginning a system sleep, and it tries to arrange to leave the 
> > device in that same state when the sleep ends.  This may involve a 
> > transition through the fully-on state...  I'm not clear on the details.
> 
> I think the PM core should stop recording _its_ notion of the power state,
> but that's a different discussion.
> 
> Yes that bit is right, it's just emphasizing that there _is_ a state machine
> in the driver (or should be!), and that with runtime power management it's
> not going to be in lock-step with system suspend/resume transitions.  That
> is after all the whole point of runtime device PM:  to decouple from those
> system-wide transitions so that "system on" needn't imply "device on".

One oddity worth mentioning somewhere is that the PM core will never ask a
driver to make a transition between two states unless one of them is ON.  
This can sometimes lead to unexpected effects, some of which might show up 
during a system resume.

Alan Stern

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC -- updated Documentation/power/devices.txt
  2006-07-12 16:03         ` Alan Stern
@ 2006-07-23  1:37           ` David Brownell
  2006-07-23  3:59             ` Alan Stern
  0 siblings, 1 reply; 28+ messages in thread
From: David Brownell @ 2006-07-23  1:37 UTC (permalink / raw)
  To: Alan Stern; +Cc: Linux-pm mailing list

On Wednesday 12 July 2006 9:03 am, Alan Stern wrote:

> One oddity worth mentioning somewhere is that the PM core will never ask a
> driver to make a transition between two states unless one of them is ON.

That's not exactly true ... in the senses that

 (a) dev->power.power_state.event is not always updated.  Some drivers do;
     as do sysfs update codepaths (clobbering updates from those drivers).

 (b) suspend_device() checks the recorded event, but resume_device() doesn't.
     My guess is that's to defend against sysfs updates, since clearly
     it'll never matter as part of a system suspend or resume sequence.

 (c) dpm_runtime_{suspend,resume} check those recorded event codes, before
     deciding to call {suspend,resume}_device().  This is in support of
     that sysfs update glitch.

 (d) ON is really an event not a state (confusingly enough it can also
     be a "message" holding that event).  OK, maybe that's a nit, but
     it does reflect conceptual confusion in the current API.

In short that's kind of a mess.  IMO the correct approach involves removing
the dev->power.power_state thing entirely, along with the sysfs thing, but
we can't do that quite yet.


> This can sometimes lead to unexpected effects, some of which might show up 
> during a system resume.

Everything related to dev->power.power_state can cause confusion.  I'd
rather not try to cast the confusion in stone by documenting it, and the
current writeup mostly ignores the issue ... although I did update it
(see the appended text) to clarify what's going on in the deprecated
sysfs stuff.  (It also assumes the patch that rejects the sysfs ops
if suspend_{prepare,late} or resume_early are needed will be merged.)

- Dave


Most of the code in Linux is device drivers, so most of the Linux power
management code is also driver-specific.  Most drivers will do very little;
others, especially for platforms with small batteries (like cell phones),
will do a lot.

This writeup gives an overview of how drivers interact with system-wide
power management goals, emphasizing the models and interfaces that are
shared by everything that hooks up to the driver model core.  Read it as
background for the domain-specific work you'd do with any specific driver.


Two Models for Device Power Management
======================================
Drivers will use one or both of these models to put devices into low-power
states:

    1	As part of entering system-wide low-power states like "suspend-to-ram",
	or (mostly for systems with disks) "hibernate" (suspend-to-disk).

	This is something that device, bus, and class drivers collaborate on
	by implementing various role-specific suspend and resume methods to
	cleanly power down hardware and software subsystems, then reactivate
	them without loss of data.

	Some drivers can manage hardware wakeup events, which make the system
	leave that low-power state.  This feature may be disabled using the
	relevant /sys/devices/.../power/wakeup file; enabling it may cost some
	power usage, but let the whole system enter low power states more often.

    2	While the system is running, independently of other power management
	activity.  Upstream drivers will normally not know (or care) if the
	device is in some low power state when issuing requests; the driver
	will auto-resume anything that's needed when it gets a request.

	This doesn't have, or need much infrastructure; it's just something you
	should do when writing your drivers.  For example, clk_disable() unused
	clocks as part of minimizing power drain for currently-unused hardware.
	Of course, sometimes clusters of drivers will collaborate with each
	other, which could involve task-specific power management.

There's not a lot to be said about those low power states except that they
are very system-specific, and often device-specific.  Also, that if enough
drivers put themselves into low power states (model #2), the effect may be
the same as entering some system-wide low-power state (model #1) ... and
that synergies exist, so that several drivers using model #2 can put the
system into a state where even deeper power saving options are available.

Most suspended devices will have quiesced all I/O:  no more DMA or irqs, no
more data read or written, and requests from upstream drivers are no longer
accepted.  A given bus or platform may have different requirements though.

Examples of hardware wakeup events include an alarm from a real time clock,
network wake-on-LAN packets, keyboard or mouse activity, and media insertion
or removal (for PCMCIA, MMC/SD, USB, and so on).


Interfaces for Entering System Sleep States
===========================================
Most of the programming interfaces a device driver needs to know about
relate to that first model:  entering a system-wide low power state,
rather than just minimizing power consumption by one device.


Bus Driver Methods
------------------
The core methods to suspend and resume devices reside in struct bus_type.
These are mostly of interest to people writing infrastructure for busses
like PCI or USB, or because they define the primitives that device drivers
may need to apply in domain-specific ways to their devices:

struct bus_type {
	...
	int  (*suspend_prepare)(struct device *dev, pm_message_t state);
	int  (*suspend)(struct device *dev, pm_message_t state);
	int  (*suspend_late)(struct device *dev, pm_message_t state);

	int  (*resume_early)(struct device *dev);
	int  (*resume)(struct device *dev);
};

Bus drivers implement those methods as appropriate for the hardware and
the drivers using it; PCI works differently from USB, and so on.  Not many
people write bus drivers; most driver code is a "device driver" that
builds on top of bus-specific framework code.

For more information on these driver calls, see the description later;
they are called in phases for every device, respecting the parent-child
sequencing in the driver model tree.  Note that as this is being written,
only the suspend() and resume() are widely available; not many bus drivers
leverage all of those phases, or pass them down to lower driver levels.


/sys/devices/.../power/wakeup files
-----------------------------------
All devices in the driver model have two flags to control handling of
wakeup events, which are hardware signals that can force the device and/or
system out of a low power state.  These are initialized by bus or device
driver code using device_init_wakeup(dev,can_wakeup).

The "can_wakeup" flag just records whether the device (and its driver) can
physically support wakeup events.  When that flag is clear, the sysfs
"wakeup" file is empty, and device_may_wakeup() returns false.

For devices that can issue wakeup events, a separate flag controls whether
that device should try to use its wakeup mechanism.  The initial value of
device_may_wakeup() will be true, so that the device's "wakeup" file holds
the value "enabled".  Userspace can change that to "disabled" so that
device_may_wakeup() returns false; or change it back to "enabled" (so that
it returns true again).


EXAMPLE:  PCI Device Driver Methods
-----------------------------------
PCI framework software calls these methods when the PCI device driver bound
to a device device has provided them:

struct pci_driver {
	...
	int  (*suspend_prepare)(struct pci_device *pdev, pm_message_t state);
	int  (*suspend)(struct pci_device *pdev, pm_message_t state);
	int  (*suspend_late)(struct pci_device *pdev, pm_message_t state);

	int  (*resume_early)(struct pci_device *pdev);
	int  (*resume)(struct pci_device *pdev);
};

Drivers will implement those methods, and call PCI-specific procedures
like pci_set_power_state(), pci_enable_wake(), pci_save_state(), and
pci_restore_state() to manage PCI-specific mechanisms.  (PCI config space
could be saved during driver probe, if it weren't for the fact that some
systems rely on userspace tweaking using setpci.)  Devices are suspended
before their bridges enter low power states, and likewise bridges resume
before their devices.


Upper Layers of Driver Stacks
-----------------------------
Device drivers generally have at least two interfaces, and the methods
sketched above are the ones which apply to the lower level (nearer PCI, USB,
or other bus hardware).  The network and block layers are examples of upper
level interfaces, as is a character device talking to userspace.

Power management requests normally need to flow through those upper levels,
which often use domain-oriented requests like "blank that screen".  In
some cases those upper levels will have power management intelligence that
relates to end-user activity, or other devices that work in cooperation.

When those interfaces are structured using class interfaces, there is a
standard way to have the upper layer stop issuing requests to a given
class device (and restart later):

struct class {
	...
	int  (*suspend)(struct device *dev, pm_message_t state);
	int  (*resume)(struct device *dev);
};

Those calls are issued in specific phases of the process by which the
system enters a low power "suspend" state, or resumes from it.


Calling Drivers to Enter System Sleep States
============================================
When the system enters a low power state, each device's driver is asked
to suspend the device by putting it into state compatible with the target
system state.  That's usually some version of "off", but the details are
system-specific.  Also, wakeup-enabled devices will usually stay partly
functional in order to wake the system.

When the system leaves that low power state, the device's driver is asked
to resume it.  The suspend and resume operations always go together, and
both are multi-phase operations.

For simple drivers, suspend might quiesce the device using the class code
and then turn its hardware as "off" as possible with late_suspend.  The
matching resume calls would then completely reinitialize the hardware
before reactivating its class I/O queues.

More power-aware drivers drivers will use more than one device low power
state, either at runtime or during system sleep states, and might trigger
system wakeup events.


Call Sequence Guarantees
------------------------
To ensure that bridges and similar links needed to talk to a device are
available when the device is suspended or resumed, the device tree is
walked in a bottom-up order to suspend devices.  A top-down order is
used to resume those devices.

The ordering of the device tree is defined by the order in which devices
get registered:  a child can never be registered, probed or resumed before
its parent; and can't be removed or suspended after that parent.

The policy is that the device tree should match hardware bus topology.
(Or at least the control bus, for devices which use multiple busses.)


Suspending Devices
------------------
Suspending a given device is done in several phases.  Suspending the
system always includes every phase, executing calls for every device
before the next phase begins.  Not all busses or classes support all
these callbacks; and not all drivers use all the callbacks.

In order, the phases are:

   1	bus.suspend_prepare(dev, message) is called early, with all tasks
	active.  This method may sleep, and is often morphed into a device
   	driver call with bus-specific parameters.

	Drivers that need to synchronize their suspension with various
	management activities could synchronize here.  This may be a good
	point to arrange that predictable actions, like removing media from
	suspended systems or needing to reload firmware, will not cause
	trouble on resume.

   2	class.suspend(dev, message) is called after tasks are frozen, for
	devices associated with a class that has such a method.  This
	method may sleep.

	Since I/O activity usually comes from such higher layers, this is
	a good place to quiesce all drivers of a given type (and keep such
	code out of those drivers).

   3	bus.suspend(dev, message) is called next.  This method may sleep,
	and is often morphed into a device driver call with bus-specific
	parameters and/or rules.

	This call should handle parts of device suspend logic that require
	sleeping.  It probably does work to quiesce the device which hasn't
	been abstracted into class.suspend() or bus.suspend_late().

   4	bus.suspend_late(dev, message) is called with IRQs disabled, and
	with only one CPU active.  Until the bus.resume_early() phase
	completes (see later), IRQs are not enabled again.  This method
	won't be exposed by all busses; for message based busses like USB,
	I2C, or SPI, device interactions normally require IRQs.  This bus
	call may be morphed into a driver call with bus-specific parameters.

	This call might save low level hardware state that might otherwise
	be lost in the upcoming low power state, and actually put the
	device into a low power state ... so that in some cases the device
	may stay partly usable until this late.  This "late" call may also
	help when coping with hardware that behaves badly.

At the end of those phases, drivers should normally have stopped all I/O
transactions (DMA, IRQs), saved enough state that they can re-initialize
or restore previous state (as needed by the hardware), and placed the
device into a low-power state.  On many platforms they will also use
clk_disable() to gate off one or more clock sources; sometimes they will
also switch off power supplies, or reduce voltages.  A pm_message_t
parameter is currently used to refine those semantics (described later).

When any driver sees that its device_can_wakeup(dev), it should make sure
to use the relevant hardware signals to trigger a system wakeup event.
For example, enable_irq_wake() might identify GPIO signals hooked up to
a switch or other external hardware, and pci_enable_wake() does something
similar for PCI's PME# signal.

If a driver (or bus, or class) fails it suspend method, the system won't
enter the desired low power state; it will resume all the devices it's
suspended so far.

Note that drivers may need to perform different actions based on the target
system lowpower/sleep state.  At this writing, there are only platform
specific APIs through which drivers could determine those target states.


Device Low Power (suspend) States
---------------------------------
Device low-power states aren't very standard.  One device might only handle
"on" and "off, while another might support a dozen different versions of
"on" (how many engines are active?), plus a state that gets back to "on"
faster than from a full "off".

Some busses define rules about what different suspend states mean.  PCI
gives one example:  after the suspend sequence completes, a non-legacy
PCI device may not perform DMA or issue IRQs, and any wakeup events it
issues would be issued through the PME# bus signal.  Plus, there are
several PCI-standard device states, some of which are optional.

In contrast, integrated system-on-chip processors often use irqs as the
wakeup event sources (so drivers would call enable_irq_wake) and might
be able to treat DMA completion as a wakeup event (sometimes DMA can stay
active too, it'd only be the CPU and some peripherals that sleep).

Some details here may be platform-specific.  Systems may have devices that
can be fully active in certain sleep states, such as an LCD display that's
refreshed using DMA while most of the system is sleeping lightly ... and
its frame buffer might even be updated by a DSP or other non-Linux CPU while
the Linux control processor stays idle.

Moreover, the specific actions taken may depend on the target system state.
One target system state might allow a given device to be very operational;
another might require a hard shut down with re-initialization on resume.
And two different target systems might use the same device in different
ways; the aforementioned LCD might be active in one product's "standby",
but a different product using the same SOC might work differently.


Meaning of pm_message_t.event
-----------------------------
Parameters to suspend calls include the device affected and a message of
type pm_message_t, which has one field:  the event.  If driver does not
recognize the event code, suspend calls may abort the request and return
a negative errno.  However, most drivers will be fine if they implement
PM_EVENT_SUSPEND semantics for all messages.

The event codes are used to refine the goal of suspending the device, and
mostly matter when creating or resuming system memory image snapshots, as
used with suspend-to-disk:

    PM_EVENT_SUSPEND -- quiesce the driver and put hardware into a low-power
	state.  When used with system sleep states like "suspend-to-RAM" or
	"standby", the upcoming resume() call will often be able to rely on
	state kept in hardware, or issue system wakeup events.  When used
	instead with suspend-to-disk, few devices support this capability;
	most are completely powered off.

    PM_EVENT_FREEZE -- quiesce the driver, but don't necessarily change into
	any low power mode.  A system snapshot is about to be taken, often
	followed by a call to the driver's resume() method.  Neither wakeup
	events nor DMA are allowed.

    PM_EVENT_PRETHAW -- quiesce the driver, knowing that the upcoming resume()
	will restore a suspend-to-disk snapshot from a different kernel image.
	Drivers that are smart enough to look at their hardware state during
	resume() processing need that state to be correct ... a PRETHAW could
	be used to invalidate that state (by resetting the device), like a
	shutdown() invocation would before a kexec() or system halt.  Other
	drivers might handle this the same way as PM_EVENT_FREEZE.  Neither
	wakeup events nor DMA are allowed.

To enter "standby" (ACPI S1) or "Suspend to RAM" (STR, ACPI S3) states, or
the similarly named APM states, only PM_EVENT_SUSPEND is used; for "Suspend
to Disk" (STD, hibernate, ACPI S4), all of those event codes are used.

There's also PM_EVENT_ON, a value which never appears as a suspend event
but is sometimes used to record the "not suspended" device state.


Resuming Devices
----------------
Resuming is done in multiple phases, much like suspending, with all
devices processing each phase's calls before the next phase begins:

   1	bus.resume_early(dev) is called with IRQs disabled, and with
   	only one CPU active.  As with bus.suspend_late(), this method
	won't be supported on busses that require IRQs in order to
	interact with devices.

	This reverses the effects of bus.suspend_late().

   2	bus.resume(dev) is called next.  This may be morphed into a device
   	driver call with bus-specific parameters; implementations may sleep.

	This reverses the effects of bus.suspend().

   3	class.resume(dev) is called for devices associated with a class
	that has such a method.  Implementations may sleep.

	This reverses the effects of class.suspend(), and would usually
	reactivate the device's I/O queue.

At the end of those phases, drivers should normally be as functional as
they were before suspending:  I/O can be performed using DMA and IRQs, and
the relevant clocks are gated on.

However, the details here may again be platform-specific.  For example,
some systems support multiple "run" states, and the mode in effect at
the end of resume() might not be the one which preceded suspension.
That means availability of certain clocks or power supplies changed,
which could easily affect how a driver works.


Drivers need to be able to handle hardware which has been reset since the
suspend methods were called, for example by complete reinitialization.
This may be the hardest part, and the one most protected by NDA'd documents
and chip errata.  It's simplest if the hardware state hasn't changed since
the suspend() was called, but that can't always be guaranteed.

Drivers must also be prepared to notice that the device has been removed
while the system was powered off, whenever that's physically possible.
PCMCIA, MMC, USB, Firewire, SCSI, and even IDE are common examples of busses
where common Linux platforms will see such removal.  Details of how drivers
will notice and handle such removals are currently bus-specific, and often
involve a separate thread.


System Devices
--------------
System devices follow a slightly different API, which can be found in

	include/linux/sysdev.h
	drivers/base/sys.c

System devices will only be suspended with interrupts disabled, and
after all other devices have been suspended.  On resume, they will be
resumed before any other devices, and also with interrupts disabled.

That is, IRQs are disabled, the suspend_late() phase begins, then the
sysdev_driver.suspend() phase, and the system enters a sleep state.
Then the sysdev_driver.resume() phase begins, followed by the
resume_early() phase, after which IRQs are enabled.


Runtime Power Management
========================
Many devices are able to dynamically power down while the system is still
running. This feature is useful for devices that are not being used, and
can offer significant power savings on a running system; these devices
often support a range of runtime power states.  Those states will in some
cases (like PCI) be constrained by a bus the device uses, and will usually
be hardware states that are also used in system sleep states.

However, note that if a driver puts a device into a runtime low power state
and the system then goes into a system-wide sleep state, it normally ought
to resume into that runtime low power state rather than "full on".  Such
distinctions would be part of the driver-internal state machine for that
hardware; the whole point of runtime power management is to be sure that
drivers are decoupled in that way from the state machine governing phases
of the system-wide power/sleep state transitions.


Power Saving Techniques
-----------------------
Normally runtime power management is handled by the drivers without specific
userspace or kernel intervention, by device-aware use of techniques like:

    Using information provided by other system layers
	- stay "off" except between open() and close()
	- application protocols may include power commands or hints
	- transceiver/PHY may indicate "nobody connected"

    Using fewer CPU cycles
	- using DMA instead of PIO
	- removing timers, or making them lower frequency
	- shortening "hot" code paths
	- eliminating cache misses
	- (sometimes) offloading work to device firmware

    Reducing other resource costs
	- gating off unused clocks in software (or hardware)
	- switching off unused power supplies
	- eliminating (or delaying/merging) IRQs
	- tuning DMA to use word and/or burst modes

    Using device-specific low power states
	- using lower voltages
	- avoiding needless DMA transfers

Read your hardware documentation carefully to see the opportunities that
may be available.  If you can, measure the actual power usage and check
it against the budget established for your project.


Examples:  USB hosts, system timer, system CPU
----------------------------------------------
USB host controllers make interesting, if complex, examples.  In many cases
these have no work to do:  no USB devices are connected, or all of them are
in the USB "suspend" state.  Linux host controller drivers can then disable
periodic DMA transfers that would otherwise be a constant power drain on the
memory subsystem, and enter a suspend state.  In power-aware controllers,
entering that suspend state may disable the clock used with USB signaling,
saving a certain amount of power.

The controller will be woken from that state (with an IRQ) by changes to the
signal state on the data lines of a given port, for example by an existing
peripheral requesting "remote wakeup" or by plugging a new peripheral.  The
same wakeup mechanism usually works from "standby" sleep states, and on some
systems also from "suspend to RAM" (or even "suspend to disk") states.
(Except that ACPI may be involved instead of normal IRQs, on some hardware.)

System devices like timers and CPUs may have special roles in the platform
power management scheme.  For example, system timers using a "dynamic tick"
approach don't just save CPU cycles (by eliminating needless timer IRQs),
but they may also open the door to using lower power CPU "idle" states that
cost more than a jiffie to enter and exit.  On x86 systems these are states
like "C3"; note that periodic DMA transfers from a USB host controller will
also prevent entry to a C3 state, much like a periodic timer IRQ.

That kind of runtime mechanism interaction is common.  "System On Chip" (SOC)
processors often have low power idle modes that can't be entered unless
certain medium-speed clocks (often 12 or 48 MHz) are gated off.  When the
drivers gate those clocks effectively, then the system idle task may be able
to use the lower power idle modes and thereby increase battery life.

If the CPU can have a "cpufreq" driver, there also may be opportunities
to shift to lower voltage settings and reduce the power cost of executing
a given number of instructions.  (Without voltage adjustment, it's rare
for cpufreq to save much power; the cost-per-instruction must go down.)


/sys/devices/.../power/state files
----------------------------------
For now you can also test some of this functionality using sysfs.

	DEPRECATED:  USE "power/state" ONLY FOR DRIVER TESTING.
	IT WILL BE REMOVED, AND REPLACED WITH SOMETHING WHICH
	GIVES MORE APPROPRIATE ACCESS TO THE RANGE OF POWER STATES
	EACH TYPE OF DEVICE AND BUS SUPPORTS THROUGH ITS DRIVER.

In each device's directory, there is a 'power' directory, which contains
at least a 'state' file.  The value of this field is effectively boolean,
PM_EVENT_ON or PM_EVENT_SUSPEND.

   *	Reading from this file displays a value corresponding to
	the power.power_state.event field.  All nonzero values are
	displayed as "2", corresponding to a low power state; zero
	is displayed as "0", corresponding to normal operation.

   *	Writing to this file initiates a transition using the
   	specified event code number; only '0', '2', and '3' are
	accepted (without a newline); '2' and '3' are both
	mapped to PM_EVENT_SUSPEND.

On writes, the PM core relies on that recorded event code and the device/bus
capabilities to determine whether it uses a partial suspend() or resume()
sequence to change things so that the recorded event corresponds to the
numeric parameter.

   -	If the bus requires the prepare_suspend() phase, or the
	irqs-disabled suspend_late()/resume_early() phases, writes
	fail because those operations are not supported here.

   -	If the recorded value is the expected value, nothing is done.
   
   -	If the recorded value is nonzero, the device is partially resumed,
	using the bus.resume() and/or class.resume() methods.

   -	If the target value is nonzero, the device is partially suspended,
	using the class.suspend() and/or bus.suspend() methods and the
	PM_EVENT_SUSPEND message.

Drivers have no way to tell whether their suspend() and resume() calls
have come through the sysfs power/state file or as part of entering a
system sleep state, except that when accessed through sysfs the normal
parent/child sequencing rules are ignored.  Drivers (such as bus, bridge,
or hub drivers) which expose child devices may need to enforce those rules
on their own.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC -- updated Documentation/power/devices.txt
  2006-07-23  1:37           ` David Brownell
@ 2006-07-23  3:59             ` Alan Stern
  2006-07-23 10:50               ` Rafael J. Wysocki
  2006-07-23 16:22               ` RFC -- " David Brownell
  0 siblings, 2 replies; 28+ messages in thread
From: Alan Stern @ 2006-07-23  3:59 UTC (permalink / raw)
  To: David Brownell; +Cc: Linux-pm mailing list

On Sat, 22 Jul 2006, David Brownell wrote:

> On Wednesday 12 July 2006 9:03 am, Alan Stern wrote:
> 
> > One oddity worth mentioning somewhere is that the PM core will never ask a
> > driver to make a transition between two states unless one of them is ON.
> 
> That's not exactly true ... in the senses that
> 
>  (a) dev->power.power_state.event is not always updated.  Some drivers do;
>      as do sysfs update codepaths (clobbering updates from those drivers).

Right.  The decision whether to invoke the driver's method depends on the 
power_state.event value, not the device's actual state.

>  (b) suspend_device() checks the recorded event, but resume_device() doesn't.
>      My guess is that's to defend against sysfs updates, since clearly
>      it'll never matter as part of a system suspend or resume sequence.

True again.  This behavior is consistent with what I said, since
resume_device() asks the driver to change to the ON state.  The assumption
appears to be that during a system resume sequence, a device will never be
resumed before its resume_device() call -- or if it is, the redundant
resume method call won't matter.

>  (c) dpm_runtime_{suspend,resume} check those recorded event codes, before
>      deciding to call {suspend,resume}_device().  This is in support of
>      that sysfs update glitch.

Well, not so much in support of that glitch as simply recognizing that the
device might already be in the desired target state.  Either as a result
of earlier calls to dpm_runtime_{suspend,resume} or as a result of
unilateral action by the driver... the reason doesn't matter.

>  (d) ON is really an event not a state (confusingly enough it can also
>      be a "message" holding that event).  OK, maybe that's a nit, but
>      it does reflect conceptual confusion in the current API.

Agreed.

> In short that's kind of a mess.  IMO the correct approach involves removing
> the dev->power.power_state thing entirely, along with the sysfs thing, but
> we can't do that quite yet.

Then what _can_ we do now?  Or better yet, what _should_ we aim towards
doing?  I'm perfectly happy to have those things removed, but what (if
anything) should take their place?

Some simple questions may help start the ball rolling.  During a system
resume, should all devices be powered on full, or should they be restored
to the state they were in before the suspend?  Or should there be a third
possibility -- maybe some devices always on, others the way they were?  
And who decides?  The driver?

For that matter, to what extent does the PM core need to be involved in
runtime power management?  As far as I can see, all the core can do is
provide centralized routines that would be widely useful.  But apart from
something resembling the current sysfs interface, I can't see what those
routines might do.

Alan Stern

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC -- updated Documentation/power/devices.txt
  2006-07-23  3:59             ` Alan Stern
@ 2006-07-23 10:50               ` Rafael J. Wysocki
  2006-07-23 13:03                 ` Alan Stern
  2006-07-23 16:22               ` RFC -- " David Brownell
  1 sibling, 1 reply; 28+ messages in thread
From: Rafael J. Wysocki @ 2006-07-23 10:50 UTC (permalink / raw)
  To: linux-pm; +Cc: David Brownell

On Sunday 23 July 2006 05:59, Alan Stern wrote:
> On Sat, 22 Jul 2006, David Brownell wrote:
]--snip--[
> Some simple questions may help start the ball rolling.  During a system
> resume, should all devices be powered on full,

IMHO that would be wasteful.

> or should they be restored to the state they were in before the suspend?
> Or should there be a third possibility -- maybe some devices always on,
> others the way they were?

I think the devices that were on before the suspend should stay on, because
that's what the user will expect to happen.  For the same reason the devices
that had been switched off explicitly ("by the user") before the suspend
should stay off.  For the others, the rule of thumb may be "off".

> And who decides?  The driver?

I don't think the driver should decide.  It just may have too little
information to make a proper decision.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC -- updated Documentation/power/devices.txt
  2006-07-23 10:50               ` Rafael J. Wysocki
@ 2006-07-23 13:03                 ` Alan Stern
  2006-07-23 22:45                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 28+ messages in thread
From: Alan Stern @ 2006-07-23 13:03 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: David Brownell, linux-pm

On Sun, 23 Jul 2006, Rafael J. Wysocki wrote:

> On Sunday 23 July 2006 05:59, Alan Stern wrote:
> > On Sat, 22 Jul 2006, David Brownell wrote:
> ]--snip--[
> > Some simple questions may help start the ball rolling.  During a system
> > resume, should all devices be powered on full,
> 
> IMHO that would be wasteful.
> 
> > or should they be restored to the state they were in before the suspend?
> > Or should there be a third possibility -- maybe some devices always on,
> > others the way they were?
> 
> I think the devices that were on before the suspend should stay on, because
> that's what the user will expect to happen.  For the same reason the devices
> that had been switched off explicitly ("by the user") before the suspend
> should stay off.  For the others, the rule of thumb may be "off".

To put it another way: Devices that were ON get restored to ON, and
devices that were OFF (either because the user turned them off explicitly
or because the driver decided to do its own runtime power management) 
get restored to OFF.  In other words, devices get restored to the state 
they were in before suspend.

I don't see anything wrong with that.  It does raise some more questions.

	Who should be responsible for remembering the device's state
	from before the system suspend: the PM core or the driver?
	(Note that the driver has a better idea of what that state
	actually was.)

	Upon resuming from a system suspend, is it better to tell the
	driver to "go directly to the saved state" or to turn the
	device back fully ON and then change to the saved state (if it 
	wasn't fully ON before the system suspend)?

Alan Stern

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC -- updated Documentation/power/devices.txt
  2006-07-23  3:59             ` Alan Stern
  2006-07-23 10:50               ` Rafael J. Wysocki
@ 2006-07-23 16:22               ` David Brownell
  1 sibling, 0 replies; 28+ messages in thread
From: David Brownell @ 2006-07-23 16:22 UTC (permalink / raw)
  To: Alan Stern; +Cc: Linux-pm mailing list

On Saturday 22 July 2006 8:59 pm, Alan Stern wrote:
> On Sat, 22 Jul 2006, David Brownell wrote:
> 
> > In short that's kind of a mess.  IMO the correct approach involves removing
> > the dev->power.power_state thing entirely, along with the sysfs thing, but
> > we can't do that quite yet.
> 
> Then what _can_ we do now?  Or better yet, what _should_ we aim towards
> doing?  I'm perfectly happy to have those things removed, but what (if
> anything) should take their place?

Remove both, replace with nothing generic ... my $US 0.02.  You will
have noticed the patch I sent to add a config option to remove the
/sys/devices/.../power/state files; that can start phasing out soon.
Removing power_state can be done over time.

Some busses could provide bus-specific replacements ... PCI and USB,
not I2C or SPI, as examples.  I can't really argue any reason to make
such a replacement though, other than for testing.


> Some simple questions may help start the ball rolling.  During a system
> resume, should all devices be powered on full, or should they be restored
> to the state they were in before the suspend? 

I'd say the answer is bus- or driver-specific, but lean towards the latter.
Though it's not clear how the PM core could tell about runtime states, since
I also think those should be driver-internal ... so how could anyone tell
the difference?

And for that matter, what is a "system resume" on systems that aren't
as simple as PCs?  E.g. when there are multiple run modes, there's
no reason to expect the post-resume mode to be the same as the pre-suspend
one and thus have e.g. the same clocks and voltages available ... neither
"all on full" nor "all on as before suspend" make sense everywhere.


> Or should there be a third 
> possibility -- maybe some devices always on, others the way they were?  
> And who decides?  The driver?

A given system should be able to provide the answer appropriate for
its applications.  Example, if it's woken up by a given device, maybe
that's the only non-system device that _needs_ to be activated ...

 
> For that matter, to what extent does the PM core need to be involved in
> runtime power management?

Hardly at all, in my book.  As I wrote in that revised devices.txt...
see that for more info.  (That's written to reflect the status quo.)
Different problem domains can have their own hooks ... there's not a
lot of really generic stuff, since the problem domains are so varied.


> As far as I can see, all the core can do is 
> provide centralized routines that would be widely useful.  But apart from
> something resembling the current sysfs interface, I can't see what those
> routines might do.

See above ... I consider the current /sys/devices/.../power/state interface
irredeemably broken.  Which leaves nothing generic enough for the core, at
least in terms of mechanisms needed/used by Linux today. 

- Dave

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC -- updated Documentation/power/devices.txt
  2006-07-23 13:03                 ` Alan Stern
@ 2006-07-23 22:45                   ` Rafael J. Wysocki
  2006-07-24  3:22                     ` David Brownell
  0 siblings, 1 reply; 28+ messages in thread
From: Rafael J. Wysocki @ 2006-07-23 22:45 UTC (permalink / raw)
  To: Alan Stern; +Cc: David Brownell, linux-pm

On Sunday 23 July 2006 15:03, Alan Stern wrote:
> On Sun, 23 Jul 2006, Rafael J. Wysocki wrote:
> 
> > On Sunday 23 July 2006 05:59, Alan Stern wrote:
> > > On Sat, 22 Jul 2006, David Brownell wrote:
> > ]--snip--[
> > > Some simple questions may help start the ball rolling.  During a system
> > > resume, should all devices be powered on full,
> > 
> > IMHO that would be wasteful.
> > 
> > > or should they be restored to the state they were in before the suspend?
> > > Or should there be a third possibility -- maybe some devices always on,
> > > others the way they were?
> > 
> > I think the devices that were on before the suspend should stay on, because
> > that's what the user will expect to happen.  For the same reason the devices
> > that had been switched off explicitly ("by the user") before the suspend
> > should stay off.  For the others, the rule of thumb may be "off".
> 
> To put it another way: Devices that were ON get restored to ON, and
> devices that were OFF (either because the user turned them off explicitly
> or because the driver decided to do its own runtime power management) 
> get restored to OFF.  In other words, devices get restored to the state 
> they were in before suspend.

Yes.

> I don't see anything wrong with that.  It does raise some more questions.
> 
> 	Who should be responsible for remembering the device's state
> 	from before the system suspend: the PM core or the driver?
> 	(Note that the driver has a better idea of what that state
> 	actually was.)

It seems to be easier to leave it to the driver.  However, I think the core
needs to have at least some control over it.   Otherwise, for example,
the detection of misbehaving drivers would be quite difficult, IMHO.

> 	Upon resuming from a system suspend, is it better to tell the
> 	driver to "go directly to the saved state" or to turn the
> 	device back fully ON and then change to the saved state (if it 
> 	wasn't fully ON before the system suspend)?

I think the rule should be "go directly to the saved state", if possible.
As far as the STR is concerned, the devices that were OFF before the
suspend are likely to be OFF after the resume, so it doesn't seem to be
reasonable to turn them ON just to switch them back OFF again.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC -- updated Documentation/power/devices.txt
  2006-07-23 22:45                   ` Rafael J. Wysocki
@ 2006-07-24  3:22                     ` David Brownell
  2006-07-24  9:46                       ` Rafael J. Wysocki
  0 siblings, 1 reply; 28+ messages in thread
From: David Brownell @ 2006-07-24  3:22 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: linux-pm

On Sunday 23 July 2006 3:45 pm, Rafael J. Wysocki wrote:
> On Sunday 23 July 2006 15:03, Alan Stern wrote:
> > On Sun, 23 Jul 2006, Rafael J. Wysocki wrote:
> > ... In other words, devices get restored to the state 
> > they were in before suspend.
> 
> Yes.
>
> > I don't see anything wrong with that.  It does raise some more questions.
> > 
> > 	Who should be responsible for remembering the device's state
> > 	from before the system suspend: the PM core or the driver?
> > 	(Note that the driver has a better idea of what that state
> > 	actually was.)
> 
> It seems to be easier to leave it to the driver.

Just the way any true runtime pm operation would be, yes?


> However, I think the core 
> needs to have at least some control over it.   Otherwise, for example,
> the detection of misbehaving drivers would be quite difficult, IMHO.

Remember that the definition of runtime PM under discussion is:

> > >    2   While the system is running, independently of other power management
> > >        activity.  Upstream drivers will normally not know (or care) if the
> > >        device is in some low power state when issuing requests; the driver
> > >        will auto-resume anything that's needed when it gets a request.

That excludes the core having any such "control", or the ability to "detect"
such misbehavior.  (And why would it be misbehavior, when the difference is
irrelevant to upstream drivers issuing requests to that device?)  No matter
what state the device resumes into, the only architectural requirement is
that it respond as if it were fully "on".  Because runtime low power states
are exactly equivalent to "fully operational" except they use less power.


> > 	Upon resuming from a system suspend, is it better to tell the
> > 	driver to "go directly to the saved state" or to turn the
> > 	device back fully ON and then change to the saved state (if it 
> > 	wasn't fully ON before the system suspend)?
> 
> I think the rule should be "go directly to the saved state", if possible.

Agreed, though as I described above it's not something that can be
detected outside of that driver without actually measuring the power
drain that driver imposes on the system.

- Dave

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC -- updated Documentation/power/devices.txt
  2006-07-24  3:22                     ` David Brownell
@ 2006-07-24  9:46                       ` Rafael J. Wysocki
  2006-07-24 14:51                         ` Alan Stern
  0 siblings, 1 reply; 28+ messages in thread
From: Rafael J. Wysocki @ 2006-07-24  9:46 UTC (permalink / raw)
  To: David Brownell; +Cc: linux-pm

On Monday 24 July 2006 05:22, David Brownell wrote:
> On Sunday 23 July 2006 3:45 pm, Rafael J. Wysocki wrote:
> > On Sunday 23 July 2006 15:03, Alan Stern wrote:
> > > On Sun, 23 Jul 2006, Rafael J. Wysocki wrote:
> > > ... In other words, devices get restored to the state 
> > > they were in before suspend.
> > 
> > Yes.
> >
> > > I don't see anything wrong with that.  It does raise some more questions.
> > > 
> > > 	Who should be responsible for remembering the device's state
> > > 	from before the system suspend: the PM core or the driver?
> > > 	(Note that the driver has a better idea of what that state
> > > 	actually was.)
> > 
> > It seems to be easier to leave it to the driver.
> 
> Just the way any true runtime pm operation would be, yes?

I think so.

> > However, I think the core 
> > needs to have at least some control over it.   Otherwise, for example,
> > the detection of misbehaving drivers would be quite difficult, IMHO.
> 
> Remember that the definition of runtime PM under discussion is:
> 
> > > >    2   While the system is running, independently of other power management
> > > >        activity.  Upstream drivers will normally not know (or care) if the
> > > >        device is in some low power state when issuing requests; the driver
> > > >        will auto-resume anything that's needed when it gets a request.
> 
> That excludes the core having any such "control", or the ability to "detect"
> such misbehavior.  (And why would it be misbehavior, when the difference is
> irrelevant to upstream drivers issuing requests to that device?)  No matter
> what state the device resumes into, the only architectural requirement is
> that it respond as if it were fully "on".  Because runtime low power states
> are exactly equivalent to "fully operational" except they use less power.
>
> > > 	Upon resuming from a system suspend, is it better to tell the
> > > 	driver to "go directly to the saved state" or to turn the
> > > 	device back fully ON and then change to the saved state (if it 
> > > 	wasn't fully ON before the system suspend)?
> > 
> > I think the rule should be "go directly to the saved state", if possible.
> 
> Agreed, though as I described above it's not something that can be
> detected outside of that driver without actually measuring the power
> drain that driver imposes on the system.

And that would be difficult if not impossible.

So, it looks like the core will only need to tell the driver to "resume"
with the implicit "go directly to the saved state" meaning and the
driver will actually decide what to do.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC -- updated Documentation/power/devices.txt
  2006-07-24  9:46                       ` Rafael J. Wysocki
@ 2006-07-24 14:51                         ` Alan Stern
  2006-07-24 15:15                           ` David Brownell
  0 siblings, 1 reply; 28+ messages in thread
From: Alan Stern @ 2006-07-24 14:51 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: David Brownell, linux-pm

On Mon, 24 Jul 2006, Rafael J. Wysocki wrote:

> > > > 	Upon resuming from a system suspend, is it better to tell the
> > > > 	driver to "go directly to the saved state" or to turn the
> > > > 	device back fully ON and then change to the saved state (if it 
> > > > 	wasn't fully ON before the system suspend)?
> > > 
> > > I think the rule should be "go directly to the saved state", if possible.
> > 
> > Agreed, though as I described above it's not something that can be
> > detected outside of that driver without actually measuring the power
> > drain that driver imposes on the system.
> 
> And that would be difficult if not impossible.
> 
> So, it looks like the core will only need to tell the driver to "resume"
> with the implicit "go directly to the saved state" meaning and the
> driver will actually decide what to do.

And this implies that the meanings of the suspend() and resume() method
calls are different from what people might normally think.  suspend()  
doesn't really mean "Put your device into a low-power state"; it means
"The system is going into a suspend, so remember the device's current
power state and take whatever actions are appropriate".  For example, if
the device is already in a low-power state then it might be appropriate to
do nothing at all.

Likewise, resume() doesn't mean "Change your device to fully ON"; it means
"The system is waking up from a suspend, so put your device back into
whatever power state it was in before the suspend occurred".

These meanings may not be entirely consistent with the way the PM core
works now, but to me they make more sense.  To make the core consistent
with this approach wouldn't require much of a change.  Basically we just
have to rip out all the stuff referring to dev->power.power_state, which
needs to be done anyway.

Alan Stern

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC -- updated Documentation/power/devices.txt
  2006-07-24 14:51                         ` Alan Stern
@ 2006-07-24 15:15                           ` David Brownell
  2006-07-24 15:42                             ` Alan Stern
  0 siblings, 1 reply; 28+ messages in thread
From: David Brownell @ 2006-07-24 15:15 UTC (permalink / raw)
  To: Alan Stern; +Cc: linux-pm

On Monday 24 July 2006 7:51 am, Alan Stern wrote:
> On Mon, 24 Jul 2006, Rafael J. Wysocki wrote:

> > So, it looks like the core will only need to tell the driver to "resume"
> > with the implicit "go directly to the saved state" meaning and the
> > driver will actually decide what to do.
> 
> And this implies that the meanings of the suspend() and resume() method
> calls are different from what people might normally think.

No ...

> suspend()   
> doesn't really mean "Put your device into a low-power state"; it means
> "The system is going into a suspend, so remember the device's current
> power state and take whatever actions are appropriate".

Which is exactly what it means today.


> For example, if 
> the device is already in a low-power state then it might be appropriate to
> do nothing at all.

Ditto.


> Likewise, resume() doesn't mean "Change your device to fully ON"; it means
> "The system is waking up from a suspend, so put your device back into
> whatever power state it was in before the suspend occurred".

It means "put it back in a fully operational state".

And whether that state is "full on", or one of the "runtime suspend" states,
or whether it goes into "full on" and then automagically into a "runtime suspend",
doesn't matter externally.  And can't even be noticed without test setups
measuring differential power usage between different configurations...


> These meanings may not be entirely consistent with the way the PM core
> works now,

I don't believe any semantic change is being discussed here.


> but to me they make more sense.  To make the core consistent 
> with this approach wouldn't require much of a change.  Basically we just
> have to rip out all the stuff referring to dev->power.power_state, which
> needs to be done anyway.

Considering how few drivers use dev->power.power_state, it's easier to
say the problem is in the ones that use it  ... rather than the ones that
ignore it and act as I've described above! :)

- Dave



> Alan Stern
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC -- updated Documentation/power/devices.txt
  2006-07-24 15:15                           ` David Brownell
@ 2006-07-24 15:42                             ` Alan Stern
  2006-07-24 17:11                               ` David Brownell
  0 siblings, 1 reply; 28+ messages in thread
From: Alan Stern @ 2006-07-24 15:42 UTC (permalink / raw)
  To: David Brownell; +Cc: linux-pm

On Mon, 24 Jul 2006, David Brownell wrote:

> On Monday 24 July 2006 7:51 am, Alan Stern wrote:
> > On Mon, 24 Jul 2006, Rafael J. Wysocki wrote:
> 
> > > So, it looks like the core will only need to tell the driver to "resume"
> > > with the implicit "go directly to the saved state" meaning and the
> > > driver will actually decide what to do.
> > 
> > And this implies that the meanings of the suspend() and resume() method
> > calls are different from what people might normally think.
> 
> No ...
> 
> > suspend()   
> > doesn't really mean "Put your device into a low-power state"; it means
> > "The system is going into a suspend, so remember the device's current
> > power state and take whatever actions are appropriate".
> 
> Which is exactly what it means today.

What I wrote wasn't necessarily intended to be different from what the
methods mean today.  But it _is_ different from what people tend to think.

It's very natural for someone to imagine that suspend() means "Suspend
your device" (i.e., put it in a low-power state) and resume() means
"Resume your device" (i.e., change it to full-power).  But the real
emphasis is different; people need to know that suspend() actually means
"The _system_ is going to suspend" and resume() means "The _system_ is
resuming".

> > Likewise, resume() doesn't mean "Change your device to fully ON"; it means
> > "The system is waking up from a suspend, so put your device back into
> > whatever power state it was in before the suspend occurred".
> 
> It means "put it back in a fully operational state".
> 
> And whether that state is "full on", or one of the "runtime suspend" states,
> or whether it goes into "full on" and then automagically into a "runtime suspend",
> doesn't matter externally.  And can't even be noticed without test setups
> measuring differential power usage between different configurations...

Yes.  But there's a natural tendency to think that resume() means "Turn it 
to full power", just as there's a natural tendency to think that suspend() 
means "Turn it to minimum power".  The documentation should emphasize that 
this is not what those calls really mean.

> > These meanings may not be entirely consistent with the way the PM core
> > works now,
> 
> I don't believe any semantic change is being discussed here.

There is.  Right now suspend_device() won't call the bus's suspend method
if dev->power.power_state.event is non-zero.  But since the purpose of the
call is to inform the driver that the _system_ is going to suspend, the
call should be made regardless of the _device's_ state (which the core
shouldn't be concerned with anyway).

> Considering how few drivers use dev->power.power_state, it's easier to
> say the problem is in the ones that use it  ... rather than the ones that
> ignore it and act as I've described above! :)

I don't know to what extent there are problems in the drivers.  Not much, 
hopefully.  The real problem lies in the core.

Alan Stern

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC -- updated Documentation/power/devices.txt
  2006-07-24 15:42                             ` Alan Stern
@ 2006-07-24 17:11                               ` David Brownell
  2006-07-24 20:44                                 ` Alan Stern
  0 siblings, 1 reply; 28+ messages in thread
From: David Brownell @ 2006-07-24 17:11 UTC (permalink / raw)
  To: Alan Stern; +Cc: linux-pm

On Monday 24 July 2006 8:42 am, Alan Stern wrote:
> 
> What I wrote wasn't necessarily intended to be different from what the
> methods mean today.  But it _is_ different from what people tend to think.
> 
> It's very natural for someone to imagine that suspend() means "Suspend
> your device" (i.e., put it in a low-power state) and resume() means
> "Resume your device" (i.e., change it to full-power).  But the real
> emphasis is different; people need to know that suspend() actually means
> "The _system_ is going to suspend" and resume() means "The _system_ is
> resuming".

Good point.  I'll tweak the text to make that more explicit.  We're in
agreement it seems, and that was already written up ... but not in what
I suspect is the best place for such points.



> > > These meanings may not be entirely consistent with the way the PM core
> > > works now,
> > 
> > I don't believe any semantic change is being discussed here.
> 
> There is.  Right now suspend_device() won't call the bus's suspend method
> if dev->power.power_state.event is non-zero.  But since the purpose of the
> call is to inform the driver that the _system_ is going to suspend, the
> call should be made regardless of the _device's_ state (which the core
> shouldn't be concerned with anyway).

OK, _now_ we're discussing a semantic change.  ;)

I've added dev->power.power_state to the "this is deprecated" text, along
with the sysfs power/state file.  IMO we can't realistically make that change
(removing the "is it nonzero" test) so long as the sysfs mechanism exists.


> > Considering how few drivers use dev->power.power_state, it's easier to
> > say the problem is in the ones that use it  ... rather than the ones that
> > ignore it and act as I've described above! :)
> 
> I don't know to what extent there are problems in the drivers.  Not much, 
> hopefully.  The real problem lies in the core.

Yes the core has the problem, but drivers referencing power_state is the
workaround for that problem.

Once the sysfs mechanism goes away, there won't be much need for the mechanism.
Only callers to dpm_runtime_*() would trigger any of the troublesome paths.
The two callers are USB and PCMCIA, and I'm not sure they really need the extra
lock that's grabbed by the dpm_runtime_*() calls if there's no need to protect
against that sysfs mechanism.

- Dave

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC -- updated Documentation/power/devices.txt
  2006-07-24 17:11                               ` David Brownell
@ 2006-07-24 20:44                                 ` Alan Stern
  2006-07-24 21:19                                   ` David Brownell
  0 siblings, 1 reply; 28+ messages in thread
From: Alan Stern @ 2006-07-24 20:44 UTC (permalink / raw)
  To: David Brownell; +Cc: linux-pm

On Mon, 24 Jul 2006, David Brownell wrote:

> OK, _now_ we're discussing a semantic change.  ;)
> 
> I've added dev->power.power_state to the "this is deprecated" text, along
> with the sysfs power/state file.  IMO we can't realistically make that change
> (removing the "is it nonzero" test) so long as the sysfs mechanism exists.

All right, I can accept that.  A fundamental problem is that

	echo -n 2 >/sys/devices/.../power/state

calls the driver's suspend() method, thereby telling it that the system as 
a whole is going to be suspended.  It's an unfortunate overloading of 
meanings.

Is it too early to start thinking about replacements for .../power/state
in individual subsystems?  Is that attribute file currently used anywhere
outside of PCMCIA (other than for testing)?  The sooner a safe replacement
is provided for such uses, the better.

> Once the sysfs mechanism goes away, there won't be much need for the mechanism.
> Only callers to dpm_runtime_*() would trigger any of the troublesome paths.
> The two callers are USB and PCMCIA, and I'm not sure they really need the extra
> lock that's grabbed by the dpm_runtime_*() calls if there's no need to protect
> against that sysfs mechanism.

My autosuspend patches remove the dpm_runtime_* stuff from USB, leaving 
only PCMCIA.

The "extra lock" you refer to is dpm_sem?  I'm not quite sure why it's
there at all...  Apparently all it does is attempt to prevent runtime PM
requests from being made while a system suspend/resume transition is in
progress.  But it's futile to try doing this from within the core, if 
drivers are going to be managing device states internally.  (Not to 
mention that a task waiting on dpm_sem will cause the "freeze processes" 
procedure to fail.)

And it's not clear that we _do_ want to prevent runtime PM changes from 
occurring during a system sleep transition.  If a remote-wakeup request 
arrives while the system is suspending, it ought to be legal for the 
request to succeed and thereby abort the transition.

Alan Stern

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC -- updated Documentation/power/devices.txt
  2006-07-24 20:44                                 ` Alan Stern
@ 2006-07-24 21:19                                   ` David Brownell
  2006-07-25 15:42                                     ` Alan Stern
  2006-08-10 23:38                                     ` [patch 2.6.18-rc] " David Brownell
  0 siblings, 2 replies; 28+ messages in thread
From: David Brownell @ 2006-07-24 21:19 UTC (permalink / raw)
  To: Alan Stern; +Cc: linux-pm

On Monday 24 July 2006 1:44 pm, Alan Stern wrote:

> Is it too early to start thinking about replacements for .../power/state
> in individual subsystems? 

I don't think so.  You had one for USB, but Greg didn't want to merge it yet.

So far as I know, there is no current valid use of the .../power/state files.
If I'm missing something, that patch to make it into a (deprecated) config
option will help us learn some interesting things.


> Is that attribute file currently used anywhere 
> outside of PCMCIA (other than for testing)?  The sooner a safe replacement
> is provided for such uses, the better.

I think PCMCIA switched over to /sys/class/pcmcia_socket/.../card_pm_state
a while back, though I'm not sure about the userspace tools.  They could be
behind the times by a bit.


> > Once the sysfs mechanism goes away, there won't be much need for the mechanism.
> > Only callers to dpm_runtime_*() would trigger any of the troublesome paths.
> > The two callers are USB and PCMCIA, and I'm not sure they really need the extra
> > lock that's grabbed by the dpm_runtime_*() calls if there's no need to protect
> > against that sysfs mechanism.
> 
> My autosuspend patches remove the dpm_runtime_* stuff from USB, leaving 
> only PCMCIA.

Good, I was going to ask about that, except that I figured reading the
patches would answer the question in a better way ... ;)

I have a patch that deprecates the dpm_runtime_*() calls; it's only triggered
by USB and PCMCIA.  Not worth submitting IMO; fix the code first, then just
remove those calls since /sys/.../power/state will be the only remaining caller.

 
> The "extra lock" you refer to is dpm_sem?  I'm not quite sure why it's
> there at all...  Apparently all it does is attempt to prevent runtime PM
> requests from being made while a system suspend/resume transition is in
> progress.  But it's futile to try doing this from within the core, if 
> drivers are going to be managing device states internally.  (Not to 
> mention that a task waiting on dpm_sem will cause the "freeze processes" 
> procedure to fail.)

I thought dpm_sem was to protect the list manipulations that are currently
substituting for a more dynamic tree traversal mechanism.  (But haven't
actually looked recently...)


> And it's not clear that we _do_ want to prevent runtime PM changes from 
> occurring during a system sleep transition. 

Real runtime PM changes, of the type the core doesn't see or care about?
Of course not.  But the fake ones via sysfs?  Yes, I think we do.  We
don't want those happening in any case.


> If a remote-wakeup request  
> arrives while the system is suspending, it ought to be legal for the 
> request to succeed and thereby abort the transition.

Agreed.  Not that we'd always want to do it quite that way (surely some
hardware will cause pain!), but it should certainly be a works-well option.

- Dave

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: RFC -- updated Documentation/power/devices.txt
  2006-07-24 21:19                                   ` David Brownell
@ 2006-07-25 15:42                                     ` Alan Stern
  2006-08-10 23:38                                     ` [patch 2.6.18-rc] " David Brownell
  1 sibling, 0 replies; 28+ messages in thread
From: Alan Stern @ 2006-07-25 15:42 UTC (permalink / raw)
  To: Dominik Brodowski, David Brownell; +Cc: Linux-pm mailing list

On Mon, 24 Jul 2006, David Brownell wrote:

> I think PCMCIA switched over to /sys/class/pcmcia_socket/.../card_pm_state
> a while back, though I'm not sure about the userspace tools.  They could be
> behind the times by a bit.

> > My autosuspend patches remove the dpm_runtime_* stuff from USB, leaving 
> > only PCMCIA.
> 
> Good, I was going to ask about that, except that I figured reading the
> patches would answer the question in a better way ... ;)
> 
> I have a patch that deprecates the dpm_runtime_*() calls; it's only triggered
> by USB and PCMCIA.  Not worth submitting IMO; fix the code first, then just
> remove those calls since /sys/.../power/state will be the only remaining caller.

Dominik, have you been following this?  Now's a good time to remove all
references to dpm_runtime_*() from the PCMCIA code.


> I thought dpm_sem was to protect the list manipulations that are currently
> substituting for a more dynamic tree traversal mechanism.  (But haven't
> actually looked recently...)

No, you're thinking of dpm_list_sem.

> > And it's not clear that we _do_ want to prevent runtime PM changes from 
> > occurring during a system sleep transition. 
> 
> Real runtime PM changes, of the type the core doesn't see or care about?
> Of course not.  But the fake ones via sysfs?  Yes, I think we do.  We
> don't want those happening in any case.

While userspace is frozen, requests can't arrive through sysfs.  So the 
only time dpm_sem might matter would be when someone on PPC uses the 
soon-to-be-deprecated sysfs mechanism during suspend-to-RAM!

Alan Stern

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 2.6.18-rc] updated Documentation/power/devices.txt
  2006-07-24 21:19                                   ` David Brownell
  2006-07-25 15:42                                     ` Alan Stern
@ 2006-08-10 23:38                                     ` David Brownell
  1 sibling, 0 replies; 28+ messages in thread
From: David Brownell @ 2006-08-10 23:38 UTC (permalink / raw)
  To: linux-pm

Here's the updated power/devices text, matching the updates now
in the MM tree (queued for 2.6.19) plus a few pending patches
(including "remove unusable/incomplete prepare_suspend").

Greg, please merge.

- Dave

This turned into a rewrite of Documentation/power/devices.txt:

 - Provide more of the "big picture"

 - Fixup some of the horribly ancient/obsolete description of device suspend()
   semantics; lots of text just got deleted.

 - Add a decent description of PM_EVENT_* codes, including the new PRETHAW code
   needed in some swsusp scenarios.

 - Describe the new PM factorization from Linus:
     * class suspend, current suspend, then suspend_late
     * NOT suspend_prepare, it wasn't really usable
     * resume_early, current resume, class resume.

 - Updates power/state docs to be correct, and deprecate its usage except for
   driver testing.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>

Index: g26/Documentation/power/devices.txt
===================================================================
--- g26.orig/Documentation/power/devices.txt	2006-08-04 10:23:41.000000000 -0700
+++ g26/Documentation/power/devices.txt	2006-08-04 10:33:04.000000000 -0700
@@ -1,208 +1,553 @@
+Most of the code in Linux is device drivers, so most of the Linux power
+management code is also driver-specific.  Most drivers will do very little;
+others, especially for platforms with small batteries (like cell phones),
+will do a lot.
+
+This writeup gives an overview of how drivers interact with system-wide
+power management goals, emphasizing the models and interfaces that are
+shared by everything that hooks up to the driver model core.  Read it as
+background for the domain-specific work you'd do with any specific driver.
+
+
+Two Models for Device Power Management
+======================================
+Drivers will use one or both of these models to put devices into low-power
+states:
+
+    System Sleep model:
+	Drivers can enter low power states as part of entering system-wide
+	low-power states like "suspend-to-ram", or (mostly for systems with
+	disks) "hibernate" (suspend-to-disk).
+
+	This is something that device, bus, and class drivers collaborate on
+	by implementing various role-specific suspend and resume methods to
+	cleanly power down hardware and software subsystems, then reactivate
+	them without loss of data.
+
+	Some drivers can manage hardware wakeup events, which make the system
+	leave that low-power state.  This feature may be disabled using the
+	relevant /sys/devices/.../power/wakeup file; enabling it may cost some
+	power usage, but let the whole system enter low power states more often.
+
+    Runtime Power Management model:
+	Drivers may also enter low power states while the system is running,
+	independently of other power management activity.  Upstream drivers
+	will normally not know (or care) if the device is in some low power
+	state when issuing requests; the driver will auto-resume anything
+	that's needed when it gets a request.
+
+	This doesn't have, or need much infrastructure; it's just something you
+	should do when writing your drivers.  For example, clk_disable() unused
+	clocks as part of minimizing power drain for currently-unused hardware.
+	Of course, sometimes clusters of drivers will collaborate with each
+	other, which could involve task-specific power management.
+
+There's not a lot to be said about those low power states except that they
+are very system-specific, and often device-specific.  Also, that if enough
+drivers put themselves into low power states (at "runtime"), the effect may be
+the same as entering some system-wide low-power state (system sleep) ... and
+that synergies exist, so that several drivers using runtime pm might put the
+system into a state where even deeper power saving options are available.
+
+Most suspended devices will have quiesced all I/O:  no more DMA or irqs, no
+more data read or written, and requests from upstream drivers are no longer
+accepted.  A given bus or platform may have different requirements though.
+
+Examples of hardware wakeup events include an alarm from a real time clock,
+network wake-on-LAN packets, keyboard or mouse activity, and media insertion
+or removal (for PCMCIA, MMC/SD, USB, and so on).
+
+
+Interfaces for Entering System Sleep States
+===========================================
+Most of the programming interfaces a device driver needs to know about
+relate to that first model:  entering a system-wide low power state,
+rather than just minimizing power consumption by one device.
+
+
+Bus Driver Methods
+------------------
+The core methods to suspend and resume devices reside in struct bus_type.
+These are mostly of interest to people writing infrastructure for busses
+like PCI or USB, or because they define the primitives that device drivers
+may need to apply in domain-specific ways to their devices:
 
-Device Power Management
+struct bus_type {
+	...
+	int  (*suspend)(struct device *dev, pm_message_t state);
+	int  (*suspend_late)(struct device *dev, pm_message_t state);
+
+	int  (*resume_early)(struct device *dev);
+	int  (*resume)(struct device *dev);
+};
 
+Bus drivers implement those methods as appropriate for the hardware and
+the drivers using it; PCI works differently from USB, and so on.  Not many
+people write bus drivers; most driver code is a "device driver" that
+builds on top of bus-specific framework code.
+
+For more information on these driver calls, see the description later;
+they are called in phases for every device, respecting the parent-child
+sequencing in the driver model tree.  Note that as this is being written,
+only the suspend() and resume() are widely available; not many bus drivers
+leverage all of those phases, or pass them down to lower driver levels.
+
+
+/sys/devices/.../power/wakeup files
+-----------------------------------
+All devices in the driver model have two flags to control handling of
+wakeup events, which are hardware signals that can force the device and/or
+system out of a low power state.  These are initialized by bus or device
+driver code using device_init_wakeup(dev,can_wakeup).
+
+The "can_wakeup" flag just records whether the device (and its driver) can
+physically support wakeup events.  When that flag is clear, the sysfs
+"wakeup" file is empty, and device_may_wakeup() returns false.
+
+For devices that can issue wakeup events, a separate flag controls whether
+that device should try to use its wakeup mechanism.  The initial value of
+device_may_wakeup() will be true, so that the device's "wakeup" file holds
+the value "enabled".  Userspace can change that to "disabled" so that
+device_may_wakeup() returns false; or change it back to "enabled" (so that
+it returns true again).
+
+
+EXAMPLE:  PCI Device Driver Methods
+-----------------------------------
+PCI framework software calls these methods when the PCI device driver bound
+to a device device has provided them:
+
+struct pci_driver {
+	...
+	int  (*suspend)(struct pci_device *pdev, pm_message_t state);
+	int  (*suspend_late)(struct pci_device *pdev, pm_message_t state);
 
-Device power management encompasses two areas - the ability to save
-state and transition a device to a low-power state when the system is
-entering a low-power state; and the ability to transition a device to
-a low-power state while the system is running (and independently of
-any other power management activity). 
+	int  (*resume_early)(struct pci_device *pdev);
+	int  (*resume)(struct pci_device *pdev);
+};
 
+Drivers will implement those methods, and call PCI-specific procedures
+like pci_set_power_state(), pci_enable_wake(), pci_save_state(), and
+pci_restore_state() to manage PCI-specific mechanisms.  (PCI config space
+could be saved during driver probe, if it weren't for the fact that some
+systems rely on userspace tweaking using setpci.)  Devices are suspended
+before their bridges enter low power states, and likewise bridges resume
+before their devices.
+
+
+Upper Layers of Driver Stacks
+-----------------------------
+Device drivers generally have at least two interfaces, and the methods
+sketched above are the ones which apply to the lower level (nearer PCI, USB,
+or other bus hardware).  The network and block layers are examples of upper
+level interfaces, as is a character device talking to userspace.
+
+Power management requests normally need to flow through those upper levels,
+which often use domain-oriented requests like "blank that screen".  In
+some cases those upper levels will have power management intelligence that
+relates to end-user activity, or other devices that work in cooperation.
+
+When those interfaces are structured using class interfaces, there is a
+standard way to have the upper layer stop issuing requests to a given
+class device (and restart later):
+
+struct class {
+	...
+	int  (*suspend)(struct device *dev, pm_message_t state);
+	int  (*resume)(struct device *dev);
+};
 
-Methods
+Those calls are issued in specific phases of the process by which the
+system enters a low power "suspend" state, or resumes from it.
 
-The methods to suspend and resume devices reside in struct bus_type: 
 
-struct bus_type {
-       ...
-       int             (*suspend)(struct device * dev, pm_message_t state);
-       int             (*resume)(struct device * dev);
-};
+Calling Drivers to Enter System Sleep States
+============================================
+When the system enters a low power state, each device's driver is asked
+to suspend the device by putting it into state compatible with the target
+system state.  That's usually some version of "off", but the details are
+system-specific.  Also, wakeup-enabled devices will usually stay partly
+functional in order to wake the system.
+
+When the system leaves that low power state, the device's driver is asked
+to resume it.  The suspend and resume operations always go together, and
+both are multi-phase operations.
+
+For simple drivers, suspend might quiesce the device using the class code
+and then turn its hardware as "off" as possible with late_suspend.  The
+matching resume calls would then completely reinitialize the hardware
+before reactivating its class I/O queues.
+
+More power-aware drivers drivers will use more than one device low power
+state, either at runtime or during system sleep states, and might trigger
+system wakeup events.
+
+
+Call Sequence Guarantees
+------------------------
+To ensure that bridges and similar links needed to talk to a device are
+available when the device is suspended or resumed, the device tree is
+walked in a bottom-up order to suspend devices.  A top-down order is
+used to resume those devices.
+
+The ordering of the device tree is defined by the order in which devices
+get registered:  a child can never be registered, probed or resumed before
+its parent; and can't be removed or suspended after that parent.
+
+The policy is that the device tree should match hardware bus topology.
+(Or at least the control bus, for devices which use multiple busses.)
+
+
+Suspending Devices
+------------------
+Suspending a given device is done in several phases.  Suspending the
+system always includes every phase, executing calls for every device
+before the next phase begins.  Not all busses or classes support all
+these callbacks; and not all drivers use all the callbacks.
+
+The phases are seen by driver notifications issued in this order:
+
+   1	class.suspend(dev, message) is called after tasks are frozen, for
+	devices associated with a class that has such a method.  This
+	method may sleep.
+
+	Since I/O activity usually comes from such higher layers, this is
+	a good place to quiesce all drivers of a given type (and keep such
+	code out of those drivers).
+
+   2	bus.suspend(dev, message) is called next.  This method may sleep,
+	and is often morphed into a device driver call with bus-specific
+	parameters and/or rules.
+
+	This call should handle parts of device suspend logic that require
+	sleeping.  It probably does work to quiesce the device which hasn't
+	been abstracted into class.suspend() or bus.suspend_late().
+
+   3	bus.suspend_late(dev, message) is called with IRQs disabled, and
+	with only one CPU active.  Until the bus.resume_early() phase
+	completes (see later), IRQs are not enabled again.  This method
+	won't be exposed by all busses; for message based busses like USB,
+	I2C, or SPI, device interactions normally require IRQs.  This bus
+	call may be morphed into a driver call with bus-specific parameters.
+
+	This call might save low level hardware state that might otherwise
+	be lost in the upcoming low power state, and actually put the
+	device into a low power state ... so that in some cases the device
+	may stay partly usable until this late.  This "late" call may also
+	help when coping with hardware that behaves badly.
+
+The pm_message_t parameter is currently used to refine those semantics
+(described later).
+
+At the end of those phases, drivers should normally have stopped all I/O
+transactions (DMA, IRQs), saved enough state that they can re-initialize
+or restore previous state (as needed by the hardware), and placed the
+device into a low-power state.  On many platforms they will also use
+clk_disable() to gate off one or more clock sources; sometimes they will
+also switch off power supplies, or reduce voltages.  Drivers which have
+runtime PM support may already have performed some or all of the steps
+needed to prepare for the upcoming system sleep state.
+
+When any driver sees that its device_can_wakeup(dev), it should make sure
+to use the relevant hardware signals to trigger a system wakeup event.
+For example, enable_irq_wake() might identify GPIO signals hooked up to
+a switch or other external hardware, and pci_enable_wake() does something
+similar for PCI's PME# signal.
+
+If a driver (or bus, or class) fails it suspend method, the system won't
+enter the desired low power state; it will resume all the devices it's
+suspended so far.
+
+Note that drivers may need to perform different actions based on the target
+system lowpower/sleep state.  At this writing, there are only platform
+specific APIs through which drivers could determine those target states.
+
+
+Device Low Power (suspend) States
+---------------------------------
+Device low-power states aren't very standard.  One device might only handle
+"on" and "off, while another might support a dozen different versions of
+"on" (how many engines are active?), plus a state that gets back to "on"
+faster than from a full "off".
+
+Some busses define rules about what different suspend states mean.  PCI
+gives one example:  after the suspend sequence completes, a non-legacy
+PCI device may not perform DMA or issue IRQs, and any wakeup events it
+issues would be issued through the PME# bus signal.  Plus, there are
+several PCI-standard device states, some of which are optional.
+
+In contrast, integrated system-on-chip processors often use irqs as the
+wakeup event sources (so drivers would call enable_irq_wake) and might
+be able to treat DMA completion as a wakeup event (sometimes DMA can stay
+active too, it'd only be the CPU and some peripherals that sleep).
+
+Some details here may be platform-specific.  Systems may have devices that
+can be fully active in certain sleep states, such as an LCD display that's
+refreshed using DMA while most of the system is sleeping lightly ... and
+its frame buffer might even be updated by a DSP or other non-Linux CPU while
+the Linux control processor stays idle.
+
+Moreover, the specific actions taken may depend on the target system state.
+One target system state might allow a given device to be very operational;
+another might require a hard shut down with re-initialization on resume.
+And two different target systems might use the same device in different
+ways; the aforementioned LCD might be active in one product's "standby",
+but a different product using the same SOC might work differently.
+
+
+Meaning of pm_message_t.event
+-----------------------------
+Parameters to suspend calls include the device affected and a message of
+type pm_message_t, which has one field:  the event.  If driver does not
+recognize the event code, suspend calls may abort the request and return
+a negative errno.  However, most drivers will be fine if they implement
+PM_EVENT_SUSPEND semantics for all messages.
+
+The event codes are used to refine the goal of suspending the device, and
+mostly matter when creating or resuming system memory image snapshots, as
+used with suspend-to-disk:
+
+    PM_EVENT_SUSPEND -- quiesce the driver and put hardware into a low-power
+	state.  When used with system sleep states like "suspend-to-RAM" or
+	"standby", the upcoming resume() call will often be able to rely on
+	state kept in hardware, or issue system wakeup events.  When used
+	instead with suspend-to-disk, few devices support this capability;
+	most are completely powered off.
+
+    PM_EVENT_FREEZE -- quiesce the driver, but don't necessarily change into
+	any low power mode.  A system snapshot is about to be taken, often
+	followed by a call to the driver's resume() method.  Neither wakeup
+	events nor DMA are allowed.
+
+    PM_EVENT_PRETHAW -- quiesce the driver, knowing that the upcoming resume()
+	will restore a suspend-to-disk snapshot from a different kernel image.
+	Drivers that are smart enough to look at their hardware state during
+	resume() processing need that state to be correct ... a PRETHAW could
+	be used to invalidate that state (by resetting the device), like a
+	shutdown() invocation would before a kexec() or system halt.  Other
+	drivers might handle this the same way as PM_EVENT_FREEZE.  Neither
+	wakeup events nor DMA are allowed.
+
+To enter "standby" (ACPI S1) or "Suspend to RAM" (STR, ACPI S3) states, or
+the similarly named APM states, only PM_EVENT_SUSPEND is used; for "Suspend
+to Disk" (STD, hibernate, ACPI S4), all of those event codes are used.
+
+There's also PM_EVENT_ON, a value which never appears as a suspend event
+but is sometimes used to record the "not suspended" device state.
+
+
+Resuming Devices
+----------------
+Resuming is done in multiple phases, much like suspending, with all
+devices processing each phase's calls before the next phase begins.
+
+The phases are seen by driver notifications issued in this order:
+
+   1	bus.resume_early(dev) is called with IRQs disabled, and with
+   	only one CPU active.  As with bus.suspend_late(), this method
+	won't be supported on busses that require IRQs in order to
+	interact with devices.
+
+	This reverses the effects of bus.suspend_late().
+
+   2	bus.resume(dev) is called next.  This may be morphed into a device
+   	driver call with bus-specific parameters; implementations may sleep.
+
+	This reverses the effects of bus.suspend().
+
+   3	class.resume(dev) is called for devices associated with a class
+	that has such a method.  Implementations may sleep.
+
+	This reverses the effects of class.suspend(), and would usually
+	reactivate the device's I/O queue.
+
+At the end of those phases, drivers should normally be as functional as
+they were before suspending:  I/O can be performed using DMA and IRQs, and
+the relevant clocks are gated on.  The device need not be "fully on"; it
+might be in a runtime lowpower/suspend state that acts as if it were.
+
+However, the details here may again be platform-specific.  For example,
+some systems support multiple "run" states, and the mode in effect at
+the end of resume() might not be the one which preceded suspension.
+That means availability of certain clocks or power supplies changed,
+which could easily affect how a driver works.
+
+
+Drivers need to be able to handle hardware which has been reset since the
+suspend methods were called, for example by complete reinitialization.
+This may be the hardest part, and the one most protected by NDA'd documents
+and chip errata.  It's simplest if the hardware state hasn't changed since
+the suspend() was called, but that can't always be guaranteed.
+
+Drivers must also be prepared to notice that the device has been removed
+while the system was powered off, whenever that's physically possible.
+PCMCIA, MMC, USB, Firewire, SCSI, and even IDE are common examples of busses
+where common Linux platforms will see such removal.  Details of how drivers
+will notice and handle such removals are currently bus-specific, and often
+involve a separate thread.
+
+
+Note that the bus-specific runtime PM wakeup mechanism can exist, and might
+be defined to share some of the same driver code as for system wakeup.  For
+example, a bus-specific device driver's resume() method might be used there,
+so it wouldn't only be called from bus.resume() during system-wide wakeup.
+See bus-specific information about how runtime wakeup events are handled.
 
-Each bus driver is responsible implementing these methods, translating
-the call into a bus-specific request and forwarding the call to the
-bus-specific drivers. For example, PCI drivers implement suspend() and
-resume() methods in struct pci_driver. The PCI core is simply
-responsible for translating the pointers to PCI-specific ones and
-calling the low-level driver.
-
-This is done to a) ease transition to the new power management methods
-and leverage the existing PM code in various bus drivers; b) allow
-buses to implement generic and default PM routines for devices, and c)
-make the flow of execution obvious to the reader. 
-
-
-System Power Management
-
-When the system enters a low-power state, the device tree is walked in
-a depth-first fashion to transition each device into a low-power
-state. The ordering of the device tree is guaranteed by the order in
-which devices get registered - children are never registered before
-their ancestors, and devices are placed at the back of the list when
-registered. By walking the list in reverse order, we are guaranteed to
-suspend devices in the proper order. 
-
-Devices are suspended once with interrupts enabled. Drivers are
-expected to stop I/O transactions, save device state, and place the
-device into a low-power state. Drivers may sleep, allocate memory,
-etc. at will. 
-
-Some devices are broken and will inevitably have problems powering
-down or disabling themselves with interrupts enabled. For these
-special cases, they may return -EAGAIN. This will put the device on a
-list to be taken care of later. When interrupts are disabled, before
-we enter the low-power state, their drivers are called again to put
-their device to sleep. 
-
-On resume, the devices that returned -EAGAIN will be called to power
-themselves back on with interrupts disabled. Once interrupts have been
-re-enabled, the rest of the drivers will be called to resume their
-devices. On resume, a driver is responsible for powering back on each
-device, restoring state, and re-enabling I/O transactions for that
-device. 
 
+System Devices
+--------------
 System devices follow a slightly different API, which can be found in
 
 	include/linux/sysdev.h
 	drivers/base/sys.c
 
-System devices will only be suspended with interrupts disabled, and
-after all other devices have been suspended. On resume, they will be
-resumed before any other devices, and also with interrupts disabled.
+System devices will only be suspended with interrupts disabled, and after
+all other devices have been suspended.  On resume, they will be resumed
+before any other devices, and also with interrupts disabled.
+
+That is, IRQs are disabled, the suspend_late() phase begins, then the
+sysdev_driver.suspend() phase, and the system enters a sleep state.  Then
+the sysdev_driver.resume() phase begins, followed by the resume_early()
+phase, after which IRQs are enabled.
+
+Code to actually enter and exit the system-wide low power state sometimes
+involves hardware details that are only known to the boot firmware, and
+may leave a CPU running software (from SRAM or flash memory) that monitors
+the system and manages its wakeup sequence.
 
 
 Runtime Power Management
-
-Many devices are able to dynamically power down while the system is
-still running. This feature is useful for devices that are not being
-used, and can offer significant power savings on a running system. 
-
-In each device's directory, there is a 'power' directory, which
-contains at least a 'state' file. Reading from this file displays what
-power state the device is currently in. Writing to this file initiates
-a transition to the specified power state, which must be a decimal in
-the range 1-3, inclusive; or 0 for 'On'.
-
-The PM core will call the ->suspend() method in the bus_type object
-that the device belongs to if the specified state is not 0, or
-->resume() if it is. 
-
-Nothing will happen if the specified state is the same state the
-device is currently in. 
-
-If the device is already in a low-power state, and the specified state
-is another, but different, low-power state, the ->resume() method will
-first be called to power the device back on, then ->suspend() will be
-called again with the new state. 
-
-The driver is responsible for saving the working state of the device
-and putting it into the low-power state specified. If this was
-successful, it returns 0, and the device's power_state field is
-updated. 
-
-The driver must take care to know whether or not it is able to
-properly resume the device, including all step of reinitialization
-necessary. (This is the hardest part, and the one most protected by
-NDA'd documents). 
-
-The driver must also take care not to suspend a device that is
-currently in use. It is their responsibility to provide their own
-exclusion mechanisms.
-
-The runtime power transition happens with interrupts enabled. If a
-device cannot support being powered down with interrupts, it may
-return -EAGAIN (as it would during a system power management
-transition),  but it will _not_ be called again, and the transaction
-will fail.
-
-There is currently no way to know what states a device or driver
-supports a priori. This will change in the future. 
-
-pm_message_t meaning
-
-pm_message_t has two fields. event ("major"), and flags.  If driver
-does not know event code, it aborts the request, returning error. Some
-drivers may need to deal with special cases based on the actual type
-of suspend operation being done at the system level. This is why
-there are flags.
-
-Event codes are:
-
-ON -- no need to do anything except special cases like broken
-HW.
-
-# NOTIFICATION -- pretty much same as ON?
-
-FREEZE -- stop DMA and interrupts, and be prepared to reinit HW from
-scratch. That probably means stop accepting upstream requests, the
-actual policy of what to do with them being specific to a given
-driver. It's acceptable for a network driver to just drop packets
-while a block driver is expected to block the queue so no request is
-lost. (Use IDE as an example on how to do that). FREEZE requires no
-power state change, and it's expected for drivers to be able to
-quickly transition back to operating state.
-
-SUSPEND -- like FREEZE, but also put hardware into low-power state. If
-there's need to distinguish several levels of sleep, additional flag
-is probably best way to do that.
-
-Transitions are only from a resumed state to a suspended state, never
-between 2 suspended states. (ON -> FREEZE or ON -> SUSPEND can happen,
-FREEZE -> SUSPEND or SUSPEND -> FREEZE can not).
-
-All events are:
-
-[NOTE NOTE NOTE: If you are driver author, you should not care; you
-should only look at event, and ignore flags.]
-
-#Prepare for suspend -- userland is still running but we are going to
-#enter suspend state. This gives drivers chance to load firmware from
-#disk and store it in memory, or do other activities taht require
-#operating userland, ability to kmalloc GFP_KERNEL, etc... All of these
-#are forbiden once the suspend dance is started.. event = ON, flags =
-#PREPARE_TO_SUSPEND
-
-Apm standby -- prepare for APM event. Quiesce devices to make life
-easier for APM BIOS. event = FREEZE, flags = APM_STANDBY
-
-Apm suspend -- same as APM_STANDBY, but it we should probably avoid
-spinning down disks. event = FREEZE, flags = APM_SUSPEND
-
-System halt, reboot -- quiesce devices to make life easier for BIOS. event
-= FREEZE, flags = SYSTEM_HALT or SYSTEM_REBOOT
-
-System shutdown -- at least disks need to be spun down, or data may be
-lost. Quiesce devices, just to make life easier for BIOS. event =
-FREEZE, flags = SYSTEM_SHUTDOWN
-
-Kexec    -- turn off DMAs and put hardware into some state where new
-kernel can take over. event = FREEZE, flags = KEXEC
-
-Powerdown at end of swsusp -- very similar to SYSTEM_SHUTDOWN, except wake
-may need to be enabled on some devices. This actually has at least 3
-subtypes, system can reboot, enter S4 and enter S5 at the end of
-swsusp. event = FREEZE, flags = SWSUSP and one of SYSTEM_REBOOT,
-SYSTEM_SHUTDOWN, SYSTEM_S4
-
-Suspend to ram  -- put devices into low power state. event = SUSPEND,
-flags = SUSPEND_TO_RAM
-
-Freeze for swsusp snapshot -- stop DMA and interrupts. No need to put
-devices into low power mode, but you must be able to reinitialize
-device from scratch in resume method. This has two flavors, its done
-once on suspending kernel, once on resuming kernel. event = FREEZE,
-flags = DURING_SUSPEND or DURING_RESUME
-
-Device detach requested from /sys -- deinitialize device; proably same as
-SYSTEM_SHUTDOWN, I do not understand this one too much. probably event
-= FREEZE, flags = DEV_DETACH.
-
-#These are not really events sent:
-#
-#System fully on -- device is working normally; this is probably never
-#passed to suspend() method... event = ON, flags = 0
-#
-#Ready after resume -- userland is now running, again. Time to free any
-#memory you ate during prepare to suspend... event = ON, flags =
-#READY_AFTER_RESUME
-#
+========================
+Many devices are able to dynamically power down while the system is still
+running. This feature is useful for devices that are not being used, and
+can offer significant power savings on a running system.  These devices
+often support a range of runtime power states, which might use names such
+as "off", "sleep", "idle", "active", and so on.  Those states will in some
+cases (like PCI) be partially constrained by a bus the device uses, and will
+usually include hardware states that are also used in system sleep states.
+
+However, note that if a driver puts a device into a runtime low power state
+and the system then goes into a system-wide sleep state, it normally ought
+to resume into that runtime low power state rather than "full on".  Such
+distinctions would be part of the driver-internal state machine for that
+hardware; the whole point of runtime power management is to be sure that
+drivers are decoupled in that way from the state machine governing phases
+of the system-wide power/sleep state transitions.
+
+
+Power Saving Techniques
+-----------------------
+Normally runtime power management is handled by the drivers without specific
+userspace or kernel intervention, by device-aware use of techniques like:
+
+    Using information provided by other system layers
+	- stay deeply "off" except between open() and close()
+	- if transceiver/PHY indicates "nobody connected", stay "off"
+	- application protocols may include power commands or hints
+
+    Using fewer CPU cycles
+	- using DMA instead of PIO
+	- removing timers, or making them lower frequency
+	- shortening "hot" code paths
+	- eliminating cache misses
+	- (sometimes) offloading work to device firmware
+
+    Reducing other resource costs
+	- gating off unused clocks in software (or hardware)
+	- switching off unused power supplies
+	- eliminating (or delaying/merging) IRQs
+	- tuning DMA to use word and/or burst modes
+
+    Using device-specific low power states
+	- using lower voltages
+	- avoiding needless DMA transfers
+
+Read your hardware documentation carefully to see the opportunities that
+may be available.  If you can, measure the actual power usage and check
+it against the budget established for your project.
+
+
+Examples:  USB hosts, system timer, system CPU
+----------------------------------------------
+USB host controllers make interesting, if complex, examples.  In many cases
+these have no work to do:  no USB devices are connected, or all of them are
+in the USB "suspend" state.  Linux host controller drivers can then disable
+periodic DMA transfers that would otherwise be a constant power drain on the
+memory subsystem, and enter a suspend state.  In power-aware controllers,
+entering that suspend state may disable the clock used with USB signaling,
+saving a certain amount of power.
+
+The controller will be woken from that state (with an IRQ) by changes to the
+signal state on the data lines of a given port, for example by an existing
+peripheral requesting "remote wakeup" or by plugging a new peripheral.  The
+same wakeup mechanism usually works from "standby" sleep states, and on some
+systems also from "suspend to RAM" (or even "suspend to disk") states.
+(Except that ACPI may be involved instead of normal IRQs, on some hardware.)
+
+System devices like timers and CPUs may have special roles in the platform
+power management scheme.  For example, system timers using a "dynamic tick"
+approach don't just save CPU cycles (by eliminating needless timer IRQs),
+but they may also open the door to using lower power CPU "idle" states that
+cost more than a jiffie to enter and exit.  On x86 systems these are states
+like "C3"; note that periodic DMA transfers from a USB host controller will
+also prevent entry to a C3 state, much like a periodic timer IRQ.
+
+That kind of runtime mechanism interaction is common.  "System On Chip" (SOC)
+processors often have low power idle modes that can't be entered unless
+certain medium-speed clocks (often 12 or 48 MHz) are gated off.  When the
+drivers gate those clocks effectively, then the system idle task may be able
+to use the lower power idle modes and thereby increase battery life.
+
+If the CPU can have a "cpufreq" driver, there also may be opportunities
+to shift to lower voltage settings and reduce the power cost of executing
+a given number of instructions.  (Without voltage adjustment, it's rare
+for cpufreq to save much power; the cost-per-instruction must go down.)
+
+
+/sys/devices/.../power/state files
+==================================
+For now you can also test some of this functionality using sysfs.
+
+	DEPRECATED:  USE "power/state" ONLY FOR DRIVER TESTING, AND
+	AVOID USING dev->power.power_state IN DRIVERS.
+
+	THESE WILL BE REMOVED.  IF THE "power/state" FILE GETS REPLACED,
+	IT WILL BECOME SOMETHING COUPLED TO THE BUS OR DRIVER.
+
+In each device's directory, there is a 'power' directory, which contains
+at least a 'state' file.  The value of this field is effectively boolean,
+PM_EVENT_ON or PM_EVENT_SUSPEND.
+
+   *	Reading from this file displays a value corresponding to
+	the power.power_state.event field.  All nonzero values are
+	displayed as "2", corresponding to a low power state; zero
+	is displayed as "0", corresponding to normal operation.
+
+   *	Writing to this file initiates a transition using the
+   	specified event code number; only '0', '2', and '3' are
+	accepted (without a newline); '2' and '3' are both
+	mapped to PM_EVENT_SUSPEND.
+
+On writes, the PM core relies on that recorded event code and the device/bus
+capabilities to determine whether it uses a partial suspend() or resume()
+sequence to change things so that the recorded event corresponds to the
+numeric parameter.
+
+   -	If the bus requires the irqs-disabled suspend_late()/resume_early()
+	phases, writes fail because those operations are not supported here.
+
+   -	If the recorded value is the expected value, nothing is done.
+   
+   -	If the recorded value is nonzero, the device is partially resumed,
+	using the bus.resume() and/or class.resume() methods.
+
+   -	If the target value is nonzero, the device is partially suspended,
+	using the class.suspend() and/or bus.suspend() methods and the
+	PM_EVENT_SUSPEND message.
+
+Drivers have no way to tell whether their suspend() and resume() calls
+have come through the sysfs power/state file or as part of entering a
+system sleep state, except that when accessed through sysfs the normal
+parent/child sequencing rules are ignored.  Drivers (such as bus, bridge,
+or hub drivers) which expose child devices may need to enforce those rules
+on their own.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2006-08-10 23:38 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-10 22:25 RFC -- updated Documentation/power/devices.txt David Brownell
2006-07-11  5:56 ` Andrew Morton
2006-07-11 16:38   ` David Brownell
2006-07-11 21:57   ` David Brownell
2006-07-12 12:25     ` Pavel Machek
2006-07-12 14:04     ` Alan Stern
2006-07-12 15:45       ` David Brownell
2006-07-12 16:03         ` Alan Stern
2006-07-23  1:37           ` David Brownell
2006-07-23  3:59             ` Alan Stern
2006-07-23 10:50               ` Rafael J. Wysocki
2006-07-23 13:03                 ` Alan Stern
2006-07-23 22:45                   ` Rafael J. Wysocki
2006-07-24  3:22                     ` David Brownell
2006-07-24  9:46                       ` Rafael J. Wysocki
2006-07-24 14:51                         ` Alan Stern
2006-07-24 15:15                           ` David Brownell
2006-07-24 15:42                             ` Alan Stern
2006-07-24 17:11                               ` David Brownell
2006-07-24 20:44                                 ` Alan Stern
2006-07-24 21:19                                   ` David Brownell
2006-07-25 15:42                                     ` Alan Stern
2006-08-10 23:38                                     ` [patch 2.6.18-rc] " David Brownell
2006-07-23 16:22               ` RFC -- " David Brownell
2006-07-11 14:40 ` Pavel Machek
2006-07-11 21:28 ` Pavel Machek
  -- strict thread matches above, loose matches on Subject: below --
2006-07-11  7:56 Woodruff, Richard
2006-07-11 16:51 ` David Brownell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox