Power Management framework proposal

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Power Management framework proposal
@ 2007-07-22  6:49 david
  2007-07-22  7:57 ` [linux-pm] " Igor Stoppa
  2007-07-22 17:26 ` Arjan van de Ven
  0 siblings, 2 replies; 40+ messages in thread
From: david @ 2007-07-22  6:49 UTC (permalink / raw)
  To: LKML, linux-pm

I'm deliberatly breaking the threading on this so that people who have 
tuned out the hibernation thread can take a look at this.

below is the proposal that I made at the bottom of one of the posts on the 
hibernation thread.

the idea is that instead of approaching power management from the point of 
view of the current desktop standard (ACPI), instead this approaches it 
from the point of view of 'what does a tool need to know to do the job' no 
matter what mechanism is actually used to implement the different modes.

one thing that I didn't put in the original post is that the framework 
that I mention below should work with other types of devices besides ACPI 
ones.

CPU clock settings could fit the same API for example (with modeID=0 being 
to hot-unplug the cpu and modeID=1 to initialize the cpu)

here's the original post:

in fact, a better abstraction would be something like

report_power_modes
    which would return a series of modes (sorted only by modeID)
    modeID, %power_used_in_this_mode, %capability_in_this_mode
    (I would make mode 0 always be complete power off, and mode 1 always be 
full capacity)

report_power_mode_speed
   which would return a matrix giving how long it takes to transition from any 
mode to any other mode. this should be a relative number, not an absolute 
number since it will be different at different clock speeds.

set_operational_mode(modeID)
   which would take you from whatever mode you are in now to the requested mode.

most devices would report the simple list of modes

0,0,0
1,100,100

with a mode_speed matrix of
    0 1
    ---
0|0 1
1|1 0

it may be that there is more info needed for the powr management engine to 
decide what modes it wants to put things into, if so identify what type of info 
you need and add another column to the modes list.
for example:
   you may want to add a flag for 'does this mode allow downstream devices to 
operate?'
   you may want to make a mode for 'this mode doesn't allow any new requests, 
but continues to process pending requests' and have a flag that indicates this

currently it looks like there's no way to find out what modes are available, 
and you have to know what mode something is in currently before you can request 
it change to a different mode. both of these prevent effective power management 
without encoding intimate knowledge of the capability of the particular 
hardware in your management tool.

some of this may be discoverable via the ACPI interface (it's not talked about 
much in the devices.txt file), but the mode setting is still wrong.

note that in the example above it's accpetable for a driver to cache what mode 
it thinks the device is in, but it needs to properly set the new mode even if 
it's cached data is incorrect.

this approach would allow the transition of ALL drivers to the new mode of 
operation in one fell swoop, and then adding additional power management 
features is just adding to the existing list rather then implementing new 
functions.

David Lang

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [linux-pm] Power Management framework proposal
  2007-07-22  6:49 Power Management framework proposal david
@ 2007-07-22  7:57 ` Igor Stoppa
  2007-07-22  8:58   ` david
  2007-07-22 17:26 ` Arjan van de Ven
  1 sibling, 1 reply; 40+ messages in thread
From: Igor Stoppa @ 2007-07-22  7:57 UTC (permalink / raw)
  To: ext linux-pm-bounces@lists.linux-foundation.org; +Cc: LKML, linux-pm

Hi,
On Sat, 2007-07-21 at 23:49 -0700, ext
linux-pm-bounces@lists.linux-foundation.org wrote:
> I'm deliberatly breaking the threading on this so that people who have 
> tuned out the hibernation thread can take a look at this.
> 
> below is the proposal that I made at the bottom of one of the posts on the 
> hibernation thread.

I have the impression that you are trying to describe a mix of the clock
and latency frameworks.

Could you elaborate on how your proposal is incompatible with enhancing
the clock framework? 

It looks like you are proposing a brand new shiny thing that frankly I
would be happy to leave alone, unless it is crystal clear that the clock
fw cannot be improved.

The clocfk fw is used for OMAP and other architectures (including SH,
iirc) and so far it has provided very good support for our power
management needs (Nokia 770 and N800).

Currently we are working on DVFS for OMAP2 (see slides presented at the
linux-pm summit for OLS 2007 http://tinyurl.com/28tact ) and even if the
current prototype is not actively involving the clock fw, our final goal
is to make it capable of supporting atomic transactions for changing the
core parameters.

OMAP3 will require suspend to ram implementation where the content of
system memory is retained, while parts or all the SoC are switched off.
The plan is still to have a clock fw based implementation (plus
interaction with the power rails, of course).

I think these are good examples of the non-ACPI systems you are
mentioning.

To make any proposal that has some chance of being accepted, you have to
compare it against the existing solution, explaining:

-what it is bringing in terms of new functionalities
-how it is different
-why the current implementation cannot simply be enhanced

You can refer to the linux-pm archives for examples of failed attempts
over the last year or so, just search for "framework" in the subject.

-- 
Cheers, Igor

Igor Stoppa <igor.stoppa@nokia.com>
(Nokia Multimedia - CP - OSSO / Helsinki, Finland)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [linux-pm] Power Management framework proposal
  2007-07-22  7:57 ` [linux-pm] " Igor Stoppa
@ 2007-07-22  8:58   ` david
  2007-07-22 12:05     ` Igor Stoppa
  0 siblings, 1 reply; 40+ messages in thread
From: david @ 2007-07-22  8:58 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: ext linux-pm-bounces@lists.linux-foundation.org, LKML, linux-pm

On Sun, 22 Jul 2007, Igor Stoppa wrote:

> Hi,
> On Sat, 2007-07-21 at 23:49 -0700, ext
> linux-pm-bounces@lists.linux-foundation.org wrote:
>> I'm deliberatly breaking the threading on this so that people who have
>> tuned out the hibernation thread can take a look at this.
>>
>> below is the proposal that I made at the bottom of one of the posts on the
>> hibernation thread.
>
> I have the impression that you are trying to describe a mix of the clock
> and latency frameworks.
>
> Could you elaborate on how your proposal is incompatible with enhancing
> the clock framework?

It's not that I think it's incompatible with any existing powersaving 
tools (in fact I hope it's not)

it's that I think that this (or something similar) could be made to cover 
all thevarious power options instead of CPU's having one interface, ACPI 
capable drivers having another, embeded devices presenting a third, etc

this was triggered by the mess of different function calls for different 
purposes that are used for the suspend functions where you have a bunch of 
different functions that are each supposed to be called at a specific time 
from a specific mode during the suspend process. with all these different 
functions driver writes tend to not bother implementing any of them, and 
it seems like there is a fairly steady stream of new functions that end up 
being needed. the initial intent was to just change this into a generic 
set of calls that every driver writer would implement the minimum set of, 
and make it trivially extensable to future capabilities of hardware.

one other effect of this is that driver writers would see the mode 
interface from day one rather then just completely ignoring it. right now 
device driver authors tend to thing " why worry about figuring out how to 
implement 'prepare to suspend', 'late suspend', 'suspend', 'quiese but 
don't suspend', etc" if they aren't really interested in working on 
suspend, it's not really clear what each of these should do even after 
reading the docs on it. however listing the power modes that a device can 
be in, documenting the cost of switching between them, and implementing 
the transition is something very straightforward for the device driver 
author to do (and they don't have to worry about the details of how and 
when the various modes get used, that's up to the suspend/powersaving 
software to figure out). as such I expect that the driver support for 
powersaving modes to improve. in fact, I expect that some driver writers 
will implement a whole bunch of modes, just to show off the features of 
the hardware. and even if nothing uses the modes right now at least they 
are implemented and documented for future use (and it should be trivial to 
have a test routine that just runs every driver you have hardware for 
through every mode transition to make sure that they all work, so the less 
commonly used modes shouldn't bitrot too badly)

while I was describing the issues to my roomates over dinner I realized 
that the same type of functions are needed for the CPU clocks.

if you have an accepted framework in place there that can do what I 
described, please consider extending it to cover other types of devices 
and drivers.

I want sanity and functionality far more then credit :-)

David Lang

> It looks like you are proposing a brand new shiny thing that frankly I
> would be happy to leave alone, unless it is crystal clear that the clock
> fw cannot be improved.
>
> The clocfk fw is used for OMAP and other architectures (including SH,
> iirc) and so far it has provided very good support for our power
> management needs (Nokia 770 and N800).
>
> Currently we are working on DVFS for OMAP2 (see slides presented at the
> linux-pm summit for OLS 2007 http://tinyurl.com/28tact ) and even if the
> current prototype is not actively involving the clock fw, our final goal
> is to make it capable of supporting atomic transactions for changing the
> core parameters.

thanks for the link. I've read through it, and it looks like there is a 
lot of the same ideas in your proposal. I think you are passing too much 
info up the chain to the part makeing the decision (that part doesn't need 
to know the details of the voltage/freq choices, the %power/%capability 
numbers I suggested are in many ways more what they are making decision 
son anyway)

in the slideshow you list in the sequence of changing the cpu speed to pre 
and post notify drivers. what exactly are the drivers expected to do with 
the notification? are you asking them to pause and then re-initialize for 
the new power level?

> OMAP3 will require suspend to ram implementation where the content of
> system memory is retained, while parts or all the SoC are switched off.
> The plan is still to have a clock fw based implementation (plus
> interaction with the power rails, of course).
>
> I think these are good examples of the non-ACPI systems you are
> mentioning.

yes, I think they are.

> To make any proposal that has some chance of being accepted, you have to
> compare it against the existing solution, explaining:
>
> -what it is bringing in terms of new functionalities
> -how it is different

it unifies all power/performance trade-offs (including power on/off) into 
a single API, but decouples that API from the implementation details of 
exactly what the technical details of the different modes are and how the 
changes are made.

for some subsystems this would be little more then renameing existing 
fucntions, for others it would be converting several indepndant functions 
into one, discoverable api

> -why the current implementation cannot simply be enhanced

which current implementation should be enhanced? and with the massive 
broadening of functionality should it retain the same name, or should it 
get renamed to something more generic?

the cpufreq implementation is very close to what I'm proposing, it would 
need to get broadend to cover other devices (like disk drives, wireless 
cards, etc), is this really the right thing to do or should the more 
generic API go in for external use and then the existing cpufreq be called 
from the set_mode() call?

> You can refer to the linux-pm archives for examples of failed attempts
> over the last year or so, just search for "framework" in the subject.

I'll try to do some digging.

David Lang

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [linux-pm] Power Management framework proposal
  2007-07-22  8:58   ` david
@ 2007-07-22 12:05     ` Igor Stoppa
  2007-07-22 21:21       ` david
  0 siblings, 1 reply; 40+ messages in thread
From: Igor Stoppa @ 2007-07-22 12:05 UTC (permalink / raw)
  To: ext david@lang.hm
  Cc: ext linux-pm-bounces@lists.linux-foundation.org, LKML, linux-pm

On Sun, 2007-07-22 at 01:58 -0700, ext david@lang.hm wrote:
> On Sun, 22 Jul 2007, Igor Stoppa wrote:

[snip]

> > Could you elaborate on how your proposal is incompatible with enhancing
> > the clock framework?
> 
> It's not that I think it's incompatible with any existing powersaving 
> tools (in fact I hope it's not)
> 
> it's that I think that this (or something similar) could be made to cover 
> all thevarious power options instead of CPU's having one interface, ACPI 
> capable drivers having another, embeded devices presenting a third, etc
> 
> this was triggered by the mess of different function calls for different 
> purposes that are used for the suspend functions where you have a bunch of 
> different functions that are each supposed to be called at a specific time 
> from a specific mode during the suspend process. with all these different 
> functions driver writes tend to not bother implementing any of them, and 
> it seems like there is a fairly steady stream of new functions that end up 
> being needed. the initial intent was to just change this into a generic 
> set of calls that every driver writer would implement the minimum set of, 
> and make it trivially extensable to future capabilities of hardware.

Every now and then there is some attempt to find One solution to bind
them all: x86, SoC, ACPI ... you name it.

Unfortunately, while it's true that there are significant similarities,
there are also notable differencies; as far as i know the USB subsystem
is the one that gets closer to what we have in the embedded arena, since
it can have complex cases of parent-child powering and wakeup.

> one other effect of this is that driver writers would see the mode 
> interface from day one rather then just completely ignoring it. right now 
> device driver authors tend to thing " why worry about figuring out how to 
> implement 'prepare to suspend', 'late suspend', 'suspend', 'quiese but 
> don't suspend', etc" if they aren't really interested in working on 
> suspend, it's not really clear what each of these should do even after 
> reading the docs on it. however listing the power modes that a device can 
> be in, documenting the cost of switching between them, and implementing 
> the transition is something very straightforward for the device driver 
> author to do (and they don't have to worry about the details of how and 
> when the various modes get used, that's up to the suspend/powersaving 
> software to figure out). as such I expect that the driver support for 
> powersaving modes to improve. in fact, I expect that some driver writers 
> will implement a whole bunch of modes, just to show off the features of 
> the hardware. and even if nothing uses the modes right now at least they 
> are implemented and documented for future use (and it should be trivial to 
> have a test routine that just runs every driver you have hardware for 
> through every mode transition to make sure that they all work, so the less 
> commonly used modes shouldn't bitrot too badly)

What you are saying can be summarised as making the driver model more
expressive.

> while I was describing the issues to my roomates over dinner I realized 
> that the same type of functions are needed for the CPU clocks.
> 
> if you have an accepted framework in place there that can do what I 
> described, please consider extending it to cover other types of devices 
> and drivers.

That is not part of the fw: the fw simply expresses parent-child clock
distribution and keeps usecounts so that unused clocks are automatically
gated.

The actual clock tree description is platform/arch/board specific and
doesn't affect the framework. You can just roll your own version for x86
by providing a description of the methods used to switch on/off every
individual clock on your board.

So what you are asking for is that somebody writes an x86 version of the
clock fw.

As for latencies, well, only few clocks really have significant impact.
Most notably the main system oscillator. Everything else has 0 latency
since it ends up in opening/closing a clock gate.

Powering device on/off will certainly introduce more latency, but either
the powering is supported by the hw, to make it quick or it has to go
through most, if not all of he usual initialisation sequence; in that
case it probably makes sense to avoid controlling it from kernelspace,
since it will be slow and won't require dedcisions made with us
precision.

> I want sanity and functionality far more then credit :-)

I want to avoid redesigning the wheel: the current version is not round
yet, but re-starting from a triangle every time is far less appealing.

> thanks for the link. I've read through it, and it looks like there is a 
> lot of the same ideas in your proposal. 

Unless some new hw/technology shows up, I'm afraid the available set of
ideas is very limited :-)

> I think you are passing too much 
> info up the chain to the part makeing the decision (that part doesn't need 
> to know the details of the voltage/freq choices, the %power/%capability 
> numbers I suggested are in many ways more what they are making decision 
> son anyway)

I don't think you have got it right: the only info being passed is the
standard cpufreq list of frequencies; everything else is part of the
cpufreq driver.

> in the slideshow you list in the sequence of changing the cpu speed to pre 
> and post notify drivers. what exactly are the drivers expected to do with 
> the notification? are you asking them to pause and then re-initialize for 
> the new power level?

It's just a  notification. The drivers are supposed to know how to deal
with it.
In OMAP2 the major concern is that the external memory cannot be
accessed since it is on a bus that is being re-clocked:
- the dma controllers must be paused
- the other cores (dsp) must not access sdram
- the onenand driver needs to adjust its timing parameters

[snip]

> > To make any proposal that has some chance of being accepted, you have to
> > compare it against the existing solution, explaining:
> >
> > -what it is bringing in terms of new functionalities
> > -how it is different
> 
> it unifies all power/performance trade-offs (including power on/off) into 
> a single API, but decouples that API from the implementation details of 
> exactly what the technical details of the different modes are and how the 
> changes are made.

It always looks great at this level of abstraction, but then usually
what is discovered later is that _a lot_ of extra complexity is
introduced, in order to cover every case on every platform that is
intended to be supported.

> for some subsystems this would be little more then renameing existing 
> fucntions, for others it would be converting several indepndant functions 
> into one, discoverable api

if you check cpufreq, you will find out that it already covers the
multiple cores case (but nothing prevents from using the same logic on
something that is not really a cpu) and also has some simple concept of
latency for frequency transition, concept that could be enhanced to
handle latencies that are depending on the current operating point and
target operating point.

> > -why the current implementation cannot simply be enhanced
> 
> which current implementation should be enhanced? and with the massive 
> broadening of functionality should it retain the same name, or should it 
> get renamed to something more generic?

cpufreq could be renamed to anything that makes sense, but i see _no_
massive broadening of functionality.

> the cpufreq implementation is very close to what I'm proposing, it would 
> need to get broadend to cover other devices (like disk drives, wireless 
> cards, etc), is this really the right thing to do or should the more 
> generic API go in for external use and then the existing cpufreq be called 
> from the set_mode() call?

No, that doesn't make sense, as general approach.
You want to manage from kernel only those parts of the system where the
latency is so low that userspace wouldn't be able to keep up.

Your examples (wireless, disk drive) can be easily controlled from
userspace, with a timeout.

In both cases there are significant delays (change of rotation speed /
sync with the access point).

All this is hand waiving unless it is backed up by numbers.
Real cases are required in order to establish a list of priorities for
latency/power consumption.

Afterward, a valid solution that can address such cases can be sketched.

-- 
Cheers, Igor

Igor Stoppa <igor.stoppa@nokia.com>
(Nokia Multimedia - CP - OSSO / Helsinki, Finland)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Power Management framework proposal
  2007-07-22  6:49 Power Management framework proposal david
  2007-07-22  7:57 ` [linux-pm] " Igor Stoppa
@ 2007-07-22 17:26 ` Arjan van de Ven
  2007-07-22 18:56   ` david
  2007-07-23 22:23   ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 40+ messages in thread
From: Arjan van de Ven @ 2007-07-22 17:26 UTC (permalink / raw)
  To: david; +Cc: LKML, linux-pm

On Sat, 2007-07-21 at 23:49 -0700, david@lang.hm wrote:

> this approach would allow the transition of ALL drivers to the new mode of 
> operation in one fell swoop, and then adding additional power management 
> features is just adding to the existing list rather then implementing new 
> functions.

I have a concern with this approach though. It seems to assume that
there is one global thing somewhere that sets the system state; in my
experience that is the wrong approach; in fact there is a very definite
evidence that there are many decisions on power that are to be made
local at a high frequency. An example of this is the processor speed;
the ondemand governer does exactly this for the cpus that can switch
speeds fast; it's just impossible to beat such a local, fast decision
with anything on a global scale.

On the other hand, some things (the high level goals and constraints)
are obviously global.

However, your design seems to want to put the low level settings in a
global thing, and that is just a mistake.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Power Management framework proposal
  2007-07-22 17:26 ` Arjan van de Ven
@ 2007-07-22 18:56   ` david
  2007-07-22 22:27     ` Arjan van de Ven
  2007-07-23 22:23   ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 40+ messages in thread
From: david @ 2007-07-22 18:56 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: LKML, linux-pm

On Sun, 22 Jul 2007, Arjan van de Ven wrote:

>> this approach would allow the transition of ALL drivers to the new mode of
>> operation in one fell swoop, and then adding additional power management
>> features is just adding to the existing list rather then implementing new
>> functions.
>
>
> I have a concern with this approach though. It seems to assume that
> there is one global thing somewhere that sets the system state; in my
> experience that is the wrong approach; in fact there is a very definite
> evidence that there are many decisions on power that are to be made
> local at a high frequency. An example of this is the processor speed;
> the ondemand governer does exactly this for the cpus that can switch
> speeds fast; it's just impossible to beat such a local, fast decision
> with anything on a global scale.

the intent was not to have one global call that sets the mode on all 
devices, but rather have one call for each device/subsystem, just the same 
call in each case.

there's also nothing that says that there can only be one thing setting 
the mode (although that does mean a fourth call 'report_current_mode()' or 
similar is needed). and if you choose to have two pieces of software 
managing the same device things could get 'interesting'.

as for the speed that such decisions need to be made.

this API is not saying anything about the speed of the decisions.
it's also not saying anything about if the decision makeing is being done 
by kernelspace or userspace. it's just providing a common way for whatever 
software is doing the decision makeing to find out it's options and set 
the modes.

one way to represent this in sysfs (as a possible userspace API) would be
power/mode (read to see what mode the device is in, write to set the mode)
power/modelist
power/switching_delay (outputs the delay matrix showing the cost of switching from mode to mode)
although I'm not sure how you would allow the system to report an error 
this way.

if you have the ondemand governer running, it uses these API's to make 
it's changes, but if you didn't want to run the ondemand governer you 
could just do course speed settings through sysfs.

> On the other hand, some things (the high level goals and constraints)
> are obviously global.
>
> However, your design seems to want to put the low level settings in a
> global thing, and that is just a mistake.

I'm proposing a single function name per device, not a single function for 
the entire system.

David Lang

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [linux-pm] Power Management framework proposal
  2007-07-22 12:05     ` Igor Stoppa
@ 2007-07-22 21:21       ` david
  2007-07-22 23:09         ` Arjan van de Ven
  2007-07-23 10:48         ` Igor Stoppa
  0 siblings, 2 replies; 40+ messages in thread
From: david @ 2007-07-22 21:21 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: ext linux-pm-bounces@lists.linux-foundation.org, LKML, linux-pm

On Sun, 22 Jul 2007, Igor Stoppa wrote:

> On Sun, 2007-07-22 at 01:58 -0700, ext david@lang.hm wrote:
>> On Sun, 22 Jul 2007, Igor Stoppa wrote:
>
> [snip]
>
>>> Could you elaborate on how your proposal is incompatible with enhancing
>>> the clock framework?
>>
>> It's not that I think it's incompatible with any existing powersaving
>> tools (in fact I hope it's not)
>>
>> it's that I think that this (or something similar) could be made to cover
>> all thevarious power options instead of CPU's having one interface, ACPI
>> capable drivers having another, embeded devices presenting a third, etc
>>
>> this was triggered by the mess of different function calls for different
>> purposes that are used for the suspend functions where you have a bunch of
>> different functions that are each supposed to be called at a specific time
>> from a specific mode during the suspend process. with all these different
>> functions driver writes tend to not bother implementing any of them, and
>> it seems like there is a fairly steady stream of new functions that end up
>> being needed. the initial intent was to just change this into a generic
>> set of calls that every driver writer would implement the minimum set of,
>> and make it trivially extensable to future capabilities of hardware.
>
> Every now and then there is some attempt to find One solution to bind
> them all: x86, SoC, ACPI ... you name it.

this is another one. I'd be happy to get pointers to prior ones to learn 
from.

> Unfortunately, while it's true that there are significant similarities,
> there are also notable differencies; as far as i know the USB subsystem
> is the one that gets closer to what we have in the embedded arena, since
> it can have complex cases of parent-child powering and wakeup.

this API is not trying to represent the parent-child hierarchy. as far as 
I know that's documented in sysfs (or is supposed to be). this is just an 
attempt to make it so that as you are going through the hierarchy you 
don't have to use vastly different API's to control the different 
functions.

I suspect that most (if not all) of the previous One Solutions have tried 
to completely handle all the details of their original case, and then 
branch out to the other cases.

this attempt is working from the other direction. the user of this API 
doesn't care how something is done, it just wants to know what's possible 
and how to tell the system to switch modes.

other then just me searching through the lists, do you have a pointer to 
some of the differences between the different types that are seen as being 
so large that they can't be unified?

>> while I was describing the issues to my roomates over dinner I realized
>> that the same type of functions are needed for the CPU clocks.
>>
>> if you have an accepted framework in place there that can do what I
>> described, please consider extending it to cover other types of devices
>> and drivers.
>
> That is not part of the fw: the fw simply expresses parent-child clock
> distribution and keeps usecounts so that unused clocks are automatically
> gated.
>
> The actual clock tree description is platform/arch/board specific and
> doesn't affect the framework. You can just roll your own version for x86
> by providing a description of the methods used to switch on/off every
> individual clock on your board.
>
> So what you are asking for is that somebody writes an x86 version of the
> clock fw.

this is more then just setting the clocks on everything (although setting 
clocks seems like it fits well into the model) becouse some power modes 
are not easily represented just as clocks.

> As for latencies, well, only few clocks really have significant impact.
> Most notably the main system oscillator. Everything else has 0 latency
> since it ends up in opening/closing a clock gate.
>
> Powering device on/off will certainly introduce more latency, but either
> the powering is supported by the hw, to make it quick or it has to go
> through most, if not all of he usual initialisation sequence; in that
> case it probably makes sense to avoid controlling it from kernelspace,
> since it will be slow and won't require dedcisions made with us
> precision.

and many devices support both a quick almost-off mode and a slow 
almost-off mode (as well as a completely off mode), with the slow mode 
eating less power, but takeing longer to wake up from. that's the reason 
for providing the matrix to let the program makeing the decision decide if 
it's worth the time delays to get the power savings

as I note in anther message, this SPI isn't intended to be strictly 
kernelspace or strictly userspace. for the ondemand speed governer you are 
changing the settings quickly and so probably want to do so in the kernel, 
however some people may be satisfied with slower controls and so could 
have them in userspace (an extreme example of this would be turning off 
wireless cards that aren't in use to save power and improve security)

>> I think you are passing too much
>> info up the chain to the part makeing the decision (that part doesn't need
>> to know the details of the voltage/freq choices, the %power/%capability
>> numbers I suggested are in many ways more what they are making decision
>> son anyway)
>
> I don't think you have got it right: the only info being passed is the
> standard cpufreq list of frequencies; everything else is part of the
> cpufreq driver.

to make the decisions the software makeing the decision needs to know how 
much power would be used at each freq setting.

>> in the slideshow you list in the sequence of changing the cpu speed to pre
>> and post notify drivers. what exactly are the drivers expected to do with
>> the notification? are you asking them to pause and then re-initialize for
>> the new power level?
>
> It's just a  notification. The drivers are supposed to know how to deal
> with it.
> In OMAP2 the major concern is that the external memory cannot be
> accessed since it is on a bus that is being re-clocked:
> - the dma controllers must be paused
> - the other cores (dsp) must not access sdram
> - the onenand driver needs to adjust its timing parameters

in my proposal this would require one or more 'pause' modes (more then 
one if you need to pause at different power settings fro some reason) for 
the first notification, and then you would set them to the mode you want 
them in at the second notification point (which is probably going to be 
the mode they were in before)

> [snip]
>
>>> To make any proposal that has some chance of being accepted, you have to
>>> compare it against the existing solution, explaining:
>>>
>>> -what it is bringing in terms of new functionalities
>>> -how it is different
>>
>> it unifies all power/performance trade-offs (including power on/off) into
>> a single API, but decouples that API from the implementation details of
>> exactly what the technical details of the different modes are and how the
>> changes are made.
>
> It always looks great at this level of abstraction, but then usually
> what is discovered later is that _a lot_ of extra complexity is
> introduced, in order to cover every case on every platform that is
> intended to be supported.

which is why I posted this for comments.

what are the cases that require extra info.can that extra info be as 
simple as a set of flags for the mode (or possibly for the transition 
matrix).

for your clock example you need a flag that says 'this requires everything 
connected to this be paused'

for suspend other low power modes you need to be able to say 'contents of 
things below this point will be lost when you go into this mode' so that 
the decision makeing software knows that it needs to save the contents of 
memory before switching to a mode that stops the dram refresh. I don't 
have any idea at the moment for how to prvide a common interface for 
actually saving or restoring the contents, that is outside the scope of 
this API

the ACPI people will need a flag for 'this device can generate wakeup 
signals in this mode'

but this API would just provide this info to the decision makeing code, 
that code would have to antually enforce the limits

>> for some subsystems this would be little more then renameing existing
>> fucntions, for others it would be converting several indepndant functions
>> into one, discoverable api
>
> if you check cpufreq, you will find out that it already covers the
> multiple cores case (but nothing prevents from using the same logic on
> something that is not really a cpu) and also has some simple concept of
> latency for frequency transition, concept that could be enhanced to
> handle latencies that are depending on the current operating point and
> target operating point.

does it provide a full matrix of latencies, or just mode 1->mode2=x, 
mode2->mode3=y so mode1->mode3=x+y?

>>> -why the current implementation cannot simply be enhanced
>>
>> which current implementation should be enhanced? and with the massive
>> broadening of functionality should it retain the same name, or should it
>> get renamed to something more generic?
>
> cpufreq could be renamed to anything that makes sense, but i see _no_
> massive broadening of functionality.

what I'm talking about would provide an API to devices that you are 
ignoring becouse they should be managed from userspace.

>> the cpufreq implementation is very close to what I'm proposing, it would
>> need to get broadend to cover other devices (like disk drives, wireless
>> cards, etc), is this really the right thing to do or should the more
>> generic API go in for external use and then the existing cpufreq be called
>> from the set_mode() call?
>
> No, that doesn't make sense, as general approach.
> You want to manage from kernel only those parts of the system where the
> latency is so low that userspace wouldn't be able to keep up.
>
> Your examples (wireless, disk drive) can be easily controlled from
> userspace, with a timeout.

absoutly, and they should be (at least most of the time). this was not 
intended as a kernelspace only api. it is intended to be available to both 
kernelspace and userspace.

> In both cases there are significant delays (change of rotation speed /
> sync with the access point).

correct, and these delays should be reflected in the transition cost 
matrix

> All this is hand waiving unless it is backed up by numbers.
> Real cases are required in order to establish a list of priorities for
> latency/power consumption.

this isn't attempting to establish a list of priorites, simply to give the 
software that is trying to establish such a list the info to make it's 
decisions, and the interface to use to issue the resulting instructions.

> Afterward, a valid solution that can address such cases can be sketched.

with this API you should be able to create a very trivial power manager 
that can know nothing about the system other then the info found in this 
API and the hirarchy of devices, but can transition the system between 
three easily explained modes.

A. full power operation
B. off
C. as low a power mode as is available on the hardware without having to 
save the contents of something somewhere else.

Thanks for your time in replying to me on this topic.

David Lang

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Power Management framework proposal
  2007-07-22 18:56   ` david
@ 2007-07-22 22:27     ` Arjan van de Ven
  2007-07-23  3:51       ` david
  0 siblings, 1 reply; 40+ messages in thread
From: Arjan van de Ven @ 2007-07-22 22:27 UTC (permalink / raw)
  To: david; +Cc: LKML, linux-pm

On Sun, 2007-07-22 at 11:56 -0700, david@lang.hm wrote:
> >
> > I have a concern with this approach though. It seems to assume that
> > there is one global thing somewhere that sets the system state; in my
> > experience that is the wrong approach; in fact there is a very definite
> > evidence that there are many decisions on power that are to be made
> > local at a high frequency. An example of this is the processor speed;
> > the ondemand governer does exactly this for the cpus that can switch
> > speeds fast; it's just impossible to beat such a local, fast decision
> > with anything on a global scale.
> 
> the intent was not to have one global call that sets the mode on all 
> devices, but rather have one call for each device/subsystem, just the same 
> call in each case.
> 
> there's also nothing that says that there can only be one thing setting 
> the mode (although that does mean a fourth call 'report_current_mode()' or 
> similar is needed). and if you choose to have two pieces of software 
> managing the same device things could get 'interesting'.
> 
> as for the speed that such decisions need to be made.
> 
> this API is not saying anything about the speed of the decisions.
> it's also not saying anything about if the decision makeing is being done 
> by kernelspace or userspace. it's just providing a common way for whatever 
> software is doing the decision makeing to find out it's options and set 
> the modes.

but it makes for a layer between the device and the setting of the
modes..  which sort of would defeat the option of having things truely
local.

Settings don't mean much in general (in specific cases, maybe), it's the
requirements that matter. The *intent* matters. Linus forced this into
cpufreq way back, and while I and perhaps others thought he was just
being silly, 6 years later it turns out he was absolutely right.

Maybe something else
A power policy management framework doesn't need a unified framework (I
know this for a fact, I'm hoping to release the code within a few
weeks). A unified interface doesn't even help one single bit: the
semantics of each part is *extremely* different even if you make it look
the same; the sameness is only cosmetic.

The consequences of managing a disk vs managing a cpu vs managing the
LCD brightness via the X server are all very different. The tradeoffs
you need to make are all very different. The things you want to control
are all very different. Trying to force a standard interface makes the
interface for a specific subsystem go away from the *actual* best
interface for that subsystem, for no gain since the thing that manages
the policy needs to have different parts for each *anyway*.

Now I realize that the needs for "hard small embedded" are different
from "PC like", and to be honest, I don't think it's entirely possible
to unify them; I don't think it's even worthwhile to pursue that (look
at where those attempts have gotten us so far)... but I suspect even in
the small embedded space a standard, forced and thus unnatural interface
isn't what is needed.

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [linux-pm] Power Management framework proposal
  2007-07-22 21:21       ` david
@ 2007-07-22 23:09         ` Arjan van de Ven
  2007-07-23  2:45           ` david
  2007-07-23 10:48         ` Igor Stoppa
  1 sibling, 1 reply; 40+ messages in thread
From: Arjan van de Ven @ 2007-07-22 23:09 UTC (permalink / raw)
  To: david
  Cc: Igor Stoppa, ext linux-pm-bounces@lists.linux-foundation.org,
	LKML, linux-pm

> >> son anyway)
> >
> > I don't think you have got it right: the only info being passed is the
> > standard cpufreq list of frequencies; everything else is part of the
> > cpufreq driver.
> 
> to make the decisions the software makeing the decision needs to know how 
> much power would be used at each freq setting.

power used at a certain frequency is not a single variable. 
In fact, on most laptops and other similarly power aware devices, it's
in fact better for power consumption to always go to the maximum
frequency as quickly as possible, so that you can be idle for the
longest possible time after that. Good luck finding a generic way to
represent such things in a (userspace) interface....

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [linux-pm] Power Management framework proposal
  2007-07-22 23:09         ` Arjan van de Ven
@ 2007-07-23  2:45           ` david
  2007-07-23  3:50             ` Arjan van de Ven
  0 siblings, 1 reply; 40+ messages in thread
From: david @ 2007-07-23  2:45 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Igor Stoppa, ext linux-pm-bounces@lists.linux-foundation.org,
	LKML, linux-pm

On Sun, 22 Jul 2007, Arjan van de Ven wrote:

>>>> son anyway)
>>>
>>> I don't think you have got it right: the only info being passed is the
>>> standard cpufreq list of frequencies; everything else is part of the
>>> cpufreq driver.
>>
>> to make the decisions the software makeing the decision needs to know how
>> much power would be used at each freq setting.
>
> power used at a certain frequency is not a single variable.
> In fact, on most laptops and other similarly power aware devices, it's
> in fact better for power consumption to always go to the maximum
> frequency as quickly as possible, so that you can be idle for the
> longest possible time after that. Good luck finding a generic way to
> represent such things in a (userspace) interface....

I disagree with you here. for each frequency setting you can say how much 
power the cpu/system is expected to use (especially as a percentage of the 
full power mode). creating this value requires you to take two things into 
account, the voltage you are running things at (by far the biggest 
effect), and the minor difference that the frequency makes at that voltage 
(possibly small enough to ignore entirely).

the API I proposed has no problem with there being multiple modes that 
have the same %power but with different %capability numbers.

I'm willing to bet that the current cpufreq software just looks at the 
voltage as the value that tells you how much power the thing is going to 
use at that setting

the fact that you want to run at the max frequancy for a given voltage is 
a reasonable strategy, but it's a power saving _strategy_, not a 
capability of the hardware and the API I'm mentioning should be enough to 
let you pick the highest performance setting that has the same power 
rating as the minimum performance you need (or for that matter to go one 
step futher and go with the most efficiant setting in terms of 
performance/power that has a performance number higher then what you need, 
which could actually be better)

the fact that you currently want to use this strategy doesn't mean that 
the other possible modes don't exist, and even if you don't use them now 
they should be available within the API (including the cpufreq api)

this strategy should work well on the normal unpredictable workload that 
most people deal with, but there are some cases where the workload becomes 
pretty predictable (media players for example) where there really is less 
variation, and a need for a constant availability of the cpu, so it may 
actually save a smidge of power to run below the highest freq that the 
voltage allows rather then running faster and being idle more cycles.

David Lang

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [linux-pm] Power Management framework proposal
  2007-07-23  2:45           ` david
@ 2007-07-23  3:50             ` Arjan van de Ven
  2007-07-23  4:04               ` david
  0 siblings, 1 reply; 40+ messages in thread
From: Arjan van de Ven @ 2007-07-23  3:50 UTC (permalink / raw)
  To: david
  Cc: Igor Stoppa, ext linux-pm-bounces@lists.linux-foundation.org,
	LKML, linux-pm

> I disagree with you here. for each frequency setting you can say how much 
> power the cpu/system is expected to use (especially as a percentage of the 
> full power mode). creating this value requires you to take two things into 
> account, the voltage you are running things at (by far the biggest 
> effect), and the minor difference that the frequency makes at that voltage 
> (possibly small enough to ignore entirely).
> 
> the API I proposed has no problem with there being multiple modes that 
> have the same %power but with different %capability numbers.

how do you deal with the "power at idle" vs "power at full load".. you
need both at each level to pick the best one, as well as relative
performance etc.

> 
> I'm willing to bet that the current cpufreq software just looks at the 
> voltage as the value that tells you how much power the thing is going to 
> use at that setting

it doesn't. 
> 
> the fact that you want to run at the max frequancy for a given voltage is 

no I want to run at the max frequency PERIOD. On just about any PC, it's
more power efficient to go full speed when executing code, and then idle
for as long as you can. (there are some second order effects that make
this a bit more complex, but as first order approach it's a sound
approach). Voltage follows, and that's fine.

> a reasonable strategy, but it's a power saving _strategy_, not a 
> capability of the hardware and the API I'm mentioning should be enough to 
> let you pick the highest performance setting that has the same power 
> rating as the minimum performance you need (or for that matter to go one 
> step futher and go with the most efficiant setting in terms of 
> performance/power that has a performance number higher then what you need, 
> which could actually be better)

why would I care about voltage? Most PCs don't expose it, and that's
fine, they can switch to the voltage needed REALLY quickly (single or
double digit microseconds). PCs in fact only expose numbered states (P0
to P7 at most), and some number that you can use to show the user, but
doesn't mean anything beyond that. Some people interpret it as
"frequency", and that's nice, but it doesn't really mean that. You
really don't know anything beyond that....

and that's ok. As I said before, as a general strategy you want "highest
speed when running code" for race-to-idle, with some 2nd order effects
for when you execute code really shortly coming out of idle; in which
case you don't want to do a voltage transition twice (most cpus have the
idle voltage be the lowest-execute voltage as well).

> this strategy should work well on the normal unpredictable workload that 
> most people deal with, but there are some cases where the workload becomes 
> pretty predictable (media players for example) where there really is less 
> variation, and a need for a constant availability of the cpu, so it may 
> actually save a smidge of power to run below the highest freq that the 
> voltage allows rather then running faster and being idle more cycles.

that actually is the example showcase of race-to-idle where you
absolutely want to run at the highest frequency..

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Power Management framework proposal
  2007-07-22 22:27     ` Arjan van de Ven
@ 2007-07-23  3:51       ` david
  2007-07-23  4:00         ` Arjan van de Ven
  0 siblings, 1 reply; 40+ messages in thread
From: david @ 2007-07-23  3:51 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: LKML, linux-pm

On Sun, 22 Jul 2007, Arjan van de Ven wrote:

> On Sun, 2007-07-22 at 11:56 -0700, david@lang.hm wrote:
>>>
>>> I have a concern with this approach though. It seems to assume that
>>> there is one global thing somewhere that sets the system state; in my
>>> experience that is the wrong approach; in fact there is a very definite
>>> evidence that there are many decisions on power that are to be made
>>> local at a high frequency. An example of this is the processor speed;
>>> the ondemand governer does exactly this for the cpus that can switch
>>> speeds fast; it's just impossible to beat such a local, fast decision
>>> with anything on a global scale.
>>
>> the intent was not to have one global call that sets the mode on all
>> devices, but rather have one call for each device/subsystem, just the same
>> call in each case.
>>
>> there's also nothing that says that there can only be one thing setting
>> the mode (although that does mean a fourth call 'report_current_mode()' or
>> similar is needed). and if you choose to have two pieces of software
>> managing the same device things could get 'interesting'.
>>
>> as for the speed that such decisions need to be made.
>>
>> this API is not saying anything about the speed of the decisions.
>> it's also not saying anything about if the decision makeing is being done
>> by kernelspace or userspace. it's just providing a common way for whatever
>> software is doing the decision makeing to find out it's options and set
>> the modes.
>
>
> but it makes for a layer between the device and the setting of the
> modes..  which sort of would defeat the option of having things truely
> local.
>
> Settings don't mean much in general (in specific cases, maybe), it's the
> requirements that matter. The *intent* matters. Linus forced this into
> cpufreq way back, and while I and perhaps others thought he was just
> being silly, 6 years later it turns out he was absolutely right.

and the more I am seeing of cpufreq the more it looks like what I'm 
proposing, so I'm glad to see that it's a good model :-)

> Maybe something else
> A power policy management framework doesn't need a unified framework (I
> know this for a fact, I'm hoping to release the code within a few
> weeks). A unified interface doesn't even help one single bit: the
> semantics of each part is *extremely* different even if you make it look
> the same; the sameness is only cosmetic.
>
> The consequences of managing a disk vs managing a cpu vs managing the
> LCD brightness via the X server are all very different. The tradeoffs
> you need to make are all very different. The things you want to control
> are all very different. Trying to force a standard interface makes the
> interface for a specific subsystem go away from the *actual* best
> interface for that subsystem, for no gain since the thing that manages
> the policy needs to have different parts for each *anyway*.

Ok, I can see that if things really are different then it's worth doing 
different things to control them.

however, let me go back to my original post on the subject here

right now drivers are supposed to have (forgive me if I get the function 
names wrong)

initialize()
shutdown()
suspend()
suspend_late()
resume()
resume_early()

with suspend taking one of several parameters
PM_EVENT_SUSPEND
PM_EVENT_FREEZE
PM_EVENT_PRETHAW

and the notes say that what is supposed to happen is fairly undefined 
becouse different things can have vastly different capabilities. so to 
really control the device you need other, per driver interfaces as well.

this API is driven by the activities that the suspend process is currently 
designed to use, and each routine assumes given existing state, if you 
call it when in any other state the results are undefined.

any match to the actual capabilities of the hardware is purely 
coincidental. to have any ability to control the mode of anything at 
runtime requires that the code doing so must have specific knowledge of 
the driver in question.

compare this underdefined mess to the sanity that cpufreq gives you for 
controlling different vendors CPUs with their different capabilities.

with cpufreq you somewhere have a table that goes something along the 
lines of

freq   voltage
2.0GHz  3.0v
1.5GHz  3.0v
1.0GHz  1.5v
500MHz  0.8v

and a function that lets you select the freq you want

if cpufreq were to switch over the the API I'm suggesting the table would 
change to

mode capacity power
0      0        0
1    100      100
2     75      100 (or possibly 95, there is some benifit to a slower clock at the same voltage)
3     50       25
4     25        7

so it would be a relativly minor change, probably causing more disruption 
then benifit to change in and of itself.

also, other then efficiancy arguments, there's nothing that says the modes 
must be integers not strings. instead of 0-4 above you could use the 
entries from under freq in the first table.

I don't know how cpufreq handles a cpu with logic blocks that can be 
turned off individually but with the type of API I'm talking about you 
could easily have

mode capacity power
0       0        0
1     100      100 (full clock, both blocks on)
2      50       60 (full clock, one block off)
3      50       25 (half clock, half voltage, both blocks on)
4      25       15 (half clock, half voltage, one blocks off)
5      25        7 (quarter clock, quarter voltage, both blocks on)
6      12        4 (quarter clock, quarter voltage, one blocks off)
7       0        1 (clock stopped, but chip still energized, faster to wake up from then mode 0)

with the benifits of mode 2 vs 3, 4 vs 5, and 7 vs 0 showing up in the 
transition cost matrix where it would show that it's faster to go up to 
the high-capacity modes from the first of each set then from the second, 
even though there are power saving advantages to the second in each set.

but the idea of adding the cpu control to this API was an afterthought, 
the biggest thing was to get something better then the current mess for 
other devices, and the fact that cpufreq was initially seen as a waste of 
time, but now you are seeing it's value could be an argument to do a 
similar transition for the power modes of other devices as well.

> Now I realize that the needs for "hard small embedded" are different
> from "PC like", and to be honest, I don't think it's entirely possible
> to unify them; I don't think it's even worthwhile to pursue that (look
> at where those attempts have gotten us so far)... but I suspect even in
> the small embedded space a standard, forced and thus unnatural interface
> isn't what is needed.

I am thinking that a standard way to define the availble modes of 
operation of a piece of hardware is an advantage for all scales. even if 
the generic API doesn't quite cover every possible mode (if you have 
enough knobs to twist the combinational explosion of the possible modes 
may mean that you don't actually implement all of them) makeing it 
possible for software to discover and set the modes for different devices 
without having to know specifics of the drivers would be a good thing.

you mention LCD backlights as an example of something non-standard enough 
to create a new intrface for. I think it would fit the API I'm proposing 
quite nicely

example 1: a laptop screen

mode  capacity power description
0        0        0    off
1      100      100    full brightness
2       70       60    half power to the backlight
3       50       35    quarter power to the backlight
4       30       25    eighth power to the backlight
5        5       10    backlight off.

example 2: a front-panel display on a server (no variable backlight 
control)

mode capacity power description
0       0        0   off
1     100      100   backlight on
2      50       10   backlight off

unless the device had a light sensor with it I wouldn't expect these 
settings to be changed automaticaly, but this API would make it trivial 
for userspace tools to be able to control the brightness of any display 
with no driver-specific code, they would just look for display type 
objects, read the capabilities, and change the modes as the user requests.

currently it would proabably take two different software packages to 
control the backlights on these two devices, one that understands the 
video display driver (and would probably be pretty specific to that 
driver) and a second one that would understand the front-panel display 
driver.

with the current situation it's practicaly impossible to create a tool 
that allows you to set the power saving modes for everything in a system. 
that tool would need to know the ins and outs of every driver, and keep up 
to date on driver changes.

and the flip side of this is that it's also very hard to get the power 
saving features of a new device handled in an appropriate manner, you not 
only need to write the capabilities into the driver, you have to write a 
utility to control those capabilities, and then try and get similar 
software included in all the sstem utilities that you would want to use to 
control those capabilities

with the approach I'm proposing creating such a tool would be fairly 
simple, it would walk the sysfs tree to see what hardware is there, read 
what modes it can be set in (including flags that tell you that things 
below it need to be in modes with specific capabilities if appropriate) 
and let you change them.

if you don't want to make the shift with cpufreq, that's fine. it sounds 
like you are at least 90% of the way there anyway, it's not that big a 
deal, but do you think that there's value in replacing the current ad-hoc 
approach with something more structured (even if it's not this proposal)?

David Lang

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Power Management framework proposal
  2007-07-23  3:51       ` david
@ 2007-07-23  4:00         ` Arjan van de Ven
  2007-07-23  4:09           ` david
  0 siblings, 1 reply; 40+ messages in thread
From: Arjan van de Ven @ 2007-07-23  4:00 UTC (permalink / raw)
  To: david; +Cc: LKML, linux-pm

> example 1: a laptop screen
> 
> mode  capacity power description
> 0        0        0    off
> 1      100      100    full brightness
> 2       70       60    half power to the backlight
> 3       50       35    quarter power to the backlight
> 4       30       25    eighth power to the backlight
> 5        5       10    backlight off.
> 
> example 2: a front-panel display on a server (no variable backlight 
> control)
> 
> mode capacity power description
> 0       0        0   off
> 1     100      100   backlight on
> 2      50       10   backlight off


the problem is: the person who SETS these needs to know what they mean.
And the side that implements these needs to translate them as well...

that's two translations, and information is lost in the abstract number
in the middle that doesn't mean anything

> if you don't want to make the shift with cpufreq, that's fine. it
> sounds 
> like you are at least 90% of the way there anyway, it's not that big
> a 
> deal, but do you think that there's value in replacing the current
> ad-hoc 
> approach with something more structured (even if it's not this
> proposal)?

as someone who wrote (part of) a power policy manager; sorry but you
take away information I need, and in addition the different API's are
absolutely no big deal.

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [linux-pm] Power Management framework proposal
  2007-07-23  3:50             ` Arjan van de Ven
@ 2007-07-23  4:04               ` david
  2007-07-23  4:19                 ` Arjan van de Ven
  0 siblings, 1 reply; 40+ messages in thread
From: david @ 2007-07-23  4:04 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Igor Stoppa, ext linux-pm-bounces@lists.linux-foundation.org,
	LKML, linux-pm

On Sun, 22 Jul 2007, Arjan van de Ven wrote:

>> I disagree with you here. for each frequency setting you can say how much
>> power the cpu/system is expected to use (especially as a percentage of the
>> full power mode). creating this value requires you to take two things into
>> account, the voltage you are running things at (by far the biggest
>> effect), and the minor difference that the frequency makes at that voltage
>> (possibly small enough to ignore entirely).
>>
>> the API I proposed has no problem with there being multiple modes that
>> have the same %power but with different %capability numbers.
>
> how do you deal with the "power at idle" vs "power at full load".. you
> need both at each level to pick the best one, as well as relative
> performance etc.

what I was thinking was to use power at full load for the power rateing of 
each mode.

>> the fact that you want to run at the max frequancy for a given voltage is
>
> no I want to run at the max frequency PERIOD. On just about any PC, it's
> more power efficient to go full speed when executing code, and then idle
> for as long as you can. (there are some second order effects that make
> this a bit more complex, but as first order approach it's a sound
> approach). Voltage follows, and that's fine.

this seems to be contradicted by the fact that AMD is listing the ability 
for each core to run at a different clock speed on the new 4-core chips as 
an advantage. if you always want to run at the max frequency PERIOD then 
why bother engineering the ability to do otherwise? (as opposed to just 
shutting down unused cores)

another example is the 80 core demo chip that Intel has been makeing press 
about. it can run at 1Tflop on 25w of power and 2Tflop at 150w of power. 
running at max freq for a 1Tflop workload would have you eating ~75w of 
power (the numbers may be off, I'm going from memory, but the cost in 
power of doubling the speed was _far_ more then double the power 
requirements)

>> this strategy should work well on the normal unpredictable workload that
>> most people deal with, but there are some cases where the workload becomes
>> pretty predictable (media players for example) where there really is less
>> variation, and a need for a constant availability of the cpu, so it may
>> actually save a smidge of power to run below the highest freq that the
>> voltage allows rather then running faster and being idle more cycles.
>
> that actually is the example showcase of race-to-idle where you
> absolutely want to run at the highest frequency..

only if the transitions don't cost anything significant, and the 
computation capacity per watt of power is the same at all frequencies. the 
chip performance numbers I've been seeing (which I admit are mostly 
embedded datasheets) indicate that neither of these hold true.

David Lang

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Power Management framework proposal
  2007-07-23  4:00         ` Arjan van de Ven
@ 2007-07-23  4:09           ` david
  2007-07-27 11:46             ` Pavel Machek
  0 siblings, 1 reply; 40+ messages in thread
From: david @ 2007-07-23  4:09 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: LKML, linux-pm

On Sun, 22 Jul 2007, Arjan van de Ven wrote:

> Date: Sun, 22 Jul 2007 21:00:39 -0700
> From: Arjan van de Ven <arjan@infradead.org>
> To: david@lang.hm
> Cc: LKML <linux-kernel@vger.kernel.org>,
>     linux-pm <linux-pm@lists.linux-foundation.org>
> Subject: Re: Power Management framework proposal
> 
>> example 1: a laptop screen
>>
>> mode  capacity power description
>> 0        0        0    off
>> 1      100      100    full brightness
>> 2       70       60    half power to the backlight
>> 3       50       35    quarter power to the backlight
>> 4       30       25    eighth power to the backlight
>> 5        5       10    backlight off.
>>
>> example 2: a front-panel display on a server (no variable backlight
>> control)
>>
>> mode capacity power description
>> 0       0        0   off
>> 1     100      100   backlight on
>> 2      50       10   backlight off
>
>
> the problem is: the person who SETS these needs to know what they mean.

that's what the description is for. this info can be provided by the 
driver as part of the list_modes() function.

> And the side that implements these needs to translate them as well...
>
> that's two translations, and information is lost in the abstract number
> in the middle that doesn't mean anything

with the current implementations you instead need to know what function to 
call and what the meaning of that function is. that's not documented in 
any system discoverable way, you have to read the driver documentation or 
code to find it.

>> if you don't want to make the shift with cpufreq, that's fine. it
>> sounds
>> like you are at least 90% of the way there anyway, it's not that big
>> a
>> deal, but do you think that there's value in replacing the current
>> ad-hoc
>> approach with something more structured (even if it's not this
>> proposal)?
>
> as someone who wrote (part of) a power policy manager; sorry but you
> take away information I need, and in addition the different API's are
> absolutely no big deal.

assuming that nobody else chimes in to disagree with you I'll accept your 
judgement and drop the issue.

David Lang

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [linux-pm] Power Management framework proposal
  2007-07-23  4:04               ` david
@ 2007-07-23  4:19                 ` Arjan van de Ven
  2007-07-23  5:25                   ` david
  2007-07-23  8:56                   ` Ondrej Zajicek
  0 siblings, 2 replies; 40+ messages in thread
From: Arjan van de Ven @ 2007-07-23  4:19 UTC (permalink / raw)
  To: david
  Cc: Igor Stoppa, ext linux-pm-bounces@lists.linux-foundation.org,
	LKML, linux-pm

On Sun, 2007-07-22 at 21:04 -0700, david@lang.hm wrote:
> >> the fact that you want to run at the max frequancy for a given voltage is
> >
> > no I want to run at the max frequency PERIOD. On just about any PC, it's
> > more power efficient to go full speed when executing code, and then idle
> > for as long as you can. (there are some second order effects that make
> > this a bit more complex, but as first order approach it's a sound
> > approach). Voltage follows, and that's fine.
> 
> this seems to be contradicted by the fact that AMD is listing the ability 
> for each core to run at a different clock speed on the new 4-core chips as 
> an advantage.

that's a marketing thing mostly.. they all still run at the same voltage
anyway.

>  if you always want to run at the max frequency PERIOD then 
> why bother engineering the ability to do otherwise? (as opposed to just 
> shutting down unused cores)

multicore changes the rules a little but not all that much. (the idle
power is higher if not all cores are idle at the same time. Yet... each
core individually trying to be idle as quickly as possible is the best
way to get to the highest "all cores idle" time, unless there is some
really special/weird synchronization)

> >> this strategy should work well on the normal unpredictable workload that
> >> most people deal with, but there are some cases where the workload becomes
> >> pretty predictable (media players for example) where there really is less
> >> variation, and a need for a constant availability of the cpu, so it may
> >> actually save a smidge of power to run below the highest freq that the
> >> voltage allows rather then running faster and being idle more cycles.
> >
> > that actually is the example showcase of race-to-idle where you
> > absolutely want to run at the highest frequency..
> 
> only if the transitions don't cost anything significant, 

these are second order effects though. On a pc, the transition costs are
quite low (as I said, single or low double digit microseconds).
They are not zero, and that is why you see things like ondemand ramp up
only after a little time, as a guestimate to make sure it's not just a
really short lived code execution.

> and the 
> computation capacity per watt of power is the same at all frequencies. the 
> chip performance numbers I've been seeing (which I admit are mostly 
> embedded datasheets) indicate that neither of these hold true.

let me give you a real world example then, and the numbers I'm using are
ballpark the same as you'll find in a (mobile) core 2 duo datasheet, I
just rounded them a little so that the math works out nice.

power at full speed: 34W
power at half speed: 24W
power at idle: 1W

assume media playback, and a dumb one, that takes half a second to
decode a second of media. (again to make the math simple)

at half speed: Energy for a second is 0.5 * 24 + 0.5 * 1 = 12.5 J
at full speed: Energy for a second is 0.25 * 34 + 0.75 * 1 = 9.25 J

this works for all systems where the idle power is more lower than the
power you save by dropping speed... and that is almost all of them in
the PC world.

now you can argue that 0.5 seconds is a really really long time, and
you'd be right. so for really really short stints (say a timer
interrupt) you don't want to change the voltage at all (nor would you
want to change the plls to change frequency for that matter). But once
you start chaning those, you might as well go full speed.

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [linux-pm] Power Management framework proposal
  2007-07-23  4:19                 ` Arjan van de Ven
@ 2007-07-23  5:25                   ` david
  2007-07-23 14:12                     ` Arjan van de Ven
  2007-07-23  8:56                   ` Ondrej Zajicek
  1 sibling, 1 reply; 40+ messages in thread
From: david @ 2007-07-23  5:25 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Igor Stoppa, ext linux-pm-bounces@lists.linux-foundation.org,
	LKML, linux-pm

On Sun, 22 Jul 2007, Arjan van de Ven wrote:

> On Sun, 2007-07-22 at 21:04 -0700, david@lang.hm wrote:
>
>>>> this strategy should work well on the normal unpredictable workload that
>>>> most people deal with, but there are some cases where the workload becomes
>>>> pretty predictable (media players for example) where there really is less
>>>> variation, and a need for a constant availability of the cpu, so it may
>>>> actually save a smidge of power to run below the highest freq that the
>>>> voltage allows rather then running faster and being idle more cycles.
>>>
>>> that actually is the example showcase of race-to-idle where you
>>> absolutely want to run at the highest frequency..
>>
>> only if the transitions don't cost anything significant,
>
> these are second order effects though. On a pc, the transition costs are
> quite low (as I said, single or low double digit microseconds).

including pausing all drivers before the transition and unpausing them 
aftrwords?

>> and the
>> computation capacity per watt of power is the same at all frequencies. the
>> chip performance numbers I've been seeing (which I admit are mostly
>> embedded datasheets) indicate that neither of these hold true.
>
> let me give you a real world example then, and the numbers I'm using are
> ballpark the same as you'll find in a (mobile) core 2 duo datasheet, I
> just rounded them a little so that the math works out nice.
>
> power at full speed: 34W
> power at half speed: 24W
> power at idle: 1W

are these numbers for the CPU itself or for the a larger chunk? I could 
easily see these numbers for motherboard (including CPU and RAM), but it 
would surprise me if these numbers are for the CPU itself. I'm used to 
seeing datasheets that have a much more linear voltage/freq (and therefor 
a quadratic voltage/power) curve. in some cases the voltage requirements 
drop faster then the frequency.

> assume media playback, and a dumb one, that takes half a second to
> decode a second of media. (again to make the math simple)
>
> at half speed: Energy for a second is 0.5 * 24 + 0.5 * 1 = 12.5 J
> at full speed: Energy for a second is 0.25 * 34 + 0.75 * 1 = 9.25 J
>
> this works for all systems where the idle power is more lower than the
> power you save by dropping speed... and that is almost all of them in
> the PC world.

if you can idle the system as a whole I agree with you fully. most PC 
hardware (including the mobile stuff) doesn't change it's power 
consumption much with load. at Usenix there was a presentiation (I don't 
remember if it was by Amazon or Google) about this subject, showing that 
current PC hardware only goes down to 50% power when idle (short of 
switching power modes) and that they and other big companies were pushing 
vendors to improve their hardware, aiming to get the idle power down to 
10% (again without suspending anything). so there's some chance that this 
will change before too long.

> now you can argue that 0.5 seconds is a really really long time, and
> you'd be right. so for really really short stints (say a timer
> interrupt) you don't want to change the voltage at all (nor would you
> want to change the plls to change frequency for that matter). But once
> you start chaning those, you might as well go full speed.

this assumes that you can cache 1 second of video, if you have more 
real-time requirements you have a much harder time (say video confrancing 
where you don't get the frame until just before you need to display it)

David Lang


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [linux-pm] Power Management framework proposal
  2007-07-23  4:19                 ` Arjan van de Ven
  2007-07-23  5:25                   ` david
@ 2007-07-23  8:56                   ` Ondrej Zajicek
  2007-07-23 17:33                     ` david
  2007-07-27 12:04                     ` Pavel Machek
  1 sibling, 2 replies; 40+ messages in thread
From: Ondrej Zajicek @ 2007-07-23  8:56 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: david, Igor Stoppa, LKML, linux-pm

On Sun, Jul 22, 2007 at 09:19:17PM -0700, Arjan van de Ven wrote:
> let me give you a real world example then, and the numbers I'm using are
> ballpark the same as you'll find in a (mobile) core 2 duo datasheet, I
> just rounded them a little so that the math works out nice.
> 
> power at full speed: 34W
> power at half speed: 24W
> power at idle: 1W

I have usually seen different numbers, for example:

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/30430.pdf

Although this paper speaks about thermal design power instead of power
consumption, i suppose that it should be roughly equal.

For example Athlon 64 3700 (ADA3700AEP5AR):

2.4 GHz, 1.5 V -> 89 W
2.2 GHz, 1.4 V -> 72 W
2.0 GHz, 1.3 V -> 53 W
1.8 GHz, 1.2 V -> 39 W
1.0 GHz, 1.1 V -> 22 W


Even my measurement on PC (Athlon X2, VIA K8T890) of complete PC power
consumption shows that it is more efficient to be busy for 2 time units
on 1 GHz than be busy for 1 time unit and be idle for 1 time unit
on 2 GHz.

1 GHz:
both cores idle:	48 W
one core busy:		57 W
two cores busy:		66 W

2 GHz:
both cores idle:	54 W
one core busy:		78 W
two cores busy:		95 W

-- 
Elen sila lumenn' omentielvo

Ondrej 'SanTiago' Zajicek (email: santiago@crfreenet.org, jabber: santiago@njs.netlab.cz)
OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net)
"To err is human -- to blame it on a computer is even more so."

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [linux-pm] Power Management framework proposal
  2007-07-22 21:21       ` david
  2007-07-22 23:09         ` Arjan van de Ven
@ 2007-07-23 10:48         ` Igor Stoppa
  2007-07-23 18:14           ` david
  1 sibling, 1 reply; 40+ messages in thread
From: Igor Stoppa @ 2007-07-23 10:48 UTC (permalink / raw)
  To: ext david@lang.hm
  Cc: ext linux-pm-bounces@lists.linux-foundation.org, LKML, linux-pm

On Sun, 2007-07-22 at 14:21 -0700, ext david@lang.hm wrote:

[snip]

> this is another one. I'd be happy to get pointers to prior ones to learn 
> from.

https://lists.linux-foundation.org/pipermail/linux-pm/2007-March/011204.html

This is probably one of the latest. Previously there was some clash
between powerop and oppoints that lead to people running away from too
much confusion.

> > Unfortunately, while it's true that there are significant similarities,
> > there are also notable differencies; as far as i know the USB subsystem
> > is the one that gets closer to what we have in the embedded arena, since
> > it can have complex cases of parent-child powering and wakeup.
> 
> this API is not trying to represent the parent-child hierarchy. as far as 
> I know that's documented in sysfs (or is supposed to be). this is just an 
> attempt to make it so that as you are going through the hierarchy you 
> don't have to use vastly different API's to control the different 
> functions.

You are going to end up with parent child relationships, or
user-consumer.
Devices don't exist in the void, but are interconnected.

> I suspect that most (if not all) of the previous One Solutions have tried 
> to completely handle all the details of their original case, and then 
> branch out to the other cases.
> 
> this attempt is working from the other direction. the user of this API 
> doesn't care how something is done, it just wants to know what's possible 
> and how to tell the system to switch modes.

True, but you are endding up in the same situation: too much abstraction
makes the governing system clumsy and inefficient.

> other then just me searching through the lists, do you have a pointer to 
> some of the differences between the different types that are seen as being 
> so large that they can't be unified?

I'll be more detailed in further replies to following emails from this
same thread that have already piled up.

> >> while I was describing the issues to my roomates over dinner I realized
> >> that the same type of functions are needed for the CPU clocks.
> >>
> >> if you have an accepted framework in place there that can do what I
> >> described, please consider extending it to cover other types of devices
> >> and drivers.
> >
> > That is not part of the fw: the fw simply expresses parent-child clock
> > distribution and keeps usecounts so that unused clocks are automatically
> > gated.
> >
> > The actual clock tree description is platform/arch/board specific and
> > doesn't affect the framework. You can just roll your own version for x86
> > by providing a description of the methods used to switch on/off every
> > individual clock on your board.
> >
> > So what you are asking for is that somebody writes an x86 version of the
> > clock fw.
> 
> this is more then just setting the clocks on everything (although setting 
> clocks seems like it fits well into the model) becouse some power modes 
> are not easily represented just as clocks.

The very same idea of power mode is something that can maybe fit some
simple peripherals (simple as not fine grained contraollable in terms of
what is on and what is off), but certainly it doesn't fit nicely modern
SoC (see OMAP) since ata certain point of time you don't really know
what is the power consumption because many resources are automatically
gated by HW on an on-need basis.
And you don't want to switch this feature off.

> > As for latencies, well, only few clocks really have significant impact.
> > Most notably the main system oscillator. Everything else has 0 latency
> > since it ends up in opening/closing a clock gate.
> >
> > Powering device on/off will certainly introduce more latency, but either
> > the powering is supported by the hw, to make it quick or it has to go
> > through most, if not all of he usual initialisation sequence; in that
> > case it probably makes sense to avoid controlling it from kernelspace,
> > since it will be slow and won't require dedcisions made with us
> > precision.
> 
> and many devices support both a quick almost-off mode and a slow 
> almost-off mode (as well as a completely off mode), with the slow mode 
> eating less power, but takeing longer to wake up from. that's the reason 
> for providing the matrix to let the program makeing the decision decide if 
> it's worth the time delays to get the power savings
> 
> as I note in anther message, this SPI isn't intended to be strictly 
> kernelspace or strictly userspace. for the ondemand speed governer you are 
> changing the settings quickly and so probably want to do so in the kernel, 
> however some people may be satisfied with slower controls and so could 
> have them in userspace (an extreme example of this would be turning off 
> wireless cards that aren't in use to save power and improve security)

So you are goingto have 2 API: one for kernelspace (evolution of
CPUfreq) and one for userspace, which seems more and more likely to be
an extension to HAL.
Are you informed on HOM and Intel Mobilin ?
http://ohm.freedesktop.org/wiki/
http://www.moblin.org/index.html
> 
> >> I think you are passing too much
> >> info up the chain to the part makeing the decision (that part doesn't need
> >> to know the details of the voltage/freq choices, the %power/%capability
> >> numbers I suggested are in many ways more what they are making decision
> >> son anyway)
> >
> > I don't think you have got it right: the only info being passed is the
> > standard cpufreq list of frequencies; everything else is part of the
> > cpufreq driver.
> 
> to make the decisions the software makeing the decision needs to know how 
> much power would be used at each freq setting.

And that's the wrong assumption: you don't know for real what the power
consumption is going to be; on decent embedded systems the clock gating
takes care of minimizing power consumption, while the frequency
throttling is intended to provide enough performance.

If you happen to know that for x86 it's because they suck in terms of
idle power cunsumption, but certainly that's not a good reason to
penalise those that have better design.

And i'm sure things are going to change rapidly now that intel is
stepping into the lightweight mobile arena.

> >> in the slideshow you list in the sequence of changing the cpu speed to pre
> >> and post notify drivers. what exactly are the drivers expected to do with
> >> the notification? are you asking them to pause and then re-initialize for
> >> the new power level?
> >
> > It's just a  notification. The drivers are supposed to know how to deal
> > with it.
> > In OMAP2 the major concern is that the external memory cannot be
> > accessed since it is on a bus that is being re-clocked:
> > - the dma controllers must be paused
> > - the other cores (dsp) must not access sdram
> > - the onenand driver needs to adjust its timing parameters
> 
> in my proposal this would require one or more 'pause' modes (more then 
> one if you need to pause at different power settings fro some reason) for 
> the first notification, and then you would set them to the mode you want 
> them in at the second notification point (which is probably going to be 
> the mode they were in before)

pauses are to be avoided as much as possible, or at least made very
short. The change has to be seamless; stopping the system is not a good
start.

> > [snip]
> >
> >>> To make any proposal that has some chance of being accepted, you have to
> >>> compare it against the existing solution, explaining:
> >>>
> >>> -what it is bringing in terms of new functionalities
> >>> -how it is different
> >>
> >> it unifies all power/performance trade-offs (including power on/off) into
> >> a single API, but decouples that API from the implementation details of
> >> exactly what the technical details of the different modes are and how the
> >> changes are made.
> >
> > It always looks great at this level of abstraction, but then usually
> > what is discovered later is that _a lot_ of extra complexity is
> > introduced, in order to cover every case on every platform that is
> > intended to be supported.
> 
> which is why I posted this for comments.
> 
> what are the cases that require extra info.can that extra info be as 
> simple as a set of flags for the mode (or possibly for the transition 
> matrix).
> 
> for your clock example you need a flag that says 'this requires everything 
> connected to this be paused'
> 
> for suspend other low power modes you need to be able to say 'contents of 
> things below this point will be lost when you go into this mode' so that 
> the decision makeing software knows that it needs to save the contents of 
> memory before switching to a mode that stops the dram refresh. I don't 
> have any idea at the moment for how to prvide a common interface for 
> actually saving or restoring the contents, that is outside the scope of 
> this API

That's wrong: the drdiver can keep to itself the information about
saving/restoring; if fit advertises the time it will take, that's
enough.

> the ACPI people will need a flag for 'this device can generate wakeup 
> signals in this mode'
> 
> but this API would just provide this info to the decision makeing code, 
> that code would have to antually enforce the limits

Again, you are going into a field that belongs to hw abstraction and
already has a standard tool do dela with it - HAL

> >> for some subsystems this would be little more then renameing existing
> >> fucntions, for others it would be converting several indepndant functions
> >> into one, discoverable api
> >
> > if you check cpufreq, you will find out that it already covers the
> > multiple cores case (but nothing prevents from using the same logic on
> > something that is not really a cpu) and also has some simple concept of
> > latency for frequency transition, concept that could be enhanced to
> > handle latencies that are depending on the current operating point and
> > target operating point.
> 
> does it provide a full matrix of latencies, or just mode 1->mode2=x, 
> mode2->mode3=y so mode1->mode3=x+y?

IIRC it's just 1 value

> >>> -why the current implementation cannot simply be enhanced
> >>
> >> which current implementation should be enhanced? and with the massive
> >> broadening of functionality should it retain the same name, or should it
> >> get renamed to something more generic?
> >
> > cpufreq could be renamed to anything that makes sense, but i see _no_
> > massive broadening of functionality.
> 
> what I'm talking about would provide an API to devices that you are 
> ignoring becouse they should be managed from userspace.

again, HAL / OHM / Mobilin

> >> the cpufreq implementation is very close to what I'm proposing, it would
> >> need to get broadend to cover other devices (like disk drives, wireless
> >> cards, etc), is this really the right thing to do or should the more
> >> generic API go in for external use and then the existing cpufreq be called
> >> from the set_mode() call?
> >
> > No, that doesn't make sense, as general approach.
> > You want to manage from kernel only those parts of the system where the
> > latency is so low that userspace wouldn't be able to keep up.
> >
> > Your examples (wireless, disk drive) can be easily controlled from
> > userspace, with a timeout.
> 
> absoutly, and they should be (at least most of the time). this was not 
> intended as a kernelspace only api. it is intended to be available to both 
> kernelspace and userspace.
> 
> > In both cases there are significant delays (change of rotation speed /
> > sync with the access point).
> 
> correct, and these delays should be reflected in the transition cost 
> matrix
> 
> > All this is hand waiving unless it is backed up by numbers.
> > Real cases are required in order to establish a list of priorities for
> > latency/power consumption.
> 
> this isn't attempting to establish a list of priorites, simply to give the 
> software that is trying to establish such a list the info to make it's 
> decisions, and the interface to use to issue the resulting instructions.

What i'm saying is that sw is implemented to fulfill certain needs. I'd
rather see a detailed description of the need and based on that debate
on the actual API / implementation.

-- 
Cheers, Igor

Igor Stoppa <igor.stoppa@nokia.com>
(Nokia Multimedia - CP - OSSO / Helsinki, Finland)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [linux-pm] Power Management framework proposal
  2007-07-23  5:25                   ` david
@ 2007-07-23 14:12                     ` Arjan van de Ven
  2007-07-23 18:19                       ` david
  0 siblings, 1 reply; 40+ messages in thread
From: Arjan van de Ven @ 2007-07-23 14:12 UTC (permalink / raw)
  To: david
  Cc: Igor Stoppa, ext linux-pm-bounces@lists.linux-foundation.org,
	LKML, linux-pm

On Sun, 2007-07-22 at 22:25 -0700, david@lang.hm wrote:
> >>
> >> only if the transitions don't cost anything significant,
> >
> > these are second order effects though. On a pc, the transition costs are
> > quite low (as I said, single or low double digit microseconds).
> 
> including pausing all drivers before the transition and unpausing them 
> aftrwords?

on a PC you don't need to do that.

> 
> >> and the
> >> computation capacity per watt of power is the same at all frequencies. the
> >> chip performance numbers I've been seeing (which I admit are mostly
> >> embedded datasheets) indicate that neither of these hold true.
> >
> > let me give you a real world example then, and the numbers I'm using are
> > ballpark the same as you'll find in a (mobile) core 2 duo datasheet, I
> > just rounded them a little so that the math works out nice.
> >
> > power at full speed: 34W
> > power at half speed: 24W
> > power at idle: 1W
> 
> are these numbers for the CPU itself or for the a larger chunk?

the cpu at full load.

> > this works for all systems where the idle power is more lower than the
> > power you save by dropping speed... and that is almost all of them in
> > the PC world.
> 
> if you can idle the system as a whole I agree with you fully. most PC 
> hardware (including the mobile stuff) doesn't change it's power 
> consumption much with load.

even if the rest of the PC is unchanging (which it's not), it is just an
offset to both sides of the equation, and the same on both sides at
that.

>  at Usenix there was a presentiation (I don't 
> remember if it was by Amazon or Google) about this subject, showing that 
> current PC hardware only goes down to 50% power when idle (short of 
> switching power modes) and that they and other big companies were pushing 
> vendors to improve their hardware, aiming to get the idle power down to 
> 10% (again without suspending anything). so there's some chance that this 
> will change before too long.

on servers and such, there is a huge offset, sure, but still the effect
is there. And it really isn't 50%.

> 
> > now you can argue that 0.5 seconds is a really really long time, and
> > you'd be right. so for really really short stints (say a timer
> > interrupt) you don't want to change the voltage at all (nor would
> you
> > want to change the plls to change frequency for that matter). But
> once
> > you start chaning those, you might as well go full speed.
> 
> this assumes that you can cache 1 second of video, if you have more 
> real-time requirements you have a much harder time (say video
> confrancing 
> where you don't get the frame until just before you need to display
> it)

the same basic math holds for just 1 frame at a fixed rate. At some
point transition costs will get you (and that's where things like
ondemand delayed speedup will save us); but to get back to your
interface, the interface doesnt nearly give the info needed to make
these decisions...

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [linux-pm] Power Management framework proposal
  2007-07-23  8:56                   ` Ondrej Zajicek
@ 2007-07-23 17:33                     ` david
  2007-07-27 12:04                     ` Pavel Machek
  1 sibling, 0 replies; 40+ messages in thread
From: david @ 2007-07-23 17:33 UTC (permalink / raw)
  To: Ondrej Zajicek; +Cc: Arjan van de Ven, Igor Stoppa, LKML, linux-pm

On Mon, 23 Jul 2007, Ondrej Zajicek wrote:

> On Sun, Jul 22, 2007 at 09:19:17PM -0700, Arjan van de Ven wrote:
>> let me give you a real world example then, and the numbers I'm using are
>> ballpark the same as you'll find in a (mobile) core 2 duo datasheet, I
>> just rounded them a little so that the math works out nice.
>>
>> power at full speed: 34W
>> power at half speed: 24W
>> power at idle: 1W
>
> I have usually seen different numbers, for example:
>
> http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/30430.pdf
>
> Although this paper speaks about thermal design power instead of power
> consumption, i suppose that it should be roughly equal.
>
> For example Athlon 64 3700 (ADA3700AEP5AR):
>
> 2.4 GHz, 1.5 V -> 89 W
> 2.2 GHz, 1.4 V -> 72 W
> 2.0 GHz, 1.3 V -> 53 W
> 1.8 GHz, 1.2 V -> 39 W
> 1.0 GHz, 1.1 V -> 22 W
>
>
> Even my measurement on PC (Athlon X2, VIA K8T890) of complete PC power
> consumption shows that it is more efficient to be busy for 2 time units
> on 1 GHz than be busy for 1 time unit and be idle for 1 time unit
> on 2 GHz.
>
> 1 GHz:
> both cores idle:	48 W
> one core busy:		57 W
> two cores busy:		66 W
>
> 2 GHz:
> both cores idle:	54 W
> one core busy:		78 W
> two cores busy:		95 W

what Arjan is saying is one time unit at 2GHz with both cores busy, one 
time unit at 1GHz with both cores idle (this would be 132w/two time units 
vs 143W/two time units) still a win for running a 1GHz, but a smaller one

or better still, one time unit at 2GHz with both cores busy, one time unit 
in sleep mode, in this case if the sleep mode is any good at all it wins.

David Lang

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [linux-pm] Power Management framework proposal
  2007-07-23 10:48         ` Igor Stoppa
@ 2007-07-23 18:14           ` david
  2007-07-24  8:43             ` Jerome Glisse
  0 siblings, 1 reply; 40+ messages in thread
From: david @ 2007-07-23 18:14 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: ext linux-pm-bounces@lists.linux-foundation.org, LKML, linux-pm

On Mon, 23 Jul 2007, Igor Stoppa wrote:

> On Sun, 2007-07-22 at 14:21 -0700, ext david@lang.hm wrote:
>
> [snip]
>
>> this is another one. I'd be happy to get pointers to prior ones to learn
>> from.
>
> https://lists.linux-foundation.org/pipermail/linux-pm/2007-March/011204.html
>
> This is probably one of the latest. Previously there was some clash
> between powerop and oppoints that lead to people running away from too
> much confusion.

thanks, I'll read through that

>>> Unfortunately, while it's true that there are significant similarities,
>>> there are also notable differencies; as far as i know the USB subsystem
>>> is the one that gets closer to what we have in the embedded arena, since
>>> it can have complex cases of parent-child powering and wakeup.
>>
>> this API is not trying to represent the parent-child hierarchy. as far as
>> I know that's documented in sysfs (or is supposed to be). this is just an
>> attempt to make it so that as you are going through the hierarchy you
>> don't have to use vastly different API's to control the different
>> functions.
>
> You are going to end up with parent child relationships, or
> user-consumer.
> Devices don't exist in the void, but are interconnected.

correct, but the interconnections are already documented via sysfs aren't 
they? if they are why should this new API need to worry about that?

>> I suspect that most (if not all) of the previous One Solutions have tried
>> to completely handle all the details of their original case, and then
>> branch out to the other cases.
>>
>> this attempt is working from the other direction. the user of this API
>> doesn't care how something is done, it just wants to know what's possible
>> and how to tell the system to switch modes.
>
> True, but you are endding up in the same situation: too much abstraction
> makes the governing system clumsy and inefficient.

I see it as going the opposite direction, today there is no abstraction, 
you need to know all the details fo everything. I proposed an abstraction 
to avoid needing to kow all the details, this may nd up being just as bad, 
but it's not the same situation :-)

>> other then just me searching through the lists, do you have a pointer to
>> some of the differences between the different types that are seen as being
>> so large that they can't be unified?
>
> I'll be more detailed in further replies to following emails from this
> same thread that have already piled up.

thanks, even though I'm dropping the proposal it's always useful to learn 
more.

>>>> while I was describing the issues to my roomates over dinner I realized
>>>> that the same type of functions are needed for the CPU clocks.
>>>>
>>>> if you have an accepted framework in place there that can do what I
>>>> described, please consider extending it to cover other types of devices
>>>> and drivers.
>>>
>>> That is not part of the fw: the fw simply expresses parent-child clock
>>> distribution and keeps usecounts so that unused clocks are automatically
>>> gated.
>>>
>>> The actual clock tree description is platform/arch/board specific and
>>> doesn't affect the framework. You can just roll your own version for x86
>>> by providing a description of the methods used to switch on/off every
>>> individual clock on your board.
>>>
>>> So what you are asking for is that somebody writes an x86 version of the
>>> clock fw.
>>
>> this is more then just setting the clocks on everything (although setting
>> clocks seems like it fits well into the model) becouse some power modes
>> are not easily represented just as clocks.
>
> The very same idea of power mode is something that can maybe fit some
> simple peripherals (simple as not fine grained contraollable in terms of
> what is on and what is off), but certainly it doesn't fit nicely modern
> SoC (see OMAP) since ata certain point of time you don't really know
> what is the power consumption because many resources are automatically
> gated by HW on an on-need basis.
> And you don't want to switch this feature off.

it seems to me that you can either get some figure of power consumption 
for a mode (even if it's just relative power consumption compared to other 
modes) or you have no way of planning what to do becouse you have no clue 
what the results of your actions are.

>>> As for latencies, well, only few clocks really have significant impact.
>>> Most notably the main system oscillator. Everything else has 0 latency
>>> since it ends up in opening/closing a clock gate.
>>>
>>> Powering device on/off will certainly introduce more latency, but either
>>> the powering is supported by the hw, to make it quick or it has to go
>>> through most, if not all of he usual initialisation sequence; in that
>>> case it probably makes sense to avoid controlling it from kernelspace,
>>> since it will be slow and won't require dedcisions made with us
>>> precision.
>>
>> and many devices support both a quick almost-off mode and a slow
>> almost-off mode (as well as a completely off mode), with the slow mode
>> eating less power, but takeing longer to wake up from. that's the reason
>> for providing the matrix to let the program makeing the decision decide if
>> it's worth the time delays to get the power savings
>>
>> as I note in anther message, this SPI isn't intended to be strictly
>> kernelspace or strictly userspace. for the ondemand speed governer you are
>> changing the settings quickly and so probably want to do so in the kernel,
>> however some people may be satisfied with slower controls and so could
>> have them in userspace (an extreme example of this would be turning off
>> wireless cards that aren't in use to save power and improve security)
>
> So you are goingto have 2 API: one for kernelspace (evolution of
> CPUfreq) and one for userspace, which seems more and more likely to be
> an extension to HAL.
> Are you informed on HOM and Intel Mobilin ?
> http://ohm.freedesktop.org/wiki/
> http://www.moblin.org/index.html

no, I'm proposing one API concept for managing modes of individual 
devices. the exact implementation of generating these calls will vary 
between userspace and kernelspace, but this is true of everything that 
provides a sysfs interface to userspace, and this is no different.

>>>> I think you are passing too much
>>>> info up the chain to the part makeing the decision (that part doesn't need
>>>> to know the details of the voltage/freq choices, the %power/%capability
>>>> numbers I suggested are in many ways more what they are making decision
>>>> son anyway)
>>>
>>> I don't think you have got it right: the only info being passed is the
>>> standard cpufreq list of frequencies; everything else is part of the
>>> cpufreq driver.
>>
>> to make the decisions the software makeing the decision needs to know how
>> much power would be used at each freq setting.
>
> And that's the wrong assumption: you don't know for real what the power
> consumption is going to be; on decent embedded systems the clock gating
> takes care of minimizing power consumption, while the frequency
> throttling is intended to provide enough performance.
>
> If you happen to know that for x86 it's because they suck in terms of
> idle power cunsumption, but certainly that's not a good reason to
> penalise those that have better design.
>
> And i'm sure things are going to change rapidly now that intel is
> stepping into the lightweight mobile arena.

again, if you don't have any idea what the power consumption is for 
various settings then you have no basis for deciding what setting to use. 
somewhere something must have an idea, even if it is only in relation to 
other settings.

>>>> in the slideshow you list in the sequence of changing the cpu speed to pre
>>>> and post notify drivers. what exactly are the drivers expected to do with
>>>> the notification? are you asking them to pause and then re-initialize for
>>>> the new power level?
>>>
>>> It's just a  notification. The drivers are supposed to know how to deal
>>> with it.
>>> In OMAP2 the major concern is that the external memory cannot be
>>> accessed since it is on a bus that is being re-clocked:
>>> - the dma controllers must be paused
>>> - the other cores (dsp) must not access sdram
>>> - the onenand driver needs to adjust its timing parameters
>>
>> in my proposal this would require one or more 'pause' modes (more then
>> one if you need to pause at different power settings fro some reason) for
>> the first notification, and then you would set them to the mode you want
>> them in at the second notification point (which is probably going to be
>> the mode they were in before)
>
> pauses are to be avoided as much as possible, or at least made very
> short. The change has to be seamless; stopping the system is not a good
> start.

pauses must be short, yes. but your statement says that they are 
nessasary.

>>> [snip]
>>>
>>>>> To make any proposal that has some chance of being accepted, you have to
>>>>> compare it against the existing solution, explaining:
>>>>>
>>>>> -what it is bringing in terms of new functionalities
>>>>> -how it is different
>>>>
>>>> it unifies all power/performance trade-offs (including power on/off) into
>>>> a single API, but decouples that API from the implementation details of
>>>> exactly what the technical details of the different modes are and how the
>>>> changes are made.
>>>
>>> It always looks great at this level of abstraction, but then usually
>>> what is discovered later is that _a lot_ of extra complexity is
>>> introduced, in order to cover every case on every platform that is
>>> intended to be supported.
>>
>> which is why I posted this for comments.
>>
>> what are the cases that require extra info.can that extra info be as
>> simple as a set of flags for the mode (or possibly for the transition
>> matrix).
>>
>> for your clock example you need a flag that says 'this requires everything
>> connected to this be paused'
>>
>> for suspend other low power modes you need to be able to say 'contents of
>> things below this point will be lost when you go into this mode' so that
>> the decision makeing software knows that it needs to save the contents of
>> memory before switching to a mode that stops the dram refresh. I don't
>> have any idea at the moment for how to prvide a common interface for
>> actually saving or restoring the contents, that is outside the scope of
>> this API
>
> That's wrong: the drdiver can keep to itself the information about
> saving/restoring; if fit advertises the time it will take, that's
> enough.

that's assuming that the place that the drdriver puts this information 
when it saves it is always available when it restores it.

for things like 'we are getting ready to stop the dram refresh' the driver 
cannot know what the best place to put the data from ram ends up being.

>> the ACPI people will need a flag for 'this device can generate wakeup
>> signals in this mode'
>>
>> but this API would just provide this info to the decision makeing code,
>> that code would have to antually enforce the limits
>
> Again, you are going into a field that belongs to hw abstraction and
> already has a standard tool do dela with it - HAL

HAL is far, far heavier then what I'm proposing.

>>>> for some subsystems this would be little more then renameing existing
>>>> fucntions, for others it would be converting several indepndant functions
>>>> into one, discoverable api
>>>
>>> if you check cpufreq, you will find out that it already covers the
>>> multiple cores case (but nothing prevents from using the same logic on
>>> something that is not really a cpu) and also has some simple concept of
>>> latency for frequency transition, concept that could be enhanced to
>>> handle latencies that are depending on the current operating point and
>>> target operating point.
>>
>> does it provide a full matrix of latencies, or just mode 1->mode2=x,
>> mode2->mode3=y so mode1->mode3=x+y?
>
> IIRC it's just 1 value

in that case it would need to be enhanced becouse with other devices the 
cost of switching from mode to mode is different depending on your 
starting and ending mode.

>>>>> -why the current implementation cannot simply be enhanced
>>>>
>>>> which current implementation should be enhanced? and with the massive
>>>> broadening of functionality should it retain the same name, or should it
>>>> get renamed to something more generic?
>>>
>>> cpufreq could be renamed to anything that makes sense, but i see _no_
>>> massive broadening of functionality.
>>
>> what I'm talking about would provide an API to devices that you are
>> ignoring becouse they should be managed from userspace.
>
> again, HAL / OHM / Mobilin

I was trying to define the lower level interfaces that these tools need. 
today they can only know what is possible by reading the source code for 
each driver and implementing the driver-specific interfaces nessasary to 
set things, I was proposing a common interface that tools like this could 
use instead of requiring all the driver-specific knowledge.

>>>> the cpufreq implementation is very close to what I'm proposing, it would
>>>> need to get broadend to cover other devices (like disk drives, wireless
>>>> cards, etc), is this really the right thing to do or should the more
>>>> generic API go in for external use and then the existing cpufreq be called
>>>> from the set_mode() call?
>>>
>>> No, that doesn't make sense, as general approach.
>>> You want to manage from kernel only those parts of the system where the
>>> latency is so low that userspace wouldn't be able to keep up.
>>>
>>> Your examples (wireless, disk drive) can be easily controlled from
>>> userspace, with a timeout.
>>
>> absoutly, and they should be (at least most of the time). this was not
>> intended as a kernelspace only api. it is intended to be available to both
>> kernelspace and userspace.
>>
>>> In both cases there are significant delays (change of rotation speed /
>>> sync with the access point).
>>
>> correct, and these delays should be reflected in the transition cost
>> matrix
>>
>>> All this is hand waiving unless it is backed up by numbers.
>>> Real cases are required in order to establish a list of priorities for
>>> latency/power consumption.
>>
>> this isn't attempting to establish a list of priorites, simply to give the
>> software that is trying to establish such a list the info to make it's
>> decisions, and the interface to use to issue the resulting instructions.
>
> What i'm saying is that sw is implemented to fulfill certain needs. I'd
> rather see a detailed description of the need and based on that debate
> on the actual API / implementation.

in a nutshell (and I know this is probably not detailed to be acceptable)

1. the software needs to know what the interconnects and dependancies
    between devices are (supposedly this is provided via sysfs)

2. the software needs to know what type of device this is (again,
    supposedly this is provided via sysfs)

3. the software needs to know what modes exist for a driver/piece of
    hardware. to make any decisions this infomation needs to provide some
    information about the capability of the mode and the power consumed in
    that mode. in addition there will need to be flags to indicate any
    special restrictions of a mode

4. the software needs to know the cost of switching from any mode to any
    other mode. since some transitions will interact with other devices
    there will need to be flags to indicate such requirements for specific
    transitions.

5. the software needs to be able to find out what mode a device is in.

6. the software needs to be able to tell the driver to switch to a
    different mode (I think it would be a very good thing if going to a
    particular mode was always the same command, no matter what mode it is
    currently in)

7. the software needs to figure out the desire of the user.

my proposal was addressing items #3-#6. it isn't trying to decide what to 
do, simply to allow the software that _is_ trying to decide what to do a 
way to find out what it can do.

David Lang

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [linux-pm] Power Management framework proposal
  2007-07-23 14:12                     ` Arjan van de Ven
@ 2007-07-23 18:19                       ` david
  0 siblings, 0 replies; 40+ messages in thread
From: david @ 2007-07-23 18:19 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Igor Stoppa, ext linux-pm-bounces@lists.linux-foundation.org,
	LKML, linux-pm

On Mon, 23 Jul 2007, Arjan van de Ven wrote:

> On Sun, 2007-07-22 at 22:25 -0700, david@lang.hm wrote:
>>>>
>>>> only if the transitions don't cost anything significant,
>>>
>>> these are second order effects though. On a pc, the transition costs are
>>> quite low (as I said, single or low double digit microseconds).
>>
>> including pausing all drivers before the transition and unpausing them
>> aftrwords?
>
> on a PC you don't need to do that.

that's not what the OWAP documentation I was told to read said. it 
specificly lists a requirement to pause drivers before the clock change 
and unpause them afterwords.

>>> this works for all systems where the idle power is more lower than the
>>> power you save by dropping speed... and that is almost all of them in
>>> the PC world.
>>
>> if you can idle the system as a whole I agree with you fully. most PC
>> hardware (including the mobile stuff) doesn't change it's power
>> consumption much with load.
>
> even if the rest of the PC is unchanging (which it's not), it is just an
> offset to both sides of the equation, and the same on both sides at
> that.

but a constant added to both sides makes the relative savings less.

>>  at Usenix there was a presentiation (I don't
>> remember if it was by Amazon or Google) about this subject, showing that
>> current PC hardware only goes down to 50% power when idle (short of
>> switching power modes) and that they and other big companies were pushing
>> vendors to improve their hardware, aiming to get the idle power down to
>> 10% (again without suspending anything). so there's some chance that this
>> will change before too long.
>
> on servers and such, there is a huge offset, sure, but still the effect
> is there. And it really isn't 50%.

their measurements and graphs say otherwise.

>>> now you can argue that 0.5 seconds is a really really long time, and
>>> you'd be right. so for really really short stints (say a timer
>>> interrupt) you don't want to change the voltage at all (nor would
>> you
>>> want to change the plls to change frequency for that matter). But
>> once
>>> you start chaning those, you might as well go full speed.
>>
>> this assumes that you can cache 1 second of video, if you have more
>> real-time requirements you have a much harder time (say video
>> confrancing
>> where you don't get the frame until just before you need to display
>> it)
>
> the same basic math holds for just 1 frame at a fixed rate. At some
> point transition costs will get you (and that's where things like
> ondemand delayed speedup will save us); but to get back to your
> interface, the interface doesnt nearly give the info needed to make
> these decisions...

what is it missing?

it lets you find out what modes are avialable and (in relative terms) how 
much capability and power is available in each mode

it lets you find out what the transition costs are from any mode to any 
other mode

David Lang

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Power Management framework proposal
  2007-07-22 17:26 ` Arjan van de Ven
  2007-07-22 18:56   ` david
@ 2007-07-23 22:23   ` Benjamin Herrenschmidt
  2007-07-24 20:14     ` david
  1 sibling, 1 reply; 40+ messages in thread
From: Benjamin Herrenschmidt @ 2007-07-23 22:23 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: david, LKML, linux-pm

On Sun, 2007-07-22 at 10:26 -0700, Arjan van de Ven wrote:
> On Sat, 2007-07-21 at 23:49 -0700, david@lang.hm wrote:
> 
> > this approach would allow the transition of ALL drivers to the new mode of 
> > operation in one fell swoop, and then adding additional power management 
> > features is just adding to the existing list rather then implementing new 
> > functions.
> 
> 
> I have a concern with this approach though. It seems to assume that
> there is one global thing somewhere that sets the system state; in my
> experience that is the wrong approach; in fact there is a very definite
> evidence that there are many decisions on power that are to be made
> local at a high frequency. An example of this is the processor speed;
> the ondemand governer does exactly this for the cpus that can switch
> speeds fast; it's just impossible to beat such a local, fast decision
> with anything on a global scale.
> 
> On the other hand, some things (the high level goals and constraints)
> are obviously global.

I think we need a set of constraints that trickle down the power tree
and limit what a given driver can do locally.

Ben.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [linux-pm] Power Management framework proposal
  2007-07-23 18:14           ` david
@ 2007-07-24  8:43             ` Jerome Glisse
  2007-07-24 14:18               ` Igor Stoppa
  2007-07-24 20:06               ` david
  0 siblings, 2 replies; 40+ messages in thread
From: Jerome Glisse @ 2007-07-24  8:43 UTC (permalink / raw)
  To: david@lang.hm
  Cc: Igor Stoppa, ext linux-pm-bounces@lists.linux-foundation.org,
	LKML, linux-pm

On 7/23/07, david@lang.hm <david@lang.hm> wrote:
> On Mon, 23 Jul 2007, Igor Stoppa wrote:
> > again, HAL / OHM / Mobilin
>
> I was trying to define the lower level interfaces that these tools need.
> today they can only know what is possible by reading the source code for
> each driver and implementing the driver-specific interfaces nessasary to
> set things, I was proposing a common interface that tools like this could
> use instead of requiring all the driver-specific knowledge.
>
>
> in a nutshell (and I know this is probably not detailed to be acceptable)
>
> 1. the software needs to know what the interconnects and dependancies
>     between devices are (supposedly this is provided via sysfs)
>
> 2. the software needs to know what type of device this is (again,
>     supposedly this is provided via sysfs)
>
> 3. the software needs to know what modes exist for a driver/piece of
>     hardware. to make any decisions this infomation needs to provide some
>     information about the capability of the mode and the power consumed in
>     that mode. in addition there will need to be flags to indicate any
>     special restrictions of a mode
>
> 4. the software needs to know the cost of switching from any mode to any
>     other mode. since some transitions will interact with other devices
>     there will need to be flags to indicate such requirements for specific
>     transitions.
>
> 5. the software needs to be able to find out what mode a device is in.
>
> 6. the software needs to be able to tell the driver to switch to a
>     different mode (I think it would be a very good thing if going to a
>     particular mode was always the same command, no matter what mode it is
>     currently in)
>
> 7. the software needs to figure out the desire of the user.
>
> my proposal was addressing items #3-#6. it isn't trying to decide what to
> do, simply to allow the software that _is_ trying to decide what to do a
> way to find out what it can do.
>
> David Lang

I believe a central place where user can set/change hw state to save
power or to increase computational power is definitely a goal to pursue.
But i truly think that the OHM approach is the best one ie using plugins
so that one can make a plugin specific for each device. The point is that
i believe there is no way to do an abstract interface for this and trying to
do so will endup doing ugly code and any interface would fail to encompass
all possible tweak that might exist for all devices.

For instance on graphics card you could do the following (maybe more):
-change GPU clock
-change memory clock
-disable part of engine
-disable unit
i truly don't think you can make a common interface for all this, more
over there might be constraint on how you can change things (GPU &
memory clock might need to follow a given ratio). So you definitely
need knowledge in the user space program to handle this.

best,
Jerome Glisse

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [linux-pm] Power Management framework proposal
  2007-07-24  8:43             ` Jerome Glisse
@ 2007-07-24 14:18               ` Igor Stoppa
  2007-07-24 20:13                 ` david
  2007-07-24 20:06               ` david
  1 sibling, 1 reply; 40+ messages in thread
From: Igor Stoppa @ 2007-07-24 14:18 UTC (permalink / raw)
  To: ext Jerome Glisse
  Cc: david@lang.hm, ext linux-pm-bounces@lists.linux-foundation.org,
	LKML, linux-pm

On Tue, 2007-07-24 at 10:43 +0200, ext Jerome Glisse wrote:

> I believe a central place where user can set/change hw state to save
> power or to increase computational power is definitely a goal to pursue.
> But i truly think that the OHM approach is the best one ie using plugins
> so that one can make a plugin specific for each device. The point is that
> i believe there is no way to do an abstract interface for this and trying to
> do so will endup doing ugly code and any interface would fail to encompass
> all possible tweak that might exist for all devices.
> 
> For instance on graphics card you could do the following (maybe more):
> -change GPU clock
> -change memory clock
> -disable part of engine
> -disable unit
> i truly don't think you can make a common interface for all this, more
> over there might be constraint on how you can change things (GPU &
> memory clock might need to follow a given ratio). So you definitely
> need knowledge in the user space program to handle this.

Even simpler case: LCD backlight can come in many flavors, both in terms
of brightness levels and fixed amount of current required to keep it ON.

Trying to abstract such details from the decision-making makes little
sense.
Isolating that into a separate module, instead, brings the best of both
worlds:
-containment of the HW-specific code
-leveraging every possible, no matter how exotic, power saving mode
available.


-- 
Cheers, Igor

Igor Stoppa <igor.stoppa@nokia.com>
(Nokia Multimedia - CP - OSSO / Helsinki, Finland)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [linux-pm] Power Management framework proposal
  2007-07-24  8:43             ` Jerome Glisse
  2007-07-24 14:18               ` Igor Stoppa
@ 2007-07-24 20:06               ` david
  2007-07-24 23:14                 ` Jerome Glisse
  1 sibling, 1 reply; 40+ messages in thread
From: david @ 2007-07-24 20:06 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Igor Stoppa, ext linux-pm-bounces@lists.linux-foundation.org,
	LKML, linux-pm

On Tue, 24 Jul 2007, Jerome Glisse wrote:

> On 7/23/07, david@lang.hm <david@lang.hm> wrote:
>>  On Mon, 23 Jul 2007, Igor Stoppa wrote:
>> >  again, HAL / OHM / Mobilin
>>
>>  I was trying to define the lower level interfaces that these tools need.
>>  today they can only know what is possible by reading the source code for
>>  each driver and implementing the driver-specific interfaces nessasary to
>>  set things, I was proposing a common interface that tools like this could
>>  use instead of requiring all the driver-specific knowledge.
>> 
>>
>>  in a nutshell (and I know this is probably not detailed to be acceptable)
>>
>>  1. the software needs to know what the interconnects and dependancies
>>      between devices are (supposedly this is provided via sysfs)
>>
>>  2. the software needs to know what type of device this is (again,
>>      supposedly this is provided via sysfs)
>>
>>  3. the software needs to know what modes exist for a driver/piece of
>>      hardware. to make any decisions this infomation needs to provide some
>>      information about the capability of the mode and the power consumed in
>>      that mode. in addition there will need to be flags to indicate any
>>      special restrictions of a mode
>>
>>  4. the software needs to know the cost of switching from any mode to any
>>      other mode. since some transitions will interact with other devices
>>      there will need to be flags to indicate such requirements for specific
>>      transitions.
>>
>>  5. the software needs to be able to find out what mode a device is in.
>>
>>  6. the software needs to be able to tell the driver to switch to a
>>      different mode (I think it would be a very good thing if going to a
>>      particular mode was always the same command, no matter what mode it is
>>      currently in)
>>
>>  7. the software needs to figure out the desire of the user.
>>
>>  my proposal was addressing items #3-#6. it isn't trying to decide what to
>>  do, simply to allow the software that _is_ trying to decide what to do a
>>  way to find out what it can do.
>>
>>  David Lang
>
> I believe a central place where user can set/change hw state to save
> power or to increase computational power is definitely a goal to pursue.
> But i truly think that the OHM approach is the best one ie using plugins
> so that one can make a plugin specific for each device. The point is that
> i believe there is no way to do an abstract interface for this and trying to
> do so will endup doing ugly code and any interface would fail to encompass
> all possible tweak that might exist for all devices.

will each plugin have it's own interface? or will you have one interface 
to access the plugins and then the plugins do things behind the scenes?

I'll bet that the API for the plugins is common, and if so then it could 
be similar to the API that I suggested.

> For instance on graphics card you could do the following (maybe more):
> -change GPU clock
> -change memory clock
> -disable part of engine
> -disable unit
> i truly don't think you can make a common interface for all this, more
> over there might be constraint on how you can change things (GPU &
> memory clock might need to follow a given ratio). So you definitely
> need knowledge in the user space program to handle this.

sure you can, just enumerate all the options the driver writer wants to 
offer as options. yes this could be a lengthy list, so what?

David Lang

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [linux-pm] Power Management framework proposal
  2007-07-24 14:18               ` Igor Stoppa
@ 2007-07-24 20:13                 ` david
  0 siblings, 0 replies; 40+ messages in thread
From: david @ 2007-07-24 20:13 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: ext Jerome Glisse,
	ext linux-pm-bounces@lists.linux-foundation.org, LKML, linux-pm

On Tue, 24 Jul 2007, Igor Stoppa wrote:

> On Tue, 2007-07-24 at 10:43 +0200, ext Jerome Glisse wrote:
>
>> I believe a central place where user can set/change hw state to save
>> power or to increase computational power is definitely a goal to pursue.
>> But i truly think that the OHM approach is the best one ie using plugins
>> so that one can make a plugin specific for each device. The point is that
>> i believe there is no way to do an abstract interface for this and trying to
>> do so will endup doing ugly code and any interface would fail to encompass
>> all possible tweak that might exist for all devices.
>>
>> For instance on graphics card you could do the following (maybe more):
>> -change GPU clock
>> -change memory clock
>> -disable part of engine
>> -disable unit
>> i truly don't think you can make a common interface for all this, more
>> over there might be constraint on how you can change things (GPU &
>> memory clock might need to follow a given ratio). So you definitely
>> need knowledge in the user space program to handle this.
>
> Even simpler case: LCD backlight can come in many flavors, both in terms
> of brightness levels and fixed amount of current required to keep it ON.
>
> Trying to abstract such details from the decision-making makes little
> sense.
> Isolating that into a separate module, instead, brings the best of both
> worlds:
> -containment of the HW-specific code
> -leveraging every possible, no matter how exotic, power saving mode
> available.

huh??

in the proposal that I made all the HW specific code would be in the 
device driver. I was just proposing a way for the driver to advertise what 
it is able to do.

why would you want to pull the code out into a seperate model?

many levels of backlight with different power consumption is trivial to 
do.

backlight 1

mode %capability %power
    aka brightness
0        0         0
1      100       100
2       75        75
3       50        50
4       25        25

backlight 2

mode %capability %power
    aka brightness
0        0         0
1      100       100
2       80        50
3       60        30
3       40        20
4       25        15

backlight 2

mode %capability %power
    aka brightness
0        0         0
1      100       100
2       50        90


why do you think the decision makeing logic needs to know the details of 
the hardware? if you can abstract the details out then the same control 
logic can be used for different things. if you infuse the hardware 
knowledge with the control logic then you have to change the control 
section every time you want to support a new piece of hardware.

David Lang



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Power Management framework proposal
  2007-07-23 22:23   ` Benjamin Herrenschmidt
@ 2007-07-24 20:14     ` david
  2007-07-24 21:38       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 40+ messages in thread
From: david @ 2007-07-24 20:14 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Arjan van de Ven, LKML, linux-pm

On Tue, 24 Jul 2007, Benjamin Herrenschmidt wrote:

> On Sun, 2007-07-22 at 10:26 -0700, Arjan van de Ven wrote:
>> On Sat, 2007-07-21 at 23:49 -0700, david@lang.hm wrote:
>>
>>> this approach would allow the transition of ALL drivers to the new mode of
>>> operation in one fell swoop, and then adding additional power management
>>> features is just adding to the existing list rather then implementing new
>>> functions.
>>
>>
>> I have a concern with this approach though. It seems to assume that
>> there is one global thing somewhere that sets the system state; in my
>> experience that is the wrong approach; in fact there is a very definite
>> evidence that there are many decisions on power that are to be made
>> local at a high frequency. An example of this is the processor speed;
>> the ondemand governer does exactly this for the cpus that can switch
>> speeds fast; it's just impossible to beat such a local, fast decision
>> with anything on a global scale.
>>
>> On the other hand, some things (the high level goals and constraints)
>> are obviously global.
>
> I think we need a set of constraints that trickle down the power tree
> and limit what a given driver can do locally.

what sort of contraints are you thinking of?

David Lang

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Power Management framework proposal
  2007-07-24 20:14     ` david
@ 2007-07-24 21:38       ` Benjamin Herrenschmidt
  2007-07-24 23:02         ` david
  0 siblings, 1 reply; 40+ messages in thread
From: Benjamin Herrenschmidt @ 2007-07-24 21:38 UTC (permalink / raw)
  To: david; +Cc: Arjan van de Ven, LKML, linux-pm

On Tue, 2007-07-24 at 13:14 -0700, david@lang.hm wrote:
> > I think we need a set of constraints that trickle down the power
> tree
> > and limit what a given driver can do locally.
> 
> what sort of contraints are you thinking of?

A parent power state defines what states children can be in. For
example. A way to express those dependencies would be nice. Or
alternatiely, the power state of all the children defines the power
state a parent can go in automatically.

Ben.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Power Management framework proposal
  2007-07-24 21:38       ` Benjamin Herrenschmidt
@ 2007-07-24 23:02         ` david
  2007-07-24 23:47           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 40+ messages in thread
From: david @ 2007-07-24 23:02 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Arjan van de Ven, LKML, linux-pm

On Wed, 25 Jul 2007, Benjamin Herrenschmidt wrote:

> On Tue, 2007-07-24 at 13:14 -0700, david@lang.hm wrote:
>>> I think we need a set of constraints that trickle down the power
>> tree
>>> and limit what a given driver can do locally.
>>
>> what sort of contraints are you thinking of?
>
> A parent power state defines what states children can be in. For
> example. A way to express those dependencies would be nice. Or
> alternatiely, the power state of all the children defines the power
> state a parent can go in automatically.

Ok, I see tow things here.

1. do you really want to try and propogate things like this from one to 
the other, or would it be good enough to flag the issue and let the 
software selecting the modes implement this contraint?

2. how can you standardize the requirements?

at the very least you have

for this mode all children must be off

for this mode all children must be in a mode that includes a 'suspended' 
flag (this could be made implicit by saying that you must suspend children 
before parents) and then just flagging the 'suspended, but not off' modes)

what requirements are needed? (I'm sure that there are others, but 
hopefully it's possible to avoid requirements like 'the clock speed for 
device A must be >X to allow device B to operate in mode Y')

David Lang

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [linux-pm] Power Management framework proposal
  2007-07-24 20:06               ` david
@ 2007-07-24 23:14                 ` Jerome Glisse
  2007-07-25  0:40                   ` david
  0 siblings, 1 reply; 40+ messages in thread
From: Jerome Glisse @ 2007-07-24 23:14 UTC (permalink / raw)
  To: david@lang.hm
  Cc: Igor Stoppa, ext linux-pm-bounces@lists.linux-foundation.org,
	LKML, linux-pm

On 7/24/07, david@lang.hm <david@lang.hm> wrote:
> On Tue, 24 Jul 2007, Jerome Glisse wrote:
>
> > On 7/23/07, david@lang.hm <david@lang.hm> wrote:
> >>  On Mon, 23 Jul 2007, Igor Stoppa wrote:
> >> >  again, HAL / OHM / Mobilin
> >>
> >>  I was trying to define the lower level interfaces that these tools need.
> >>  today they can only know what is possible by reading the source code for
> >>  each driver and implementing the driver-specific interfaces nessasary to
> >>  set things, I was proposing a common interface that tools like this could
> >>  use instead of requiring all the driver-specific knowledge.
> >>
> >>
> >>  in a nutshell (and I know this is probably not detailed to be acceptable)
> >>
> >>  1. the software needs to know what the interconnects and dependancies
> >>      between devices are (supposedly this is provided via sysfs)
> >>
> >>  2. the software needs to know what type of device this is (again,
> >>      supposedly this is provided via sysfs)
> >>
> >>  3. the software needs to know what modes exist for a driver/piece of
> >>      hardware. to make any decisions this infomation needs to provide some
> >>      information about the capability of the mode and the power consumed in
> >>      that mode. in addition there will need to be flags to indicate any
> >>      special restrictions of a mode
> >>
> >>  4. the software needs to know the cost of switching from any mode to any
> >>      other mode. since some transitions will interact with other devices
> >>      there will need to be flags to indicate such requirements for specific
> >>      transitions.
> >>
> >>  5. the software needs to be able to find out what mode a device is in.
> >>
> >>  6. the software needs to be able to tell the driver to switch to a
> >>      different mode (I think it would be a very good thing if going to a
> >>      particular mode was always the same command, no matter what mode it is
> >>      currently in)
> >>
> >>  7. the software needs to figure out the desire of the user.
> >>
> >>  my proposal was addressing items #3-#6. it isn't trying to decide what to
> >>  do, simply to allow the software that _is_ trying to decide what to do a
> >>  way to find out what it can do.
> >>
> >>  David Lang
> >
> > I believe a central place where user can set/change hw state to save
> > power or to increase computational power is definitely a goal to pursue.
> > But i truly think that the OHM approach is the best one ie using plugins
> > so that one can make a plugin specific for each device. The point is that
> > i believe there is no way to do an abstract interface for this and trying to
> > do so will endup doing ugly code and any interface would fail to encompass
> > all possible tweak that might exist for all devices.
>
> will each plugin have it's own interface? or will you have one interface
> to access the plugins and then the plugins do things behind the scenes?
>
> I'll bet that the API for the plugins is common, and if so then it could
> be similar to the API that I suggested.

I take here ohm as a reference (this come from my limited understanding of
this daemon so there might be inaccuracy) driver export through HAL
there power management tunning capacity, Then an ohm plugin would use
HAL to give a higher
view of this capacity and also manage policy, preference, permission, ...

Last consumer in power management food chain would be an user interface which
will communicate with ohm (and with all ohm plugin) so desktop writter (gnome,
kde, ...) can write some kind of power management center where each ohm plugin
can have its own panel. So in the end the user got one place to do all its
power management which is the goal i think you are trying to aim.

> > For instance on graphics card you could do the following (maybe more):
> > -change GPU clock
> > -change memory clock
> > -disable part of engine
> > -disable unit
> > i truly don't think you can make a common interface for all this, more
> > over there might be constraint on how you can change things (GPU &
> > memory clock might need to follow a given ratio). So you definitely
> > need knowledge in the user space program to handle this.
>
> sure you can, just enumerate all the options the driver writer wants to
> offer as options. yes this could be a lengthy list, so what?
>

My point was that your interface by trying to fit square pegs into round hole
will fail to expose all subtility of each device which might in the end bring
to wrong power management decision. So i believe we can't sum up
power management to list of mode whose attribute are power consumption
& capacity.

And there is no way to design an abstraction given that all hw we will have
to deal with are too much different and do not follow any standard things
(beside ACPI there is other way to save power brightness, gpu/memory
clock, pll, ...) so i don't see how one might give a common view of things
which are fundamentally different in how they affect consumption (same end
result with many different paths leading to it).

best,
Jerome Glisse

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Power Management framework proposal
  2007-07-24 23:02         ` david
@ 2007-07-24 23:47           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 40+ messages in thread
From: Benjamin Herrenschmidt @ 2007-07-24 23:47 UTC (permalink / raw)
  To: david; +Cc: Arjan van de Ven, LKML, linux-pm

On Tue, 2007-07-24 at 16:02 -0700, david@lang.hm wrote:
> 
> what requirements are needed? (I'm sure that there are others, but 
> hopefully it's possible to avoid requirements like 'the clock speed
> for 
> device A must be >X to allow device B to operate in mode Y') 

I had an idea a while ago, might still be in the pm list archives, of
exposing constraints as opaque bitmaps. The bits have defined meaning
for a given bus, but are opaque to the core.

The devices however, provide tables indicating to the core their list of
power states (with names) and their requirements in term of parent
states (using such bitmasks).

Thus, the core can resolve the dependency requirements without having to
know about the actual meaning of the states of the various busses.

Ben.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [linux-pm] Power Management framework proposal
  2007-07-24 23:14                 ` Jerome Glisse
@ 2007-07-25  0:40                   ` david
  2007-07-25 12:49                     ` Jerome Glisse
  0 siblings, 1 reply; 40+ messages in thread
From: david @ 2007-07-25  0:40 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Igor Stoppa, ext linux-pm-bounces@lists.linux-foundation.org,
	LKML, linux-pm

On Wed, 25 Jul 2007, Jerome Glisse wrote:

> On 7/24/07, david@lang.hm <david@lang.hm> wrote:
>>  On Tue, 24 Jul 2007, Jerome Glisse wrote:
>> 
>> >  On 7/23/07, david@lang.hm <david@lang.hm> wrote:
>> > >   On Mon, 23 Jul 2007, Igor Stoppa wrote:
>> > > >   again, HAL / OHM / Mobilin
>> > > 
>> > >   I was trying to define the lower level interfaces that these tools 
>> > >   need.
>> > >   today they can only know what is possible by reading the source code 
>> > >   for
>> > >   each driver and implementing the driver-specific interfaces nessasary 
>> > >   to
>> > >   set things, I was proposing a common interface that tools like this 
>> > >   could
>> > >   use instead of requiring all the driver-specific knowledge.
>> > > 
>> > > 
>> > >   in a nutshell (and I know this is probably not detailed to be 
>> > >   acceptable)
>> > > 
>> > >   1. the software needs to know what the interconnects and dependancies
>> > >       between devices are (supposedly this is provided via sysfs)
>> > > 
>> > >   2. the software needs to know what type of device this is (again,
>> > >       supposedly this is provided via sysfs)
>> > > 
>> > >   3. the software needs to know what modes exist for a driver/piece of
>> > >       hardware. to make any decisions this infomation needs to provide 
>> > >       some
>> > >       information about the capability of the mode and the power 
>> > >       consumed in
>> > >       that mode. in addition there will need to be flags to indicate 
>> > >       any
>> > >       special restrictions of a mode
>> > > 
>> > >   4. the software needs to know the cost of switching from any mode to 
>> > >   any
>> > >       other mode. since some transitions will interact with other 
>> > >       devices
>> > >       there will need to be flags to indicate such requirements for 
>> > >       specific
>> > >       transitions.
>> > > 
>> > >   5. the software needs to be able to find out what mode a device is 
>> > >   in.
>> > > 
>> > >   6. the software needs to be able to tell the driver to switch to a
>> > >       different mode (I think it would be a very good thing if going to 
>> > >       a
>> > >       particular mode was always the same command, no matter what mode 
>> > >       it is
>> > >       currently in)
>> > > 
>> > >   7. the software needs to figure out the desire of the user.
>> > > 
>> > >   my proposal was addressing items #3-#6. it isn't trying to decide 
>> > >   what to
>> > >   do, simply to allow the software that _is_ trying to decide what to 
>> > >   do a
>> > >   way to find out what it can do.
>> > > 
>> > >   David Lang
>> > 
>> >  I believe a central place where user can set/change hw state to save
>> >  power or to increase computational power is definitely a goal to pursue.
>> >  But i truly think that the OHM approach is the best one ie using plugins
>> >  so that one can make a plugin specific for each device. The point is 
>> >  that
>> >  i believe there is no way to do an abstract interface for this and 
>> >  trying to
>> >  do so will endup doing ugly code and any interface would fail to 
>> >  encompass
>> >  all possible tweak that might exist for all devices.
>>
>>  will each plugin have it's own interface? or will you have one interface
>>  to access the plugins and then the plugins do things behind the scenes?
>>
>>  I'll bet that the API for the plugins is common, and if so then it could
>>  be similar to the API that I suggested.
>
> I take here ohm as a reference (this come from my limited understanding of
> this daemon so there might be inaccuracy) driver export through HAL
> there power management tunning capacity, Then an ohm plugin would use
> HAL to give a higher
> view of this capacity and also manage policy, preference, permission, ...
>
> Last consumer in power management food chain would be an user interface which
> will communicate with ohm (and with all ohm plugin) so desktop writter 
> (gnome,
> kde, ...) can write some kind of power management center where each ohm 
> plugin
> can have its own panel. So in the end the user got one place to do all its
> power management which is the goal i think you are trying to aim.

no. I am talking about the interface to the drivers that things like HAL 
would use

>> >  For instance on graphics card you could do the following (maybe more):
>> >  -change GPU clock
>> >  -change memory clock
>> >  -disable part of engine
>> >  -disable unit
>> >  i truly don't think you can make a common interface for all this, more
>> >  over there might be constraint on how you can change things (GPU &
>> >  memory clock might need to follow a given ratio). So you definitely
>> >  need knowledge in the user space program to handle this.
>>
>>  sure you can, just enumerate all the options the driver writer wants to
>>  offer as options. yes this could be a lengthy list, so what?
>> 
>
> My point was that your interface by trying to fit square pegs into round hole
> will fail to expose all subtility of each device which might in the end bring
> to wrong power management decision. So i believe we can't sum up
> power management to list of mode whose attribute are power consumption
> & capacity.

it's possible (which is part of the reason I started the thread), but so 
far there hasn't been anything identified that is a really bad fit.

> And there is no way to design an abstraction given that all hw we will have
> to deal with are too much different and do not follow any standard things
> (beside ACPI there is other way to save power brightness, gpu/memory
> clock, pll, ...) so i don't see how one might give a common view of things
> which are fundamentally different in how they affect consumption (same end
> result with many different paths leading to it).

so you are saying that the power management software must know the details 
of each and every driver, and if you add a new driver you must change the 
power management software before it can do anything (including allowing 
manual control of the modes)

seems to me I heard similar arguments several years ago about the CPU 
speed settings, it turns out that the cpufreq interface works really well 
for them and the software that's controlling things no longer needs to 
know the details of every CPU.

why did it work there but can't work anywhere else?

David Lang

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [linux-pm] Power Management framework proposal
  2007-07-25  0:40                   ` david
@ 2007-07-25 12:49                     ` Jerome Glisse
  2007-07-29 21:56                       ` david
  0 siblings, 1 reply; 40+ messages in thread
From: Jerome Glisse @ 2007-07-25 12:49 UTC (permalink / raw)
  To: david
  Cc: Igor Stoppa, ext linux-pm-bounces@lists.linux-foundation.org,
	LKML, linux-pm

david@lang.hm wrote:
> On Wed, 25 Jul 2007, Jerome Glisse wrote:
>
>> On 7/24/07, david@lang.hm <david@lang.hm> wrote:
>>>  will each plugin have it's own interface? or will you have one 
>>> interface
>>>  to access the plugins and then the plugins do things behind the 
>>> scenes?
>>>
>>>  I'll bet that the API for the plugins is common, and if so then it 
>>> could
>>>  be similar to the API that I suggested.
>>
>> I take here ohm as a reference (this come from my limited 
>> understanding of
>> this daemon so there might be inaccuracy) driver export through HAL
>> there power management tunning capacity, Then an ohm plugin would use
>> HAL to give a higher
>> view of this capacity and also manage policy, preference, permission, 
>> ...
>>
>> Last consumer in power management food chain would be an user 
>> interface which
>> will communicate with ohm (and with all ohm plugin) so desktop 
>> writter (gnome,
>> kde, ...) can write some kind of power management center where each 
>> ohm plugin
>> can have its own panel. So in the end the user got one place to do 
>> all its
>> power management which is the goal i think you are trying to aim.
>
> no. I am talking about the interface to the drivers that things like 
> HAL would use
>
Ok, i was just trying to stress that the end result is the same from the 
user point of
view.
>>> >  For instance on graphics card you could do the following (maybe 
>>> more):
>>> >  -change GPU clock
>>> >  -change memory clock
>>> >  -disable part of engine
>>> >  -disable unit
>>> >  i truly don't think you can make a common interface for all this, 
>>> more
>>> >  over there might be constraint on how you can change things (GPU &
>>> >  memory clock might need to follow a given ratio). So you definitely
>>> >  need knowledge in the user space program to handle this.
>>>
>>>  sure you can, just enumerate all the options the driver writer 
>>> wants to
>>>  offer as options. yes this could be a lengthy list, so what?
>>>
>>
>> My point was that your interface by trying to fit square pegs into 
>> round hole
>> will fail to expose all subtility of each device which might in the 
>> end bring
>> to wrong power management decision. So i believe we can't sum up
>> power management to list of mode whose attribute are power consumption
>> & capacity.
>
> it's possible (which is part of the reason I started the thread), but 
> so far there hasn't been anything identified that is a really bad fit.
>
Tell me how i do this in your model:
GPU/VRAM memory clock change power consumption of the card and
the power consumption is often not a trivial function of both of this 
parameters
(i even here simplify the problem by omitting pipeline shutdown). So how 
with
two different separate mode list (one for GPU speed another one for VRAM 
speed)
can you provide consumption information while this consumption depends 
on the
others settings. Then if you give as a solution to make only one list 
you end up
with a more bigger list than previously needed (nrGPUmodes * nrVRAMmodes)
do you expect the user to go through a lengthy list to find what he wants ?
(remember that we will have to add pipeline power off, pll tweaking or many
others way of saving power on such card).

So by choosing this power consumption as a unit of measure you end up
in non trivial case. There is also the question of overclocking, and 
other points
already identified where unfortunately a global design such as your proposal
does not seems to fit properly: local power decision (ethernet, wifi 
card, ...
can power down them self is they are doing nothings but the place where you
can know this is the driver), there is also the child/parent relation, 
how to
estimate power usage (on some configuration one device consumption can
be marginal toward all others things while on other this same device can be
the most power hungry device)... I see all this as bad fit.
>> And there is no way to design an abstraction given that all hw we 
>> will have
>> to deal with are too much different and do not follow any standard 
>> things
>> (beside ACPI there is other way to save power brightness, gpu/memory
>> clock, pll, ...) so i don't see how one might give a common view of 
>> things
>> which are fundamentally different in how they affect consumption 
>> (same end
>> result with many different paths leading to it).
>
> so you are saying that the power management software must know the 
> details of each and every driver, and if you add a new driver you must 
> change the power management software before it can do anything 
> (including allowing manual control of the modes)
>
You have to provide an ohm plug in (in an ohm world) where policy for 
this device will be
handled and this plug in need to be designed knowing what the hw export 
through HAL.
Yes it's pain full but you don't want to put policy in the driver and to 
do policy you need
knowledge on the things you deal with.
> seems to me I heard similar arguments several years ago about the CPU 
> speed settings, it turns out that the cpufreq interface works really 
> well for them and the software that's controlling things no longer 
> needs to know the details of every CPU.
>
> why did it work there but can't work anywhere else?
>
> David Lang
cpufreq is unification of processor frequency management, what you try 
to unify here
can go from toaster (i am sure there is usb driven toaster available 
somewhere on earth
don't ask why) to 100000$ specialized DSP card, we can save power on 
both on them but
how, which policy to follow, and what parameters to tweak will be very 
different. At
least i don't think you will want to driver your specialized DSP card as 
a toaster, personally
i won't.

Oh and as a side note i would like a common interface for dealing with 
power but i just
don't think it's possible, i might be wrong but so far i don't see in 
any other os offering
such things. That said you might be able to factor out _some_ common 
things for
given class of hardware (network card, usb device, graphics cards, ...).

best,
Jerome Glisse

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Power Management framework proposal
  2007-07-23  4:09           ` david
@ 2007-07-27 11:46             ` Pavel Machek
  2007-07-29 22:00               ` david
  0 siblings, 1 reply; 40+ messages in thread
From: Pavel Machek @ 2007-07-27 11:46 UTC (permalink / raw)
  To: david; +Cc: Arjan van de Ven, LKML, linux-pm

Hi!

> >>example 1: a laptop screen
> >>
> >>mode  capacity power description
> >>0        0        0    off
> >>1      100      100    full brightness
> >>2       70       60    half power to the backlight
> >>3       50       35    quarter power to the backlight
> >>4       30       25    eighth power to the backlight
> >>5        5       10    backlight off.
> >>
> >>example 2: a front-panel display on a server (no 
> >>variable backlight
> >>control)
> >>
> >>mode capacity power description
> >>0       0        0   off
> >>1     100      100   backlight on
> >>2      50       10   backlight off
> >
> >
> >the problem is: the person who SETS these needs to know 
> >what they mean.
> 
> that's what the description is for. this info can be 
> provided by the driver as part of the list_modes() 
> function.

That's what /sys/class/backlight/ is for :-).

> >as someone who wrote (part of) a power policy manager; 
> >sorry but you
> >take away information I need, and in addition the 
> >different API's are
> >absolutely no big deal.
> 
> assuming that nobody else chimes in to disagree with you 
> I'll accept your judgement and drop the issue.

Just for the record, I agree with Arjan here.
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [linux-pm] Power Management framework proposal
  2007-07-23  8:56                   ` Ondrej Zajicek
  2007-07-23 17:33                     ` david
@ 2007-07-27 12:04                     ` Pavel Machek
  1 sibling, 0 replies; 40+ messages in thread
From: Pavel Machek @ 2007-07-27 12:04 UTC (permalink / raw)
  To: Ondrej Zajicek; +Cc: Arjan van de Ven, david, Igor Stoppa, LKML, linux-pm

Hi!

> > let me give you a real world example then, and the numbers I'm using are
> > ballpark the same as you'll find in a (mobile) core 2 duo datasheet, I
> > just rounded them a little so that the math works out nice.
> > 
> > power at full speed: 34W
> > power at half speed: 24W
> > power at idle: 1W
> 
> I have usually seen different numbers, for example:
> 
> http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/30430.pdf

Trust Arjan, modern cpus work as he describes.

> Although this paper speaks about thermal design power instead of power
> consumption, i suppose that it should be roughly equal.
> 
> For example Athlon 64 3700 (ADA3700AEP5AR):
> 
> 2.4 GHz, 1.5 V -> 89 W
> 2.2 GHz, 1.4 V -> 72 W
> 2.0 GHz, 1.3 V -> 53 W
> 1.8 GHz, 1.2 V -> 39 W
> 1.0 GHz, 1.1 V -> 22 W

I guess that means athlon 64 is 'old'.

> Even my measurement on PC (Athlon X2, VIA K8T890) of complete PC power
> consumption shows that it is more efficient to be busy for 2 time units
> on 1 GHz than be busy for 1 time unit and be idle for 1 time unit
> on 2 GHz.
> 
> 1 GHz:
> both cores idle:	48 W
> one core busy:		57 W
> two cores busy:		66 W

2 sec decoding video at both cores: 132J

> 2 GHz:
> both cores idle:	54 W
> one core busy:		78 W
> two cores busy:		95 W

1 sec decode @ 2GHz + 1 sec idle @ 1GHz: 143J

So even on your hw difference is not too big... and take a look at
numbers from core2duo.

Actually...

4 sec decode @ 1 core @ 1GHz: 57*4=228J
1 sec decode @ 2 cores @ 2GHz, then idle: 95 + 48*3 = 142+95 = 235J...

Ok, so it is still 	 win, but even smaller one..
							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [linux-pm] Power Management framework proposal
  2007-07-25 12:49                     ` Jerome Glisse
@ 2007-07-29 21:56                       ` david
  0 siblings, 0 replies; 40+ messages in thread
From: david @ 2007-07-29 21:56 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Igor Stoppa, ext linux-pm-bounces@lists.linux-foundation.org,
	LKML, linux-pm

sorry for the delay in responding

On Wed, 25 Jul 2007, Jerome Glisse wrote:
> david@lang.hm wrote:
>>  On Wed, 25 Jul 2007, Jerome Glisse wrote:
>> 
>> >  On 7/24/07, david@lang.hm <david@lang.hm> wrote:
>> > > >   For instance on graphics card you could do the following (maybe 
>> > >  more):
>> > > >   -change GPU clock
>> > > >   -change memory clock
>> > > >   -disable part of engine
>> > > >   -disable unit
>> > > >   i truly don't think you can make a common interface for all this, 
>> > >  more
>> > > >   over there might be constraint on how you can change things (GPU &
>> > > >   memory clock might need to follow a given ratio). So you definitely
>> > > >   need knowledge in the user space program to handle this.
>> > > 
>> > >  sure you can, just enumerate all the options the driver writer wants 
>> > >  to
>> > >   offer as options. yes this could be a lengthy list, so what?
>> > > 
>> > 
>> >  My point was that your interface by trying to fit square pegs into round 
>> >  hole
>> >  will fail to expose all subtility of each device which might in the end 
>> >  bring
>> >  to wrong power management decision. So i believe we can't sum up
>> >  power management to list of mode whose attribute are power consumption
>>> & capacity.
>>
>>  it's possible (which is part of the reason I started the thread), but so
>>  far there hasn't been anything identified that is a really bad fit.
>> 
> Tell me how i do this in your model:
> GPU/VRAM memory clock change power consumption of the card and
> the power consumption is often not a trivial function of both of this 
> parameters
> (i even here simplify the problem by omitting pipeline shutdown). So how with
> two different separate mode list (one for GPU speed another one for VRAM 
> speed)
> can you provide consumption information while this consumption depends on the
> others settings. Then if you give as a solution to make only one list you end 
> up
> with a more bigger list than previously needed (nrGPUmodes * nrVRAMmodes)
> do you expect the user to go through a lengthy list to find what he wants ?
> (remember that we will have to add pipeline power off, pll tweaking or many
> others way of saving power on such card).

yes I expect that it would be a large list in some conditions. but one 
purpose of this API is to make these options able to be discovered by 
software. right now nothing could be done at all without driver specific 
knowledge. even a lengthy list can be better then that.

presenting the list to the user directly is a last resort, only for 
experimentation or when nothing else wants to deal with devices of that 
type.

with a description field (which I didn't include initially, but seems 
obviously needed now) it should be fairly easy to create descriptions that 
let the software see that there are multiple factors involved.

> So by choosing this power consumption as a unit of measure you end up
> in non trivial case. There is also the question of overclocking

if the driver supports overclocking then list it in the modes (nothing 
says that % capacity couldn't go over 100% for example)

> , and other points already identified where unfortunately a global 
> design such as your proposal does not seems to fit properly: local power 
> decision (ethernet, wifi card, ... can power down them self is they are 
> doing nothings but the place where you can know this is the driver)

if they power themselves down with no notice to the system they should 
power themselves back up with no need for the rest of the system to tell 
them either. so this ca either be ignored or presented as a mode between 
off and on that enables this behavior.

> , there is also the child/parent relation, how to
> estimate power usage (on some configuration one device consumption can
> be marginal toward all others things while on other this same device can be
> the most power hungry device)... I see all this as bad fit.

ahh, here we see a disconnect. I was not intending for the power field to 
be that exact. there are just too many variables. for example: even for a 
cpu, the power used isn't exactly tied to the clock speed and voltage, the 
mix of commands that the cpu is running will affect the power it eats, 
sometimes by a significant amount. it was intended to be an ordering 
factor and approximate the power used so that things could make a 
peroformance/power tradeoff with a good chance of makeing a reasonable 
choice.

it's not intended for 'make this laptop use 24w of power instead of 25w of 
power'

>> >  And there is no way to design an abstraction given that all hw we will 
>> >  have
>> >  to deal with are too much different and do not follow any standard 
>> >  things
>> >  (beside ACPI there is other way to save power brightness, gpu/memory
>> >  clock, pll, ...) so i don't see how one might give a common view of 
>> >  things
>> >  which are fundamentally different in how they affect consumption (same 
>> >  end
>> >  result with many different paths leading to it).
>>
>>  so you are saying that the power management software must know the details
>>  of each and every driver, and if you add a new driver you must change the
>>  power management software before it can do anything (including allowing
>>  manual control of the modes)
>> 
> You have to provide an ohm plug in (in an ohm world) where policy for this 
> device will be
> handled and this plug in need to be designed knowing what the hw export 
> through HAL.
> Yes it's pain full but you don't want to put policy in the driver and to do 
> policy you need
> knowledge on the things you deal with.

I'll have to look into what the ohm plugin does.....

ok, after a quick google search and reading the wiki, what I was proposing 
would potentially allow ohm to use it directly instead of having to go 
through HAL (there may still be a place for HAL, but some parts of it 
would get significantly simpler if the abstraction was available directly 
from the driver instead of haivng to teach a second piece of software all 
the nuances of the driver for it to then create an abstraction for the 
system to use)

>>  seems to me I heard similar arguments several years ago about the CPU
>>  speed settings, it turns out that the cpufreq interface works really well
>>  for them and the software that's controlling things no longer needs to
>>  know the details of every CPU.
>>
>>  why did it work there but can't work anywhere else?
>>
>>  David Lang
> cpufreq is unification of processor frequency management, what you try 
> to unify here can go from toaster (i am sure there is usb driven toaster 
> available somewhere on earth don't ask why) to 100000$ specialized DSP 
> card, we can save power on both on them but how, which policy to follow, 
> and what parameters to tweak will be very different. At least i don't 
> think you will want to driver your specialized DSP card as a toaster, 
> personally i won't.

no you don't want to use the same driver for both, however it would be 
very nice to have an interfact that would let yo load the driver for yout 
specialized DSP card and all the opensouce software would see it and be 
able to manipulate it rather then having to modify the HAL, create ohm 
plugins, etc (for how many programs?) before they can know that the card 
is there.

David Lang

> Oh and as a side note i would like a common interface for dealing with 
> power but i just don't think it's possible, i might be wrong but so far 
> i don't see in any other os offering such things. That said you might be 
> able to factor out _some_ common things for given class of hardware 
> (network card, usb device, graphics cards, ...).
>
> best,
> Jerome Glisse
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Power Management framework proposal
  2007-07-27 11:46             ` Pavel Machek
@ 2007-07-29 22:00               ` david
  2007-07-30  1:05                 ` Matthew Garrett
  0 siblings, 1 reply; 40+ messages in thread
From: david @ 2007-07-29 22:00 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Arjan van de Ven, LKML, linux-pm

On Fri, 27 Jul 2007, Pavel Machek wrote:

> Hi!
>
>>>> example 1: a laptop screen
>>>>
>>>> mode  capacity power description
>>>> 0        0        0    off
>>>> 1      100      100    full brightness
>>>> 2       70       60    half power to the backlight
>>>> 3       50       35    quarter power to the backlight
>>>> 4       30       25    eighth power to the backlight
>>>> 5        5       10    backlight off.
>>>>
>>>> example 2: a front-panel display on a server (no
>>>> variable backlight
>>>> control)
>>>>
>>>> mode capacity power description
>>>> 0       0        0   off
>>>> 1     100      100   backlight on
>>>> 2      50       10   backlight off
>>>
>>>
>>> the problem is: the person who SETS these needs to know
>>> what they mean.
>>
>> that's what the description is for. this info can be
>> provided by the driver as part of the list_modes()
>> function.
>
> That's what /sys/class/backlight/ is for :-).

yes it is, and each type of device is growing it's own, incompatible, 
interfaces for controlling things like this. I was aiming to do two 
things.

1. head this off and try and get a more common api

2. simplify the confusion that there is with multiple functions needing to 
be implemented during the suspend/resume cycle by chaning them from 
independant suspend-only functions to being just part of the device 
settings

>>> as someone who wrote (part of) a power policy manager;
>>> sorry but you
>>> take away information I need, and in addition the
>>> different API's are
>>> absolutely no big deal.
>>
>> assuming that nobody else chimes in to disagree with you
>> I'll accept your judgement and drop the issue.
>
> Just for the record, I agree with Arjan here.

oh well. sorry to take up everyone's time.

David Lang

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Power Management framework proposal
  2007-07-29 22:00               ` david
@ 2007-07-30  1:05                 ` Matthew Garrett
  0 siblings, 0 replies; 40+ messages in thread
From: Matthew Garrett @ 2007-07-30  1:05 UTC (permalink / raw)
  To: david; +Cc: Pavel Machek, Arjan van de Ven, LKML, linux-pm

On Sun, Jul 29, 2007 at 03:00:07PM -0700, david@lang.hm wrote:

> yes it is, and each type of device is growing it's own, incompatible, 
> interfaces for controlling things like this. I was aiming to do two 
> things.

Anything playing with power management needs to be aware of the 
limitations of the hardware. Many devices have reduced functionality 
when in reduced power states, and it's vital that the caller be aware of 
that. There's no way to express that information in a consistent way 
because the limitations vary widely between different types of device. 
So, given that software will need to be aware of the different special 
cases for different types of hardware, there's very little cost to each 
of them exposing a different interface.

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2007-07-30  1:50 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-22  6:49 Power Management framework proposal david
2007-07-22  7:57 ` [linux-pm] " Igor Stoppa
2007-07-22  8:58   ` david
2007-07-22 12:05     ` Igor Stoppa
2007-07-22 21:21       ` david
2007-07-22 23:09         ` Arjan van de Ven
2007-07-23  2:45           ` david
2007-07-23  3:50             ` Arjan van de Ven
2007-07-23  4:04               ` david
2007-07-23  4:19                 ` Arjan van de Ven
2007-07-23  5:25                   ` david
2007-07-23 14:12                     ` Arjan van de Ven
2007-07-23 18:19                       ` david
2007-07-23  8:56                   ` Ondrej Zajicek
2007-07-23 17:33                     ` david
2007-07-27 12:04                     ` Pavel Machek
2007-07-23 10:48         ` Igor Stoppa
2007-07-23 18:14           ` david
2007-07-24  8:43             ` Jerome Glisse
2007-07-24 14:18               ` Igor Stoppa
2007-07-24 20:13                 ` david
2007-07-24 20:06               ` david
2007-07-24 23:14                 ` Jerome Glisse
2007-07-25  0:40                   ` david
2007-07-25 12:49                     ` Jerome Glisse
2007-07-29 21:56                       ` david
2007-07-22 17:26 ` Arjan van de Ven
2007-07-22 18:56   ` david
2007-07-22 22:27     ` Arjan van de Ven
2007-07-23  3:51       ` david
2007-07-23  4:00         ` Arjan van de Ven
2007-07-23  4:09           ` david
2007-07-27 11:46             ` Pavel Machek
2007-07-29 22:00               ` david
2007-07-30  1:05                 ` Matthew Garrett
2007-07-23 22:23   ` Benjamin Herrenschmidt
2007-07-24 20:14     ` david
2007-07-24 21:38       ` Benjamin Herrenschmidt
2007-07-24 23:02         ` david
2007-07-24 23:47           ` Benjamin Herrenschmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox