* PowerOP 0/3: System power operating point management API
@ 2005-08-09 2:49 Todd Poynor
2005-08-09 18:12 ` [linux-pm] " Patrick Mochel
2005-08-10 10:07 ` Pavel Machek
0 siblings, 2 replies; 9+ messages in thread
From: Todd Poynor @ 2005-08-09 2:49 UTC (permalink / raw)
To: linux-kernel, linux-pm, cpufreq
PowerOP is a system power parameter management API submitted for
discussion. PowerOP writes and reads power "operating points",
comprised of arbitrary integer-valued values, called power parameters,
that correspond to registers, clocks, dividers, voltage regulators,
etc. that may be modified to set a basic power/performance point for the
system. The core basically passes an array of integer-valued power
parameters (with very little additional structure imposed by the core)
to a platform-specific backend that interprets those values and makes
the requested adjustments. PowerOP is intended to leave all power
policy decisions to higher layers. An optional sysfs representation of
power parameters is also available, primarily for diagnostic use.
PowerOP can be thought of as a layer below cpufreq that actually
accesses the hardware to make cpu frequency, voltage, core bus, and
perhaps other modifications to set a power point, leaving cpufreq to
manage the interfaces based around the "cpu frequency" abstraction, the
policies and governors that select the frequency, its notifiers, and so
forth. An example hooking up support for one cpufreq platform to
PowerOP is in patch 3/3.
Depending on the ability of the hardware to make software-controlled
power/performance adjustments, this may be useful to select custom
voltages, bus speeds, etc. in desktop/server systems. Various embedded
systems have several parameters that can be set. For example, an XScale
PXA27x could be considered to have six basic power parameters (mainly
cpu run mode and memory and bus dividers) that for the most part should
be set in tandem to known good sets of values as validated by the
silicon vendor, plus other parameters possible for disabling PLLs during
low-speed execution, and so forth. PowerOP is aimed at supporting this
kind of system, where the cpu frequency abstraction specifies only part
of the operating point that may be managed from software. It also
pushes the hardware-level power parameter management down to a level
that can be shared with other power management policy frameworks, in use
in some embedded systems, that wish to deal with entire operating points
as the basic unit of system power management..
There are many ways to tackle those issues, of course, and a new API
layer is arguably rather heavyweight. This is one suggested way that
tries to minimize disturbing existing power management code. Comments
very much appreciated.
Patch 2/3 is a desktop-oriented example of PowerOP; embedded examples
will follow soon.
--
Todd
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [linux-pm] PowerOP 0/3: System power operating point management API
2005-08-09 2:49 PowerOP 0/3: System power operating point management API Todd Poynor
@ 2005-08-09 18:12 ` Patrick Mochel
2005-08-10 2:18 ` Todd Poynor
2005-08-10 10:07 ` Pavel Machek
1 sibling, 1 reply; 9+ messages in thread
From: Patrick Mochel @ 2005-08-09 18:12 UTC (permalink / raw)
To: Todd Poynor; +Cc: linux-kernel, linux-pm, cpufreq
On Mon, 8 Aug 2005, Todd Poynor wrote:
> PowerOP is a system power parameter management API submitted for
> discussion. PowerOP writes and reads power "operating points",
> comprised of arbitrary integer-valued values, called power parameters,
> that correspond to registers, clocks, dividers, voltage regulators,
> etc. that may be modified to set a basic power/performance point for the
> system. The core basically passes an array of integer-valued power
> parameters (with very little additional structure imposed by the core)
> to a platform-specific backend that interprets those values and makes
> the requested adjustments. PowerOP is intended to leave all power
> policy decisions to higher layers. An optional sysfs representation of
> power parameters is also available, primarily for diagnostic use.
What do those higher layers look like? Do you have a userspace component
that uses this interface?
Who is using this code? Are there vendors that are already shipping
systems with this enabled?
Is this part of the DPM project? If so, what other components are left in
DPM?
What are your plans to integrate this more with the cpufreq code?
Thanks,
Pat
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [linux-pm] PowerOP 0/3: System power operating point management API
2005-08-09 18:12 ` [linux-pm] " Patrick Mochel
@ 2005-08-10 2:18 ` Todd Poynor
[not found] ` <20050809030000.GA25112@slurryseal.ddns.mvista.com>
0 siblings, 1 reply; 9+ messages in thread
From: Todd Poynor @ 2005-08-10 2:18 UTC (permalink / raw)
To: Patrick Mochel; +Cc: linux-kernel, linux-pm, cpufreq
Patrick Mochel wrote:
> On Mon, 8 Aug 2005, Todd Poynor wrote:
(apologies for use of obsolete cpufreq mailing list address in my
initial message.)
...
>>PowerOP is intended to leave all power
>>policy decisions to higher layers.
>
> What do those higher layers look like? Do you have a userspace component
> that uses this interface?
cpufreq is one example, it manages an abstraction of system
power/performance levels based on cpu speed, which maps onto the
PowerOP-level hardware capabilities in some fashion, and has both kernel
and userspace components to manage the desired policy associated with
this. Regardless of whether this notion of configurable operating
points would remain a separate layer from cpufreq or was more tightly
integrated, the code to set these operating points can handle things
such as setting validated voltage levels to match cpu speeds, etc.
For embedded systems, I am aware only of the Dynamic Power Management
project, which you also mention and does indeed manage power policy
based on the notions of power parameters and operating points. The
settings of these are configured entirely from userspace via sysfs,
using shell scripts or convenience libraries that access the sysfs
attributes. A system designer chooses the operating points to be
employed in the system based on the information from the processor or
board vendor that describes validated, supported operating points and
based on the characteristics of the system (how fast it needs to run
while in use for different purposes and how much battery power can be
spent for those purposes).
For example, a designer implementing a system based on an Intel XScale
PXA27x processor can choose from among about 16 validated operating
points listed in the most recent specification update. Those operating
points are comprised of register settings with inscrutable names such as
CCCR[L], CCCR[2N], CLKCFG[T], CCCR[A], and two or three others. A few
of those operating points run the CPU at identical frequencies, but have
other changes in memory clocking, system bus clocking, and the ability
to quickly switch between certain cpu frequencies based on other
properties of the platform (so-called "Turbo-mode" frequency scaling).
A DPM- or PowerOP-based system can be configured with the subset of
desired operating points and a particular operating point activated as
needed. The policy decision as to what operating point is appropriate
to activate is a matter for custom code provided by the designer,
tailored to their system. It is also possible to write automated
operating point selection algorithms based on such criteria as system
busyness.
> Who is using this code? Are there vendors that are already shipping
> systems with this enabled?
>
> Is this part of the DPM project? If so, what other components are left in
> DPM?
The concepts and general Linux implementation of power parameters and
operating points stems from the power-aware computing work done by
Bishop Brock and Karthick Rajamani of IBM Research, and a somewhat
different implementation is a part of the DPM project, which MontaVista
(and reportedly others in the near future) does ship. So far as I
understand there are or soon will be mobile phones that use that code as
the low- to mid-layers of the power management stack (the high-layer
policy management is performed by a custom application of which I have
no knowledge).
I mentioned in a previous email the next step of creating and activating
operating points from userspace. If that were in place, DPM would
additionally consist primarily of:
1. Machine-specific backends to set operating points for the systems
that DPM has been ported to. If something like PowerOP is accepted into
a broader community then that code would come along for the ride.
XScale PXA27x and various ARM OMAPs are among the systems supported, as
well as potentially others not yet making an appearance in open source.
2. DPM has further concepts of "operating state" (generally, whether the
system is idle, processing interrupts, running a normal-power-usage
task, running a background task without deadlines that can be assigned a
low power/performance level, etc.) and the unfortunately-named "policy"
that maps each operating state to an operating point, along with the
code to switch in different operating points as the system switches
operating states. The "policy" is a bit of a misnomer; a system
designer must create the desired operating points and decide upon the
state -> point mappings appropriate, as well as make decisions on when
to update the mappings based on external events, changing workloads,
etc. There are a few extra ramifications of modifying operating points
in this fashion, including the need to handle such transitions while in
interrupt context or in the idle loop, as well as a general concern for
low overhead since switching may occur very frequently (such as at every
entry and exit from idle).
3. Kernel-to-userspace power event notification is temporarily based on
executing hotplug scripts. This is outside the true domain of DPM, but
in the absence of an acpid-like de facto standard for communicating
power events it seemed best to provide some sort of mechanism. kobject
uevents are now the proper choice, and I'd propose use of that, as a
separate matter from what I'm hoping to accomplish with PowerOP or the
rest of DPM.
All of these are GPL software available on the project site.
> What are your plans to integrate this more with the cpufreq code?
At this point it's a proposed layer that does not disturb existing
cpufreq code much, but if the cpufreq folks are receptive to these ideas
I'd be all for a tighter integration. Others have already asked for the
ability to manage voltages along with cpu speed, so in one way or
another it seems likely that an expanded set of power parameters may be
provided in the future. But I don't have any insight into the wishes or
goals of the project. Thanks,
--
Todd
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: PowerOP 0/3: System power operating point management API
2005-08-09 2:49 PowerOP 0/3: System power operating point management API Todd Poynor
2005-08-09 18:12 ` [linux-pm] " Patrick Mochel
@ 2005-08-10 10:07 ` Pavel Machek
2005-08-10 22:02 ` Todd Poynor
1 sibling, 1 reply; 9+ messages in thread
From: Pavel Machek @ 2005-08-10 10:07 UTC (permalink / raw)
To: Todd Poynor; +Cc: linux-kernel, linux-pm, cpufreq
Hi!
> PowerOP is a system power parameter management API submitted for
> discussion. PowerOP writes and reads power "operating points",
> comprised of arbitrary integer-valued values, called power parameters,
> that correspond to registers, clocks, dividers, voltage regulators,
> etc. that may be modified to set a basic power/performance point for the
> system. The core basically passes an array of integer-valued power
> parameters (with very little additional structure imposed by the core)
> to a platform-specific backend that interprets those values and makes
> the requested adjustments. PowerOP is intended to leave all power
> policy decisions to higher layers. An optional sysfs representation of
> power parameters is also available, primarily for diagnostic use.
>
> PowerOP can be thought of as a layer below cpufreq that actually
> accesses the hardware to make cpu frequency, voltage, core bus, and
> perhaps other modifications to set a power point, leaving cpufreq to
> manage the interfaces based around the "cpu frequency" abstraction, the
> policies and governors that select the frequency, its notifiers, and so
> forth. An example hooking up support for one cpufreq platform to
> PowerOP is in patch 3/3.
>
> Depending on the ability of the hardware to make software-controlled
> power/performance adjustments, this may be useful to select custom
> voltages, bus speeds, etc. in desktop/server systems. Various embedded
> systems have several parameters that can be set. For example, an XScale
> PXA27x could be considered to have six basic power parameters (mainly
> cpu run mode and memory and bus dividers) that for the most part
> should
This scares me a bit. Is table enough to handle this? I'm afraid that
table will get very large on systems that allow you to do "almost
anything".
Pavel
--
if you have sharp zaurus hardware you don't need... you know my address
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: PowerOP 0/3: System power operating point management API
2005-08-10 10:07 ` Pavel Machek
@ 2005-08-10 22:02 ` Todd Poynor
0 siblings, 0 replies; 9+ messages in thread
From: Todd Poynor @ 2005-08-10 22:02 UTC (permalink / raw)
To: Pavel Machek; +Cc: linux-kernel, linux-pm, cpufreq
Pavel Machek wrote:
>>Depending on the ability of the hardware to make software-controlled
>>power/performance adjustments, this may be useful to select custom
>>voltages, bus speeds, etc. in desktop/server systems. Various embedded
>>systems have several parameters that can be set. For example, an XScale
>>PXA27x could be considered to have six basic power parameters (mainly
>>cpu run mode and memory and bus dividers) that for the most part
>>should
>
>
> This scares me a bit. Is table enough to handle this? I'm afraid that
> table will get very large on systems that allow you to do "almost
> anything".
Exhaustive tables for all combinations of possible parameters aren't
expected (or practical for many systems as you note). In practice, a
subset of these possible operating points are created and activated over
the lifetime of the system, where the subset is chosen by a system
designer according to the needs of the particular system. It's a matter
for the higher-layer power management software to decide whether to have
in-kernel tables of the possible operating points (as cpufreq does for
various platforms) or whether to require userspace to create only the
ones wanted (as does DPM). There are cpufreq patches for PXA27x
somewhere, for example, and in that case a subset of the supported
operating points (and there are still only about 16 of those even for
such a complicated piece of hardware) are represented in the kernel
tables, choosing one of the possible combinations of memory/bus/etc.
parameters for each unique cpu frequency. Thanks,
--
Todd
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: PowerOP 0/3: System power operating point management API
[not found] ` <20050809030000.GA25112@slurryseal.ddns.mvista.com>
@ 2005-08-16 8:53 ` Dominik Brodowski
2005-08-16 8:57 ` Dominik Brodowski
2005-08-17 1:39 ` Todd Poynor
0 siblings, 2 replies; 9+ messages in thread
From: Dominik Brodowski @ 2005-08-16 8:53 UTC (permalink / raw)
To: Todd Poynor; +Cc: cpufreq, Patrick Mochel, linux-pm, linux-kernel, Pavel Machek
Hi!
The PowerOP infrastructure you suggest surely is one path to better runtime
power management in the Linux kernel. However, I don't like it at all in its
current implementation. Here are a few suggestions for improvements,
rewrites, and so on:
First, the table interface you suggest is ugly. If there's indeed the need for
such an abstraction, I'd favour something like
struct powerop {
struct list_head powerop_values; /* linked list of powerop_values */
...
}
struct powerop_value {
unsigned long value_cur;
unsigned long value_min;
unsigned long value_max;
struct list_head next;
u16 type;
struct powerop_value *cross_dependency;
struct powerop_driver *driver;
}
#define POWEROP_TYPE_CPU_FREQUENCY 0x00000001
#define POWEROP_TYPE_CPU_VOLTAGE 0x00000002
#define POWEROP_TYPE_FRONT_SIDE_BUS_SPEED 0x00000004
...
#define POWEROP_TYPE_GPU_FREQUENCY 0x00010000
...
and if CPU_VOLTAGE and CPU_FREQEUNCY can only be modified at the same time, (as
most cpufreq drivers require), type is 0x00000003.
Secondly, you do not adress the cross-relationships between operation points
correctly. If you change the CPU frequency, you may have to switch other
(memory, video) settings; you might even have to validate the frequency
settings for these or even additional reasons (thermal and battery reasons -
ACPI _PPC).
Thirdly, who is to decide on the power management settings? The first and
intuitive answer is the kernel. Therefore, kernel-space cpufreq governors
exist. Only under rare circumstances, you want full userspace control --
that's what the userspace cpufreq governor is for.
Foruthly, the code duplication which your implementation leads to is obvious
for the speedstep-centrino case. And in contrast to Pavel, I do not consider
it a "tiny cleanup".
I'd suggest that you try upgrading the cpufreq infrastructure to provide
full support for multiple types of POWEROPs:
a) Setting of "policies"
- New "min" or "max" values for all powerop_values are set, verified
by powerop lowlevel drivers, powerop governors and external
notifiers. E.g. if a new frequency min/max pair is required, the
voltage level gets a new min and max value as well --> you need to
handle recursion.
- If necessary a new "powerop governor" is started.
- Each powerop governor specifies which POWEROPs it can handle
- current cpufreq governors can handle CPU_FREQUENCY,
CPU_VOLTAGE and FRONT_SIDE_BUS_SPEED
- an userspace fallback-governor always "handles" the
parameters no other governor handles
b) Setting of "values"
- Each governor can initiate transitions between the "min" and "max"
values for operationg points it aquired ownership for.
- The new setting is notified to all other governors and to external
notifiers. If some entitiy decides it cannot live well with this
new setting, it breaks out. Note that this should not happen quite
often, as the "normal" verification takes place in a) above.
Nonetheless, if you want to break out CPU_VOLTAGE and CPU_FREQUENCY, you
need it. And as it makes life for the kernel so much more
difficult, I'm against doing so.
- The low-level driver handling the powerop_value is called
Thanks,
Dominik
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: PowerOP 0/3: System power operating point management API
2005-08-16 8:53 ` Dominik Brodowski
@ 2005-08-16 8:57 ` Dominik Brodowski
2005-08-17 1:52 ` Todd Poynor
2005-08-17 1:39 ` Todd Poynor
1 sibling, 1 reply; 9+ messages in thread
From: Dominik Brodowski @ 2005-08-16 8:57 UTC (permalink / raw)
To: Todd Poynor, cpufreq, Patrick Mochel, linux-pm, linux-kernel,
Pavel Machek
A small add-on:
We need to make sure that we're capable of handling smart CPUs like Transmeta
Crusoe processors in a sane way. This means
> b) Setting of "values"
is optional if the hardware itself can be set to a min/max value (step a
above in previous mail).
Dominik
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: PowerOP 0/3: System power operating point management API
2005-08-16 8:53 ` Dominik Brodowski
2005-08-16 8:57 ` Dominik Brodowski
@ 2005-08-17 1:39 ` Todd Poynor
1 sibling, 0 replies; 9+ messages in thread
From: Todd Poynor @ 2005-08-17 1:39 UTC (permalink / raw)
To: Dominik Brodowski, Todd Poynor, cpufreq, Patrick Mochel, linux-pm,
linux-kernel, Pavel Machek
Dominik Brodowski wrote:
> First, the table interface you suggest is ugly. If there's indeed the need for
> such an abstraction, I'd favour something like
I'm planning to adopt the previous suggestions of an opaque data
structure and stop trying to have any generic structure to it. I'll try
to leave dependency checking etc. to the upper layers as much as
possible, since platforms vary greatly in this and so do the needs of
different PM s/w stacks.
> Secondly, you do not adress the cross-relationships between operation points
> correctly. If you change the CPU frequency, you may have to switch other
> (memory, video) settings; you might even have to validate the frequency
> settings for these or even additional reasons (thermal and battery reasons -
> ACPI _PPC).
This lowest layer basically assumes that upper-layer software has
created an appropriate operating point (for example, in DPM we pretty
much require a system designer to create operating points that match the
h/w specs and don't go to great lengths to encode rules about this),
and/or will call driver notifiers etc. as needed to adapt to the
changes. Although there may be some sanity checking appropriate at the
PowerOP level, cpufreq, DPM, etc. can for the most part continue to
handle the larger issues of how valid operating points are constructed,
driver callbacks, etc. If you do want to handle various dependencies at
the PowerOP layer then there's nothing that prevents that, but PM
frameworks tend to embody assumptions about how frequently operating
points will change and in what contexts (interrupt, idle...), and this
can influence the code for such things.
> Thirdly, who is to decide on the power management settings? The first and
> intuitive answer is the kernel. Therefore, kernel-space cpufreq governors
> exist. Only under rare circumstances, you want full userspace control --
> that's what the userspace cpufreq governor is for.
Also something left to the existing upper layers; PowerOP isn't intended
to handle any of that. In the embedded space we usually let the system
designer choose operating points supported by their h/w vendor and that
match their particular system states (hardware enabled at any point in
time, type and power/performance needs of software currently running).
We do recommend that a userspace power policy manager be the component
in charge of PM settings, based on messages from drivers and other apps
on the state of the system. And so that userspace component activates
the operating point (or set of operating points in the case of DPM)
appropriate for current state.
> Foruthly, the code duplication which your implementation leads to is obvious
> for the speedstep-centrino case.
We could move the tables of valid cpu speeds and corresponding voltages
down to the PowerOP level, and there would probably be little
duplication at that point (in fact, with the current patch there's not a
lot of duplication since the actual MSR access was moved to PowerOP and
PowerOP contains little else, but both levels know how to understand the
MSR format, and a more aggressive port to PowerOP could do away with that).
Your suggestions of changes to cpufreq governors and policies to handle
governance of non-cpu-speed parameters sound interesting, and I'd be
happy to help figure out what to do about those vs. the lower machine
access layer I've discussed up until now. I'll think more about this
real soon now. Thanks,
--
Todd
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: PowerOP 0/3: System power operating point management API
2005-08-16 8:57 ` Dominik Brodowski
@ 2005-08-17 1:52 ` Todd Poynor
0 siblings, 0 replies; 9+ messages in thread
From: Todd Poynor @ 2005-08-17 1:52 UTC (permalink / raw)
To: Dominik Brodowski, Todd Poynor, cpufreq, Patrick Mochel, linux-pm,
linux-kernel, Pavel Machek
Dominik Brodowski wrote:
> A small add-on:
>
> We need to make sure that we're capable of handling smart CPUs like Transmeta
> Crusoe processors in a sane way. This means
>
>
>>b) Setting of "values"
>
>
> is optional if the hardware itself can be set to a min/max value (step a
> above in previous mail).
Although I haven't looked into the Crusoe processor support, it may be
that there is a different set of power parameters, not cpu speed
directly, that are appropriate to manage on these platforms (after a
brief look, seems to be a range of frequencies and some sort of flags)?
If so, these sorts of machine-specific power parameters are what
PowerOP is trying to address, allowing management of the underlying
machine-specific stuff to upper layers that may be presenting an
abstracted view of power/performance, such as CPU speed or speed ranges,
to the user. Thanks,
--
Todd
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2005-08-17 1:52 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-09 2:49 PowerOP 0/3: System power operating point management API Todd Poynor
2005-08-09 18:12 ` [linux-pm] " Patrick Mochel
2005-08-10 2:18 ` Todd Poynor
[not found] ` <20050809030000.GA25112@slurryseal.ddns.mvista.com>
2005-08-16 8:53 ` Dominik Brodowski
2005-08-16 8:57 ` Dominik Brodowski
2005-08-17 1:52 ` Todd Poynor
2005-08-17 1:39 ` Todd Poynor
2005-08-10 10:07 ` Pavel Machek
2005-08-10 22:02 ` Todd Poynor
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox