PowerOP 0/3: System power operating point management API

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* PowerOP 0/3: System power operating point management API
@ 2005-08-09  2:49 Todd Poynor
  2005-08-09 18:12 ` [linux-pm] " Patrick Mochel
  2005-08-10 10:07 ` Pavel Machek
  0 siblings, 2 replies; 9+ messages in thread
From: Todd Poynor @ 2005-08-09  2:49 UTC (permalink / raw)
  To: linux-kernel, linux-pm, cpufreq

PowerOP is a system power parameter management API submitted for
discussion.  PowerOP writes and reads power "operating points",
comprised of arbitrary integer-valued values, called power parameters,
that correspond to registers, clocks, dividers, voltage regulators,
etc. that may be modified to set a basic power/performance point for the
system.  The core basically passes an array of integer-valued power
parameters (with very little additional structure imposed by the core)
to a platform-specific backend that interprets those values and makes
the requested adjustments.  PowerOP is intended to leave all power
policy decisions to higher layers.  An optional sysfs representation of
power parameters is also available, primarily for diagnostic use.

PowerOP can be thought of as a layer below cpufreq that actually
accesses the hardware to make cpu frequency, voltage, core bus, and
perhaps other modifications to set a power point, leaving cpufreq to
manage the interfaces based around the "cpu frequency" abstraction, the
policies and governors that select the frequency, its notifiers, and so
forth.  An example hooking up support for one cpufreq platform to
PowerOP is in patch 3/3.

Depending on the ability of the hardware to make software-controlled
power/performance adjustments, this may be useful to select custom
voltages, bus speeds, etc. in desktop/server systems.  Various embedded
systems have several parameters that can be set.  For example, an XScale
PXA27x could be considered to have six basic power parameters (mainly
cpu run mode and memory and bus dividers) that for the most part should
be set in tandem to known good sets of values as validated by the
silicon vendor, plus other parameters possible for disabling PLLs during
low-speed execution, and so forth.  PowerOP is aimed at supporting this
kind of system, where the cpu frequency abstraction specifies only part
of the operating point that may be managed from software.  It also
pushes the hardware-level power parameter management down to a level
that can be shared with other power management policy frameworks, in use
in some embedded systems, that wish to deal with entire operating points
as the basic unit of system power management..

There are many ways to tackle those issues, of course, and a new API
layer is arguably rather heavyweight.  This is one suggested way that
tries to minimize disturbing existing power management code.  Comments
very much appreciated.

Patch 2/3 is a desktop-oriented example of PowerOP; embedded examples
will follow soon.

--
Todd

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [linux-pm] PowerOP 0/3: System power operating point management API
  2005-08-09  2:49 PowerOP 0/3: System power operating point management API Todd Poynor
@ 2005-08-09 18:12 ` Patrick Mochel
  2005-08-10  2:18   ` Todd Poynor
  2005-08-10 10:07 ` Pavel Machek
  1 sibling, 1 reply; 9+ messages in thread
From: Patrick Mochel @ 2005-08-09 18:12 UTC (permalink / raw)
  To: Todd Poynor; +Cc: linux-kernel, linux-pm, cpufreq


On Mon, 8 Aug 2005, Todd Poynor wrote:

> PowerOP is a system power parameter management API submitted for
> discussion.  PowerOP writes and reads power "operating points",
> comprised of arbitrary integer-valued values, called power parameters,
> that correspond to registers, clocks, dividers, voltage regulators,
> etc. that may be modified to set a basic power/performance point for the
> system.  The core basically passes an array of integer-valued power
> parameters (with very little additional structure imposed by the core)
> to a platform-specific backend that interprets those values and makes
> the requested adjustments.  PowerOP is intended to leave all power
> policy decisions to higher layers.  An optional sysfs representation of
> power parameters is also available, primarily for diagnostic use.

What do those higher layers look like? Do you have a userspace component
that uses this interface?

Who is using this code? Are there vendors that are already shipping
systems with this enabled?

Is this part of the DPM project? If so, what other components are left in
DPM?

What are your plans to integrate this more with the cpufreq code?

Thanks,


	Pat


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [linux-pm] PowerOP 0/3: System power operating point management API
  2005-08-09 18:12 ` [linux-pm] " Patrick Mochel
@ 2005-08-10  2:18   ` Todd Poynor
       [not found]     ` <20050809030000.GA25112@slurryseal.ddns.mvista.com>
  0 siblings, 1 reply; 9+ messages in thread
From: Todd Poynor @ 2005-08-10  2:18 UTC (permalink / raw)
  To: Patrick Mochel; +Cc: linux-kernel, linux-pm, cpufreq

Patrick Mochel wrote:
> On Mon, 8 Aug 2005, Todd Poynor wrote:
(apologies for use of obsolete cpufreq mailing list address in my 
initial message.)
...
>>PowerOP is intended to leave all power
>>policy decisions to higher layers.
> 
> What do those higher layers look like? Do you have a userspace component
> that uses this interface?

cpufreq is one example, it manages an abstraction of system 
power/performance levels based on cpu speed, which maps onto the 
PowerOP-level hardware capabilities in some fashion, and has both kernel 
and userspace components to manage the desired policy associated with 
this.  Regardless of whether this notion of configurable operating 
points would remain a separate layer from cpufreq or was more tightly 
integrated, the code to set these operating points can handle things 
such as setting validated voltage levels to match cpu speeds, etc.

For embedded systems, I am aware only of the Dynamic Power Management 
project, which you also mention and does indeed manage power policy 
based on the notions of power parameters and operating points.  The 
settings of these are configured entirely from userspace via sysfs, 
using shell scripts or convenience libraries that access the sysfs 
attributes.  A system designer chooses the operating points to be 
employed in the system based on the information from the processor or 
board vendor that describes validated, supported operating points and 
based on the characteristics of the system (how fast it needs to run 
while in use for different purposes and how much battery power can be 
spent for those purposes).

For example, a designer implementing a system based on an Intel XScale 
PXA27x processor can choose from among about 16 validated operating 
points listed in the most recent specification update.  Those operating 
points are comprised of register settings with inscrutable names such as 
CCCR[L], CCCR[2N], CLKCFG[T], CCCR[A], and two or three others.  A few 
of those operating points run the CPU at identical frequencies, but have 
other changes in memory clocking, system bus clocking, and the ability 
to quickly switch between certain cpu frequencies based on other 
properties of the platform (so-called "Turbo-mode" frequency scaling). 
A DPM- or PowerOP-based system can be configured with the subset of 
desired operating points and a particular operating point activated as 
needed.  The policy decision as to what operating point is appropriate 
to activate is a matter for custom code provided by the designer, 
tailored to their system.  It is also possible to write automated 
operating point selection algorithms based on such criteria as system 
busyness.

> Who is using this code? Are there vendors that are already shipping
> systems with this enabled?
> 
> Is this part of the DPM project? If so, what other components are left in
> DPM?

The concepts and general Linux implementation of power parameters and 
operating points stems from the power-aware computing work done by 
Bishop Brock and Karthick Rajamani of IBM Research, and a somewhat 
different implementation is a part of the DPM project, which MontaVista 
(and reportedly others in the near future) does ship.  So far as I 
understand there are or soon will be mobile phones that use that code as 
the low- to mid-layers of the power management stack (the high-layer 
policy management is performed by a custom application of which I have 
no knowledge).

I mentioned in a previous email the next step of creating and activating 
operating points from userspace.  If that were in place, DPM would 
additionally consist primarily of:

1. Machine-specific backends to set operating points for the systems 
that DPM has been ported to.  If something like PowerOP is accepted into 
a broader community then that code would come along for the ride. 
XScale PXA27x and various ARM OMAPs are among the systems supported, as 
well as potentially others not yet making an appearance in open source.

2. DPM has further concepts of "operating state" (generally, whether the 
system is idle, processing interrupts, running a normal-power-usage 
task, running a background task without deadlines that can be assigned a 
low power/performance level, etc.) and the unfortunately-named "policy" 
that maps each operating state to an operating point, along with the 
code to switch in different operating points as the system switches 
operating states.  The "policy" is a bit of a misnomer; a system 
designer must create the desired operating points and decide upon the 
state -> point mappings appropriate, as well as make decisions on when 
to update the mappings based on external events, changing workloads, 
etc.  There are a few extra ramifications of modifying operating points 
in this fashion, including the need to handle such transitions while in 
interrupt context or in the idle loop, as well as a general concern for 
low overhead since switching may occur very frequently (such as at every 
entry and exit from idle).

3. Kernel-to-userspace power event notification is temporarily based on 
executing hotplug scripts.  This is outside the true domain of DPM, but 
in the absence of an acpid-like de facto standard for communicating 
power events it seemed best to provide some sort of mechanism.  kobject 
uevents are now the proper choice, and I'd propose use of that, as a 
separate matter from what I'm hoping to accomplish with PowerOP or the 
rest of DPM.

All of these are GPL software available on the project site.

> What are your plans to integrate this more with the cpufreq code?

At this point it's a proposed layer that does not disturb existing 
cpufreq code much, but if the cpufreq folks are receptive to these ideas 
I'd be all for a tighter integration.  Others have already asked for the 
ability to manage voltages along with cpu speed, so in one way or 
another it seems likely that an expanded set of power parameters may be 
provided in the future.  But I don't have any insight into the wishes or 
goals of the project.  Thanks,

-- 
Todd

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: PowerOP 0/3: System power operating point management API
  2005-08-09  2:49 PowerOP 0/3: System power operating point management API Todd Poynor
  2005-08-09 18:12 ` [linux-pm] " Patrick Mochel
@ 2005-08-10 10:07 ` Pavel Machek
  2005-08-10 22:02   ` Todd Poynor
  1 sibling, 1 reply; 9+ messages in thread
From: Pavel Machek @ 2005-08-10 10:07 UTC (permalink / raw)
  To: Todd Poynor; +Cc: linux-kernel, linux-pm, cpufreq

Hi!

> PowerOP is a system power parameter management API submitted for
> discussion.  PowerOP writes and reads power "operating points",
> comprised of arbitrary integer-valued values, called power parameters,
> that correspond to registers, clocks, dividers, voltage regulators,
> etc. that may be modified to set a basic power/performance point for the
> system.  The core basically passes an array of integer-valued power
> parameters (with very little additional structure imposed by the core)
> to a platform-specific backend that interprets those values and makes
> the requested adjustments.  PowerOP is intended to leave all power
> policy decisions to higher layers.  An optional sysfs representation of
> power parameters is also available, primarily for diagnostic use.
> 
> PowerOP can be thought of as a layer below cpufreq that actually
> accesses the hardware to make cpu frequency, voltage, core bus, and
> perhaps other modifications to set a power point, leaving cpufreq to
> manage the interfaces based around the "cpu frequency" abstraction, the
> policies and governors that select the frequency, its notifiers, and so
> forth.  An example hooking up support for one cpufreq platform to
> PowerOP is in patch 3/3.
> 
> Depending on the ability of the hardware to make software-controlled
> power/performance adjustments, this may be useful to select custom
> voltages, bus speeds, etc. in desktop/server systems.  Various embedded
> systems have several parameters that can be set.  For example, an XScale
> PXA27x could be considered to have six basic power parameters (mainly
> cpu run mode and memory and bus dividers) that for the most part
> should

This scares me a bit. Is table enough to handle this? I'm afraid that
table will get very large on systems that allow you to do "almost
anything".
								Pavel
-- 
if you have sharp zaurus hardware you don't need... you know my address

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: PowerOP 0/3: System power operating point management API
  2005-08-10 10:07 ` Pavel Machek
@ 2005-08-10 22:02   ` Todd Poynor
  0 siblings, 0 replies; 9+ messages in thread
From: Todd Poynor @ 2005-08-10 22:02 UTC (permalink / raw)
  To: Pavel Machek; +Cc: linux-kernel, linux-pm, cpufreq

Pavel Machek wrote:
>>Depending on the ability of the hardware to make software-controlled
>>power/performance adjustments, this may be useful to select custom
>>voltages, bus speeds, etc. in desktop/server systems.  Various embedded
>>systems have several parameters that can be set.  For example, an XScale
>>PXA27x could be considered to have six basic power parameters (mainly
>>cpu run mode and memory and bus dividers) that for the most part
>>should
> 
> 
> This scares me a bit. Is table enough to handle this? I'm afraid that
> table will get very large on systems that allow you to do "almost
> anything".

Exhaustive tables for all combinations of possible parameters aren't 
expected (or practical for many systems as you note).  In practice, a 
subset of these possible operating points are created and activated over 
the lifetime of the system, where the subset is chosen by a system 
designer according to the needs of the particular system.  It's a matter 
for the higher-layer power management software to decide whether to have 
in-kernel tables of the possible operating points (as cpufreq does for 
various platforms) or whether to require userspace to create only the 
ones wanted (as does DPM).  There are cpufreq patches for PXA27x 
somewhere, for example, and in that case a subset of the supported 
operating points (and there are still only about 16 of those even for 
such a complicated piece of hardware) are represented in the kernel 
tables, choosing one of the possible combinations of memory/bus/etc. 
parameters for each unique cpu frequency.  Thanks,

-- 
Todd

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: PowerOP 0/3: System power operating point management API
       [not found]     ` <20050809030000.GA25112@slurryseal.ddns.mvista.com>
@ 2005-08-16  8:53       ` Dominik Brodowski
  2005-08-16  8:57         ` Dominik Brodowski
  2005-08-17  1:39         ` Todd Poynor
  0 siblings, 2 replies; 9+ messages in thread
From: Dominik Brodowski @ 2005-08-16  8:53 UTC (permalink / raw)
  To: Todd Poynor; +Cc: cpufreq, Patrick Mochel, linux-pm, linux-kernel, Pavel Machek

Hi!

The PowerOP infrastructure you suggest surely is one path to better runtime
power management in the Linux kernel. However, I don't like it at all in its
current implementation. Here are a few suggestions for improvements,
rewrites, and so on:

First, the table interface you suggest is ugly. If there's indeed the need for
such an abstraction, I'd favour something like

	struct powerop {
		struct list_head	powerop_values; /* linked list of powerop_values */
		...
	}

	struct powerop_value {
		unsigned long		value_cur;
		unsigned long		value_min;
		unsigned long		value_max;
		struct list_head	next;
		u16			type;
		struct powerop_value	*cross_dependency;
		struct powerop_driver	*driver;
	}

	#define POWEROP_TYPE_CPU_FREQUENCY		0x00000001
	#define POWEROP_TYPE_CPU_VOLTAGE		0x00000002
	#define POWEROP_TYPE_FRONT_SIDE_BUS_SPEED	0x00000004
	...
	#define POWEROP_TYPE_GPU_FREQUENCY	0x00010000
	...

and if CPU_VOLTAGE and CPU_FREQEUNCY can only be modified at the same time, (as
most cpufreq drivers require), type is 0x00000003.

Secondly, you do not adress the cross-relationships between operation points
correctly. If you change the CPU frequency, you may have to switch other
(memory, video) settings; you might even have to validate the frequency
settings for these or even additional reasons (thermal and battery reasons -
ACPI _PPC).

Thirdly, who is to decide on the power management settings? The first and
intuitive answer is the kernel. Therefore, kernel-space cpufreq governors
exist. Only under rare circumstances, you want full userspace control --
that's what the userspace cpufreq governor is for.

Foruthly, the code duplication which your implementation leads to is obvious
for the speedstep-centrino case. And in contrast to Pavel, I do not consider
it a "tiny cleanup".

I'd suggest that you try upgrading the cpufreq infrastructure to provide
full support for multiple types of POWEROPs:

a)	Setting of "policies"
	- New "min" or "max" values for all powerop_values are set, verified
	  by powerop lowlevel drivers, powerop governors and external
	  notifiers. E.g. if a new frequency min/max pair is required, the
	  voltage level gets a new min and max value as well --> you need to
	  handle recursion.
	- If necessary a new "powerop governor" is started.
	   - Each powerop governor specifies which POWEROPs it can handle
		- current cpufreq governors can handle CPU_FREQUENCY,
		  CPU_VOLTAGE and FRONT_SIDE_BUS_SPEED
		- an userspace fallback-governor always "handles" the
		  parameters no other governor handles

b)	Setting of "values"
	- Each governor can initiate transitions between the "min" and "max"
	  values for operationg points it aquired ownership for.
	- The new setting is notified to all other governors and to external
	  notifiers. If some entitiy decides it cannot live well with this
	  new setting, it breaks out. Note that this should not happen quite
	  often, as the "normal" verification takes place in a) above.
	  Nonetheless, if you want to break out CPU_VOLTAGE and CPU_FREQUENCY, you
	  need it. And as it makes life for the kernel so much more
	  difficult, I'm against doing so.
	- The low-level driver handling the powerop_value is called

Thanks,
	Dominik

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: PowerOP 0/3: System power operating point management API
  2005-08-16  8:53       ` Dominik Brodowski
@ 2005-08-16  8:57         ` Dominik Brodowski
  2005-08-17  1:52           ` Todd Poynor
  2005-08-17  1:39         ` Todd Poynor
  1 sibling, 1 reply; 9+ messages in thread
From: Dominik Brodowski @ 2005-08-16  8:57 UTC (permalink / raw)
  To: Todd Poynor, cpufreq, Patrick Mochel, linux-pm, linux-kernel,
	Pavel Machek

A small add-on:

We need to make sure that we're capable of handling smart CPUs like Transmeta
Crusoe processors in a sane way. This means

> b)	Setting of "values"

is optional if the hardware itself can be set to a min/max value (step a
above in previous mail).

	Dominik

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: PowerOP 0/3: System power operating point management API
  2005-08-16  8:53       ` Dominik Brodowski
  2005-08-16  8:57         ` Dominik Brodowski
@ 2005-08-17  1:39         ` Todd Poynor
  1 sibling, 0 replies; 9+ messages in thread
From: Todd Poynor @ 2005-08-17  1:39 UTC (permalink / raw)
  To: Dominik Brodowski, Todd Poynor, cpufreq, Patrick Mochel, linux-pm,
	linux-kernel, Pavel Machek

Dominik Brodowski wrote:

> First, the table interface you suggest is ugly. If there's indeed the need for
> such an abstraction, I'd favour something like

I'm planning to adopt the previous suggestions of an opaque data 
structure and stop trying to have any generic structure to it.  I'll try 
to leave dependency checking etc. to the upper layers as much as 
possible, since platforms vary greatly in this and so do the needs of 
different PM s/w stacks.

> Secondly, you do not adress the cross-relationships between operation points
> correctly. If you change the CPU frequency, you may have to switch other
> (memory, video) settings; you might even have to validate the frequency
> settings for these or even additional reasons (thermal and battery reasons -
> ACPI _PPC).

This lowest layer basically assumes that upper-layer software has 
created an appropriate operating point (for example, in DPM we pretty 
much require a system designer to create operating points that match the 
h/w specs and don't go to great lengths to encode rules about this), 
and/or will call driver notifiers etc. as needed to adapt to the 
changes.  Although there may be some sanity checking appropriate at the 
PowerOP level, cpufreq, DPM, etc. can for the most part continue to 
handle the larger issues of how valid operating points are constructed, 
driver callbacks, etc.  If you do want to handle various dependencies at 
the PowerOP layer then there's nothing that prevents that, but PM 
frameworks tend to embody assumptions about how frequently operating 
points will change and in what contexts (interrupt, idle...), and this 
can influence the code for such things.

> Thirdly, who is to decide on the power management settings? The first and
> intuitive answer is the kernel. Therefore, kernel-space cpufreq governors
> exist. Only under rare circumstances, you want full userspace control --
> that's what the userspace cpufreq governor is for.

Also something left to the existing upper layers; PowerOP isn't intended 
to handle any of that.  In the embedded space we usually let the system 
designer choose operating points supported by their h/w vendor and that 
match their particular system states (hardware enabled at any point in 
time, type and power/performance needs of software currently running). 
We do recommend that a userspace power policy manager be the component 
in charge of PM settings, based on messages from drivers and other apps 
on the state of the system.  And so that userspace component activates 
the operating point (or set of operating points in the case of DPM) 
appropriate for current state.

> Foruthly, the code duplication which your implementation leads to is obvious
> for the speedstep-centrino case. 

We could move the tables of valid cpu speeds and corresponding voltages 
down to the PowerOP level, and there would probably be little 
duplication at that point (in fact, with the current patch there's not a 
lot of duplication since the actual MSR access was moved to PowerOP and 
PowerOP contains little else, but both levels know how to understand the 
MSR format, and a more aggressive port to PowerOP could do away with that).

Your suggestions of changes to cpufreq governors and policies to handle 
governance of non-cpu-speed parameters sound interesting, and I'd be 
happy to help figure out what to do about those vs. the lower machine 
access layer I've discussed up until now.  I'll think more about this 
real soon now.  Thanks,

-- 
Todd

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: PowerOP 0/3: System power operating point management API
  2005-08-16  8:57         ` Dominik Brodowski
@ 2005-08-17  1:52           ` Todd Poynor
  0 siblings, 0 replies; 9+ messages in thread
From: Todd Poynor @ 2005-08-17  1:52 UTC (permalink / raw)
  To: Dominik Brodowski, Todd Poynor, cpufreq, Patrick Mochel, linux-pm,
	linux-kernel, Pavel Machek

Dominik Brodowski wrote:
> A small add-on:
> 
> We need to make sure that we're capable of handling smart CPUs like Transmeta
> Crusoe processors in a sane way. This means
> 
> 
>>b)	Setting of "values"
> 
> 
> is optional if the hardware itself can be set to a min/max value (step a
> above in previous mail).

Although I haven't looked into the Crusoe processor support, it may be 
that there is a different set of power parameters, not cpu speed 
directly, that are appropriate to manage on these platforms (after a 
brief look, seems to be a range of frequencies and some sort of flags)? 
  If so, these sorts of machine-specific power parameters are what 
PowerOP is trying to address, allowing management of the underlying 
machine-specific stuff to upper layers that may be presenting an 
abstracted view of power/performance, such as CPU speed or speed ranges, 
to the user.  Thanks,

-- 
Todd

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2005-08-17  1:52 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-09  2:49 PowerOP 0/3: System power operating point management API Todd Poynor
2005-08-09 18:12 ` [linux-pm] " Patrick Mochel
2005-08-10  2:18   ` Todd Poynor
     [not found]     ` <20050809030000.GA25112@slurryseal.ddns.mvista.com>
2005-08-16  8:53       ` Dominik Brodowski
2005-08-16  8:57         ` Dominik Brodowski
2005-08-17  1:52           ` Todd Poynor
2005-08-17  1:39         ` Todd Poynor
2005-08-10 10:07 ` Pavel Machek
2005-08-10 22:02   ` Todd Poynor

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox