* PowerOP 0/3: System power operating point management API @ 2005-08-09 2:49 Todd Poynor 2005-08-09 18:12 ` [linux-pm] " Patrick Mochel 2005-08-10 10:07 ` Pavel Machek 0 siblings, 2 replies; 9+ messages in thread From: Todd Poynor @ 2005-08-09 2:49 UTC (permalink / raw) To: linux-kernel, linux-pm, cpufreq PowerOP is a system power parameter management API submitted for discussion. PowerOP writes and reads power "operating points", comprised of arbitrary integer-valued values, called power parameters, that correspond to registers, clocks, dividers, voltage regulators, etc. that may be modified to set a basic power/performance point for the system. The core basically passes an array of integer-valued power parameters (with very little additional structure imposed by the core) to a platform-specific backend that interprets those values and makes the requested adjustments. PowerOP is intended to leave all power policy decisions to higher layers. An optional sysfs representation of power parameters is also available, primarily for diagnostic use. PowerOP can be thought of as a layer below cpufreq that actually accesses the hardware to make cpu frequency, voltage, core bus, and perhaps other modifications to set a power point, leaving cpufreq to manage the interfaces based around the "cpu frequency" abstraction, the policies and governors that select the frequency, its notifiers, and so forth. An example hooking up support for one cpufreq platform to PowerOP is in patch 3/3. Depending on the ability of the hardware to make software-controlled power/performance adjustments, this may be useful to select custom voltages, bus speeds, etc. in desktop/server systems. Various embedded systems have several parameters that can be set. For example, an XScale PXA27x could be considered to have six basic power parameters (mainly cpu run mode and memory and bus dividers) that for the most part should be set in tandem to known good sets of values as validated by the silicon vendor, plus other parameters possible for disabling PLLs during low-speed execution, and so forth. PowerOP is aimed at supporting this kind of system, where the cpu frequency abstraction specifies only part of the operating point that may be managed from software. It also pushes the hardware-level power parameter management down to a level that can be shared with other power management policy frameworks, in use in some embedded systems, that wish to deal with entire operating points as the basic unit of system power management.. There are many ways to tackle those issues, of course, and a new API layer is arguably rather heavyweight. This is one suggested way that tries to minimize disturbing existing power management code. Comments very much appreciated. Patch 2/3 is a desktop-oriented example of PowerOP; embedded examples will follow soon. -- Todd ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [linux-pm] PowerOP 0/3: System power operating point management API 2005-08-09 2:49 PowerOP 0/3: System power operating point management API Todd Poynor @ 2005-08-09 18:12 ` Patrick Mochel 2005-08-10 2:18 ` Todd Poynor 2005-08-10 10:07 ` Pavel Machek 1 sibling, 1 reply; 9+ messages in thread From: Patrick Mochel @ 2005-08-09 18:12 UTC (permalink / raw) To: Todd Poynor; +Cc: linux-kernel, linux-pm, cpufreq On Mon, 8 Aug 2005, Todd Poynor wrote: > PowerOP is a system power parameter management API submitted for > discussion. PowerOP writes and reads power "operating points", > comprised of arbitrary integer-valued values, called power parameters, > that correspond to registers, clocks, dividers, voltage regulators, > etc. that may be modified to set a basic power/performance point for the > system. The core basically passes an array of integer-valued power > parameters (with very little additional structure imposed by the core) > to a platform-specific backend that interprets those values and makes > the requested adjustments. PowerOP is intended to leave all power > policy decisions to higher layers. An optional sysfs representation of > power parameters is also available, primarily for diagnostic use. What do those higher layers look like? Do you have a userspace component that uses this interface? Who is using this code? Are there vendors that are already shipping systems with this enabled? Is this part of the DPM project? If so, what other components are left in DPM? What are your plans to integrate this more with the cpufreq code? Thanks, Pat ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [linux-pm] PowerOP 0/3: System power operating point management API 2005-08-09 18:12 ` [linux-pm] " Patrick Mochel @ 2005-08-10 2:18 ` Todd Poynor [not found] ` <20050809030000.GA25112@slurryseal.ddns.mvista.com> 0 siblings, 1 reply; 9+ messages in thread From: Todd Poynor @ 2005-08-10 2:18 UTC (permalink / raw) To: Patrick Mochel; +Cc: linux-kernel, linux-pm, cpufreq Patrick Mochel wrote: > On Mon, 8 Aug 2005, Todd Poynor wrote: (apologies for use of obsolete cpufreq mailing list address in my initial message.) ... >>PowerOP is intended to leave all power >>policy decisions to higher layers. > > What do those higher layers look like? Do you have a userspace component > that uses this interface? cpufreq is one example, it manages an abstraction of system power/performance levels based on cpu speed, which maps onto the PowerOP-level hardware capabilities in some fashion, and has both kernel and userspace components to manage the desired policy associated with this. Regardless of whether this notion of configurable operating points would remain a separate layer from cpufreq or was more tightly integrated, the code to set these operating points can handle things such as setting validated voltage levels to match cpu speeds, etc. For embedded systems, I am aware only of the Dynamic Power Management project, which you also mention and does indeed manage power policy based on the notions of power parameters and operating points. The settings of these are configured entirely from userspace via sysfs, using shell scripts or convenience libraries that access the sysfs attributes. A system designer chooses the operating points to be employed in the system based on the information from the processor or board vendor that describes validated, supported operating points and based on the characteristics of the system (how fast it needs to run while in use for different purposes and how much battery power can be spent for those purposes). For example, a designer implementing a system based on an Intel XScale PXA27x processor can choose from among about 16 validated operating points listed in the most recent specification update. Those operating points are comprised of register settings with inscrutable names such as CCCR[L], CCCR[2N], CLKCFG[T], CCCR[A], and two or three others. A few of those operating points run the CPU at identical frequencies, but have other changes in memory clocking, system bus clocking, and the ability to quickly switch between certain cpu frequencies based on other properties of the platform (so-called "Turbo-mode" frequency scaling). A DPM- or PowerOP-based system can be configured with the subset of desired operating points and a particular operating point activated as needed. The policy decision as to what operating point is appropriate to activate is a matter for custom code provided by the designer, tailored to their system. It is also possible to write automated operating point selection algorithms based on such criteria as system busyness. > Who is using this code? Are there vendors that are already shipping > systems with this enabled? > > Is this part of the DPM project? If so, what other components are left in > DPM? The concepts and general Linux implementation of power parameters and operating points stems from the power-aware computing work done by Bishop Brock and Karthick Rajamani of IBM Research, and a somewhat different implementation is a part of the DPM project, which MontaVista (and reportedly others in the near future) does ship. So far as I understand there are or soon will be mobile phones that use that code as the low- to mid-layers of the power management stack (the high-layer policy management is performed by a custom application of which I have no knowledge). I mentioned in a previous email the next step of creating and activating operating points from userspace. If that were in place, DPM would additionally consist primarily of: 1. Machine-specific backends to set operating points for the systems that DPM has been ported to. If something like PowerOP is accepted into a broader community then that code would come along for the ride. XScale PXA27x and various ARM OMAPs are among the systems supported, as well as potentially others not yet making an appearance in open source. 2. DPM has further concepts of "operating state" (generally, whether the system is idle, processing interrupts, running a normal-power-usage task, running a background task without deadlines that can be assigned a low power/performance level, etc.) and the unfortunately-named "policy" that maps each operating state to an operating point, along with the code to switch in different operating points as the system switches operating states. The "policy" is a bit of a misnomer; a system designer must create the desired operating points and decide upon the state -> point mappings appropriate, as well as make decisions on when to update the mappings based on external events, changing workloads, etc. There are a few extra ramifications of modifying operating points in this fashion, including the need to handle such transitions while in interrupt context or in the idle loop, as well as a general concern for low overhead since switching may occur very frequently (such as at every entry and exit from idle). 3. Kernel-to-userspace power event notification is temporarily based on executing hotplug scripts. This is outside the true domain of DPM, but in the absence of an acpid-like de facto standard for communicating power events it seemed best to provide some sort of mechanism. kobject uevents are now the proper choice, and I'd propose use of that, as a separate matter from what I'm hoping to accomplish with PowerOP or the rest of DPM. All of these are GPL software available on the project site. > What are your plans to integrate this more with the cpufreq code? At this point it's a proposed layer that does not disturb existing cpufreq code much, but if the cpufreq folks are receptive to these ideas I'd be all for a tighter integration. Others have already asked for the ability to manage voltages along with cpu speed, so in one way or another it seems likely that an expanded set of power parameters may be provided in the future. But I don't have any insight into the wishes or goals of the project. Thanks, -- Todd ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <20050809030000.GA25112@slurryseal.ddns.mvista.com>]
* Re: PowerOP 0/3: System power operating point management API [not found] ` <20050809030000.GA25112@slurryseal.ddns.mvista.com> @ 2005-08-16 8:53 ` Dominik Brodowski 2005-08-16 8:57 ` Dominik Brodowski 2005-08-17 1:39 ` Todd Poynor 0 siblings, 2 replies; 9+ messages in thread From: Dominik Brodowski @ 2005-08-16 8:53 UTC (permalink / raw) To: Todd Poynor; +Cc: cpufreq, Patrick Mochel, linux-pm, linux-kernel, Pavel Machek Hi! The PowerOP infrastructure you suggest surely is one path to better runtime power management in the Linux kernel. However, I don't like it at all in its current implementation. Here are a few suggestions for improvements, rewrites, and so on: First, the table interface you suggest is ugly. If there's indeed the need for such an abstraction, I'd favour something like struct powerop { struct list_head powerop_values; /* linked list of powerop_values */ ... } struct powerop_value { unsigned long value_cur; unsigned long value_min; unsigned long value_max; struct list_head next; u16 type; struct powerop_value *cross_dependency; struct powerop_driver *driver; } #define POWEROP_TYPE_CPU_FREQUENCY 0x00000001 #define POWEROP_TYPE_CPU_VOLTAGE 0x00000002 #define POWEROP_TYPE_FRONT_SIDE_BUS_SPEED 0x00000004 ... #define POWEROP_TYPE_GPU_FREQUENCY 0x00010000 ... and if CPU_VOLTAGE and CPU_FREQEUNCY can only be modified at the same time, (as most cpufreq drivers require), type is 0x00000003. Secondly, you do not adress the cross-relationships between operation points correctly. If you change the CPU frequency, you may have to switch other (memory, video) settings; you might even have to validate the frequency settings for these or even additional reasons (thermal and battery reasons - ACPI _PPC). Thirdly, who is to decide on the power management settings? The first and intuitive answer is the kernel. Therefore, kernel-space cpufreq governors exist. Only under rare circumstances, you want full userspace control -- that's what the userspace cpufreq governor is for. Foruthly, the code duplication which your implementation leads to is obvious for the speedstep-centrino case. And in contrast to Pavel, I do not consider it a "tiny cleanup". I'd suggest that you try upgrading the cpufreq infrastructure to provide full support for multiple types of POWEROPs: a) Setting of "policies" - New "min" or "max" values for all powerop_values are set, verified by powerop lowlevel drivers, powerop governors and external notifiers. E.g. if a new frequency min/max pair is required, the voltage level gets a new min and max value as well --> you need to handle recursion. - If necessary a new "powerop governor" is started. - Each powerop governor specifies which POWEROPs it can handle - current cpufreq governors can handle CPU_FREQUENCY, CPU_VOLTAGE and FRONT_SIDE_BUS_SPEED - an userspace fallback-governor always "handles" the parameters no other governor handles b) Setting of "values" - Each governor can initiate transitions between the "min" and "max" values for operationg points it aquired ownership for. - The new setting is notified to all other governors and to external notifiers. If some entitiy decides it cannot live well with this new setting, it breaks out. Note that this should not happen quite often, as the "normal" verification takes place in a) above. Nonetheless, if you want to break out CPU_VOLTAGE and CPU_FREQUENCY, you need it. And as it makes life for the kernel so much more difficult, I'm against doing so. - The low-level driver handling the powerop_value is called Thanks, Dominik ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: PowerOP 0/3: System power operating point management API 2005-08-16 8:53 ` Dominik Brodowski @ 2005-08-16 8:57 ` Dominik Brodowski 2005-08-17 1:52 ` Todd Poynor 2005-08-17 1:39 ` Todd Poynor 1 sibling, 1 reply; 9+ messages in thread From: Dominik Brodowski @ 2005-08-16 8:57 UTC (permalink / raw) To: Todd Poynor, cpufreq, Patrick Mochel, linux-pm, linux-kernel, Pavel Machek A small add-on: We need to make sure that we're capable of handling smart CPUs like Transmeta Crusoe processors in a sane way. This means > b) Setting of "values" is optional if the hardware itself can be set to a min/max value (step a above in previous mail). Dominik ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: PowerOP 0/3: System power operating point management API 2005-08-16 8:57 ` Dominik Brodowski @ 2005-08-17 1:52 ` Todd Poynor 0 siblings, 0 replies; 9+ messages in thread From: Todd Poynor @ 2005-08-17 1:52 UTC (permalink / raw) To: Dominik Brodowski, Todd Poynor, cpufreq, Patrick Mochel, linux-pm, linux-kernel, Pavel Machek Dominik Brodowski wrote: > A small add-on: > > We need to make sure that we're capable of handling smart CPUs like Transmeta > Crusoe processors in a sane way. This means > > >>b) Setting of "values" > > > is optional if the hardware itself can be set to a min/max value (step a > above in previous mail). Although I haven't looked into the Crusoe processor support, it may be that there is a different set of power parameters, not cpu speed directly, that are appropriate to manage on these platforms (after a brief look, seems to be a range of frequencies and some sort of flags)? If so, these sorts of machine-specific power parameters are what PowerOP is trying to address, allowing management of the underlying machine-specific stuff to upper layers that may be presenting an abstracted view of power/performance, such as CPU speed or speed ranges, to the user. Thanks, -- Todd ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: PowerOP 0/3: System power operating point management API 2005-08-16 8:53 ` Dominik Brodowski 2005-08-16 8:57 ` Dominik Brodowski @ 2005-08-17 1:39 ` Todd Poynor 1 sibling, 0 replies; 9+ messages in thread From: Todd Poynor @ 2005-08-17 1:39 UTC (permalink / raw) To: Dominik Brodowski, Todd Poynor, cpufreq, Patrick Mochel, linux-pm, linux-kernel, Pavel Machek Dominik Brodowski wrote: > First, the table interface you suggest is ugly. If there's indeed the need for > such an abstraction, I'd favour something like I'm planning to adopt the previous suggestions of an opaque data structure and stop trying to have any generic structure to it. I'll try to leave dependency checking etc. to the upper layers as much as possible, since platforms vary greatly in this and so do the needs of different PM s/w stacks. > Secondly, you do not adress the cross-relationships between operation points > correctly. If you change the CPU frequency, you may have to switch other > (memory, video) settings; you might even have to validate the frequency > settings for these or even additional reasons (thermal and battery reasons - > ACPI _PPC). This lowest layer basically assumes that upper-layer software has created an appropriate operating point (for example, in DPM we pretty much require a system designer to create operating points that match the h/w specs and don't go to great lengths to encode rules about this), and/or will call driver notifiers etc. as needed to adapt to the changes. Although there may be some sanity checking appropriate at the PowerOP level, cpufreq, DPM, etc. can for the most part continue to handle the larger issues of how valid operating points are constructed, driver callbacks, etc. If you do want to handle various dependencies at the PowerOP layer then there's nothing that prevents that, but PM frameworks tend to embody assumptions about how frequently operating points will change and in what contexts (interrupt, idle...), and this can influence the code for such things. > Thirdly, who is to decide on the power management settings? The first and > intuitive answer is the kernel. Therefore, kernel-space cpufreq governors > exist. Only under rare circumstances, you want full userspace control -- > that's what the userspace cpufreq governor is for. Also something left to the existing upper layers; PowerOP isn't intended to handle any of that. In the embedded space we usually let the system designer choose operating points supported by their h/w vendor and that match their particular system states (hardware enabled at any point in time, type and power/performance needs of software currently running). We do recommend that a userspace power policy manager be the component in charge of PM settings, based on messages from drivers and other apps on the state of the system. And so that userspace component activates the operating point (or set of operating points in the case of DPM) appropriate for current state. > Foruthly, the code duplication which your implementation leads to is obvious > for the speedstep-centrino case. We could move the tables of valid cpu speeds and corresponding voltages down to the PowerOP level, and there would probably be little duplication at that point (in fact, with the current patch there's not a lot of duplication since the actual MSR access was moved to PowerOP and PowerOP contains little else, but both levels know how to understand the MSR format, and a more aggressive port to PowerOP could do away with that). Your suggestions of changes to cpufreq governors and policies to handle governance of non-cpu-speed parameters sound interesting, and I'd be happy to help figure out what to do about those vs. the lower machine access layer I've discussed up until now. I'll think more about this real soon now. Thanks, -- Todd ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: PowerOP 0/3: System power operating point management API 2005-08-09 2:49 PowerOP 0/3: System power operating point management API Todd Poynor 2005-08-09 18:12 ` [linux-pm] " Patrick Mochel @ 2005-08-10 10:07 ` Pavel Machek 2005-08-10 22:02 ` Todd Poynor 1 sibling, 1 reply; 9+ messages in thread From: Pavel Machek @ 2005-08-10 10:07 UTC (permalink / raw) To: Todd Poynor; +Cc: linux-kernel, linux-pm, cpufreq Hi! > PowerOP is a system power parameter management API submitted for > discussion. PowerOP writes and reads power "operating points", > comprised of arbitrary integer-valued values, called power parameters, > that correspond to registers, clocks, dividers, voltage regulators, > etc. that may be modified to set a basic power/performance point for the > system. The core basically passes an array of integer-valued power > parameters (with very little additional structure imposed by the core) > to a platform-specific backend that interprets those values and makes > the requested adjustments. PowerOP is intended to leave all power > policy decisions to higher layers. An optional sysfs representation of > power parameters is also available, primarily for diagnostic use. > > PowerOP can be thought of as a layer below cpufreq that actually > accesses the hardware to make cpu frequency, voltage, core bus, and > perhaps other modifications to set a power point, leaving cpufreq to > manage the interfaces based around the "cpu frequency" abstraction, the > policies and governors that select the frequency, its notifiers, and so > forth. An example hooking up support for one cpufreq platform to > PowerOP is in patch 3/3. > > Depending on the ability of the hardware to make software-controlled > power/performance adjustments, this may be useful to select custom > voltages, bus speeds, etc. in desktop/server systems. Various embedded > systems have several parameters that can be set. For example, an XScale > PXA27x could be considered to have six basic power parameters (mainly > cpu run mode and memory and bus dividers) that for the most part > should This scares me a bit. Is table enough to handle this? I'm afraid that table will get very large on systems that allow you to do "almost anything". Pavel -- if you have sharp zaurus hardware you don't need... you know my address ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: PowerOP 0/3: System power operating point management API 2005-08-10 10:07 ` Pavel Machek @ 2005-08-10 22:02 ` Todd Poynor 0 siblings, 0 replies; 9+ messages in thread From: Todd Poynor @ 2005-08-10 22:02 UTC (permalink / raw) To: Pavel Machek; +Cc: linux-kernel, linux-pm, cpufreq Pavel Machek wrote: >>Depending on the ability of the hardware to make software-controlled >>power/performance adjustments, this may be useful to select custom >>voltages, bus speeds, etc. in desktop/server systems. Various embedded >>systems have several parameters that can be set. For example, an XScale >>PXA27x could be considered to have six basic power parameters (mainly >>cpu run mode and memory and bus dividers) that for the most part >>should > > > This scares me a bit. Is table enough to handle this? I'm afraid that > table will get very large on systems that allow you to do "almost > anything". Exhaustive tables for all combinations of possible parameters aren't expected (or practical for many systems as you note). In practice, a subset of these possible operating points are created and activated over the lifetime of the system, where the subset is chosen by a system designer according to the needs of the particular system. It's a matter for the higher-layer power management software to decide whether to have in-kernel tables of the possible operating points (as cpufreq does for various platforms) or whether to require userspace to create only the ones wanted (as does DPM). There are cpufreq patches for PXA27x somewhere, for example, and in that case a subset of the supported operating points (and there are still only about 16 of those even for such a complicated piece of hardware) are represented in the kernel tables, choosing one of the possible combinations of memory/bus/etc. parameters for each unique cpu frequency. Thanks, -- Todd ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2005-08-17 1:52 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-09 2:49 PowerOP 0/3: System power operating point management API Todd Poynor
2005-08-09 18:12 ` [linux-pm] " Patrick Mochel
2005-08-10 2:18 ` Todd Poynor
[not found] ` <20050809030000.GA25112@slurryseal.ddns.mvista.com>
2005-08-16 8:53 ` Dominik Brodowski
2005-08-16 8:57 ` Dominik Brodowski
2005-08-17 1:52 ` Todd Poynor
2005-08-17 1:39 ` Todd Poynor
2005-08-10 10:07 ` Pavel Machek
2005-08-10 22:02 ` Todd Poynor
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox