* [PATCH v2 1/3] PM / EM: Fix broken kerneldoc
2019-01-21 11:17 [PATCH v2 0/3] Documentation: Explain EAS and EM Quentin Perret
@ 2019-01-21 11:17 ` Quentin Perret
2019-01-21 11:17 ` [PATCH v2 2/3] PM / EM: Document the Energy Model framework Quentin Perret
` (2 subsequent siblings)
3 siblings, 0 replies; 6+ messages in thread
From: Quentin Perret @ 2019-01-21 11:17 UTC (permalink / raw)
To: corbet, peterz, rjw, juri.lelli
Cc: mingo, morten.rasmussen, qais.yousef, patrick.bellasi,
dietmar.eggemann, linux-doc, linux-pm, linux-kernel
Some of the kerneldoc comments about the Energy Model framework are
slightly broken, hence causing errors when compiling the html doc.
Fix them.
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
include/linux/energy_model.h | 4 ++--
kernel/power/energy_model.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index aa027f7bcb3e..57889589e638 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -11,7 +11,7 @@
#ifdef CONFIG_ENERGY_MODEL
/**
- * em_cap_state - Capacity state of a performance domain
+ * struct em_cap_state - Capacity state of a performance domain
* @frequency: The CPU frequency in KHz, for consistency with CPUFreq
* @power: The power consumed by 1 CPU at this level, in milli-watts
* @cost: The cost coefficient associated with this level, used during
@@ -24,7 +24,7 @@ struct em_cap_state {
};
/**
- * em_perf_domain - Performance domain
+ * struct em_perf_domain - Performance domain
* @table: List of capacity states, in ascending order
* @nr_cap_states: Number of capacity states
* @cpus: Cpumask covering the CPUs of the domain
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index d9dc2c38764a..1e3a88ea4728 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -137,7 +137,7 @@ EXPORT_SYMBOL_GPL(em_cpu_get);
* If multiple clients register the same performance domain, all but the first
* registration will be ignored.
*
- * Return 0 on success
+ * Return: 0 on success
*/
int em_register_perf_domain(cpumask_t *span, unsigned int nr_states,
struct em_data_callback *cb)
--
2.20.1
^ permalink raw reply related [flat|nested] 6+ messages in thread* [PATCH v2 2/3] PM / EM: Document the Energy Model framework
2019-01-21 11:17 [PATCH v2 0/3] Documentation: Explain EAS and EM Quentin Perret
2019-01-21 11:17 ` [PATCH v2 1/3] PM / EM: Fix broken kerneldoc Quentin Perret
@ 2019-01-21 11:17 ` Quentin Perret
2019-01-21 11:17 ` [PATCH v2 3/3] sched: Document Energy Aware Scheduling Quentin Perret
2019-02-07 0:32 ` [PATCH v2 0/3] Documentation: Explain EAS and EM Jonathan Corbet
3 siblings, 0 replies; 6+ messages in thread
From: Quentin Perret @ 2019-01-21 11:17 UTC (permalink / raw)
To: corbet, peterz, rjw, juri.lelli
Cc: mingo, morten.rasmussen, qais.yousef, patrick.bellasi,
dietmar.eggemann, linux-doc, linux-pm, linux-kernel
Introduce a documentation file summarizing the key design points and
APIs of the newly introduced Energy Model framework.
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
Juri: Although I did change some things to the doc in v2 (translated to
rst mainly), I kept your 'Reviewed-by' as the content is still pretty
much the same. Please scream if you disagree :-)
Thanks,
Quentin
---
Documentation/driver-api/pm/energy-model.rst | 150 +++++++++++++++++++
Documentation/driver-api/pm/index.rst | 1 +
2 files changed, 151 insertions(+)
create mode 100644 Documentation/driver-api/pm/energy-model.rst
diff --git a/Documentation/driver-api/pm/energy-model.rst b/Documentation/driver-api/pm/energy-model.rst
new file mode 100644
index 000000000000..c447528c4e29
--- /dev/null
+++ b/Documentation/driver-api/pm/energy-model.rst
@@ -0,0 +1,150 @@
+====================
+Energy Model of CPUs
+====================
+
+Overview
+========
+
+The Energy Model (EM) framework serves as an interface between drivers knowing
+the power consumed by CPUs at various performance levels, and the kernel
+subsystems willing to use that information to make energy-aware decisions.
+
+The source of the information about the power consumed by CPUs can vary greatly
+from one platform to another. These power costs can be estimated using
+devicetree data in some cases. In others, the firmware will know better.
+Alternatively, userspace might be best positioned. And so on. In order to avoid
+each and every client subsystem to re-implement support for each and every
+possible source of information on its own, the EM framework intervenes as an
+abstraction layer which standardizes the format of power cost tables in the
+kernel, hence enabling to avoid redundant work.
+
+The figure below depicts an example of drivers (Arm-specific here, but the
+approach is applicable to any architecture) providing power costs to the EM
+framework, and interested clients reading the data from it.
+
+.. code-block:: none
+
+ +---------------+ +-----------------+ +---------------+
+ | Thermal (IPA) | | Scheduler (EAS) | | Other |
+ +---------------+ +-----------------+ +---------------+
+ | | em_pd_energy() |
+ | | em_cpu_get() |
+ +---------+ | +---------+
+ | | |
+ v v v
+ +---------------------+
+ | Energy Model |
+ | Framework |
+ +---------------------+
+ ^ ^ ^
+ | | | em_register_perf_domain()
+ +----------+ | +---------+
+ | | |
+ +---------------+ +---------------+ +--------------+
+ | cpufreq-dt | | arm_scmi | | Other |
+ +---------------+ +---------------+ +--------------+
+ ^ ^ ^
+ | | |
+ +--------------+ +---------------+ +--------------+
+ | Device Tree | | Firmware | | ? |
+ +--------------+ +---------------+ +--------------+
+
+The EM framework manages power cost tables per 'performance domain' in the
+system. A performance domain is a group of CPUs whose performance is scaled
+together. Performance domains generally have a 1-to-1 mapping with CPUFreq
+policies. All CPUs in a performance domain are required to have the same
+micro-architecture. CPUs in different performance domains can have different
+micro-architectures.
+
+
+Core APIs overview
+==================
+
+Config options
+--------------
+
+`CONFIG_ENERGY_MODEL` must be enabled to use the EM framework.
+
+
+Registration of performance domains
+-----------------------------------
+
+Drivers are expected to register performance domains into the EM framework by
+calling the :c:func:`em_register_perf_domain()` API. Drivers must specify the
+CPUs of the performance domains using a cpumask, and provide a callback function
+returning <frequency, power> tuples for each capacity state. The callback
+function provided by the driver is free to fetch data from any relevant location
+(DT, firmware, ...), and by any mean deemed necessary.
+
+
+Accessing performance domains
+-----------------------------
+
+Subsystems interested in the energy model of a CPU can retrieve it using the
+:c:func:`em_cpu_get()` API. The energy model tables are allocated once upon
+creation of the performance domains, and kept in memory untouched.
+
+The energy consumed by a performance domain can be estimated using the
+:c:func:`em_pd_energy()` API. The estimation is performed assuming that the
+schedutil CPUfreq governor is in use.
+
+
+Example driver
+==============
+
+This section provides a simple example of a CPUFreq driver registering a
+performance domain in the Energy Model framework using the (fake) `foo`
+protocol. The driver implements an :c:func:`est_power()` function to be provided
+to the EM framework.
+
+:file:`drivers/cpufreq/foo_cpufreq.c`:
+
+.. code-block:: c
+ :linenos:
+
+ static int est_power(unsigned long *mW, unsigned long *KHz, int cpu)
+ {
+ long freq, power;
+
+ /* Use the 'foo' protocol to ceil the frequency */
+ freq = foo_get_freq_ceil(cpu, *KHz);
+ if (freq < 0);
+ return freq;
+
+ /* Estimate the power cost for the CPU at the relevant freq. */
+ power = foo_estimate_power(cpu, freq);
+ if (power < 0);
+ return power;
+
+ /* Return the values to the EM framework */
+ *mW = power;
+ *KHz = freq;
+
+ return 0;
+ }
+
+ static int foo_cpufreq_init(struct cpufreq_policy *policy)
+ {
+ struct em_data_callback em_cb = EM_DATA_CB(est_power);
+ int nr_opp, ret;
+
+ /* Do the actual CPUFreq init work ... */
+ ret = do_foo_cpufreq_init(policy);
+ if (ret)
+ return ret;
+
+ /* Find the number of OPPs for this policy */
+ nr_opp = foo_get_nr_opp(policy);
+
+ /* And register the new performance domain */
+ em_register_perf_domain(policy->cpus, nr_opp, &em_cb);
+
+ return 0;
+ }
+
+
+Inline kernel documentation
+===========================
+
+.. kernel-doc:: include/linux/energy_model.h
+.. kernel-doc:: kernel/power/energy_model.c
diff --git a/Documentation/driver-api/pm/index.rst b/Documentation/driver-api/pm/index.rst
index 2f6d0e9cf6b7..d3a582adef4c 100644
--- a/Documentation/driver-api/pm/index.rst
+++ b/Documentation/driver-api/pm/index.rst
@@ -5,6 +5,7 @@ Device Power Management
.. toctree::
devices
+ energy-model
notifiers
types
--
2.20.1
^ permalink raw reply related [flat|nested] 6+ messages in thread* [PATCH v2 3/3] sched: Document Energy Aware Scheduling
2019-01-21 11:17 [PATCH v2 0/3] Documentation: Explain EAS and EM Quentin Perret
2019-01-21 11:17 ` [PATCH v2 1/3] PM / EM: Fix broken kerneldoc Quentin Perret
2019-01-21 11:17 ` [PATCH v2 2/3] PM / EM: Document the Energy Model framework Quentin Perret
@ 2019-01-21 11:17 ` Quentin Perret
2019-02-07 0:32 ` [PATCH v2 0/3] Documentation: Explain EAS and EM Jonathan Corbet
3 siblings, 0 replies; 6+ messages in thread
From: Quentin Perret @ 2019-01-21 11:17 UTC (permalink / raw)
To: corbet, peterz, rjw, juri.lelli
Cc: mingo, morten.rasmussen, qais.yousef, patrick.bellasi,
dietmar.eggemann, linux-doc, linux-pm, linux-kernel
Add some documentation detailing the main design points of EAS, as well
as a list of its dependencies.
Parts of this documentation are taken from Morten Rasmussen's original
EAS posting: https://lkml.org/lkml/2015/7/7/754
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Co-authored-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
Documentation/scheduler/sched-energy.txt | 431 +++++++++++++++++++++++
1 file changed, 431 insertions(+)
create mode 100644 Documentation/scheduler/sched-energy.txt
diff --git a/Documentation/scheduler/sched-energy.txt b/Documentation/scheduler/sched-energy.txt
new file mode 100644
index 000000000000..b91899cb2846
--- /dev/null
+++ b/Documentation/scheduler/sched-energy.txt
@@ -0,0 +1,431 @@
+ =======================
+ Energy Aware Scheduling
+ =======================
+
+1. Introduction
+---------------
+
+Energy Aware Scheduling (or EAS) gives the scheduler the ability to predict
+the impact of its decisions on the energy consumed by CPUs. EAS relies on an
+Energy Model (EM) of the CPUs to select an energy efficient CPU for each task,
+with a minimal impact on throughput. This document aims at providing an
+introduction on how EAS works, what are the main design decisions behind it, and
+details what is needed to get it to run.
+
+Before going any further, please note that at the time of writing:
+
+ /!\ EAS does not support platforms with symmetric CPU topologies /!\
+
+EAS operates only on heterogeneous CPU topologies (such as Arm big.LITTLE)
+because this is where the potential for saving energy through scheduling is
+the highest.
+
+The actual EM used by EAS is _not_ maintained by the scheduler, but by a
+dedicated framework. For details about this framework and what it provides,
+please refer to its documentation (which is available under
+Documentation/driver-api/pm/energy-model.rst).
+
+
+2. Background and Terminology
+-----------------------------
+
+To make it clear from the start:
+ - energy = [joule] (resource like a battery on powered devices)
+ - power = energy/time = [joule/second] = [watt]
+
+The goal of EAS is to minimize energy, while still getting the job done. That
+is, we want to maximize:
+
+ performance [inst/s]
+ --------------------
+ power [W]
+
+which is equivalent to minimizing:
+
+ energy [J]
+ -----------
+ instruction
+
+while still getting 'good' performance. It is essentially an alternative
+optimization objective to the current performance-only objective for the
+scheduler. This alternative considers two objectives: energy-efficiency and
+performance.
+
+The idea behind introducing an EM is to allow the scheduler to evaluate the
+implications of its decisions rather than blindly applying energy-saving
+techniques that may have positive effects only on some platforms. At the same
+time, the EM must be as simple as possible to minimize the scheduler latency
+impact.
+
+In short, EAS changes the way tasks are assigned to CPUs. When it is time
+for the scheduler to decide where a task should run (during wake-up), the EM
+is used to break the tie between several good CPU candidates and pick the one
+that is predicted to yield the best energy consumption without harming the
+system's throughput. EAS is applied only to CFS tasks at the time of writing,
+but it could be extended to other scheduling classes in the future.
+
+The predictions made by EAS rely on specific elements of knowledge about the
+platform's topology, which include the 'capacity' of CPUs (defined in Section 3.
+below), and their respective energy costs.
+
+
+3. Topology information
+-----------------------
+
+EAS (as well as the rest of the scheduler) uses the notion of 'capacity' to
+differentiate CPUs with different computing throughput. The 'capacity' of a CPU
+represents the amount of work it can absorb when running at its highest
+frequency compared to the most capable CPU of the system. Capacity values are
+normalized in a 1024 range, and are comparable with the utilization signals of
+tasks and CPUs computed by the Per-Entity Load Tracking (PELT) mechanism. Thanks
+to capacity and utilization values, EAS is able to estimate how big/busy a
+task/CPU is, and to take this into consideration when evaluating performance vs
+energy trade-offs. The capacity of CPUs is provided via arch-specific code
+through the arch_scale_cpu_capacity() callback. As an example, arm and arm64
+share an implementation of this callback which uses a combination of CPUFreq
+data and device-tree bindings to compute the capacity of CPUs (see
+Documentation/devicetree/bindings/arm/cpu-capacity.txt for more details).
+
+The rest of platform knowledge used by EAS is directly read from the Energy
+Model (EM) framework. The EM of a platform is composed of a power cost table
+per 'performance domain' in the system (for further details about performance
+domains, see Documentation/driver-api/pm/energy-model.rst).
+
+The scheduler manages references to the EM objects in the topology code when the
+scheduling domains are built, or re-built. For each root domain (rd), the
+scheduler maintains a singly linked list of all performance domains intersecting
+the current rd->span. Each node in the list contains a pointer to a struct
+em_perf_domain as provided by the EM framework.
+
+The lists are attached to the root domains in order to cope with exclusive
+cpuset configurations. Since the boundaries of exclusive cpusets do not
+necessarily match those of performance domains, the lists of different root
+domains can contain duplicate elements.
+
+Example 1.
+ Let us consider a platform with 12 CPUs, split in 3 performance domains
+ (pd0, pd4 and pd8), organized as follows:
+
+ CPUs: 0 1 2 3 4 5 6 7 8 9 10 11
+ PDs: |--pd0--|--pd4--|---pd8---|
+ RDs: |----rd1----|-----rd2-----|
+
+ Now, consider that userspace decided to split the system with two
+ exclusive cpusets, hence creating two independent root domains, each
+ containing 6 CPUs. The two root domains are denoted rd1 and rd2 in the
+ above figure. Since pd4 intersects with both rd1 and rd2, it will be
+ present in the linked list '->pd' attached to each of them:
+ * rd1->pd: pd0 -> pd4
+ * rd2->pd: pd4 -> pd8
+
+ Please note that the scheduler will create two duplicate list nodes for
+ pd4 (one for each list). However, both just hold a pointer to the same
+ shared data structure of the EM framework.
+
+Since the access to these lists can happen concurrently with hotplug and other
+things, they are protected by RCU, like the rest of topology structures
+manipulated by the scheduler.
+
+EAS also maintains a static key (sched_energy_present) which is enabled when at
+least one root domain meets all conditions for EAS to start. Those conditions
+are summarized in Section 6.
+
+
+4. Energy-Aware task placement
+------------------------------
+
+EAS overrides the CFS task wake-up balancing code. It uses the EM of the
+platform and the PELT signals to choose an energy-efficient target CPU during
+wake-up balance. When EAS is enabled, the periodic and idle load-balance paths
+are disabled (please see Section 5. for more details about this) so all the
+balancing happens during wake-up. With EAS, select_task_rq_fair() calls
+find_energy_efficient_cpu() to do the placement decision. This function looks
+for the CPU with the highest spare capacity (CPU capacity - CPU utilization) in
+each performance domain since it is the one which will allow us to keep the
+frequency the lowest. Then, the function checks if placing the task there could
+save energy compared to leaving it on prev_cpu, i.e. the CPU where the task ran
+in its previous activation.
+
+find_energy_efficient_cpu() uses compute_energy() to estimate what will be the
+energy consumed by the system if the waking task was migrated. compute_energy()
+looks at the current utilization landscape of the CPUs and adjusts it to
+'simulate' the task migration. The EM framework provides the em_pd_energy() API
+which computes the expected energy consumption of each performance domain for
+the given utilization landscape.
+
+An example of energy-optimized task placement decision is detailed below.
+
+Example 2.
+ Let us consider a (fake) platform with 2 independent performance domains
+ composed of two CPUs each. CPU0 and CPU1 are little CPUs; CPU2 and CPU3
+ are big.
+
+ The scheduler must decide where to place a task P whose util_avg = 200
+ and prev_cpu = 0.
+
+ The current utilization landscape of the CPUs is depicted on the graph
+ below. CPUs 0-3 have a util_avg of 400, 100, 600 and 500 respectively
+ Each performance domain has three Operating Performance Points (OPPs).
+ The CPU capacity and power cost associated with each OPP is listed in
+ the Energy Model table. The util_avg of P is shown on the figures
+ below as 'PP'.
+
+ CPU util.
+ 1024 - - - - - - - Energy Model
+ +-----------+-------------+
+ | Little | Big |
+ 768 ============= +-----+-----+------+------+
+ | Cap | Pwr | Cap | Pwr |
+ +-----+-----+------+------+
+ 512 =========== - ##- - - - - | 170 | 50 | 512 | 400 |
+ ## ## | 341 | 150 | 768 | 800 |
+ 341 -PP - - - - ## ## | 512 | 300 | 1024 | 1700 |
+ PP ## ## +-----+-----+------+------+
+ 170 -## - - - - ## ##
+ ## ## ## ##
+ ------------ -------------
+ CPU0 CPU1 CPU2 CPU3
+
+ Current OPP: ===== Other OPP: - - - util_avg (100 each): ##
+
+
+ find_energy_efficient_cpu() will first look for the CPUs with the
+ maximum spare capacity in the two performance domains. In this example,
+ CPU1 and CPU3. Then it will estimate the energy of the system if P was
+ placed on either of them, and check if that would save some energy
+ compared to leaving P on CPU0. EAS assumes that OPPs follow utilization
+ (which is coherent with the behaviour of the schedutil CPUFreq
+ governor, see Section 6. for more details on this topic).
+
+ Case 1. P is migrated to CPU1
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ 1024 - - - - - - -
+
+ Energy calculation:
+ 768 ============= * CPU0: 200 / 341 * 150 = 88
+ * CPU1: 300 / 341 * 150 = 131
+ * CPU2: 600 / 768 * 800 = 625
+ 512 - - - - - - - ##- - - - - * CPU3: 500 / 768 * 800 = 520
+ ## ## => total_energy = 1364
+ 341 =========== ## ##
+ PP ## ##
+ 170 -## - - PP- ## ##
+ ## ## ## ##
+ ------------ -------------
+ CPU0 CPU1 CPU2 CPU3
+
+
+ Case 2. P is migrated to CPU3
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ 1024 - - - - - - -
+
+ Energy calculation:
+ 768 ============= * CPU0: 200 / 341 * 150 = 88
+ * CPU1: 100 / 341 * 150 = 43
+ PP * CPU2: 600 / 768 * 800 = 625
+ 512 - - - - - - - ##- - -PP - * CPU3: 700 / 768 * 800 = 729
+ ## ## => total_energy = 1485
+ 341 =========== ## ##
+ ## ##
+ 170 -## - - - - ## ##
+ ## ## ## ##
+ ------------ -------------
+ CPU0 CPU1 CPU2 CPU3
+
+
+ Case 3. P stays on prev_cpu / CPU 0
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ 1024 - - - - - - -
+
+ Energy calculation:
+ 768 ============= * CPU0: 400 / 512 * 300 = 234
+ * CPU1: 100 / 512 * 300 = 58
+ * CPU2: 600 / 768 * 800 = 625
+ 512 =========== - ##- - - - - * CPU3: 500 / 768 * 800 = 520
+ ## ## => total_energy = 1437
+ 341 -PP - - - - ## ##
+ PP ## ##
+ 170 -## - - - - ## ##
+ ## ## ## ##
+ ------------ -------------
+ CPU0 CPU1 CPU2 CPU3
+
+
+ From these calculations, the Case 1 has the lowest total energy. So CPU 1
+ is be the best candidate from an energy-efficiency standpoint.
+
+Big CPUs are generally more power hungry than the little ones and are thus used
+mainly when a task doesn't fit the littles. However, little CPUs aren't always
+necessarily more energy-efficient than big CPUs. For some systems, the high OPPs
+of the little CPUs can be less energy-efficient than the lowest OPPs of the
+bigs, for example. So, if the little CPUs happen to have enough utilization at
+a specific point in time, a small task waking up at that moment could be better
+off executing on the big side in order to save energy, even though it would fit
+on the little side.
+
+And even in the case where all OPPs of the big CPUs are less energy-efficient
+than those of the little, using the big CPUs for a small task might still, under
+specific conditions, save energy. Indeed, placing a task on a little CPU can
+result in raising the OPP of the entire performance domain, and that will
+increase the cost of the tasks already running there. If the waking task is
+placed on a big CPU, its own execution cost might be higher than if it was
+running on a little, but it won't impact the other tasks of the little CPUs
+which will keep running at a lower OPP. So, when considering the total energy
+consumed by CPUs, the extra cost of running that one task on a big core can be
+smaller than the cost of raising the OPP on the little CPUs for all the other
+tasks.
+
+The examples above would be nearly impossible to get right in a generic way, and
+for all platforms, without knowing the cost of running at different OPPs on all
+CPUs of the system. Thanks to its EM-based design, EAS should cope with them
+correctly without too many troubles. However, in order to ensure a minimal
+impact on throughput for high-utilization scenarios, EAS also implements another
+mechanism called 'over-utilization'.
+
+
+5. Over-utilization
+-------------------
+
+From a general standpoint, the use-cases where EAS can help the most are those
+involving a light/medium CPU utilization. Whenever long CPU-bound tasks are
+being run, they will require all of the available CPU capacity, and there isn't
+much that can be done by the scheduler to save energy without severely harming
+throughput. In order to avoid hurting performance with EAS, CPUs are flagged as
+'over-utilized' as soon as they are used at more than 80% of their compute
+capacity. As long as no CPUs are over-utilized in a root domain, load balancing
+is disabled and EAS overrides the wake-up balancing code. EAS is likely to load
+the most energy efficient CPUs of the system more than the others if that can be
+done without harming throughput. So, the load-balancer is disabled to prevent
+it from breaking the energy-efficient task placement found by EAS. It is safe to
+do so when the system isn't overutilized since being below the 80% tipping point
+implies that:
+
+ a. there is some idle time on all CPUs, so the utilization signals used by
+ EAS are likely to accurately represent the 'size' of the various tasks
+ in the system;
+ b. all tasks should already be provided with enough CPU capacity,
+ regardless of their nice values;
+ c. since there is spare capacity all tasks must be blocking/sleeping
+ regularly and balancing at wake-up is sufficient.
+
+As soon as one CPU goes above the 80% tipping point, at least one of the three
+assumptions above becomes incorrect. In this scenario, the 'overutilized' flag
+is raised for the entire root domain, EAS is disabled, and the load-balancer is
+re-enabled. By doing so, the scheduler falls back onto load-based algorithms for
+wake-up and load balance under CPU-bound conditions. This provides a better
+respect of the nice values of tasks.
+
+Since the notion of overutilization largely relies on detecting whether or not
+there is some idle time in the system, the CPU capacity 'stolen' by higher
+(than CFS) scheduling classes (as well as IRQ) must be taken into account. As
+such, the detection of overutilization accounts for the capacity used not only
+by CFS tasks, but also by the other scheduling classes and IRQ.
+
+
+6. Dependencies and requirements for EAS
+----------------------------------------
+
+Energy Aware Scheduling depends on the CPUs of the system having specific
+hardware properties and on other features of the kernel being enabled. This
+section lists these dependencies and provides hints as to how they can be met.
+
+
+ 6.1 - Asymmetric CPU topology
+
+As mentioned in the introduction, EAS is only supported on platforms with
+asymmetric CPU topologies for now. This requirement is checked at run-time by
+looking for the presence of the SD_ASYM_CPUCAPACITY flag when the scheduling
+domains are built.
+
+The flag is set/cleared automatically by the scheduler topology code whenever
+there are CPUs with different capacities in a root domain. The capacities of
+CPUs are provided by arch-specific code through the arch_scale_cpu_capacity()
+callback.
+
+So, in order to use EAS, it is required from the architecture code to implement
+the arch_scale_cpu_capacity() callback. Moreover, some of the CPUs must have a
+lower capacity than others.
+
+Please note that EAS is not fundamentally incompatible with SMP, but no
+significant savings on SMP platforms have been observed yet. This restriction
+could be amended in the future if proven otherwise.
+
+
+ 6.2 - Energy Model presence
+
+EAS uses the EM of a platform to estimate the impact of scheduling decisions on
+energy. So, the platform code (drivers) must provide power cost tables to the
+EM framework in order to make EAS start. To do so, please refer to documentation
+of the independent EM framework in Documentation/driver-api/pm/energy-model.rst.
+
+Please also note that the scheduling domains need to be re-built after the
+EM has been registered in order to start EAS.
+
+
+ 6.3 - Energy Model complexity
+
+The task wake-up path is very latency-sensitive. When the EM of a platform is
+too complex (too many CPUs, too many performance domains, too many performance
+states, ...), the cost of using it in the wake-up path can become prohibitive.
+The energy-aware wake-up algorithm has a complexity of:
+
+ C = Nd * (Nc + Ns)
+
+with: Nd the number of performance domains; Nc the number of CPUs; and Ns the
+total number of OPPs (ex: for two perf. domains with 4 OPPs each, Ns = 8).
+
+A complexity check is performed at the root domain level, when scheduling
+domains are built. EAS will not start on a root domain if its C happens to be
+higher than the completely arbitrary EM_MAX_COMPLEXITY threshold (2048 at the
+time of writing).
+
+In order to use EAS on a platform having a too complex EM, the only two possible
+options are:
+
+ 1. splitting the system into separate, smaller, root domains using exclusive
+ cpusets and enabling EAS locally on each of them. This option has the
+ benefit to work out of the box but the drawback of preventing load
+ balance between root domains, which can result in an unbalanced system
+ overall;
+ 2. submitting patches to reduce the complexity of the EAS wake-up algorithm,
+ hence enabling it to cope with larger EMs in reasonable time.
+
+
+ 6.4 - Schedutil governor
+
+EAS tries to predict at which OPP will the CPUs be running in the close future
+in order to estimate their energy consumption. To do so, it is assumed that OPPs
+of CPUs follow their utilization.
+
+Although it is very difficult to provide hard guarantees regarding the accuracy
+of this assumption in practice (because the hardware might not do what it is
+told to do, for example), schedutil as opposed to other CPUFreq governors at
+least _requests_ frequencies calculated using the utilization signals.
+Consequently, the only sane governor to use together with EAS is schedutil,
+because it is the only one providing some degree of consistency between
+frequency requests and energy predictions.
+
+Using EAS with any other governor than schedutil is not supported.
+
+
+ 6.5 Scale-invariant utilization signals
+
+In order to make accurate prediction across CPUs and for all performance
+states, EAS needs frequency-invariant and CPU-invariant PELT signals. These can
+be obtained using the architecture-defined arch_scale{cpu,freq}_capacity()
+callbacks.
+
+Using EAS on a platform that doesn't implement these two callbacks is not
+supported.
+
+
+ 6.6 Multithreading (SMT)
+
+EAS in its current form is SMT unaware and is not able to leverage
+multithreaded hardware to save energy. EAS considers threads as independent
+CPUs, which can actually be counter-productive for both performance and energy.
+
+EAS on SMT is not supported.
--
2.20.1
^ permalink raw reply related [flat|nested] 6+ messages in thread