linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/17] Introduce runtime modifiable Energy Model
@ 2023-05-12  9:57 Lukasz Luba
  2023-05-12  9:57 ` [PATCH v2 01/17] PM: EM: Refactor em_cpufreq_update_efficiencies() arguments Lukasz Luba
                   ` (18 more replies)
  0 siblings, 19 replies; 43+ messages in thread
From: Lukasz Luba @ 2023-05-12  9:57 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	Pierre.Gondois, ionela.voinescu, rostedt, mhiramat

Hi all,

This patch set adds a new feature which allows to modify Energy Model (EM)
power values at runtime. It will allow to better reflect power model of
a recent SoCs and silicon. Different characteristics of the power usage
can be leveraged and thus better decisions made during task placement in EAS.

It's part of feature set know as Dynamic Energy Model. It has been presented
and discussed recently at OSPM2023 [3]. This patch set implements the 1st
improvement for the EM.

The concepts:
1. The CPU power usage can vary due to the workload that it's running or due
to the temperature of the SoC. The same workload can use more power when the
temperature of the silicon has increased (e.g. due to hot GPU or ISP).
In such situation or EM can be adjusted and reflect the fact of increased
power usage. That power increase is due to a factor called static power
(sometimes called simply: leakage). The CPUs in recent SoCs are different.
We have heterogeneous SoCs with 3 (or even 4) different microarchitectures.
They are also built differently with High Performance (HP) cells or
Low Power (LP) cells. They are affected by the temperature increase
differently: HP cells have bigger leakage. The SW model can leverage that
knowledge.
2. It is also possible to change the EM to better reflect the currently
running workload. Usually the EM is derived from some average power values
taken from experiments with benchmark (e.g. Dhrystone). The model derived
from such scenario might not represent properly the workloads usually running
on the device. Therefore, runtime modification of the EM allows to switch to
a different model, when there is a need.
3. The EM can be adjusted after boot, when all the modules are loaded and
more information about the SoC is available e.g. chip binning. This would help
to better reflect the silicon characteristics. Thus, this EM modification
API allows it now. It wasn't possible in the past and the EM had to be
'set in stone'.

Some design details:
The internal mechanisms for the memory allocation are handled internally in the 
EM. Kernel modules can just call the new API to update the EM data and the 
new memory would be provided and owned by the EM. The EM memory is used by
EAS, which impacts those design decisions. The EM writers are protected by
a mutex. This new runtime modified EM table is protected using RCU mechanism,
which fits the current EAS hot path (which already uses RCU read lock).
The unregister API handles only non-CPU (e.g. GPU, ISP) devices and uses the
same mutex as EM modifiers to make sure the memory is safely freed.

More detailed explanation and background can be found in presentations
during LPC2022 [1][2] or in the documentation patches.

Changelog:
v2:
- solved build warning of unused variable in patch 13/17 when EM is
  not compiled in, e.g. on Intel platform for this cpufreq_cooling
- re-based on top of v6.4-rc1
v1:
- implementation can be found here [4]

Regards,
Lukasz Luba

[1] https://lpc.events/event/16/contributions/1341/attachments/955/1873/Dynamic_Energy_Model_to_handle_leakage_power.pdf
[2] https://lpc.events/event/16/contributions/1194/attachments/1114/2139/LPC2022_Energy_model_accuracy.pdf
[3] https://www.youtube.com/watch?v=2C-5uikSbtM&list=PL0fKordpLTjKsBOUcZqnzlHShri4YBL1H
[4] https://lore.kernel.org/lkml/20230314103357.26010-1-lukasz.luba@arm.com/

Lukasz Luba (17):
  PM: EM: Refactor em_cpufreq_update_efficiencies() arguments
  PM: EM: Find first CPU online while updating OPP efficiency
  PM: EM: Refactor em_pd_get_efficient_state() to be more flexible
  PM: EM: Create a new function em_compute_costs()
  trace: energy_model: Add trace event for EM runtime modifications
  PM: EM: Add update_power() callback for runtime modifications
  PM: EM: Check if the get_cost() callback is present in
    em_compute_costs()
  PM: EM: Introduce runtime modifiable table
  PM: EM: Add RCU mechanism which safely cleans the old data
  PM: EM: Add runtime update interface to modify EM power
  PM: EM: Use runtime modified EM for CPUs energy estimation in EAS
  PM: EM: Add argument to get_cost() for runtime modification
  PM: EM: Refactor struct em_perf_domain and add default_table
  Documentation: EM: Add a new section about the design
  Documentation: EM: Add a runtime modifiable EM design description
  Documentation: EM: Add example with driver modifying the EM
  Documentation: EM: Describe the API of runtime modifications

 Documentation/power/energy-model.rst | 134 +++++++++++-
 drivers/cpufreq/cppc_cpufreq.c       |   2 +-
 drivers/powercap/dtpm_cpu.c          |  27 ++-
 drivers/powercap/dtpm_devfreq.c      |  23 +-
 drivers/thermal/cpufreq_cooling.c    |  23 +-
 drivers/thermal/devfreq_cooling.c    |  23 +-
 include/linux/energy_model.h         |  93 ++++++--
 include/trace/events/energy_model.h  |  46 ++++
 kernel/power/energy_model.c          | 305 +++++++++++++++++++++++----
 9 files changed, 580 insertions(+), 96 deletions(-)
 create mode 100644 include/trace/events/energy_model.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v2 01/17] PM: EM: Refactor em_cpufreq_update_efficiencies() arguments
  2023-05-12  9:57 [PATCH v2 00/17] Introduce runtime modifiable Energy Model Lukasz Luba
@ 2023-05-12  9:57 ` Lukasz Luba
  2023-05-12  9:57 ` [PATCH v2 02/17] PM: EM: Find first CPU online while updating OPP efficiency Lukasz Luba
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 43+ messages in thread
From: Lukasz Luba @ 2023-05-12  9:57 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	Pierre.Gondois, ionela.voinescu, rostedt, mhiramat

In order to prepare the code for the modifiable EM perf_state table,
refactor existing function em_cpufreq_update_efficiencies().

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 kernel/power/energy_model.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 7b44f5b89fa1..0d037f3c4e58 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -237,10 +237,10 @@ static int em_create_pd(struct device *dev, int nr_states,
 	return 0;
 }
 
-static void em_cpufreq_update_efficiencies(struct device *dev)
+static void
+em_cpufreq_update_efficiencies(struct device *dev, struct em_perf_state *table)
 {
 	struct em_perf_domain *pd = dev->em_pd;
-	struct em_perf_state *table;
 	struct cpufreq_policy *policy;
 	int found = 0;
 	int i;
@@ -254,8 +254,6 @@ static void em_cpufreq_update_efficiencies(struct device *dev)
 		return;
 	}
 
-	table = pd->table;
-
 	for (i = 0; i < pd->nr_perf_states; i++) {
 		if (!(table[i].flags & EM_PERF_STATE_INEFFICIENT))
 			continue;
@@ -397,7 +395,7 @@ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
 
 	dev->em_pd->flags |= flags;
 
-	em_cpufreq_update_efficiencies(dev);
+	em_cpufreq_update_efficiencies(dev, dev->em_pd->table);
 
 	em_debug_create_pd(dev);
 	dev_info(dev, "EM: created perf domain\n");
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 02/17] PM: EM: Find first CPU online while updating OPP efficiency
  2023-05-12  9:57 [PATCH v2 00/17] Introduce runtime modifiable Energy Model Lukasz Luba
  2023-05-12  9:57 ` [PATCH v2 01/17] PM: EM: Refactor em_cpufreq_update_efficiencies() arguments Lukasz Luba
@ 2023-05-12  9:57 ` Lukasz Luba
  2023-05-12  9:57 ` [PATCH v2 03/17] PM: EM: Refactor em_pd_get_efficient_state() to be more flexible Lukasz Luba
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 43+ messages in thread
From: Lukasz Luba @ 2023-05-12  9:57 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	Pierre.Gondois, ionela.voinescu, rostedt, mhiramat

The Energy Model might be updated at runtime and the energy efficiency
for each OPP may change. Thus, there is a need to update also the
cpufreq framework and make it aligned to the new values. In order to
do that, use a first online CPU from the Performance Domain.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 kernel/power/energy_model.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 0d037f3c4e58..85a70b7da023 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -243,12 +243,19 @@ em_cpufreq_update_efficiencies(struct device *dev, struct em_perf_state *table)
 	struct em_perf_domain *pd = dev->em_pd;
 	struct cpufreq_policy *policy;
 	int found = 0;
-	int i;
+	int i, cpu;
 
 	if (!_is_cpu_device(dev) || !pd)
 		return;
 
-	policy = cpufreq_cpu_get(cpumask_first(em_span_cpus(pd)));
+	/* Try to get a CPU which is online and in this PD */
+	cpu = cpumask_first_and(em_span_cpus(pd), cpu_active_mask);
+	if (cpu >= nr_cpu_ids) {
+		dev_warn(dev, "EM: No online CPU for CPUFreq policy\n");
+		return;
+	}
+
+	policy = cpufreq_cpu_get(cpu);
 	if (!policy) {
 		dev_warn(dev, "EM: Access to CPUFreq policy failed");
 		return;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 03/17] PM: EM: Refactor em_pd_get_efficient_state() to be more flexible
  2023-05-12  9:57 [PATCH v2 00/17] Introduce runtime modifiable Energy Model Lukasz Luba
  2023-05-12  9:57 ` [PATCH v2 01/17] PM: EM: Refactor em_cpufreq_update_efficiencies() arguments Lukasz Luba
  2023-05-12  9:57 ` [PATCH v2 02/17] PM: EM: Find first CPU online while updating OPP efficiency Lukasz Luba
@ 2023-05-12  9:57 ` Lukasz Luba
  2023-05-30 11:06   ` Dietmar Eggemann
  2023-05-12  9:57 ` [PATCH v2 04/17] PM: EM: Create a new function em_compute_costs() Lukasz Luba
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 43+ messages in thread
From: Lukasz Luba @ 2023-05-12  9:57 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	Pierre.Gondois, ionela.voinescu, rostedt, mhiramat

Prepare em_pd_get_efficient_state() for the upcoming changes and
make it possible to re-use. Return an index for the best performance
state. The function arguments that are introduced should allow to
work on different performance state arrays. The caller of
em_pd_get_efficient_state() should be able to use the index either
on the default or the modifiable EM table.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 include/linux/energy_model.h | 30 +++++++++++++++++-------------
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index b9caa01dfac4..8069f526c9d8 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -175,33 +175,35 @@ void em_dev_unregister_perf_domain(struct device *dev);
 
 /**
  * em_pd_get_efficient_state() - Get an efficient performance state from the EM
- * @pd   : Performance domain for which we want an efficient frequency
- * @freq : Frequency to map with the EM
+ * @state:		List of performance states, in ascending order
+ * @nr_perf_states:	Number of performance states
+ * @freq:		Frequency to map with the EM
+ * @pd_flags:		Performance Domain flags
  *
  * It is called from the scheduler code quite frequently and as a consequence
  * doesn't implement any check.
  *
- * Return: An efficient performance state, high enough to meet @freq
+ * Return: An efficient performance state id, high enough to meet @freq
  * requirement.
  */
-static inline
-struct em_perf_state *em_pd_get_efficient_state(struct em_perf_domain *pd,
-						unsigned long freq)
+static inline int
+em_pd_get_efficient_state(struct em_perf_state *table, int nr_perf_states,
+			  unsigned long freq, unsigned long pd_flags)
 {
 	struct em_perf_state *ps;
 	int i;
 
-	for (i = 0; i < pd->nr_perf_states; i++) {
-		ps = &pd->table[i];
+	for (i = 0; i < nr_perf_states; i++) {
+		ps = &table[i];
 		if (ps->frequency >= freq) {
-			if (pd->flags & EM_PERF_DOMAIN_SKIP_INEFFICIENCIES &&
+			if (pd_flags & EM_PERF_DOMAIN_SKIP_INEFFICIENCIES &&
 			    ps->flags & EM_PERF_STATE_INEFFICIENT)
 				continue;
-			break;
+			return i;
 		}
 	}
 
-	return ps;
+	return nr_perf_states - 1;
 }
 
 /**
@@ -226,7 +228,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 {
 	unsigned long freq, scale_cpu;
 	struct em_perf_state *ps;
-	int cpu;
+	int cpu, i;
 
 	if (!sum_util)
 		return 0;
@@ -251,7 +253,9 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 	 * Find the lowest performance state of the Energy Model above the
 	 * requested frequency.
 	 */
-	ps = em_pd_get_efficient_state(pd, freq);
+	i = em_pd_get_efficient_state(pd->table, pd->nr_perf_states, freq,
+				      pd->flags);
+	ps = &pd->table[i];
 
 	/*
 	 * The capacity of a CPU in the domain at the performance state (ps)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 04/17] PM: EM: Create a new function em_compute_costs()
  2023-05-12  9:57 [PATCH v2 00/17] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (2 preceding siblings ...)
  2023-05-12  9:57 ` [PATCH v2 03/17] PM: EM: Refactor em_pd_get_efficient_state() to be more flexible Lukasz Luba
@ 2023-05-12  9:57 ` Lukasz Luba
  2023-05-30  9:51   ` Dietmar Eggemann
  2023-05-12  9:57 ` [PATCH v2 05/17] trace: energy_model: Add trace event for EM runtime modifications Lukasz Luba
                   ` (14 subsequent siblings)
  18 siblings, 1 reply; 43+ messages in thread
From: Lukasz Luba @ 2023-05-12  9:57 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	Pierre.Gondois, ionela.voinescu, rostedt, mhiramat

Create a dedicated function which will be easier to maintain and re-use
in future. The upcoming changes for the modifiable EM perf_state table
will use it.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 kernel/power/energy_model.c | 72 ++++++++++++++++++++++---------------
 1 file changed, 43 insertions(+), 29 deletions(-)

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 85a70b7da023..fd1066dcf38b 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -103,14 +103,52 @@ static void em_debug_create_pd(struct device *dev) {}
 static void em_debug_remove_pd(struct device *dev) {}
 #endif
 
+static int em_compute_costs(struct device *dev, struct em_perf_state *table,
+			    struct em_data_callback *cb, int nr_states,
+			    unsigned long flags)
+{
+	unsigned long prev_cost = ULONG_MAX;
+	u64 fmax;
+	int i, ret;
+
+	/* Compute the cost of each performance state. */
+	fmax = (u64) table[nr_states - 1].frequency;
+	for (i = nr_states - 1; i >= 0; i--) {
+		unsigned long power_res, cost;
+
+		if (flags & EM_PERF_DOMAIN_ARTIFICIAL) {
+			ret = cb->get_cost(dev, table[i].frequency, &cost);
+			if (ret || !cost || cost > EM_MAX_POWER) {
+				dev_err(dev, "EM: invalid cost %lu %d\n",
+					cost, ret);
+				return -EINVAL;
+			}
+		} else {
+			power_res = table[i].power;
+			cost = div64_u64(fmax * power_res, table[i].frequency);
+		}
+
+		table[i].cost = cost;
+
+		if (table[i].cost >= prev_cost) {
+			table[i].flags = EM_PERF_STATE_INEFFICIENT;
+			dev_dbg(dev, "EM: OPP:%lu is inefficient\n",
+				table[i].frequency);
+		} else {
+			prev_cost = table[i].cost;
+		}
+	}
+
+	return 0;
+}
+
 static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
 				int nr_states, struct em_data_callback *cb,
 				unsigned long flags)
 {
-	unsigned long power, freq, prev_freq = 0, prev_cost = ULONG_MAX;
+	unsigned long power, freq, prev_freq = 0;
 	struct em_perf_state *table;
 	int i, ret;
-	u64 fmax;
 
 	table = kcalloc(nr_states, sizeof(*table), GFP_KERNEL);
 	if (!table)
@@ -154,33 +192,9 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
 		table[i].frequency = prev_freq = freq;
 	}
 
-	/* Compute the cost of each performance state. */
-	fmax = (u64) table[nr_states - 1].frequency;
-	for (i = nr_states - 1; i >= 0; i--) {
-		unsigned long power_res, cost;
-
-		if (flags & EM_PERF_DOMAIN_ARTIFICIAL) {
-			ret = cb->get_cost(dev, table[i].frequency, &cost);
-			if (ret || !cost || cost > EM_MAX_POWER) {
-				dev_err(dev, "EM: invalid cost %lu %d\n",
-					cost, ret);
-				goto free_ps_table;
-			}
-		} else {
-			power_res = table[i].power;
-			cost = div64_u64(fmax * power_res, table[i].frequency);
-		}
-
-		table[i].cost = cost;
-
-		if (table[i].cost >= prev_cost) {
-			table[i].flags = EM_PERF_STATE_INEFFICIENT;
-			dev_dbg(dev, "EM: OPP:%lu is inefficient\n",
-				table[i].frequency);
-		} else {
-			prev_cost = table[i].cost;
-		}
-	}
+	ret = em_compute_costs(dev, table, cb, nr_states, flags);
+	if (ret)
+		goto free_ps_table;
 
 	pd->table = table;
 	pd->nr_perf_states = nr_states;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 05/17] trace: energy_model: Add trace event for EM runtime modifications
  2023-05-12  9:57 [PATCH v2 00/17] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (3 preceding siblings ...)
  2023-05-12  9:57 ` [PATCH v2 04/17] PM: EM: Create a new function em_compute_costs() Lukasz Luba
@ 2023-05-12  9:57 ` Lukasz Luba
  2023-05-30 10:03   ` Dietmar Eggemann
  2023-05-12  9:57 ` [PATCH v2 06/17] PM: EM: Add update_power() callback for " Lukasz Luba
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 43+ messages in thread
From: Lukasz Luba @ 2023-05-12  9:57 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	Pierre.Gondois, ionela.voinescu, rostedt, mhiramat

The Energy Model (EM) supports runtime modifications. Track the changes
in order to do post-processing analysis. Don't use arrays in the trace
event, since they are not properly supported by the tools. Instead use
simple "unroll" with emitting the trace event for each EM array entry
with proper ID information. The older debugging mechanism which was
the simple debugfs which dumping the EM content won't be sufficient for
the modifiable EM purpose. This trace event mechanism would address the
needs.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 include/trace/events/energy_model.h | 46 +++++++++++++++++++++++++++++
 kernel/power/energy_model.c         |  3 ++
 2 files changed, 49 insertions(+)
 create mode 100644 include/trace/events/energy_model.h

diff --git a/include/trace/events/energy_model.h b/include/trace/events/energy_model.h
new file mode 100644
index 000000000000..f70babeb5dde
--- /dev/null
+++ b/include/trace/events/energy_model.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM energy_model
+
+#if !defined(_TRACE_ENERGY_MODEL_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_ENERGY_MODEL_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(em_perf_state,
+	TP_PROTO(const char *dev_name, int nr_perf_states, int state,
+		 unsigned long ps_frequency, unsigned long ps_power,
+		 unsigned long ps_cost, unsigned long ps_flags),
+
+	TP_ARGS(dev_name, nr_perf_states, state, ps_frequency, ps_power, ps_cost,
+		ps_flags),
+
+	TP_STRUCT__entry(
+		__string(name, dev_name)
+		__field(int, num_states)
+		__field(int, state)
+		__field(unsigned long, frequency)
+		__field(unsigned long, power)
+		__field(unsigned long, cost)
+		__field(unsigned long, flags)
+	),
+
+	TP_fast_assign(
+		__assign_str(name, dev_name);
+		__entry->num_states = nr_perf_states;
+		__entry->state = state;
+		__entry->frequency = ps_frequency;
+		__entry->power = ps_power;
+		__entry->cost = ps_cost;
+		__entry->flags = ps_flags;
+	),
+
+	TP_printk("dev_name=%s nr_perf_states=%d state=%d frequency=%lu power=%lu cost=%lu flags=%lu",
+		__get_str(name), __entry->num_states, __entry->state,
+		__entry->frequency, __entry->power, __entry->cost,
+		__entry->flags)
+);
+#endif /* _TRACE_ENERGY_MODEL_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index fd1066dcf38b..61d349fec545 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -17,6 +17,9 @@
 #include <linux/sched/topology.h>
 #include <linux/slab.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/energy_model.h>
+
 /*
  * Mutex serializing the registrations of performance domains and letting
  * callbacks defined by drivers sleep.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 06/17] PM: EM: Add update_power() callback for runtime modifications
  2023-05-12  9:57 [PATCH v2 00/17] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (4 preceding siblings ...)
  2023-05-12  9:57 ` [PATCH v2 05/17] trace: energy_model: Add trace event for EM runtime modifications Lukasz Luba
@ 2023-05-12  9:57 ` Lukasz Luba
  2023-05-30  9:31   ` Dietmar Eggemann
  2023-05-12  9:57 ` [PATCH v2 07/17] PM: EM: Check if the get_cost() callback is present in em_compute_costs() Lukasz Luba
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 43+ messages in thread
From: Lukasz Luba @ 2023-05-12  9:57 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	Pierre.Gondois, ionela.voinescu, rostedt, mhiramat

The Energy Model (EM) is going to support runtime modifications. This
new callback would be used in the upcoming EM changes. The drivers
or frameworks which want to modify the EM have to implement the
update_power() callback and provide it via EM API
em_dev_update_perf_domain(). The callback is then used by the EM
framework to get new power values for each frequency in existing EM.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 include/linux/energy_model.h | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 8069f526c9d8..cc2bf607191e 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -158,6 +158,26 @@ struct em_data_callback {
 	 */
 	int (*get_cost)(struct device *dev, unsigned long freq,
 			unsigned long *cost);
+
+	/**
+	 * update_power() - Provide new power at the given performance state of
+	 *		a device
+	 * @dev		: Device for which we do this operation (can be a CPU)
+	 * @freq	: Frequency at the performance state in kHz
+	 * @power	: New power value at the performance state
+	 *		(modified)
+	 * @priv	: Pointer to private data useful for tracking context
+	 *		during run-time modifications of EM.
+	 *
+	 * The update_power() is used by run-time modifiable EM. It aims to
+	 * provide updated power value for a given frequency, which is stored
+	 * in the performance state. The power value provided by this callback
+	 * should fit in the [0, EM_MAX_POWER] range.
+	 *
+	 * Return 0 on success, or appropriate error value in case of failure.
+	 */
+	int (*update_power)(struct device *dev, unsigned long freq,
+			    unsigned long *power, void *priv);
 };
 #define EM_SET_ACTIVE_POWER_CB(em_cb, cb) ((em_cb).active_power = cb)
 #define EM_ADV_DATA_CB(_active_power_cb, _cost_cb)	\
@@ -165,6 +185,7 @@ struct em_data_callback {
 	  .get_cost = _cost_cb }
 #define EM_DATA_CB(_active_power_cb)			\
 		EM_ADV_DATA_CB(_active_power_cb, NULL)
+#define EM_UPDATE_CB(_update_power_cb) { .update_power = &_update_power_cb }
 
 struct em_perf_domain *em_cpu_get(int cpu);
 struct em_perf_domain *em_pd_get(struct device *dev);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 07/17] PM: EM: Check if the get_cost() callback is present in em_compute_costs()
  2023-05-12  9:57 [PATCH v2 00/17] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (5 preceding siblings ...)
  2023-05-12  9:57 ` [PATCH v2 06/17] PM: EM: Add update_power() callback for " Lukasz Luba
@ 2023-05-12  9:57 ` Lukasz Luba
  2023-05-12  9:57 ` [PATCH v2 08/17] PM: EM: Introduce runtime modifiable table Lukasz Luba
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 43+ messages in thread
From: Lukasz Luba @ 2023-05-12  9:57 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	Pierre.Gondois, ionela.voinescu, rostedt, mhiramat

The em_compute_cost() is going to be re-used in runtime modified EM
code path. Thus, make sure that this common code is safe and won't
try to use the NULL pointer. The former em_compute_cost() didn't have to
care about runtime modification code path. The upcoming changes introduce
such option, but with different callback. Those two paths which use
get_cost() (during first EM registration) or update_power() (during
runtime modification) need to be safely handled in em_compute_costs().

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 kernel/power/energy_model.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 61d349fec545..8866d217714e 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -119,7 +119,7 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
 	for (i = nr_states - 1; i >= 0; i--) {
 		unsigned long power_res, cost;
 
-		if (flags & EM_PERF_DOMAIN_ARTIFICIAL) {
+		if (flags & EM_PERF_DOMAIN_ARTIFICIAL && cb->get_cost) {
 			ret = cb->get_cost(dev, table[i].frequency, &cost);
 			if (ret || !cost || cost > EM_MAX_POWER) {
 				dev_err(dev, "EM: invalid cost %lu %d\n",
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 08/17] PM: EM: Introduce runtime modifiable table
  2023-05-12  9:57 [PATCH v2 00/17] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (6 preceding siblings ...)
  2023-05-12  9:57 ` [PATCH v2 07/17] PM: EM: Check if the get_cost() callback is present in em_compute_costs() Lukasz Luba
@ 2023-05-12  9:57 ` Lukasz Luba
  2023-05-14  4:28   ` kernel test robot
  2023-05-30 10:18   ` Dietmar Eggemann
  2023-05-12  9:57 ` [PATCH v2 09/17] PM: EM: Add RCU mechanism which safely cleans the old data Lukasz Luba
                   ` (10 subsequent siblings)
  18 siblings, 2 replies; 43+ messages in thread
From: Lukasz Luba @ 2023-05-12  9:57 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	Pierre.Gondois, ionela.voinescu, rostedt, mhiramat

This patch introduces the new feature: modifiable EM perf_state table.
The new runtime table would be populated with a new power data to better
reflect the actual power. The power can vary over time e.g. due to the
SoC temperature change. Higher temperature can increase power values.
For longer running scenarios, such as game or camera, when also other
devices are used (e.g. GPU, ISP) the CPU power can change. The new
EM framework is able to addresses this issue and change the data
at runtime safely. The runtime modifiable EM data is used by the Energy
Aware Scheduler (EAS) for the task placement.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 include/linux/energy_model.h | 13 +++++++++++++
 kernel/power/energy_model.c  | 24 ++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index cc2bf607191e..a616006a8130 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -36,9 +36,21 @@ struct em_perf_state {
  */
 #define EM_PERF_STATE_INEFFICIENT BIT(0)
 
+/**
+ * struct em_perf_table - Performance states table, which can be
+ *		runtime modifiable and protected with RCU
+ * @state:	List of performance states, in ascending order
+ * @rcu:	RCU used for safe access and destruction
+ */
+struct em_perf_table {
+	struct em_perf_state *state;
+	struct rcu_head rcu;
+};
+
 /**
  * struct em_perf_domain - Performance domain
  * @table:		List of performance states, in ascending order
+ * @runtime_table:	Pointer to the runtime modified em_perf_table
  * @nr_perf_states:	Number of performance states
  * @flags:		See "em_perf_domain flags"
  * @cpus:		Cpumask covering the CPUs of the domain. It's here
@@ -54,6 +66,7 @@ struct em_perf_state {
  */
 struct em_perf_domain {
 	struct em_perf_state *table;
+	struct em_perf_table __rcu *runtime_table;
 	int nr_perf_states;
 	unsigned long flags;
 	unsigned long cpus[];
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 8866d217714e..39d47028ef3d 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -213,6 +213,7 @@ static int em_create_pd(struct device *dev, int nr_states,
 			struct em_data_callback *cb, cpumask_t *cpus,
 			unsigned long flags)
 {
+	struct em_perf_table *runtime_table;
 	struct em_perf_domain *pd;
 	struct device *cpu_dev;
 	int cpu, ret, num_cpus;
@@ -237,12 +238,23 @@ static int em_create_pd(struct device *dev, int nr_states,
 			return -ENOMEM;
 	}
 
+	runtime_table = kzalloc(sizeof(*runtime_table), GFP_KERNEL);
+	if (!runtime_table) {
+		kfree(pd);
+		return -ENOMEM;
+	}
+
 	ret = em_create_perf_table(dev, pd, nr_states, cb, flags);
 	if (ret) {
 		kfree(pd);
+		kfree(runtime_table);
 		return ret;
 	}
 
+	/* Re-use temporally (till 1st modification) the memory */
+	runtime_table->state = pd->table;
+	rcu_assign_pointer(pd->runtime_table, runtime_table);
+
 	if (_is_cpu_device(dev))
 		for_each_cpu(cpu, cpus) {
 			cpu_dev = get_cpu_device(cpu);
@@ -438,20 +450,32 @@ EXPORT_SYMBOL_GPL(em_dev_register_perf_domain);
  */
 void em_dev_unregister_perf_domain(struct device *dev)
 {
+	struct em_perf_domain *pd;
+	struct em_perf_table *tmp;
+
 	if (IS_ERR_OR_NULL(dev) || !dev->em_pd)
 		return;
 
 	if (_is_cpu_device(dev))
 		return;
 
+	pd = dev->em_pd;
 	/*
 	 * The mutex separates all register/unregister requests and protects
 	 * from potential clean-up/setup issues in the debugfs directories.
 	 * The debugfs directory name is the same as device's name.
 	 */
 	mutex_lock(&em_pd_mutex);
+
 	em_debug_remove_pd(dev);
 
+	tmp = pd->runtime_table;
+
+	rcu_assign_pointer(pd->runtime_table, NULL);
+	synchronize_rcu();
+
+	kfree(tmp);
+
 	kfree(dev->em_pd->table);
 	kfree(dev->em_pd);
 	dev->em_pd = NULL;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 09/17] PM: EM: Add RCU mechanism which safely cleans the old data
  2023-05-12  9:57 [PATCH v2 00/17] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (7 preceding siblings ...)
  2023-05-12  9:57 ` [PATCH v2 08/17] PM: EM: Introduce runtime modifiable table Lukasz Luba
@ 2023-05-12  9:57 ` Lukasz Luba
  2023-05-30 10:02   ` Dietmar Eggemann
  2023-05-12  9:57 ` [PATCH v2 10/17] PM: EM: Add runtime update interface to modify EM power Lukasz Luba
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 43+ messages in thread
From: Lukasz Luba @ 2023-05-12  9:57 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	Pierre.Gondois, ionela.voinescu, rostedt, mhiramat

The EM is going to support runtime modifications of the power data.
In order to achieve that prepare the internal mechanism. This patch
introduces RCU safe mechanism to clean up the old allocated EM data.
It also adds a mutex for the EM structure to serialize the modifiers.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 kernel/power/energy_model.c | 57 +++++++++++++++++++++++++++++++++++++
 1 file changed, 57 insertions(+)

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 39d47028ef3d..bc9b7dec0763 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -26,6 +26,9 @@
  */
 static DEFINE_MUTEX(em_pd_mutex);
 
+static void em_cpufreq_update_efficiencies(struct device *dev,
+					   struct em_perf_state *table);
+
 static bool _is_cpu_device(struct device *dev)
 {
 	return (dev->bus == &cpu_subsys);
@@ -106,6 +109,60 @@ static void em_debug_create_pd(struct device *dev) {}
 static void em_debug_remove_pd(struct device *dev) {}
 #endif
 
+static void em_destroy_rt_table_rcu(struct rcu_head *rp)
+{
+	struct em_perf_table *runtime_table;
+
+	runtime_table = container_of(rp, struct em_perf_table, rcu);
+	kfree(runtime_table->state);
+	kfree(runtime_table);
+}
+
+static void em_destroy_tmp_setup_rcu(struct rcu_head *rp)
+{
+	struct em_perf_table *runtime_table;
+
+	runtime_table = container_of(rp, struct em_perf_table, rcu);
+	kfree(runtime_table);
+}
+
+static void em_perf_runtime_table_set(struct device *dev,
+				      struct em_perf_table *runtime_table)
+{
+	struct em_perf_domain *pd = dev->em_pd;
+	struct em_perf_table *tmp;
+
+	tmp = pd->runtime_table;
+
+	rcu_assign_pointer(pd->runtime_table, runtime_table);
+
+	em_cpufreq_update_efficiencies(dev, runtime_table->state);
+
+	if (trace_em_perf_state_enabled()) {
+		unsigned long freq, power, cost, flags;
+		int i;
+
+		for (i = 0; i < pd->nr_perf_states; i++) {
+			freq = runtime_table->state[i].frequency;
+			power = runtime_table->state[i].power;
+			cost = runtime_table->state[i].cost;
+			flags = runtime_table->state[i].flags;
+
+			trace_em_perf_state(dev_name(dev), pd->nr_perf_states,
+					    i, freq, power, cost, flags);
+		}
+	}
+
+	/*
+	 * Check if the 'state' array is not actually the one from setup.
+	 * If it is then don't free it.
+	 */
+	if (tmp->state == pd->table)
+		call_rcu(&tmp->rcu, em_destroy_tmp_setup_rcu);
+	else
+		call_rcu(&tmp->rcu, em_destroy_rt_table_rcu);
+}
+
 static int em_compute_costs(struct device *dev, struct em_perf_state *table,
 			    struct em_data_callback *cb, int nr_states,
 			    unsigned long flags)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 10/17] PM: EM: Add runtime update interface to modify EM power
  2023-05-12  9:57 [PATCH v2 00/17] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (8 preceding siblings ...)
  2023-05-12  9:57 ` [PATCH v2 09/17] PM: EM: Add RCU mechanism which safely cleans the old data Lukasz Luba
@ 2023-05-12  9:57 ` Lukasz Luba
  2023-05-12  9:57 ` [PATCH v2 11/17] PM: EM: Use runtime modified EM for CPUs energy estimation in EAS Lukasz Luba
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 43+ messages in thread
From: Lukasz Luba @ 2023-05-12  9:57 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	Pierre.Gondois, ionela.voinescu, rostedt, mhiramat

Add an interface which allows to modify EM power data at runtime.
The new power information is populated by the provided callback, which
is called for each performance state. The CPU frequencies' efficiency is
re-calculated since that might be affected as well. The old EM memory
is going to be freed later using RCU mechanism.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 include/linux/energy_model.h |   8 +++
 kernel/power/energy_model.c  | 109 +++++++++++++++++++++++++++++++++++
 2 files changed, 117 insertions(+)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index a616006a8130..e1772aa6c843 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -202,6 +202,8 @@ struct em_data_callback {
 
 struct em_perf_domain *em_cpu_get(int cpu);
 struct em_perf_domain *em_pd_get(struct device *dev);
+int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb,
+			      void *priv);
 int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
 				struct em_data_callback *cb, cpumask_t *span,
 				bool microwatts);
@@ -382,6 +384,12 @@ static inline int em_pd_nr_perf_states(struct em_perf_domain *pd)
 {
 	return 0;
 }
+static inline
+int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb,
+			      void *priv)
+{
+	return -EINVAL;
+}
 #endif
 
 #endif
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index bc9b7dec0763..b5675dda00e1 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -202,6 +202,101 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
 	return 0;
 }
 
+/**
+ * em_dev_update_perf_domain() - Update run-time EM table for a device
+ * @dev		: Device for which the EM is to be updated
+ * @cb		: Callback function providing the power data for the EM
+ * @priv	: Pointer to private data useful for passing context
+ *		which might be required while calling @cb
+ *
+ * Update EM run-time modifiable table for a @dev using the callback
+ * defined in @cb. The EM new power values are then used for calculating
+ * the em_perf_state::cost for associated performance state.
+ *
+ * This function uses mutex to serialize writers, so it must not be called
+ * from non-sleeping context.
+ *
+ * Return 0 on success or a proper error in case of failure.
+ */
+int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb,
+			      void *priv)
+{
+	struct em_perf_table *runtime_table;
+	unsigned long power, freq;
+	struct em_perf_domain *pd;
+	int ret, i;
+
+	if (!cb || !cb->update_power)
+		return -EINVAL;
+
+	/*
+	 * The lock serializes update and unregister code paths. When the
+	 * EM has been unregistered in the meantime, we should capture that
+	 * when entering this critical section. It also makes sure that
+	 * two concurrent updates will be serialized.
+	 */
+	mutex_lock(&em_pd_mutex);
+
+	if (!dev || !dev->em_pd) {
+		ret = -EINVAL;
+		goto unlock_em;
+	}
+
+	pd = dev->em_pd;
+
+	runtime_table = kzalloc(sizeof(*runtime_table), GFP_KERNEL);
+	if (!runtime_table) {
+		ret = -ENOMEM;
+		goto unlock_em;
+	}
+
+	runtime_table->state = kcalloc(pd->nr_perf_states,
+				       sizeof(struct em_perf_state),
+				       GFP_KERNEL);
+	if (!runtime_table->state) {
+		ret = -ENOMEM;
+		goto free_runtime_table;
+	}
+
+	/* Populate runtime table with updated values using driver callback */
+	for (i = 0; i < pd->nr_perf_states; i++) {
+		freq = pd->table[i].frequency;
+		runtime_table->state[i].frequency = freq;
+
+		/*
+		 * Call driver callback to get a new power value for
+		 * a given frequency.
+		 */
+		ret = cb->update_power(dev, freq, &power, priv);
+		if (ret) {
+			dev_dbg(dev, "EM: run-time update error: %d\n", ret);
+			goto free_runtime_state_table;
+		}
+
+		runtime_table->state[i].power = power;
+	}
+
+	ret = em_compute_costs(dev, runtime_table->state, cb,
+			       pd->nr_perf_states, pd->flags);
+	if (ret)
+		goto free_runtime_state_table;
+
+	em_perf_runtime_table_set(dev, runtime_table);
+
+	mutex_unlock(&em_pd_mutex);
+	return 0;
+
+free_runtime_state_table:
+	kfree(runtime_table->state);
+free_runtime_table:
+	kfree(runtime_table);
+unlock_em:
+	mutex_unlock(&em_pd_mutex);
+
+	return -EINVAL;
+}
+EXPORT_SYMBOL_GPL(em_dev_update_perf_domain);
+
 static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
 				int nr_states, struct em_data_callback *cb,
 				unsigned long flags)
@@ -521,6 +616,8 @@ void em_dev_unregister_perf_domain(struct device *dev)
 	 * The mutex separates all register/unregister requests and protects
 	 * from potential clean-up/setup issues in the debugfs directories.
 	 * The debugfs directory name is the same as device's name.
+	 * The lock also protects the updater of the runtime modifiable
+	 * EM and this remover.
 	 */
 	mutex_lock(&em_pd_mutex);
 
@@ -528,9 +625,21 @@ void em_dev_unregister_perf_domain(struct device *dev)
 
 	tmp = pd->runtime_table;
 
+	/*
+	 * Safely destroy runtime modifiable EM. By using the call
+	 * synchronize_rcu() we make sure we don't progress till last user
+	 * finished the RCU section and our update got applied.
+	 */
 	rcu_assign_pointer(pd->runtime_table, NULL);
 	synchronize_rcu();
 
+	/*
+	 * After the sync no updates will be in-flight, so free the old
+	 * memory.
+	 */
+	if (tmp->state != pd->table)
+		kfree(tmp->state);
+
 	kfree(tmp);
 
 	kfree(dev->em_pd->table);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 11/17] PM: EM: Use runtime modified EM for CPUs energy estimation in EAS
  2023-05-12  9:57 [PATCH v2 00/17] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (9 preceding siblings ...)
  2023-05-12  9:57 ` [PATCH v2 10/17] PM: EM: Add runtime update interface to modify EM power Lukasz Luba
@ 2023-05-12  9:57 ` Lukasz Luba
  2023-05-12  9:57 ` [PATCH v2 12/17] PM: EM: Add argument to get_cost() for runtime modification Lukasz Luba
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 43+ messages in thread
From: Lukasz Luba @ 2023-05-12  9:57 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	Pierre.Gondois, ionela.voinescu, rostedt, mhiramat

The new Energy Model (EM) supports runtime modification of the performance
state table to better model the power used by the SoC. Use this new
feature to improve energy estimation and therefore task placement in
Energy Aware Scheduler (EAS).

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 include/linux/energy_model.h | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index e1772aa6c843..8e3fa2b6bf28 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -262,6 +262,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 				unsigned long max_util, unsigned long sum_util,
 				unsigned long allowed_cpu_cap)
 {
+	struct em_perf_table *runtime_table;
 	unsigned long freq, scale_cpu;
 	struct em_perf_state *ps;
 	int cpu, i;
@@ -279,7 +280,14 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 	 */
 	cpu = cpumask_first(to_cpumask(pd->cpus));
 	scale_cpu = arch_scale_cpu_capacity(cpu);
-	ps = &pd->table[pd->nr_perf_states - 1];
+
+	/*
+	 * No rcu_read_lock() since it's already called by task scheduler.
+	 * The runtime_table is always there for CPUs, so we don't check.
+	 */
+	runtime_table = rcu_dereference(pd->runtime_table);
+
+	ps = &runtime_table->state[pd->nr_perf_states - 1];
 
 	max_util = map_util_perf(max_util);
 	max_util = min(max_util, allowed_cpu_cap);
@@ -289,9 +297,9 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 	 * Find the lowest performance state of the Energy Model above the
 	 * requested frequency.
 	 */
-	i = em_pd_get_efficient_state(pd->table, pd->nr_perf_states, freq,
-				      pd->flags);
-	ps = &pd->table[i];
+	i = em_pd_get_efficient_state(runtime_table->state, pd->nr_perf_states,
+				      freq, pd->flags);
+	ps = &runtime_table->state[i];
 
 	/*
 	 * The capacity of a CPU in the domain at the performance state (ps)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 12/17] PM: EM: Add argument to get_cost() for runtime modification
  2023-05-12  9:57 [PATCH v2 00/17] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (10 preceding siblings ...)
  2023-05-12  9:57 ` [PATCH v2 11/17] PM: EM: Use runtime modified EM for CPUs energy estimation in EAS Lukasz Luba
@ 2023-05-12  9:57 ` Lukasz Luba
  2023-05-30  9:53   ` Dietmar Eggemann
  2023-05-12  9:57 ` [PATCH v2 13/17] PM: EM: Refactor struct em_perf_domain and add default_table Lukasz Luba
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 43+ messages in thread
From: Lukasz Luba @ 2023-05-12  9:57 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	Pierre.Gondois, ionela.voinescu, rostedt, mhiramat

The Energy Model (EM) supports runtime modifications. Let also the
artificial EM use this new feature and allow to update the 'cost' values
at runtime. When the artificial EM is used there is a need to provide
two callbacks: get_cost() and update_power(), not only the last one.

Update also CPPC driver code, since the new argument is needed there
to compile properly and register EM.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 drivers/cpufreq/cppc_cpufreq.c | 2 +-
 include/linux/energy_model.h   | 7 ++++++-
 kernel/power/energy_model.c    | 9 +++++----
 3 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
index 022e3555407c..f5353c10552b 100644
--- a/drivers/cpufreq/cppc_cpufreq.c
+++ b/drivers/cpufreq/cppc_cpufreq.c
@@ -574,7 +574,7 @@ static int cppc_get_cpu_power(struct device *cpu_dev,
 }
 
 static int cppc_get_cpu_cost(struct device *cpu_dev, unsigned long KHz,
-		unsigned long *cost)
+		unsigned long *cost, void *priv)
 {
 	unsigned long perf_step, perf_prev;
 	struct cppc_perf_caps *perf_caps;
diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 8e3fa2b6bf28..b8506df9af2d 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -162,6 +162,8 @@ struct em_data_callback {
 	 * @freq	: Frequency at the performance state in kHz
 	 * @cost	: The cost value for the performance state
 	 *		(modified)
+	 * @priv	: Pointer to private data useful for tracking context
+	 *		during run-time modifications of EM.
 	 *
 	 * In case of CPUs, the cost is the one of a single CPU in the domain.
 	 * It is expected to fit in the [0, EM_MAX_POWER] range due to internal
@@ -170,7 +172,7 @@ struct em_data_callback {
 	 * Return 0 on success, or appropriate error value in case of failure.
 	 */
 	int (*get_cost)(struct device *dev, unsigned long freq,
-			unsigned long *cost);
+			unsigned long *cost, void *priv);
 
 	/**
 	 * update_power() - Provide new power at the given performance state of
@@ -199,6 +201,9 @@ struct em_data_callback {
 #define EM_DATA_CB(_active_power_cb)			\
 		EM_ADV_DATA_CB(_active_power_cb, NULL)
 #define EM_UPDATE_CB(_update_power_cb) { .update_power = &_update_power_cb }
+#define EM_ADV_UPDATE_CB(_update_power_cb, _cost_cb)	\
+	{ .update_power = &_update_power_cb,		\
+	  .get_cost = _cost_cb }
 
 struct em_perf_domain *em_cpu_get(int cpu);
 struct em_perf_domain *em_pd_get(struct device *dev);
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index b5675dda00e1..456d9f2b4370 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -165,7 +165,7 @@ static void em_perf_runtime_table_set(struct device *dev,
 
 static int em_compute_costs(struct device *dev, struct em_perf_state *table,
 			    struct em_data_callback *cb, int nr_states,
-			    unsigned long flags)
+			    unsigned long flags, void *priv)
 {
 	unsigned long prev_cost = ULONG_MAX;
 	u64 fmax;
@@ -177,7 +177,8 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
 		unsigned long power_res, cost;
 
 		if (flags & EM_PERF_DOMAIN_ARTIFICIAL && cb->get_cost) {
-			ret = cb->get_cost(dev, table[i].frequency, &cost);
+			ret = cb->get_cost(dev, table[i].frequency, &cost,
+					   priv);
 			if (ret || !cost || cost > EM_MAX_POWER) {
 				dev_err(dev, "EM: invalid cost %lu %d\n",
 					cost, ret);
@@ -277,7 +278,7 @@ int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb,
 	}
 
 	ret = em_compute_costs(dev, runtime_table->state, cb,
-			       pd->nr_perf_states, pd->flags);
+			       pd->nr_perf_states, pd->flags, priv);
 	if (ret)
 		goto free_runtime_state_table;
 
@@ -347,7 +348,7 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
 		table[i].frequency = prev_freq = freq;
 	}
 
-	ret = em_compute_costs(dev, table, cb, nr_states, flags);
+	ret = em_compute_costs(dev, table, cb, nr_states, flags, NULL);
 	if (ret)
 		goto free_ps_table;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 13/17] PM: EM: Refactor struct em_perf_domain and add default_table
  2023-05-12  9:57 [PATCH v2 00/17] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (11 preceding siblings ...)
  2023-05-12  9:57 ` [PATCH v2 12/17] PM: EM: Add argument to get_cost() for runtime modification Lukasz Luba
@ 2023-05-12  9:57 ` Lukasz Luba
  2023-05-30 10:23   ` Dietmar Eggemann
  2023-05-12  9:57 ` [PATCH v2 14/17] Documentation: EM: Add a new section about the design Lukasz Luba
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 43+ messages in thread
From: Lukasz Luba @ 2023-05-12  9:57 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	Pierre.Gondois, ionela.voinescu, rostedt, mhiramat

The Energy Model support runtime modifications. Refactor old
implementation which accessed struct em_perf_state and introduce
em_perf_domain::default_table to clean up the design. This new field
is better aligned with em_perf_domain::runtime_table and helps to
distinguish them better.

Update all drivers or frameworks which used the old field:
em_perf_domain::table and now should use em_perf_domain::default_table.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 drivers/powercap/dtpm_cpu.c       | 27 ++++++++++++++++++--------
 drivers/powercap/dtpm_devfreq.c   | 23 +++++++++++++++-------
 drivers/thermal/cpufreq_cooling.c | 23 ++++++++++++++--------
 drivers/thermal/devfreq_cooling.c | 23 ++++++++++++++++------
 include/linux/energy_model.h      |  4 ++--
 kernel/power/energy_model.c       | 32 ++++++++++++++++++++++---------
 6 files changed, 92 insertions(+), 40 deletions(-)

diff --git a/drivers/powercap/dtpm_cpu.c b/drivers/powercap/dtpm_cpu.c
index 2ff7717530bf..743a0ac8ecdf 100644
--- a/drivers/powercap/dtpm_cpu.c
+++ b/drivers/powercap/dtpm_cpu.c
@@ -43,6 +43,7 @@ static u64 set_pd_power_limit(struct dtpm *dtpm, u64 power_limit)
 {
 	struct dtpm_cpu *dtpm_cpu = to_dtpm_cpu(dtpm);
 	struct em_perf_domain *pd = em_cpu_get(dtpm_cpu->cpu);
+	struct em_perf_state *table;
 	struct cpumask cpus;
 	unsigned long freq;
 	u64 power;
@@ -51,19 +52,21 @@ static u64 set_pd_power_limit(struct dtpm *dtpm, u64 power_limit)
 	cpumask_and(&cpus, cpu_online_mask, to_cpumask(pd->cpus));
 	nr_cpus = cpumask_weight(&cpus);
 
+	table = pd->default_table->state;
+
 	for (i = 0; i < pd->nr_perf_states; i++) {
 
-		power = pd->table[i].power * nr_cpus;
+		power = table[i].power * nr_cpus;
 
 		if (power > power_limit)
 			break;
 	}
 
-	freq = pd->table[i - 1].frequency;
+	freq = table[i - 1].frequency;
 
 	freq_qos_update_request(&dtpm_cpu->qos_req, freq);
 
-	power_limit = pd->table[i - 1].power * nr_cpus;
+	power_limit = table[i - 1].power * nr_cpus;
 
 	return power_limit;
 }
@@ -88,12 +91,14 @@ static u64 scale_pd_power_uw(struct cpumask *pd_mask, u64 power)
 static u64 get_pd_power_uw(struct dtpm *dtpm)
 {
 	struct dtpm_cpu *dtpm_cpu = to_dtpm_cpu(dtpm);
+	struct em_perf_state *table;
 	struct em_perf_domain *pd;
 	struct cpumask *pd_mask;
 	unsigned long freq;
 	int i;
 
 	pd = em_cpu_get(dtpm_cpu->cpu);
+	table = pd->default_table->state;
 
 	pd_mask = em_span_cpus(pd);
 
@@ -101,10 +106,10 @@ static u64 get_pd_power_uw(struct dtpm *dtpm)
 
 	for (i = 0; i < pd->nr_perf_states; i++) {
 
-		if (pd->table[i].frequency < freq)
+		if (table[i].frequency < freq)
 			continue;
 
-		return scale_pd_power_uw(pd_mask, pd->table[i].power *
+		return scale_pd_power_uw(pd_mask, table[i].power *
 					 MICROWATT_PER_MILLIWATT);
 	}
 
@@ -115,17 +120,20 @@ static int update_pd_power_uw(struct dtpm *dtpm)
 {
 	struct dtpm_cpu *dtpm_cpu = to_dtpm_cpu(dtpm);
 	struct em_perf_domain *em = em_cpu_get(dtpm_cpu->cpu);
+	struct em_perf_state *table;
 	struct cpumask cpus;
 	int nr_cpus;
 
 	cpumask_and(&cpus, cpu_online_mask, to_cpumask(em->cpus));
 	nr_cpus = cpumask_weight(&cpus);
 
-	dtpm->power_min = em->table[0].power;
+	table = em->default_table->state;
+
+	dtpm->power_min = table[0].power;
 	dtpm->power_min *= MICROWATT_PER_MILLIWATT;
 	dtpm->power_min *= nr_cpus;
 
-	dtpm->power_max = em->table[em->nr_perf_states - 1].power;
+	dtpm->power_max = table[em->nr_perf_states - 1].power;
 	dtpm->power_max *= MICROWATT_PER_MILLIWATT;
 	dtpm->power_max *= nr_cpus;
 
@@ -182,6 +190,7 @@ static int __dtpm_cpu_setup(int cpu, struct dtpm *parent)
 {
 	struct dtpm_cpu *dtpm_cpu;
 	struct cpufreq_policy *policy;
+	struct em_perf_state *table;
 	struct em_perf_domain *pd;
 	char name[CPUFREQ_NAME_LEN];
 	int ret = -ENOMEM;
@@ -198,6 +207,8 @@ static int __dtpm_cpu_setup(int cpu, struct dtpm *parent)
 	if (!pd || em_is_artificial(pd))
 		return -EINVAL;
 
+	table = pd->default_table->state;
+
 	dtpm_cpu = kzalloc(sizeof(*dtpm_cpu), GFP_KERNEL);
 	if (!dtpm_cpu)
 		return -ENOMEM;
@@ -216,7 +227,7 @@ static int __dtpm_cpu_setup(int cpu, struct dtpm *parent)
 
 	ret = freq_qos_add_request(&policy->constraints,
 				   &dtpm_cpu->qos_req, FREQ_QOS_MAX,
-				   pd->table[pd->nr_perf_states - 1].frequency);
+				   table[pd->nr_perf_states - 1].frequency);
 	if (ret)
 		goto out_dtpm_unregister;
 
diff --git a/drivers/powercap/dtpm_devfreq.c b/drivers/powercap/dtpm_devfreq.c
index 91276761a31d..6ef0f2b4a683 100644
--- a/drivers/powercap/dtpm_devfreq.c
+++ b/drivers/powercap/dtpm_devfreq.c
@@ -37,11 +37,14 @@ static int update_pd_power_uw(struct dtpm *dtpm)
 	struct devfreq *devfreq = dtpm_devfreq->devfreq;
 	struct device *dev = devfreq->dev.parent;
 	struct em_perf_domain *pd = em_pd_get(dev);
+	struct em_perf_state *table;
 
-	dtpm->power_min = pd->table[0].power;
+	table = pd->default_table->state;
+
+	dtpm->power_min = table[0].power;
 	dtpm->power_min *= MICROWATT_PER_MILLIWATT;
 
-	dtpm->power_max = pd->table[pd->nr_perf_states - 1].power;
+	dtpm->power_max = table[pd->nr_perf_states - 1].power;
 	dtpm->power_max *= MICROWATT_PER_MILLIWATT;
 
 	return 0;
@@ -53,22 +56,25 @@ static u64 set_pd_power_limit(struct dtpm *dtpm, u64 power_limit)
 	struct devfreq *devfreq = dtpm_devfreq->devfreq;
 	struct device *dev = devfreq->dev.parent;
 	struct em_perf_domain *pd = em_pd_get(dev);
+	struct em_perf_state *table;
 	unsigned long freq;
 	u64 power;
 	int i;
 
+	table = pd->default_table->state;
+
 	for (i = 0; i < pd->nr_perf_states; i++) {
 
-		power = pd->table[i].power * MICROWATT_PER_MILLIWATT;
+		power = table[i].power * MICROWATT_PER_MILLIWATT;
 		if (power > power_limit)
 			break;
 	}
 
-	freq = pd->table[i - 1].frequency;
+	freq = table[i - 1].frequency;
 
 	dev_pm_qos_update_request(&dtpm_devfreq->qos_req, freq);
 
-	power_limit = pd->table[i - 1].power * MICROWATT_PER_MILLIWATT;
+	power_limit = table[i - 1].power * MICROWATT_PER_MILLIWATT;
 
 	return power_limit;
 }
@@ -94,6 +100,7 @@ static u64 get_pd_power_uw(struct dtpm *dtpm)
 	struct device *dev = devfreq->dev.parent;
 	struct em_perf_domain *pd = em_pd_get(dev);
 	struct devfreq_dev_status status;
+	struct em_perf_state *table;
 	unsigned long freq;
 	u64 power;
 	int i;
@@ -102,15 +109,17 @@ static u64 get_pd_power_uw(struct dtpm *dtpm)
 	status = devfreq->last_status;
 	mutex_unlock(&devfreq->lock);
 
+	table = pd->default_table->state;
+
 	freq = DIV_ROUND_UP(status.current_frequency, HZ_PER_KHZ);
 	_normalize_load(&status);
 
 	for (i = 0; i < pd->nr_perf_states; i++) {
 
-		if (pd->table[i].frequency < freq)
+		if (table[i].frequency < freq)
 			continue;
 
-		power = pd->table[i].power * MICROWATT_PER_MILLIWATT;
+		power = table[i].power * MICROWATT_PER_MILLIWATT;
 		power *= status.busy_time;
 		power >>= 10;
 
diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c
index e2cc7bd30862..1d979c5e05ed 100644
--- a/drivers/thermal/cpufreq_cooling.c
+++ b/drivers/thermal/cpufreq_cooling.c
@@ -91,10 +91,11 @@ struct cpufreq_cooling_device {
 static unsigned long get_level(struct cpufreq_cooling_device *cpufreq_cdev,
 			       unsigned int freq)
 {
+	struct em_perf_state *table = cpufreq_cdev->em->default_table->state;
 	int i;
 
 	for (i = cpufreq_cdev->max_level - 1; i >= 0; i--) {
-		if (freq > cpufreq_cdev->em->table[i].frequency)
+		if (freq > table[i].frequency)
 			break;
 	}
 
@@ -104,15 +105,16 @@ static unsigned long get_level(struct cpufreq_cooling_device *cpufreq_cdev,
 static u32 cpu_freq_to_power(struct cpufreq_cooling_device *cpufreq_cdev,
 			     u32 freq)
 {
+	struct em_perf_state *table = cpufreq_cdev->em->default_table->state;
 	unsigned long power_mw;
 	int i;
 
 	for (i = cpufreq_cdev->max_level - 1; i >= 0; i--) {
-		if (freq > cpufreq_cdev->em->table[i].frequency)
+		if (freq > table[i].frequency)
 			break;
 	}
 
-	power_mw = cpufreq_cdev->em->table[i + 1].power;
+	power_mw = table[i + 1].power;
 	power_mw /= MICROWATT_PER_MILLIWATT;
 
 	return power_mw;
@@ -121,18 +123,19 @@ static u32 cpu_freq_to_power(struct cpufreq_cooling_device *cpufreq_cdev,
 static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
 			     u32 power)
 {
+	struct em_perf_state *table = cpufreq_cdev->em->default_table->state;
 	unsigned long em_power_mw;
 	int i;
 
 	for (i = cpufreq_cdev->max_level; i > 0; i--) {
 		/* Convert EM power to milli-Watts to make safe comparison */
-		em_power_mw = cpufreq_cdev->em->table[i].power;
+		em_power_mw = table[i].power;
 		em_power_mw /= MICROWATT_PER_MILLIWATT;
 		if (power >= em_power_mw)
 			break;
 	}
 
-	return cpufreq_cdev->em->table[i].frequency;
+	return table[i].frequency;
 }
 
 /**
@@ -262,8 +265,9 @@ static int cpufreq_get_requested_power(struct thermal_cooling_device *cdev,
 static int cpufreq_state2power(struct thermal_cooling_device *cdev,
 			       unsigned long state, u32 *power)
 {
-	unsigned int freq, num_cpus, idx;
 	struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata;
+	unsigned int freq, num_cpus, idx;
+	struct em_perf_state *table;
 
 	/* Request state should be less than max_level */
 	if (state > cpufreq_cdev->max_level)
@@ -271,8 +275,9 @@ static int cpufreq_state2power(struct thermal_cooling_device *cdev,
 
 	num_cpus = cpumask_weight(cpufreq_cdev->policy->cpus);
 
+	table = cpufreq_cdev->em->default_table->state;
 	idx = cpufreq_cdev->max_level - state;
-	freq = cpufreq_cdev->em->table[idx].frequency;
+	freq = table[idx].frequency;
 	*power = cpu_freq_to_power(cpufreq_cdev, freq) * num_cpus;
 
 	return 0;
@@ -378,8 +383,10 @@ static unsigned int get_state_freq(struct cpufreq_cooling_device *cpufreq_cdev,
 #ifdef CONFIG_THERMAL_GOV_POWER_ALLOCATOR
 	/* Use the Energy Model table if available */
 	if (cpufreq_cdev->em) {
+		struct em_perf_state *table;
+		table = cpufreq_cdev->em->default_table->state;
 		idx = cpufreq_cdev->max_level - state;
-		return cpufreq_cdev->em->table[idx].frequency;
+		return table[idx].frequency;
 	}
 #endif
 
diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
index 262e62ab6cf2..4207ef850582 100644
--- a/drivers/thermal/devfreq_cooling.c
+++ b/drivers/thermal/devfreq_cooling.c
@@ -87,6 +87,7 @@ static int devfreq_cooling_set_cur_state(struct thermal_cooling_device *cdev,
 	struct devfreq_cooling_device *dfc = cdev->devdata;
 	struct devfreq *df = dfc->devfreq;
 	struct device *dev = df->dev.parent;
+	struct em_perf_state *table;
 	unsigned long freq;
 	int perf_idx;
 
@@ -99,8 +100,9 @@ static int devfreq_cooling_set_cur_state(struct thermal_cooling_device *cdev,
 		return -EINVAL;
 
 	if (dfc->em_pd) {
+		table = dfc->em_pd->default_table->state;
 		perf_idx = dfc->max_state - state;
-		freq = dfc->em_pd->table[perf_idx].frequency * 1000;
+		freq = table[perf_idx].frequency * 1000;
 	} else {
 		freq = dfc->freq_table[state];
 	}
@@ -123,10 +125,11 @@ static int devfreq_cooling_set_cur_state(struct thermal_cooling_device *cdev,
  */
 static int get_perf_idx(struct em_perf_domain *em_pd, unsigned long freq)
 {
+	struct em_perf_state *table = em_pd->default_table->state;
 	int i;
 
 	for (i = 0; i < em_pd->nr_perf_states; i++) {
-		if (em_pd->table[i].frequency == freq)
+		if (table[i].frequency == freq)
 			return i;
 	}
 
@@ -181,6 +184,7 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
 	struct devfreq_cooling_device *dfc = cdev->devdata;
 	struct devfreq *df = dfc->devfreq;
 	struct devfreq_dev_status status;
+	struct em_perf_state *table;
 	unsigned long state;
 	unsigned long freq;
 	unsigned long voltage;
@@ -192,6 +196,8 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
 
 	freq = status.current_frequency;
 
+	table = dfc->em_pd->default_table->state;
+
 	if (dfc->power_ops && dfc->power_ops->get_real_power) {
 		voltage = get_voltage(df, freq);
 		if (voltage == 0) {
@@ -204,7 +210,7 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
 			state = dfc->capped_state;
 
 			/* Convert EM power into milli-Watts first */
-			dfc->res_util = dfc->em_pd->table[state].power;
+			dfc->res_util = table[state].power;
 			dfc->res_util /= MICROWATT_PER_MILLIWATT;
 
 			dfc->res_util *= SCALE_ERROR_MITIGATION;
@@ -225,7 +231,7 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
 		_normalize_load(&status);
 
 		/* Convert EM power into milli-Watts first */
-		*power = dfc->em_pd->table[perf_idx].power;
+		*power = table[perf_idx].power;
 		*power /= MICROWATT_PER_MILLIWATT;
 		/* Scale power for utilization */
 		*power *= status.busy_time;
@@ -245,13 +251,15 @@ static int devfreq_cooling_state2power(struct thermal_cooling_device *cdev,
 				       unsigned long state, u32 *power)
 {
 	struct devfreq_cooling_device *dfc = cdev->devdata;
+	struct em_perf_state *table;
 	int perf_idx;
 
 	if (state > dfc->max_state)
 		return -EINVAL;
 
+	table = dfc->em_pd->default_table->state;
 	perf_idx = dfc->max_state - state;
-	*power = dfc->em_pd->table[perf_idx].power;
+	*power = table[perf_idx].power;
 	*power /= MICROWATT_PER_MILLIWATT;
 
 	return 0;
@@ -264,6 +272,7 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
 	struct devfreq *df = dfc->devfreq;
 	struct devfreq_dev_status status;
 	unsigned long freq, em_power_mw;
+	struct em_perf_state *table;
 	s32 est_power;
 	int i;
 
@@ -273,6 +282,8 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
 
 	freq = status.current_frequency;
 
+	table = dfc->em_pd->default_table->state;
+
 	if (dfc->power_ops && dfc->power_ops->get_real_power) {
 		/* Scale for resource utilization */
 		est_power = power * dfc->res_util;
@@ -290,7 +301,7 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
 	 */
 	for (i = dfc->max_state; i > 0; i--) {
 		/* Convert EM power to milli-Watts to make safe comparison */
-		em_power_mw = dfc->em_pd->table[i].power;
+		em_power_mw = table[i].power;
 		em_power_mw /= MICROWATT_PER_MILLIWATT;
 		if (est_power >= em_power_mw)
 			break;
diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index b8506df9af2d..eb28920b1b2c 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -49,7 +49,7 @@ struct em_perf_table {
 
 /**
  * struct em_perf_domain - Performance domain
- * @table:		List of performance states, in ascending order
+ * @default_table:	Pointer to the default em_perf_table
  * @runtime_table:	Pointer to the runtime modified em_perf_table
  * @nr_perf_states:	Number of performance states
  * @flags:		See "em_perf_domain flags"
@@ -65,7 +65,7 @@ struct em_perf_table {
  * field is unused.
  */
 struct em_perf_domain {
-	struct em_perf_state *table;
+	struct em_perf_table *default_table;
 	struct em_perf_table __rcu *runtime_table;
 	int nr_perf_states;
 	unsigned long flags;
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 456d9f2b4370..204fd415ebc9 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -72,6 +72,7 @@ DEFINE_SHOW_ATTRIBUTE(em_debug_flags);
 
 static void em_debug_create_pd(struct device *dev)
 {
+	struct em_perf_table *table = dev->em_pd->default_table;
 	struct dentry *d;
 	int i;
 
@@ -87,7 +88,7 @@ static void em_debug_create_pd(struct device *dev)
 
 	/* Create a sub-directory for each performance state */
 	for (i = 0; i < dev->em_pd->nr_perf_states; i++)
-		em_debug_create_ps(&dev->em_pd->table[i], d);
+		em_debug_create_ps(&table->state[i], d);
 
 }
 
@@ -157,7 +158,7 @@ static void em_perf_runtime_table_set(struct device *dev,
 	 * Check if the 'state' array is not actually the one from setup.
 	 * If it is then don't free it.
 	 */
-	if (tmp->state == pd->table)
+	if (tmp->state == pd->default_table->state)
 		call_rcu(&tmp->rcu, em_destroy_tmp_setup_rcu);
 	else
 		call_rcu(&tmp->rcu, em_destroy_rt_table_rcu);
@@ -261,7 +262,7 @@ int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb,
 
 	/* Populate runtime table with updated values using driver callback */
 	for (i = 0; i < pd->nr_perf_states; i++) {
-		freq = pd->table[i].frequency;
+		freq = pd->default_table->state[i].frequency;
 		runtime_table->state[i].frequency = freq;
 
 		/*
@@ -352,7 +353,7 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
 	if (ret)
 		goto free_ps_table;
 
-	pd->table = table;
+	pd->default_table->state = table;
 	pd->nr_perf_states = nr_states;
 
 	return 0;
@@ -366,7 +367,7 @@ static int em_create_pd(struct device *dev, int nr_states,
 			struct em_data_callback *cb, cpumask_t *cpus,
 			unsigned long flags)
 {
-	struct em_perf_table *runtime_table;
+	struct em_perf_table *default_table, *runtime_table;
 	struct em_perf_domain *pd;
 	struct device *cpu_dev;
 	int cpu, ret, num_cpus;
@@ -391,21 +392,31 @@ static int em_create_pd(struct device *dev, int nr_states,
 			return -ENOMEM;
 	}
 
+	default_table = kzalloc(sizeof(*default_table), GFP_KERNEL);
+	if (!default_table) {
+		kfree(pd);
+		return -ENOMEM;
+	}
+
 	runtime_table = kzalloc(sizeof(*runtime_table), GFP_KERNEL);
 	if (!runtime_table) {
 		kfree(pd);
+		kfree(default_table);
 		return -ENOMEM;
 	}
 
+	pd->default_table = default_table;
+
 	ret = em_create_perf_table(dev, pd, nr_states, cb, flags);
 	if (ret) {
 		kfree(pd);
+		kfree(default_table);
 		kfree(runtime_table);
 		return ret;
 	}
 
 	/* Re-use temporally (till 1st modification) the memory */
-	runtime_table->state = pd->table;
+	runtime_table->state = pd->default_table->state;
 	rcu_assign_pointer(pd->runtime_table, runtime_table);
 
 	if (_is_cpu_device(dev))
@@ -526,6 +537,7 @@ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
 				bool microwatts)
 {
 	unsigned long cap, prev_cap = 0;
+	struct em_perf_state *table;
 	unsigned long flags = 0;
 	int cpu, ret;
 
@@ -584,7 +596,8 @@ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
 
 	dev->em_pd->flags |= flags;
 
-	em_cpufreq_update_efficiencies(dev, dev->em_pd->table);
+	table = dev->em_pd->default_table->state;
+	em_cpufreq_update_efficiencies(dev, table);
 
 	em_debug_create_pd(dev);
 	dev_info(dev, "EM: created perf domain\n");
@@ -638,12 +651,13 @@ void em_dev_unregister_perf_domain(struct device *dev)
 	 * After the sync no updates will be in-flight, so free the old
 	 * memory.
 	 */
-	if (tmp->state != pd->table)
+	if (tmp->state != pd->default_table->state)
 		kfree(tmp->state);
 
 	kfree(tmp);
 
-	kfree(dev->em_pd->table);
+	kfree(pd->default_table->state);
+	kfree(pd->default_table);
 	kfree(dev->em_pd);
 	dev->em_pd = NULL;
 	mutex_unlock(&em_pd_mutex);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 14/17] Documentation: EM: Add a new section about the design
  2023-05-12  9:57 [PATCH v2 00/17] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (12 preceding siblings ...)
  2023-05-12  9:57 ` [PATCH v2 13/17] PM: EM: Refactor struct em_perf_domain and add default_table Lukasz Luba
@ 2023-05-12  9:57 ` Lukasz Luba
  2023-05-30 10:33   ` Dietmar Eggemann
  2023-05-12  9:57 ` [PATCH v2 15/17] Documentation: EM: Add a runtime modifiable EM design description Lukasz Luba
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 43+ messages in thread
From: Lukasz Luba @ 2023-05-12  9:57 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	Pierre.Gondois, ionela.voinescu, rostedt, mhiramat

Add a new section 'Design' which covers the information about Energy
Model. It contains the design decisions, describes models and how they
reflect the reality. Add description of the basic const. EM. Change the
other section IDs.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 Documentation/power/energy-model.rst | 36 +++++++++++++++++++++-------
 1 file changed, 27 insertions(+), 9 deletions(-)

diff --git a/Documentation/power/energy-model.rst b/Documentation/power/energy-model.rst
index ef341be2882b..e97c7f18d8bd 100644
--- a/Documentation/power/energy-model.rst
+++ b/Documentation/power/energy-model.rst
@@ -72,16 +72,34 @@ required to have the same micro-architecture. CPUs in different performance
 domains can have different micro-architectures.
 
 
-2. Core APIs
+2. Design
+-----------------
+
+2.1 Basic EM
+^^^^^^^^^^^^
+
+The basic EM is built around const. power information for each performance
+state. This model can be derived based on power measurements of the device
+e.g. CPU while running some benchmark. The benchmark might be integer heavy
+or floating point computation with a data set fitting into the CPU cache or
+registers. Bare in mind that this model might not be covering all possible
+workloads running on CPUs. Thus, please run a few different benchmarks and
+verify with some real workloads your power model values. The power variation
+due to the workload instruction mix and data set is not modeled.
+Also static power which can change during runtime due to variation of SOC
+temperature is not modeled in EM.
+
+
+3. Core APIs
 ------------
 
-2.1 Config options
+3.1 Config options
 ^^^^^^^^^^^^^^^^^^
 
 CONFIG_ENERGY_MODEL must be enabled to use the EM framework.
 
 
-2.2 Registration of performance domains
+3.2 Registration of performance domains
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Registration of 'advanced' EM
@@ -110,8 +128,8 @@ The last argument 'microwatts' is important to set with correct value. Kernel
 subsystems which use EM might rely on this flag to check if all EM devices use
 the same scale. If there are different scales, these subsystems might decide
 to return warning/error, stop working or panic.
-See Section 3. for an example of driver implementing this
-callback, or Section 2.4 for further documentation on this API
+See Section 4. for an example of driver implementing this
+callback, or Section 3.4 for further documentation on this API
 
 Registration of EM using DT
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -156,7 +174,7 @@ The EM which is registered using this method might not reflect correctly the
 physics of a real device, e.g. when static power (leakage) is important.
 
 
-2.3 Accessing performance domains
+3.3 Accessing performance domains
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 There are two API functions which provide the access to the energy model:
@@ -175,10 +193,10 @@ CPUfreq governor is in use in case of CPU device. Currently this calculation is
 not provided for other type of devices.
 
 More details about the above APIs can be found in ``<linux/energy_model.h>``
-or in Section 2.4
+or in Section 3.4
 
 
-2.4 Description details of this API
+3.4 Description details of this API
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 .. kernel-doc:: include/linux/energy_model.h
    :internal:
@@ -187,7 +205,7 @@ or in Section 2.4
    :export:
 
 
-3. Example driver
+4. Example driver
 -----------------
 
 The CPUFreq framework supports dedicated callback for registering
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 15/17] Documentation: EM: Add a runtime modifiable EM design description
  2023-05-12  9:57 [PATCH v2 00/17] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (13 preceding siblings ...)
  2023-05-12  9:57 ` [PATCH v2 14/17] Documentation: EM: Add a new section about the design Lukasz Luba
@ 2023-05-12  9:57 ` Lukasz Luba
  2023-05-30 10:42   ` Dietmar Eggemann
  2023-05-12  9:57 ` [PATCH v2 16/17] Documentation: EM: Add example with driver modifying the EM Lukasz Luba
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 43+ messages in thread
From: Lukasz Luba @ 2023-05-12  9:57 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	Pierre.Gondois, ionela.voinescu, rostedt, mhiramat

Document the new runtime modifiable EM design and how it can be used.
Change the last section layout and allow to provide another example
how to use this new API in a driver code.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 Documentation/power/energy-model.rst | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/Documentation/power/energy-model.rst b/Documentation/power/energy-model.rst
index e97c7f18d8bd..64c2462dc9a6 100644
--- a/Documentation/power/energy-model.rst
+++ b/Documentation/power/energy-model.rst
@@ -89,6 +89,26 @@ due to the workload instruction mix and data set is not modeled.
 Also static power which can change during runtime due to variation of SOC
 temperature is not modeled in EM.
 
+2.2 Runtime modifiable EM
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To better reflect power variation due to static power (leakage) the EM
+supports runtime modifications of the power values. The mechanism relies on
+RCU to free the modifiable EM perf_state table memory. Its user, the task
+scheduler, also uses RCU to access this memory. The EM framework is
+responsible for allocating the new memory for the modifiable EM perf_state
+table. The old memory is freed automatically using RCU callback mechanism.
+This design decision is made based on task scheduler using that data and
+to prevent wrong usage of kernel modules if they would be responsible for the
+memory management.
+The drivers which want to modify the EM values are protected from concurrent
+access using a mutex. Therefore, the drivers must use sleeping context when
+they want to modify the EM. The runtime modifiable EM might also be used for
+better reflecting real workload scenarios, e.g. when they pop-up on the screen
+and will run for longer period, such as: games, video recoding or playing,
+video calls, etc. It is up to the platform engineers to experiment and choose
+the right approach for their device.
+
 
 3. Core APIs
 ------------
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 16/17] Documentation: EM: Add example with driver modifying the EM
  2023-05-12  9:57 [PATCH v2 00/17] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (14 preceding siblings ...)
  2023-05-12  9:57 ` [PATCH v2 15/17] Documentation: EM: Add a runtime modifiable EM design description Lukasz Luba
@ 2023-05-12  9:57 ` Lukasz Luba
  2023-05-12  9:57 ` [PATCH v2 17/17] Documentation: EM: Describe the API of runtime modifications Lukasz Luba
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 43+ messages in thread
From: Lukasz Luba @ 2023-05-12  9:57 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	Pierre.Gondois, ionela.voinescu, rostedt, mhiramat

The Energy Model (EM) support runtime modifications. Add description
with example driver code which updates EM.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 Documentation/power/energy-model.rst | 53 ++++++++++++++++++++++++++--
 1 file changed, 51 insertions(+), 2 deletions(-)

diff --git a/Documentation/power/energy-model.rst b/Documentation/power/energy-model.rst
index 64c2462dc9a6..a6ceeeb72868 100644
--- a/Documentation/power/energy-model.rst
+++ b/Documentation/power/energy-model.rst
@@ -225,8 +225,11 @@ or in Section 3.4
    :export:
 
 
-4. Example driver
------------------
+4. Examples
+-----------
+
+4.1 Example driver with EM registration
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 The CPUFreq framework supports dedicated callback for registering
 the EM for a given CPU(s) 'policy' object: cpufreq_driver::register_em().
@@ -280,3 +283,49 @@ EM framework::
   39	static struct cpufreq_driver foo_cpufreq_driver = {
   40		.register_em = foo_cpufreq_register_em,
   41	};
+
+
+4.2 Example driver with EM modification
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This section provides a simple example of a thermal driver modifying the EM.
+The driver implements a foo_mod_power() function to be provided to the
+EM framework. The driver is woken up periodically to check the temperature
+and modify the EM data if needed::
+
+  -> drivers/thermal/foo_thermal.c
+
+  01	static int foo_mod_power(struct device *dev, unsigned long freq,
+  02			unsigned long *power, void *priv)
+  03	{
+  04		struct foo_context *ctx = priv;
+  05
+  06		/* Estimate power for the given frequency and temperature */
+  07		*power = foo_estimate_power(dev, freq, ctx->temperature);
+  08		if (*power >= EM_MAX_POWER);
+  09			return -EINVAL;
+  10
+  11		return 0;
+  12	}
+  13
+  14	/*
+  15	 * Function called periodically to check the temperature and
+  16	 * update the EM if needed
+  17	 */
+  18	static void foo_thermal_em_update(struct foo_context *ctx)
+  19	{
+  20		struct em_data_callback em_cb = EM_UPDATE_CB(mod_power);
+  21		struct cpufreq_policy *policy = ctx->policy;
+  22		struct device *cpu_dev;
+  23
+  24		cpu_dev = get_cpu_device(cpumask_first(policy->cpus));
+  25
+  26		ctx->temperature = foo_get_temp(cpu_dev, ctx);
+  27		if (ctx->temperature < FOO_EM_UPDATE_TEMP_THRESHOLD)
+  28			return;
+  29
+  30		/* Update EM for the CPUs' performance domain */
+  31		ret = em_dev_update_perf_domain(cpu_dev, &em_cb, ctx);
+  32		if (ret)
+  33			pr_warn("foo_thermal: EM update failed\n");
+  34	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 17/17] Documentation: EM: Describe the API of runtime modifications
  2023-05-12  9:57 [PATCH v2 00/17] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (15 preceding siblings ...)
  2023-05-12  9:57 ` [PATCH v2 16/17] Documentation: EM: Add example with driver modifying the EM Lukasz Luba
@ 2023-05-12  9:57 ` Lukasz Luba
  2023-05-24 17:25 ` [PATCH v2 00/17] Introduce runtime modifiable Energy Model Rafael J. Wysocki
  2023-05-30 11:07 ` Dietmar Eggemann
  18 siblings, 0 replies; 43+ messages in thread
From: Lukasz Luba @ 2023-05-12  9:57 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	Pierre.Gondois, ionela.voinescu, rostedt, mhiramat

Describe the Energy Model runtime modification API and how it can
be used.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 Documentation/power/energy-model.rst | 31 ++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/Documentation/power/energy-model.rst b/Documentation/power/energy-model.rst
index a6ceeeb72868..2fd6e82a8124 100644
--- a/Documentation/power/energy-model.rst
+++ b/Documentation/power/energy-model.rst
@@ -213,10 +213,37 @@ CPUfreq governor is in use in case of CPU device. Currently this calculation is
 not provided for other type of devices.
 
 More details about the above APIs can be found in ``<linux/energy_model.h>``
-or in Section 3.4
+or in Section 3.5
 
 
-3.4 Description details of this API
+3.4 Runtime modifications
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Drivers willing to modify the EM at runtime should use the following API::
+
+
+  int em_dev_update_perf_domain(struct device *dev,
+			struct em_data_callback *cb, void *priv);
+
+Drivers must provide a callback .update_power() returning power value for each
+performance state. The callback function provided by the driver is free
+to fetch data from any relevant location (DT, firmware, ...) or sensor.
+The .update_power() callback is called by the EM for each performance state to
+provide new power value. In the Section 4.2 there is an example driver
+which shows simple implementation of this mechanism. The callback can be
+declared with EM_UPDATE_CB() macro. The caller of that callback also passes
+a private void pointer back to the driver which tries to update EM.
+It is useful and helps to maintain the consistent context for all performance
+state calls for a given EM.
+The artificial EM also supports runtime modifications. For this type of EM
+there is a need to provide one more callback: .get_cost(). The .get_cost()
+returns the cost value for each performance state, which better reflects the
+efficiency of the CPUs which use artificial EM. Those two callbacks:
+.update_power() and get .get_cost() can be declared with one macro
+EM_ADV_UPDATE_CB() and then passed to the em_dev_update_perf_domain().
+
+
+3.5 Description details of this API
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 .. kernel-doc:: include/linux/energy_model.h
    :internal:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 08/17] PM: EM: Introduce runtime modifiable table
  2023-05-12  9:57 ` [PATCH v2 08/17] PM: EM: Introduce runtime modifiable table Lukasz Luba
@ 2023-05-14  4:28   ` kernel test robot
  2023-05-30 10:18   ` Dietmar Eggemann
  1 sibling, 0 replies; 43+ messages in thread
From: kernel test robot @ 2023-05-14  4:28 UTC (permalink / raw)
  To: Lukasz Luba, linux-kernel, linux-pm, rafael
  Cc: oe-kbuild-all, lukasz.luba, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, Pierre.Gondois, ionela.voinescu, rostedt,
	mhiramat

Hi Lukasz,

kernel test robot noticed the following build warnings:

[auto build test WARNING on rafael-pm/linux-next]
[also build test WARNING on rafael-pm/thermal linus/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Lukasz-Luba/PM-EM-Refactor-em_cpufreq_update_efficiencies-arguments/20230512-180158
base:   https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next
patch link:    https://lore.kernel.org/r/20230512095743.3393563-9-lukasz.luba%40arm.com
patch subject: [PATCH v2 08/17] PM: EM: Introduce runtime modifiable table
config: arm64-randconfig-s041-20230514 (https://download.01.org/0day-ci/archive/20230514/202305141200.aaTHzYOJ-lkp@intel.com/config)
compiler: aarch64-linux-gcc (GCC) 12.1.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # apt-get install sparse
        # sparse version: v0.6.4-39-gce1a6720-dirty
        # https://github.com/intel-lab-lkp/linux/commit/d12d8d1010d7b093d6b64c204d77484d6fc268ab
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Lukasz-Luba/PM-EM-Refactor-em_cpufreq_update_efficiencies-arguments/20230512-180158
        git checkout d12d8d1010d7b093d6b64c204d77484d6fc268ab
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=arm64 olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=arm64 SHELL=/bin/bash kernel/power/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202305141200.aaTHzYOJ-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> kernel/power/energy_model.c:472:13: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct em_perf_table *tmp @@     got struct em_perf_table [noderef] __rcu *runtime_table @@
   kernel/power/energy_model.c:472:13: sparse:     expected struct em_perf_table *tmp
   kernel/power/energy_model.c:472:13: sparse:     got struct em_perf_table [noderef] __rcu *runtime_table

vim +472 kernel/power/energy_model.c

   444	
   445	/**
   446	 * em_dev_unregister_perf_domain() - Unregister Energy Model (EM) for a device
   447	 * @dev		: Device for which the EM is registered
   448	 *
   449	 * Unregister the EM for the specified @dev (but not a CPU device).
   450	 */
   451	void em_dev_unregister_perf_domain(struct device *dev)
   452	{
   453		struct em_perf_domain *pd;
   454		struct em_perf_table *tmp;
   455	
   456		if (IS_ERR_OR_NULL(dev) || !dev->em_pd)
   457			return;
   458	
   459		if (_is_cpu_device(dev))
   460			return;
   461	
   462		pd = dev->em_pd;
   463		/*
   464		 * The mutex separates all register/unregister requests and protects
   465		 * from potential clean-up/setup issues in the debugfs directories.
   466		 * The debugfs directory name is the same as device's name.
   467		 */
   468		mutex_lock(&em_pd_mutex);
   469	
   470		em_debug_remove_pd(dev);
   471	
 > 472		tmp = pd->runtime_table;

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 00/17] Introduce runtime modifiable Energy Model
  2023-05-12  9:57 [PATCH v2 00/17] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (16 preceding siblings ...)
  2023-05-12  9:57 ` [PATCH v2 17/17] Documentation: EM: Describe the API of runtime modifications Lukasz Luba
@ 2023-05-24 17:25 ` Rafael J. Wysocki
  2023-07-03 11:08   ` Lukasz Luba
  2023-05-30 11:07 ` Dietmar Eggemann
  18 siblings, 1 reply; 43+ messages in thread
From: Rafael J. Wysocki @ 2023-05-24 17:25 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, rafael, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, Pierre.Gondois, ionela.voinescu, rostedt,
	mhiramat

Hi Lukasz,

On Fri, May 12, 2023 at 11:58 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> Hi all,
>
> This patch set adds a new feature which allows to modify Energy Model (EM)
> power values at runtime. It will allow to better reflect power model of
> a recent SoCs and silicon. Different characteristics of the power usage
> can be leveraged and thus better decisions made during task placement in EAS.
>
> It's part of feature set know as Dynamic Energy Model. It has been presented
> and discussed recently at OSPM2023 [3]. This patch set implements the 1st
> improvement for the EM.
>
> The concepts:
> 1. The CPU power usage can vary due to the workload that it's running or due
> to the temperature of the SoC. The same workload can use more power when the
> temperature of the silicon has increased (e.g. due to hot GPU or ISP).
> In such situation or EM can be adjusted and reflect the fact of increased
> power usage. That power increase is due to a factor called static power
> (sometimes called simply: leakage). The CPUs in recent SoCs are different.
> We have heterogeneous SoCs with 3 (or even 4) different microarchitectures.
> They are also built differently with High Performance (HP) cells or
> Low Power (LP) cells. They are affected by the temperature increase
> differently: HP cells have bigger leakage. The SW model can leverage that
> knowledge.
> 2. It is also possible to change the EM to better reflect the currently
> running workload. Usually the EM is derived from some average power values
> taken from experiments with benchmark (e.g. Dhrystone). The model derived
> from such scenario might not represent properly the workloads usually running
> on the device. Therefore, runtime modification of the EM allows to switch to
> a different model, when there is a need.
> 3. The EM can be adjusted after boot, when all the modules are loaded and
> more information about the SoC is available e.g. chip binning. This would help
> to better reflect the silicon characteristics. Thus, this EM modification
> API allows it now. It wasn't possible in the past and the EM had to be
> 'set in stone'.
>
> Some design details:
> The internal mechanisms for the memory allocation are handled internally in the
> EM. Kernel modules can just call the new API to update the EM data and the
> new memory would be provided and owned by the EM. The EM memory is used by
> EAS, which impacts those design decisions. The EM writers are protected by
> a mutex. This new runtime modified EM table is protected using RCU mechanism,
> which fits the current EAS hot path (which already uses RCU read lock).
> The unregister API handles only non-CPU (e.g. GPU, ISP) devices and uses the
> same mutex as EM modifiers to make sure the memory is safely freed.
>
> More detailed explanation and background can be found in presentations
> during LPC2022 [1][2] or in the documentation patches.
>
> Changelog:
> v2:
> - solved build warning of unused variable in patch 13/17 when EM is
>   not compiled in, e.g. on Intel platform for this cpufreq_cooling
> - re-based on top of v6.4-rc1
> v1:
> - implementation can be found here [4]
>
> [1] https://lpc.events/event/16/contributions/1341/attachments/955/1873/Dynamic_Energy_Model_to_handle_leakage_power.pdf
> [2] https://lpc.events/event/16/contributions/1194/attachments/1114/2139/LPC2022_Energy_model_accuracy.pdf
> [3] https://www.youtube.com/watch?v=2C-5uikSbtM&list=PL0fKordpLTjKsBOUcZqnzlHShri4YBL1H
> [4] https://lore.kernel.org/lkml/20230314103357.26010-1-lukasz.luba@arm.com/
>
> Lukasz Luba (17):
>   PM: EM: Refactor em_cpufreq_update_efficiencies() arguments
>   PM: EM: Find first CPU online while updating OPP efficiency
>   PM: EM: Refactor em_pd_get_efficient_state() to be more flexible
>   PM: EM: Create a new function em_compute_costs()
>   trace: energy_model: Add trace event for EM runtime modifications
>   PM: EM: Add update_power() callback for runtime modifications
>   PM: EM: Check if the get_cost() callback is present in
>     em_compute_costs()
>   PM: EM: Introduce runtime modifiable table
>   PM: EM: Add RCU mechanism which safely cleans the old data
>   PM: EM: Add runtime update interface to modify EM power
>   PM: EM: Use runtime modified EM for CPUs energy estimation in EAS
>   PM: EM: Add argument to get_cost() for runtime modification
>   PM: EM: Refactor struct em_perf_domain and add default_table
>   Documentation: EM: Add a new section about the design
>   Documentation: EM: Add a runtime modifiable EM design description
>   Documentation: EM: Add example with driver modifying the EM
>   Documentation: EM: Describe the API of runtime modifications

I haven't seen any responses from anyone having a vested interest in
the Energy Model code.

I'm not sure what this means, but I surely can't do much about it
myself without any input from the potentially interested parties.

Thanks!

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 06/17] PM: EM: Add update_power() callback for runtime modifications
  2023-05-12  9:57 ` [PATCH v2 06/17] PM: EM: Add update_power() callback for " Lukasz Luba
@ 2023-05-30  9:31   ` Dietmar Eggemann
  2023-07-03 15:06     ` Lukasz Luba
  0 siblings, 1 reply; 43+ messages in thread
From: Dietmar Eggemann @ 2023-05-30  9:31 UTC (permalink / raw)
  To: Lukasz Luba, linux-kernel, linux-pm, rafael
  Cc: rui.zhang, amit.kucheria, amit.kachhap, daniel.lezcano,
	viresh.kumar, len.brown, pavel, Pierre.Gondois, ionela.voinescu,
	rostedt, mhiramat

On 12/05/2023 11:57, Lukasz Luba wrote:
> The Energy Model (EM) is going to support runtime modifications. This
> new callback would be used in the upcoming EM changes. The drivers
> or frameworks which want to modify the EM have to implement the
> update_power() callback and provide it via EM API
> em_dev_update_perf_domain(). The callback is then used by the EM
> framework to get new power values for each frequency in existing EM.

Do we have any numbers or feedback that the chosen design (i.e. update
per performance state through update_power()) is performant enough for
the anticipated use case on real devices?

> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  include/linux/energy_model.h | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
> 
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index 8069f526c9d8..cc2bf607191e 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -158,6 +158,26 @@ struct em_data_callback {
>  	 */
>  	int (*get_cost)(struct device *dev, unsigned long freq,
>  			unsigned long *cost);
> +
> +	/**
> +	 * update_power() - Provide new power at the given performance state of
> +	 *		a device
> +	 * @dev		: Device for which we do this operation (can be a CPU)
> +	 * @freq	: Frequency at the performance state in kHz
> +	 * @power	: New power value at the performance state
> +	 *		(modified)
> +	 * @priv	: Pointer to private data useful for tracking context
> +	 *		during run-time modifications of EM.
> +	 *
> +	 * The update_power() is used by run-time modifiable EM. It aims to
> +	 * provide updated power value for a given frequency, which is stored
> +	 * in the performance state. The power value provided by this callback
> +	 * should fit in the [0, EM_MAX_POWER] range.
> +	 *
> +	 * Return 0 on success, or appropriate error value in case of failure.
> +	 */
> +	int (*update_power)(struct device *dev, unsigned long freq,
> +			    unsigned long *power, void *priv);
>  };
>  #define EM_SET_ACTIVE_POWER_CB(em_cb, cb) ((em_cb).active_power = cb)
>  #define EM_ADV_DATA_CB(_active_power_cb, _cost_cb)	\
> @@ -165,6 +185,7 @@ struct em_data_callback {
>  	  .get_cost = _cost_cb }
>  #define EM_DATA_CB(_active_power_cb)			\
>  		EM_ADV_DATA_CB(_active_power_cb, NULL)
> +#define EM_UPDATE_CB(_update_power_cb) { .update_power = &_update_power_cb }
>  
>  struct em_perf_domain *em_cpu_get(int cpu);
>  struct em_perf_domain *em_pd_get(struct device *dev);


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 04/17] PM: EM: Create a new function em_compute_costs()
  2023-05-12  9:57 ` [PATCH v2 04/17] PM: EM: Create a new function em_compute_costs() Lukasz Luba
@ 2023-05-30  9:51   ` Dietmar Eggemann
  2023-07-03 15:09     ` Lukasz Luba
  0 siblings, 1 reply; 43+ messages in thread
From: Dietmar Eggemann @ 2023-05-30  9:51 UTC (permalink / raw)
  To: Lukasz Luba, linux-kernel, linux-pm, rafael
  Cc: rui.zhang, amit.kucheria, amit.kachhap, daniel.lezcano,
	viresh.kumar, len.brown, pavel, Pierre.Gondois, ionela.voinescu,
	rostedt, mhiramat

On 12/05/2023 11:57, Lukasz Luba wrote:
> Create a dedicated function which will be easier to maintain and re-use

I guess `refactor` would be more suitable than `create` here?

AFAICS, you factor out em_compute_costs() from em_create_perf_table() so
you can use it later for the updater of the runtime modifiable EM
em_dev_update_perf_domain() (in 10/17) as well.

[...]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 12/17] PM: EM: Add argument to get_cost() for runtime modification
  2023-05-12  9:57 ` [PATCH v2 12/17] PM: EM: Add argument to get_cost() for runtime modification Lukasz Luba
@ 2023-05-30  9:53   ` Dietmar Eggemann
  2023-07-03 15:30     ` Lukasz Luba
  0 siblings, 1 reply; 43+ messages in thread
From: Dietmar Eggemann @ 2023-05-30  9:53 UTC (permalink / raw)
  To: Lukasz Luba, linux-kernel, linux-pm, rafael
  Cc: rui.zhang, amit.kucheria, amit.kachhap, daniel.lezcano,
	viresh.kumar, len.brown, pavel, Pierre.Gondois, ionela.voinescu,
	rostedt, mhiramat

On 12/05/2023 11:57, Lukasz Luba wrote:
> The Energy Model (EM) supports runtime modifications. Let also the
> artificial EM use this new feature and allow to update the 'cost' values
> at runtime. When the artificial EM is used there is a need to provide
> two callbacks: get_cost() and update_power(), not only the last one.
> 
> Update also CPPC driver code, since the new argument is needed there
> to compile properly and register EM.

Is there a real use case behind this? It can't be mobile which IMHO
drivers the rest of the 'Runtime modifiable EM' feature.

Do we know of any machine using the artificial EM. And do they care
about EM matching workload or static power?

[...]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 09/17] PM: EM: Add RCU mechanism which safely cleans the old data
  2023-05-12  9:57 ` [PATCH v2 09/17] PM: EM: Add RCU mechanism which safely cleans the old data Lukasz Luba
@ 2023-05-30 10:02   ` Dietmar Eggemann
  2023-07-03 15:49     ` Lukasz Luba
  0 siblings, 1 reply; 43+ messages in thread
From: Dietmar Eggemann @ 2023-05-30 10:02 UTC (permalink / raw)
  To: Lukasz Luba, linux-kernel, linux-pm, rafael
  Cc: rui.zhang, amit.kucheria, amit.kachhap, daniel.lezcano,
	viresh.kumar, len.brown, pavel, Pierre.Gondois, ionela.voinescu,
	rostedt, mhiramat

On 12/05/2023 11:57, Lukasz Luba wrote:
> The EM is going to support runtime modifications of the power data.
> In order to achieve that prepare the internal mechanism. This patch
> introduces RCU safe mechanism to clean up the old allocated EM data.

s/In order to achieve that prepare the internal mechanism. This patch
introduces/Introduce ... much shorter, same information.

[...]

> +static void em_perf_runtime_table_set(struct device *dev,
> +				      struct em_perf_table *runtime_table)
> +{
> +	struct em_perf_domain *pd = dev->em_pd;
> +	struct em_perf_table *tmp;
> +
> +	tmp = pd->runtime_table;
> +
> +	rcu_assign_pointer(pd->runtime_table, runtime_table);
> +
> +	em_cpufreq_update_efficiencies(dev, runtime_table->state);
> +
> +	if (trace_em_perf_state_enabled()) {
> +		unsigned long freq, power, cost, flags;
> +		int i;
> +
> +		for (i = 0; i < pd->nr_perf_states; i++) {
> +			freq = runtime_table->state[i].frequency;
> +			power = runtime_table->state[i].power;
> +			cost = runtime_table->state[i].cost;
> +			flags = runtime_table->state[i].flags;
> +
> +			trace_em_perf_state(dev_name(dev), pd->nr_perf_states,
> +					    i, freq, power, cost, flags);
> +		}
> +	}
> +
> +	/*
> +	 * Check if the 'state' array is not actually the one from setup.
> +	 * If it is then don't free it.
> +	 */
> +	if (tmp->state == pd->table)

It's unfortunate that you have the refactoring in 13/17 which would lead to:

 if (pd->runtime_table>state == pd->default_table->state)
         ^^^^^^^^^^^^^              ^^^^^^^^^^^^^

so people would immediately grasp one of the core concepts of the
design: non-modifiable default_table and modifiable runtime_table.

I still belief that it would be the better idea to do the refactoring first.

[...]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 05/17] trace: energy_model: Add trace event for EM runtime modifications
  2023-05-12  9:57 ` [PATCH v2 05/17] trace: energy_model: Add trace event for EM runtime modifications Lukasz Luba
@ 2023-05-30 10:03   ` Dietmar Eggemann
  2023-07-03 15:53     ` Lukasz Luba
  0 siblings, 1 reply; 43+ messages in thread
From: Dietmar Eggemann @ 2023-05-30 10:03 UTC (permalink / raw)
  To: Lukasz Luba, linux-kernel, linux-pm, rafael
  Cc: rui.zhang, amit.kucheria, amit.kachhap, daniel.lezcano,
	viresh.kumar, len.brown, pavel, Pierre.Gondois, ionela.voinescu,
	rostedt, mhiramat

On 12/05/2023 11:57, Lukasz Luba wrote:
> The Energy Model (EM) supports runtime modifications. Track the changes
> in order to do post-processing analysis. Don't use arrays in the trace
> event, since they are not properly supported by the tools. Instead use
> simple "unroll" with emitting the trace event for each EM array entry
> with proper ID information. The older debugging mechanism which was
> the simple debugfs which dumping the EM content won't be sufficient for
> the modifiable EM purpose. This trace event mechanism would address the
> needs.

Do we really need a full trace_event for this? Can we not follow the
task scheduler rule which says no new trace_events and use a trace_point
here? The footprint in the kernel would be so much smaller.

E.g. pelt_cfs_tp

0 sched.h  694 DECLARE_TRACE(pelt_cfs_tp,
1 core.c   106 EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_cfs_tp);
2 fair.c  3937 trace_pelt_cfs_tp(cfs_rq);

And then this patch should be after the section with the functional changes.

[...]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 08/17] PM: EM: Introduce runtime modifiable table
  2023-05-12  9:57 ` [PATCH v2 08/17] PM: EM: Introduce runtime modifiable table Lukasz Luba
  2023-05-14  4:28   ` kernel test robot
@ 2023-05-30 10:18   ` Dietmar Eggemann
  2023-07-03 15:58     ` Lukasz Luba
  1 sibling, 1 reply; 43+ messages in thread
From: Dietmar Eggemann @ 2023-05-30 10:18 UTC (permalink / raw)
  To: Lukasz Luba, linux-kernel, linux-pm, rafael
  Cc: rui.zhang, amit.kucheria, amit.kachhap, daniel.lezcano,
	viresh.kumar, len.brown, pavel, Pierre.Gondois, ionela.voinescu,
	rostedt, mhiramat

On 12/05/2023 11:57, Lukasz Luba wrote:
> This patch introduces the new feature: modifiable EM perf_state table.
> The new runtime table would be populated with a new power data to better
> reflect the actual power. The power can vary over time e.g. due to the
> SoC temperature change. Higher temperature can increase power values.
> For longer running scenarios, such as game or camera, when also other
> devices are used (e.g. GPU, ISP) the CPU power can change. The new
> EM framework is able to addresses this issue and change the data
> at runtime safely. The runtime modifiable EM data is used by the Energy
> Aware Scheduler (EAS) for the task placement.

It's important to say that EAS is the _only_user of the `runtime
modifiable EM`. All the other users (thermal, etc.) are still using the
default (basic) EM. IMHO, this fact drove the design here.

> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  include/linux/energy_model.h | 13 +++++++++++++
>  kernel/power/energy_model.c  | 24 ++++++++++++++++++++++++
>  2 files changed, 37 insertions(+)
> 
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index cc2bf607191e..a616006a8130 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -36,9 +36,21 @@ struct em_perf_state {
>   */
>  #define EM_PERF_STATE_INEFFICIENT BIT(0)
>  
> +/**
> + * struct em_perf_table - Performance states table, which can be
> + *		runtime modifiable and protected with RCU

which is `runtime modifiable` ? So `runtime modifiable performance state
table`? RCU is obvious since we have `struct rcu_head rcu`.

> + * @state:	List of performance states, in ascending order
> + * @rcu:	RCU used for safe access and destruction
> + */
> +struct em_perf_table {
> +	struct em_perf_state *state;
> +	struct rcu_head rcu;
> +};
> +
>  /**
>   * struct em_perf_domain - Performance domain
>   * @table:		List of performance states, in ascending order
> + * @runtime_table:	Pointer to the runtime modified em_perf_table

s/modified/modifiable

[...]

> @@ -237,12 +238,23 @@ static int em_create_pd(struct device *dev, int nr_states,
>  			return -ENOMEM;
>  	}
>  
> +	runtime_table = kzalloc(sizeof(*runtime_table), GFP_KERNEL);
> +	if (!runtime_table) {
> +		kfree(pd);
> +		return -ENOMEM;
> +	}
> +
>  	ret = em_create_perf_table(dev, pd, nr_states, cb, flags);
>  	if (ret) {
>  		kfree(pd);
> +		kfree(runtime_table);
>  		return ret;
>  	}
>  
> +	/* Re-use temporally (till 1st modification) the memory */

So this means that the runtime (modifiable) table
(pd->runtime_table>state) is mapped to the default (basic) table
(pd->default_table->state) until the first call to
em_dev_update_perf_domain() (here mentioned as the 1st modification)?

IMHO, not easy to understand since neither the cover letter, nor
documentation patch 15/17 describes this in a consistent story.

[...]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 13/17] PM: EM: Refactor struct em_perf_domain and add default_table
  2023-05-12  9:57 ` [PATCH v2 13/17] PM: EM: Refactor struct em_perf_domain and add default_table Lukasz Luba
@ 2023-05-30 10:23   ` Dietmar Eggemann
  2023-07-03 16:00     ` Lukasz Luba
  0 siblings, 1 reply; 43+ messages in thread
From: Dietmar Eggemann @ 2023-05-30 10:23 UTC (permalink / raw)
  To: Lukasz Luba, linux-kernel, linux-pm, rafael
  Cc: rui.zhang, amit.kucheria, amit.kachhap, daniel.lezcano,
	viresh.kumar, len.brown, pavel, Pierre.Gondois, ionela.voinescu,
	rostedt, mhiramat

On 12/05/2023 11:57, Lukasz Luba wrote:
> The Energy Model support runtime modifications. Refactor old
> implementation which accessed struct em_perf_state and introduce
> em_perf_domain::default_table to clean up the design. This new field
> is better aligned with em_perf_domain::runtime_table and helps to
> distinguish them better.
> 
> Update all drivers or frameworks which used the old field:
> em_perf_domain::table and now should use em_perf_domain::default_table.

I still believe that doing this refactoring before the introducation o
the new feature `runtime modifiable EM` would do wonders on the
tangibility of the other patches in this set.

[...]


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 14/17] Documentation: EM: Add a new section about the design
  2023-05-12  9:57 ` [PATCH v2 14/17] Documentation: EM: Add a new section about the design Lukasz Luba
@ 2023-05-30 10:33   ` Dietmar Eggemann
  2023-07-03 16:09     ` Lukasz Luba
  0 siblings, 1 reply; 43+ messages in thread
From: Dietmar Eggemann @ 2023-05-30 10:33 UTC (permalink / raw)
  To: Lukasz Luba, linux-kernel, linux-pm, rafael
  Cc: rui.zhang, amit.kucheria, amit.kachhap, daniel.lezcano,
	viresh.kumar, len.brown, pavel, Pierre.Gondois, ionela.voinescu,
	rostedt, mhiramat

On 12/05/2023 11:57, Lukasz Luba wrote:
> Add a new section 'Design' which covers the information about Energy
> Model. It contains the design decisions, describes models and how they
> reflect the reality. Add description of the basic const. EM. Change the
> other section IDs.

I would vote for coalescing the 4 doc patches into 1.

> 
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  Documentation/power/energy-model.rst | 36 +++++++++++++++++++++-------
>  1 file changed, 27 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/power/energy-model.rst b/Documentation/power/energy-model.rst
> index ef341be2882b..e97c7f18d8bd 100644
> --- a/Documentation/power/energy-model.rst
> +++ b/Documentation/power/energy-model.rst
> @@ -72,16 +72,34 @@ required to have the same micro-architecture. CPUs in different performance
>  domains can have different micro-architectures.
>  
>  
> -2. Core APIs
> +2. Design
> +-----------------
> +
> +2.1 Basic EM
> +^^^^^^^^^^^^

Can we make sure that people get the relation between 'basic' and
'default' easier?

It should be obvious that `pd->default_table->state` refers to this
`basic (const.) EM. Using the same identifier always helps in this
situation.

> +
> +The basic EM is built around const. power information for each performance
> +state. This model can be derived based on power measurements of the device
> +e.g. CPU while running some benchmark. The benchmark might be integer heavy
> +or floating point computation with a data set fitting into the CPU cache or
> +registers. Bare in mind that this model might not be covering all possible
> +workloads running on CPUs. Thus, please run a few different benchmarks and
> +verify with some real workloads your power model values. The power variation
> +due to the workload instruction mix and data set is not modeled.
> +Also static power which can change during runtime due to variation of SOC
> +temperature is not modeled in EM.
> +
> +
> +3. Core APIs
>  ------------
>  
> -2.1 Config options
> +3.1 Config options
>  ^^^^^^^^^^^^^^^^^^
>  
>  CONFIG_ENERGY_MODEL must be enabled to use the EM framework.
>  
>  
> -2.2 Registration of performance domains
> +3.2 Registration of performance domains
>  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>  
>  Registration of 'advanced' EM
> @@ -110,8 +128,8 @@ The last argument 'microwatts' is important to set with correct value. Kernel
>  subsystems which use EM might rely on this flag to check if all EM devices use
>  the same scale. If there are different scales, these subsystems might decide
>  to return warning/error, stop working or panic.
> -See Section 3. for an example of driver implementing this
> -callback, or Section 2.4 for further documentation on this API
> +See Section 4. for an example of driver implementing this
> +callback, or Section 3.4 for further documentation on this API
>  
>  Registration of EM using DT
>  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> @@ -156,7 +174,7 @@ The EM which is registered using this method might not reflect correctly the
>  physics of a real device, e.g. when static power (leakage) is important.
>  
>  
> -2.3 Accessing performance domains
> +3.3 Accessing performance domains
>  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>  
>  There are two API functions which provide the access to the energy model:
> @@ -175,10 +193,10 @@ CPUfreq governor is in use in case of CPU device. Currently this calculation is
>  not provided for other type of devices.
>  
>  More details about the above APIs can be found in ``<linux/energy_model.h>``
> -or in Section 2.4
> +or in Section 3.4
>  
>  
> -2.4 Description details of this API
> +3.4 Description details of this API
>  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>  .. kernel-doc:: include/linux/energy_model.h
>     :internal:
> @@ -187,7 +205,7 @@ or in Section 2.4
>     :export:
>  
>  
> -3. Example driver
> +4. Example driver
>  -----------------
>  
>  The CPUFreq framework supports dedicated callback for registering


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 15/17] Documentation: EM: Add a runtime modifiable EM design description
  2023-05-12  9:57 ` [PATCH v2 15/17] Documentation: EM: Add a runtime modifiable EM design description Lukasz Luba
@ 2023-05-30 10:42   ` Dietmar Eggemann
  2023-07-03 16:13     ` Lukasz Luba
  0 siblings, 1 reply; 43+ messages in thread
From: Dietmar Eggemann @ 2023-05-30 10:42 UTC (permalink / raw)
  To: Lukasz Luba, linux-kernel, linux-pm, rafael
  Cc: rui.zhang, amit.kucheria, amit.kachhap, daniel.lezcano,
	viresh.kumar, len.brown, pavel, Pierre.Gondois, ionela.voinescu,
	rostedt, mhiramat

On 12/05/2023 11:57, Lukasz Luba wrote:
> Document the new runtime modifiable EM design and how it can be used.
> Change the last section layout and allow to provide another example
> how to use this new API in a driver code.
> 
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  Documentation/power/energy-model.rst | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/Documentation/power/energy-model.rst b/Documentation/power/energy-model.rst
> index e97c7f18d8bd..64c2462dc9a6 100644
> --- a/Documentation/power/energy-model.rst
> +++ b/Documentation/power/energy-model.rst
> @@ -89,6 +89,26 @@ due to the workload instruction mix and data set is not modeled.
>  Also static power which can change during runtime due to variation of SOC
>  temperature is not modeled in EM.
>  
> +2.2 Runtime modifiable EM
> +^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +To better reflect power variation due to static power (leakage) the EM
> +supports runtime modifications of the power values. The mechanism relies on
> +RCU to free the modifiable EM perf_state table memory. Its user, the task
> +scheduler, also uses RCU to access this memory. The EM framework is
> +responsible for allocating the new memory for the modifiable EM perf_state
> +table. The old memory is freed automatically using RCU callback mechanism.
> +This design decision is made based on task scheduler using that data and
> +to prevent wrong usage of kernel modules if they would be responsible for the
> +memory management.
> +The drivers which want to modify the EM values are protected from concurrent
> +access using a mutex. Therefore, the drivers must use sleeping context when
> +they want to modify the EM. The runtime modifiable EM might also be used for
> +better reflecting real workload scenarios, e.g. when they pop-up on the screen
> +and will run for longer period, such as: games, video recoding or playing,
> +video calls, etc. It is up to the platform engineers to experiment and choose
> +the right approach for their device.

IMHO, there are a lot of design aspects missing here.

E.g.

Why 2 tables, modifiable (a) and default (b)?

Why does only EAS use (a)?

(a) and (b) being the same performance state table until first call to
modify (a) ()

> +
>  
>  3. Core APIs
>  ------------


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 03/17] PM: EM: Refactor em_pd_get_efficient_state() to be more flexible
  2023-05-12  9:57 ` [PATCH v2 03/17] PM: EM: Refactor em_pd_get_efficient_state() to be more flexible Lukasz Luba
@ 2023-05-30 11:06   ` Dietmar Eggemann
  2023-07-03 16:22     ` Lukasz Luba
  0 siblings, 1 reply; 43+ messages in thread
From: Dietmar Eggemann @ 2023-05-30 11:06 UTC (permalink / raw)
  To: Lukasz Luba, linux-kernel, linux-pm, rafael
  Cc: rui.zhang, amit.kucheria, amit.kachhap, daniel.lezcano,
	viresh.kumar, len.brown, pavel, Pierre.Gondois, ionela.voinescu,
	rostedt, mhiramat

On 12/05/2023 11:57, Lukasz Luba wrote:
> Prepare em_pd_get_efficient_state() for the upcoming changes and
> make it possible to re-use. Return an index for the best performance

Don't get the `possible to re-use`? Did you mean `possible to be
re-used`? But then `re-used` for what?

> state. The function arguments that are introduced should allow to
> work on different performance state arrays. The caller of
> em_pd_get_efficient_state() should be able to use the index either
> on the default or the modifiable EM table.

This describes the WHAT but not the WHY.

> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  include/linux/energy_model.h | 30 +++++++++++++++++-------------
>  1 file changed, 17 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index b9caa01dfac4..8069f526c9d8 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -175,33 +175,35 @@ void em_dev_unregister_perf_domain(struct device *dev);
>  
>  /**
>   * em_pd_get_efficient_state() - Get an efficient performance state from the EM
> - * @pd   : Performance domain for which we want an efficient frequency
> - * @freq : Frequency to map with the EM
> + * @state:		List of performance states, in ascending order
> + * @nr_perf_states:	Number of performance states
> + * @freq:		Frequency to map with the EM
> + * @pd_flags:		Performance Domain flags
>   *
>   * It is called from the scheduler code quite frequently and as a consequence
>   * doesn't implement any check.
>   *
> - * Return: An efficient performance state, high enough to meet @freq
> + * Return: An efficient performance state id, high enough to meet @freq
>   * requirement.
>   */
> -static inline
> -struct em_perf_state *em_pd_get_efficient_state(struct em_perf_domain *pd,
> -						unsigned long freq)
> +static inline int
> +em_pd_get_efficient_state(struct em_perf_state *table, int nr_perf_states,
> +			  unsigned long freq, unsigned long pd_flags)
>  {
>  	struct em_perf_state *ps;
>  	int i;
>  
> -	for (i = 0; i < pd->nr_perf_states; i++) {
> -		ps = &pd->table[i];
> +	for (i = 0; i < nr_perf_states; i++) {
> +		ps = &table[i];
>  		if (ps->frequency >= freq) {
> -			if (pd->flags & EM_PERF_DOMAIN_SKIP_INEFFICIENCIES &&
> +			if (pd_flags & EM_PERF_DOMAIN_SKIP_INEFFICIENCIES &&
>  			    ps->flags & EM_PERF_STATE_INEFFICIENT)
>  				continue;
> -			break;
> +			return i;
>  		}
>  	}
>  
> -	return ps;
> +	return nr_perf_states - 1;
>  }
>  
>  /**
> @@ -226,7 +228,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>  {
>  	unsigned long freq, scale_cpu;
>  	struct em_perf_state *ps;
> -	int cpu;
> +	int cpu, i;
>  
>  	if (!sum_util)
>  		return 0;
> @@ -251,7 +253,9 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>  	 * Find the lowest performance state of the Energy Model above the
>  	 * requested frequency.
>  	 */
> -	ps = em_pd_get_efficient_state(pd, freq);
> +	i = em_pd_get_efficient_state(pd->table, pd->nr_perf_states, freq,
> +				      pd->flags);
> +	ps = &pd->table[i];
>  
>  	/*
>  	 * The capacity of a CPU in the domain at the performance state (ps)


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 00/17] Introduce runtime modifiable Energy Model
  2023-05-12  9:57 [PATCH v2 00/17] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (17 preceding siblings ...)
  2023-05-24 17:25 ` [PATCH v2 00/17] Introduce runtime modifiable Energy Model Rafael J. Wysocki
@ 2023-05-30 11:07 ` Dietmar Eggemann
  2023-07-03 16:35   ` Lukasz Luba
  18 siblings, 1 reply; 43+ messages in thread
From: Dietmar Eggemann @ 2023-05-30 11:07 UTC (permalink / raw)
  To: Lukasz Luba, linux-kernel, linux-pm, rafael
  Cc: rui.zhang, amit.kucheria, amit.kachhap, daniel.lezcano,
	viresh.kumar, len.brown, pavel, Pierre.Gondois, ionela.voinescu,
	rostedt, mhiramat

On 12/05/2023 11:57, Lukasz Luba wrote:
> Hi all,
> 
> This patch set adds a new feature which allows to modify Energy Model (EM)
> power values at runtime. It will allow to better reflect power model of
> a recent SoCs and silicon. Different characteristics of the power usage
> can be leveraged and thus better decisions made during task placement in EAS.
> 
> It's part of feature set know as Dynamic Energy Model. It has been presented
> and discussed recently at OSPM2023 [3]. This patch set implements the 1st
> improvement for the EM.

Why is the feature set called Dynamic Energy Model?

Dynamic Energy Model:

  Runtime modifiable EM

  Proper CPU performance state evaluation

  CPU idle wakeup costs

  CPU capacity as new EM data

Didn't this `Dynamic` stand for the modifiability of the EM only?

> The concepts:
> 1. The CPU power usage can vary due to the workload that it's running or due
> to the temperature of the SoC. The same workload can use more power when the
> temperature of the silicon has increased (e.g. due to hot GPU or ISP).
> In such situation or EM can be adjusted and reflect the fact of increased
> power usage. That power increase is due to a factor called static power
> (sometimes called simply: leakage). The CPUs in recent SoCs are different.
> We have heterogeneous SoCs with 3 (or even 4) different microarchitectures.
> They are also built differently with High Performance (HP) cells or
> Low Power (LP) cells. They are affected by the temperature increase
> differently: HP cells have bigger leakage. The SW model can leverage that
> knowledge.
> 2. It is also possible to change the EM to better reflect the currently
> running workload. Usually the EM is derived from some average power values
> taken from experiments with benchmark (e.g. Dhrystone). The model derived
> from such scenario might not represent properly the workloads usually running
> on the device. Therefore, runtime modification of the EM allows to switch to
> a different model, when there is a need.
> 3. The EM can be adjusted after boot, when all the modules are loaded and
> more information about the SoC is available e.g. chip binning. This would help
> to better reflect the silicon characteristics. Thus, this EM modification
> API allows it now. It wasn't possible in the past and the EM had to be
> 'set in stone'.
> 
> Some design details:
> The internal mechanisms for the memory allocation are handled internally in the 
> EM. Kernel modules can just call the new API to update the EM data and the 
> new memory would be provided and owned by the EM. The EM memory is used by
> EAS, which impacts those design decisions. The EM writers are protected by
> a mutex. This new runtime modified EM table is protected using RCU mechanism,
> which fits the current EAS hot path (which already uses RCU read lock).
> The unregister API handles only non-CPU (e.g. GPU, ISP) devices and uses the
> same mutex as EM modifiers to make sure the memory is safely freed.

This only mentions memory allocation and locking? A global design
overview, containing e.g.

  Why 2 tables, modifiable (a) and default (b)?

  Why does only EAS use (a)?

  (a) and (b) being the same performance state table until first call to
  modify (a) ()

as an introduction into the patches would be more helpful here.

> More detailed explanation and background can be found in presentations
> during LPC2022 [1][2] or in the documentation patches.

I checked 15/17 as well but could find any of this information there either.

[...]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 00/17] Introduce runtime modifiable Energy Model
  2023-05-24 17:25 ` [PATCH v2 00/17] Introduce runtime modifiable Energy Model Rafael J. Wysocki
@ 2023-07-03 11:08   ` Lukasz Luba
  0 siblings, 0 replies; 43+ messages in thread
From: Lukasz Luba @ 2023-07-03 11:08 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-kernel, linux-pm, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, Pierre.Gondois, ionela.voinescu, rostedt,
	mhiramat

Hi Rafael,


On 5/24/23 18:25, Rafael J. Wysocki wrote:
> Hi Lukasz,
> 
> On Fri, May 12, 2023 at 11:58 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>> Hi all,
>>
>> This patch set adds a new feature which allows to modify Energy Model (EM)
>> power values at runtime. It will allow to better reflect power model of
>> a recent SoCs and silicon. Different characteristics of the power usage
>> can be leveraged and thus better decisions made during task placement in EAS.
>>
>> It's part of feature set know as Dynamic Energy Model. It has been presented
>> and discussed recently at OSPM2023 [3]. This patch set implements the 1st
>> improvement for the EM.
>>
>> The concepts:
>> 1. The CPU power usage can vary due to the workload that it's running or due
>> to the temperature of the SoC. The same workload can use more power when the
>> temperature of the silicon has increased (e.g. due to hot GPU or ISP).
>> In such situation or EM can be adjusted and reflect the fact of increased
>> power usage. That power increase is due to a factor called static power
>> (sometimes called simply: leakage). The CPUs in recent SoCs are different.
>> We have heterogeneous SoCs with 3 (or even 4) different microarchitectures.
>> They are also built differently with High Performance (HP) cells or
>> Low Power (LP) cells. They are affected by the temperature increase
>> differently: HP cells have bigger leakage. The SW model can leverage that
>> knowledge.
>> 2. It is also possible to change the EM to better reflect the currently
>> running workload. Usually the EM is derived from some average power values
>> taken from experiments with benchmark (e.g. Dhrystone). The model derived
>> from such scenario might not represent properly the workloads usually running
>> on the device. Therefore, runtime modification of the EM allows to switch to
>> a different model, when there is a need.
>> 3. The EM can be adjusted after boot, when all the modules are loaded and
>> more information about the SoC is available e.g. chip binning. This would help
>> to better reflect the silicon characteristics. Thus, this EM modification
>> API allows it now. It wasn't possible in the past and the EM had to be
>> 'set in stone'.
>>
>> Some design details:
>> The internal mechanisms for the memory allocation are handled internally in the
>> EM. Kernel modules can just call the new API to update the EM data and the
>> new memory would be provided and owned by the EM. The EM memory is used by
>> EAS, which impacts those design decisions. The EM writers are protected by
>> a mutex. This new runtime modified EM table is protected using RCU mechanism,
>> which fits the current EAS hot path (which already uses RCU read lock).
>> The unregister API handles only non-CPU (e.g. GPU, ISP) devices and uses the
>> same mutex as EM modifiers to make sure the memory is safely freed.
>>
>> More detailed explanation and background can be found in presentations
>> during LPC2022 [1][2] or in the documentation patches.
>>
>> Changelog:
>> v2:
>> - solved build warning of unused variable in patch 13/17 when EM is
>>    not compiled in, e.g. on Intel platform for this cpufreq_cooling
>> - re-based on top of v6.4-rc1
>> v1:
>> - implementation can be found here [4]
>>
>> [1] https://lpc.events/event/16/contributions/1341/attachments/955/1873/Dynamic_Energy_Model_to_handle_leakage_power.pdf
>> [2] https://lpc.events/event/16/contributions/1194/attachments/1114/2139/LPC2022_Energy_model_accuracy.pdf
>> [3] https://www.youtube.com/watch?v=2C-5uikSbtM&list=PL0fKordpLTjKsBOUcZqnzlHShri4YBL1H
>> [4] https://lore.kernel.org/lkml/20230314103357.26010-1-lukasz.luba@arm.com/
>>
>> Lukasz Luba (17):
>>    PM: EM: Refactor em_cpufreq_update_efficiencies() arguments
>>    PM: EM: Find first CPU online while updating OPP efficiency
>>    PM: EM: Refactor em_pd_get_efficient_state() to be more flexible
>>    PM: EM: Create a new function em_compute_costs()
>>    trace: energy_model: Add trace event for EM runtime modifications
>>    PM: EM: Add update_power() callback for runtime modifications
>>    PM: EM: Check if the get_cost() callback is present in
>>      em_compute_costs()
>>    PM: EM: Introduce runtime modifiable table
>>    PM: EM: Add RCU mechanism which safely cleans the old data
>>    PM: EM: Add runtime update interface to modify EM power
>>    PM: EM: Use runtime modified EM for CPUs energy estimation in EAS
>>    PM: EM: Add argument to get_cost() for runtime modification
>>    PM: EM: Refactor struct em_perf_domain and add default_table
>>    Documentation: EM: Add a new section about the design
>>    Documentation: EM: Add a runtime modifiable EM design description
>>    Documentation: EM: Add example with driver modifying the EM
>>    Documentation: EM: Describe the API of runtime modifications
> 
> I haven't seen any responses from anyone having a vested interest in
> the Energy Model code.
> 
> I'm not sure what this means, but I surely can't do much about it
> myself without any input from the potentially interested parties.

My apologies for the delay. Correct, it has been missing attention,
but now Dietmar is reviewing the stuff. He commented a few things
and I'm going to respond and address them in v3.

Thanks for having a look into this thread!

Lukasz

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 06/17] PM: EM: Add update_power() callback for runtime modifications
  2023-05-30  9:31   ` Dietmar Eggemann
@ 2023-07-03 15:06     ` Lukasz Luba
  0 siblings, 0 replies; 43+ messages in thread
From: Lukasz Luba @ 2023-07-03 15:06 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: rui.zhang, amit.kucheria, amit.kachhap, daniel.lezcano,
	viresh.kumar, len.brown, pavel, Pierre.Gondois, ionela.voinescu,
	rostedt, mhiramat, linux-kernel, linux-pm, rafael

Hi Dietmar,

On 5/30/23 10:31, Dietmar Eggemann wrote:
> On 12/05/2023 11:57, Lukasz Luba wrote:
>> The Energy Model (EM) is going to support runtime modifications. This
>> new callback would be used in the upcoming EM changes. The drivers
>> or frameworks which want to modify the EM have to implement the
>> update_power() callback and provide it via EM API
>> em_dev_update_perf_domain(). The callback is then used by the EM
>> framework to get new power values for each frequency in existing EM.
> 
> Do we have any numbers or feedback that the chosen design (i.e. update
> per performance state through update_power()) is performant enough for
> the anticipated use case on real devices?
> 

Yes, we have. I have a testing kernel module which updates the EM
with queue_delayed_work() every 100ms. That update is for Little's EM
where we have 11 OPPs. We call the new callback for each OPP
in the em_dev_update_perf_domain(). I have measured that total function
time.

When we fix all CPUs freq to max freq on pixel6 and disable deep idle
states and leave only WFI, then we can run some tracing and capture the
results:

(The 4 CPUs from top are the little (1.8MHz), than 2 Mid (2.2GHz) and
then 2 big (2.8GHz))
------------------------------------
   Function                               Hit    Time            Avg 
         s^2
   --------                               ---    ----            --- 
         ---
   em_dev_update_perf_domain             3104    51236.39 us     16.506 
us       75.344 us
   Function                               Hit    Time            Avg 
         s^2
   --------                               ---    ----            --- 
         ---
   em_dev_update_perf_domain             1264    20768.15 us     16.430 
us       62.257 us
   Function                               Hit    Time            Avg 
         s^2
   --------                               ---    ----            --- 
         ---
   em_dev_update_perf_domain             1166    18632.95 us     15.980 
us       70.707 us
   Function                               Hit    Time            Avg 
         s^2
   --------                               ---    ----            --- 
         ---
   em_dev_update_perf_domain              770    12334.43 us     16.018 
us       66.337 us
   Function                               Hit    Time            Avg 
         s^2
   --------                               ---    ----            --- 
         ---
   em_dev_update_perf_domain              101    920.613 us      9.114 
us        21.380 us
   Function                               Hit    Time            Avg 
         s^2
   --------                               ---    ----            --- 
         ---
   em_dev_update_perf_domain               20    211.830 us      10.591 
us       23.998 us
   Function                               Hit    Time            Avg 
         s^2
   --------                               ---    ----            --- 
         ---
   Function                               Hit    Time            Avg 
         s^2
   --------                               ---    ----            --- 
         ---
   em_dev_update_perf_domain               15    78.085 us       5.205 
us        7.444 us

------------------------------------

As you can see in avg on Little CPUs it takes ~16us, on Mid ~10us and on
Big ~5us.

If such updating kernel module is implemented correctly, it would be
most often scheduled on the Littles as you can see based on 'Hit'
column.

Therefore, IMO this cost can be OK for the upstream. This EM runtime
change won't be triggered very often. If it would be e.g. every
100ms than the cost ~1.5us per 1 OPP is negligible.

Regards,
Lukasz

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 04/17] PM: EM: Create a new function em_compute_costs()
  2023-05-30  9:51   ` Dietmar Eggemann
@ 2023-07-03 15:09     ` Lukasz Luba
  0 siblings, 0 replies; 43+ messages in thread
From: Lukasz Luba @ 2023-07-03 15:09 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: rui.zhang, amit.kucheria, amit.kachhap, daniel.lezcano,
	viresh.kumar, len.brown, pavel, Pierre.Gondois, ionela.voinescu,
	rostedt, mhiramat, linux-kernel, linux-pm, rafael



On 5/30/23 10:51, Dietmar Eggemann wrote:
> On 12/05/2023 11:57, Lukasz Luba wrote:
>> Create a dedicated function which will be easier to maintain and re-use
> 
> I guess `refactor` would be more suitable than `create` here?

Good point, I'll rephrase that.

> 
> AFAICS, you factor out em_compute_costs() from em_create_perf_table() so
> you can use it later for the updater of the runtime modifiable EM
> em_dev_update_perf_domain() (in 10/17) as well.
> 
> [...]

yes

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 12/17] PM: EM: Add argument to get_cost() for runtime modification
  2023-05-30  9:53   ` Dietmar Eggemann
@ 2023-07-03 15:30     ` Lukasz Luba
  0 siblings, 0 replies; 43+ messages in thread
From: Lukasz Luba @ 2023-07-03 15:30 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: rui.zhang, amit.kucheria, amit.kachhap, daniel.lezcano,
	viresh.kumar, len.brown, pavel, Pierre.Gondois, ionela.voinescu,
	rostedt, mhiramat, linux-kernel, linux-pm, rafael



On 5/30/23 10:53, Dietmar Eggemann wrote:
> On 12/05/2023 11:57, Lukasz Luba wrote:
>> The Energy Model (EM) supports runtime modifications. Let also the
>> artificial EM use this new feature and allow to update the 'cost' values
>> at runtime. When the artificial EM is used there is a need to provide
>> two callbacks: get_cost() and update_power(), not only the last one.
>>
>> Update also CPPC driver code, since the new argument is needed there
>> to compile properly and register EM.
> 
> Is there a real use case behind this? It can't be mobile which IMHO
> drivers the rest of the 'Runtime modifiable EM' feature.

Correct, CPPC+EM is not for mobile phones, but notebooks. For now
the notebooks are not tested with that feature and were not in
requirement scope.

> 
> Do we know of any machine using the artificial EM. And do they care
> about EM matching workload or static power?
> 
> [...]

For now we don't know about such notebook which uses CPPC + EM
and if it's a candidate for this feature.

I thought, since it's just one patch w/ small change, it would be
consistent to not 'forget' about that notebooks angle. I know
that there are some folks who 'run'/'are willing to run' Arm laptop with
CPPC w/ artificial EM. They might pick that feature as well.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 09/17] PM: EM: Add RCU mechanism which safely cleans the old data
  2023-05-30 10:02   ` Dietmar Eggemann
@ 2023-07-03 15:49     ` Lukasz Luba
  0 siblings, 0 replies; 43+ messages in thread
From: Lukasz Luba @ 2023-07-03 15:49 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: rui.zhang, amit.kucheria, amit.kachhap, daniel.lezcano,
	viresh.kumar, len.brown, pavel, Pierre.Gondois, ionela.voinescu,
	rostedt, mhiramat, linux-kernel, linux-pm, rafael



On 5/30/23 11:02, Dietmar Eggemann wrote:
> On 12/05/2023 11:57, Lukasz Luba wrote:
>> The EM is going to support runtime modifications of the power data.
>> In order to achieve that prepare the internal mechanism. This patch
>> introduces RCU safe mechanism to clean up the old allocated EM data.
> 
> s/In order to achieve that prepare the internal mechanism. This patch
> introduces/Introduce ... much shorter, same information.

OK

> 
> [...]
> 
>> +static void em_perf_runtime_table_set(struct device *dev,
>> +				      struct em_perf_table *runtime_table)
>> +{
>> +	struct em_perf_domain *pd = dev->em_pd;
>> +	struct em_perf_table *tmp;
>> +
>> +	tmp = pd->runtime_table;
>> +
>> +	rcu_assign_pointer(pd->runtime_table, runtime_table);
>> +
>> +	em_cpufreq_update_efficiencies(dev, runtime_table->state);
>> +
>> +	if (trace_em_perf_state_enabled()) {
>> +		unsigned long freq, power, cost, flags;
>> +		int i;
>> +
>> +		for (i = 0; i < pd->nr_perf_states; i++) {
>> +			freq = runtime_table->state[i].frequency;
>> +			power = runtime_table->state[i].power;
>> +			cost = runtime_table->state[i].cost;
>> +			flags = runtime_table->state[i].flags;
>> +
>> +			trace_em_perf_state(dev_name(dev), pd->nr_perf_states,
>> +					    i, freq, power, cost, flags);
>> +		}
>> +	}
>> +
>> +	/*
>> +	 * Check if the 'state' array is not actually the one from setup.
>> +	 * If it is then don't free it.
>> +	 */
>> +	if (tmp->state == pd->table)
> 
> It's unfortunate that you have the refactoring in 13/17 which would lead to:
> 
>   if (pd->runtime_table>state == pd->default_table->state)
>           ^^^^^^^^^^^^^              ^^^^^^^^^^^^^
> 
> so people would immediately grasp one of the core concepts of the
> design: non-modifiable default_table and modifiable runtime_table.
> 
> I still belief that it would be the better idea to do the refactoring first.
> 
> [...]

That make sense. Let me try that approach as see how the patch set would
look like.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 05/17] trace: energy_model: Add trace event for EM runtime modifications
  2023-05-30 10:03   ` Dietmar Eggemann
@ 2023-07-03 15:53     ` Lukasz Luba
  0 siblings, 0 replies; 43+ messages in thread
From: Lukasz Luba @ 2023-07-03 15:53 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: rui.zhang, amit.kucheria, amit.kachhap, daniel.lezcano,
	viresh.kumar, len.brown, pavel, Pierre.Gondois, ionela.voinescu,
	rostedt, mhiramat, linux-kernel, linux-pm, rafael



On 5/30/23 11:03, Dietmar Eggemann wrote:
> On 12/05/2023 11:57, Lukasz Luba wrote:
>> The Energy Model (EM) supports runtime modifications. Track the changes
>> in order to do post-processing analysis. Don't use arrays in the trace
>> event, since they are not properly supported by the tools. Instead use
>> simple "unroll" with emitting the trace event for each EM array entry
>> with proper ID information. The older debugging mechanism which was
>> the simple debugfs which dumping the EM content won't be sufficient for
>> the modifiable EM purpose. This trace event mechanism would address the
>> needs.
> 
> Do we really need a full trace_event for this? Can we not follow the
> task scheduler rule which says no new trace_events and use a trace_point
> here? The footprint in the kernel would be so much smaller.
> 
> E.g. pelt_cfs_tp
> 
> 0 sched.h  694 DECLARE_TRACE(pelt_cfs_tp,
> 1 core.c   106 EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_cfs_tp);
> 2 fair.c  3937 trace_pelt_cfs_tp(cfs_rq);
> 
> And then this patch should be after the section with the functional changes.
> 
> [...]

I agree. I will change that approach and create tracepoint. Also, I'll
move it to the patch at the end of the functional changes.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 08/17] PM: EM: Introduce runtime modifiable table
  2023-05-30 10:18   ` Dietmar Eggemann
@ 2023-07-03 15:58     ` Lukasz Luba
  0 siblings, 0 replies; 43+ messages in thread
From: Lukasz Luba @ 2023-07-03 15:58 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: rui.zhang, amit.kucheria, amit.kachhap, daniel.lezcano,
	viresh.kumar, len.brown, pavel, Pierre.Gondois, ionela.voinescu,
	rostedt, mhiramat, linux-kernel, linux-pm, rafael



On 5/30/23 11:18, Dietmar Eggemann wrote:
> On 12/05/2023 11:57, Lukasz Luba wrote:
>> This patch introduces the new feature: modifiable EM perf_state table.
>> The new runtime table would be populated with a new power data to better
>> reflect the actual power. The power can vary over time e.g. due to the
>> SoC temperature change. Higher temperature can increase power values.
>> For longer running scenarios, such as game or camera, when also other
>> devices are used (e.g. GPU, ISP) the CPU power can change. The new
>> EM framework is able to addresses this issue and change the data
>> at runtime safely. The runtime modifiable EM data is used by the Energy
>> Aware Scheduler (EAS) for the task placement.
> 
> It's important to say that EAS is the _only_user of the `runtime
> modifiable EM`. All the other users (thermal, etc.) are still using the
> default (basic) EM. IMHO, this fact drove the design here.

OK, I'll add that information in the header.

> 
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>> ---
>>   include/linux/energy_model.h | 13 +++++++++++++
>>   kernel/power/energy_model.c  | 24 ++++++++++++++++++++++++
>>   2 files changed, 37 insertions(+)
>>
>> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
>> index cc2bf607191e..a616006a8130 100644
>> --- a/include/linux/energy_model.h
>> +++ b/include/linux/energy_model.h
>> @@ -36,9 +36,21 @@ struct em_perf_state {
>>    */
>>   #define EM_PERF_STATE_INEFFICIENT BIT(0)
>>   
>> +/**
>> + * struct em_perf_table - Performance states table, which can be
>> + *		runtime modifiable and protected with RCU
> 
> which is `runtime modifiable` ? So `runtime modifiable performance state
> table`? RCU is obvious since we have `struct rcu_head rcu`.

Thanks, 'Runtime modifiable performance state table' sounds better.

> 
>> + * @state:	List of performance states, in ascending order
>> + * @rcu:	RCU used for safe access and destruction
>> + */
>> +struct em_perf_table {
>> +	struct em_perf_state *state;
>> +	struct rcu_head rcu;
>> +};
>> +
>>   /**
>>    * struct em_perf_domain - Performance domain
>>    * @table:		List of performance states, in ascending order
>> + * @runtime_table:	Pointer to the runtime modified em_perf_table
> 
> s/modified/modifiable
> 
> [...]
> 
>> @@ -237,12 +238,23 @@ static int em_create_pd(struct device *dev, int nr_states,
>>   			return -ENOMEM;
>>   	}
>>   
>> +	runtime_table = kzalloc(sizeof(*runtime_table), GFP_KERNEL);
>> +	if (!runtime_table) {
>> +		kfree(pd);
>> +		return -ENOMEM;
>> +	}
>> +
>>   	ret = em_create_perf_table(dev, pd, nr_states, cb, flags);
>>   	if (ret) {
>>   		kfree(pd);
>> +		kfree(runtime_table);
>>   		return ret;
>>   	}
>>   
>> +	/* Re-use temporally (till 1st modification) the memory */
> 
> So this means that the runtime (modifiable) table
> (pd->runtime_table>state) is mapped to the default (basic) table
> (pd->default_table->state) until the first call to
> em_dev_update_perf_domain() (here mentioned as the 1st modification)?

correct

> 
> IMHO, not easy to understand since neither the cover letter, nor
> documentation patch 15/17 describes this in a consistent story.

I'll add that to the patch header and also to the documentation patch
which is later in the series.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 13/17] PM: EM: Refactor struct em_perf_domain and add default_table
  2023-05-30 10:23   ` Dietmar Eggemann
@ 2023-07-03 16:00     ` Lukasz Luba
  0 siblings, 0 replies; 43+ messages in thread
From: Lukasz Luba @ 2023-07-03 16:00 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: rui.zhang, amit.kucheria, amit.kachhap, daniel.lezcano,
	viresh.kumar, len.brown, pavel, Pierre.Gondois, ionela.voinescu,
	rostedt, mhiramat, linux-kernel, linux-pm, rafael



On 5/30/23 11:23, Dietmar Eggemann wrote:
> On 12/05/2023 11:57, Lukasz Luba wrote:
>> The Energy Model support runtime modifications. Refactor old
>> implementation which accessed struct em_perf_state and introduce
>> em_perf_domain::default_table to clean up the design. This new field
>> is better aligned with em_perf_domain::runtime_table and helps to
>> distinguish them better.
>>
>> Update all drivers or frameworks which used the old field:
>> em_perf_domain::table and now should use em_perf_domain::default_table.
> 
> I still believe that doing this refactoring before the introducation o
> the new feature `runtime modifiable EM` would do wonders on the
> tangibility of the other patches in this set.
> 
> [...]
> 

Let me move that refactor patch a bit earlier in the series...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 14/17] Documentation: EM: Add a new section about the design
  2023-05-30 10:33   ` Dietmar Eggemann
@ 2023-07-03 16:09     ` Lukasz Luba
  0 siblings, 0 replies; 43+ messages in thread
From: Lukasz Luba @ 2023-07-03 16:09 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: rui.zhang, amit.kucheria, amit.kachhap, daniel.lezcano,
	viresh.kumar, len.brown, pavel, Pierre.Gondois, ionela.voinescu,
	rostedt, mhiramat, linux-kernel, linux-pm, rafael



On 5/30/23 11:33, Dietmar Eggemann wrote:
> On 12/05/2023 11:57, Lukasz Luba wrote:
>> Add a new section 'Design' which covers the information about Energy
>> Model. It contains the design decisions, describes models and how they
>> reflect the reality. Add description of the basic const. EM. Change the
>> other section IDs.
> 
> I would vote for coalescing the 4 doc patches into 1.

OK, I can make that. I will be a big one patch, though.

> 
>>
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>> ---
>>   Documentation/power/energy-model.rst | 36 +++++++++++++++++++++-------
>>   1 file changed, 27 insertions(+), 9 deletions(-)
>>
>> diff --git a/Documentation/power/energy-model.rst b/Documentation/power/energy-model.rst
>> index ef341be2882b..e97c7f18d8bd 100644
>> --- a/Documentation/power/energy-model.rst
>> +++ b/Documentation/power/energy-model.rst
>> @@ -72,16 +72,34 @@ required to have the same micro-architecture. CPUs in different performance
>>   domains can have different micro-architectures.
>>   
>>   
>> -2. Core APIs
>> +2. Design
>> +-----------------
>> +
>> +2.1 Basic EM
>> +^^^^^^^^^^^^
> 
> Can we make sure that people get the relation between 'basic' and
> 'default' easier?
> 
> It should be obvious that `pd->default_table->state` refers to this
> `basic (const.) EM. Using the same identifier always helps in this
> situation.

OK, how about adding this:
"The basic EM is built around const. power information for each
performance state, which is accessible in:
'dev->em_pd->default_table->state'. This model can be derived based..."

> 
>> +
>> +The basic EM is built around const. power information for each performance
>> +state. This model can be derived based on power measurements of the device
>> +e.g. CPU while running some benchmark. The benchmark might be integer heavy
>> +or floating point computation with a data set fitting into the CPU cache or
>> +registers. Bare in mind that this model might not be covering all possible
>> +workloads running on CPUs. Thus, please run a few different benchmarks and
>> +verify with some real workloads your power model values. The power variation
>> +due to the workload instruction mix and data set is not modeled.
>> +Also static power which can change during runtime due to variation of SOC
>> +temperature is not modeled in EM.
>> +
>> +

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 15/17] Documentation: EM: Add a runtime modifiable EM design description
  2023-05-30 10:42   ` Dietmar Eggemann
@ 2023-07-03 16:13     ` Lukasz Luba
  0 siblings, 0 replies; 43+ messages in thread
From: Lukasz Luba @ 2023-07-03 16:13 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: rui.zhang, amit.kucheria, amit.kachhap, daniel.lezcano,
	viresh.kumar, len.brown, pavel, Pierre.Gondois, ionela.voinescu,
	rostedt, mhiramat, linux-kernel, linux-pm, rafael



On 5/30/23 11:42, Dietmar Eggemann wrote:
> On 12/05/2023 11:57, Lukasz Luba wrote:
>> Document the new runtime modifiable EM design and how it can be used.
>> Change the last section layout and allow to provide another example
>> how to use this new API in a driver code.
>>
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>> ---
>>   Documentation/power/energy-model.rst | 20 ++++++++++++++++++++
>>   1 file changed, 20 insertions(+)
>>
>> diff --git a/Documentation/power/energy-model.rst b/Documentation/power/energy-model.rst
>> index e97c7f18d8bd..64c2462dc9a6 100644
>> --- a/Documentation/power/energy-model.rst
>> +++ b/Documentation/power/energy-model.rst
>> @@ -89,6 +89,26 @@ due to the workload instruction mix and data set is not modeled.
>>   Also static power which can change during runtime due to variation of SOC
>>   temperature is not modeled in EM.
>>   
>> +2.2 Runtime modifiable EM
>> +^^^^^^^^^^^^^^^^^^^^^^^^^
>> +
>> +To better reflect power variation due to static power (leakage) the EM
>> +supports runtime modifications of the power values. The mechanism relies on
>> +RCU to free the modifiable EM perf_state table memory. Its user, the task
>> +scheduler, also uses RCU to access this memory. The EM framework is
>> +responsible for allocating the new memory for the modifiable EM perf_state
>> +table. The old memory is freed automatically using RCU callback mechanism.
>> +This design decision is made based on task scheduler using that data and
>> +to prevent wrong usage of kernel modules if they would be responsible for the
>> +memory management.
>> +The drivers which want to modify the EM values are protected from concurrent
>> +access using a mutex. Therefore, the drivers must use sleeping context when
>> +they want to modify the EM. The runtime modifiable EM might also be used for
>> +better reflecting real workload scenarios, e.g. when they pop-up on the screen
>> +and will run for longer period, such as: games, video recoding or playing,
>> +video calls, etc. It is up to the platform engineers to experiment and choose
>> +the right approach for their device.
> 
> IMHO, there are a lot of design aspects missing here.
> 
> E.g.
> 
> Why 2 tables, modifiable (a) and default (b)?
> 
> Why does only EAS use (a)?
> 
> (a) and (b) being the same performance state table until first call to
> modify (a) ()
> 

I'll add that explanation in the v3. I wanted to avoid such detailed
description about e.g. 'being the same performance state table until 
first call to > modify (a)' since it's a memory optimization
bit. Although, I will add that reason to the doc as well.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 03/17] PM: EM: Refactor em_pd_get_efficient_state() to be more flexible
  2023-05-30 11:06   ` Dietmar Eggemann
@ 2023-07-03 16:22     ` Lukasz Luba
  0 siblings, 0 replies; 43+ messages in thread
From: Lukasz Luba @ 2023-07-03 16:22 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: rui.zhang, amit.kucheria, amit.kachhap, daniel.lezcano,
	viresh.kumar, len.brown, pavel, Pierre.Gondois, ionela.voinescu,
	rostedt, mhiramat, linux-kernel, linux-pm, rafael



On 5/30/23 12:06, Dietmar Eggemann wrote:
> On 12/05/2023 11:57, Lukasz Luba wrote:
>> Prepare em_pd_get_efficient_state() for the upcoming changes and
>> make it possible to re-use. Return an index for the best performance
> 
> Don't get the `possible to re-use`? Did you mean `possible to be
> re-used`? But then `re-used` for what?

The function will no longer get a pointer to 'struct em_perf_domain'
but instead to 'struct em_perf_state'. It would also require to
get the number of states from 'pd->nr_perf_states'.

This is preparation for handling 2 tables:
modifiable (a) and default (b).

Then it also returns and ID not the pointer to state.
It all makes it more generic and ready for those 2 tables.

> 
>> state. The function arguments that are introduced should allow to
>> work on different performance state arrays. The caller of
>> em_pd_get_efficient_state() should be able to use the index either
>> on the default or the modifiable EM table.
> 
> This describes the WHAT but not the WHY.

I will add that description as 'why' in the header. I wanted to
avoid mentioning in the patch description something which
is coming in the next patch.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 00/17] Introduce runtime modifiable Energy Model
  2023-05-30 11:07 ` Dietmar Eggemann
@ 2023-07-03 16:35   ` Lukasz Luba
  0 siblings, 0 replies; 43+ messages in thread
From: Lukasz Luba @ 2023-07-03 16:35 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: rui.zhang, amit.kucheria, amit.kachhap, daniel.lezcano,
	viresh.kumar, len.brown, pavel, Pierre.Gondois, ionela.voinescu,
	rostedt, mhiramat, linux-kernel, linux-pm, rafael



On 5/30/23 12:07, Dietmar Eggemann wrote:
> On 12/05/2023 11:57, Lukasz Luba wrote:
>> Hi all,
>>
>> This patch set adds a new feature which allows to modify Energy Model (EM)
>> power values at runtime. It will allow to better reflect power model of
>> a recent SoCs and silicon. Different characteristics of the power usage
>> can be leveraged and thus better decisions made during task placement in EAS.
>>
>> It's part of feature set know as Dynamic Energy Model. It has been presented
>> and discussed recently at OSPM2023 [3]. This patch set implements the 1st
>> improvement for the EM.
> 
> Why is the feature set called Dynamic Energy Model?
> 
> Dynamic Energy Model:
> 
>    Runtime modifiable EM
> 
>    Proper CPU performance state evaluation
> 
>    CPU idle wakeup costs
> 
>    CPU capacity as new EM data
> 
> Didn't this `Dynamic` stand for the modifiability of the EM only?

The 'modifiability' is the main feature, but not the last one.

The 2nd: 'Proper CPU performance state evaluation' is also
a 'dynamic' thing, which is related to the currently set CPU
frequency (for Mid or Big) and the dependent Little's and L3
cache frequency. As you know that frequency can change in time,
so it's 'dynamic' situation and will lands to the 'Dynamic EM'.

(The 2 below need more thinking and experiments)
The 3rd can be also quite dynamic. You might change the wake-up cost
for in the EM if you want to avoid waking up big cores in some
workloads. You might pay penalty for bigger latency, because tasks
would be more queued on Mids/Littles, but that's for power
saving scenario.

The 4th is about CPU capacity. We still have to conduct more
investigations, but it might be useful to change the EM
and provide a new CPU capacity from it to the OS. In the 3-gear
SoC the CPUs might have quite different capacity in different
scenarios (workloads). This could be also a 'dynamic' thing,
e.g. triggered by middle-ware for a long running video call.

> 
>> The concepts:
>> 1. The CPU power usage can vary due to the workload that it's running or due
>> to the temperature of the SoC. The same workload can use more power when the
>> temperature of the silicon has increased (e.g. due to hot GPU or ISP).
>> In such situation or EM can be adjusted and reflect the fact of increased
>> power usage. That power increase is due to a factor called static power
>> (sometimes called simply: leakage). The CPUs in recent SoCs are different.
>> We have heterogeneous SoCs with 3 (or even 4) different microarchitectures.
>> They are also built differently with High Performance (HP) cells or
>> Low Power (LP) cells. They are affected by the temperature increase
>> differently: HP cells have bigger leakage. The SW model can leverage that
>> knowledge.
>> 2. It is also possible to change the EM to better reflect the currently
>> running workload. Usually the EM is derived from some average power values
>> taken from experiments with benchmark (e.g. Dhrystone). The model derived
>> from such scenario might not represent properly the workloads usually running
>> on the device. Therefore, runtime modification of the EM allows to switch to
>> a different model, when there is a need.
>> 3. The EM can be adjusted after boot, when all the modules are loaded and
>> more information about the SoC is available e.g. chip binning. This would help
>> to better reflect the silicon characteristics. Thus, this EM modification
>> API allows it now. It wasn't possible in the past and the EM had to be
>> 'set in stone'.
>>
>> Some design details:
>> The internal mechanisms for the memory allocation are handled internally in the
>> EM. Kernel modules can just call the new API to update the EM data and the
>> new memory would be provided and owned by the EM. The EM memory is used by
>> EAS, which impacts those design decisions. The EM writers are protected by
>> a mutex. This new runtime modified EM table is protected using RCU mechanism,
>> which fits the current EAS hot path (which already uses RCU read lock).
>> The unregister API handles only non-CPU (e.g. GPU, ISP) devices and uses the
>> same mutex as EM modifiers to make sure the memory is safely freed.
> 
> This only mentions memory allocation and locking? A global design
> overview, containing e.g.
> 
>    Why 2 tables, modifiable (a) and default (b)?
> 
>    Why does only EAS use (a)?
> 
>    (a) and (b) being the same performance state table until first call to
>    modify (a) ()
> 
> as an introduction into the patches would be more helpful here.
> 
>> More detailed explanation and background can be found in presentations
>> during LPC2022 [1][2] or in the documentation patches.
> 
> I checked 15/17 as well but could find any of this information there either.
> 
> [...]

For the above two comments: Yes, I see your point and will address that
in the next v3.

Thank you fro your valuable comments and time for reviewing it!

Lukasz

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2023-07-03 16:35 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-12  9:57 [PATCH v2 00/17] Introduce runtime modifiable Energy Model Lukasz Luba
2023-05-12  9:57 ` [PATCH v2 01/17] PM: EM: Refactor em_cpufreq_update_efficiencies() arguments Lukasz Luba
2023-05-12  9:57 ` [PATCH v2 02/17] PM: EM: Find first CPU online while updating OPP efficiency Lukasz Luba
2023-05-12  9:57 ` [PATCH v2 03/17] PM: EM: Refactor em_pd_get_efficient_state() to be more flexible Lukasz Luba
2023-05-30 11:06   ` Dietmar Eggemann
2023-07-03 16:22     ` Lukasz Luba
2023-05-12  9:57 ` [PATCH v2 04/17] PM: EM: Create a new function em_compute_costs() Lukasz Luba
2023-05-30  9:51   ` Dietmar Eggemann
2023-07-03 15:09     ` Lukasz Luba
2023-05-12  9:57 ` [PATCH v2 05/17] trace: energy_model: Add trace event for EM runtime modifications Lukasz Luba
2023-05-30 10:03   ` Dietmar Eggemann
2023-07-03 15:53     ` Lukasz Luba
2023-05-12  9:57 ` [PATCH v2 06/17] PM: EM: Add update_power() callback for " Lukasz Luba
2023-05-30  9:31   ` Dietmar Eggemann
2023-07-03 15:06     ` Lukasz Luba
2023-05-12  9:57 ` [PATCH v2 07/17] PM: EM: Check if the get_cost() callback is present in em_compute_costs() Lukasz Luba
2023-05-12  9:57 ` [PATCH v2 08/17] PM: EM: Introduce runtime modifiable table Lukasz Luba
2023-05-14  4:28   ` kernel test robot
2023-05-30 10:18   ` Dietmar Eggemann
2023-07-03 15:58     ` Lukasz Luba
2023-05-12  9:57 ` [PATCH v2 09/17] PM: EM: Add RCU mechanism which safely cleans the old data Lukasz Luba
2023-05-30 10:02   ` Dietmar Eggemann
2023-07-03 15:49     ` Lukasz Luba
2023-05-12  9:57 ` [PATCH v2 10/17] PM: EM: Add runtime update interface to modify EM power Lukasz Luba
2023-05-12  9:57 ` [PATCH v2 11/17] PM: EM: Use runtime modified EM for CPUs energy estimation in EAS Lukasz Luba
2023-05-12  9:57 ` [PATCH v2 12/17] PM: EM: Add argument to get_cost() for runtime modification Lukasz Luba
2023-05-30  9:53   ` Dietmar Eggemann
2023-07-03 15:30     ` Lukasz Luba
2023-05-12  9:57 ` [PATCH v2 13/17] PM: EM: Refactor struct em_perf_domain and add default_table Lukasz Luba
2023-05-30 10:23   ` Dietmar Eggemann
2023-07-03 16:00     ` Lukasz Luba
2023-05-12  9:57 ` [PATCH v2 14/17] Documentation: EM: Add a new section about the design Lukasz Luba
2023-05-30 10:33   ` Dietmar Eggemann
2023-07-03 16:09     ` Lukasz Luba
2023-05-12  9:57 ` [PATCH v2 15/17] Documentation: EM: Add a runtime modifiable EM design description Lukasz Luba
2023-05-30 10:42   ` Dietmar Eggemann
2023-07-03 16:13     ` Lukasz Luba
2023-05-12  9:57 ` [PATCH v2 16/17] Documentation: EM: Add example with driver modifying the EM Lukasz Luba
2023-05-12  9:57 ` [PATCH v2 17/17] Documentation: EM: Describe the API of runtime modifications Lukasz Luba
2023-05-24 17:25 ` [PATCH v2 00/17] Introduce runtime modifiable Energy Model Rafael J. Wysocki
2023-07-03 11:08   ` Lukasz Luba
2023-05-30 11:07 ` Dietmar Eggemann
2023-07-03 16:35   ` Lukasz Luba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).