[RFT][PATCH v1 0/8] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFT][PATCH v1 0/8] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT
@ 2025-04-16 17:44 Rafael J. Wysocki
  2025-04-16 17:48 ` [RFT][PATCH v1 1/8] cpufreq/sched: schedutil: Add helper for governor checks Rafael J. Wysocki
                   ` (9 more replies)
  0 siblings, 10 replies; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-04-16 17:44 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois, Christian Loehle

Hi Everyone,

This is a new version of

https://lore.kernel.org/linux-pm/22640172.EfDdHjke4D@rjwysocki.net/

which is not regarded as RFC any more.  It appears to be better than

https://lore.kernel.org/linux-pm/5861970.DvuYhMxLoT@rjwysocki.net/

but still requires more testing, so I'd appreciate any help here.

The following paragraph from the original cover letter still applies:

"The underlying observation is that on the platforms targeted by these changes,
Lunar Lake at the time of this writing, the "small" CPUs (E-cores), when run at
the same performance level, are always more energy-efficient than the "big" or
"performance" CPUs (P-cores).  This means that, regardless of the scale-
invariant utilization of a task, as long as there is enough spare capacity on
E-cores, the relative cost of running it there is always lower."

The first 3 patches have been updated since v0.3 and they now depend on the new
cpufreq material in linux-next.

The next 2 patches (Energy Model code changes) have been reviewed in the
meantime, but they are only needed for the last 3 patches.

Patch [6/8] is essentially the same as before.  It causes perf domains to be
registered per CPU and in addition to the primary cost component, which is
related to the CPU type, there is a small component proportional to performance
whose role is to help balance the load between CPUs of the same type.

This is done to avoid migrating tasks too much between CPUs of the same type,
especially between E-cores, which has been observed in tests of

https://lore.kernel.org/linux-pm/5861970.DvuYhMxLoT@rjwysocki.net/

The expected effect is still that the CPUs of the "low-cost" type will be
preferred so long as there is enough spare capacity on any of them.

The last 2 patches are new.

Patch [7/8] looks at the cache topology to avoid creating per-CPU perf domains
for CPUs sharing an L2 cache.  Typically, on the chips that will be affected
by this patch, CPUs sharing an L2 cache also share a voltage regulator and a
clock, so they technically belong to the same OPP domain and they will be put
into a shared perf domain after this patch.

Patch [8/8] makes CPUs sharing the L3 cache look slightly more expensive to
cause the scheduler to prefer placing tasks on CPUs that don't use the L3,
which in some cases should allow all of the CPUs sharing the L3 to stay in
idle states and the energy usage should be reduced.

Please refer to the individual patch changelogs for details.

Since patches [7-8/8] also apply on top of the v0.3, I have created a git branch at

git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
experimental/intel_pstate/eas-take2-extended

or

https://web.git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/intel_pstate/eas-take2-extended

to allow the difference they make with respect to the v0.3 to be seen (if any).

Later, I'm going to put this series as a whole into a new git branch on top of
the mainline and the cpufreq material queued up for 6.16.

Thanks!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFT][PATCH v1 1/8] cpufreq/sched: schedutil: Add helper for governor checks
  2025-04-16 17:44 [RFT][PATCH v1 0/8] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
@ 2025-04-16 17:48 ` Rafael J. Wysocki
  2025-04-17 12:23   ` Christian Loehle
  2025-04-16 17:59 ` [RFT][PATCH v1 2/8] cpufreq/sched: Move cpufreq-specific EAS checks to cpufreq Rafael J. Wysocki
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-04-16 17:48 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois, Christian Loehle

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Add a helper for checking if schedutil is the current governor for
a given cpufreq policy and use it in sched_is_eas_possible() to avoid
accessing cpufreq policy internals directly from there.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v0.3 -> v1
     * Change the name of the new function to sugov_is_governor().

This patch is regarded as a cleanup for 6.16.

---
 include/linux/cpufreq.h          |    9 +++++++++
 kernel/sched/cpufreq_schedutil.c |    9 +++++++--
 kernel/sched/sched.h             |    2 --
 kernel/sched/topology.c          |    6 +++---
 4 files changed, 19 insertions(+), 7 deletions(-)

--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -650,6 +650,15 @@
 struct cpufreq_governor *cpufreq_default_governor(void);
 struct cpufreq_governor *cpufreq_fallback_governor(void);
 
+#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
+bool sugov_is_governor(struct cpufreq_policy *policy);
+#else
+static inline bool sugov_is_governor(struct cpufreq_policy *policy)
+{
+	return false;
+}
+#endif
+
 static inline void cpufreq_policy_apply_limits(struct cpufreq_policy *policy)
 {
 	if (policy->max < policy->cur)
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -630,7 +630,7 @@
 
 /********************** cpufreq governor interface *********************/
 
-struct cpufreq_governor schedutil_gov;
+static struct cpufreq_governor schedutil_gov;
 
 static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy)
 {
@@ -909,7 +909,7 @@
 	WRITE_ONCE(sg_policy->limits_changed, true);
 }
 
-struct cpufreq_governor schedutil_gov = {
+static struct cpufreq_governor schedutil_gov = {
 	.name			= "schedutil",
 	.owner			= THIS_MODULE,
 	.flags			= CPUFREQ_GOV_DYNAMIC_SWITCHING,
@@ -927,4 +927,9 @@
 }
 #endif
 
+bool sugov_is_governor(struct cpufreq_policy *policy)
+{
+	return policy->governor == &schedutil_gov;
+}
+
 cpufreq_governor_init(schedutil_gov);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3509,8 +3509,6 @@
 	return static_branch_unlikely(&sched_energy_present);
 }
 
-extern struct cpufreq_governor schedutil_gov;
-
 #else /* ! (CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL) */
 
 #define perf_domain_span(pd) NULL
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -213,7 +213,7 @@
 {
 	bool any_asym_capacity = false;
 	struct cpufreq_policy *policy;
-	struct cpufreq_governor *gov;
+	bool policy_is_ready;
 	int i;
 
 	/* EAS is enabled for asymmetric CPU capacity topologies. */
@@ -258,9 +258,9 @@
 			}
 			return false;
 		}
-		gov = policy->governor;
+		policy_is_ready = sugov_is_governor(policy);
 		cpufreq_cpu_put(policy);
-		if (gov != &schedutil_gov) {
+		if (!policy_is_ready) {
 			if (sched_debug()) {
 				pr_info("rd %*pbl: Checking EAS, schedutil is mandatory\n",
 					cpumask_pr_args(cpu_mask));




^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFT][PATCH v1 2/8] cpufreq/sched: Move cpufreq-specific EAS checks to cpufreq
  2025-04-16 17:44 [RFT][PATCH v1 0/8] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
  2025-04-16 17:48 ` [RFT][PATCH v1 1/8] cpufreq/sched: schedutil: Add helper for governor checks Rafael J. Wysocki
@ 2025-04-16 17:59 ` Rafael J. Wysocki
  2025-04-17 12:28   ` Christian Loehle
  2025-04-16 18:01 ` [RFT][PATCH v1 3/8] cpufreq/sched: Allow .setpolicy() cpufreq drivers to enable EAS Rafael J. Wysocki
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-04-16 17:59 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois, Christian Loehle

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Doing cpufreq-specific EAS checks that require accessing policy
internals directly from sched_is_eas_possible() is a bit unfortunate,
so introduce cpufreq_ready_for_eas() in cpufreq, move those checks
into that new function and make sched_is_eas_possible() call it.

While at it, address a possible race between the EAS governor check
and governor change by doing the former under the policy rwsem.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v0.3 -> v1
     * Add a new helper called cpufreq_policy_is_good_for_eas() which is
       properly synchronized with governor changes.
     * Slightly modify debug messages.

This patch is regarded as a cleanup for 6.16.

---
 drivers/cpufreq/cpufreq.c |   32 ++++++++++++++++++++++++++++++++
 include/linux/cpufreq.h   |    2 ++
 kernel/sched/topology.c   |   25 +++++--------------------
 3 files changed, 39 insertions(+), 20 deletions(-)

--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -3041,6 +3041,38 @@
 
 	return 0;
 }
+
+static bool cpufreq_policy_is_good_for_eas(unsigned int cpu)
+{
+	struct cpufreq_policy *policy __free(put_cpufreq_policy);
+
+	policy = cpufreq_cpu_get(cpu);
+	if (!policy) {
+		pr_debug("cpufreq policy not set for CPU: %d", cpu);
+		return false;
+	}
+
+	guard(cpufreq_policy_read)(policy);
+
+	return sugov_is_governor(policy);
+}
+
+bool cpufreq_ready_for_eas(const struct cpumask *cpu_mask)
+{
+	unsigned int cpu;
+
+	/* Do not attempt EAS if schedutil is not being used. */
+	for_each_cpu(cpu, cpu_mask) {
+		if (!cpufreq_policy_is_good_for_eas(cpu)) {
+			pr_debug("rd %*pbl: schedutil is mandatory for EAS\n",
+				 cpumask_pr_args(cpu_mask));
+			return false;
+		}
+	}
+
+	return true;
+}
+
 module_param(off, int, 0444);
 module_param_string(default_governor, default_governor, CPUFREQ_NAME_LEN, 0444);
 core_initcall(cpufreq_core_init);
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -1212,6 +1212,8 @@
 		struct cpufreq_frequency_table *table,
 		unsigned int transition_latency);
 
+bool cpufreq_ready_for_eas(const struct cpumask *cpu_mask);
+
 static inline void cpufreq_register_em_with_opp(struct cpufreq_policy *policy)
 {
 	dev_pm_opp_of_register_em(get_cpu_device(policy->cpu),
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -212,8 +212,6 @@
 static bool sched_is_eas_possible(const struct cpumask *cpu_mask)
 {
 	bool any_asym_capacity = false;
-	struct cpufreq_policy *policy;
-	bool policy_is_ready;
 	int i;
 
 	/* EAS is enabled for asymmetric CPU capacity topologies. */
@@ -248,25 +246,12 @@
 		return false;
 	}
 
-	/* Do not attempt EAS if schedutil is not being used. */
-	for_each_cpu(i, cpu_mask) {
-		policy = cpufreq_cpu_get(i);
-		if (!policy) {
-			if (sched_debug()) {
-				pr_info("rd %*pbl: Checking EAS, cpufreq policy not set for CPU: %d",
-					cpumask_pr_args(cpu_mask), i);
-			}
-			return false;
-		}
-		policy_is_ready = sugov_is_governor(policy);
-		cpufreq_cpu_put(policy);
-		if (!policy_is_ready) {
-			if (sched_debug()) {
-				pr_info("rd %*pbl: Checking EAS, schedutil is mandatory\n",
-					cpumask_pr_args(cpu_mask));
-			}
-			return false;
+	if (!cpufreq_ready_for_eas(cpu_mask)) {
+		if (sched_debug()) {
+			pr_info("rd %*pbl: Checking EAS: cpufreq is not ready",
+				cpumask_pr_args(cpu_mask));
 		}
+		return false;
 	}
 
 	return true;




^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFT][PATCH v1 3/8] cpufreq/sched: Allow .setpolicy() cpufreq drivers to enable EAS
  2025-04-16 17:44 [RFT][PATCH v1 0/8] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
  2025-04-16 17:48 ` [RFT][PATCH v1 1/8] cpufreq/sched: schedutil: Add helper for governor checks Rafael J. Wysocki
  2025-04-16 17:59 ` [RFT][PATCH v1 2/8] cpufreq/sched: Move cpufreq-specific EAS checks to cpufreq Rafael J. Wysocki
@ 2025-04-16 18:01 ` Rafael J. Wysocki
  2025-04-17 12:19   ` Christian Loehle
  2025-04-16 18:04 ` [RFT][PATCH v1 4/8] PM: EM: Move CPU capacity check to em_adjust_new_capacity() Rafael J. Wysocki
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-04-16 18:01 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois, Christian Loehle

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Some cpufreq drivers, like intel_pstate, have built-in governors that
are used instead of regular cpufreq governors, schedutil in particular,
but they can work with EAS just fine, so allow EAS to be used with
those drivers.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v0.3 -> v1
     * Rebase on top of the new [1-2/8].
     * Update the diagnostic message printed if the conditions are not met.

This patch is regarded as a cleanup for 6.16.

---
 drivers/cpufreq/cpufreq.c |   13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -3054,7 +3054,16 @@
 
 	guard(cpufreq_policy_read)(policy);
 
-	return sugov_is_governor(policy);
+	/*
+	 * For EAS compatibility, require that either schedutil is the policy
+	 * governor or the policy is governed directly by the cpufreq driver.
+	 *
+	 * In the latter case, it is assumed that EAS can only be enabled by the
+	 * cpufreq driver itself which will not enable EAS if it does not meet
+	 * the EAS' expectations regarding performance scaling response.
+	 */
+	return sugov_is_governor(policy) || (!policy->governor &&
+		policy->policy != CPUFREQ_POLICY_UNKNOWN);
 }
 
 bool cpufreq_ready_for_eas(const struct cpumask *cpu_mask)
@@ -3064,7 +3073,7 @@
 	/* Do not attempt EAS if schedutil is not being used. */
 	for_each_cpu(cpu, cpu_mask) {
 		if (!cpufreq_policy_is_good_for_eas(cpu)) {
-			pr_debug("rd %*pbl: schedutil is mandatory for EAS\n",
+			pr_debug("rd %*pbl: EAS requirements not met\n",
 				 cpumask_pr_args(cpu_mask));
 			return false;
 		}




^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFT][PATCH v1 4/8] PM: EM: Move CPU capacity check to em_adjust_new_capacity()
  2025-04-16 17:44 [RFT][PATCH v1 0/8] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
                   ` (2 preceding siblings ...)
  2025-04-16 18:01 ` [RFT][PATCH v1 3/8] cpufreq/sched: Allow .setpolicy() cpufreq drivers to enable EAS Rafael J. Wysocki
@ 2025-04-16 18:04 ` Rafael J. Wysocki
  2025-04-16 18:06 ` [RFT][PATCH v1 5/8] PM: EM: Introduce em_adjust_cpu_capacity() Rafael J. Wysocki
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-04-16 18:04 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois, Christian Loehle

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Move the check of the CPU capacity currently stored in the energy model
against the arch_scale_cpu_capacity() value to em_adjust_new_capacity()
so it will be done regardless of where the latter is called from.

This will be useful when a new em_adjust_new_capacity() caller is added
subsequently.

While at it, move the pd local variable declaration in
em_check_capacity_update() into the loop in which it is used.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
---

v0.3 -> v1:
     * Added R-by from Lukasz.

---
 kernel/power/energy_model.c |   40 +++++++++++++++++-----------------------
 1 file changed, 17 insertions(+), 23 deletions(-)

--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -721,10 +721,24 @@
  * Adjustment of CPU performance values after boot, when all CPUs capacites
  * are correctly calculated.
  */
-static void em_adjust_new_capacity(struct device *dev,
+static void em_adjust_new_capacity(unsigned int cpu, struct device *dev,
 				   struct em_perf_domain *pd)
 {
+	unsigned long cpu_capacity = arch_scale_cpu_capacity(cpu);
 	struct em_perf_table *em_table;
+	struct em_perf_state *table;
+	unsigned long em_max_perf;
+
+	rcu_read_lock();
+	table = em_perf_state_from_pd(pd);
+	em_max_perf = table[pd->nr_perf_states - 1].performance;
+	rcu_read_unlock();
+
+	if (em_max_perf == cpu_capacity)
+		return;
+
+	pr_debug("updating cpu%d cpu_cap=%lu old capacity=%lu\n", cpu,
+		 cpu_capacity, em_max_perf);
 
 	em_table = em_table_dup(pd);
 	if (!em_table) {
@@ -740,9 +754,6 @@
 static void em_check_capacity_update(void)
 {
 	cpumask_var_t cpu_done_mask;
-	struct em_perf_state *table;
-	struct em_perf_domain *pd;
-	unsigned long cpu_capacity;
 	int cpu;
 
 	if (!zalloc_cpumask_var(&cpu_done_mask, GFP_KERNEL)) {
@@ -753,7 +764,7 @@
 	/* Check if CPUs capacity has changed than update EM */
 	for_each_possible_cpu(cpu) {
 		struct cpufreq_policy *policy;
-		unsigned long em_max_perf;
+		struct em_perf_domain *pd;
 		struct device *dev;
 
 		if (cpumask_test_cpu(cpu, cpu_done_mask))
@@ -776,24 +787,7 @@
 		cpumask_or(cpu_done_mask, cpu_done_mask,
 			   em_span_cpus(pd));
 
-		cpu_capacity = arch_scale_cpu_capacity(cpu);
-
-		rcu_read_lock();
-		table = em_perf_state_from_pd(pd);
-		em_max_perf = table[pd->nr_perf_states - 1].performance;
-		rcu_read_unlock();
-
-		/*
-		 * Check if the CPU capacity has been adjusted during boot
-		 * and trigger the update for new performance values.
-		 */
-		if (em_max_perf == cpu_capacity)
-			continue;
-
-		pr_debug("updating cpu%d cpu_cap=%lu old capacity=%lu\n",
-			 cpu, cpu_capacity, em_max_perf);
-
-		em_adjust_new_capacity(dev, pd);
+		em_adjust_new_capacity(cpu, dev, pd);
 	}
 
 	free_cpumask_var(cpu_done_mask);




^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFT][PATCH v1 5/8] PM: EM: Introduce em_adjust_cpu_capacity()
  2025-04-16 17:44 [RFT][PATCH v1 0/8] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
                   ` (3 preceding siblings ...)
  2025-04-16 18:04 ` [RFT][PATCH v1 4/8] PM: EM: Move CPU capacity check to em_adjust_new_capacity() Rafael J. Wysocki
@ 2025-04-16 18:06 ` Rafael J. Wysocki
  2025-04-27 14:01   ` Dietmar Eggemann
  2025-04-16 18:09 ` [RFT][PATCH v1 6/8] cpufreq: intel_pstate: EAS support for hybrid platforms Rafael J. Wysocki
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-04-16 18:06 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois, Christian Loehle

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Add a function for updating the Energy Model for a CPU after its
capacity has changed, which subsequently will be used by the
intel_pstate driver.

An EM_PERF_DOMAIN_ARTIFICIAL check is added to em_adjust_new_capacity()
to prevent it from calling em_compute_costs() for an "artificial" perf
domain with a NULL cb parameter which would cause it to crash.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
---

v0.3 -> v1:
     * Added R-by from Lukasz.

---
 include/linux/energy_model.h |    2 ++
 kernel/power/energy_model.c  |   28 ++++++++++++++++++++++++----
 2 files changed, 26 insertions(+), 4 deletions(-)

--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -179,6 +179,7 @@
 int em_dev_update_chip_binning(struct device *dev);
 int em_update_performance_limits(struct em_perf_domain *pd,
 		unsigned long freq_min_khz, unsigned long freq_max_khz);
+void em_adjust_cpu_capacity(unsigned int cpu);
 void em_rebuild_sched_domains(void);
 
 /**
@@ -405,6 +406,7 @@
 {
 	return -EINVAL;
 }
+static inline void em_adjust_cpu_capacity(unsigned int cpu) {}
 static inline void em_rebuild_sched_domains(void) {}
 #endif
 
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -698,10 +698,12 @@
 {
 	int ret;
 
-	ret = em_compute_costs(dev, em_table->state, NULL, pd->nr_perf_states,
-			       pd->flags);
-	if (ret)
-		goto free_em_table;
+	if (!(pd->flags & EM_PERF_DOMAIN_ARTIFICIAL)) {
+		ret = em_compute_costs(dev, em_table->state, NULL,
+				       pd->nr_perf_states, pd->flags);
+		if (ret)
+			goto free_em_table;
+	}
 
 	ret = em_dev_update_perf_domain(dev, em_table);
 	if (ret)
@@ -751,6 +753,24 @@
 	em_recalc_and_update(dev, pd, em_table);
 }
 
+/**
+ * em_adjust_cpu_capacity() - Adjust the EM for a CPU after a capacity update.
+ * @cpu: Target CPU.
+ *
+ * Adjust the existing EM for @cpu after a capacity update under the assumption
+ * that the capacity has been updated in the same way for all of the CPUs in
+ * the same perf domain.
+ */
+void em_adjust_cpu_capacity(unsigned int cpu)
+{
+	struct device *dev = get_cpu_device(cpu);
+	struct em_perf_domain *pd;
+
+	pd = em_pd_get(dev);
+	if (pd)
+		em_adjust_new_capacity(cpu, dev, pd);
+}
+
 static void em_check_capacity_update(void)
 {
 	cpumask_var_t cpu_done_mask;




^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFT][PATCH v1 6/8] cpufreq: intel_pstate: EAS support for hybrid platforms
  2025-04-16 17:44 [RFT][PATCH v1 0/8] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
                   ` (4 preceding siblings ...)
  2025-04-16 18:06 ` [RFT][PATCH v1 5/8] PM: EM: Introduce em_adjust_cpu_capacity() Rafael J. Wysocki
@ 2025-04-16 18:09 ` Rafael J. Wysocki
  2025-04-16 18:10 ` [RFT][PATCH v1 7/8] cpufreq: intel_pstate: Align perf domains with L2 cache Rafael J. Wysocki
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-04-16 18:09 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois, Christian Loehle, Tim Chen

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Modify intel_pstate to register EM perf domains for CPUs on hybrid
platforms and enable EAS on them.

This change is targeting platforms (for example, Lunar Lake) where the
"little" CPUs (E-cores) are always more energy-efficient than the "big"
or "performance" CPUs (P-cores) when run at the same HWP performance
level, so it is sufficient to tell EAS that E-cores are always preferred
(so long as there is enough spare capacity on one of them to run the
given task).  However, migrating tasks between CPUs of the same type
too often is not desirable because it may hurt both performance and
energy efficiency due to leaving warm caches behind.

For this reason, register a separate perf domain for each CPU and assign
costs for them so that the cost mostly depends on the CPU type, but
there is also a small component of it depending on the performance
level (utilization) which allows to avoid substantial load imbalances
between CPUs of the same type.

The observation used here is that the IPC metric value for a given CPU
is inversely proportional to its performance-to-frequency scaling factor
and the cost of running on it can be assumed to be roughly proportional
to that IPC ratio (in principle, the higher the IPC ratio, the more
resources are utilized when running at a given frequency, so the cost
should be higher).  This main component of the cost is amended with a
small addition proportional performance.

EM perf domains for all CPUs that are online during system startup are
registered at the driver initialization time, after asymmetric capacity
support has been enabled.  For the CPUs that become online later, EM
perf domains are registered after setting the asymmetric capacity for
them.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v0.3 -> v1:
     * Relocated INTEL_PSTATE_CORE_SCALING define (Tim).
     * Rephrased a comment regarding EAS disabling (Tim).

---
 drivers/cpufreq/intel_pstate.c |  131 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 126 insertions(+), 5 deletions(-)

--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -221,6 +221,7 @@
  * @sched_flags:	Store scheduler flags for possible cross CPU update
  * @hwp_boost_min:	Last HWP boosted min performance
  * @suspended:		Whether or not the driver has been suspended.
+ * @em_registered:	If set, an energy model has been registered.
  * @hwp_notify_work:	workqueue for HWP notifications.
  *
  * This structure stores per CPU instance data for all CPUs.
@@ -260,6 +261,9 @@
 	unsigned int sched_flags;
 	u32 hwp_boost_min;
 	bool suspended;
+#ifdef CONFIG_ENERGY_MODEL
+	bool em_registered;
+#endif
 	struct delayed_work hwp_notify_work;
 };
 
@@ -303,6 +307,7 @@
 
 static struct cpufreq_driver *intel_pstate_driver __read_mostly;
 
+#define INTEL_PSTATE_CORE_SCALING	100000
 #define HYBRID_SCALING_FACTOR_ADL	78741
 #define HYBRID_SCALING_FACTOR_MTL	80000
 #define HYBRID_SCALING_FACTOR_LNL	86957
@@ -311,7 +316,7 @@
 
 static inline int core_get_scaling(void)
 {
-	return 100000;
+	return INTEL_PSTATE_CORE_SCALING;
 }
 
 #ifdef CONFIG_ACPI
@@ -945,12 +950,105 @@
  */
 static DEFINE_MUTEX(hybrid_capacity_lock);
 
+#ifdef CONFIG_ENERGY_MODEL
+#define HYBRID_EM_STATE_COUNT	4
+
+static int hybrid_active_power(struct device *dev, unsigned long *power,
+			       unsigned long *freq)
+{
+	/*
+	 * Create "utilization bins" of 0-40%, 40%-60%, 60%-80%, and 80%-100%
+	 * of the maximum capacity such that two CPUs of the same type will be
+	 * regarded as equally attractive if the utilization of each of them
+	 * falls into the same bin, which should prevent tasks from being
+	 * migrated between them too often.
+	 *
+	 * For this purpose, return the "frequency" of 2 for the first
+	 * performance level and otherwise leave the value set by the caller.
+	 */
+	if (!*freq)
+		*freq = 2;
+
+	/* No power information. */
+	*power = EM_MAX_POWER;
+
+	return 0;
+}
+
+static int hybrid_get_cost(struct device *dev, unsigned long freq,
+			   unsigned long *cost)
+{
+	struct pstate_data *pstate = &all_cpu_data[dev->id]->pstate;
+
+	/*
+	 * The smaller the perf-to-frequency scaling factor, the larger the IPC
+	 * ratio between the given CPU and the least capable CPU in the system.
+	 * Regard that IPC ratio as the primary cost component and assume that
+	 * the scaling factors for different CPU types will differ by at least
+	 * 5% and they will not be above INTEL_PSTATE_CORE_SCALING.
+	 *
+	 * Add the freq value to the cost, so that the cost of running on CPUs
+	 * of the same type in different "utilization bins" is different.
+	 */
+	*cost = div_u64(100ULL * INTEL_PSTATE_CORE_SCALING, pstate->scaling) + freq;
+
+	return 0;
+}
+
+static bool hybrid_register_perf_domain(unsigned int cpu)
+{
+	static const struct em_data_callback cb
+			= EM_ADV_DATA_CB(hybrid_active_power, hybrid_get_cost);
+	struct cpudata *cpudata = all_cpu_data[cpu];
+	struct device *cpu_dev;
+
+	/*
+	 * Registering EM perf domains without enabling asymmetric CPU capacity
+	 * support is not really useful and one domain should not be registered
+	 * more than once.
+	 */
+	if (!hybrid_max_perf_cpu || cpudata->em_registered)
+		return false;
+
+	cpu_dev = get_cpu_device(cpu);
+	if (!cpu_dev)
+		return false;
+
+	if (em_dev_register_perf_domain(cpu_dev, HYBRID_EM_STATE_COUNT, &cb,
+					cpumask_of(cpu), false))
+		return false;
+
+	cpudata->em_registered = true;
+
+	return true;
+}
+
+static void hybrid_register_all_perf_domains(void)
+{
+	unsigned int cpu;
+
+	for_each_online_cpu(cpu)
+		hybrid_register_perf_domain(cpu);
+}
+
+static void hybrid_update_perf_domain(struct cpudata *cpu)
+{
+	if (cpu->em_registered)
+		em_adjust_cpu_capacity(cpu->cpu);
+}
+#else /* !CONFIG_ENERGY_MODEL */
+static inline bool hybrid_register_perf_domain(unsigned int cpu) { return false; }
+static inline void hybrid_register_all_perf_domains(void) {}
+static inline void hybrid_update_perf_domain(struct cpudata *cpu) {}
+#endif /* CONFIG_ENERGY_MODEL */
+
 static void hybrid_set_cpu_capacity(struct cpudata *cpu)
 {
 	arch_set_cpu_capacity(cpu->cpu, cpu->capacity_perf,
 			      hybrid_max_perf_cpu->capacity_perf,
 			      cpu->capacity_perf,
 			      cpu->pstate.max_pstate_physical);
+	hybrid_update_perf_domain(cpu);
 
 	pr_debug("CPU%d: perf = %u, max. perf = %u, base perf = %d\n", cpu->cpu,
 		 cpu->capacity_perf, hybrid_max_perf_cpu->capacity_perf,
@@ -1039,6 +1137,11 @@
 	guard(mutex)(&hybrid_capacity_lock);
 
 	__hybrid_refresh_cpu_capacity_scaling();
+	/*
+	 * Perf domains are not registered before setting hybrid_max_perf_cpu,
+	 * so register them all after setting up CPU capacity scaling.
+	 */
+	hybrid_register_all_perf_domains();
 }
 
 static void hybrid_init_cpu_capacity_scaling(bool refresh)
@@ -1066,7 +1169,7 @@
 		hybrid_refresh_cpu_capacity_scaling();
 		/*
 		 * Disabling ITMT causes sched domains to be rebuilt to disable asym
-		 * packing and enable asym capacity.
+		 * packing and enable asym capacity and EAS.
 		 */
 		sched_clear_itmt_support();
 	}
@@ -1144,6 +1247,14 @@
 	}
 
 	hybrid_set_cpu_capacity(cpu);
+	/*
+	 * If the CPU was offline to start with and it is going online for the
+	 * first time, a perf domain needs to be registered for it if hybrid
+	 * capacity scaling has been enabled already.  In that case, sched
+	 * domains need to be rebuilt to take the new perf domain into account.
+	 */
+	if (hybrid_register_perf_domain(cpu->cpu))
+		em_rebuild_sched_domains();
 
 unlock:
 	mutex_unlock(&hybrid_capacity_lock);
@@ -3397,6 +3508,8 @@
 
 static int intel_pstate_update_status(const char *buf, size_t size)
 {
+	int ret = -EINVAL;
+
 	if (size == 3 && !strncmp(buf, "off", size)) {
 		if (!intel_pstate_driver)
 			return -EINVAL;
@@ -3406,6 +3519,8 @@
 
 		cpufreq_unregister_driver(intel_pstate_driver);
 		intel_pstate_driver_cleanup();
+		/* Disable EAS support in case it was used. */
+		rebuild_sched_domains_energy();
 		return 0;
 	}
 
@@ -3417,7 +3532,13 @@
 			cpufreq_unregister_driver(intel_pstate_driver);
 		}
 
-		return intel_pstate_register_driver(&intel_pstate);
+		ret = intel_pstate_register_driver(&intel_pstate);
+		/*
+		 * If the previous status had been "passive" and the schedutil
+		 * governor had been used, it disabled EAS on exit, so trigger
+		 * sched domains rebuild in case EAS needs to be enabled again.
+		 */
+		rebuild_sched_domains_energy();
 	}
 
 	if (size == 7 && !strncmp(buf, "passive", size)) {
@@ -3429,10 +3550,10 @@
 			intel_pstate_sysfs_hide_hwp_dynamic_boost();
 		}
 
-		return intel_pstate_register_driver(&intel_cpufreq);
+		ret = intel_pstate_register_driver(&intel_cpufreq);
 	}
 
-	return -EINVAL;
+	return ret;
 }
 
 static int no_load __initdata;




^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFT][PATCH v1 7/8] cpufreq: intel_pstate: Align perf domains with L2 cache
  2025-04-16 17:44 [RFT][PATCH v1 0/8] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
                   ` (5 preceding siblings ...)
  2025-04-16 18:09 ` [RFT][PATCH v1 6/8] cpufreq: intel_pstate: EAS support for hybrid platforms Rafael J. Wysocki
@ 2025-04-16 18:10 ` Rafael J. Wysocki
  2025-04-17 12:42   ` Christian Loehle
  2025-04-27 16:23   ` Dietmar Eggemann
  2025-04-16 18:12 ` [RFT][PATCH v1 8/8] cpufreq: intel_pstate: EAS: Increase cost for CPUs using L3 cache Rafael J. Wysocki
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-04-16 18:10 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois, Christian Loehle, Tim Chen

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

On some hybrid platforms a group of cores (referred to as a module) may
share an L2 cache in which case they also share a voltage regulator and
always run at the same frequency (while not in idle states).

For this reason, make hybrid_register_perf_domain() in the intel_pstate
driver add all CPUs sharing an L2 cache to the same perf domain for EAS.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

New in v1.

---
 drivers/cpufreq/intel_pstate.c |   23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -999,8 +999,11 @@
 {
 	static const struct em_data_callback cb
 			= EM_ADV_DATA_CB(hybrid_active_power, hybrid_get_cost);
+	struct cpu_cacheinfo *cacheinfo = get_cpu_cacheinfo(cpu);
+	const struct cpumask *cpumask = cpumask_of(cpu);
 	struct cpudata *cpudata = all_cpu_data[cpu];
 	struct device *cpu_dev;
+	int ret;
 
 	/*
 	 * Registering EM perf domains without enabling asymmetric CPU capacity
@@ -1014,9 +1017,25 @@
 	if (!cpu_dev)
 		return false;
 
-	if (em_dev_register_perf_domain(cpu_dev, HYBRID_EM_STATE_COUNT, &cb,
-					cpumask_of(cpu), false))
+	if (cacheinfo) {
+		unsigned int i;
+
+		/* Find the L2 cache and the CPUs sharing it. */
+		for (i = 0; i < cacheinfo->num_leaves; i++) {
+			if (cacheinfo->info_list[i].level == 2) {
+				cpumask = &cacheinfo->info_list[i].shared_cpu_map;
+				break;
+			}
+		}
+	}
+
+	ret = em_dev_register_perf_domain(cpu_dev, HYBRID_EM_STATE_COUNT, &cb,
+					  cpumask, false);
+	if (ret) {
+		cpudata->em_registered = ret == -EEXIST;
+
 		return false;
+	}
 
 	cpudata->em_registered = true;
 




^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFT][PATCH v1 8/8] cpufreq: intel_pstate: EAS: Increase cost for CPUs using L3 cache
  2025-04-16 17:44 [RFT][PATCH v1 0/8] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
                   ` (6 preceding siblings ...)
  2025-04-16 18:10 ` [RFT][PATCH v1 7/8] cpufreq: intel_pstate: Align perf domains with L2 cache Rafael J. Wysocki
@ 2025-04-16 18:12 ` Rafael J. Wysocki
  2025-04-25 21:32   ` Christian Loehle
  2025-04-18  9:58 ` [RFT][PATCH v1 0/8] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Christian Loehle
  2025-05-12 13:23 ` [PATCH v1] cpufreq: Drop policy locking from cpufreq_policy_is_good_for_eas() Rafael J. Wysocki
  9 siblings, 1 reply; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-04-16 18:12 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois, Christian Loehle

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

On some hybrid platforms some efficient CPUs (E-cores) are not connected
to the L3 cache, but there are no other differences between them and the
other E-cores that use L3.  In that case, it is generally more efficient
to run "light" workloads on the E-cores that do not use L3 and allow all
of the cores using L3, including P-cores, to go into idle states.

For this reason, slightly increase the cost for all CPUs sharing the L3
cache to make EAS prefer CPUs that do not use it to the other CPUs with
the same perf-to-frequency scaling factor (if any).

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpufreq/intel_pstate.c |    8 ++++++++
 1 file changed, 8 insertions(+)

--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -979,6 +979,7 @@
 			   unsigned long *cost)
 {
 	struct pstate_data *pstate = &all_cpu_data[dev->id]->pstate;
+	struct cpu_cacheinfo *cacheinfo = get_cpu_cacheinfo(dev->id);
 
 	/*
 	 * The smaller the perf-to-frequency scaling factor, the larger the IPC
@@ -991,6 +992,13 @@
 	 * of the same type in different "utilization bins" is different.
 	 */
 	*cost = div_u64(100ULL * INTEL_PSTATE_CORE_SCALING, pstate->scaling) + freq;
+	/*
+	 * Inrease the cost slightly for CPUs able to access L3 to avoid litting
+	 * it up too eagerly in case some other CPUs of the same type cannot
+	 * access it.
+	 */
+	if (cacheinfo->num_levels >= 3)
+		(*cost)++;
 
 	return 0;
 }




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFT][PATCH v1 3/8] cpufreq/sched: Allow .setpolicy() cpufreq drivers to enable EAS
  2025-04-16 18:01 ` [RFT][PATCH v1 3/8] cpufreq/sched: Allow .setpolicy() cpufreq drivers to enable EAS Rafael J. Wysocki
@ 2025-04-17 12:19   ` Christian Loehle
  2025-04-17 13:01     ` Rafael J. Wysocki
  0 siblings, 1 reply; 27+ messages in thread
From: Christian Loehle @ 2025-04-17 12:19 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois

On 4/16/25 19:01, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Some cpufreq drivers, like intel_pstate, have built-in governors that
> are used instead of regular cpufreq governors, schedutil in particular,
> but they can work with EAS just fine, so allow EAS to be used with
> those drivers.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> v0.3 -> v1
>      * Rebase on top of the new [1-2/8].
>      * Update the diagnostic message printed if the conditions are not met.
> 
> This patch is regarded as a cleanup for 6.16.
> 
> ---
>  drivers/cpufreq/cpufreq.c |   13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -3054,7 +3054,16 @@
>  
>  	guard(cpufreq_policy_read)(policy);
>  
> -	return sugov_is_governor(policy);
> +	/*
> +	 * For EAS compatibility, require that either schedutil is the policy
> +	 * governor or the policy is governed directly by the cpufreq driver.
> +	 *
> +	 * In the latter case, it is assumed that EAS can only be enabled by the
> +	 * cpufreq driver itself which will not enable EAS if it does not meet
> +	 * the EAS' expectations regarding performance scaling response.
> +	 */
> +	return sugov_is_governor(policy) || (!policy->governor &&
> +		policy->policy != CPUFREQ_POLICY_UNKNOWN);
>  }
>  
>  bool cpufreq_ready_for_eas(const struct cpumask *cpu_mask)
> @@ -3064,7 +3073,7 @@
>  	/* Do not attempt EAS if schedutil is not being used. */
>  	for_each_cpu(cpu, cpu_mask) {
>  		if (!cpufreq_policy_is_good_for_eas(cpu)) {
> -			pr_debug("rd %*pbl: schedutil is mandatory for EAS\n",
> +			pr_debug("rd %*pbl: EAS requirements not met\n",
>  				 cpumask_pr_args(cpu_mask));

I'd prefer to have at least "EAS cpufreq requirements" printed here.
with that caveat
Reviewed-by: Christian Loehle <christian.loehle@arm.com>

Maybe we should amend the EAS documentation to reflect this?
(And also emphasise that EAS will make cpufreq assumptions as if sugov
was the governor regardless.)

>  			return false;
>  		}
> 
> 
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFT][PATCH v1 1/8] cpufreq/sched: schedutil: Add helper for governor checks
  2025-04-16 17:48 ` [RFT][PATCH v1 1/8] cpufreq/sched: schedutil: Add helper for governor checks Rafael J. Wysocki
@ 2025-04-17 12:23   ` Christian Loehle
  0 siblings, 0 replies; 27+ messages in thread
From: Christian Loehle @ 2025-04-17 12:23 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois

On 4/16/25 18:48, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Add a helper for checking if schedutil is the current governor for
> a given cpufreq policy and use it in sched_is_eas_possible() to avoid
> accessing cpufreq policy internals directly from there.
> 
> No intentional functional impact.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Reviewed-by: Christian Loehle <christian.loehle@arm.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFT][PATCH v1 2/8] cpufreq/sched: Move cpufreq-specific EAS checks to cpufreq
  2025-04-16 17:59 ` [RFT][PATCH v1 2/8] cpufreq/sched: Move cpufreq-specific EAS checks to cpufreq Rafael J. Wysocki
@ 2025-04-17 12:28   ` Christian Loehle
  0 siblings, 0 replies; 27+ messages in thread
From: Christian Loehle @ 2025-04-17 12:28 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois

On 4/16/25 18:59, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Doing cpufreq-specific EAS checks that require accessing policy
> internals directly from sched_is_eas_possible() is a bit unfortunate,
> so introduce cpufreq_ready_for_eas() in cpufreq, move those checks
> into that new function and make sched_is_eas_possible() call it.
> 
> While at it, address a possible race between the EAS governor check
> and governor change by doing the former under the policy rwsem.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> v0.3 -> v1
>      * Add a new helper called cpufreq_policy_is_good_for_eas() which is
>        properly synchronized with governor changes.
>      * Slightly modify debug messages.
> 
> This patch is regarded as a cleanup for 6.16.
> 
> ---
>  drivers/cpufreq/cpufreq.c |   32 ++++++++++++++++++++++++++++++++
>  include/linux/cpufreq.h   |    2 ++
>  kernel/sched/topology.c   |   25 +++++--------------------
>  3 files changed, 39 insertions(+), 20 deletions(-)
> 
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -3041,6 +3041,38 @@
>  
>  	return 0;
>  }
> +
> +static bool cpufreq_policy_is_good_for_eas(unsigned int cpu)
> +{
> +	struct cpufreq_policy *policy __free(put_cpufreq_policy);
> +
> +	policy = cpufreq_cpu_get(cpu);
> +	if (!policy) {
> +		pr_debug("cpufreq policy not set for CPU: %d", cpu);
> +		return false;
> +	}
> +
> +	guard(cpufreq_policy_read)(policy);
> +
> +	return sugov_is_governor(policy);
> +}
> +
> +bool cpufreq_ready_for_eas(const struct cpumask *cpu_mask)
> +{
> +	unsigned int cpu;
> +
> +	/* Do not attempt EAS if schedutil is not being used. */
> +	for_each_cpu(cpu, cpu_mask) {
> +		if (!cpufreq_policy_is_good_for_eas(cpu)) {
> +			pr_debug("rd %*pbl: schedutil is mandatory for EAS\n",
> +				 cpumask_pr_args(cpu_mask));
> +			return false;
> +		}
> +	}
> +
> +	return true;
> +}
> +
>  module_param(off, int, 0444);
>  module_param_string(default_governor, default_governor, CPUFREQ_NAME_LEN, 0444);
>  core_initcall(cpufreq_core_init);
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -1212,6 +1212,8 @@
>  		struct cpufreq_frequency_table *table,
>  		unsigned int transition_latency);
>  
> +bool cpufreq_ready_for_eas(const struct cpumask *cpu_mask);
> +
>  static inline void cpufreq_register_em_with_opp(struct cpufreq_policy *policy)
>  {
>  	dev_pm_opp_of_register_em(get_cpu_device(policy->cpu),
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -212,8 +212,6 @@
>  static bool sched_is_eas_possible(const struct cpumask *cpu_mask)
>  {
>  	bool any_asym_capacity = false;
> -	struct cpufreq_policy *policy;
> -	bool policy_is_ready;
>  	int i;
>  
>  	/* EAS is enabled for asymmetric CPU capacity topologies. */
> @@ -248,25 +246,12 @@
>  		return false;
>  	}
>  
> -	/* Do not attempt EAS if schedutil is not being used. */
> -	for_each_cpu(i, cpu_mask) {
> -		policy = cpufreq_cpu_get(i);
> -		if (!policy) {
> -			if (sched_debug()) {
> -				pr_info("rd %*pbl: Checking EAS, cpufreq policy not set for CPU: %d",
> -					cpumask_pr_args(cpu_mask), i);
> -			}
> -			return false;
> -		}
> -		policy_is_ready = sugov_is_governor(policy);
> -		cpufreq_cpu_put(policy);
> -		if (!policy_is_ready) {
> -			if (sched_debug()) {
> -				pr_info("rd %*pbl: Checking EAS, schedutil is mandatory\n",
> -					cpumask_pr_args(cpu_mask));
> -			}
> -			return false;
> +	if (!cpufreq_ready_for_eas(cpu_mask)) {
> +		if (sched_debug()) {
> +			pr_info("rd %*pbl: Checking EAS: cpufreq is not ready",

Missing \n here.
There is another one you touch, I've sent patches already last month:
https://lore.kernel.org/lkml/20250319131324.224228-1-christian.loehle@arm.com/

With that:
Reviewed-by: Christian Loehle <christian.loehle@arm.com>

> +				cpumask_pr_args(cpu_mask));
>  		}
> +		return false;
>  	}
>  
>  	return true;
> 
> 
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFT][PATCH v1 7/8] cpufreq: intel_pstate: Align perf domains with L2 cache
  2025-04-16 18:10 ` [RFT][PATCH v1 7/8] cpufreq: intel_pstate: Align perf domains with L2 cache Rafael J. Wysocki
@ 2025-04-17 12:42   ` Christian Loehle
  2025-04-27 16:23   ` Dietmar Eggemann
  1 sibling, 0 replies; 27+ messages in thread
From: Christian Loehle @ 2025-04-17 12:42 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois, Tim Chen

On 4/16/25 19:10, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> On some hybrid platforms a group of cores (referred to as a module) may
> share an L2 cache in which case they also share a voltage regulator and
> always run at the same frequency (while not in idle states).
> 
> For this reason, make hybrid_register_perf_domain() in the intel_pstate
> driver add all CPUs sharing an L2 cache to the same perf domain for EAS.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> New in v1.
> 
> ---
>  drivers/cpufreq/intel_pstate.c |   23 +++++++++++++++++++++--
>  1 file changed, 21 insertions(+), 2 deletions(-)
> 
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -999,8 +999,11 @@
>  {
>  	static const struct em_data_callback cb
>  			= EM_ADV_DATA_CB(hybrid_active_power, hybrid_get_cost);
> +	struct cpu_cacheinfo *cacheinfo = get_cpu_cacheinfo(cpu);
> +	const struct cpumask *cpumask = cpumask_of(cpu);
>  	struct cpudata *cpudata = all_cpu_data[cpu];
>  	struct device *cpu_dev;
> +	int ret;
>  
>  	/*
>  	 * Registering EM perf domains without enabling asymmetric CPU capacity
> @@ -1014,9 +1017,25 @@
>  	if (!cpu_dev)
>  		return false;
>  
> -	if (em_dev_register_perf_domain(cpu_dev, HYBRID_EM_STATE_COUNT, &cb,
> -					cpumask_of(cpu), false))
> +	if (cacheinfo) {
> +		unsigned int i;
> +
> +		/* Find the L2 cache and the CPUs sharing it. */
> +		for (i = 0; i < cacheinfo->num_leaves; i++) {
> +			if (cacheinfo->info_list[i].level == 2) {
> +				cpumask = &cacheinfo->info_list[i].shared_cpu_map;
> +				break;
> +			}
> +		}
> +	}
> +
> +	ret = em_dev_register_perf_domain(cpu_dev, HYBRID_EM_STATE_COUNT, &cb,
> +					  cpumask, false);
> +	if (ret) {
> +		cpudata->em_registered = ret == -EEXIST;
> +
>  		return false;
> +	}
>  
>  	cpudata->em_registered = true;
>  
> 

debugfs already provides a way to retrieve that information, but with more
complex perf domain constructions like here maybe this would be useful
(maybe it already is):

--->8---

Subject: [PATCH] PM: EM: Print CPUs of perf domains

In preparation for future EAS users who make the relation from CPU
to perf-domain not strictly based on cpufreq policies print the
affected CPUs when registering a perf-domain.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
---
 kernel/power/energy_model.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 99a1ae324c2d..a202968b2ee9 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -627,7 +627,7 @@ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
 	em_cpufreq_update_efficiencies(dev, em_table->state);
 
 	em_debug_create_pd(dev);
-	dev_info(dev, "EM: created perf domain\n");
+	dev_info(dev, "EM: created perf domain for CPUs %*pbl\n", cpumask_pr_args(cpus));
 
 unlock:
 	mutex_unlock(&em_pd_mutex);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFT][PATCH v1 3/8] cpufreq/sched: Allow .setpolicy() cpufreq drivers to enable EAS
  2025-04-17 12:19   ` Christian Loehle
@ 2025-04-17 13:01     ` Rafael J. Wysocki
  2025-04-17 13:03       ` Christian Loehle
  0 siblings, 1 reply; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-04-17 13:01 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Rafael J. Wysocki, Linux PM, LKML, Lukasz Luba, Peter Zijlstra,
	Srinivas Pandruvada, Dietmar Eggemann, Morten Rasmussen,
	Vincent Guittot, Ricardo Neri, Pierre Gondois

On Thu, Apr 17, 2025 at 2:19 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 4/16/25 19:01, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Some cpufreq drivers, like intel_pstate, have built-in governors that
> > are used instead of regular cpufreq governors, schedutil in particular,
> > but they can work with EAS just fine, so allow EAS to be used with
> > those drivers.
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >
> > v0.3 -> v1
> >      * Rebase on top of the new [1-2/8].
> >      * Update the diagnostic message printed if the conditions are not met.
> >
> > This patch is regarded as a cleanup for 6.16.
> >
> > ---
> >  drivers/cpufreq/cpufreq.c |   13 +++++++++++--
> >  1 file changed, 11 insertions(+), 2 deletions(-)
> >
> > --- a/drivers/cpufreq/cpufreq.c
> > +++ b/drivers/cpufreq/cpufreq.c
> > @@ -3054,7 +3054,16 @@
> >
> >       guard(cpufreq_policy_read)(policy);
> >
> > -     return sugov_is_governor(policy);
> > +     /*
> > +      * For EAS compatibility, require that either schedutil is the policy
> > +      * governor or the policy is governed directly by the cpufreq driver.
> > +      *
> > +      * In the latter case, it is assumed that EAS can only be enabled by the
> > +      * cpufreq driver itself which will not enable EAS if it does not meet
> > +      * the EAS' expectations regarding performance scaling response.
> > +      */
> > +     return sugov_is_governor(policy) || (!policy->governor &&
> > +             policy->policy != CPUFREQ_POLICY_UNKNOWN);
> >  }
> >
> >  bool cpufreq_ready_for_eas(const struct cpumask *cpu_mask)
> > @@ -3064,7 +3073,7 @@
> >       /* Do not attempt EAS if schedutil is not being used. */
> >       for_each_cpu(cpu, cpu_mask) {
> >               if (!cpufreq_policy_is_good_for_eas(cpu)) {
> > -                     pr_debug("rd %*pbl: schedutil is mandatory for EAS\n",
> > +                     pr_debug("rd %*pbl: EAS requirements not met\n",
> >                                cpumask_pr_args(cpu_mask));
>
> I'd prefer to have at least "EAS cpufreq requirements" printed here.

Sure.

> with that caveat
> Reviewed-by: Christian Loehle <christian.loehle@arm.com>
>
> Maybe we should amend the EAS documentation to reflect this?

Yes, the documentation should be updated.  Which piece of it in
particular I need to look at?

> (And also emphasise that EAS will make cpufreq assumptions as if sugov
> was the governor regardless.)

Right.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFT][PATCH v1 3/8] cpufreq/sched: Allow .setpolicy() cpufreq drivers to enable EAS
  2025-04-17 13:01     ` Rafael J. Wysocki
@ 2025-04-17 13:03       ` Christian Loehle
  0 siblings, 0 replies; 27+ messages in thread
From: Christian Loehle @ 2025-04-17 13:03 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Linux PM, LKML, Lukasz Luba, Peter Zijlstra,
	Srinivas Pandruvada, Dietmar Eggemann, Morten Rasmussen,
	Vincent Guittot, Ricardo Neri, Pierre Gondois

On 4/17/25 14:01, Rafael J. Wysocki wrote:
> On Thu, Apr 17, 2025 at 2:19 PM Christian Loehle
> <christian.loehle@arm.com> wrote:
>>
>> On 4/16/25 19:01, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> Some cpufreq drivers, like intel_pstate, have built-in governors that
>>> are used instead of regular cpufreq governors, schedutil in particular,
>>> but they can work with EAS just fine, so allow EAS to be used with
>>> those drivers.
>>>
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>> ---
>>>
>>> v0.3 -> v1
>>>      * Rebase on top of the new [1-2/8].
>>>      * Update the diagnostic message printed if the conditions are not met.
>>>
>>> This patch is regarded as a cleanup for 6.16.
>>>
>>> ---
>>>  drivers/cpufreq/cpufreq.c |   13 +++++++++++--
>>>  1 file changed, 11 insertions(+), 2 deletions(-)
>>>
>>> --- a/drivers/cpufreq/cpufreq.c
>>> +++ b/drivers/cpufreq/cpufreq.c
>>> @@ -3054,7 +3054,16 @@
>>>
>>>       guard(cpufreq_policy_read)(policy);
>>>
>>> -     return sugov_is_governor(policy);
>>> +     /*
>>> +      * For EAS compatibility, require that either schedutil is the policy
>>> +      * governor or the policy is governed directly by the cpufreq driver.
>>> +      *
>>> +      * In the latter case, it is assumed that EAS can only be enabled by the
>>> +      * cpufreq driver itself which will not enable EAS if it does not meet
>>> +      * the EAS' expectations regarding performance scaling response.
>>> +      */
>>> +     return sugov_is_governor(policy) || (!policy->governor &&
>>> +             policy->policy != CPUFREQ_POLICY_UNKNOWN);
>>>  }
>>>
>>>  bool cpufreq_ready_for_eas(const struct cpumask *cpu_mask)
>>> @@ -3064,7 +3073,7 @@
>>>       /* Do not attempt EAS if schedutil is not being used. */
>>>       for_each_cpu(cpu, cpu_mask) {
>>>               if (!cpufreq_policy_is_good_for_eas(cpu)) {
>>> -                     pr_debug("rd %*pbl: schedutil is mandatory for EAS\n",
>>> +                     pr_debug("rd %*pbl: EAS requirements not met\n",
>>>                                cpumask_pr_args(cpu_mask));
>>
>> I'd prefer to have at least "EAS cpufreq requirements" printed here.
> 
> Sure.
> 
>> with that caveat
>> Reviewed-by: Christian Loehle <christian.loehle@arm.com>
>>
>> Maybe we should amend the EAS documentation to reflect this?
> 
> Yes, the documentation should be updated.  Which piece of it in
> particular I need to look at?

Documentation/scheduler/sched-energy.rst
has:
6.4 - Schedutil governor
so at least there.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFT][PATCH v1 0/8] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT
  2025-04-16 17:44 [RFT][PATCH v1 0/8] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
                   ` (7 preceding siblings ...)
  2025-04-16 18:12 ` [RFT][PATCH v1 8/8] cpufreq: intel_pstate: EAS: Increase cost for CPUs using L3 cache Rafael J. Wysocki
@ 2025-04-18  9:58 ` Christian Loehle
  2025-04-18 19:52   ` Rafael J. Wysocki
  2025-05-12 13:23 ` [PATCH v1] cpufreq: Drop policy locking from cpufreq_policy_is_good_for_eas() Rafael J. Wysocki
  9 siblings, 1 reply; 27+ messages in thread
From: Christian Loehle @ 2025-04-18  9:58 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois

On 4/16/25 18:44, Rafael J. Wysocki wrote:
> Hi Everyone,
> 
> This is a new version of
> 
> https://lore.kernel.org/linux-pm/22640172.EfDdHjke4D@rjwysocki.net/
> 
> which is not regarded as RFC any more.  It appears to be better than
> 
> https://lore.kernel.org/linux-pm/5861970.DvuYhMxLoT@rjwysocki.net/
> 
> but still requires more testing, so I'd appreciate any help here.
> 
> The following paragraph from the original cover letter still applies:
> 
> "The underlying observation is that on the platforms targeted by these changes,
> Lunar Lake at the time of this writing, the "small" CPUs (E-cores), when run at
> the same performance level, are always more energy-efficient than the "big" or
> "performance" CPUs (P-cores).  This means that, regardless of the scale-
> invariant utilization of a task, as long as there is enough spare capacity on
> E-cores, the relative cost of running it there is always lower."
> 
> The first 3 patches have been updated since v0.3 and they now depend on the new
> cpufreq material in linux-next.
> 
> The next 2 patches (Energy Model code changes) have been reviewed in the
> meantime, but they are only needed for the last 3 patches.
> 
> Patch [6/8] is essentially the same as before.  It causes perf domains to be
> registered per CPU and in addition to the primary cost component, which is
> related to the CPU type, there is a small component proportional to performance
> whose role is to help balance the load between CPUs of the same type.
> 
> This is done to avoid migrating tasks too much between CPUs of the same type,
> especially between E-cores, which has been observed in tests of
> 
> https://lore.kernel.org/linux-pm/5861970.DvuYhMxLoT@rjwysocki.net/
> 
> The expected effect is still that the CPUs of the "low-cost" type will be
> preferred so long as there is enough spare capacity on any of them.
> 
> The last 2 patches are new.
> 
> Patch [7/8] looks at the cache topology to avoid creating per-CPU perf domains
> for CPUs sharing an L2 cache.  Typically, on the chips that will be affected
> by this patch, CPUs sharing an L2 cache also share a voltage regulator and a
> clock, so they technically belong to the same OPP domain and they will be put
> into a shared perf domain after this patch.
> 
> Patch [8/8] makes CPUs sharing the L3 cache look slightly more expensive to
> cause the scheduler to prefer placing tasks on CPUs that don't use the L3,
> which in some cases should allow all of the CPUs sharing the L3 to stay in
> idle states and the energy usage should be reduced.
> 
> Please refer to the individual patch changelogs for details.
> 
> Since patches [7-8/8] also apply on top of the v0.3, I have created a git branch at
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
> experimental/intel_pstate/eas-take2-extended
> 
> or
> 
> https://web.git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/intel_pstate/eas-take2-extended
> 
> to allow the difference they make with respect to the v0.3 to be seen (if any).
> 
> Later, I'm going to put this series as a whole into a new git branch on top of
> the mainline and the cpufreq material queued up for 6.16.
> 
> Thanks!
> 

Similar to the v0.3 tests done here:
https://lore.kernel.org/linux-pm/6ab0531a-d6d8-46ac-9afc-23cf87f37905@arm.com/
here are the results for the same raptor lake nosmt machine (now with
4 e-cores + 4 e-cores and 8x1 p-core PDs, 10 PDs in total).

Firefox YouTube 4K video playback:
EAS:
684.504 +-19.167841239372198
CAS:
929.83 +-50.41498564690636
(-26.3844% energy used with EAS)
(cf. -43.1% energy used with EAS v0.3)
(cf. -24.2% energy used with EAS v0.2)

Firefox Web Aquarium 500 fish.
EAS:
540.192 +-14.294833410089904
CAS:
712.896 +-16.821304745272684
(-24.2257% energy used with EAS)
(cf. -35.6% energy used with EAS v0.3)

Seems the per-CPU PD worked better, at least for this machine, which arguably
isn't the main target of the series.
Feel free to add
Tested-by: Christian Loehle <christian.loehle@arm.com>
to patches 1 to 7 (the tested system isn't affected by 8/8).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFT][PATCH v1 0/8] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT
  2025-04-18  9:58 ` [RFT][PATCH v1 0/8] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Christian Loehle
@ 2025-04-18 19:52   ` Rafael J. Wysocki
  0 siblings, 0 replies; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-04-18 19:52 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Rafael J. Wysocki, Linux PM, LKML, Lukasz Luba, Peter Zijlstra,
	Srinivas Pandruvada, Dietmar Eggemann, Morten Rasmussen,
	Vincent Guittot, Ricardo Neri, Pierre Gondois

On Fri, Apr 18, 2025 at 11:58 AM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 4/16/25 18:44, Rafael J. Wysocki wrote:
> > Hi Everyone,
> >
> > This is a new version of
> >
> > https://lore.kernel.org/linux-pm/22640172.EfDdHjke4D@rjwysocki.net/
> >
> > which is not regarded as RFC any more.  It appears to be better than
> >
> > https://lore.kernel.org/linux-pm/5861970.DvuYhMxLoT@rjwysocki.net/
> >
> > but still requires more testing, so I'd appreciate any help here.
> >
> > The following paragraph from the original cover letter still applies:
> >
> > "The underlying observation is that on the platforms targeted by these changes,
> > Lunar Lake at the time of this writing, the "small" CPUs (E-cores), when run at
> > the same performance level, are always more energy-efficient than the "big" or
> > "performance" CPUs (P-cores).  This means that, regardless of the scale-
> > invariant utilization of a task, as long as there is enough spare capacity on
> > E-cores, the relative cost of running it there is always lower."
> >
> > The first 3 patches have been updated since v0.3 and they now depend on the new
> > cpufreq material in linux-next.
> >
> > The next 2 patches (Energy Model code changes) have been reviewed in the
> > meantime, but they are only needed for the last 3 patches.
> >
> > Patch [6/8] is essentially the same as before.  It causes perf domains to be
> > registered per CPU and in addition to the primary cost component, which is
> > related to the CPU type, there is a small component proportional to performance
> > whose role is to help balance the load between CPUs of the same type.
> >
> > This is done to avoid migrating tasks too much between CPUs of the same type,
> > especially between E-cores, which has been observed in tests of
> >
> > https://lore.kernel.org/linux-pm/5861970.DvuYhMxLoT@rjwysocki.net/
> >
> > The expected effect is still that the CPUs of the "low-cost" type will be
> > preferred so long as there is enough spare capacity on any of them.
> >
> > The last 2 patches are new.
> >
> > Patch [7/8] looks at the cache topology to avoid creating per-CPU perf domains
> > for CPUs sharing an L2 cache.  Typically, on the chips that will be affected
> > by this patch, CPUs sharing an L2 cache also share a voltage regulator and a
> > clock, so they technically belong to the same OPP domain and they will be put
> > into a shared perf domain after this patch.
> >
> > Patch [8/8] makes CPUs sharing the L3 cache look slightly more expensive to
> > cause the scheduler to prefer placing tasks on CPUs that don't use the L3,
> > which in some cases should allow all of the CPUs sharing the L3 to stay in
> > idle states and the energy usage should be reduced.
> >
> > Please refer to the individual patch changelogs for details.
> >
> > Since patches [7-8/8] also apply on top of the v0.3, I have created a git branch at
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
> > experimental/intel_pstate/eas-take2-extended
> >
> > or
> >
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/intel_pstate/eas-take2-extended
> >
> > to allow the difference they make with respect to the v0.3 to be seen (if any).
> >
> > Later, I'm going to put this series as a whole into a new git branch on top of
> > the mainline and the cpufreq material queued up for 6.16.
> >
> > Thanks!
> >
>
> Similar to the v0.3 tests done here:
> https://lore.kernel.org/linux-pm/6ab0531a-d6d8-46ac-9afc-23cf87f37905@arm.com/
> here are the results for the same raptor lake nosmt machine (now with
> 4 e-cores + 4 e-cores and 8x1 p-core PDs, 10 PDs in total).
>
> Firefox YouTube 4K video playback:
> EAS:
> 684.504 +-19.167841239372198
> CAS:
> 929.83 +-50.41498564690636
> (-26.3844% energy used with EAS)
> (cf. -43.1% energy used with EAS v0.3)
> (cf. -24.2% energy used with EAS v0.2)
>
> Firefox Web Aquarium 500 fish.
> EAS:
> 540.192 +-14.294833410089904
> CAS:
> 712.896 +-16.821304745272684
> (-24.2257% energy used with EAS)
> (cf. -35.6% energy used with EAS v0.3)
>
> Seems the per-CPU PD worked better, at least for this machine, which arguably
> isn't the main target of the series.

But it is an interesting data point.

We still cannot get this much of a difference on other systems in our
labs, even with the same type of workload.

Thanks for the results, much appreciated!

> Feel free to add
> Tested-by: Christian Loehle <christian.loehle@arm.com>
> to patches 1 to 7 (the tested system isn't affected by 8/8).

Thank you!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFT][PATCH v1 8/8] cpufreq: intel_pstate: EAS: Increase cost for CPUs using L3 cache
  2025-04-16 18:12 ` [RFT][PATCH v1 8/8] cpufreq: intel_pstate: EAS: Increase cost for CPUs using L3 cache Rafael J. Wysocki
@ 2025-04-25 21:32   ` Christian Loehle
  2025-04-25 21:39     ` Rafael J. Wysocki
  0 siblings, 1 reply; 27+ messages in thread
From: Christian Loehle @ 2025-04-25 21:32 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois

On 4/16/25 19:12, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> On some hybrid platforms some efficient CPUs (E-cores) are not connected
> to the L3 cache, but there are no other differences between them and the
> other E-cores that use L3.  In that case, it is generally more efficient
> to run "light" workloads on the E-cores that do not use L3 and allow all
> of the cores using L3, including P-cores, to go into idle states.
> 
> For this reason, slightly increase the cost for all CPUs sharing the L3
> cache to make EAS prefer CPUs that do not use it to the other CPUs with
> the same perf-to-frequency scaling factor (if any).
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  drivers/cpufreq/intel_pstate.c |    8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -979,6 +979,7 @@
>  			   unsigned long *cost)
>  {
>  	struct pstate_data *pstate = &all_cpu_data[dev->id]->pstate;
> +	struct cpu_cacheinfo *cacheinfo = get_cpu_cacheinfo(dev->id);
>  
>  	/*
>  	 * The smaller the perf-to-frequency scaling factor, the larger the IPC
> @@ -991,6 +992,13 @@
>  	 * of the same type in different "utilization bins" is different.
>  	 */
>  	*cost = div_u64(100ULL * INTEL_PSTATE_CORE_SCALING, pstate->scaling) + freq;
> +	/*
> +	 * Inrease the cost slightly for CPUs able to access L3 to avoid litting

s/Inrease/Increase
and I guess s/litting/littering

> +	 * it up too eagerly in case some other CPUs of the same type cannot
> +	 * access it.
> +	 */
> +	if (cacheinfo->num_levels >= 3)
> +		(*cost)++;

This makes cost(OPP1) of the SoC Tile e-core as expensive as cost(OPP0) of a
normal e-core.
Is that the intended behaviour?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFT][PATCH v1 8/8] cpufreq: intel_pstate: EAS: Increase cost for CPUs using L3 cache
  2025-04-25 21:32   ` Christian Loehle
@ 2025-04-25 21:39     ` Rafael J. Wysocki
  0 siblings, 0 replies; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-04-25 21:39 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Rafael J. Wysocki, Linux PM, LKML, Lukasz Luba, Peter Zijlstra,
	Srinivas Pandruvada, Dietmar Eggemann, Morten Rasmussen,
	Vincent Guittot, Ricardo Neri, Pierre Gondois

On Fri, Apr 25, 2025 at 11:32 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 4/16/25 19:12, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > On some hybrid platforms some efficient CPUs (E-cores) are not connected
> > to the L3 cache, but there are no other differences between them and the
> > other E-cores that use L3.  In that case, it is generally more efficient
> > to run "light" workloads on the E-cores that do not use L3 and allow all
> > of the cores using L3, including P-cores, to go into idle states.
> >
> > For this reason, slightly increase the cost for all CPUs sharing the L3
> > cache to make EAS prefer CPUs that do not use it to the other CPUs with
> > the same perf-to-frequency scaling factor (if any).
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >  drivers/cpufreq/intel_pstate.c |    8 ++++++++
> >  1 file changed, 8 insertions(+)
> >
> > --- a/drivers/cpufreq/intel_pstate.c
> > +++ b/drivers/cpufreq/intel_pstate.c
> > @@ -979,6 +979,7 @@
> >                          unsigned long *cost)
> >  {
> >       struct pstate_data *pstate = &all_cpu_data[dev->id]->pstate;
> > +     struct cpu_cacheinfo *cacheinfo = get_cpu_cacheinfo(dev->id);
> >
> >       /*
> >        * The smaller the perf-to-frequency scaling factor, the larger the IPC
> > @@ -991,6 +992,13 @@
> >        * of the same type in different "utilization bins" is different.
> >        */
> >       *cost = div_u64(100ULL * INTEL_PSTATE_CORE_SCALING, pstate->scaling) + freq;
> > +     /*
> > +      * Inrease the cost slightly for CPUs able to access L3 to avoid litting
>
> s/Inrease/Increase
> and I guess s/litting/littering
>
> > +      * it up too eagerly in case some other CPUs of the same type cannot
> > +      * access it.
> > +      */
> > +     if (cacheinfo->num_levels >= 3)

This check actually doesn't work on Intel processors, I have a
replacement patch for this one.

> > +             (*cost)++;
>
> This makes cost(OPP1) of the SoC Tile e-core as expensive as cost(OPP0) of a
> normal e-core.

If "a normal Ecore" is one using L3, then yes.

> Is that the intended behaviour?

Yes, it is.  I wanted the Ecores on L3 to appear somewhat more
expensive, but not too much.

It looks like *cost += 2 would work better, though.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFT][PATCH v1 5/8] PM: EM: Introduce em_adjust_cpu_capacity()
  2025-04-16 18:06 ` [RFT][PATCH v1 5/8] PM: EM: Introduce em_adjust_cpu_capacity() Rafael J. Wysocki
@ 2025-04-27 14:01   ` Dietmar Eggemann
  2025-04-30 19:23     ` Rafael J. Wysocki
  0 siblings, 1 reply; 27+ messages in thread
From: Dietmar Eggemann @ 2025-04-27 14:01 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Morten Rasmussen, Vincent Guittot, Ricardo Neri, Pierre Gondois,
	Christian Loehle

On 16/04/2025 20:06, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Add a function for updating the Energy Model for a CPU after its
> capacity has changed, which subsequently will be used by the
> intel_pstate driver.
> 
> An EM_PERF_DOMAIN_ARTIFICIAL check is added to em_adjust_new_capacity()
> to prevent it from calling em_compute_costs() for an "artificial" perf
> domain with a NULL cb parameter which would cause it to crash.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
> 
> v0.3 -> v1:
>      * Added R-by from Lukasz.
> 
> ---
>  include/linux/energy_model.h |    2 ++
>  kernel/power/energy_model.c  |   28 ++++++++++++++++++++++++----
>  2 files changed, 26 insertions(+), 4 deletions(-)
> 
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -179,6 +179,7 @@
>  int em_dev_update_chip_binning(struct device *dev);
>  int em_update_performance_limits(struct em_perf_domain *pd,
>  		unsigned long freq_min_khz, unsigned long freq_max_khz);
> +void em_adjust_cpu_capacity(unsigned int cpu);
>  void em_rebuild_sched_domains(void);
>  
>  /**
> @@ -405,6 +406,7 @@
>  {
>  	return -EINVAL;
>  }
> +static inline void em_adjust_cpu_capacity(unsigned int cpu) {}
>  static inline void em_rebuild_sched_domains(void) {}
>  #endif
>  
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -698,10 +698,12 @@
>  {
>  	int ret;
>  
> -	ret = em_compute_costs(dev, em_table->state, NULL, pd->nr_perf_states,
> -			       pd->flags);
> -	if (ret)
> -		goto free_em_table;
> +	if (!(pd->flags & EM_PERF_DOMAIN_ARTIFICIAL)) {

This looks weird to me. How can an artificial EM ever have a non-ZERO
em_data_callback here?

There is already EM_PERF_DOMAIN_ARTIFICIAL specific handling in
em_compute_costs(). Which probably works well for the
em_create_perf_table() call-site.

Will there be cases for Hybrid CPU EM's in which 'em_max_perf !=
cpu_capacity':

em_adjust_new_capacity()

  if (em_max_perf == cpu_capacity)
    return

  em_recalc_and_update()
    em_compute_costs()

so that em_compute_costs() might be called?

Maybe:

@@ -233,11 +237,17 @@ static int em_compute_costs(struct device *dev,
struct em_perf_state *table,
        unsigned long prev_cost = ULONG_MAX;
        int i, ret;

+       if (!cb && (flags & EM_PERF_DOMAIN_ARTIFICIAL))
+               return 0;

is somehow clearer in this case?

[...]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFT][PATCH v1 7/8] cpufreq: intel_pstate: Align perf domains with L2 cache
  2025-04-16 18:10 ` [RFT][PATCH v1 7/8] cpufreq: intel_pstate: Align perf domains with L2 cache Rafael J. Wysocki
  2025-04-17 12:42   ` Christian Loehle
@ 2025-04-27 16:23   ` Dietmar Eggemann
  2025-04-30 19:29     ` Rafael J. Wysocki
  1 sibling, 1 reply; 27+ messages in thread
From: Dietmar Eggemann @ 2025-04-27 16:23 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Morten Rasmussen, Vincent Guittot, Ricardo Neri, Pierre Gondois,
	Christian Loehle, Tim Chen

On 16/04/2025 20:10, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> On some hybrid platforms a group of cores (referred to as a module) may
> share an L2 cache in which case they also share a voltage regulator and
> always run at the same frequency (while not in idle states).
> 
> For this reason, make hybrid_register_perf_domain() in the intel_pstate
> driver add all CPUs sharing an L2 cache to the same perf domain for EAS.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> New in v1.
> 
> ---
>  drivers/cpufreq/intel_pstate.c |   23 +++++++++++++++++++++--
>  1 file changed, 21 insertions(+), 2 deletions(-)
> 
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -999,8 +999,11 @@
>  {
>  	static const struct em_data_callback cb
>  			= EM_ADV_DATA_CB(hybrid_active_power, hybrid_get_cost);
> +	struct cpu_cacheinfo *cacheinfo = get_cpu_cacheinfo(cpu);
> +	const struct cpumask *cpumask = cpumask_of(cpu);
>  	struct cpudata *cpudata = all_cpu_data[cpu];
>  	struct device *cpu_dev;
> +	int ret;
>  
>  	/*
>  	 * Registering EM perf domains without enabling asymmetric CPU capacity
> @@ -1014,9 +1017,25 @@
>  	if (!cpu_dev)
>  		return false;
>  
> -	if (em_dev_register_perf_domain(cpu_dev, HYBRID_EM_STATE_COUNT, &cb,
> -					cpumask_of(cpu), false))
> +	if (cacheinfo) {
> +		unsigned int i;
> +
> +		/* Find the L2 cache and the CPUs sharing it. */
> +		for (i = 0; i < cacheinfo->num_leaves; i++) {
> +			if (cacheinfo->info_list[i].level == 2) {
> +				cpumask = &cacheinfo->info_list[i].shared_cpu_map;
> +				break;
> +			}
> +		}
> +	}
> +
> +	ret = em_dev_register_perf_domain(cpu_dev, HYBRID_EM_STATE_COUNT, &cb,
> +					  cpumask, false);
> +	if (ret) {
> +		cpudata->em_registered = ret == -EEXIST;
> +
>  		return false;
> +	}
>  
>  	cpudata->em_registered = true;

There seems to be an issue with late CPU-hotplug-in and this alignment
on L2 cache boundaries.

Example:

i7-13700K [P-cores: [0,2,4,6,8,10,12,14] E-cores: [16-23] w/ 'nosmt' and
'maxcpus=12', [16-19] & [20-23] share L2 cache.

root:/sys/kernel/debug/energy_model# cat cpu*/cpus
0
10
12
14
16-19
2
4
6
8

# echo 1 > /sys/devices/system/cpu/cpu20/online

...
[  496.616050] root_domain 0,2,4,6,8,10,12,14,16-20: pd20:{ cpus=20
nr_pstate=4 } pd16:{ cpus=16-19 nr_pstate=4 } pd14:{ cpus=14 nr_pstate=4
} pd12:{ cpus=12 nr_pstate=4 } pd10:{ cpus=10 nr_pstate=4 } pd8:{ cpus=8
nr_pstate=4 } pd6:{ cpus=6 nr_pstate=4 } pd4:{ cpus=4 nr_pstate=4 }
pd2:{ cpus=2 nr_pstate=4 } pd0:{ cpus=0 nr_pstate=4 }
...

root:/sys/kernel/debug/energy_model# cat cpu*/cpus
0
10
12
14
16-19
20
2
4
6
8

# echo 1 > /sys/devices/system/cpu/cpu21/online

...
[  589.001256] root domain span: 0,2,4,6,8,10,12,14,16-21
[  589.001265] pd_init: no EM found for CPU21
[  589.001266] sched_energy_set: stopping EAS
...

root:/sys/kernel/debug/energy_model# cat cpu*/cpus
0
10
12
14
16-19
20
2
4
6
8


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFT][PATCH v1 5/8] PM: EM: Introduce em_adjust_cpu_capacity()
  2025-04-27 14:01   ` Dietmar Eggemann
@ 2025-04-30 19:23     ` Rafael J. Wysocki
  2025-05-01 12:30       ` Dietmar Eggemann
  0 siblings, 1 reply; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-04-30 19:23 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Rafael J. Wysocki, Linux PM, LKML, Lukasz Luba, Peter Zijlstra,
	Srinivas Pandruvada, Morten Rasmussen, Vincent Guittot,
	Ricardo Neri, Pierre Gondois, Christian Loehle

On Sun, Apr 27, 2025 at 4:07 PM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 16/04/2025 20:06, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Add a function for updating the Energy Model for a CPU after its
> > capacity has changed, which subsequently will be used by the
> > intel_pstate driver.
> >
> > An EM_PERF_DOMAIN_ARTIFICIAL check is added to em_adjust_new_capacity()
> > to prevent it from calling em_compute_costs() for an "artificial" perf
> > domain with a NULL cb parameter which would cause it to crash.
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
> > ---
> >
> > v0.3 -> v1:
> >      * Added R-by from Lukasz.
> >
> > ---
> >  include/linux/energy_model.h |    2 ++
> >  kernel/power/energy_model.c  |   28 ++++++++++++++++++++++++----
> >  2 files changed, 26 insertions(+), 4 deletions(-)
> >
> > --- a/include/linux/energy_model.h
> > +++ b/include/linux/energy_model.h
> > @@ -179,6 +179,7 @@
> >  int em_dev_update_chip_binning(struct device *dev);
> >  int em_update_performance_limits(struct em_perf_domain *pd,
> >               unsigned long freq_min_khz, unsigned long freq_max_khz);
> > +void em_adjust_cpu_capacity(unsigned int cpu);
> >  void em_rebuild_sched_domains(void);
> >
> >  /**
> > @@ -405,6 +406,7 @@
> >  {
> >       return -EINVAL;
> >  }
> > +static inline void em_adjust_cpu_capacity(unsigned int cpu) {}
> >  static inline void em_rebuild_sched_domains(void) {}
> >  #endif
> >
> > --- a/kernel/power/energy_model.c
> > +++ b/kernel/power/energy_model.c
> > @@ -698,10 +698,12 @@
> >  {
> >       int ret;
> >
> > -     ret = em_compute_costs(dev, em_table->state, NULL, pd->nr_perf_states,
> > -                            pd->flags);
> > -     if (ret)
> > -             goto free_em_table;
> > +     if (!(pd->flags & EM_PERF_DOMAIN_ARTIFICIAL)) {
>
> This looks weird to me. How can an artificial EM ever have a non-ZERO
> em_data_callback here?
>
> There is already EM_PERF_DOMAIN_ARTIFICIAL specific handling in
> em_compute_costs(). Which probably works well for the
> em_create_perf_table() call-site.

Yes, but that one doesn't pass a NULL cb pointer to it.

> Will there be cases for Hybrid CPU EM's in which 'em_max_perf !=
> cpu_capacity':

When the capacity is updated, the EM needs to be updated accordingly,
which is why the new function is being added.

> em_adjust_new_capacity()
>
>   if (em_max_perf == cpu_capacity)
>     return
>
>   em_recalc_and_update()
>     em_compute_costs()
>
> so that em_compute_costs() might be called?
>
> Maybe:
>
> @@ -233,11 +237,17 @@ static int em_compute_costs(struct device *dev,
> struct em_perf_state *table,
>         unsigned long prev_cost = ULONG_MAX;
>         int i, ret;
>
> +       if (!cb && (flags & EM_PERF_DOMAIN_ARTIFICIAL))
> +               return 0;
>
> is somehow clearer in this case?

This would work, but I prefer my version because it does one check
less and it does the check directly in em_recalc_and_update(), so it
is clear that this doesn't call em_compute_costs() for artificial PDs
at all.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFT][PATCH v1 7/8] cpufreq: intel_pstate: Align perf domains with L2 cache
  2025-04-27 16:23   ` Dietmar Eggemann
@ 2025-04-30 19:29     ` Rafael J. Wysocki
  2025-05-01 12:30       ` Dietmar Eggemann
  0 siblings, 1 reply; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-04-30 19:29 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Rafael J. Wysocki, Linux PM, LKML, Lukasz Luba, Peter Zijlstra,
	Srinivas Pandruvada, Morten Rasmussen, Vincent Guittot,
	Ricardo Neri, Pierre Gondois, Christian Loehle, Tim Chen

On Sun, Apr 27, 2025 at 6:23 PM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 16/04/2025 20:10, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > On some hybrid platforms a group of cores (referred to as a module) may
> > share an L2 cache in which case they also share a voltage regulator and
> > always run at the same frequency (while not in idle states).
> >
> > For this reason, make hybrid_register_perf_domain() in the intel_pstate
> > driver add all CPUs sharing an L2 cache to the same perf domain for EAS.
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >
> > New in v1.
> >
> > ---
> >  drivers/cpufreq/intel_pstate.c |   23 +++++++++++++++++++++--
> >  1 file changed, 21 insertions(+), 2 deletions(-)
> >
> > --- a/drivers/cpufreq/intel_pstate.c
> > +++ b/drivers/cpufreq/intel_pstate.c
> > @@ -999,8 +999,11 @@
> >  {
> >       static const struct em_data_callback cb
> >                       = EM_ADV_DATA_CB(hybrid_active_power, hybrid_get_cost);
> > +     struct cpu_cacheinfo *cacheinfo = get_cpu_cacheinfo(cpu);
> > +     const struct cpumask *cpumask = cpumask_of(cpu);
> >       struct cpudata *cpudata = all_cpu_data[cpu];
> >       struct device *cpu_dev;
> > +     int ret;
> >
> >       /*
> >        * Registering EM perf domains without enabling asymmetric CPU capacity
> > @@ -1014,9 +1017,25 @@
> >       if (!cpu_dev)
> >               return false;
> >
> > -     if (em_dev_register_perf_domain(cpu_dev, HYBRID_EM_STATE_COUNT, &cb,
> > -                                     cpumask_of(cpu), false))
> > +     if (cacheinfo) {
> > +             unsigned int i;
> > +
> > +             /* Find the L2 cache and the CPUs sharing it. */
> > +             for (i = 0; i < cacheinfo->num_leaves; i++) {
> > +                     if (cacheinfo->info_list[i].level == 2) {
> > +                             cpumask = &cacheinfo->info_list[i].shared_cpu_map;
> > +                             break;
> > +                     }
> > +             }
> > +     }
> > +
> > +     ret = em_dev_register_perf_domain(cpu_dev, HYBRID_EM_STATE_COUNT, &cb,
> > +                                       cpumask, false);
> > +     if (ret) {
> > +             cpudata->em_registered = ret == -EEXIST;
> > +
> >               return false;
> > +     }
> >
> >       cpudata->em_registered = true;
>
> There seems to be an issue with late CPU-hotplug-in and this alignment
> on L2 cache boundaries.
>
> Example:
>
> i7-13700K [P-cores: [0,2,4,6,8,10,12,14] E-cores: [16-23] w/ 'nosmt' and
> 'maxcpus=12', [16-19] & [20-23] share L2 cache.
>
> root:/sys/kernel/debug/energy_model# cat cpu*/cpus
> 0
> 10
> 12
> 14
> 16-19
> 2
> 4
> 6
> 8
>
> # echo 1 > /sys/devices/system/cpu/cpu20/online
>
> ...
> [  496.616050] root_domain 0,2,4,6,8,10,12,14,16-20: pd20:{ cpus=20
> nr_pstate=4 } pd16:{ cpus=16-19 nr_pstate=4 } pd14:{ cpus=14 nr_pstate=4
> } pd12:{ cpus=12 nr_pstate=4 } pd10:{ cpus=10 nr_pstate=4 } pd8:{ cpus=8
> nr_pstate=4 } pd6:{ cpus=6 nr_pstate=4 } pd4:{ cpus=4 nr_pstate=4 }
> pd2:{ cpus=2 nr_pstate=4 } pd0:{ cpus=0 nr_pstate=4 }
> ...
>
> root:/sys/kernel/debug/energy_model# cat cpu*/cpus
> 0
> 10
> 12
> 14
> 16-19
> 20
> 2
> 4
> 6
> 8
>
> # echo 1 > /sys/devices/system/cpu/cpu21/online
>
> ...
> [  589.001256] root domain span: 0,2,4,6,8,10,12,14,16-21
> [  589.001265] pd_init: no EM found for CPU21
> [  589.001266] sched_energy_set: stopping EAS
> ...
>
> root:/sys/kernel/debug/energy_model# cat cpu*/cpus
> 0
> 10
> 12
> 14
> 16-19
> 20
> 2
> 4
> 6
> 8

I see.

What happens is that cpu_cacheinfo hides information on offline CPUs,
so when CPU20 goes online, it doesn't see any other CPUs sharing the
L2 with it.  Accordingly, a PD is created just for itself.

When CPU21 goes online, it sees that CPU20 shares the L2 with it, so
the code attempts to create a PD for them both which fails.

This could be addressed, but the code would need to be a bit more
complex and the current hardware seems to do better with a PD per CPU,
so I'll drop the $subject patch for now.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFT][PATCH v1 5/8] PM: EM: Introduce em_adjust_cpu_capacity()
  2025-04-30 19:23     ` Rafael J. Wysocki
@ 2025-05-01 12:30       ` Dietmar Eggemann
  2025-05-06 19:46         ` Rafael J. Wysocki
  0 siblings, 1 reply; 27+ messages in thread
From: Dietmar Eggemann @ 2025-05-01 12:30 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Linux PM, LKML, Lukasz Luba, Peter Zijlstra,
	Srinivas Pandruvada, Morten Rasmussen, Vincent Guittot,
	Ricardo Neri, Pierre Gondois, Christian Loehle

On 30/04/2025 21:23, Rafael J. Wysocki wrote:
> On Sun, Apr 27, 2025 at 4:07 PM Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:
>>
>> On 16/04/2025 20:06, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

[...]

>>> +     if (!(pd->flags & EM_PERF_DOMAIN_ARTIFICIAL)) {
>>
>> This looks weird to me. How can an artificial EM ever have a non-ZERO
>> em_data_callback here?
>>
>> There is already EM_PERF_DOMAIN_ARTIFICIAL specific handling in
>> em_compute_costs(). Which probably works well for the
>> em_create_perf_table() call-site.
> 
> Yes, but that one doesn't pass a NULL cb pointer to it.
> 
>> Will there be cases for Hybrid CPU EM's in which 'em_max_perf !=
>> cpu_capacity':
> 
> When the capacity is updated, the EM needs to be updated accordingly,
> which is why the new function is being added.
> 
>> em_adjust_new_capacity()
>>
>>   if (em_max_perf == cpu_capacity)
>>     return
>>
>>   em_recalc_and_update()
>>     em_compute_costs()
>>
>> so that em_compute_costs() might be called?
>>
>> Maybe:
>>
>> @@ -233,11 +237,17 @@ static int em_compute_costs(struct device *dev,
>> struct em_perf_state *table,
>>         unsigned long prev_cost = ULONG_MAX;
>>         int i, ret;
>>
>> +       if (!cb && (flags & EM_PERF_DOMAIN_ARTIFICIAL))
>> +               return 0;
>>
>> is somehow clearer in this case?
> 
> This would work, but I prefer my version because it does one check
> less and it does the check directly in em_recalc_and_update(), so it
> is clear that this doesn't call em_compute_costs() for artificial PDs
> at all.

OK, but checking it inside em_compute_costs() would also avoid this 'cb
= NULL' crash for an artificial EM in:

int em_dev_compute_costs(struct device *dev, struct em_perf_state
                         *table, int nr_states)
{
        return em_compute_costs(dev, table, NULL, nr_states, 0);
}

BTW, there is this:

#define em_is_artificial(em) ((em)->flags & EM_PERF_DOMAIN_ARTIFICIAL)

(I guess s/em/pd ?) which lets you check this when you have the perf
domain. So far it's used in dtpm, cpu- and devfreq cooling.

Anyway, you can add my:

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

for the entire set.














^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFT][PATCH v1 7/8] cpufreq: intel_pstate: Align perf domains with L2 cache
  2025-04-30 19:29     ` Rafael J. Wysocki
@ 2025-05-01 12:30       ` Dietmar Eggemann
  0 siblings, 0 replies; 27+ messages in thread
From: Dietmar Eggemann @ 2025-05-01 12:30 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Linux PM, LKML, Lukasz Luba, Peter Zijlstra,
	Srinivas Pandruvada, Morten Rasmussen, Vincent Guittot,
	Ricardo Neri, Pierre Gondois, Christian Loehle, Tim Chen

On 30/04/2025 21:29, Rafael J. Wysocki wrote:
> On Sun, Apr 27, 2025 at 6:23 PM Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:
>>
>> On 16/04/2025 20:10, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

[...]

>> There seems to be an issue with late CPU-hotplug-in and this alignment
>> on L2 cache boundaries.

[...]

> I see.
> 
> What happens is that cpu_cacheinfo hides information on offline CPUs,
> so when CPU20 goes online, it doesn't see any other CPUs sharing the
> L2 with it.  Accordingly, a PD is created just for itself.
> 
> When CPU21 goes online, it sees that CPU20 shares the L2 with it, so
> the code attempts to create a PD for them both which fails.
> 
> This could be addressed, but the code would need to be a bit more
> complex and the current hardware seems to do better with a PD per CPU,
> so I'll drop the $subject patch for now.

Ah OK, thanks!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFT][PATCH v1 5/8] PM: EM: Introduce em_adjust_cpu_capacity()
  2025-05-01 12:30       ` Dietmar Eggemann
@ 2025-05-06 19:46         ` Rafael J. Wysocki
  0 siblings, 0 replies; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-05-06 19:46 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Linux PM, LKML, Lukasz Luba,
	Peter Zijlstra, Srinivas Pandruvada, Morten Rasmussen,
	Vincent Guittot, Ricardo Neri, Pierre Gondois, Christian Loehle

On Thu, May 1, 2025 at 2:30 PM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 30/04/2025 21:23, Rafael J. Wysocki wrote:
> > On Sun, Apr 27, 2025 at 4:07 PM Dietmar Eggemann
> > <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 16/04/2025 20:06, Rafael J. Wysocki wrote:
> >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> [...]
>
> >>> +     if (!(pd->flags & EM_PERF_DOMAIN_ARTIFICIAL)) {
> >>
> >> This looks weird to me. How can an artificial EM ever have a non-ZERO
> >> em_data_callback here?
> >>
> >> There is already EM_PERF_DOMAIN_ARTIFICIAL specific handling in
> >> em_compute_costs(). Which probably works well for the
> >> em_create_perf_table() call-site.
> >
> > Yes, but that one doesn't pass a NULL cb pointer to it.
> >
> >> Will there be cases for Hybrid CPU EM's in which 'em_max_perf !=
> >> cpu_capacity':
> >
> > When the capacity is updated, the EM needs to be updated accordingly,
> > which is why the new function is being added.
> >
> >> em_adjust_new_capacity()
> >>
> >>   if (em_max_perf == cpu_capacity)
> >>     return
> >>
> >>   em_recalc_and_update()
> >>     em_compute_costs()
> >>
> >> so that em_compute_costs() might be called?
> >>
> >> Maybe:
> >>
> >> @@ -233,11 +237,17 @@ static int em_compute_costs(struct device *dev,
> >> struct em_perf_state *table,
> >>         unsigned long prev_cost = ULONG_MAX;
> >>         int i, ret;
> >>
> >> +       if (!cb && (flags & EM_PERF_DOMAIN_ARTIFICIAL))
> >> +               return 0;
> >>
> >> is somehow clearer in this case?
> >
> > This would work, but I prefer my version because it does one check
> > less and it does the check directly in em_recalc_and_update(), so it
> > is clear that this doesn't call em_compute_costs() for artificial PDs
> > at all.
>
> OK, but checking it inside em_compute_costs() would also avoid this 'cb
> = NULL' crash for an artificial EM in:
>
> int em_dev_compute_costs(struct device *dev, struct em_perf_state
>                          *table, int nr_states)
> {
>         return em_compute_costs(dev, table, NULL, nr_states, 0);
> }

This is unused currently, so no worries, but you have a point.  It
should return -EINVAL for artificial perf domains.

>
> BTW, there is this:
>
> #define em_is_artificial(em) ((em)->flags & EM_PERF_DOMAIN_ARTIFICIAL)
>
> (I guess s/em/pd ?) which lets you check this when you have the perf
> domain. So far it's used in dtpm, cpu- and devfreq cooling.

Thanks for letting me know about it, I'll use it in that check.

> Anyway, you can add my:
>
> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
>
> for the entire set.

Thank you!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v1] cpufreq: Drop policy locking from cpufreq_policy_is_good_for_eas()
  2025-04-16 17:44 [RFT][PATCH v1 0/8] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
                   ` (8 preceding siblings ...)
  2025-04-18  9:58 ` [RFT][PATCH v1 0/8] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Christian Loehle
@ 2025-05-12 13:23 ` Rafael J. Wysocki
  9 siblings, 0 replies; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-05-12 13:23 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Dietmar Eggemann, Ricardo Neri,
	Christian Loehle, Marek Szyprowski

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Policy locking was added to cpufreq_policy_is_good_for_eas() by commit
4854649b1fb4 ("cpufreq/sched: Move cpufreq-specific EAS checks to
cpufreq") to address a theoretical race condition, but it turned out to
introduce a circular locking dependency between the policy rwsem and
sched_domains_mutex via cpuset_mutex. This leads to a board lockup on
OdroidN2 that is based on the ARM64 Amlogic Meson SoC.

Drop the policy locking from cpufreq_policy_is_good_for_eas() to address
this issue.

Fixes: 4854649b1fb4 ("cpufreq/sched: Move cpufreq-specific EAS checks to cpufreq")
Closes: https://lore.kernel.org/linux-pm/1bf3df62-0641-459f-99fc-fd511e564b84@samsung.com/
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpufreq/cpufreq.c |    2 --
 1 file changed, 2 deletions(-)

--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -3067,8 +3067,6 @@
 		return false;
 	}
 
-	guard(cpufreq_policy_read)(policy);
-
 	return sugov_is_governor(policy);
 }
 




^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2025-05-12 13:23 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-16 17:44 [RFT][PATCH v1 0/8] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
2025-04-16 17:48 ` [RFT][PATCH v1 1/8] cpufreq/sched: schedutil: Add helper for governor checks Rafael J. Wysocki
2025-04-17 12:23   ` Christian Loehle
2025-04-16 17:59 ` [RFT][PATCH v1 2/8] cpufreq/sched: Move cpufreq-specific EAS checks to cpufreq Rafael J. Wysocki
2025-04-17 12:28   ` Christian Loehle
2025-04-16 18:01 ` [RFT][PATCH v1 3/8] cpufreq/sched: Allow .setpolicy() cpufreq drivers to enable EAS Rafael J. Wysocki
2025-04-17 12:19   ` Christian Loehle
2025-04-17 13:01     ` Rafael J. Wysocki
2025-04-17 13:03       ` Christian Loehle
2025-04-16 18:04 ` [RFT][PATCH v1 4/8] PM: EM: Move CPU capacity check to em_adjust_new_capacity() Rafael J. Wysocki
2025-04-16 18:06 ` [RFT][PATCH v1 5/8] PM: EM: Introduce em_adjust_cpu_capacity() Rafael J. Wysocki
2025-04-27 14:01   ` Dietmar Eggemann
2025-04-30 19:23     ` Rafael J. Wysocki
2025-05-01 12:30       ` Dietmar Eggemann
2025-05-06 19:46         ` Rafael J. Wysocki
2025-04-16 18:09 ` [RFT][PATCH v1 6/8] cpufreq: intel_pstate: EAS support for hybrid platforms Rafael J. Wysocki
2025-04-16 18:10 ` [RFT][PATCH v1 7/8] cpufreq: intel_pstate: Align perf domains with L2 cache Rafael J. Wysocki
2025-04-17 12:42   ` Christian Loehle
2025-04-27 16:23   ` Dietmar Eggemann
2025-04-30 19:29     ` Rafael J. Wysocki
2025-05-01 12:30       ` Dietmar Eggemann
2025-04-16 18:12 ` [RFT][PATCH v1 8/8] cpufreq: intel_pstate: EAS: Increase cost for CPUs using L3 cache Rafael J. Wysocki
2025-04-25 21:32   ` Christian Loehle
2025-04-25 21:39     ` Rafael J. Wysocki
2025-04-18  9:58 ` [RFT][PATCH v1 0/8] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Christian Loehle
2025-04-18 19:52   ` Rafael J. Wysocki
2025-05-12 13:23 ` [PATCH v1] cpufreq: Drop policy locking from cpufreq_policy_is_good_for_eas() Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox