[PATCH v2 0/7] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT

linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/7] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT
@ 2025-05-06 20:32 Rafael J. Wysocki
  2025-05-06 20:34 ` [PATCH v2 1/7] cpufreq/sched: schedutil: Add helper for governor checks Rafael J. Wysocki
                   ` (7 more replies)
  0 siblings, 8 replies; 18+ messages in thread
From: Rafael J. Wysocki @ 2025-05-06 20:32 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois, Christian Loehle

Hi Everyone,

This is a new (and most likely final) version of

https://lore.kernel.org/linux-pm/3344336.aeNJFYEL58@rjwysocki.net/

The most significant difference between it and the above is that schedutil is
now required for EAS to be enabled, like on the other platforms using EAS,
which means that intel_pstate needs to operate in the passive mode in order
to use it (the most straightforward way to switch it over to the passive mode
is to write "passive" to /sys/devices/system/cpu/intel_pstate/status).

Accordingly, the changes that were needed for EAS to work without schedutil are
not needed any more, so there is one patch less in the series because of that
and patch [5/7] is simpler than its previous version because some changes made
by it are not necessary any more.

Another patch that has been dropped is

https://lore.kernel.org/linux-pm/1964444.taCxCBeP46@rjwysocki.net/

because it didn't take CPU online into account properly and it is not essential
for the current hardware anyway.

There is a new patch, [7/7], which adds CAS/EAS/hybrid support description to
the intel_pstate admin-guide documentation.

The following paragraph from the original cover letter still applies:

"The underlying observation is that on the platforms targeted by these changes,
Lunar Lake at the time of this writing, the "small" CPUs (E-cores), when run at
the same performance level, are always more energy-efficient than the "big" or
"performance" CPUs (P-cores).  This means that, regardless of the scale-
invariant utilization of a task, as long as there is enough spare capacity on
E-cores, the relative cost of running it there is always lower."

The first 2 patches depend on the current cpufreq material queued up for 6.16
in linux-pm.git/linux-next (and in linux-next proper) and are not really
depended on by the rest of the series, but I've decided to include them into
this series because they have been slightly updated since the previous version,
mostly to take review feedback into account (I'm going to queue them up for
6.16 shortly because they don't appear to be objectionable).

The next 2 patches (Energy Model code changes) were reviewed previously, but
they are only needed because of patch [5/7].

Patch [5/7] has not changed much except that some changes made by the previous
version have been dropped from it.  Also its changelog has been updated.  It
causes perf domains to be registered per CPU and in addition to the primary cost
component, which is related to the CPU type, there is a small component
proportional to performance whose role is to help balance the load between CPUs
of the same type.

The expected effect is still that the CPUs of the "low-cost" type will be
preferred so long as there is enough spare capacity on any of them.

Patch [6/7] has been updated to walk all of the cache leaves and look for
the ones with level equal to 3 because the check used in the previous version
does not always work.

The documentation patch, [7/7], is new.

Please refer to the individual patch changelogs for details.

Thanks!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 1/7] cpufreq/sched: schedutil: Add helper for governor checks
  2025-05-06 20:32 [PATCH v2 0/7] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
@ 2025-05-06 20:34 ` Rafael J. Wysocki
  2025-05-06 20:37 ` [PATCH v2 2/7] cpufreq/sched: Move cpufreq-specific EAS checks to cpufreq Rafael J. Wysocki
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Rafael J. Wysocki @ 2025-05-06 20:34 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois, Christian Loehle

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Add a helper for checking if schedutil is the current governor for
a given cpufreq policy and use it in sched_is_eas_possible() to avoid
accessing cpufreq policy internals directly from there.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---

v1 -> v2: Pick up tags

---
 include/linux/cpufreq.h          |    9 +++++++++
 kernel/sched/cpufreq_schedutil.c |    9 +++++++--
 kernel/sched/sched.h             |    2 --
 kernel/sched/topology.c          |    6 +++---
 4 files changed, 19 insertions(+), 7 deletions(-)

--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -650,6 +650,15 @@
 struct cpufreq_governor *cpufreq_default_governor(void);
 struct cpufreq_governor *cpufreq_fallback_governor(void);
 
+#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
+bool sugov_is_governor(struct cpufreq_policy *policy);
+#else
+static inline bool sugov_is_governor(struct cpufreq_policy *policy)
+{
+	return false;
+}
+#endif
+
 static inline void cpufreq_policy_apply_limits(struct cpufreq_policy *policy)
 {
 	if (policy->max < policy->cur)
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -630,7 +630,7 @@
 
 /********************** cpufreq governor interface *********************/
 
-struct cpufreq_governor schedutil_gov;
+static struct cpufreq_governor schedutil_gov;
 
 static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy)
 {
@@ -909,7 +909,7 @@
 	WRITE_ONCE(sg_policy->limits_changed, true);
 }
 
-struct cpufreq_governor schedutil_gov = {
+static struct cpufreq_governor schedutil_gov = {
 	.name			= "schedutil",
 	.owner			= THIS_MODULE,
 	.flags			= CPUFREQ_GOV_DYNAMIC_SWITCHING,
@@ -927,4 +927,9 @@
 }
 #endif
 
+bool sugov_is_governor(struct cpufreq_policy *policy)
+{
+	return policy->governor == &schedutil_gov;
+}
+
 cpufreq_governor_init(schedutil_gov);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3509,8 +3509,6 @@
 	return static_branch_unlikely(&sched_energy_present);
 }
 
-extern struct cpufreq_governor schedutil_gov;
-
 #else /* ! (CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL) */
 
 #define perf_domain_span(pd) NULL
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -213,7 +213,7 @@
 {
 	bool any_asym_capacity = false;
 	struct cpufreq_policy *policy;
-	struct cpufreq_governor *gov;
+	bool policy_is_ready;
 	int i;
 
 	/* EAS is enabled for asymmetric CPU capacity topologies. */
@@ -258,9 +258,9 @@
 			}
 			return false;
 		}
-		gov = policy->governor;
+		policy_is_ready = sugov_is_governor(policy);
 		cpufreq_cpu_put(policy);
-		if (gov != &schedutil_gov) {
+		if (!policy_is_ready) {
 			if (sched_debug()) {
 				pr_info("rd %*pbl: Checking EAS, schedutil is mandatory\n",
 					cpumask_pr_args(cpu_mask));




^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 2/7] cpufreq/sched: Move cpufreq-specific EAS checks to cpufreq
  2025-05-06 20:32 [PATCH v2 0/7] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
  2025-05-06 20:34 ` [PATCH v2 1/7] cpufreq/sched: schedutil: Add helper for governor checks Rafael J. Wysocki
@ 2025-05-06 20:37 ` Rafael J. Wysocki
       [not found]   ` <CGME20250509234915eucas1p2846ecef88f268d78ab2e96d4a67002b0@eucas1p2.samsung.com>
  2025-05-06 20:39 ` [PATCH v2 3/7] PM: EM: Move CPU capacity check to em_adjust_new_capacity() Rafael J. Wysocki
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Rafael J. Wysocki @ 2025-05-06 20:37 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois, Christian Loehle

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Doing cpufreq-specific EAS checks that require accessing policy
internals directly from sched_is_eas_possible() is a bit unfortunate,
so introduce cpufreq_ready_for_eas() in cpufreq, move those checks
into that new function and make sched_is_eas_possible() call it.

While at it, address a possible race between the EAS governor check
and governor change by doing the former under the policy rwsem.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---

v1 -> v2:
   * Add missing newline characters in two places (Christian).
   * Pick up tags.

---
 drivers/cpufreq/cpufreq.c |   32 ++++++++++++++++++++++++++++++++
 include/linux/cpufreq.h   |    2 ++
 kernel/sched/topology.c   |   25 +++++--------------------
 3 files changed, 39 insertions(+), 20 deletions(-)

--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -3056,6 +3056,38 @@
 
 	return 0;
 }
+
+static bool cpufreq_policy_is_good_for_eas(unsigned int cpu)
+{
+	struct cpufreq_policy *policy __free(put_cpufreq_policy);
+
+	policy = cpufreq_cpu_get(cpu);
+	if (!policy) {
+		pr_debug("cpufreq policy not set for CPU: %d\n", cpu);
+		return false;
+	}
+
+	guard(cpufreq_policy_read)(policy);
+
+	return sugov_is_governor(policy);
+}
+
+bool cpufreq_ready_for_eas(const struct cpumask *cpu_mask)
+{
+	unsigned int cpu;
+
+	/* Do not attempt EAS if schedutil is not being used. */
+	for_each_cpu(cpu, cpu_mask) {
+		if (!cpufreq_policy_is_good_for_eas(cpu)) {
+			pr_debug("rd %*pbl: schedutil is mandatory for EAS\n",
+				 cpumask_pr_args(cpu_mask));
+			return false;
+		}
+	}
+
+	return true;
+}
+
 module_param(off, int, 0444);
 module_param_string(default_governor, default_governor, CPUFREQ_NAME_LEN, 0444);
 core_initcall(cpufreq_core_init);
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -1237,6 +1237,8 @@
 		struct cpufreq_frequency_table *table,
 		unsigned int transition_latency);
 
+bool cpufreq_ready_for_eas(const struct cpumask *cpu_mask);
+
 static inline void cpufreq_register_em_with_opp(struct cpufreq_policy *policy)
 {
 	dev_pm_opp_of_register_em(get_cpu_device(policy->cpu),
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -212,8 +212,6 @@
 static bool sched_is_eas_possible(const struct cpumask *cpu_mask)
 {
 	bool any_asym_capacity = false;
-	struct cpufreq_policy *policy;
-	bool policy_is_ready;
 	int i;
 
 	/* EAS is enabled for asymmetric CPU capacity topologies. */
@@ -248,25 +246,12 @@
 		return false;
 	}
 
-	/* Do not attempt EAS if schedutil is not being used. */
-	for_each_cpu(i, cpu_mask) {
-		policy = cpufreq_cpu_get(i);
-		if (!policy) {
-			if (sched_debug()) {
-				pr_info("rd %*pbl: Checking EAS, cpufreq policy not set for CPU: %d",
-					cpumask_pr_args(cpu_mask), i);
-			}
-			return false;
-		}
-		policy_is_ready = sugov_is_governor(policy);
-		cpufreq_cpu_put(policy);
-		if (!policy_is_ready) {
-			if (sched_debug()) {
-				pr_info("rd %*pbl: Checking EAS, schedutil is mandatory\n",
-					cpumask_pr_args(cpu_mask));
-			}
-			return false;
+	if (!cpufreq_ready_for_eas(cpu_mask)) {
+		if (sched_debug()) {
+			pr_info("rd %*pbl: Checking EAS: cpufreq is not ready\n",
+				cpumask_pr_args(cpu_mask));
 		}
+		return false;
 	}
 
 	return true;




^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 3/7] PM: EM: Move CPU capacity check to em_adjust_new_capacity()
  2025-05-06 20:32 [PATCH v2 0/7] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
  2025-05-06 20:34 ` [PATCH v2 1/7] cpufreq/sched: schedutil: Add helper for governor checks Rafael J. Wysocki
  2025-05-06 20:37 ` [PATCH v2 2/7] cpufreq/sched: Move cpufreq-specific EAS checks to cpufreq Rafael J. Wysocki
@ 2025-05-06 20:39 ` Rafael J. Wysocki
  2025-05-06 20:41 ` [PATCH v2 4/7] PM: EM: Introduce em_adjust_cpu_capacity() Rafael J. Wysocki
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Rafael J. Wysocki @ 2025-05-06 20:39 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois, Christian Loehle

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Move the check of the CPU capacity currently stored in the energy model
against the arch_scale_cpu_capacity() value to em_adjust_new_capacity()
so it will be done regardless of where the latter is called from.

This will be useful when a new em_adjust_new_capacity() caller is added
subsequently.

While at it, move the pd local variable declaration in
em_check_capacity_update() into the loop in which it is used.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---

v1 -> v2: Pick up tags

---
 kernel/power/energy_model.c |   40 +++++++++++++++++-----------------------
 1 file changed, 17 insertions(+), 23 deletions(-)

--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -725,10 +725,24 @@
  * Adjustment of CPU performance values after boot, when all CPUs capacites
  * are correctly calculated.
  */
-static void em_adjust_new_capacity(struct device *dev,
+static void em_adjust_new_capacity(unsigned int cpu, struct device *dev,
 				   struct em_perf_domain *pd)
 {
+	unsigned long cpu_capacity = arch_scale_cpu_capacity(cpu);
 	struct em_perf_table *em_table;
+	struct em_perf_state *table;
+	unsigned long em_max_perf;
+
+	rcu_read_lock();
+	table = em_perf_state_from_pd(pd);
+	em_max_perf = table[pd->nr_perf_states - 1].performance;
+	rcu_read_unlock();
+
+	if (em_max_perf == cpu_capacity)
+		return;
+
+	pr_debug("updating cpu%d cpu_cap=%lu old capacity=%lu\n", cpu,
+		 cpu_capacity, em_max_perf);
 
 	em_table = em_table_dup(pd);
 	if (!em_table) {
@@ -744,9 +758,6 @@
 static void em_check_capacity_update(void)
 {
 	cpumask_var_t cpu_done_mask;
-	struct em_perf_state *table;
-	struct em_perf_domain *pd;
-	unsigned long cpu_capacity;
 	int cpu;
 
 	if (!zalloc_cpumask_var(&cpu_done_mask, GFP_KERNEL)) {
@@ -757,7 +768,7 @@
 	/* Check if CPUs capacity has changed than update EM */
 	for_each_possible_cpu(cpu) {
 		struct cpufreq_policy *policy;
-		unsigned long em_max_perf;
+		struct em_perf_domain *pd;
 		struct device *dev;
 
 		if (cpumask_test_cpu(cpu, cpu_done_mask))
@@ -780,24 +791,7 @@
 		cpumask_or(cpu_done_mask, cpu_done_mask,
 			   em_span_cpus(pd));
 
-		cpu_capacity = arch_scale_cpu_capacity(cpu);
-
-		rcu_read_lock();
-		table = em_perf_state_from_pd(pd);
-		em_max_perf = table[pd->nr_perf_states - 1].performance;
-		rcu_read_unlock();
-
-		/*
-		 * Check if the CPU capacity has been adjusted during boot
-		 * and trigger the update for new performance values.
-		 */
-		if (em_max_perf == cpu_capacity)
-			continue;
-
-		pr_debug("updating cpu%d cpu_cap=%lu old capacity=%lu\n",
-			 cpu, cpu_capacity, em_max_perf);
-
-		em_adjust_new_capacity(dev, pd);
+		em_adjust_new_capacity(cpu, dev, pd);
 	}
 
 	free_cpumask_var(cpu_done_mask);




^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 4/7] PM: EM: Introduce em_adjust_cpu_capacity()
  2025-05-06 20:32 [PATCH v2 0/7] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
                   ` (2 preceding siblings ...)
  2025-05-06 20:39 ` [PATCH v2 3/7] PM: EM: Move CPU capacity check to em_adjust_new_capacity() Rafael J. Wysocki
@ 2025-05-06 20:41 ` Rafael J. Wysocki
  2025-05-06 20:44 ` [PATCH v2 5/7] cpufreq: intel_pstate: EAS support for hybrid platforms Rafael J. Wysocki
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Rafael J. Wysocki @ 2025-05-06 20:41 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois, Christian Loehle

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Add a function for updating the Energy Model for a CPU after its
capacity has changed, which subsequently will be used by the
intel_pstate driver.

An EM_PERF_DOMAIN_ARTIFICIAL check is added to em_recalc_and_update()
to prevent it from calling em_compute_costs() for an "artificial" perf
domain with a NULL cb parameter which would cause it to crash.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---

v1 -> v2:
   * Use em_is_artificial() in the new check (Dietmar).
   * Pick up tags.

---
 include/linux/energy_model.h |    2 ++
 kernel/power/energy_model.c  |   28 ++++++++++++++++++++++++----
 2 files changed, 26 insertions(+), 4 deletions(-)

--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -179,6 +179,7 @@
 int em_dev_update_chip_binning(struct device *dev);
 int em_update_performance_limits(struct em_perf_domain *pd,
 		unsigned long freq_min_khz, unsigned long freq_max_khz);
+void em_adjust_cpu_capacity(unsigned int cpu);
 void em_rebuild_sched_domains(void);
 
 /**
@@ -403,6 +404,7 @@
 {
 	return -EINVAL;
 }
+static inline void em_adjust_cpu_capacity(unsigned int cpu) {}
 static inline void em_rebuild_sched_domains(void) {}
 #endif
 
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -702,10 +702,12 @@
 {
 	int ret;
 
-	ret = em_compute_costs(dev, em_table->state, NULL, pd->nr_perf_states,
-			       pd->flags);
-	if (ret)
-		goto free_em_table;
+	if (!em_is_artificial(pd)) {
+		ret = em_compute_costs(dev, em_table->state, NULL,
+				       pd->nr_perf_states, pd->flags);
+		if (ret)
+			goto free_em_table;
+	}
 
 	ret = em_dev_update_perf_domain(dev, em_table);
 	if (ret)
@@ -755,6 +757,24 @@
 	em_recalc_and_update(dev, pd, em_table);
 }
 
+/**
+ * em_adjust_cpu_capacity() - Adjust the EM for a CPU after a capacity update.
+ * @cpu: Target CPU.
+ *
+ * Adjust the existing EM for @cpu after a capacity update under the assumption
+ * that the capacity has been updated in the same way for all of the CPUs in
+ * the same perf domain.
+ */
+void em_adjust_cpu_capacity(unsigned int cpu)
+{
+	struct device *dev = get_cpu_device(cpu);
+	struct em_perf_domain *pd;
+
+	pd = em_pd_get(dev);
+	if (pd)
+		em_adjust_new_capacity(cpu, dev, pd);
+}
+
 static void em_check_capacity_update(void)
 {
 	cpumask_var_t cpu_done_mask;




^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 5/7] cpufreq: intel_pstate: EAS support for hybrid platforms
  2025-05-06 20:32 [PATCH v2 0/7] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
                   ` (3 preceding siblings ...)
  2025-05-06 20:41 ` [PATCH v2 4/7] PM: EM: Introduce em_adjust_cpu_capacity() Rafael J. Wysocki
@ 2025-05-06 20:44 ` Rafael J. Wysocki
  2025-05-06 20:47 ` [PATCH v2 6/7] cpufreq: intel_pstate: EAS: Increase cost for CPUs using L3 cache Rafael J. Wysocki
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Rafael J. Wysocki @ 2025-05-06 20:44 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois, Christian Loehle

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Modify intel_pstate to register EM perf domains for CPUs on hybrid
platforms without SMT which causes EAS to be enabled on them when
schedutil is used as the cpufreq governor (which requires intel_pstate
to operate in the passive mode).

This change is targeting platforms (for example, Lunar Lake) where the
"little" CPUs (E-cores) are always more energy-efficient than the "big"
or "performance" CPUs (P-cores) when run at the same HWP performance
level, so it is sufficient to tell EAS that E-cores are always preferred
(so long as there is enough spare capacity on one of them to run the
given task).  However, migrating tasks between CPUs of the same type
too often is not desirable because it may hurt both performance and
energy efficiency due to leaving warm caches behind.

For this reason, register a separate perf domain for each CPU and choose
the cost values for them so that the cost mostly depends on the CPU type,
but there is also a small component of it depending on the performance
level (utilization) which helps to balance the load between CPUs of the
same type.

The cost component related to the CPU type is computed with the help of
the observation that the IPC metric value for a given CPU is inversely
proportional to its performance-to-frequency scaling factor and the cost
of running code on it can be assumed to be roughly proportional to that
IPC ratio (in principle, the higher the IPC ratio, the more resources
are utilized when running at a given frequency, so the cost should be
higher).

For all CPUs that are online at the system initialization time, EM perf
domains are registered when the driver starts up, after asymmetric
capacity support has been enabled.  For the CPUs that become online
later, EM perf domains are registered after setting the asymmetric
capacity for them.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---

v1 -> v2:
   * Drop changes that were needed for EAS to work without schedutil.
   * Rename em_registered to pd_registered.
   * Adjust the changelog.
   * Pick up tags.

---
 drivers/cpufreq/intel_pstate.c |  115 ++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 113 insertions(+), 2 deletions(-)

--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -221,6 +221,7 @@
  * @sched_flags:	Store scheduler flags for possible cross CPU update
  * @hwp_boost_min:	Last HWP boosted min performance
  * @suspended:		Whether or not the driver has been suspended.
+ * @pd_registered:	Set when a perf domain is registered for this CPU.
  * @hwp_notify_work:	workqueue for HWP notifications.
  *
  * This structure stores per CPU instance data for all CPUs.
@@ -260,6 +261,9 @@
 	unsigned int sched_flags;
 	u32 hwp_boost_min;
 	bool suspended;
+#ifdef CONFIG_ENERGY_MODEL
+	bool pd_registered;
+#endif
 	struct delayed_work hwp_notify_work;
 };
 
@@ -303,6 +307,7 @@
 
 static struct cpufreq_driver *intel_pstate_driver __read_mostly;
 
+#define INTEL_PSTATE_CORE_SCALING	100000
 #define HYBRID_SCALING_FACTOR_ADL	78741
 #define HYBRID_SCALING_FACTOR_MTL	80000
 #define HYBRID_SCALING_FACTOR_LNL	86957
@@ -311,7 +316,7 @@
 
 static inline int core_get_scaling(void)
 {
-	return 100000;
+	return INTEL_PSTATE_CORE_SCALING;
 }
 
 #ifdef CONFIG_ACPI
@@ -948,12 +953,105 @@
  */
 static DEFINE_MUTEX(hybrid_capacity_lock);
 
+#ifdef CONFIG_ENERGY_MODEL
+#define HYBRID_EM_STATE_COUNT	4
+
+static int hybrid_active_power(struct device *dev, unsigned long *power,
+			       unsigned long *freq)
+{
+	/*
+	 * Create "utilization bins" of 0-40%, 40%-60%, 60%-80%, and 80%-100%
+	 * of the maximum capacity such that two CPUs of the same type will be
+	 * regarded as equally attractive if the utilization of each of them
+	 * falls into the same bin, which should prevent tasks from being
+	 * migrated between them too often.
+	 *
+	 * For this purpose, return the "frequency" of 2 for the first
+	 * performance level and otherwise leave the value set by the caller.
+	 */
+	if (!*freq)
+		*freq = 2;
+
+	/* No power information. */
+	*power = EM_MAX_POWER;
+
+	return 0;
+}
+
+static int hybrid_get_cost(struct device *dev, unsigned long freq,
+			   unsigned long *cost)
+{
+	struct pstate_data *pstate = &all_cpu_data[dev->id]->pstate;
+
+	/*
+	 * The smaller the perf-to-frequency scaling factor, the larger the IPC
+	 * ratio between the given CPU and the least capable CPU in the system.
+	 * Regard that IPC ratio as the primary cost component and assume that
+	 * the scaling factors for different CPU types will differ by at least
+	 * 5% and they will not be above INTEL_PSTATE_CORE_SCALING.
+	 *
+	 * Add the freq value to the cost, so that the cost of running on CPUs
+	 * of the same type in different "utilization bins" is different.
+	 */
+	*cost = div_u64(100ULL * INTEL_PSTATE_CORE_SCALING, pstate->scaling) + freq;
+
+	return 0;
+}
+
+static bool hybrid_register_perf_domain(unsigned int cpu)
+{
+	static const struct em_data_callback cb
+			= EM_ADV_DATA_CB(hybrid_active_power, hybrid_get_cost);
+	struct cpudata *cpudata = all_cpu_data[cpu];
+	struct device *cpu_dev;
+
+	/*
+	 * Registering EM perf domains without enabling asymmetric CPU capacity
+	 * support is not really useful and one domain should not be registered
+	 * more than once.
+	 */
+	if (!hybrid_max_perf_cpu || cpudata->pd_registered)
+		return false;
+
+	cpu_dev = get_cpu_device(cpu);
+	if (!cpu_dev)
+		return false;
+
+	if (em_dev_register_perf_domain(cpu_dev, HYBRID_EM_STATE_COUNT, &cb,
+					cpumask_of(cpu), false))
+		return false;
+
+	cpudata->pd_registered = true;
+
+	return true;
+}
+
+static void hybrid_register_all_perf_domains(void)
+{
+	unsigned int cpu;
+
+	for_each_online_cpu(cpu)
+		hybrid_register_perf_domain(cpu);
+}
+
+static void hybrid_update_perf_domain(struct cpudata *cpu)
+{
+	if (cpu->pd_registered)
+		em_adjust_cpu_capacity(cpu->cpu);
+}
+#else /* !CONFIG_ENERGY_MODEL */
+static inline bool hybrid_register_perf_domain(unsigned int cpu) { return false; }
+static inline void hybrid_register_all_perf_domains(void) {}
+static inline void hybrid_update_perf_domain(struct cpudata *cpu) {}
+#endif /* CONFIG_ENERGY_MODEL */
+
 static void hybrid_set_cpu_capacity(struct cpudata *cpu)
 {
 	arch_set_cpu_capacity(cpu->cpu, cpu->capacity_perf,
 			      hybrid_max_perf_cpu->capacity_perf,
 			      cpu->capacity_perf,
 			      cpu->pstate.max_pstate_physical);
+	hybrid_update_perf_domain(cpu);
 
 	pr_debug("CPU%d: perf = %u, max. perf = %u, base perf = %d\n", cpu->cpu,
 		 cpu->capacity_perf, hybrid_max_perf_cpu->capacity_perf,
@@ -1042,6 +1140,11 @@
 	guard(mutex)(&hybrid_capacity_lock);
 
 	__hybrid_refresh_cpu_capacity_scaling();
+	/*
+	 * Perf domains are not registered before setting hybrid_max_perf_cpu,
+	 * so register them all after setting up CPU capacity scaling.
+	 */
+	hybrid_register_all_perf_domains();
 }
 
 static void hybrid_init_cpu_capacity_scaling(bool refresh)
@@ -1069,7 +1172,7 @@
 		hybrid_refresh_cpu_capacity_scaling();
 		/*
 		 * Disabling ITMT causes sched domains to be rebuilt to disable asym
-		 * packing and enable asym capacity.
+		 * packing and enable asym capacity and EAS.
 		 */
 		sched_clear_itmt_support();
 	}
@@ -1147,6 +1250,14 @@
 	}
 
 	hybrid_set_cpu_capacity(cpu);
+	/*
+	 * If the CPU was offline to start with and it is going online for the
+	 * first time, a perf domain needs to be registered for it if hybrid
+	 * capacity scaling has been enabled already.  In that case, sched
+	 * domains need to be rebuilt to take the new perf domain into account.
+	 */
+	if (hybrid_register_perf_domain(cpu->cpu))
+		em_rebuild_sched_domains();
 
 unlock:
 	mutex_unlock(&hybrid_capacity_lock);




^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 6/7] cpufreq: intel_pstate: EAS: Increase cost for CPUs using L3 cache
  2025-05-06 20:32 [PATCH v2 0/7] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
                   ` (4 preceding siblings ...)
  2025-05-06 20:44 ` [PATCH v2 5/7] cpufreq: intel_pstate: EAS support for hybrid platforms Rafael J. Wysocki
@ 2025-05-06 20:47 ` Rafael J. Wysocki
  2025-05-06 20:48 ` [PATCH v2 7/7] cpufreq: intel_pstate: Document hybrid processor support Rafael J. Wysocki
  2025-05-13 12:51 ` [PATCH v2 0/7] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
  7 siblings, 0 replies; 18+ messages in thread
From: Rafael J. Wysocki @ 2025-05-06 20:47 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois, Christian Loehle

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

On some hybrid platforms some efficient CPUs (E-cores) are not connected
to the L3 cache, but there are no other differences between them and the
other E-cores that use L3.  In that case, it is generally more efficient
to run "light" workloads on the E-cores that do not use L3 and allow all
of the cores using L3, including P-cores, to go into idle states.

For this reason, slightly increase the cost for all CPUs sharing the L3
cache to make EAS prefer CPUs that do not use it to the other CPUs of
the same type (if any).

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v1 -> v2:
   * Change L3 cache lookup method (the previous one doesn't always work).
   * Adjust the new comment.
   * Increase the cost for CPUs using L3 by 2 (instead of increasing it by 1)
     to make the scheduler select the CPUs without L3 more often.
   * Adjust the changelog.

---
 drivers/cpufreq/intel_pstate.c |   17 +++++++++++++++++
 1 file changed, 17 insertions(+)

--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -982,6 +982,7 @@
 			   unsigned long *cost)
 {
 	struct pstate_data *pstate = &all_cpu_data[dev->id]->pstate;
+	struct cpu_cacheinfo *cacheinfo = get_cpu_cacheinfo(dev->id);
 
 	/*
 	 * The smaller the perf-to-frequency scaling factor, the larger the IPC
@@ -994,6 +995,22 @@
 	 * of the same type in different "utilization bins" is different.
 	 */
 	*cost = div_u64(100ULL * INTEL_PSTATE_CORE_SCALING, pstate->scaling) + freq;
+	/*
+	 * Increase the cost slightly for CPUs able to access L3 to avoid
+	 * touching it in case some other CPUs of the same type can do the work
+	 * without it.
+	 */
+	if (cacheinfo) {
+		unsigned int i;
+
+		/* Check if L3 cache is there. */
+		for (i = 0; i < cacheinfo->num_leaves; i++) {
+			if (cacheinfo->info_list[i].level == 3) {
+				*cost += 2;
+				break;
+			}
+		}
+	}
 
 	return 0;
 }




^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 7/7] cpufreq: intel_pstate: Document hybrid processor support
  2025-05-06 20:32 [PATCH v2 0/7] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
                   ` (5 preceding siblings ...)
  2025-05-06 20:47 ` [PATCH v2 6/7] cpufreq: intel_pstate: EAS: Increase cost for CPUs using L3 cache Rafael J. Wysocki
@ 2025-05-06 20:48 ` Rafael J. Wysocki
  2025-05-13 12:51 ` [PATCH v2 0/7] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
  7 siblings, 0 replies; 18+ messages in thread
From: Rafael J. Wysocki @ 2025-05-06 20:48 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois, Christian Loehle

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Describe the support for hybrid processors in intel_pstate, including
the CAS and EAS support, in the admin-guide documentation.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v1 -> v2: New patch

---
 Documentation/admin-guide/pm/intel_pstate.rst |  104 +++++++++++++++++++++++++-
 1 file changed, 102 insertions(+), 2 deletions(-)

--- a/Documentation/admin-guide/pm/intel_pstate.rst
+++ b/Documentation/admin-guide/pm/intel_pstate.rst
@@ -329,6 +329,106 @@
 HWP feature, which is why ``intel_pstate`` works with all of them.]
 
 
+Support for Hybrid Processors
+=============================
+
+Some processors supported by ``intel_pstate`` contain two or more types of CPU
+cores differing by the maximum turbo P-state, performance vs power characteristics,
+cache sizes, and possibly other properties.  They are commonly referred to as
+hybrid processors.  To support them, ``intel_pstate`` requires HWP to be enabled
+and it assumes the HWP performance units to be the same for all CPUs in the
+system, so a given HWP performance level always represents approximately the
+same physical performance regardless of the core (CPU) type.
+
+Hybrid Processors with SMT
+--------------------------
+
+On systems where SMT (Simultaneous Multithreading), also referred to as
+HyperThreading (HT) in the context of Intel processors, is enabled on at least
+one core, ``intel_pstate`` assigns performance-based priorities to CPUs.  Namely,
+the priority of a given CPU reflects its highest HWP performance level which
+causes the CPU scheduler to generally prefer more performant CPUs, so the less
+performant CPUs are used when the other ones are fully loaded.  However, SMT
+siblings (that is, logical CPUs sharing one physical core) are treated in a
+special way such that if one of them is in use, the effective priority of the
+other ones is lowered below the priorities of the CPUs located in the other
+physical cores.
+
+This approach maximizes performance in the majority of cases, but unfortunately
+it also leads to excessive energy usage in some important scenarios, like video
+playback, which is not generally desirable.  While there is no other viable
+choice with SMT enabled because the effective capacity and utilization of SMT
+siblings are hard to determine, hybrid processors without SMT can be handled in
+more energy-efficient ways.
+
+.. _CAS:
+
+Capacity-Aware Scheduling Support
+---------------------------------
+
+The capacity-aware scheduling (CAS) support in the CPU scheduler is enabled by
+``intel_pstate`` by default on hybrid processors without SMT.  CAS generally
+causes the scheduler to put tasks on a CPU so long as there is a sufficient
+amount of spare capacity on it, and if the utilization of a given task is too
+high for it, the task will need to go somewhere else.
+
+Since CAS takes CPU capacities into account, it does not require CPU
+prioritization and it allows tasks to be distributed more symmetrically among
+the more performant and less performant CPUs.  Once placed on a CPU with enough
+capacity to accommodate it, a task may just continue to run there regardless of
+whether or not the other CPUs are fully loaded, so on average CAS reduces the
+utilization of the more performant CPUs which causes the energy usage to be more
+balanced because the more performant CPUs are generally less energy-efficient
+than the less performant ones.
+
+In order to use CAS, the scheduler needs to know the capacity of each CPU in
+the system and it needs to be able to compute scale-invariant utilization of
+CPUs, so ``intel_pstate`` provides it with the requisite information.
+
+First of all, the capacity of each CPU is represented by the ratio of its highest
+HWP performance level, multiplied by 1024, to the highest HWP performance level
+of the most performant CPU in the system, which works because the HWP performance
+units are the same for all CPUs.  Second, the frequency-invariance computations,
+carried out by the scheduler to always express CPU utilization in the same units
+regardless of the frequency it is currently running at, are adjusted to take the
+CPU capacity into account.  All of this happens when ``intel_pstate`` has
+registered itself with the ``CPUFreq`` core and it has figured out that it is
+running on a hybrid processor without SMT.
+
+Energy-Aware Scheduling Support
+-------------------------------
+
+If ``CONFIG_ENERGY_MODEL`` has been set during kernel configuration and
+``intel_pstate`` runs on a hybrid processor without SMT, in addition to enabling
+`CAS <CAS_>`_ it registers an Energy Model for the processor.  This allows the
+Energy-Aware Scheduling (EAS) support to be enabled in the CPU scheduler if
+``schedutil`` is used as the  ``CPUFreq`` governor which requires ``intel_pstate``
+to operate in the `passive mode <Passive Mode_>`_.
+
+The Energy Model registered by ``intel_pstate`` is artificial (that is, it is
+based on abstract cost values and it does not include any real power numbers)
+and it is relatively simple to avoid unnecessary computations in the scheduler.
+There is a performance domain in it for every CPU in the system and the cost
+values for these performance domains have been chosen so that running a task on
+a less performant (small) CPU appears to be always cheaper than running that
+task on a more performant (big) CPU.  However, for two CPUs of the same type,
+the cost difference depends on their current utilization, and the CPU whose
+current utilization is higher generally appears to be a more expensive
+destination for a given task.  This helps to balance the load among CPUs of the
+same type.
+
+Since EAS works on top of CAS, high-utilization tasks are always migrated to
+CPUs with enough capacity to accommodate them, but thanks to EAS, low-utilization
+tasks tend to be placed on the CPUs that look less expensive to the scheduler.
+Effectively, this causes the less performant and less loaded CPUs to be
+preferred as long as they have enough spare capacity to run the given task
+which generally leads to reduced energy usage.
+
+The Energy Model created by ``intel_pstate`` can be inspected by looking at
+the ``energy_model`` directory in ``debugfs`` (typlically mounted on
+``/sys/kernel/debug/``).
+
+
 User Space Interface in ``sysfs``
 =================================
 
@@ -697,8 +797,8 @@
 	Limits`_ for details).
 
 ``no_cas``
-	Do not enable capacity-aware scheduling (CAS) which is enabled by
-	default on hybrid systems.
+	Do not enable `capacity-aware scheduling <CAS_>`_ which is enabled by
+	default on hybrid systems without SMT.
 
 Diagnostics and Tuning
 ======================




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/7] cpufreq/sched: Move cpufreq-specific EAS checks to cpufreq
       [not found]   ` <CGME20250509234915eucas1p2846ecef88f268d78ab2e96d4a67002b0@eucas1p2.samsung.com>
@ 2025-05-09 23:49     ` Marek Szyprowski
  2025-05-10 11:31       ` Rafael J. Wysocki
  0 siblings, 1 reply; 18+ messages in thread
From: Marek Szyprowski @ 2025-05-09 23:49 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois, Christian Loehle

On 06.05.2025 22:37, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> Doing cpufreq-specific EAS checks that require accessing policy
> internals directly from sched_is_eas_possible() is a bit unfortunate,
> so introduce cpufreq_ready_for_eas() in cpufreq, move those checks
> into that new function and make sched_is_eas_possible() call it.
>
> While at it, address a possible race between the EAS governor check
> and governor change by doing the former under the policy rwsem.
>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Reviewed-by: Christian Loehle <christian.loehle@arm.com>
> Tested-by: Christian Loehle <christian.loehle@arm.com>
> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

In my tests I've noticed that this patch, merged as commit 4854649b1fb4 
("cpufreq/sched: Move cpufreq-specific EAS checks to cpufreq"), causes a 
regression on ARM64 Amlogic Meson SoC based OdroidN2 board. The board 
finally lockups. Reverting $subject on top of next-20250509 fixes this 
issue. Here is the lockdep warning observed before the lockup:

======================================================
WARNING: possible circular locking dependency detected
6.15.0-rc5-next-20250509-dirty #10335 Tainted: G         C
cpufreq: cpufreq_policy_online: CPU2: Running at unlisted initial 
frequency: 999999 kHz, changing to: 1000000 kHz
------------------------------------------------------
kworker/3:1/79 is trying to acquire lock:
ffff00000494b380 (&policy->rwsem){++++}-{4:4}, at: 
cpufreq_ready_for_eas+0x60/0xbc

but task is already holding lock:
ffff8000832887a0 (sched_domains_mutex){+.+.}-{4:4}, at: 
partition_sched_domains+0x54/0x938

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (sched_domains_mutex){+.+.}-{4:4}:
        __mutex_lock+0xa8/0x598
        mutex_lock_nested+0x24/0x30
        partition_sched_domains+0x54/0x938
        rebuild_sched_domains_locked+0x2d4/0x900
        rebuild_sched_domains+0x2c/0x48
        rebuild_sched_domains_energy+0x3c/0x58
        rebuild_sd_workfn+0x10/0x1c
        process_one_work+0x208/0x604
        worker_thread+0x244/0x388
        kthread+0x150/0x228
        ret_from_fork+0x10/0x20

-> #1 (cpuset_mutex){+.+.}-{4:4}:
        __mutex_lock+0xa8/0x598
        mutex_lock_nested+0x24/0x30
        cpuset_lock+0x1c/0x28
        __sched_setscheduler+0x31c/0x830
        sched_setattr_nocheck+0x18/0x24
        sugov_init+0x1b4/0x388
        cpufreq_init_governor.part.0+0x58/0xd4
        cpufreq_set_policy+0x2c8/0x3ec
        cpufreq_online+0x520/0xb20
        cpufreq_add_dev+0x80/0x98
        subsys_interface_register+0xfc/0x118
        cpufreq_register_driver+0x150/0x238
        dt_cpufreq_probe+0x148/0x488
        platform_probe+0x68/0xdc
        really_probe+0xbc/0x298
        __driver_probe_device+0x78/0x12c
        driver_probe_device+0xdc/0x164
        __device_attach_driver+0xb8/0x138
        bus_for_each_drv+0x80/0xdc
        __device_attach+0xa8/0x1b0
        device_initial_probe+0x14/0x20
        bus_probe_device+0xb0/0xb4
        deferred_probe_work_func+0x8c/0xc8
        process_one_work+0x208/0x604
        worker_thread+0x244/0x388
        kthread+0x150/0x228
        ret_from_fork+0x10/0x20

-> #0 (&policy->rwsem){++++}-{4:4}:
        __lock_acquire+0x1408/0x2254
        lock_acquire+0x1c8/0x354
        down_read+0x60/0x180
        cpufreq_ready_for_eas+0x60/0xbc
        sched_is_eas_possible+0x144/0x170
        partition_sched_domains+0x504/0x938
        rebuild_sched_domains_locked+0x2d4/0x900
        rebuild_sched_domains+0x2c/0x48
        rebuild_sched_domains_energy+0x3c/0x58
        rebuild_sd_workfn+0x10/0x1c
        process_one_work+0x208/0x604
        worker_thread+0x244/0x388
        kthread+0x150/0x228
        ret_from_fork+0x10/0x20

other info that might help us debug this:

Chain exists of:
   &policy->rwsem --> cpuset_mutex --> sched_domains_mutex

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(sched_domains_mutex);
                                lock(cpuset_mutex);
                                lock(sched_domains_mutex);
   rlock(&policy->rwsem);

  *** DEADLOCK ***

6 locks held by kworker/3:1/79:
  #0: ffff00000000a748 ((wq_completion)events){+.+.}-{0:0}, at: 
process_one_work+0x18c/0x604
  #1: ffff800084a83dd0 (rebuild_sd_work){+.+.}-{0:0}, at: 
process_one_work+0x1b4/0x604
  #2: ffff800083288908 (sched_energy_mutex){+.+.}-{4:4}, at: 
rebuild_sched_domains_energy+0x30/0x58
  #3: ffff80008327c2d8 (cpu_hotplug_lock){++++}-{0:0}, at: 
cpus_read_lock+0x10/0x1c
  #4: ffff80008331b098 (cpuset_mutex){+.+.}-{4:4}, at: 
rebuild_sched_domains+0x28/0x48
  #5: ffff8000832887a0 (sched_domains_mutex){+.+.}-{4:4}, at: 
partition_sched_domains+0x54/0x938

stack backtrace:
CPU: 3 UID: 0 PID: 79 Comm: kworker/3:1 Tainted: G C          
6.15.0-rc5-next-20250509-dirty #10335 PREEMPT
Tainted: [C]=CRAP
Hardware name: Hardkernel ODROID-N2 (DT)
Workqueue: events rebuild_sd_workfn
Call trace:
  show_stack+0x18/0x24 (C)
  dump_stack_lvl+0x90/0xd0
  dump_stack+0x18/0x24
  print_circular_bug+0x298/0x37c
  check_noncircular+0x15c/0x170
  __lock_acquire+0x1408/0x2254
  lock_acquire+0x1c8/0x354
  down_read+0x60/0x180
  cpufreq_ready_for_eas+0x60/0xbc
  sched_is_eas_possible+0x144/0x170
  partition_sched_domains+0x504/0x938
  rebuild_sched_domains_locked+0x2d4/0x900
  rebuild_sched_domains+0x2c/0x48
  rebuild_sched_domains_energy+0x3c/0x58
  rebuild_sd_workfn+0x10/0x1c
  process_one_work+0x208/0x604
  worker_thread+0x244/0x388
  kthread+0x150/0x228
  ret_from_fork+0x10/0x20


> ---
>
> v1 -> v2:
>     * Add missing newline characters in two places (Christian).
>     * Pick up tags.
>
> ---
>   drivers/cpufreq/cpufreq.c |   32 ++++++++++++++++++++++++++++++++
>   include/linux/cpufreq.h   |    2 ++
>   kernel/sched/topology.c   |   25 +++++--------------------
>   3 files changed, 39 insertions(+), 20 deletions(-)
>
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -3056,6 +3056,38 @@
>   
>   	return 0;
>   }
> +
> +static bool cpufreq_policy_is_good_for_eas(unsigned int cpu)
> +{
> +	struct cpufreq_policy *policy __free(put_cpufreq_policy);
> +
> +	policy = cpufreq_cpu_get(cpu);
> +	if (!policy) {
> +		pr_debug("cpufreq policy not set for CPU: %d\n", cpu);
> +		return false;
> +	}
> +
> +	guard(cpufreq_policy_read)(policy);
> +
> +	return sugov_is_governor(policy);
> +}
> +
> +bool cpufreq_ready_for_eas(const struct cpumask *cpu_mask)
> +{
> +	unsigned int cpu;
> +
> +	/* Do not attempt EAS if schedutil is not being used. */
> +	for_each_cpu(cpu, cpu_mask) {
> +		if (!cpufreq_policy_is_good_for_eas(cpu)) {
> +			pr_debug("rd %*pbl: schedutil is mandatory for EAS\n",
> +				 cpumask_pr_args(cpu_mask));
> +			return false;
> +		}
> +	}
> +
> +	return true;
> +}
> +
>   module_param(off, int, 0444);
>   module_param_string(default_governor, default_governor, CPUFREQ_NAME_LEN, 0444);
>   core_initcall(cpufreq_core_init);
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -1237,6 +1237,8 @@
>   		struct cpufreq_frequency_table *table,
>   		unsigned int transition_latency);
>   
> +bool cpufreq_ready_for_eas(const struct cpumask *cpu_mask);
> +
>   static inline void cpufreq_register_em_with_opp(struct cpufreq_policy *policy)
>   {
>   	dev_pm_opp_of_register_em(get_cpu_device(policy->cpu),
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -212,8 +212,6 @@
>   static bool sched_is_eas_possible(const struct cpumask *cpu_mask)
>   {
>   	bool any_asym_capacity = false;
> -	struct cpufreq_policy *policy;
> -	bool policy_is_ready;
>   	int i;
>   
>   	/* EAS is enabled for asymmetric CPU capacity topologies. */
> @@ -248,25 +246,12 @@
>   		return false;
>   	}
>   
> -	/* Do not attempt EAS if schedutil is not being used. */
> -	for_each_cpu(i, cpu_mask) {
> -		policy = cpufreq_cpu_get(i);
> -		if (!policy) {
> -			if (sched_debug()) {
> -				pr_info("rd %*pbl: Checking EAS, cpufreq policy not set for CPU: %d",
> -					cpumask_pr_args(cpu_mask), i);
> -			}
> -			return false;
> -		}
> -		policy_is_ready = sugov_is_governor(policy);
> -		cpufreq_cpu_put(policy);
> -		if (!policy_is_ready) {
> -			if (sched_debug()) {
> -				pr_info("rd %*pbl: Checking EAS, schedutil is mandatory\n",
> -					cpumask_pr_args(cpu_mask));
> -			}
> -			return false;
> +	if (!cpufreq_ready_for_eas(cpu_mask)) {
> +		if (sched_debug()) {
> +			pr_info("rd %*pbl: Checking EAS: cpufreq is not ready\n",
> +				cpumask_pr_args(cpu_mask));
>   		}
> +		return false;
>   	}
>   
>   	return true;
>
>
>
>
Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/7] cpufreq/sched: Move cpufreq-specific EAS checks to cpufreq
  2025-05-09 23:49     ` Marek Szyprowski
@ 2025-05-10 11:31       ` Rafael J. Wysocki
  2025-05-12  6:48         ` Marek Szyprowski
  0 siblings, 1 reply; 18+ messages in thread
From: Rafael J. Wysocki @ 2025-05-10 11:31 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: Rafael J. Wysocki, Linux PM, LKML, Lukasz Luba, Peter Zijlstra,
	Srinivas Pandruvada, Dietmar Eggemann, Morten Rasmussen,
	Vincent Guittot, Ricardo Neri, Pierre Gondois, Christian Loehle

[-- Attachment #1: Type: text/plain, Size: 5008 bytes --]

On Sat, May 10, 2025 at 1:49 AM Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
>
> On 06.05.2025 22:37, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Doing cpufreq-specific EAS checks that require accessing policy
> > internals directly from sched_is_eas_possible() is a bit unfortunate,
> > so introduce cpufreq_ready_for_eas() in cpufreq, move those checks
> > into that new function and make sched_is_eas_possible() call it.
> >
> > While at it, address a possible race between the EAS governor check
> > and governor change by doing the former under the policy rwsem.
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > Reviewed-by: Christian Loehle <christian.loehle@arm.com>
> > Tested-by: Christian Loehle <christian.loehle@arm.com>
> > Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
>
> In my tests I've noticed that this patch, merged as commit 4854649b1fb4
> ("cpufreq/sched: Move cpufreq-specific EAS checks to cpufreq"), causes a
> regression on ARM64 Amlogic Meson SoC based OdroidN2 board. The board
> finally lockups. Reverting $subject on top of next-20250509 fixes this
> issue. Here is the lockdep warning observed before the lockup:

Thanks for the report!

> ======================================================
> WARNING: possible circular locking dependency detected
> 6.15.0-rc5-next-20250509-dirty #10335 Tainted: G         C
> cpufreq: cpufreq_policy_online: CPU2: Running at unlisted initial
> frequency: 999999 kHz, changing to: 1000000 kHz
> ------------------------------------------------------
> kworker/3:1/79 is trying to acquire lock:
> ffff00000494b380 (&policy->rwsem){++++}-{4:4}, at:
> cpufreq_ready_for_eas+0x60/0xbc
>
> but task is already holding lock:
> ffff8000832887a0 (sched_domains_mutex){+.+.}-{4:4}, at:
> partition_sched_domains+0x54/0x938
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 (sched_domains_mutex){+.+.}-{4:4}:
>         __mutex_lock+0xa8/0x598
>         mutex_lock_nested+0x24/0x30
>         partition_sched_domains+0x54/0x938
>         rebuild_sched_domains_locked+0x2d4/0x900
>         rebuild_sched_domains+0x2c/0x48
>         rebuild_sched_domains_energy+0x3c/0x58
>         rebuild_sd_workfn+0x10/0x1c
>         process_one_work+0x208/0x604
>         worker_thread+0x244/0x388
>         kthread+0x150/0x228
>         ret_from_fork+0x10/0x20
>
> -> #1 (cpuset_mutex){+.+.}-{4:4}:
>         __mutex_lock+0xa8/0x598
>         mutex_lock_nested+0x24/0x30
>         cpuset_lock+0x1c/0x28
>         __sched_setscheduler+0x31c/0x830
>         sched_setattr_nocheck+0x18/0x24
>         sugov_init+0x1b4/0x388
>         cpufreq_init_governor.part.0+0x58/0xd4
>         cpufreq_set_policy+0x2c8/0x3ec
>         cpufreq_online+0x520/0xb20
>         cpufreq_add_dev+0x80/0x98
>         subsys_interface_register+0xfc/0x118
>         cpufreq_register_driver+0x150/0x238
>         dt_cpufreq_probe+0x148/0x488
>         platform_probe+0x68/0xdc
>         really_probe+0xbc/0x298
>         __driver_probe_device+0x78/0x12c
>         driver_probe_device+0xdc/0x164
>         __device_attach_driver+0xb8/0x138
>         bus_for_each_drv+0x80/0xdc
>         __device_attach+0xa8/0x1b0
>         device_initial_probe+0x14/0x20
>         bus_probe_device+0xb0/0xb4
>         deferred_probe_work_func+0x8c/0xc8
>         process_one_work+0x208/0x604
>         worker_thread+0x244/0x388
>         kthread+0x150/0x228
>         ret_from_fork+0x10/0x20
>
> -> #0 (&policy->rwsem){++++}-{4:4}:
>         __lock_acquire+0x1408/0x2254
>         lock_acquire+0x1c8/0x354
>         down_read+0x60/0x180
>         cpufreq_ready_for_eas+0x60/0xbc
>         sched_is_eas_possible+0x144/0x170
>         partition_sched_domains+0x504/0x938
>         rebuild_sched_domains_locked+0x2d4/0x900
>         rebuild_sched_domains+0x2c/0x48
>         rebuild_sched_domains_energy+0x3c/0x58
>         rebuild_sd_workfn+0x10/0x1c
>         process_one_work+0x208/0x604
>         worker_thread+0x244/0x388
>         kthread+0x150/0x228
>         ret_from_fork+0x10/0x20
>
> other info that might help us debug this:
>
> Chain exists of:
>    &policy->rwsem --> cpuset_mutex --> sched_domains_mutex
>
>   Possible unsafe locking scenario:
>
>         CPU0                    CPU1
>         ----                    ----
>    lock(sched_domains_mutex);
>                                 lock(cpuset_mutex);
>                                 lock(sched_domains_mutex);
>    rlock(&policy->rwsem);
>
>   *** DEADLOCK ***

Well, it turns out that trying to acquire policy->rwsem under
sched_domains_mutex is a bad idea.  It was added to
cpufreq_policy_is_good_for_eas() to address a theoretical race, so it
can be dropped safely.  A theoretical race is better than a real
deadlock.

Please test the attached patch.

[-- Attachment #2: cpufreq-eas-policy-locking.patch --]
[-- Type: text/x-patch, Size: 264 bytes --]

---
 drivers/cpufreq/cpufreq.c |    2 --
 1 file changed, 2 deletions(-)

--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -3067,8 +3067,6 @@
 		return false;
 	}
 
-	guard(cpufreq_policy_read)(policy);
-
 	return sugov_is_governor(policy);
 }
 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/7] cpufreq/sched: Move cpufreq-specific EAS checks to cpufreq
  2025-05-10 11:31       ` Rafael J. Wysocki
@ 2025-05-12  6:48         ` Marek Szyprowski
  2025-05-12 12:53           ` Rafael J. Wysocki
  0 siblings, 1 reply; 18+ messages in thread
From: Marek Szyprowski @ 2025-05-12  6:48 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Linux PM, LKML, Lukasz Luba, Peter Zijlstra,
	Srinivas Pandruvada, Dietmar Eggemann, Morten Rasmussen,
	Vincent Guittot, Ricardo Neri, Pierre Gondois, Christian Loehle

On 10.05.2025 13:31, Rafael J. Wysocki wrote:
> On Sat, May 10, 2025 at 1:49 AM Marek Szyprowski
> <m.szyprowski@samsung.com> wrote:
>> On 06.05.2025 22:37, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> Doing cpufreq-specific EAS checks that require accessing policy
>>> internals directly from sched_is_eas_possible() is a bit unfortunate,
>>> so introduce cpufreq_ready_for_eas() in cpufreq, move those checks
>>> into that new function and make sched_is_eas_possible() call it.
>>>
>>> While at it, address a possible race between the EAS governor check
>>> and governor change by doing the former under the policy rwsem.
>>>
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>> Reviewed-by: Christian Loehle <christian.loehle@arm.com>
>>> Tested-by: Christian Loehle <christian.loehle@arm.com>
>>> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
>> In my tests I've noticed that this patch, merged as commit 4854649b1fb4
>> ("cpufreq/sched: Move cpufreq-specific EAS checks to cpufreq"), causes a
>> regression on ARM64 Amlogic Meson SoC based OdroidN2 board. The board
>> finally lockups. Reverting $subject on top of next-20250509 fixes this
>> issue. Here is the lockdep warning observed before the lockup:
> Thanks for the report!
>
>> ======================================================
>> WARNING: possible circular locking dependency detected
>> 6.15.0-rc5-next-20250509-dirty #10335 Tainted: G         C
>> cpufreq: cpufreq_policy_online: CPU2: Running at unlisted initial
>> frequency: 999999 kHz, changing to: 1000000 kHz
>> ------------------------------------------------------
>> kworker/3:1/79 is trying to acquire lock:
>> ffff00000494b380 (&policy->rwsem){++++}-{4:4}, at:
>> cpufreq_ready_for_eas+0x60/0xbc
>>
>> but task is already holding lock:
>> ffff8000832887a0 (sched_domains_mutex){+.+.}-{4:4}, at:
>> partition_sched_domains+0x54/0x938
>>
>> which lock already depends on the new lock.
>>
>> the existing dependency chain (in reverse order) is:
>>
>> -> #2 (sched_domains_mutex){+.+.}-{4:4}:
>>          __mutex_lock+0xa8/0x598
>>          mutex_lock_nested+0x24/0x30
>>          partition_sched_domains+0x54/0x938
>>          rebuild_sched_domains_locked+0x2d4/0x900
>>          rebuild_sched_domains+0x2c/0x48
>>          rebuild_sched_domains_energy+0x3c/0x58
>>          rebuild_sd_workfn+0x10/0x1c
>>          process_one_work+0x208/0x604
>>          worker_thread+0x244/0x388
>>          kthread+0x150/0x228
>>          ret_from_fork+0x10/0x20
>>
>> -> #1 (cpuset_mutex){+.+.}-{4:4}:
>>          __mutex_lock+0xa8/0x598
>>          mutex_lock_nested+0x24/0x30
>>          cpuset_lock+0x1c/0x28
>>          __sched_setscheduler+0x31c/0x830
>>          sched_setattr_nocheck+0x18/0x24
>>          sugov_init+0x1b4/0x388
>>          cpufreq_init_governor.part.0+0x58/0xd4
>>          cpufreq_set_policy+0x2c8/0x3ec
>>          cpufreq_online+0x520/0xb20
>>          cpufreq_add_dev+0x80/0x98
>>          subsys_interface_register+0xfc/0x118
>>          cpufreq_register_driver+0x150/0x238
>>          dt_cpufreq_probe+0x148/0x488
>>          platform_probe+0x68/0xdc
>>          really_probe+0xbc/0x298
>>          __driver_probe_device+0x78/0x12c
>>          driver_probe_device+0xdc/0x164
>>          __device_attach_driver+0xb8/0x138
>>          bus_for_each_drv+0x80/0xdc
>>          __device_attach+0xa8/0x1b0
>>          device_initial_probe+0x14/0x20
>>          bus_probe_device+0xb0/0xb4
>>          deferred_probe_work_func+0x8c/0xc8
>>          process_one_work+0x208/0x604
>>          worker_thread+0x244/0x388
>>          kthread+0x150/0x228
>>          ret_from_fork+0x10/0x20
>>
>> -> #0 (&policy->rwsem){++++}-{4:4}:
>>          __lock_acquire+0x1408/0x2254
>>          lock_acquire+0x1c8/0x354
>>          down_read+0x60/0x180
>>          cpufreq_ready_for_eas+0x60/0xbc
>>          sched_is_eas_possible+0x144/0x170
>>          partition_sched_domains+0x504/0x938
>>          rebuild_sched_domains_locked+0x2d4/0x900
>>          rebuild_sched_domains+0x2c/0x48
>>          rebuild_sched_domains_energy+0x3c/0x58
>>          rebuild_sd_workfn+0x10/0x1c
>>          process_one_work+0x208/0x604
>>          worker_thread+0x244/0x388
>>          kthread+0x150/0x228
>>          ret_from_fork+0x10/0x20
>>
>> other info that might help us debug this:
>>
>> Chain exists of:
>>     &policy->rwsem --> cpuset_mutex --> sched_domains_mutex
>>
>>    Possible unsafe locking scenario:
>>
>>          CPU0                    CPU1
>>          ----                    ----
>>     lock(sched_domains_mutex);
>>                                  lock(cpuset_mutex);
>>                                  lock(sched_domains_mutex);
>>     rlock(&policy->rwsem);
>>
>>    *** DEADLOCK ***
> Well, it turns out that trying to acquire policy->rwsem under
> sched_domains_mutex is a bad idea.  It was added to
> cpufreq_policy_is_good_for_eas() to address a theoretical race, so it
> can be dropped safely.  A theoretical race is better than a real
> deadlock.
>
> Please test the attached patch.

This fixed the observed issue. Thanks!

Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>


Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/7] cpufreq/sched: Move cpufreq-specific EAS checks to cpufreq
  2025-05-12  6:48         ` Marek Szyprowski
@ 2025-05-12 12:53           ` Rafael J. Wysocki
  2025-05-12 17:24             ` Dietmar Eggemann
  0 siblings, 1 reply; 18+ messages in thread
From: Rafael J. Wysocki @ 2025-05-12 12:53 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Linux PM, LKML, Lukasz Luba,
	Peter Zijlstra, Srinivas Pandruvada, Dietmar Eggemann,
	Morten Rasmussen, Vincent Guittot, Ricardo Neri, Pierre Gondois,
	Christian Loehle

On Mon, May 12, 2025 at 8:48 AM Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
>
> On 10.05.2025 13:31, Rafael J. Wysocki wrote:
> > On Sat, May 10, 2025 at 1:49 AM Marek Szyprowski
> > <m.szyprowski@samsung.com> wrote:
> >> On 06.05.2025 22:37, Rafael J. Wysocki wrote:
> >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>>
> >>> Doing cpufreq-specific EAS checks that require accessing policy
> >>> internals directly from sched_is_eas_possible() is a bit unfortunate,
> >>> so introduce cpufreq_ready_for_eas() in cpufreq, move those checks
> >>> into that new function and make sched_is_eas_possible() call it.
> >>>
> >>> While at it, address a possible race between the EAS governor check
> >>> and governor change by doing the former under the policy rwsem.
> >>>
> >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>> Reviewed-by: Christian Loehle <christian.loehle@arm.com>
> >>> Tested-by: Christian Loehle <christian.loehle@arm.com>
> >>> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> >> In my tests I've noticed that this patch, merged as commit 4854649b1fb4
> >> ("cpufreq/sched: Move cpufreq-specific EAS checks to cpufreq"), causes a
> >> regression on ARM64 Amlogic Meson SoC based OdroidN2 board. The board
> >> finally lockups. Reverting $subject on top of next-20250509 fixes this
> >> issue. Here is the lockdep warning observed before the lockup:
> > Thanks for the report!
> >
> >> ======================================================
> >> WARNING: possible circular locking dependency detected
> >> 6.15.0-rc5-next-20250509-dirty #10335 Tainted: G         C
> >> cpufreq: cpufreq_policy_online: CPU2: Running at unlisted initial
> >> frequency: 999999 kHz, changing to: 1000000 kHz
> >> ------------------------------------------------------
> >> kworker/3:1/79 is trying to acquire lock:
> >> ffff00000494b380 (&policy->rwsem){++++}-{4:4}, at:
> >> cpufreq_ready_for_eas+0x60/0xbc
> >>
> >> but task is already holding lock:
> >> ffff8000832887a0 (sched_domains_mutex){+.+.}-{4:4}, at:
> >> partition_sched_domains+0x54/0x938
> >>
> >> which lock already depends on the new lock.
> >>
> >> the existing dependency chain (in reverse order) is:
> >>
> >> -> #2 (sched_domains_mutex){+.+.}-{4:4}:
> >>          __mutex_lock+0xa8/0x598
> >>          mutex_lock_nested+0x24/0x30
> >>          partition_sched_domains+0x54/0x938
> >>          rebuild_sched_domains_locked+0x2d4/0x900
> >>          rebuild_sched_domains+0x2c/0x48
> >>          rebuild_sched_domains_energy+0x3c/0x58
> >>          rebuild_sd_workfn+0x10/0x1c
> >>          process_one_work+0x208/0x604
> >>          worker_thread+0x244/0x388
> >>          kthread+0x150/0x228
> >>          ret_from_fork+0x10/0x20
> >>
> >> -> #1 (cpuset_mutex){+.+.}-{4:4}:
> >>          __mutex_lock+0xa8/0x598
> >>          mutex_lock_nested+0x24/0x30
> >>          cpuset_lock+0x1c/0x28
> >>          __sched_setscheduler+0x31c/0x830
> >>          sched_setattr_nocheck+0x18/0x24
> >>          sugov_init+0x1b4/0x388
> >>          cpufreq_init_governor.part.0+0x58/0xd4
> >>          cpufreq_set_policy+0x2c8/0x3ec
> >>          cpufreq_online+0x520/0xb20
> >>          cpufreq_add_dev+0x80/0x98
> >>          subsys_interface_register+0xfc/0x118
> >>          cpufreq_register_driver+0x150/0x238
> >>          dt_cpufreq_probe+0x148/0x488
> >>          platform_probe+0x68/0xdc
> >>          really_probe+0xbc/0x298
> >>          __driver_probe_device+0x78/0x12c
> >>          driver_probe_device+0xdc/0x164
> >>          __device_attach_driver+0xb8/0x138
> >>          bus_for_each_drv+0x80/0xdc
> >>          __device_attach+0xa8/0x1b0
> >>          device_initial_probe+0x14/0x20
> >>          bus_probe_device+0xb0/0xb4
> >>          deferred_probe_work_func+0x8c/0xc8
> >>          process_one_work+0x208/0x604
> >>          worker_thread+0x244/0x388
> >>          kthread+0x150/0x228
> >>          ret_from_fork+0x10/0x20
> >>
> >> -> #0 (&policy->rwsem){++++}-{4:4}:
> >>          __lock_acquire+0x1408/0x2254
> >>          lock_acquire+0x1c8/0x354
> >>          down_read+0x60/0x180
> >>          cpufreq_ready_for_eas+0x60/0xbc
> >>          sched_is_eas_possible+0x144/0x170
> >>          partition_sched_domains+0x504/0x938
> >>          rebuild_sched_domains_locked+0x2d4/0x900
> >>          rebuild_sched_domains+0x2c/0x48
> >>          rebuild_sched_domains_energy+0x3c/0x58
> >>          rebuild_sd_workfn+0x10/0x1c
> >>          process_one_work+0x208/0x604
> >>          worker_thread+0x244/0x388
> >>          kthread+0x150/0x228
> >>          ret_from_fork+0x10/0x20
> >>
> >> other info that might help us debug this:
> >>
> >> Chain exists of:
> >>     &policy->rwsem --> cpuset_mutex --> sched_domains_mutex
> >>
> >>    Possible unsafe locking scenario:
> >>
> >>          CPU0                    CPU1
> >>          ----                    ----
> >>     lock(sched_domains_mutex);
> >>                                  lock(cpuset_mutex);
> >>                                  lock(sched_domains_mutex);
> >>     rlock(&policy->rwsem);
> >>
> >>    *** DEADLOCK ***
> > Well, it turns out that trying to acquire policy->rwsem under
> > sched_domains_mutex is a bad idea.  It was added to
> > cpufreq_policy_is_good_for_eas() to address a theoretical race, so it
> > can be dropped safely.  A theoretical race is better than a real
> > deadlock.
> >
> > Please test the attached patch.
>
> This fixed the observed issue. Thanks!
>
> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>

Thanks for the confirmation!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/7] cpufreq/sched: Move cpufreq-specific EAS checks to cpufreq
  2025-05-12 12:53           ` Rafael J. Wysocki
@ 2025-05-12 17:24             ` Dietmar Eggemann
  2025-05-12 17:47               ` Rafael J. Wysocki
  0 siblings, 1 reply; 18+ messages in thread
From: Dietmar Eggemann @ 2025-05-12 17:24 UTC (permalink / raw)
  To: Rafael J. Wysocki, Marek Szyprowski
  Cc: Rafael J. Wysocki, Linux PM, LKML, Lukasz Luba, Peter Zijlstra,
	Srinivas Pandruvada, Morten Rasmussen, Vincent Guittot,
	Ricardo Neri, Pierre Gondois, Christian Loehle

On 12/05/2025 14:53, Rafael J. Wysocki wrote:
> On Mon, May 12, 2025 at 8:48 AM Marek Szyprowski
> <m.szyprowski@samsung.com> wrote:
>>
>> On 10.05.2025 13:31, Rafael J. Wysocki wrote:
>>> On Sat, May 10, 2025 at 1:49 AM Marek Szyprowski
>>> <m.szyprowski@samsung.com> wrote:
>>>> On 06.05.2025 22:37, Rafael J. Wysocki wrote:
>>>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

[...]

>>>>    *** DEADLOCK ***
>>> Well, it turns out that trying to acquire policy->rwsem under
>>> sched_domains_mutex is a bad idea.  It was added to
>>> cpufreq_policy_is_good_for_eas() to address a theoretical race, so it
>>> can be dropped safely.  A theoretical race is better than a real
>>> deadlock.
>>>
>>> Please test the attached patch.
>>
>> This fixed the observed issue. Thanks!
>>
>> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
>> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
> 
> Thanks for the confirmation!

See this on my Hikey 960 as well. I was wondering why Christian L. and I
didn't catch this with RFT v1.0 (*) on i7-13700K.

But it looks like that

https://web.git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/intel_pstate/eas-take2-extended

mentioned in the patch-header of (*) didn't have this line in its

commit 9ad047cade6b ("cpufreq/sched: Move cpufreq-specific EAS checks to
cpufreq")

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/7] cpufreq/sched: Move cpufreq-specific EAS checks to cpufreq
  2025-05-12 17:24             ` Dietmar Eggemann
@ 2025-05-12 17:47               ` Rafael J. Wysocki
  0 siblings, 0 replies; 18+ messages in thread
From: Rafael J. Wysocki @ 2025-05-12 17:47 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Rafael J. Wysocki, Marek Szyprowski, Rafael J. Wysocki, Linux PM,
	LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Morten Rasmussen, Vincent Guittot, Ricardo Neri, Pierre Gondois,
	Christian Loehle

On Mon, May 12, 2025 at 7:24 PM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 12/05/2025 14:53, Rafael J. Wysocki wrote:
> > On Mon, May 12, 2025 at 8:48 AM Marek Szyprowski
> > <m.szyprowski@samsung.com> wrote:
> >>
> >> On 10.05.2025 13:31, Rafael J. Wysocki wrote:
> >>> On Sat, May 10, 2025 at 1:49 AM Marek Szyprowski
> >>> <m.szyprowski@samsung.com> wrote:
> >>>> On 06.05.2025 22:37, Rafael J. Wysocki wrote:
> >>>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> [...]
>
> >>>>    *** DEADLOCK ***
> >>> Well, it turns out that trying to acquire policy->rwsem under
> >>> sched_domains_mutex is a bad idea.  It was added to
> >>> cpufreq_policy_is_good_for_eas() to address a theoretical race, so it
> >>> can be dropped safely.  A theoretical race is better than a real
> >>> deadlock.
> >>>
> >>> Please test the attached patch.
> >>
> >> This fixed the observed issue. Thanks!
> >>
> >> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
> >> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
> >
> > Thanks for the confirmation!
>
> See this on my Hikey 960 as well. I was wondering why Christian L. and I
> didn't catch this with RFT v1.0 (*) on i7-13700K.
>
> But it looks like that
>
> https://web.git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/intel_pstate/eas-take2-extended
>
> mentioned in the patch-header of (*) didn't have this line in its
>
> commit 9ad047cade6b ("cpufreq/sched: Move cpufreq-specific EAS checks to
> cpufreq")

Yeah, I made this change later.  My bad.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/7] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT
  2025-05-06 20:32 [PATCH v2 0/7] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
                   ` (6 preceding siblings ...)
  2025-05-06 20:48 ` [PATCH v2 7/7] cpufreq: intel_pstate: Document hybrid processor support Rafael J. Wysocki
@ 2025-05-13 12:51 ` Rafael J. Wysocki
  2025-05-13 14:01   ` Rafael J. Wysocki
  7 siblings, 1 reply; 18+ messages in thread
From: Rafael J. Wysocki @ 2025-05-13 12:51 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois, Christian Loehle, Rafael J. Wysocki

On Tue, May 6, 2025 at 10:49 PM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>
> Hi Everyone,
>
> This is a new (and most likely final) version of
>
> https://lore.kernel.org/linux-pm/3344336.aeNJFYEL58@rjwysocki.net/
>
> The most significant difference between it and the above is that schedutil is
> now required for EAS to be enabled, like on the other platforms using EAS,
> which means that intel_pstate needs to operate in the passive mode in order
> to use it (the most straightforward way to switch it over to the passive mode
> is to write "passive" to /sys/devices/system/cpu/intel_pstate/status).
>
> Accordingly, the changes that were needed for EAS to work without schedutil are
> not needed any more, so there is one patch less in the series because of that
> and patch [5/7] is simpler than its previous version because some changes made
> by it are not necessary any more.
>
> Another patch that has been dropped is
>
> https://lore.kernel.org/linux-pm/1964444.taCxCBeP46@rjwysocki.net/
>
> because it didn't take CPU online into account properly and it is not essential
> for the current hardware anyway.
>
> There is a new patch, [7/7], which adds CAS/EAS/hybrid support description to
> the intel_pstate admin-guide documentation.
>
> The following paragraph from the original cover letter still applies:
>
> "The underlying observation is that on the platforms targeted by these changes,
> Lunar Lake at the time of this writing, the "small" CPUs (E-cores), when run at
> the same performance level, are always more energy-efficient than the "big" or
> "performance" CPUs (P-cores).  This means that, regardless of the scale-
> invariant utilization of a task, as long as there is enough spare capacity on
> E-cores, the relative cost of running it there is always lower."
>
> The first 2 patches depend on the current cpufreq material queued up for 6.16
> in linux-pm.git/linux-next (and in linux-next proper) and are not really
> depended on by the rest of the series, but I've decided to include them into
> this series because they have been slightly updated since the previous version,
> mostly to take review feedback into account (I'm going to queue them up for
> 6.16 shortly because they don't appear to be objectionable).
>
> The next 2 patches (Energy Model code changes) were reviewed previously, but
> they are only needed because of patch [5/7].
>
> Patch [5/7] has not changed much except that some changes made by the previous
> version have been dropped from it.  Also its changelog has been updated.  It
> causes perf domains to be registered per CPU and in addition to the primary cost
> component, which is related to the CPU type, there is a small component
> proportional to performance whose role is to help balance the load between CPUs
> of the same type.
>
> The expected effect is still that the CPUs of the "low-cost" type will be
> preferred so long as there is enough spare capacity on any of them.
>
> Patch [6/7] has been updated to walk all of the cache leaves and look for
> the ones with level equal to 3 because the check used in the previous version
> does not always work.
>
> The documentation patch, [7/7], is new.
>
> Please refer to the individual patch changelogs for details.

This series along with the fix at

https://lore.kernel.org/linux-pm/2806514.mvXUDI8C0e@rjwysocki.net/

is now present in the experimental/intel_pstate/eas-final brangh in
linux-pm.git:

https://web.git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/intel_pstate/eas-final

and it has been added to the bleeding-edge branch for 0-day build testing.

Thanks!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/7] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT
  2025-05-13 12:51 ` [PATCH v2 0/7] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
@ 2025-05-13 14:01   ` Rafael J. Wysocki
  2025-06-29 12:28     ` Ibrahim Ansari
  0 siblings, 1 reply; 18+ messages in thread
From: Rafael J. Wysocki @ 2025-05-13 14:01 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois, Christian Loehle, Rafael J. Wysocki

On Tue, May 13, 2025 at 2:51 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Tue, May 6, 2025 at 10:49 PM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >
> > Hi Everyone,
> >
> > This is a new (and most likely final) version of
> >
> > https://lore.kernel.org/linux-pm/3344336.aeNJFYEL58@rjwysocki.net/
> >
> > The most significant difference between it and the above is that schedutil is
> > now required for EAS to be enabled, like on the other platforms using EAS,
> > which means that intel_pstate needs to operate in the passive mode in order
> > to use it (the most straightforward way to switch it over to the passive mode
> > is to write "passive" to /sys/devices/system/cpu/intel_pstate/status).
> >
> > Accordingly, the changes that were needed for EAS to work without schedutil are
> > not needed any more, so there is one patch less in the series because of that
> > and patch [5/7] is simpler than its previous version because some changes made
> > by it are not necessary any more.
> >
> > Another patch that has been dropped is
> >
> > https://lore.kernel.org/linux-pm/1964444.taCxCBeP46@rjwysocki.net/
> >
> > because it didn't take CPU online into account properly and it is not essential
> > for the current hardware anyway.
> >
> > There is a new patch, [7/7], which adds CAS/EAS/hybrid support description to
> > the intel_pstate admin-guide documentation.
> >
> > The following paragraph from the original cover letter still applies:
> >
> > "The underlying observation is that on the platforms targeted by these changes,
> > Lunar Lake at the time of this writing, the "small" CPUs (E-cores), when run at
> > the same performance level, are always more energy-efficient than the "big" or
> > "performance" CPUs (P-cores).  This means that, regardless of the scale-
> > invariant utilization of a task, as long as there is enough spare capacity on
> > E-cores, the relative cost of running it there is always lower."
> >
> > The first 2 patches depend on the current cpufreq material queued up for 6.16
> > in linux-pm.git/linux-next (and in linux-next proper) and are not really
> > depended on by the rest of the series, but I've decided to include them into
> > this series because they have been slightly updated since the previous version,
> > mostly to take review feedback into account (I'm going to queue them up for
> > 6.16 shortly because they don't appear to be objectionable).
> >
> > The next 2 patches (Energy Model code changes) were reviewed previously, but
> > they are only needed because of patch [5/7].
> >
> > Patch [5/7] has not changed much except that some changes made by the previous
> > version have been dropped from it.  Also its changelog has been updated.  It
> > causes perf domains to be registered per CPU and in addition to the primary cost
> > component, which is related to the CPU type, there is a small component
> > proportional to performance whose role is to help balance the load between CPUs
> > of the same type.
> >
> > The expected effect is still that the CPUs of the "low-cost" type will be
> > preferred so long as there is enough spare capacity on any of them.
> >
> > Patch [6/7] has been updated to walk all of the cache leaves and look for
> > the ones with level equal to 3 because the check used in the previous version
> > does not always work.
> >
> > The documentation patch, [7/7], is new.
> >
> > Please refer to the individual patch changelogs for details.
>
> This series along with the fix at
>
> https://lore.kernel.org/linux-pm/2806514.mvXUDI8C0e@rjwysocki.net/
>
> is now present in the experimental/intel_pstate/eas-final brangh in
> linux-pm.git:
>
> https://web.git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/intel_pstate/eas-final
>
> and it has been added to the bleeding-edge branch for 0-day build testing.

In order to test it, the kernel needs to be compiled with
CONFIG_ENERGY_MODEL set.

If at least some cores in the system are SMT-capable, SMT needs to be
turned off (either in the BIOS setup or by passing nosmt to the kernel
in the command line).

Finally, schedutil needs to be the cpufreq governor which requires
intel_pstate to operate in the passive mode (schedutil is the default
governor in that case).  The most straightforward way to switch it
into the passive mode is to write "passive" to
/sys/devices/system/cpu/intel_pstate/status (it may also be started in
the passive mode as described in
https://www.kernel.org/doc/html/latest/admin-guide/pm/intel_pstate.html).

Note that you can compare the default non-EAS configuration of
intel_pstate to the one with EAS enabled by switching it between the
"active" and "passive" modes via sysfs (no reboots needed).

Thanks!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/7] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT
  2025-05-13 14:01   ` Rafael J. Wysocki
@ 2025-06-29 12:28     ` Ibrahim Ansari
  2025-06-30 18:51       ` Rafael J. Wysocki
  0 siblings, 1 reply; 18+ messages in thread
From: Ibrahim Ansari @ 2025-06-29 12:28 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois, Christian Loehle, Rafael J. Wysocki

Hi,

On 5/13/25 19:31, Rafael J. Wysocki wrote:

> Finally, schedutil needs to be the cpufreq governor which requires
> intel_pstate to operate in the passive mode (schedutil is the default
> governor in that case).  The most straightforward way to switch it
> into the passive mode is to write "passive" to
> /sys/devices/system/cpu/intel_pstate/status (it may also be started in
> the passive mode as described in
> https://www.kernel.org/doc/html/latest/admin-guide/pm/intel_pstate.html).

I'm curious if you intend to bring back support for EAS with 
intel_pstate in active mode down the line?

That would get this working out of the box across distros, since 
`intel_pstate=active` is the default setup everywhere (and typically 
what users should prefer? as I understand from the documentation.)

Thanks for your work!


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/7] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT
  2025-06-29 12:28     ` Ibrahim Ansari
@ 2025-06-30 18:51       ` Rafael J. Wysocki
  0 siblings, 0 replies; 18+ messages in thread
From: Rafael J. Wysocki @ 2025-06-30 18:51 UTC (permalink / raw)
  To: Ibrahim Ansari
  Cc: Rafael J. Wysocki, Linux PM, LKML, Lukasz Luba, Peter Zijlstra,
	Srinivas Pandruvada, Dietmar Eggemann, Morten Rasmussen,
	Vincent Guittot, Ricardo Neri, Pierre Gondois, Christian Loehle,
	Rafael J. Wysocki

Hi,

On Sun, Jun 29, 2025 at 2:28 PM Ibrahim Ansari
<ansari.ibrahim1@gmail.com> wrote:
>
> Hi,
>
> On 5/13/25 19:31, Rafael J. Wysocki wrote:
>
> > Finally, schedutil needs to be the cpufreq governor which requires
> > intel_pstate to operate in the passive mode (schedutil is the default
> > governor in that case).  The most straightforward way to switch it
> > into the passive mode is to write "passive" to
> > /sys/devices/system/cpu/intel_pstate/status (it may also be started in
> > the passive mode as described in
> > https://www.kernel.org/doc/html/latest/admin-guide/pm/intel_pstate.html).
>
> I'm curious if you intend to bring back support for EAS with
> intel_pstate in active mode down the line?

No plans as of today and this is somewhat technically questionable
because EAS requires the schedutil governor for cpufreq which is only
available in the passive mode.

It may be revisited in the future, though, if there's sufficient demand.

> That would get this working out of the box across distros, since
> `intel_pstate=active` is the default setup everywhere (and typically
> what users should prefer? as I understand from the documentation.)

It's generally hard to say what users should prefer because it depends
on what they need.

The active mode involves less complexity in the kernel code and so it
is the default.  It is also somewhat performance-oriented relative to
EAS.

Users who prefer EAS can append intel_pstate=passive to the default
kernel command line.

> Thanks for your work!

Anytime!

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-06-30 18:51 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-06 20:32 [PATCH v2 0/7] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
2025-05-06 20:34 ` [PATCH v2 1/7] cpufreq/sched: schedutil: Add helper for governor checks Rafael J. Wysocki
2025-05-06 20:37 ` [PATCH v2 2/7] cpufreq/sched: Move cpufreq-specific EAS checks to cpufreq Rafael J. Wysocki
     [not found]   ` <CGME20250509234915eucas1p2846ecef88f268d78ab2e96d4a67002b0@eucas1p2.samsung.com>
2025-05-09 23:49     ` Marek Szyprowski
2025-05-10 11:31       ` Rafael J. Wysocki
2025-05-12  6:48         ` Marek Szyprowski
2025-05-12 12:53           ` Rafael J. Wysocki
2025-05-12 17:24             ` Dietmar Eggemann
2025-05-12 17:47               ` Rafael J. Wysocki
2025-05-06 20:39 ` [PATCH v2 3/7] PM: EM: Move CPU capacity check to em_adjust_new_capacity() Rafael J. Wysocki
2025-05-06 20:41 ` [PATCH v2 4/7] PM: EM: Introduce em_adjust_cpu_capacity() Rafael J. Wysocki
2025-05-06 20:44 ` [PATCH v2 5/7] cpufreq: intel_pstate: EAS support for hybrid platforms Rafael J. Wysocki
2025-05-06 20:47 ` [PATCH v2 6/7] cpufreq: intel_pstate: EAS: Increase cost for CPUs using L3 cache Rafael J. Wysocki
2025-05-06 20:48 ` [PATCH v2 7/7] cpufreq: intel_pstate: Document hybrid processor support Rafael J. Wysocki
2025-05-13 12:51 ` [PATCH v2 0/7] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
2025-05-13 14:01   ` Rafael J. Wysocki
2025-06-29 12:28     ` Ibrahim Ansari
2025-06-30 18:51       ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).