[RFC][PATCH v021 0/9] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC][PATCH v021 0/9] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT
@ 2024-11-29 15:55 Rafael J. Wysocki
  2024-11-29 15:56 ` [RFC][PATCH v021 1/9] cpufreq: intel_pstate: Use CPPC to get scaling factors Rafael J. Wysocki
                   ` (9 more replies)
  0 siblings, 10 replies; 31+ messages in thread
From: Rafael J. Wysocki @ 2024-11-29 15:55 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois

Hi Everyone,

This is a new iteration of the "EAS for intel_pstate" work:

https://lore.kernel.org/linux-pm/3607404.iIbC2pHGDl@rjwysocki.net/

It contains a few new patches and almost all of the patches sent previously
have been updated.

The following paragraph from the original cover letter still applies:

"The underlying observation is that on the platforms targeted by these changes,
Lunar Lake at the time of this writing, the "small" CPUs (E-cores), when run at
the same performance level, are always more energy-efficient than the "big" or
"performance" CPUs (P-cores).  This means that, regardless of the scale-
invariant utilization of a task, as long as there is enough spare capacity on
E-cores, the relative cost of running it there is always lower."

Thus the idea is still to register a perf domain per CPU type, but this time
there may be more than just two of them because of the first patch.

The states table in each of these perf domains is still one-element and that
element only contains the cost value, but this time the costs are computed
and not prescribed (see the last patch).  Nevertheless, the expected effect
is still that the perf domains (or CPU types) with lower cost values will
be preferred so long as there is enough spare capacity in them.

The first two patches are not really RFC, but they are included here because
patches [8-9/9] depend on patch [1/9].  They will be resent next week as
non-RFC 6.14-candidate material.

The difference made by them is significant because it is now not known in
advance how many CPU types will be there and the cost values for each of
them cannot be prescribed.

Patch [3/9] is also a change that I'd like to make regardless of what
happens to the rest of the series because it effectively moves EM code
from the schedutil governor to EM where it belongs.  Of course, it is also
depended on by patch [9/9].

Patch [4/9] differs from its previous version,

https://lore.kernel.org/linux-pm/1889415.atdPhlSkOF@rjwysocki.net/

because gov is NULL not only when it is not used at all, but also during the
cpufreq policy init and exit, so the check in the patch had to be adjusted
to match the former case only.  [As a side note, I don't think that the code
modified by patch [4/9] belongs to sched/topology as it messes around the
cpufreq internals.  At least, it should be moved to cpufreq and called by
sched_is_eas_possible(), but I'm also not convinced that it is necessary
at all.  This is not directly related to the $subject series, though.]

Patch [5/9] adds a new function needed by patch [9/9] and it is the same as
its previous version:

https://lore.kernel.org/linux-pm/2223963.Mh6RI2rZIc@rjwysocki.net/

Patch [6/9] is almost the same as its previous version:

https://lore.kernel.org/linux-pm/1821040.VLH7GnMWUR@rjwysocki.net/

but its changelog has been expanded a bit as suggested by Dietmar.  It
simply rearranges the EM code without changing its functionality, so the
next patch looks more straightforward.

Patch [7/9] is a somewhat updated counterpart of

https://lore.kernel.org/linux-pm/2017201.usQuhbGJ8B@rjwysocki.net/

It still changes the EM code to allow a perf domains with one-element states
table to be registered without providing the :active_power() callback (which
is then done in the last patch), but it is somewhat simpler.  It also
contains some discussion regarding the requirement that the capacity of
all CPUs in a perf domain must be the same.  In a short summary, I'm not
convinced that it is actually valid.

Patches [8-9/9] modify intel_pstate.  The first one is preparatory, but it
is useful for explaining the basic concept, which is "hybrid domains" that
each contain CPUs of the same type.

The last patch is just the registration of EM perf domains (one for each hybrid
domain), expanding them when needed and rebuilding sched domains in some corner
cases.  It also contains some discussion that doesn't technically belong to the
changelog, but is useful for explaining the background for some decisions.

Please refer to the individual patch changelogs for details.

For easier access, the series is available on the experimental/intel_ostate
branch in linux-pm.git:

https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/intel_pstate

Thanks!

or

https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/intel_pstate

Thanks!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC][PATCH v021 1/9] cpufreq: intel_pstate: Use CPPC to get scaling factors
  2024-11-29 15:55 [RFC][PATCH v021 0/9] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
@ 2024-11-29 15:56 ` Rafael J. Wysocki
  2024-11-29 15:56 ` [RFC][PATCH v021 2/9] cpufreq: intel_pstate: Drop Arrow Lake from "scaling factor" list Rafael J. Wysocki
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Rafael J. Wysocki @ 2024-11-29 15:56 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

If the P-core hybrid scaling factor for the given processor model is not
known, use CPPC to compute hybrid scaling factors for all CPUs.

Since the current default hybrid scaling factor is only suitable for a
few early hybrid platforms, add intel_hybrid_scaling_factor[] entries
for them and initialize the scaling factor to zero ("unknown") by
default.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpufreq/intel_pstate.c |   59 +++++++++++++++++++++++------------------
 1 file changed, 34 insertions(+), 25 deletions(-)

Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -28,6 +28,7 @@
 #include <linux/pm_qos.h>
 #include <linux/bitfield.h>
 #include <trace/events/power.h>
+#include <linux/units.h>
 
 #include <asm/cpu.h>
 #include <asm/div64.h>
@@ -302,11 +303,11 @@ static bool hwp_is_hybrid;
 
 static struct cpufreq_driver *intel_pstate_driver __read_mostly;
 
-#define HYBRID_SCALING_FACTOR		78741
+#define HYBRID_SCALING_FACTOR_ADL	78741
 #define HYBRID_SCALING_FACTOR_MTL	80000
 #define HYBRID_SCALING_FACTOR_LNL	86957
 
-static int hybrid_scaling_factor = HYBRID_SCALING_FACTOR;
+static int hybrid_scaling_factor;
 
 static inline int core_get_scaling(void)
 {
@@ -414,18 +415,15 @@ static int intel_pstate_get_cppc_guarant
 static int intel_pstate_cppc_get_scaling(int cpu)
 {
 	struct cppc_perf_caps cppc_perf;
-	int ret;
-
-	ret = cppc_get_perf_caps(cpu, &cppc_perf);
 
 	/*
-	 * If the nominal frequency and the nominal performance are not
-	 * zero and the ratio between them is not 100, return the hybrid
-	 * scaling factor.
-	 */
-	if (!ret && cppc_perf.nominal_perf && cppc_perf.nominal_freq &&
-	    cppc_perf.nominal_perf * 100 != cppc_perf.nominal_freq)
-		return hybrid_scaling_factor;
+	 * Compute the perf-to-frequency scaling factor for the given CPU if
+	 * possible, unless it would be 0.
+	 */
+	if (!cppc_get_perf_caps(cpu, &cppc_perf) &&
+	    cppc_perf.nominal_perf && cppc_perf.nominal_freq)
+		return div_u64(cppc_perf.nominal_freq * KHZ_PER_MHZ,
+			       cppc_perf.nominal_perf);
 
 	return core_get_scaling();
 }
@@ -2211,24 +2209,30 @@ static void hybrid_get_type(void *data)
 
 static int hwp_get_cpu_scaling(int cpu)
 {
-	u8 cpu_type = 0;
+	if (hybrid_scaling_factor) {
+		u8 cpu_type = 0;
+
+		smp_call_function_single(cpu, hybrid_get_type, &cpu_type, 1);
 
-	smp_call_function_single(cpu, hybrid_get_type, &cpu_type, 1);
-	/* P-cores have a smaller perf level-to-freqency scaling factor. */
-	if (cpu_type == 0x40)
-		return hybrid_scaling_factor;
+		/*
+		 * Return the hybrid scaling factor for P-cores and use the
+		 * default core scaling for E-cores.
+		 */
+		if (cpu_type == 0x40)
+			return hybrid_scaling_factor;
 
-	/* Use default core scaling for E-cores */
-	if (cpu_type == 0x20)
+		if (cpu_type == 0x20)
+			return core_get_scaling();
+	}
+
+	/* Use core scaling on non-hybrid systems. */
+	if (!cpu_feature_enabled(X86_FEATURE_HYBRID_CPU))
 		return core_get_scaling();
 
 	/*
-	 * If reached here, this system is either non-hybrid (like Tiger
-	 * Lake) or hybrid-capable (like Alder Lake or Raptor Lake) with
-	 * no E cores (in which case CPUID for hybrid support is 0).
-	 *
-	 * The CPPC nominal_frequency field is 0 for non-hybrid systems,
-	 * so the default core scaling will be used for them.
+	 * The system is hybrid, but the hybrid scaling factor is not known or
+	 * the CPU type is not one of the above, so use CPPC to compute the
+	 * scaling factor for this CPU.
 	 */
 	return intel_pstate_cppc_get_scaling(cpu);
 }
@@ -3665,6 +3669,11 @@ static const struct x86_cpu_id intel_epp
 };
 
 static const struct x86_cpu_id intel_hybrid_scaling_factor[] = {
+	X86_MATCH_VFM(INTEL_ALDERLAKE, HYBRID_SCALING_FACTOR_ADL),
+	X86_MATCH_VFM(INTEL_ALDERLAKE_L, HYBRID_SCALING_FACTOR_ADL),
+	X86_MATCH_VFM(INTEL_RAPTORLAKE, HYBRID_SCALING_FACTOR_ADL),
+	X86_MATCH_VFM(INTEL_RAPTORLAKE_P, HYBRID_SCALING_FACTOR_ADL),
+	X86_MATCH_VFM(INTEL_RAPTORLAKE_S, HYBRID_SCALING_FACTOR_ADL),
 	X86_MATCH_VFM(INTEL_METEORLAKE_L, HYBRID_SCALING_FACTOR_MTL),
 	X86_MATCH_VFM(INTEL_ARROWLAKE, HYBRID_SCALING_FACTOR_MTL),
 	X86_MATCH_VFM(INTEL_LUNARLAKE_M, HYBRID_SCALING_FACTOR_LNL),




^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC][PATCH v021 2/9] cpufreq: intel_pstate: Drop Arrow Lake from "scaling factor" list
  2024-11-29 15:55 [RFC][PATCH v021 0/9] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
  2024-11-29 15:56 ` [RFC][PATCH v021 1/9] cpufreq: intel_pstate: Use CPPC to get scaling factors Rafael J. Wysocki
@ 2024-11-29 15:56 ` Rafael J. Wysocki
  2024-11-29 15:59 ` [RFC][PATCH v021 3/9] PM: EM: Move perf rebuilding function from schedutil to EM Rafael J. Wysocki
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Rafael J. Wysocki @ 2024-11-29 15:56 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Since HYBRID_SCALING_FACTOR_MTL is not going to be suitable for Arrow
Lake in general, drop it from the "hybrid scaling factor" list of
platforms, so the scaling factor for it will be determined with the
help of information provided by the platform firmware via CPPC.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpufreq/intel_pstate.c |    1 -
 1 file changed, 1 deletion(-)

Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -3675,7 +3675,6 @@ static const struct x86_cpu_id intel_hyb
 	X86_MATCH_VFM(INTEL_RAPTORLAKE_P, HYBRID_SCALING_FACTOR_ADL),
 	X86_MATCH_VFM(INTEL_RAPTORLAKE_S, HYBRID_SCALING_FACTOR_ADL),
 	X86_MATCH_VFM(INTEL_METEORLAKE_L, HYBRID_SCALING_FACTOR_MTL),
-	X86_MATCH_VFM(INTEL_ARROWLAKE, HYBRID_SCALING_FACTOR_MTL),
 	X86_MATCH_VFM(INTEL_LUNARLAKE_M, HYBRID_SCALING_FACTOR_LNL),
 	{}
 };




^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC][PATCH v021 3/9] PM: EM: Move perf rebuilding function from schedutil to EM
  2024-11-29 15:55 [RFC][PATCH v021 0/9] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
  2024-11-29 15:56 ` [RFC][PATCH v021 1/9] cpufreq: intel_pstate: Use CPPC to get scaling factors Rafael J. Wysocki
  2024-11-29 15:56 ` [RFC][PATCH v021 2/9] cpufreq: intel_pstate: Drop Arrow Lake from "scaling factor" list Rafael J. Wysocki
@ 2024-11-29 15:59 ` Rafael J. Wysocki
  2024-12-11 10:32   ` Christian Loehle
  2024-11-29 16:00 ` [RFC][PATCH v021 4/9] sched/topology: Adjust cpufreq checks for EAS Rafael J. Wysocki
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Rafael J. Wysocki @ 2024-11-29 15:59 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

The sugov_eas_rebuild_sd() function defined in the schedutil cpufreq
governor implements generic functionality that may be useful in other
places.  In particular, going forward it will be used in the intel_pstate
driver.

For this reason, move it from schedutil to the energy model code and
rename it to em_rebuild_perf_domains().

This also helps to get rid of some #ifdeffery in schedutil which is a
plus.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v0.1 -> v0.2:
     * Update the comment regarding :register_em() in cpufreq.
     * Changelog edits.

---
 drivers/cpufreq/cpufreq.c        |    2 +-
 include/linux/energy_model.h     |    2 ++
 kernel/power/energy_model.c      |   17 +++++++++++++++++
 kernel/sched/cpufreq_schedutil.c |   33 ++++++---------------------------
 4 files changed, 26 insertions(+), 28 deletions(-)

Index: linux-pm/kernel/power/energy_model.c
===================================================================
--- linux-pm.orig/kernel/power/energy_model.c
+++ linux-pm/kernel/power/energy_model.c
@@ -908,3 +908,20 @@ int em_update_performance_limits(struct
 	return 0;
 }
 EXPORT_SYMBOL_GPL(em_update_performance_limits);
+
+static void rebuild_sd_workfn(struct work_struct *work)
+{
+	rebuild_sched_domains_energy();
+}
+
+static DECLARE_WORK(rebuild_sd_work, rebuild_sd_workfn);
+
+void em_rebuild_perf_domains(void)
+{
+	/*
+	 * When called from the cpufreq_register_driver() path, the
+	 * cpu_hotplug_lock is already held, so use a work item to
+	 * avoid nested locking in rebuild_sched_domains().
+	 */
+	schedule_work(&rebuild_sd_work);
+}
Index: linux-pm/kernel/sched/cpufreq_schedutil.c
===================================================================
--- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
+++ linux-pm/kernel/sched/cpufreq_schedutil.c
@@ -604,31 +604,6 @@ static const struct kobj_type sugov_tuna
 
 /********************** cpufreq governor interface *********************/
 
-#ifdef CONFIG_ENERGY_MODEL
-static void rebuild_sd_workfn(struct work_struct *work)
-{
-	rebuild_sched_domains_energy();
-}
-
-static DECLARE_WORK(rebuild_sd_work, rebuild_sd_workfn);
-
-/*
- * EAS shouldn't be attempted without sugov, so rebuild the sched_domains
- * on governor changes to make sure the scheduler knows about it.
- */
-static void sugov_eas_rebuild_sd(void)
-{
-	/*
-	 * When called from the cpufreq_register_driver() path, the
-	 * cpu_hotplug_lock is already held, so use a work item to
-	 * avoid nested locking in rebuild_sched_domains().
-	 */
-	schedule_work(&rebuild_sd_work);
-}
-#else
-static inline void sugov_eas_rebuild_sd(void) { };
-#endif
-
 struct cpufreq_governor schedutil_gov;
 
 static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy)
@@ -784,7 +759,11 @@ static int sugov_init(struct cpufreq_pol
 		goto fail;
 
 out:
-	sugov_eas_rebuild_sd();
+	/*
+	 * EAS shouldn't be attempted without sugov, so rebuild the sched_domains
+	 * on governor changes to make sure the scheduler knows about them.
+	 */
+	em_rebuild_perf_domains();
 	mutex_unlock(&global_tunables_lock);
 	return 0;
 
@@ -826,7 +805,7 @@ static void sugov_exit(struct cpufreq_po
 	sugov_policy_free(sg_policy);
 	cpufreq_disable_fast_switch(policy);
 
-	sugov_eas_rebuild_sd();
+	em_rebuild_perf_domains();
 }
 
 static int sugov_start(struct cpufreq_policy *policy)
Index: linux-pm/include/linux/energy_model.h
===================================================================
--- linux-pm.orig/include/linux/energy_model.h
+++ linux-pm/include/linux/energy_model.h
@@ -179,6 +179,7 @@ int em_dev_compute_costs(struct device *
 int em_dev_update_chip_binning(struct device *dev);
 int em_update_performance_limits(struct em_perf_domain *pd,
 		unsigned long freq_min_khz, unsigned long freq_max_khz);
+void em_rebuild_perf_domains(void);
 
 /**
  * em_pd_get_efficient_state() - Get an efficient performance state from the EM
@@ -404,6 +405,7 @@ int em_update_performance_limits(struct
 {
 	return -EINVAL;
 }
+static inline void em_rebuild_perf_domains(void) {}
 #endif
 
 #endif
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -1538,7 +1538,7 @@ static int cpufreq_online(unsigned int c
 
 		/*
 		 * Register with the energy model before
-		 * sugov_eas_rebuild_sd() is called, which will result
+		 * em_rebuild_perf_domains() is called, which will result
 		 * in rebuilding of the sched domains, which should only be done
 		 * once the energy model is properly initialized for the policy
 		 * first.




^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC][PATCH v021 4/9] sched/topology: Adjust cpufreq checks for EAS
  2024-11-29 15:55 [RFC][PATCH v021 0/9] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
                   ` (2 preceding siblings ...)
  2024-11-29 15:59 ` [RFC][PATCH v021 3/9] PM: EM: Move perf rebuilding function from schedutil to EM Rafael J. Wysocki
@ 2024-11-29 16:00 ` Rafael J. Wysocki
  2024-12-11 10:33   ` Christian Loehle
  2024-12-16 14:49   ` Dietmar Eggemann
  2024-11-29 16:02 ` [RFC][PATCH v021 5/9] PM: EM: Introduce em_dev_expand_perf_domain() Rafael J. Wysocki
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 31+ messages in thread
From: Rafael J. Wysocki @ 2024-11-29 16:00 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Make it possible to use EAS with cpufreq drivers that implement the
:setpolicy() callback instead of using generic cpufreq governors.

This is going to be necessary for using EAS with intel_pstate in its
default configuration.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

This is the minimum of what's needed, but I'd really prefer to move
the cpufreq vs EAS checks into cpufreq because messing around cpufreq
internals in topology.c feels like a butcher shop kind of exercise.

Besides, as I said before, I remain unconvinced about the usefulness
of these checks at all.  Yes, one is supposed to get the best results
from EAS when running schedutil, but what if they just want to try
something else with EAS?  What if they can get better results with
that other thing, surprisingly enough?

---
 kernel/sched/topology.c |   10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

Index: linux-pm/kernel/sched/topology.c
===================================================================
--- linux-pm.orig/kernel/sched/topology.c
+++ linux-pm/kernel/sched/topology.c
@@ -217,6 +217,7 @@ static bool sched_is_eas_possible(const
 	bool any_asym_capacity = false;
 	struct cpufreq_policy *policy;
 	struct cpufreq_governor *gov;
+	bool cpufreq_ok;
 	int i;
 
 	/* EAS is enabled for asymmetric CPU capacity topologies. */
@@ -251,7 +252,7 @@ static bool sched_is_eas_possible(const
 		return false;
 	}
 
-	/* Do not attempt EAS if schedutil is not being used. */
+	/* Do not attempt EAS if cpufreq is not configured adequately */
 	for_each_cpu(i, cpu_mask) {
 		policy = cpufreq_cpu_get(i);
 		if (!policy) {
@@ -261,11 +262,14 @@ static bool sched_is_eas_possible(const
 			}
 			return false;
 		}
+		/* Require schedutil or a "setpolicy" driver */
 		gov = policy->governor;
+		cpufreq_ok = gov == &schedutil_gov ||
+				(!gov && policy->policy != CPUFREQ_POLICY_UNKNOWN);
 		cpufreq_cpu_put(policy);
-		if (gov != &schedutil_gov) {
+		if (!cpufreq_ok) {
 			if (sched_debug()) {
-				pr_info("rd %*pbl: Checking EAS, schedutil is mandatory\n",
+				pr_info("rd %*pbl: Checking EAS, unsuitable cpufreq governor\n",
 					cpumask_pr_args(cpu_mask));
 			}
 			return false;




^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC][PATCH v021 5/9] PM: EM: Introduce em_dev_expand_perf_domain()
  2024-11-29 15:55 [RFC][PATCH v021 0/9] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
                   ` (3 preceding siblings ...)
  2024-11-29 16:00 ` [RFC][PATCH v021 4/9] sched/topology: Adjust cpufreq checks for EAS Rafael J. Wysocki
@ 2024-11-29 16:02 ` Rafael J. Wysocki
  2024-12-17  9:38   ` Dietmar Eggemann
  2024-11-29 16:06 ` [RFC][PATCH v021 6/9] PM: EM: Call em_compute_costs() from em_create_perf_table() Rafael J. Wysocki
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Rafael J. Wysocki @ 2024-11-29 16:02 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Introduce a helper function for adding a CPU to an existing EM perf
domain.

Subsequently, this will be used by the intel_pstate driver to add new
CPUs to existing perf domains when those CPUs go online for the first
time after the initialization of the driver.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v0.1 -> v0.2: No changes

---
 include/linux/energy_model.h |    5 +++++
 kernel/power/energy_model.c  |   32 ++++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

Index: linux-pm/kernel/power/energy_model.c
===================================================================
--- linux-pm.orig/kernel/power/energy_model.c
+++ linux-pm/kernel/power/energy_model.c
@@ -676,6 +676,38 @@ void em_dev_unregister_perf_domain(struc
 }
 EXPORT_SYMBOL_GPL(em_dev_unregister_perf_domain);
 
+/**
+ * em_dev_expand_perf_domain() - Expand CPU perf domain
+ * @dev: CPU device of a CPU in the perf domain.
+ * @new_cpu: CPU to add to the perf domain.
+ */
+int em_dev_expand_perf_domain(struct device *dev, int new_cpu)
+{
+	struct device *new_cpu_dev;
+	struct em_perf_domain *pd;
+
+	if (IS_ERR_OR_NULL(dev) || !_is_cpu_device(dev))
+		return -EINVAL;
+
+	new_cpu_dev = get_cpu_device(new_cpu);
+	if (!new_cpu_dev)
+		return -EINVAL;
+
+	guard(mutex)(&em_pd_mutex);
+
+	if (em_pd_get(new_cpu_dev))
+		return -EEXIST;
+
+	pd = em_pd_get(dev);
+	if (!pd)
+		return -EINVAL;
+
+	cpumask_set_cpu(new_cpu, em_span_cpus(pd));
+	new_cpu_dev->em_pd = pd;
+
+	return 0;
+}
+
 static struct em_perf_table __rcu *em_table_dup(struct em_perf_domain *pd)
 {
 	struct em_perf_table __rcu *em_table;
Index: linux-pm/include/linux/energy_model.h
===================================================================
--- linux-pm.orig/include/linux/energy_model.h
+++ linux-pm/include/linux/energy_model.h
@@ -172,6 +172,7 @@ int em_dev_register_perf_domain(struct d
 				struct em_data_callback *cb, cpumask_t *span,
 				bool microwatts);
 void em_dev_unregister_perf_domain(struct device *dev);
+int em_dev_expand_perf_domain(struct device *dev, int new_cpu);
 struct em_perf_table __rcu *em_table_alloc(struct em_perf_domain *pd);
 void em_table_free(struct em_perf_table __rcu *table);
 int em_dev_compute_costs(struct device *dev, struct em_perf_state *table,
@@ -354,6 +355,10 @@ int em_dev_register_perf_domain(struct d
 static inline void em_dev_unregister_perf_domain(struct device *dev)
 {
 }
+static inline int em_dev_expand_perf_domain(struct device *dev, int new_cpu)
+{
+	return -EINVAL;
+}
 static inline struct em_perf_domain *em_cpu_get(int cpu)
 {
 	return NULL;




^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC][PATCH v021 6/9] PM: EM: Call em_compute_costs() from em_create_perf_table()
  2024-11-29 15:55 [RFC][PATCH v021 0/9] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
                   ` (4 preceding siblings ...)
  2024-11-29 16:02 ` [RFC][PATCH v021 5/9] PM: EM: Introduce em_dev_expand_perf_domain() Rafael J. Wysocki
@ 2024-11-29 16:06 ` Rafael J. Wysocki
  2024-11-29 16:15 ` [RFC][PATCH v021 7/9] PM: EM: Register perf domains with ho :active_power() callbacks Rafael J. Wysocki
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Rafael J. Wysocki @ 2024-11-29 16:06 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

In preparation for subsequent changes, move the em_compute_costs()
invocation from em_create_perf_table() to em_create_pd() which is its
only caller.

This helps to prepare the EM registration code for handling the case in
which the :active_power() EM callback is NULL.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v0.1 -> v0.2: Update changelog to explain what subsequent change depends
              on this patch.

---
 kernel/power/energy_model.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Index: linux-pm/kernel/power/energy_model.c
===================================================================
--- linux-pm.orig/kernel/power/energy_model.c
+++ linux-pm/kernel/power/energy_model.c
@@ -388,10 +388,6 @@ static int em_create_perf_table(struct d
 
 	em_init_performance(dev, pd, table, nr_states);
 
-	ret = em_compute_costs(dev, table, cb, nr_states, flags);
-	if (ret)
-		return -EINVAL;
-
 	return 0;
 }
 
@@ -434,6 +430,10 @@ static int em_create_pd(struct device *d
 	if (ret)
 		goto free_pd_table;
 
+	ret = em_compute_costs(dev, em_table->state, cb, nr_states, flags);
+	if (ret)
+		goto free_pd_table;
+
 	rcu_assign_pointer(pd->em_table, em_table);
 
 	if (_is_cpu_device(dev))




^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC][PATCH v021 7/9] PM: EM: Register perf domains with ho :active_power() callbacks
  2024-11-29 15:55 [RFC][PATCH v021 0/9] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
                   ` (5 preceding siblings ...)
  2024-11-29 16:06 ` [RFC][PATCH v021 6/9] PM: EM: Call em_compute_costs() from em_create_perf_table() Rafael J. Wysocki
@ 2024-11-29 16:15 ` Rafael J. Wysocki
  2024-12-16 10:59   ` Dietmar Eggemann
  2024-11-29 16:21 ` [RFC][PATCH v021 8/9] cpufreq: intel_pstate: Introduce hybrid domains Rafael J. Wysocki
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Rafael J. Wysocki @ 2024-11-29 16:15 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Allow em_dev_register_perf_domain() to register a cost-only perf domain
with a one-element states table if the :active_power() callback is not
provided.

Subsequently, this will be used by the intel_pstate driver to register
such perf domains for CPUs on hybrid systems.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v0.1 -> v0.2:
     * Do not add one more local variable to em_dev_register_perf_domain()
       and make it skip the capacity check if the :active_power() callback
       is NULL.  Add a comment explaining why it is correct to do so.

For the time being, intel_pstate will only need to be able to say
something like "this group of CPUs is preferred to that other group
of CPUs for energy-efficiency reasons regardless of the required
performance level" (and there may be multiple groups).

For this purpose, it only needs a one-element states table in a perf
domain for each of the groups and it only needs to set the cost value
for each of them to reflect the preference.

However, it may need to put CPUs of different capacity into the same
group because of favored cores and this doesn't mean that one
group will contain CPUs of different mincro-architectures.  Favored
cores essentially follow the same power-performance curve as the
other CPUs of the same type up to a certain point, but they can get
beyond that point on the curve which the other CPUs of the same type
cannot do.  Thus it would make sense to use the same states table
for favored cores and the other CPUs of the same type even if there
were multiple states in it, and em_pd_get_efficient_state() could
easily take that into account by only using the states where
"performance" is at most equal to the current CPU capacity.

Unfortunately, this doesn't match the em_create_perf_table() code
design and in particular em_init_performance() which scales the
"performance" values the the current capacity of the "domain leader"
CPU represented by "dev".

IMV, it should instead scale them to the maximum possible capacity
of the highest-capacity CPU in the domain (without requiring equal
capacity at all).

That would also help to handle the cases when CPU capacity changes,
at least on Intel platforms, if it's necessary to worry about this some
day.  Namely, when CPU capacity is reduced, say by the platform firmware
(for thermal or power distribution reasons, for example), this effectively
means chopping off the top-most section of the power-performance curve,
but the rest of it remains intact, so the same states table as before
can be used, only the current CPU capacity needs to be changed to
prevent the top-most performance states from being taken into
consideration (and analogously when CPU capacity increases).

Turbo works pretty much in the same way: When it's enabled, the
power-performance curve is effectively extended by adding some states to
it and when turbo is disabled, the top-most part of the power-performance
curve effectively gets chopped off.

This observation may help to avoid having to update the energy model for
a chip.

---
 kernel/power/energy_model.c |   23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

Index: linux-pm/kernel/power/energy_model.c
===================================================================
--- linux-pm.orig/kernel/power/energy_model.c
+++ linux-pm/kernel/power/energy_model.c
@@ -426,9 +426,11 @@ static int em_create_pd(struct device *d
 	if (!em_table)
 		goto free_pd;
 
-	ret = em_create_perf_table(dev, pd, em_table->state, cb, flags);
-	if (ret)
-		goto free_pd_table;
+	if (cb->active_power) {
+		ret = em_create_perf_table(dev, pd, em_table->state, cb, flags);
+		if (ret)
+			goto free_pd_table;
+	}
 
 	ret = em_compute_costs(dev, em_table->state, cb, nr_states, flags);
 	if (ret)
@@ -566,6 +568,9 @@ int em_dev_register_perf_domain(struct d
 	if (!dev || !nr_states || !cb)
 		return -EINVAL;
 
+	if (!cb->active_power && (!cb->get_cost || nr_states > 1 || microwatts))
+		return -EINVAL;
+
 	/*
 	 * Use a mutex to serialize the registration of performance domains and
 	 * let the driver-defined callback functions sleep.
@@ -594,7 +599,19 @@ int em_dev_register_perf_domain(struct d
 			 * All CPUs of a domain must have the same
 			 * micro-architecture since they all share the same
 			 * table.
+			 *
+			 * If the :active_power() callback is present, the
+			 * performance values for different states in the table
+			 * computed by em_create_perf_table() depend on the CPU
+			 * capacity which therefore must be the same for all
+			 * CPUs in the domain.  However, if the :active_power()
+			 * callback is not NULL, em_create_perf_table() doesn't
+			 * run and there is no requirement regarding CPU
+			 * capacity any more.
 			 */
+			if (!cb->active_power)
+				continue;
+
 			cap = arch_scale_cpu_capacity(cpu);
 			if (prev_cap && prev_cap != cap) {
 				dev_err(dev, "EM: CPUs of %*pbl must have the same capacity\n",




^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC][PATCH v021 8/9] cpufreq: intel_pstate: Introduce hybrid domains
  2024-11-29 15:55 [RFC][PATCH v021 0/9] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
                   ` (6 preceding siblings ...)
  2024-11-29 16:15 ` [RFC][PATCH v021 7/9] PM: EM: Register perf domains with ho :active_power() callbacks Rafael J. Wysocki
@ 2024-11-29 16:21 ` Rafael J. Wysocki
  2024-12-12 17:04   ` Christian Loehle
  2024-11-29 16:28 ` [RFC][PATCH v021 9/9] cpufreq: intel_pstate: Add basic EAS support on hybrid platforms Rafael J. Wysocki
  2025-01-25 11:18 ` [RFC][PATCH v021 0/9] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Dietmar Eggemann
  9 siblings, 1 reply; 31+ messages in thread
From: Rafael J. Wysocki @ 2024-11-29 16:21 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Hybrid platforms contain different types of CPUs.  They may differ
by micro-architecture, by cache topology, by manufacturing process, by
the interconnect access design etc.  Of course, this means that power-
performance curves for CPUs of different types are generally different.

Because of these differences, CPUs of different types need to be handled
differently in certain situations and so it is convenient to operate
groups of CPUs that each contain CPUs of the same type.  In intel_pstate,
each of them will be represented by a struct hybrid_domain object and
referred to as a hybrid domain.

A key problem is how to identify the type of a CPUs so as to know which
hybrid domain it belongs to.  In principle, there are a few ways to do
it, but none of them is perfectly reliable.

From the computational perspective, an important factor is how many
instructions (on average) can be executed by the given CPU when it is
running at a specific frequency, often referred to as the IPC
(instructions per cycle) ratio of the given CPU to the least-capable
CPU in the system.  In intel_pstate this ratio is represented by the
performance-to-frequency scaling factor which needs to be used to get
a frequency in kHz for a given HWP performance level of the given CPU.
Since HWP performance levels are in the same units for all CPUs in a
hybrid system, the smaller the scaling factor, the larger the IPC ratio
for the given CPU.

Of course, the performance-to-frequency scaling factor must be the
same for all CPUs of the same type.  While it may be the same for CPUs
of different types, there is only one case in which that actually
happens (Meteor Lake platforms with two types of E-cores) and it is not
expected to happen again in the future.  Moreover, when it happens,
there is no straightforward way to distinguish CPUs of different types
with the same scaling factor in general.

For this reason, the scaling factor is as good as it gets for CPU
type identification and so it is used for building hybrid domains in
intel_pstate.

On hybrid systems, every CPU is added to a hybrid domain at the
initialization time.  If a hybrid domain with a matching scaling
factor is already present at that point, the CPU will be added to it.
Otherwise, a new hybrid domain will be created and the CPU will be
put into it.  The domain's scaling factor will then be set to the
one of the new CPU.

So far, the new code doesn't do much beyond printing debud messages,
but subsequently the EAS support for intel_pstate will be based on it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpufreq/intel_pstate.c |   57 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 57 insertions(+)

Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -943,6 +943,62 @@ static struct cpudata *hybrid_max_perf_c
  */
 static DEFINE_MUTEX(hybrid_capacity_lock);
 
+#ifdef CONFIG_ENERGY_MODEL
+/*
+ * A hybrid domain is a collection of CPUs with the same perf-to-frequency
+ * scaling factor.
+ */
+struct hybrid_domain {
+	struct hybrid_domain *next;
+	cpumask_t cpumask;
+	int scaling;
+};
+
+static struct hybrid_domain *hybrid_domains;
+
+static void hybrid_add_to_domain(struct cpudata *cpudata)
+{
+	int scaling = cpudata->pstate.scaling;
+	int cpu = cpudata->cpu;
+	struct hybrid_domain *hd;
+
+	/* Do this only on hubrid platforms. */
+	if (!cpu_feature_enabled(X86_FEATURE_HYBRID_CPU))
+		return;
+
+	guard(mutex)(&hybrid_capacity_lock);
+
+	/* Look for an existing hybrid domain matching this CPU. */
+	for (hd = hybrid_domains; hd; hd = hd->next) {
+		if (hd->scaling == scaling) {
+			if (cpumask_test_cpu(cpu, &hd->cpumask))
+				return;
+
+			cpumask_set_cpu(cpu, &hd->cpumask);
+
+			pr_debug("CPU %d added to hybrid domain %*pbl\n", cpu,
+				 cpumask_pr_args(&hd->cpumask));
+			return;
+		}
+	}
+
+	/* No match.  Add a new one. */
+	hd = kzalloc(sizeof(*hd), GFP_KERNEL);
+	if (!hd)
+		return;
+
+	cpumask_set_cpu(cpu, &hd->cpumask);
+	hd->scaling = scaling;
+	hd->next = hybrid_domains;
+	hybrid_domains = hd;
+
+	pr_debug("New hybrid domain %*pbl: scaling = %d\n",
+		 cpumask_pr_args(&hd->cpumask), hd->scaling);
+}
+#else /* CONFIG_ENERGY_MODEL */
+static inline void hybrid_add_to_domain(struct cpudata *cpudata) {}
+#endif /* !CONFIG_ENERGY_MODEL */
+
 static void hybrid_set_cpu_capacity(struct cpudata *cpu)
 {
 	arch_set_cpu_capacity(cpu->cpu, cpu->capacity_perf,
@@ -2273,6 +2329,7 @@ static void intel_pstate_get_cpu_pstates
 				intel_pstate_hybrid_hwp_adjust(cpu);
 				hwp_is_hybrid = true;
 			}
+			hybrid_add_to_domain(cpu);
 		} else {
 			cpu->pstate.scaling = perf_ctl_scaling;
 		}




^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC][PATCH v021 9/9] cpufreq: intel_pstate: Add basic EAS support on hybrid platforms
  2024-11-29 15:55 [RFC][PATCH v021 0/9] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
                   ` (7 preceding siblings ...)
  2024-11-29 16:21 ` [RFC][PATCH v021 8/9] cpufreq: intel_pstate: Introduce hybrid domains Rafael J. Wysocki
@ 2024-11-29 16:28 ` Rafael J. Wysocki
  2025-01-25 11:18 ` [RFC][PATCH v021 0/9] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Dietmar Eggemann
  9 siblings, 0 replies; 31+ messages in thread
From: Rafael J. Wysocki @ 2024-11-29 16:28 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Modify intel_pstate to register EM perf domains for hybrid domains,
introduced previously, via em_dev_register_perf_domain(), and to use
em_dev_expand_perf_domain(), also introduced previously, for adding
new CPUs to existing EM perf domains when those CPUs become online for
the first time after driver initialization.

This change is targeting platforms (for example, Lunar Lake) where the
"little" CPUs (E-cores) are always more energy-efficient than the "big"
or "performance" CPUs (P-cores) when run at the same HWP performance
level, so it is sufficient to tell EAS that E-cores are always preferred
(so long as there is enough spare capacity on one of them to run the
given task).

Accordingly, an EM perf domain with 1-element states array is registered
for each hybrid domain in the system and each of them is assigned a
single cost value.

The observation used here is that the IPC ratio between the CPUs in a
given hybrid domain is inversely proportional to their performance-to-
frequency scaling factor (which must be the same for all of them) and
the cost of running on one of them can be assumed to be proportional
to that IPC ratio (in principle, the higher the IPC ratio, the more
resources are utilized when running at a given frequency, so the cost
should be higher).

EM perf domains for all CPUs that are online during system startup are
registered at the driver initialization time, after asymmetric capacity
support has been enabled.  For the CPUs that become online later and
do not belong to any existing hybrid domain, an EM perf domain is
registered when the hybrid domain for the given CPU is created.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v0.1 -> v0.2:
     * Some changes have been dropped because patch [8/9] takes
       care of what they did now.
     * Some corner cases related to changing driver operation modes
       are now handled.

For now, I'm sticking to the idea of having one "effective OPP" per
EM perf domain because (a) it should be sufficient for the target
use case IIUC and (b) anything more complex would need to be tied to
the capacity updates carried out by intel_pstate and that would be
a much heavier lifting.

Besides, I'd generally prefer to avoid updating the energy model on
every CPU capacity update because that might need to happen too often.

If my understanding of this thread is correct:

https://lore.kernel.org/lkml/20240830130309.2141697-1-vincent.guittot@linaro.org/

if there is only one "OPP" per perf domain, within a perf domain EAS will
look for a CPU with the highest spare capacity which is exactly what I
want and it will generally prefer domains with the lower cost values.

So far, I'm not aware of any upcoming hybrid platforms with significant
cross-over points between power-performance curves for different types
of CPUs, so the current approach (possibly with the addition of some kind
of load balancing) should be sufficient for them.

If such cross-over points become an issue the future, more complete energy
models will need to be used for systems where they are present, but that
will require solving the problem with scaling the performance numbers in
the states table to the current CPU capacity at init time.

---
 drivers/cpufreq/intel_pstate.c |   83 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 79 insertions(+), 4 deletions(-)

Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -950,12 +950,67 @@ static DEFINE_MUTEX(hybrid_capacity_lock
  */
 struct hybrid_domain {
 	struct hybrid_domain *next;
+	struct device *dev;
 	cpumask_t cpumask;
 	int scaling;
 };

 static struct hybrid_domain *hybrid_domains;

+static int hybrid_get_cost(struct device *dev, unsigned long freq_not_used,
+			   unsigned long *cost)
+{
+	struct pstate_data *pd = &all_cpu_data[dev->id]->pstate;
+
+	/*
+	 * The smaller the perf-to-frequency scaling factor, the larger the IPC
+	 * ratio between the CPUs in the given domain and the least capable CPU
+	 * in the system.  Assume the cost to be proportional to that IPC ratio
+	 * (and note that perf_ctl_scaling is always the same for all CPUs in
+	 * the system and it is equal to the perf-to-frequency scaling factor
+	 * for one CPU type).
+	 */
+	*cost = div_u64(100ULL * pd->perf_ctl_scaling, pd->scaling);
+
+	return 0;
+}
+
+static bool hybrid_register_perf_domain(struct hybrid_domain *hd)
+{
+	static struct em_data_callback em_cb = EM_ADV_DATA_CB(NULL, hybrid_get_cost);
+
+	/*
+	 * Registering EM perf domains without enabling asymmetric CPU capacity
+	 * support is not really useful and one domain should not be registered
+	 * more than once.
+	 */
+	if (!hybrid_max_perf_cpu || hd->dev)
+		return false;
+
+	hd->dev = get_cpu_device(cpumask_any(&hd->cpumask));
+	if (!hd->dev)
+		return false;
+
+	/*
+	 * Use one EM state in each domain to indicate that the cost associated
+	 * with it applies to all CPUs in it regardless of the frequency.
+	 */
+	if (em_dev_register_perf_domain(hd->dev, 1, &em_cb, &hd->cpumask, false)) {
+		hd->dev = NULL;
+		return false;
+	}
+
+	return true;
+}
+
+static void hybrid_register_all_perf_domains(void)
+{
+	struct hybrid_domain *hd;
+
+	for (hd = hybrid_domains; hd; hd = hd->next)
+		hybrid_register_perf_domain(hd);
+}
+
 static void hybrid_add_to_domain(struct cpudata *cpudata)
 {
 	int scaling = cpudata->pstate.scaling;
@@ -975,6 +1030,8 @@ static void hybrid_add_to_domain(struct
 				return;

 			cpumask_set_cpu(cpu, &hd->cpumask);
+			if (hd->dev)
+				em_dev_expand_perf_domain(hd->dev, cpu);

 			pr_debug("CPU %d added to hybrid domain %*pbl\n", cpu,
 				 cpumask_pr_args(&hd->cpumask));
@@ -991,11 +1048,14 @@ static void hybrid_add_to_domain(struct
 	hd->scaling = scaling;
 	hd->next = hybrid_domains;
 	hybrid_domains = hd;
+	if (hybrid_register_perf_domain(hd))
+		em_rebuild_perf_domains();

 	pr_debug("New hybrid domain %*pbl: scaling = %d\n",
 		 cpumask_pr_args(&hd->cpumask), hd->scaling);
 }
 #else /* CONFIG_ENERGY_MODEL */
+static inline void hybrid_register_all_perf_domains(void) {}
 static inline void hybrid_add_to_domain(struct cpudata *cpudata) {}
 #endif /* !CONFIG_ENERGY_MODEL */

@@ -1093,6 +1153,11 @@ static void hybrid_refresh_cpu_capacity_
 	guard(mutex)(&hybrid_capacity_lock);

 	__hybrid_refresh_cpu_capacity_scaling();
+	/*
+	 * Perf domains are not registered before setting hybrid_max_perf_cpu,
+	 * so register them all after setting up CPU capacity scaling.
+	 */
+	hybrid_register_all_perf_domains();
 }

 static void hybrid_init_cpu_capacity_scaling(bool refresh)
@@ -1116,7 +1181,7 @@ static void hybrid_init_cpu_capacity_sca
 		hybrid_refresh_cpu_capacity_scaling();
 		/*
 		 * Disabling ITMT causes sched domains to be rebuilt to disable asym
-		 * packing and enable asym capacity.
+		 * packing and enable asym capacity and EAS.
 		 */
 		sched_clear_itmt_support();
 	}
@@ -3467,6 +3532,8 @@ static ssize_t intel_pstate_show_status(

 static int intel_pstate_update_status(const char *buf, size_t size)
 {
+	int ret = -EINVAL;
+
 	if (size == 3 && !strncmp(buf, "off", size)) {
 		if (!intel_pstate_driver)
 			return -EINVAL;
@@ -3476,6 +3543,8 @@ static int intel_pstate_update_status(co

 		cpufreq_unregister_driver(intel_pstate_driver);
 		intel_pstate_driver_cleanup();
+		/* Trigger EAS support reconfiguration in case it was used. */
+		rebuild_sched_domains_energy();
 		return 0;
 	}

@@ -3487,7 +3556,13 @@ static int intel_pstate_update_status(co
 			cpufreq_unregister_driver(intel_pstate_driver);
 		}

-		return intel_pstate_register_driver(&intel_pstate);
+		ret = intel_pstate_register_driver(&intel_pstate);
+		/*
+		 * If the previous status had been "passive" and the schedutil
+		 * governor had been used, it disabled EAS on exit, so trigger
+		 * sched domains rebuild in case EAS needs to be enabled again.
+		 */
+		rebuild_sched_domains_energy();
 	}

 	if (size == 7 && !strncmp(buf, "passive", size)) {
@@ -3499,10 +3574,10 @@ static int intel_pstate_update_status(co
 			intel_pstate_sysfs_hide_hwp_dynamic_boost();
 		}

-		return intel_pstate_register_driver(&intel_cpufreq);
+		ret = intel_pstate_register_driver(&intel_cpufreq);
 	}

-	return -EINVAL;
+	return ret;
 }

 static int no_load __initdata;

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH v021 3/9] PM: EM: Move perf rebuilding function from schedutil to EM
  2024-11-29 15:59 ` [RFC][PATCH v021 3/9] PM: EM: Move perf rebuilding function from schedutil to EM Rafael J. Wysocki
@ 2024-12-11 10:32   ` Christian Loehle
  0 siblings, 0 replies; 31+ messages in thread
From: Christian Loehle @ 2024-12-11 10:32 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois

On 11/29/24 15:59, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> The sugov_eas_rebuild_sd() function defined in the schedutil cpufreq
> governor implements generic functionality that may be useful in other
> places.  In particular, going forward it will be used in the intel_pstate
> driver.
> 
> For this reason, move it from schedutil to the energy model code and
> rename it to em_rebuild_perf_domains().
> 
> This also helps to get rid of some #ifdeffery in schedutil which is a
> plus.
> 
> No intentional functional impact.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> v0.1 -> v0.2:
>      * Update the comment regarding :register_em() in cpufreq.
>      * Changelog edits.
> 
> ---
>  drivers/cpufreq/cpufreq.c        |    2 +-
>  include/linux/energy_model.h     |    2 ++
>  kernel/power/energy_model.c      |   17 +++++++++++++++++
>  kernel/sched/cpufreq_schedutil.c |   33 ++++++---------------------------
>  4 files changed, 26 insertions(+), 28 deletions(-)
> 
> Index: linux-pm/kernel/power/energy_model.c
> ===================================================================
> --- linux-pm.orig/kernel/power/energy_model.c
> +++ linux-pm/kernel/power/energy_model.c
> @@ -908,3 +908,20 @@ int em_update_performance_limits(struct
>  	return 0;
>  }
>  EXPORT_SYMBOL_GPL(em_update_performance_limits);
> +
> +static void rebuild_sd_workfn(struct work_struct *work)
> +{
> +	rebuild_sched_domains_energy();
> +}
> +
> +static DECLARE_WORK(rebuild_sd_work, rebuild_sd_workfn);
> +
> +void em_rebuild_perf_domains(void)
> +{
> +	/*
> +	 * When called from the cpufreq_register_driver() path, the
> +	 * cpu_hotplug_lock is already held, so use a work item to
> +	 * avoid nested locking in rebuild_sched_domains().
> +	 */
> +	schedule_work(&rebuild_sd_work);
> +}
> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> @@ -604,31 +604,6 @@ static const struct kobj_type sugov_tuna
>  
>  /********************** cpufreq governor interface *********************/
>  
> -#ifdef CONFIG_ENERGY_MODEL
> -static void rebuild_sd_workfn(struct work_struct *work)
> -{
> -	rebuild_sched_domains_energy();
> -}
> -
> -static DECLARE_WORK(rebuild_sd_work, rebuild_sd_workfn);
> -
> -/*
> - * EAS shouldn't be attempted without sugov, so rebuild the sched_domains
> - * on governor changes to make sure the scheduler knows about it.
> - */
> -static void sugov_eas_rebuild_sd(void)
> -{
> -	/*
> -	 * When called from the cpufreq_register_driver() path, the
> -	 * cpu_hotplug_lock is already held, so use a work item to
> -	 * avoid nested locking in rebuild_sched_domains().
> -	 */
> -	schedule_work(&rebuild_sd_work);
> -}
> -#else
> -static inline void sugov_eas_rebuild_sd(void) { };
> -#endif
> -
>  struct cpufreq_governor schedutil_gov;
>  
>  static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy)
> @@ -784,7 +759,11 @@ static int sugov_init(struct cpufreq_pol
>  		goto fail;
>  
>  out:
> -	sugov_eas_rebuild_sd();
> +	/*
> +	 * EAS shouldn't be attempted without sugov, so rebuild the sched_domains
> +	 * on governor changes to make sure the scheduler knows about them.
> +	 */
> +	em_rebuild_perf_domains();

The sugov mention might be considered stale after the next patch?
Apart from that LGTM.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH v021 4/9] sched/topology: Adjust cpufreq checks for EAS
  2024-11-29 16:00 ` [RFC][PATCH v021 4/9] sched/topology: Adjust cpufreq checks for EAS Rafael J. Wysocki
@ 2024-12-11 10:33   ` Christian Loehle
  2024-12-11 11:29     ` Rafael J. Wysocki
  2024-12-16 14:49   ` Dietmar Eggemann
  1 sibling, 1 reply; 31+ messages in thread
From: Christian Loehle @ 2024-12-11 10:33 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois

On 11/29/24 16:00, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Make it possible to use EAS with cpufreq drivers that implement the
> :setpolicy() callback instead of using generic cpufreq governors.
> 
> This is going to be necessary for using EAS with intel_pstate in its
> default configuration.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> This is the minimum of what's needed, but I'd really prefer to move
> the cpufreq vs EAS checks into cpufreq because messing around cpufreq
> internals in topology.c feels like a butcher shop kind of exercise.

Makes sense, something like cpufreq_eas_capable().

> 
> Besides, as I said before, I remain unconvinced about the usefulness
> of these checks at all.  Yes, one is supposed to get the best results
> from EAS when running schedutil, but what if they just want to try
> something else with EAS?  What if they can get better results with
> that other thing, surprisingly enough?

How do you imagine this to work then?
I assume we don't make any 'resulting-OPP-guesses' like
sugov_effective_cpu_perf() for any of the setpolicy governors.
Neither for dbs and I guess userspace.
What about standard powersave and performance?
Do we just have a cpufreq callback to ask which OPP to use for
the energy calculation? Assume lowest/highest?
(I don't think there is hardware where lowest/highest makes a
difference, so maybe not bothering with the complexity could
be an option, too.)

> 
> ---
>  kernel/sched/topology.c |   10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> Index: linux-pm/kernel/sched/topology.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/topology.c
> +++ linux-pm/kernel/sched/topology.c
> @@ -217,6 +217,7 @@ static bool sched_is_eas_possible(const
>  	bool any_asym_capacity = false;
>  	struct cpufreq_policy *policy;
>  	struct cpufreq_governor *gov;
> +	bool cpufreq_ok;
>  	int i;
>  
>  	/* EAS is enabled for asymmetric CPU capacity topologies. */
> @@ -251,7 +252,7 @@ static bool sched_is_eas_possible(const
>  		return false;
>  	}
>  
> -	/* Do not attempt EAS if schedutil is not being used. */
> +	/* Do not attempt EAS if cpufreq is not configured adequately */
>  	for_each_cpu(i, cpu_mask) {
>  		policy = cpufreq_cpu_get(i);
>  		if (!policy) {
> @@ -261,11 +262,14 @@ static bool sched_is_eas_possible(const
>  			}
>  			return false;
>  		}
> +		/* Require schedutil or a "setpolicy" driver */
>  		gov = policy->governor;
> +		cpufreq_ok = gov == &schedutil_gov ||
> +				(!gov && policy->policy != CPUFREQ_POLICY_UNKNOWN);
>  		cpufreq_cpu_put(policy);
> -		if (gov != &schedutil_gov) {
> +		if (!cpufreq_ok) {
>  			if (sched_debug()) {
> -				pr_info("rd %*pbl: Checking EAS, schedutil is mandatory\n",
> +				pr_info("rd %*pbl: Checking EAS, unsuitable cpufreq governor\n",
>  					cpumask_pr_args(cpu_mask));
>  			}
>  			return false;

The logic here looks fine to me FWIW.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH v021 4/9] sched/topology: Adjust cpufreq checks for EAS
  2024-12-11 10:33   ` Christian Loehle
@ 2024-12-11 11:29     ` Rafael J. Wysocki
  2024-12-11 11:43       ` Christian Loehle
  2024-12-11 13:25       ` Vincent Guittot
  0 siblings, 2 replies; 31+ messages in thread
From: Rafael J. Wysocki @ 2024-12-11 11:29 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Rafael J. Wysocki, Linux PM, LKML, Lukasz Luba, Peter Zijlstra,
	Srinivas Pandruvada, Dietmar Eggemann, Morten Rasmussen,
	Vincent Guittot, Ricardo Neri, Pierre Gondois

On Wed, Dec 11, 2024 at 11:33 AM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 11/29/24 16:00, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Make it possible to use EAS with cpufreq drivers that implement the
> > :setpolicy() callback instead of using generic cpufreq governors.
> >
> > This is going to be necessary for using EAS with intel_pstate in its
> > default configuration.
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >
> > This is the minimum of what's needed, but I'd really prefer to move
> > the cpufreq vs EAS checks into cpufreq because messing around cpufreq
> > internals in topology.c feels like a butcher shop kind of exercise.
>
> Makes sense, something like cpufreq_eas_capable().
>
> >
> > Besides, as I said before, I remain unconvinced about the usefulness
> > of these checks at all.  Yes, one is supposed to get the best results
> > from EAS when running schedutil, but what if they just want to try
> > something else with EAS?  What if they can get better results with
> > that other thing, surprisingly enough?
>
> How do you imagine this to work then?
> I assume we don't make any 'resulting-OPP-guesses' like
> sugov_effective_cpu_perf() for any of the setpolicy governors.
> Neither for dbs and I guess userspace.
> What about standard powersave and performance?
> Do we just have a cpufreq callback to ask which OPP to use for
> the energy calculation? Assume lowest/highest?
> (I don't think there is hardware where lowest/highest makes a
> difference, so maybe not bothering with the complexity could
> be an option, too.)

In the "setpolicy" case there is no way to reliably predict the OPP
that is going to be used, so why bother?

In the other cases, and if the OPPs are actually known, EAS may still
make assumptions regarding which of them will be used that will match
the schedutil selection rules, but if the cpufreq governor happens to
choose a different OPP, this is not the end of the world.

Yes, you could have been more energy-efficient had you chosen to use
schedutil, but you chose otherwise and that's what you get.

> >
> > ---
> >  kernel/sched/topology.c |   10 +++++++---
> >  1 file changed, 7 insertions(+), 3 deletions(-)
> >
> > Index: linux-pm/kernel/sched/topology.c
> > ===================================================================
> > --- linux-pm.orig/kernel/sched/topology.c
> > +++ linux-pm/kernel/sched/topology.c
> > @@ -217,6 +217,7 @@ static bool sched_is_eas_possible(const
> >       bool any_asym_capacity = false;
> >       struct cpufreq_policy *policy;
> >       struct cpufreq_governor *gov;
> > +     bool cpufreq_ok;
> >       int i;
> >
> >       /* EAS is enabled for asymmetric CPU capacity topologies. */
> > @@ -251,7 +252,7 @@ static bool sched_is_eas_possible(const
> >               return false;
> >       }
> >
> > -     /* Do not attempt EAS if schedutil is not being used. */
> > +     /* Do not attempt EAS if cpufreq is not configured adequately */
> >       for_each_cpu(i, cpu_mask) {
> >               policy = cpufreq_cpu_get(i);
> >               if (!policy) {
> > @@ -261,11 +262,14 @@ static bool sched_is_eas_possible(const
> >                       }
> >                       return false;
> >               }
> > +             /* Require schedutil or a "setpolicy" driver */
> >               gov = policy->governor;
> > +             cpufreq_ok = gov == &schedutil_gov ||
> > +                             (!gov && policy->policy != CPUFREQ_POLICY_UNKNOWN);
> >               cpufreq_cpu_put(policy);
> > -             if (gov != &schedutil_gov) {
> > +             if (!cpufreq_ok) {
> >                       if (sched_debug()) {
> > -                             pr_info("rd %*pbl: Checking EAS, schedutil is mandatory\n",
> > +                             pr_info("rd %*pbl: Checking EAS, unsuitable cpufreq governor\n",
> >                                       cpumask_pr_args(cpu_mask));
> >                       }
> >                       return false;
>
> The logic here looks fine to me FWIW.
>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH v021 4/9] sched/topology: Adjust cpufreq checks for EAS
  2024-12-11 11:29     ` Rafael J. Wysocki
@ 2024-12-11 11:43       ` Christian Loehle
  2024-12-11 11:59         ` Rafael J. Wysocki
  2024-12-11 13:25       ` Vincent Guittot
  1 sibling, 1 reply; 31+ messages in thread
From: Christian Loehle @ 2024-12-11 11:43 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Linux PM, LKML, Lukasz Luba, Peter Zijlstra,
	Srinivas Pandruvada, Dietmar Eggemann, Morten Rasmussen,
	Vincent Guittot, Ricardo Neri, Pierre Gondois

On 12/11/24 11:29, Rafael J. Wysocki wrote:
> On Wed, Dec 11, 2024 at 11:33 AM Christian Loehle
> <christian.loehle@arm.com> wrote:
>>
>> On 11/29/24 16:00, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> Make it possible to use EAS with cpufreq drivers that implement the
>>> :setpolicy() callback instead of using generic cpufreq governors.
>>>
>>> This is going to be necessary for using EAS with intel_pstate in its
>>> default configuration.
>>>
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>> ---
>>>
>>> This is the minimum of what's needed, but I'd really prefer to move
>>> the cpufreq vs EAS checks into cpufreq because messing around cpufreq
>>> internals in topology.c feels like a butcher shop kind of exercise.
>>
>> Makes sense, something like cpufreq_eas_capable().
>>
>>>
>>> Besides, as I said before, I remain unconvinced about the usefulness
>>> of these checks at all.  Yes, one is supposed to get the best results
>>> from EAS when running schedutil, but what if they just want to try
>>> something else with EAS?  What if they can get better results with
>>> that other thing, surprisingly enough?
>>
>> How do you imagine this to work then?
>> I assume we don't make any 'resulting-OPP-guesses' like
>> sugov_effective_cpu_perf() for any of the setpolicy governors.
>> Neither for dbs and I guess userspace.
>> What about standard powersave and performance?
>> Do we just have a cpufreq callback to ask which OPP to use for
>> the energy calculation? Assume lowest/highest?
>> (I don't think there is hardware where lowest/highest makes a
>> difference, so maybe not bothering with the complexity could
>> be an option, too.)
> 
> In the "setpolicy" case there is no way to reliably predict the OPP
> that is going to be used, so why bother?
> 
> In the other cases, and if the OPPs are actually known, EAS may still
> make assumptions regarding which of them will be used that will match
> the schedutil selection rules, but if the cpufreq governor happens to
> choose a different OPP, this is not the end of the world.

"Not the end of the world" as in the model making incorrect assumptions.
With the significant power-performance overlaps we see in mobile systems
taking sugov's guess while using powersave/performance (the !setpolicy
case) at least will make worse decisions.
See here for reference, first slide.
https://lpc.events/event/16/contributions/1194/attachments/1114/2139/LPC2022_Energy_model_accuracy.pdf

What about the config space, are you fine with everything relying on
CONFIG_CPU_FREQ_GOV_SCHEDUTIL?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH v021 4/9] sched/topology: Adjust cpufreq checks for EAS
  2024-12-11 11:43       ` Christian Loehle
@ 2024-12-11 11:59         ` Rafael J. Wysocki
  0 siblings, 0 replies; 31+ messages in thread
From: Rafael J. Wysocki @ 2024-12-11 11:59 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Linux PM, LKML, Lukasz Luba,
	Peter Zijlstra, Srinivas Pandruvada, Dietmar Eggemann,
	Morten Rasmussen, Vincent Guittot, Ricardo Neri, Pierre Gondois

On Wed, Dec 11, 2024 at 12:44 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 12/11/24 11:29, Rafael J. Wysocki wrote:
> > On Wed, Dec 11, 2024 at 11:33 AM Christian Loehle
> > <christian.loehle@arm.com> wrote:
> >>
> >> On 11/29/24 16:00, Rafael J. Wysocki wrote:
> >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>>
> >>> Make it possible to use EAS with cpufreq drivers that implement the
> >>> :setpolicy() callback instead of using generic cpufreq governors.
> >>>
> >>> This is going to be necessary for using EAS with intel_pstate in its
> >>> default configuration.
> >>>
> >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>> ---
> >>>
> >>> This is the minimum of what's needed, but I'd really prefer to move
> >>> the cpufreq vs EAS checks into cpufreq because messing around cpufreq
> >>> internals in topology.c feels like a butcher shop kind of exercise.
> >>
> >> Makes sense, something like cpufreq_eas_capable().
> >>
> >>>
> >>> Besides, as I said before, I remain unconvinced about the usefulness
> >>> of these checks at all.  Yes, one is supposed to get the best results
> >>> from EAS when running schedutil, but what if they just want to try
> >>> something else with EAS?  What if they can get better results with
> >>> that other thing, surprisingly enough?
> >>
> >> How do you imagine this to work then?
> >> I assume we don't make any 'resulting-OPP-guesses' like
> >> sugov_effective_cpu_perf() for any of the setpolicy governors.
> >> Neither for dbs and I guess userspace.
> >> What about standard powersave and performance?
> >> Do we just have a cpufreq callback to ask which OPP to use for
> >> the energy calculation? Assume lowest/highest?
> >> (I don't think there is hardware where lowest/highest makes a
> >> difference, so maybe not bothering with the complexity could
> >> be an option, too.)
> >
> > In the "setpolicy" case there is no way to reliably predict the OPP
> > that is going to be used, so why bother?
> >
> > In the other cases, and if the OPPs are actually known, EAS may still
> > make assumptions regarding which of them will be used that will match
> > the schedutil selection rules, but if the cpufreq governor happens to
> > choose a different OPP, this is not the end of the world.
>
> "Not the end of the world" as in the model making incorrect assumptions.
> With the significant power-performance overlaps we see in mobile systems
> taking sugov's guess while using powersave/performance (the !setpolicy
> case) at least will make worse decisions.
> See here for reference, first slide.
> https://lpc.events/event/16/contributions/1194/attachments/1114/2139/LPC2022_Energy_model_accuracy.pdf

I've never said it won't make worse decisions, but whoever decides
which governor to use should be able to check which one is better.

> What about the config space, are you fine with everything relying on
> CONFIG_CPU_FREQ_GOV_SCHEDUTIL?

Yes, that's fine.

I think that schedultil should be the default governor for EAS, but I
don't see why it should be regarded as the only one possible and so
enforced.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH v021 4/9] sched/topology: Adjust cpufreq checks for EAS
  2024-12-11 11:29     ` Rafael J. Wysocki
  2024-12-11 11:43       ` Christian Loehle
@ 2024-12-11 13:25       ` Vincent Guittot
  2024-12-11 16:37         ` Rafael J. Wysocki
  1 sibling, 1 reply; 31+ messages in thread
From: Vincent Guittot @ 2024-12-11 13:25 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Christian Loehle, Rafael J. Wysocki, Linux PM, LKML, Lukasz Luba,
	Peter Zijlstra, Srinivas Pandruvada, Dietmar Eggemann,
	Morten Rasmussen, Ricardo Neri, Pierre Gondois

On Wed, 11 Dec 2024 at 12:29, Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Wed, Dec 11, 2024 at 11:33 AM Christian Loehle
> <christian.loehle@arm.com> wrote:
> >
> > On 11/29/24 16:00, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > >
> > > Make it possible to use EAS with cpufreq drivers that implement the
> > > :setpolicy() callback instead of using generic cpufreq governors.
> > >
> > > This is going to be necessary for using EAS with intel_pstate in its
> > > default configuration.
> > >
> > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > ---
> > >
> > > This is the minimum of what's needed, but I'd really prefer to move
> > > the cpufreq vs EAS checks into cpufreq because messing around cpufreq
> > > internals in topology.c feels like a butcher shop kind of exercise.
> >
> > Makes sense, something like cpufreq_eas_capable().
> >
> > >
> > > Besides, as I said before, I remain unconvinced about the usefulness
> > > of these checks at all.  Yes, one is supposed to get the best results
> > > from EAS when running schedutil, but what if they just want to try
> > > something else with EAS?  What if they can get better results with
> > > that other thing, surprisingly enough?
> >
> > How do you imagine this to work then?
> > I assume we don't make any 'resulting-OPP-guesses' like
> > sugov_effective_cpu_perf() for any of the setpolicy governors.
> > Neither for dbs and I guess userspace.
> > What about standard powersave and performance?
> > Do we just have a cpufreq callback to ask which OPP to use for
> > the energy calculation? Assume lowest/highest?
> > (I don't think there is hardware where lowest/highest makes a
> > difference, so maybe not bothering with the complexity could
> > be an option, too.)
>
> In the "setpolicy" case there is no way to reliably predict the OPP
> that is going to be used, so why bother?
>
> In the other cases, and if the OPPs are actually known, EAS may still
> make assumptions regarding which of them will be used that will match
> the schedutil selection rules, but if the cpufreq governor happens to
> choose a different OPP, this is not the end of the world.

Should we add a new cpufreq governor fops to return the guest estimate
of the compute capacity selection ? something like
cpufreq_effective_cpu_perf(cpu, actual, min, max)
EAS needs to estimate what would be the next OPP; schedutil uses
sugov_effective_cpu_perf() and other governor could provide their own

>
> Yes, you could have been more energy-efficient had you chosen to use
> schedutil, but you chose otherwise and that's what you get.

Calling sugov_effective_cpu_perf() for another governor than schedutil
doesn't make sense. and do we handle the case when
CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not selected

>
> > >
> > > ---
> > >  kernel/sched/topology.c |   10 +++++++---
> > >  1 file changed, 7 insertions(+), 3 deletions(-)
> > >
> > > Index: linux-pm/kernel/sched/topology.c
> > > ===================================================================
> > > --- linux-pm.orig/kernel/sched/topology.c
> > > +++ linux-pm/kernel/sched/topology.c
> > > @@ -217,6 +217,7 @@ static bool sched_is_eas_possible(const
> > >       bool any_asym_capacity = false;
> > >       struct cpufreq_policy *policy;
> > >       struct cpufreq_governor *gov;
> > > +     bool cpufreq_ok;
> > >       int i;
> > >
> > >       /* EAS is enabled for asymmetric CPU capacity topologies. */
> > > @@ -251,7 +252,7 @@ static bool sched_is_eas_possible(const
> > >               return false;
> > >       }
> > >
> > > -     /* Do not attempt EAS if schedutil is not being used. */
> > > +     /* Do not attempt EAS if cpufreq is not configured adequately */
> > >       for_each_cpu(i, cpu_mask) {
> > >               policy = cpufreq_cpu_get(i);
> > >               if (!policy) {
> > > @@ -261,11 +262,14 @@ static bool sched_is_eas_possible(const
> > >                       }
> > >                       return false;
> > >               }
> > > +             /* Require schedutil or a "setpolicy" driver */
> > >               gov = policy->governor;
> > > +             cpufreq_ok = gov == &schedutil_gov ||
> > > +                             (!gov && policy->policy != CPUFREQ_POLICY_UNKNOWN);
> > >               cpufreq_cpu_put(policy);
> > > -             if (gov != &schedutil_gov) {
> > > +             if (!cpufreq_ok) {
> > >                       if (sched_debug()) {
> > > -                             pr_info("rd %*pbl: Checking EAS, schedutil is mandatory\n",
> > > +                             pr_info("rd %*pbl: Checking EAS, unsuitable cpufreq governor\n",
> > >                                       cpumask_pr_args(cpu_mask));
> > >                       }
> > >                       return false;
> >
> > The logic here looks fine to me FWIW.
> >
> >

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH v021 4/9] sched/topology: Adjust cpufreq checks for EAS
  2024-12-11 13:25       ` Vincent Guittot
@ 2024-12-11 16:37         ` Rafael J. Wysocki
  2024-12-11 17:07           ` Vincent Guittot
  0 siblings, 1 reply; 31+ messages in thread
From: Rafael J. Wysocki @ 2024-12-11 16:37 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Rafael J. Wysocki, Christian Loehle, Rafael J. Wysocki, Linux PM,
	LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Ricardo Neri, Pierre Gondois

On Wed, Dec 11, 2024 at 2:25 PM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Wed, 11 Dec 2024 at 12:29, Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Wed, Dec 11, 2024 at 11:33 AM Christian Loehle
> > <christian.loehle@arm.com> wrote:
> > >
> > > On 11/29/24 16:00, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > >
> > > > Make it possible to use EAS with cpufreq drivers that implement the
> > > > :setpolicy() callback instead of using generic cpufreq governors.
> > > >
> > > > This is going to be necessary for using EAS with intel_pstate in its
> > > > default configuration.
> > > >
> > > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > ---
> > > >
> > > > This is the minimum of what's needed, but I'd really prefer to move
> > > > the cpufreq vs EAS checks into cpufreq because messing around cpufreq
> > > > internals in topology.c feels like a butcher shop kind of exercise.
> > >
> > > Makes sense, something like cpufreq_eas_capable().
> > >
> > > >
> > > > Besides, as I said before, I remain unconvinced about the usefulness
> > > > of these checks at all.  Yes, one is supposed to get the best results
> > > > from EAS when running schedutil, but what if they just want to try
> > > > something else with EAS?  What if they can get better results with
> > > > that other thing, surprisingly enough?
> > >
> > > How do you imagine this to work then?
> > > I assume we don't make any 'resulting-OPP-guesses' like
> > > sugov_effective_cpu_perf() for any of the setpolicy governors.
> > > Neither for dbs and I guess userspace.
> > > What about standard powersave and performance?
> > > Do we just have a cpufreq callback to ask which OPP to use for
> > > the energy calculation? Assume lowest/highest?
> > > (I don't think there is hardware where lowest/highest makes a
> > > difference, so maybe not bothering with the complexity could
> > > be an option, too.)
> >
> > In the "setpolicy" case there is no way to reliably predict the OPP
> > that is going to be used, so why bother?
> >
> > In the other cases, and if the OPPs are actually known, EAS may still
> > make assumptions regarding which of them will be used that will match
> > the schedutil selection rules, but if the cpufreq governor happens to
> > choose a different OPP, this is not the end of the world.
>
> Should we add a new cpufreq governor fops to return the guest estimate
> of the compute capacity selection ? something like
> cpufreq_effective_cpu_perf(cpu, actual, min, max)
> EAS needs to estimate what would be the next OPP; schedutil uses
> sugov_effective_cpu_perf() and other governor could provide their own

Generally, yes.  And documented for that matter.

But it doesn't really tell you the OPP, but the performance level that
is going to be set for the given list of arguments IIUC.  An energy
model is needed to find an OPP for the given perf level.  Or generally
the cost of it for that matter.

> > Yes, you could have been more energy-efficient had you chosen to use
> > schedutil, but you chose otherwise and that's what you get.
>
> Calling sugov_effective_cpu_perf() for another governor than schedutil
> doesn't make sense.

It will work for intel_pstate in the "setpolicy" mode to a reasonable
approximation AFAICS.

> and do we handle the case when
> CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not selected

I don't think it's necessary to handle it.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH v021 4/9] sched/topology: Adjust cpufreq checks for EAS
  2024-12-11 16:37         ` Rafael J. Wysocki
@ 2024-12-11 17:07           ` Vincent Guittot
  2024-12-11 17:55             ` Rafael J. Wysocki
  0 siblings, 1 reply; 31+ messages in thread
From: Vincent Guittot @ 2024-12-11 17:07 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Christian Loehle, Rafael J. Wysocki, Linux PM, LKML, Lukasz Luba,
	Peter Zijlstra, Srinivas Pandruvada, Dietmar Eggemann,
	Morten Rasmussen, Ricardo Neri, Pierre Gondois

On Wed, 11 Dec 2024 at 17:38, Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Wed, Dec 11, 2024 at 2:25 PM Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > On Wed, 11 Dec 2024 at 12:29, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > >
> > > On Wed, Dec 11, 2024 at 11:33 AM Christian Loehle
> > > <christian.loehle@arm.com> wrote:
> > > >
> > > > On 11/29/24 16:00, Rafael J. Wysocki wrote:
> > > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > >
> > > > > Make it possible to use EAS with cpufreq drivers that implement the
> > > > > :setpolicy() callback instead of using generic cpufreq governors.
> > > > >
> > > > > This is going to be necessary for using EAS with intel_pstate in its
> > > > > default configuration.
> > > > >
> > > > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > > ---
> > > > >
> > > > > This is the minimum of what's needed, but I'd really prefer to move
> > > > > the cpufreq vs EAS checks into cpufreq because messing around cpufreq
> > > > > internals in topology.c feels like a butcher shop kind of exercise.
> > > >
> > > > Makes sense, something like cpufreq_eas_capable().
> > > >
> > > > >
> > > > > Besides, as I said before, I remain unconvinced about the usefulness
> > > > > of these checks at all.  Yes, one is supposed to get the best results
> > > > > from EAS when running schedutil, but what if they just want to try
> > > > > something else with EAS?  What if they can get better results with
> > > > > that other thing, surprisingly enough?
> > > >
> > > > How do you imagine this to work then?
> > > > I assume we don't make any 'resulting-OPP-guesses' like
> > > > sugov_effective_cpu_perf() for any of the setpolicy governors.
> > > > Neither for dbs and I guess userspace.
> > > > What about standard powersave and performance?
> > > > Do we just have a cpufreq callback to ask which OPP to use for
> > > > the energy calculation? Assume lowest/highest?
> > > > (I don't think there is hardware where lowest/highest makes a
> > > > difference, so maybe not bothering with the complexity could
> > > > be an option, too.)
> > >
> > > In the "setpolicy" case there is no way to reliably predict the OPP
> > > that is going to be used, so why bother?
> > >
> > > In the other cases, and if the OPPs are actually known, EAS may still
> > > make assumptions regarding which of them will be used that will match
> > > the schedutil selection rules, but if the cpufreq governor happens to
> > > choose a different OPP, this is not the end of the world.
> >
> > Should we add a new cpufreq governor fops to return the guest estimate
> > of the compute capacity selection ? something like
> > cpufreq_effective_cpu_perf(cpu, actual, min, max)
> > EAS needs to estimate what would be the next OPP; schedutil uses
> > sugov_effective_cpu_perf() and other governor could provide their own
>
> Generally, yes.  And documented for that matter.
>
> But it doesn't really tell you the OPP, but the performance level that
> is going to be set for the given list of arguments IIUC.  An energy

Yes, the governor return what performance level it will select and asl
to the cpufreq driver so EAS can directly map it to an OPP and a cost

> model is needed to find an OPP for the given perf level.  Or generally
> the cost of it for that matter.
>
> > > Yes, you could have been more energy-efficient had you chosen to use
> > > schedutil, but you chose otherwise and that's what you get.
> >
> > Calling sugov_effective_cpu_perf() for another governor than schedutil
> > doesn't make sense.
>
> It will work for intel_pstate in the "setpolicy" mode to a reasonable
> approximation AFAICS.
>
> > and do we handle the case when
> > CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not selected
>
> I don't think it's necessary to handle it.

I don't think that CI and others will be happy to possibly get an
undeclared function. Or you put a dependency of other governors with
CPU_FREQ_DEFAULT_GOV_SCHEDUTIL

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH v021 4/9] sched/topology: Adjust cpufreq checks for EAS
  2024-12-11 17:07           ` Vincent Guittot
@ 2024-12-11 17:55             ` Rafael J. Wysocki
  2024-12-12  8:34               ` Vincent Guittot
  0 siblings, 1 reply; 31+ messages in thread
From: Rafael J. Wysocki @ 2024-12-11 17:55 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Rafael J. Wysocki, Christian Loehle, Rafael J. Wysocki, Linux PM,
	LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Ricardo Neri, Pierre Gondois

On Wed, Dec 11, 2024 at 6:08 PM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Wed, 11 Dec 2024 at 17:38, Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Wed, Dec 11, 2024 at 2:25 PM Vincent Guittot
> > <vincent.guittot@linaro.org> wrote:
> > >
> > > On Wed, 11 Dec 2024 at 12:29, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > >
> > > > On Wed, Dec 11, 2024 at 11:33 AM Christian Loehle
> > > > <christian.loehle@arm.com> wrote:
> > > > >
> > > > > On 11/29/24 16:00, Rafael J. Wysocki wrote:
> > > > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > > >
> > > > > > Make it possible to use EAS with cpufreq drivers that implement the
> > > > > > :setpolicy() callback instead of using generic cpufreq governors.
> > > > > >
> > > > > > This is going to be necessary for using EAS with intel_pstate in its
> > > > > > default configuration.
> > > > > >
> > > > > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > > > ---
> > > > > >
> > > > > > This is the minimum of what's needed, but I'd really prefer to move
> > > > > > the cpufreq vs EAS checks into cpufreq because messing around cpufreq
> > > > > > internals in topology.c feels like a butcher shop kind of exercise.
> > > > >
> > > > > Makes sense, something like cpufreq_eas_capable().
> > > > >
> > > > > >
> > > > > > Besides, as I said before, I remain unconvinced about the usefulness
> > > > > > of these checks at all.  Yes, one is supposed to get the best results
> > > > > > from EAS when running schedutil, but what if they just want to try
> > > > > > something else with EAS?  What if they can get better results with
> > > > > > that other thing, surprisingly enough?
> > > > >
> > > > > How do you imagine this to work then?
> > > > > I assume we don't make any 'resulting-OPP-guesses' like
> > > > > sugov_effective_cpu_perf() for any of the setpolicy governors.
> > > > > Neither for dbs and I guess userspace.
> > > > > What about standard powersave and performance?
> > > > > Do we just have a cpufreq callback to ask which OPP to use for
> > > > > the energy calculation? Assume lowest/highest?
> > > > > (I don't think there is hardware where lowest/highest makes a
> > > > > difference, so maybe not bothering with the complexity could
> > > > > be an option, too.)
> > > >
> > > > In the "setpolicy" case there is no way to reliably predict the OPP
> > > > that is going to be used, so why bother?
> > > >
> > > > In the other cases, and if the OPPs are actually known, EAS may still
> > > > make assumptions regarding which of them will be used that will match
> > > > the schedutil selection rules, but if the cpufreq governor happens to
> > > > choose a different OPP, this is not the end of the world.
> > >
> > > Should we add a new cpufreq governor fops to return the guest estimate
> > > of the compute capacity selection ? something like
> > > cpufreq_effective_cpu_perf(cpu, actual, min, max)
> > > EAS needs to estimate what would be the next OPP; schedutil uses
> > > sugov_effective_cpu_perf() and other governor could provide their own
> >
> > Generally, yes.  And documented for that matter.
> >
> > But it doesn't really tell you the OPP, but the performance level that
> > is going to be set for the given list of arguments IIUC.  An energy
>
> Yes, the governor return what performance level it will select and asl
> to the cpufreq driver so EAS can directly map it to an OPP and a cost
>
> > model is needed to find an OPP for the given perf level.  Or generally
> > the cost of it for that matter.
> >
> > > > Yes, you could have been more energy-efficient had you chosen to use
> > > > schedutil, but you chose otherwise and that's what you get.
> > >
> > > Calling sugov_effective_cpu_perf() for another governor than schedutil
> > > doesn't make sense.
> >
> > It will work for intel_pstate in the "setpolicy" mode to a reasonable
> > approximation AFAICS.
> >
> > > and do we handle the case when
> > > CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not selected
> >
> > I don't think it's necessary to handle it.
>
> I don't think that CI and others will be happy to possibly get an
> undeclared function. Or you put a dependency of other governors with
> CPU_FREQ_DEFAULT_GOV_SCHEDUTIL

Do you mean CONFIG_CPU_FREQ_GOV_SCHEDUTIL?  Because
CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is only about whether or not schedutil
is the default governor.

I think that it is fine to require CONFIG_CPU_FREQ_GOV_SCHEDUTIL for EAS.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH v021 4/9] sched/topology: Adjust cpufreq checks for EAS
  2024-12-11 17:55             ` Rafael J. Wysocki
@ 2024-12-12  8:34               ` Vincent Guittot
  0 siblings, 0 replies; 31+ messages in thread
From: Vincent Guittot @ 2024-12-12  8:34 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Christian Loehle, Rafael J. Wysocki, Linux PM, LKML, Lukasz Luba,
	Peter Zijlstra, Srinivas Pandruvada, Dietmar Eggemann,
	Morten Rasmussen, Ricardo Neri, Pierre Gondois

On Wed, 11 Dec 2024 at 18:55, Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Wed, Dec 11, 2024 at 6:08 PM Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > On Wed, 11 Dec 2024 at 17:38, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > >
> > > On Wed, Dec 11, 2024 at 2:25 PM Vincent Guittot
> > > <vincent.guittot@linaro.org> wrote:
> > > >
> > > > On Wed, 11 Dec 2024 at 12:29, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > > >
> > > > > On Wed, Dec 11, 2024 at 11:33 AM Christian Loehle
> > > > > <christian.loehle@arm.com> wrote:
> > > > > >
> > > > > > On 11/29/24 16:00, Rafael J. Wysocki wrote:
> > > > > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > > > >
> > > > > > > Make it possible to use EAS with cpufreq drivers that implement the
> > > > > > > :setpolicy() callback instead of using generic cpufreq governors.
> > > > > > >
> > > > > > > This is going to be necessary for using EAS with intel_pstate in its
> > > > > > > default configuration.
> > > > > > >
> > > > > > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > > > > ---
> > > > > > >
> > > > > > > This is the minimum of what's needed, but I'd really prefer to move
> > > > > > > the cpufreq vs EAS checks into cpufreq because messing around cpufreq
> > > > > > > internals in topology.c feels like a butcher shop kind of exercise.
> > > > > >
> > > > > > Makes sense, something like cpufreq_eas_capable().
> > > > > >
> > > > > > >
> > > > > > > Besides, as I said before, I remain unconvinced about the usefulness
> > > > > > > of these checks at all.  Yes, one is supposed to get the best results
> > > > > > > from EAS when running schedutil, but what if they just want to try
> > > > > > > something else with EAS?  What if they can get better results with
> > > > > > > that other thing, surprisingly enough?
> > > > > >
> > > > > > How do you imagine this to work then?
> > > > > > I assume we don't make any 'resulting-OPP-guesses' like
> > > > > > sugov_effective_cpu_perf() for any of the setpolicy governors.
> > > > > > Neither for dbs and I guess userspace.
> > > > > > What about standard powersave and performance?
> > > > > > Do we just have a cpufreq callback to ask which OPP to use for
> > > > > > the energy calculation? Assume lowest/highest?
> > > > > > (I don't think there is hardware where lowest/highest makes a
> > > > > > difference, so maybe not bothering with the complexity could
> > > > > > be an option, too.)
> > > > >
> > > > > In the "setpolicy" case there is no way to reliably predict the OPP
> > > > > that is going to be used, so why bother?
> > > > >
> > > > > In the other cases, and if the OPPs are actually known, EAS may still
> > > > > make assumptions regarding which of them will be used that will match
> > > > > the schedutil selection rules, but if the cpufreq governor happens to
> > > > > choose a different OPP, this is not the end of the world.
> > > >
> > > > Should we add a new cpufreq governor fops to return the guest estimate
> > > > of the compute capacity selection ? something like
> > > > cpufreq_effective_cpu_perf(cpu, actual, min, max)
> > > > EAS needs to estimate what would be the next OPP; schedutil uses
> > > > sugov_effective_cpu_perf() and other governor could provide their own
> > >
> > > Generally, yes.  And documented for that matter.
> > >
> > > But it doesn't really tell you the OPP, but the performance level that
> > > is going to be set for the given list of arguments IIUC.  An energy
> >
> > Yes, the governor return what performance level it will select and asl
> > to the cpufreq driver so EAS can directly map it to an OPP and a cost
> >
> > > model is needed to find an OPP for the given perf level.  Or generally
> > > the cost of it for that matter.
> > >
> > > > > Yes, you could have been more energy-efficient had you chosen to use
> > > > > schedutil, but you chose otherwise and that's what you get.
> > > >
> > > > Calling sugov_effective_cpu_perf() for another governor than schedutil
> > > > doesn't make sense.
> > >
> > > It will work for intel_pstate in the "setpolicy" mode to a reasonable
> > > approximation AFAICS.
> > >
> > > > and do we handle the case when
> > > > CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not selected
> > >
> > > I don't think it's necessary to handle it.
> >
> > I don't think that CI and others will be happy to possibly get an
> > undeclared function. Or you put a dependency of other governors with
> > CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
>
> Do you mean CONFIG_CPU_FREQ_GOV_SCHEDUTIL?  Because
> CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is only about whether or not schedutil
> is the default governor.

Yes, I mean CONFIG_CPU_FREQ_GOV_SCHEDUTIL

>
> I think that it is fine to require CONFIG_CPU_FREQ_GOV_SCHEDUTIL for EAS.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH v021 8/9] cpufreq: intel_pstate: Introduce hybrid domains
  2024-11-29 16:21 ` [RFC][PATCH v021 8/9] cpufreq: intel_pstate: Introduce hybrid domains Rafael J. Wysocki
@ 2024-12-12 17:04   ` Christian Loehle
  0 siblings, 0 replies; 31+ messages in thread
From: Christian Loehle @ 2024-12-12 17:04 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Dietmar Eggemann, Morten Rasmussen, Vincent Guittot, Ricardo Neri,
	Pierre Gondois

On 11/29/24 16:21, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Hybrid platforms contain different types of CPUs.  They may differ
> by micro-architecture, by cache topology, by manufacturing process, by
> the interconnect access design etc.  Of course, this means that power-
> performance curves for CPUs of different types are generally different.
> 
> Because of these differences, CPUs of different types need to be handled
> differently in certain situations and so it is convenient to operate
> groups of CPUs that each contain CPUs of the same type.  In intel_pstate,
> each of them will be represented by a struct hybrid_domain object and
> referred to as a hybrid domain.
> 
> A key problem is how to identify the type of a CPUs so as to know which
> hybrid domain it belongs to.  In principle, there are a few ways to do
> it, but none of them is perfectly reliable.
> 
> From the computational perspective, an important factor is how many
> instructions (on average) can be executed by the given CPU when it is
> running at a specific frequency, often referred to as the IPC
> (instructions per cycle) ratio of the given CPU to the least-capable
> CPU in the system.  In intel_pstate this ratio is represented by the
> performance-to-frequency scaling factor which needs to be used to get
> a frequency in kHz for a given HWP performance level of the given CPU.
> Since HWP performance levels are in the same units for all CPUs in a
> hybrid system, the smaller the scaling factor, the larger the IPC ratio
> for the given CPU.
> 
> Of course, the performance-to-frequency scaling factor must be the
> same for all CPUs of the same type.  While it may be the same for CPUs
> of different types, there is only one case in which that actually
> happens (Meteor Lake platforms with two types of E-cores) and it is not
> expected to happen again in the future.  Moreover, when it happens,
> there is no straightforward way to distinguish CPUs of different types
> with the same scaling factor in general.
> 
> For this reason, the scaling factor is as good as it gets for CPU
> type identification and so it is used for building hybrid domains in
> intel_pstate.
> 
> On hybrid systems, every CPU is added to a hybrid domain at the
> initialization time.  If a hybrid domain with a matching scaling
> factor is already present at that point, the CPU will be added to it.
> Otherwise, a new hybrid domain will be created and the CPU will be
> put into it.  The domain's scaling factor will then be set to the
> one of the new CPU.

Just two irrelevant typos below, although for the unfamiliar maybe an
example debug message output from any Arrow Lake would make this more
concrete?

> 
> So far, the new code doesn't do much beyond printing debud messages,

s/debud/debug

> but subsequently the EAS support for intel_pstate will be based on it.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  drivers/cpufreq/intel_pstate.c |   57 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 57 insertions(+)
> 
> Index: linux-pm/drivers/cpufreq/intel_pstate.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> +++ linux-pm/drivers/cpufreq/intel_pstate.c
> @@ -943,6 +943,62 @@ static struct cpudata *hybrid_max_perf_c
>   */
>  static DEFINE_MUTEX(hybrid_capacity_lock);
>  
> +#ifdef CONFIG_ENERGY_MODEL
> +/*
> + * A hybrid domain is a collection of CPUs with the same perf-to-frequency
> + * scaling factor.
> + */
> +struct hybrid_domain {
> +	struct hybrid_domain *next;
> +	cpumask_t cpumask;
> +	int scaling;
> +};
> +
> +static struct hybrid_domain *hybrid_domains;
> +
> +static void hybrid_add_to_domain(struct cpudata *cpudata)
> +{
> +	int scaling = cpudata->pstate.scaling;
> +	int cpu = cpudata->cpu;
> +	struct hybrid_domain *hd;
> +
> +	/* Do this only on hubrid platforms. */

s/hubrid/hybrid


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH v021 7/9] PM: EM: Register perf domains with ho :active_power() callbacks
  2024-11-29 16:15 ` [RFC][PATCH v021 7/9] PM: EM: Register perf domains with ho :active_power() callbacks Rafael J. Wysocki
@ 2024-12-16 10:59   ` Dietmar Eggemann
  2024-12-16 11:58     ` Rafael J. Wysocki
  0 siblings, 1 reply; 31+ messages in thread
From: Dietmar Eggemann @ 2024-12-16 10:59 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Morten Rasmussen, Vincent Guittot, Ricardo Neri, Pierre Gondois

On 29/11/2024 17:15, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

In Subject:

with ho :active_power() - without ?
^^^^^^^

[...]

> @@ -594,7 +599,19 @@ int em_dev_register_perf_domain(struct d
>  			 * All CPUs of a domain must have the same
>  			 * micro-architecture since they all share the same
>  			 * table.
> +			 *
> +			 * If the :active_power() callback is present, the
> +			 * performance values for different states in the table
> +			 * computed by em_create_perf_table() depend on the CPU
> +			 * capacity which therefore must be the same for all
> +			 * CPUs in the domain.  However, if the :active_power()
> +			 * callback is not NULL, em_create_perf_table() doesn't

s/is not NULL/is NULL ?

[...]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH v021 7/9] PM: EM: Register perf domains with ho :active_power() callbacks
  2024-12-16 10:59   ` Dietmar Eggemann
@ 2024-12-16 11:58     ` Rafael J. Wysocki
  0 siblings, 0 replies; 31+ messages in thread
From: Rafael J. Wysocki @ 2024-12-16 11:58 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Rafael J. Wysocki, Linux PM, LKML, Lukasz Luba, Peter Zijlstra,
	Srinivas Pandruvada, Morten Rasmussen, Vincent Guittot,
	Ricardo Neri, Pierre Gondois

On Mon, Dec 16, 2024 at 11:59 AM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 29/11/2024 17:15, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> In Subject:
>
> with ho :active_power() - without ?
> ^^^^^^^

Sure, thanks!

> [...]
>
> > @@ -594,7 +599,19 @@ int em_dev_register_perf_domain(struct d
> >                        * All CPUs of a domain must have the same
> >                        * micro-architecture since they all share the same
> >                        * table.
> > +                      *
> > +                      * If the :active_power() callback is present, the
> > +                      * performance values for different states in the table
> > +                      * computed by em_create_perf_table() depend on the CPU
> > +                      * capacity which therefore must be the same for all
> > +                      * CPUs in the domain.  However, if the :active_power()
> > +                      * callback is not NULL, em_create_perf_table() doesn't
>
> s/is not NULL/is NULL ?
>
> [...]

Yeah, one negation too many, thanks!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH v021 4/9] sched/topology: Adjust cpufreq checks for EAS
  2024-11-29 16:00 ` [RFC][PATCH v021 4/9] sched/topology: Adjust cpufreq checks for EAS Rafael J. Wysocki
  2024-12-11 10:33   ` Christian Loehle
@ 2024-12-16 14:49   ` Dietmar Eggemann
  2024-12-16 14:58     ` Rafael J. Wysocki
  1 sibling, 1 reply; 31+ messages in thread
From: Dietmar Eggemann @ 2024-12-16 14:49 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Morten Rasmussen, Vincent Guittot, Ricardo Neri, Pierre Gondois

On 29/11/2024 17:00, Rafael J. Wysocki wrote:

[...]

> @@ -261,11 +262,14 @@ static bool sched_is_eas_possible(const
>  			}
>  			return false;
>  		}
> +		/* Require schedutil or a "setpolicy" driver */
>  		gov = policy->governor;
> +		cpufreq_ok = gov == &schedutil_gov ||
> +				(!gov && policy->policy != CPUFREQ_POLICY_UNKNOWN);
>  		cpufreq_cpu_put(policy);
> -		if (gov != &schedutil_gov) {
> +		if (!cpufreq_ok) {
>  			if (sched_debug()) {
> -				pr_info("rd %*pbl: Checking EAS, schedutil is mandatory\n",
> +				pr_info("rd %*pbl: Checking EAS, unsuitable cpufreq governor\n",
>  					cpumask_pr_args(cpu_mask));
>  			}
>  			return false;

build_perf_domains() which calls sched_is_eas_possible() has schedutil
(4) mentioned in the function header as well:

/*
 * EAS can be used on a root domain if it meets all the following
conditions:
 *    1. an Energy Model (EM) is available;
 *    2. the SD_ASYM_CPUCAPACITY flag is set in the sched_domain hierarchy.
 *    3. no SMT is detected.
 *    4. schedutil is driving the frequency of all CPUs of the rd; <-- !
 *    5. frequency invariance support is present;
 */

IMHO, his patch should remove the function header since the conditions
in sched_is_eas_possible() have comments already or are self-explanatory.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH v021 4/9] sched/topology: Adjust cpufreq checks for EAS
  2024-12-16 14:49   ` Dietmar Eggemann
@ 2024-12-16 14:58     ` Rafael J. Wysocki
  0 siblings, 0 replies; 31+ messages in thread
From: Rafael J. Wysocki @ 2024-12-16 14:58 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Rafael J. Wysocki, Linux PM, LKML, Lukasz Luba, Peter Zijlstra,
	Srinivas Pandruvada, Morten Rasmussen, Vincent Guittot,
	Ricardo Neri, Pierre Gondois

On Mon, Dec 16, 2024 at 3:49 PM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 29/11/2024 17:00, Rafael J. Wysocki wrote:
>
> [...]
>
> > @@ -261,11 +262,14 @@ static bool sched_is_eas_possible(const
> >                       }
> >                       return false;
> >               }
> > +             /* Require schedutil or a "setpolicy" driver */
> >               gov = policy->governor;
> > +             cpufreq_ok = gov == &schedutil_gov ||
> > +                             (!gov && policy->policy != CPUFREQ_POLICY_UNKNOWN);
> >               cpufreq_cpu_put(policy);
> > -             if (gov != &schedutil_gov) {
> > +             if (!cpufreq_ok) {
> >                       if (sched_debug()) {
> > -                             pr_info("rd %*pbl: Checking EAS, schedutil is mandatory\n",
> > +                             pr_info("rd %*pbl: Checking EAS, unsuitable cpufreq governor\n",
> >                                       cpumask_pr_args(cpu_mask));
> >                       }
> >                       return false;
>
> build_perf_domains() which calls sched_is_eas_possible() has schedutil
> (4) mentioned in the function header as well:
>
> /*
>  * EAS can be used on a root domain if it meets all the following
> conditions:
>  *    1. an Energy Model (EM) is available;
>  *    2. the SD_ASYM_CPUCAPACITY flag is set in the sched_domain hierarchy.
>  *    3. no SMT is detected.
>  *    4. schedutil is driving the frequency of all CPUs of the rd; <-- !
>  *    5. frequency invariance support is present;
>  */
>
> IMHO, his patch should remove the function header since the conditions
> in sched_is_eas_possible() have comments already or are self-explanatory.

Fair enough.

I'm considering a patch moving these checks to cpufreq, then I'll
change all of that.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH v021 5/9] PM: EM: Introduce em_dev_expand_perf_domain()
  2024-11-29 16:02 ` [RFC][PATCH v021 5/9] PM: EM: Introduce em_dev_expand_perf_domain() Rafael J. Wysocki
@ 2024-12-17  9:38   ` Dietmar Eggemann
  2024-12-17 20:40     ` Rafael J. Wysocki
  0 siblings, 1 reply; 31+ messages in thread
From: Dietmar Eggemann @ 2024-12-17  9:38 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Morten Rasmussen, Vincent Guittot, Ricardo Neri, Pierre Gondois

On 29/11/2024 17:02, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Introduce a helper function for adding a CPU to an existing EM perf
> domain.
> 
> Subsequently, this will be used by the intel_pstate driver to add new
> CPUs to existing perf domains when those CPUs go online for the first
> time after the initialization of the driver.
> 
> No intentional functional impact.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> v0.1 -> v0.2: No changes

Could you add information why this new EM interface is needed?

IIRC, you can't use the existing way (cpufreq_driver::register_em) since
it gets called to early (3) for the PD cpumasks to be ready. This issue
will be there for any system in which uarch domains are not congruent
with clock domains which we hadn't have to deal with Arm's heterogeneous
CPUs so far.

__init intel_pstate_init()

  intel_pstate_register_driver()

    cpufreq_register_driver()

      subsys_interface_register()

        sif->add_dev() -> cpufreq_add_dev()

          cpufreq_online()

            if (!new_policy && cpufreq_driver->online)

            else

              cpufreq_driver->init() -> intel_pstate_cpu_init()

                __intel_pstate_cpu_init()

                  intel_pstate_init_cpu()

                    intel_pstate_get_cpu_pstates()

                      hybrid_add_to_domain()

                        em_dev_expand_perf_domain()              <-- (1)

                  intel_pstate_init_acpi_perf_limits()

                    intel_pstate_set_itmt_prio()                 <-- (2)

            if (new_policy)

              cpufreq_driver->register_em()                      <-- (3)

    hybrid_init_cpu_capacity_scaling()

      hybrid_refresh_cpu_capacity_scaling()

        __hybrid_refresh_cpu_capacity_scaling()                  <-- (4)

        hybrid_register_all_perf_domains()

          hybrid_register_perf_domain()	

            em_dev_register_perf_domain()                        <-- (5)

      /* Enable EAS */
      sched_clear_itmt_support()                                 <-- (6)

Debugging this on a 'nosmt' i7-13700K (online mask =
[0,2,4,6,8,10,12,14,16-23]

(1) Add CPU to existing hybrid PD or create new hybrid PD.
(2) Triggers sched domain rebuild (+ enabling EAS) already here during
    startup ?
    IMHO, reason is that max_highest_perf > min_highest_perf because of
    different itmt prio
    Happens for CPU8 on my machine (after CPU8 is added to hybrid PD
    0,2,4,6,8) (itmt prio for CPU8=69 (1024) instead of 68 (1009)).
    So it looks like EAS is enabled before (6) ?	
(3) ARM's way to do (5)
(4) Setting hybrid_max_perf_cpu
(5) Register EM here
(6) Actual call to initially triggers sched domain rebuild (+ enable
    EAS) (done already in (2) on my machine)

So (3) is not possible for Intel hybrid since the policy's cpumask(s)
contain only one CPUs, i.e. CPUs are not sharing clock.
And those cpumasks have to be build under (1) to be used in (5)?

[...]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH v021 5/9] PM: EM: Introduce em_dev_expand_perf_domain()
  2024-12-17  9:38   ` Dietmar Eggemann
@ 2024-12-17 20:40     ` Rafael J. Wysocki
  2025-01-06 12:59       ` Dietmar Eggemann
  0 siblings, 1 reply; 31+ messages in thread
From: Rafael J. Wysocki @ 2024-12-17 20:40 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Rafael J. Wysocki, Linux PM, LKML, Lukasz Luba, Peter Zijlstra,
	Srinivas Pandruvada, Morten Rasmussen, Vincent Guittot,
	Ricardo Neri, Pierre Gondois

On Tue, Dec 17, 2024 at 10:38 AM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 29/11/2024 17:02, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Introduce a helper function for adding a CPU to an existing EM perf
> > domain.
> >
> > Subsequently, this will be used by the intel_pstate driver to add new
> > CPUs to existing perf domains when those CPUs go online for the first
> > time after the initialization of the driver.
> >
> > No intentional functional impact.
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >
> > v0.1 -> v0.2: No changes
>
> Could you add information why this new EM interface is needed?

There is some of it in the changelog already.

In fact, it is only needed in a corner case when the system starts
with some CPUs offline and they only go online later (as a result of
an explicit request from user space).  That is the only case when a
new CPU may need to be added to an existing perf domain.

> IIRC, you can't use the existing way (cpufreq_driver::register_em) since
> it gets called to early (3) for the PD cpumasks to be ready. This issue
> will be there for any system in which uarch domains are not congruent
> with clock domains which we hadn't have to deal with Arm's heterogeneous
> CPUs so far.

Basically, yes.

> __init intel_pstate_init()
>
>   intel_pstate_register_driver()
>
>     cpufreq_register_driver()
>
>       subsys_interface_register()
>
>         sif->add_dev() -> cpufreq_add_dev()
>
>           cpufreq_online()
>
>             if (!new_policy && cpufreq_driver->online)
>
>             else
>
>               cpufreq_driver->init() -> intel_pstate_cpu_init()
>
>                 __intel_pstate_cpu_init()
>
>                   intel_pstate_init_cpu()
>
>                     intel_pstate_get_cpu_pstates()
>
>                       hybrid_add_to_domain()
>
>                         em_dev_expand_perf_domain()              <-- (1)
>
>                   intel_pstate_init_acpi_perf_limits()
>
>                     intel_pstate_set_itmt_prio()                 <-- (2)
>
>             if (new_policy)
>
>               cpufreq_driver->register_em()                      <-- (3)
>
>     hybrid_init_cpu_capacity_scaling()
>
>       hybrid_refresh_cpu_capacity_scaling()
>
>         __hybrid_refresh_cpu_capacity_scaling()                  <-- (4)
>
>         hybrid_register_all_perf_domains()
>
>           hybrid_register_perf_domain()
>
>             em_dev_register_perf_domain()                        <-- (5)
>
>       /* Enable EAS */
>       sched_clear_itmt_support()                                 <-- (6)
>
> Debugging this on a 'nosmt' i7-13700K (online mask =
> [0,2,4,6,8,10,12,14,16-23]
>
> (1) Add CPU to existing hybrid PD or create new hybrid PD.

Not exactly.

(1) is just to expand an existing perf domain if the CPU is new (was
offline all the time previously).

Likewise, the direct hybrid_register_perf_domain() call in
hybrid_add_to_domain() is to add a new perf domain if the given CPU is
new (was offline all the time previously) and is the first one of the
given type (say, the system is starting with all E-cores offline).
It won't succeed before (4).

For CPUs that are online to start with, hybrid_add_to_domain() assigns
them to hybrid domains and PDs are created for them in
hybrid_register_all_perf_domains().

> (2) Triggers sched domain rebuild (+ enabling EAS) already here during
>     startup ?

This is just to enable ITMT which is the default mechanism for Intel
hybrid platforms.  It also requires a rebuild of sched domains to be
enabled.

>     IMHO, reason is that max_highest_perf > min_highest_perf because of
>     different itmt prio

Yes (which means that the platform is at least not homogenous).

This really has been introduced for the handling of favored cores on
otherwise non-hybrid platforms (say Tiger Lake).

>     Happens for CPU8 on my machine (after CPU8 is added to hybrid PD
>     0,2,4,6,8) (itmt prio for CPU8=69 (1024) instead of 68 (1009)).
>     So it looks like EAS is enabled before (6) ?

No, it is ITMT because CPU8 is a favored core.

> (3) ARM's way to do (5)
> (4) Setting hybrid_max_perf_cpu
> (5) Register EM here
> (6) Actual call to initially triggers sched domain rebuild (+ enable
>     EAS) (done already in (2) on my machine)

This is the second rebuild of sched domains to turn off ITMT and
enable EAS.  The previous one is to enable ITMT.

The earlier enabling of ITMT could be avoided I think, but that would
be a complication on platforms that contain favored cores but
otherwise are not hybrid.

> So (3) is not possible for Intel hybrid since the policy's cpumask(s)

It is possible in principle, but not particularly convenient because
at that point it is not clear whether or not the platform is really
hybrid and SMT is off and so whether or not EAS is to be used.

> contain only one CPUs, i.e. CPUs are not sharing clock.
> And those cpumasks have to be build under (1) to be used in (5)?

They are built by the caller of (1) to be precise, but yes.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH v021 5/9] PM: EM: Introduce em_dev_expand_perf_domain()
  2024-12-17 20:40     ` Rafael J. Wysocki
@ 2025-01-06 12:59       ` Dietmar Eggemann
  0 siblings, 0 replies; 31+ messages in thread
From: Dietmar Eggemann @ 2025-01-06 12:59 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Linux PM, LKML, Lukasz Luba, Peter Zijlstra,
	Srinivas Pandruvada, Morten Rasmussen, Vincent Guittot,
	Ricardo Neri, Pierre Gondois

On 17/12/2024 21:40, Rafael J. Wysocki wrote:
> On Tue, Dec 17, 2024 at 10:38 AM Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:
>>
>> On 29/11/2024 17:02, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> Introduce a helper function for adding a CPU to an existing EM perf
>>> domain.
>>>
>>> Subsequently, this will be used by the intel_pstate driver to add new
>>> CPUs to existing perf domains when those CPUs go online for the first
>>> time after the initialization of the driver.
>>>
>>> No intentional functional impact.
>>>
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>> ---
>>>
>>> v0.1 -> v0.2: No changes
>>
>> Could you add information why this new EM interface is needed?
> 
> There is some of it in the changelog already.
> 
> In fact, it is only needed in a corner case when the system starts
> with some CPUs offline and they only go online later (as a result of
> an explicit request from user space).  That is the only case when a
> new CPU may need to be added to an existing perf domain.

OK, I see. Arm doesn't need this since we derive the masks from the
CPUfreq policies so far.

I just verified, we both keep hotplugged-out CPU within the PD. That's
why we mask the PD cpus with cpu_online_mask in:

find_energy_efficient_cpu()

  ...

  for (; pd; pd = pd->next)

    cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);

>> IIRC, you can't use the existing way (cpufreq_driver::register_em) since
>> it gets called to early (3) for the PD cpumasks to be ready. This issue
>> will be there for any system in which uarch domains are not congruent
>> with clock domains which we hadn't have to deal with Arm's heterogeneous
>> CPUs so far.
> 
> Basically, yes.
> 
>> __init intel_pstate_init()
>>
>>   intel_pstate_register_driver()
>>
>>     cpufreq_register_driver()
>>
>>       subsys_interface_register()
>>
>>         sif->add_dev() -> cpufreq_add_dev()
>>
>>           cpufreq_online()
>>
>>             if (!new_policy && cpufreq_driver->online)
>>
>>             else
>>
>>               cpufreq_driver->init() -> intel_pstate_cpu_init()
>>
>>                 __intel_pstate_cpu_init()
>>
>>                   intel_pstate_init_cpu()
>>
>>                     intel_pstate_get_cpu_pstates()
>>
>>                       hybrid_add_to_domain()
>>
>>                         em_dev_expand_perf_domain()              <-- (1)
>>
>>                   intel_pstate_init_acpi_perf_limits()
>>
>>                     intel_pstate_set_itmt_prio()                 <-- (2)
>>
>>             if (new_policy)
>>
>>               cpufreq_driver->register_em()                      <-- (3)
>>
>>     hybrid_init_cpu_capacity_scaling()
>>
>>       hybrid_refresh_cpu_capacity_scaling()
>>
>>         __hybrid_refresh_cpu_capacity_scaling()                  <-- (4)
>>
>>         hybrid_register_all_perf_domains()
>>
>>           hybrid_register_perf_domain()
>>
>>             em_dev_register_perf_domain()                        <-- (5)
>>
>>       /* Enable EAS */
>>       sched_clear_itmt_support()                                 <-- (6)
>>
>> Debugging this on a 'nosmt' i7-13700K (online mask =
>> [0,2,4,6,8,10,12,14,16-23]
>>
>> (1) Add CPU to existing hybrid PD or create new hybrid PD.
> 
> Not exactly.
> 
> (1) is just to expand an existing perf domain if the CPU is new (was
> offline all the time previously).

OK.

> 
> Likewise, the direct hybrid_register_perf_domain() call in
> hybrid_add_to_domain() is to add a new perf domain if the given CPU is
> new (was offline all the time previously) and is the first one of the
> given type (say, the system is starting with all E-cores offline).
> It won't succeed before (4).
> 
> For CPUs that are online to start with, hybrid_add_to_domain() assigns
> them to hybrid domains and PDs are created for them in
> hybrid_register_all_perf_domains().

Understood.

> 
>> (2) Triggers sched domain rebuild (+ enabling EAS) already here during
>>     startup ?
> 
> This is just to enable ITMT which is the default mechanism for Intel
> hybrid platforms.  It also requires a rebuild of sched domains to be
> enabled.
> 
>>     IMHO, reason is that max_highest_perf > min_highest_perf because of
>>     different itmt prio
> 
> Yes (which means that the platform is at least not homogenous).
> 
> This really has been introduced for the handling of favored cores on
> otherwise non-hybrid platforms (say Tiger Lake).
> 
>>     Happens for CPU8 on my machine (after CPU8 is added to hybrid PD
>>     0,2,4,6,8) (itmt prio for CPU8=69 (1024) instead of 68 (1009)).
>>     So it looks like EAS is enabled before (6) ?
> 
> No, it is ITMT because CPU8 is a favored core.
> 
>> (3) ARM's way to do (5)
>> (4) Setting hybrid_max_perf_cpu
>> (5) Register EM here
>> (6) Actual call to initially triggers sched domain rebuild (+ enable
>>     EAS) (done already in (2) on my machine)
> 
> This is the second rebuild of sched domains to turn off ITMT and
> enable EAS.  The previous one is to enable ITMT.
> 
> The earlier enabling of ITMT could be avoided I think, but that would
> be a complication on platforms that contain favored cores but
> otherwise are not hybrid.

OK, the fact that EAS will already be enabled in (2) is not an issue IMHO.

> 
>> So (3) is not possible for Intel hybrid since the policy's cpumask(s)
> 
> It is possible in principle, but not particularly convenient because
> at that point it is not clear whether or not the platform is really
> hybrid and SMT is off and so whether or not EAS is to be used.
> 
>> contain only one CPUs, i.e. CPUs are not sharing clock.
>> And those cpumasks have to be build under (1) to be used in (5)?
> 
> They are built by the caller of (1) to be precise, but yes.

OK.

I still see an issue in putting all performance CPUs in one PD.

find_energy_efficient_cpu() assumes that all PD CPUs has the same CPU
capacity value.

  find_energy_efficient_cpu()

    ...

    for (; pd; pd = pd->next) {

      ...
      /* Account external pressure for the energy estimation */
      cpu = cpumask_first(cpus);
      cpu_actual_cap = get_actual_cpu_capacity(cpu); --> (1)


Even though X86 does not implement hw_load_avg() or
cpufreq_get_pressure() we would still assume that all PD CPUs have the
same cpu_capacity:


  get_actual_cpu_capacity() <-- (1)

    capacity = arch_scale_cpu_capacity(cpu)


Now the error introduced is small (1024 versus 1009) on my i7-13700K but
it's there. How big can those diffs based om itmt prio be?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH v021 0/9] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT
  2024-11-29 15:55 [RFC][PATCH v021 0/9] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
                   ` (8 preceding siblings ...)
  2024-11-29 16:28 ` [RFC][PATCH v021 9/9] cpufreq: intel_pstate: Add basic EAS support on hybrid platforms Rafael J. Wysocki
@ 2025-01-25 11:18 ` Dietmar Eggemann
  2025-01-27 13:57   ` Rafael J. Wysocki
  9 siblings, 1 reply; 31+ messages in thread
From: Dietmar Eggemann @ 2025-01-25 11:18 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM
  Cc: LKML, Lukasz Luba, Peter Zijlstra, Srinivas Pandruvada,
	Morten Rasmussen, Vincent Guittot, Ricardo Neri, Pierre Gondois

On 29/11/2024 16:55, Rafael J. Wysocki wrote:

[...]

> For easier access, the series is available on the experimental/intel_ostate
> branch in linux-pm.git:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/intel_pstate

I was wondering how we can test the EAS behaviour (power/perf) on Intel
hybrid machines.

I have system-wide RAPL 'power/energy-{cores,pkg}' events for power
(energy) on my i7-13700K (nosmt) so I can run an rt-app workload
(e.g. 30 5% tasks (0.8ms/16ms)) with:

perf stat -e power/energy-cores/,power/energy-pkg/ --repeat 10 ./rt-app.sh

Plus I can check for negative slack for those rt-app-test tasks (perf)
and do ftrace-based task placement evaluation.

base:

 Performance counter stats for 'system wide' (10 runs):

       52.67 Joules power/energy-cores/       ( +-  1.24% )
       85.09 Joules power/energy-pkg/         ( +-  0.83% )

   34.922801 +- 0.000736 seconds time elapsed ( +-  0.00% )


EAS:

 Performance counter stats for 'system wide' (10 runs):

       45.55 Joules power/energy-cores/       ( +-  1.07% )
       75.73 Joules power/energy-pkg/         ( +-  0.67% )

   34.93183 +- 0.00514 seconds time elapsed  ( +-  0.01% )

Do you have another (maybe more sophisticated) test methodology?


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH v021 0/9] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT
  2025-01-25 11:18 ` [RFC][PATCH v021 0/9] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Dietmar Eggemann
@ 2025-01-27 13:57   ` Rafael J. Wysocki
  2025-02-01 12:43     ` Christian Loehle
  0 siblings, 1 reply; 31+ messages in thread
From: Rafael J. Wysocki @ 2025-01-27 13:57 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Rafael J. Wysocki, Linux PM, LKML, Lukasz Luba, Peter Zijlstra,
	Srinivas Pandruvada, Morten Rasmussen, Vincent Guittot,
	Ricardo Neri, Pierre Gondois, Zhang, Rui

On Sat, Jan 25, 2025 at 12:18 PM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 29/11/2024 16:55, Rafael J. Wysocki wrote:
>
> [...]
>
> > For easier access, the series is available on the experimental/intel_ostate
> > branch in linux-pm.git:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/intel_pstate
>
> I was wondering how we can test the EAS behaviour (power/perf) on Intel
> hybrid machines.

Thanks a lot for looking into this, much appreciated!

> I have system-wide RAPL 'power/energy-{cores,pkg}' events for power
> (energy) on my i7-13700K (nosmt) so I can run an rt-app workload
> (e.g. 30 5% tasks (0.8ms/16ms)) with:
>
> perf stat -e power/energy-cores/,power/energy-pkg/ --repeat 10 ./rt-app.sh
>
> Plus I can check for negative slack for those rt-app-test tasks (perf)
> and do ftrace-based task placement evaluation.
>
> base:
>
>  Performance counter stats for 'system wide' (10 runs):
>
>        52.67 Joules power/energy-cores/       ( +-  1.24% )
>        85.09 Joules power/energy-pkg/         ( +-  0.83% )
>
>    34.922801 +- 0.000736 seconds time elapsed ( +-  0.00% )
>
>
> EAS:
>
>  Performance counter stats for 'system wide' (10 runs):
>
>        45.55 Joules power/energy-cores/       ( +-  1.07% )
>        75.73 Joules power/energy-pkg/         ( +-  0.67% )
>
>    34.93183 +- 0.00514 seconds time elapsed  ( +-  0.01% )
>
> Do you have another (maybe more sophisticated) test methodology?

Not really more sophisticated, but we cast a wider net, so to speak.

For taks placement testing we use an internal utility that can create
arbitrary synthetic workloads and plot CPU utilization (and other
things) while they are running.  It is kind of similar to rt-app
AFAICS.

We also run various benchmarks and measure energy usage during these
runs, first in order to check if EAS helps in the cases when it is
expected to help, but also to see how it affects the benchmark scores
in general (because we don't want it to make too much of a "negative"
difference for "performance" workloads).

The above results are basically in-line with what we are observing,
but we often see less of a difference in terms of energy usage between
the baseline and EAS enabled.

We also see a lot of task migrations between CPUs in the "low-cost"
PD, mostly in the utilization range where we would expect EAS to make
a difference.  Those migrations are a bit of a concern although they
don't seem to affect benchmark scores.

We think that those migrations are related to the preference of CPUs
with the largest spare capacity, so I'm working on an alternative
approach to enabling EAS that will use per-CPU PDs to see if the
migrations can be reduced this way.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH v021 0/9] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT
  2025-01-27 13:57   ` Rafael J. Wysocki
@ 2025-02-01 12:43     ` Christian Loehle
  0 siblings, 0 replies; 31+ messages in thread
From: Christian Loehle @ 2025-02-01 12:43 UTC (permalink / raw)
  To: Rafael J. Wysocki, Dietmar Eggemann
  Cc: Rafael J. Wysocki, Linux PM, LKML, Lukasz Luba, Peter Zijlstra,
	Srinivas Pandruvada, Morten Rasmussen, Vincent Guittot,
	Ricardo Neri, Pierre Gondois, Zhang, Rui

On 1/27/25 13:57, Rafael J. Wysocki wrote:
> On Sat, Jan 25, 2025 at 12:18 PM Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:
>>
>> On 29/11/2024 16:55, Rafael J. Wysocki wrote:
>>
>> [...]
>>
>>> For easier access, the series is available on the experimental/intel_ostate
>>> branch in linux-pm.git:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/intel_pstate
>>
>> I was wondering how we can test the EAS behaviour (power/perf) on Intel
>> hybrid machines.
> 
> Thanks a lot for looking into this, much appreciated!
> 
>> I have system-wide RAPL 'power/energy-{cores,pkg}' events for power
>> (energy) on my i7-13700K (nosmt) so I can run an rt-app workload
>> (e.g. 30 5% tasks (0.8ms/16ms)) with:
>>
>> perf stat -e power/energy-cores/,power/energy-pkg/ --repeat 10 ./rt-app.sh
>>
>> Plus I can check for negative slack for those rt-app-test tasks (perf)
>> and do ftrace-based task placement evaluation.
>>
>> base:
>>
>>  Performance counter stats for 'system wide' (10 runs):
>>
>>        52.67 Joules power/energy-cores/       ( +-  1.24% )
>>        85.09 Joules power/energy-pkg/         ( +-  0.83% )
>>
>>    34.922801 +- 0.000736 seconds time elapsed ( +-  0.00% )
>>
>>
>> EAS:
>>
>>  Performance counter stats for 'system wide' (10 runs):
>>
>>        45.55 Joules power/energy-cores/       ( +-  1.07% )
>>        75.73 Joules power/energy-pkg/         ( +-  0.67% )
>>
>>    34.93183 +- 0.00514 seconds time elapsed  ( +-  0.01% )
>>
>> Do you have another (maybe more sophisticated) test methodology?
> 
> Not really more sophisticated, but we cast a wider net, so to speak.
> 
> For taks placement testing we use an internal utility that can create
> arbitrary synthetic workloads and plot CPU utilization (and other
> things) while they are running.  It is kind of similar to rt-app
> AFAICS.
> 
> We also run various benchmarks and measure energy usage during these
> runs, first in order to check if EAS helps in the cases when it is
> expected to help, but also to see how it affects the benchmark scores
> in general (because we don't want it to make too much of a "negative"
> difference for "performance" workloads).

Any insights always appreciated.
I have an OSPM talk accepted about the recent EAS overutilized
proposals, which does touch upon being able to switch out of EAS
quickly enough, too. I will be including some x86 results from our
test machine, too.

> 
> The above results are basically in-line with what we are observing,
> but we often see less of a difference in terms of energy usage between
> the baseline and EAS enabled.
> 
> We also see a lot of task migrations between CPUs in the "low-cost"
> PD, mostly in the utilization range where we would expect EAS to make
> a difference.  Those migrations are a bit of a concern although they
> don't seem to affect benchmark scores.
> 
> We think that those migrations are related to the preference of CPUs
> with the largest spare capacity, so I'm working on an alternative
> approach to enabling EAS that will use per-CPU PDs to see if the
> migrations can be reduced this way.

We've had something like this actually, you might be interested [1].
You'd want something more flexible in terms of the margins (or a
non-energy-based approach based on e.g. spare-cap [2]), but just
sidestepping the CPU selection within the cluster?

Is there anything specifically worrying you about frequent e-core
wakeup migrations? A few things come to mind immediately like:
Idle state latency, cache, DVFS, per-core internals like branch
predictor training, maybe turbo states would also favor the same
core(s) to be active?
(I've played with the series, too and still have lots of questions
on how this interact with turbo states, but given that we can't
really deterministically trigger them, trying to experiment/measure
anything seems rather futile?)

Interestingly if anything we were more interested in reducing CPU
wakeups in the big cores, because of their higher static leakage,
while little cores have low static leakage and low cpuidle wakeup
cost and latency.

[1]
It should be noted that we were always more concerned between the
uArch differences instead of breaking ties between intra-cluster CPUs,
simply because that's where the big efficiency gains are.

https://lore.kernel.org/lkml/20220412134220.1588482-1-vincent.donnefort@arm.com/

[2]
Vincent Guittot's is currently proposing this, I don't think it would
work well ootb because of the single-OPP approach you took, but maybe
going from "same OPP" to e.g. "5% cap diff" remedies that?
https://lore.kernel.org/lkml/20241217160720.2397239-4-vincent.guittot@linaro.org/

Anyway to provide something useful on this thread as well, testing
on our Raptor Lake with nosmt (=8+8) (note that this doesn't necessarily
translate into the lunar lake series that is the focus here).
I can reproduce the same efficiency gains of around 20-25% on
common workloads, e.g. 20 iterations of 5mins Firefox 4K youtube
video playback (Acquired by RAPL power/energy-cores/ in Joules):

EAS:
628.6145 +-30.4479693342421
CAS:
829.172 +-29.422507961369337
(-24.2% energy used with EAS)

FWIW Dietmar's patch of adding cpu_capacity sysfs for the intel_pstate
setup path is pretty handy for testing at least, maybe it could still
be considered for upstream:
https://lore.kernel.org/lkml/91b37d34-6d9a-4623-87d8-0ff648ea2415@arm.com/

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2025-02-01 12:43 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-29 15:55 [RFC][PATCH v021 0/9] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Rafael J. Wysocki
2024-11-29 15:56 ` [RFC][PATCH v021 1/9] cpufreq: intel_pstate: Use CPPC to get scaling factors Rafael J. Wysocki
2024-11-29 15:56 ` [RFC][PATCH v021 2/9] cpufreq: intel_pstate: Drop Arrow Lake from "scaling factor" list Rafael J. Wysocki
2024-11-29 15:59 ` [RFC][PATCH v021 3/9] PM: EM: Move perf rebuilding function from schedutil to EM Rafael J. Wysocki
2024-12-11 10:32   ` Christian Loehle
2024-11-29 16:00 ` [RFC][PATCH v021 4/9] sched/topology: Adjust cpufreq checks for EAS Rafael J. Wysocki
2024-12-11 10:33   ` Christian Loehle
2024-12-11 11:29     ` Rafael J. Wysocki
2024-12-11 11:43       ` Christian Loehle
2024-12-11 11:59         ` Rafael J. Wysocki
2024-12-11 13:25       ` Vincent Guittot
2024-12-11 16:37         ` Rafael J. Wysocki
2024-12-11 17:07           ` Vincent Guittot
2024-12-11 17:55             ` Rafael J. Wysocki
2024-12-12  8:34               ` Vincent Guittot
2024-12-16 14:49   ` Dietmar Eggemann
2024-12-16 14:58     ` Rafael J. Wysocki
2024-11-29 16:02 ` [RFC][PATCH v021 5/9] PM: EM: Introduce em_dev_expand_perf_domain() Rafael J. Wysocki
2024-12-17  9:38   ` Dietmar Eggemann
2024-12-17 20:40     ` Rafael J. Wysocki
2025-01-06 12:59       ` Dietmar Eggemann
2024-11-29 16:06 ` [RFC][PATCH v021 6/9] PM: EM: Call em_compute_costs() from em_create_perf_table() Rafael J. Wysocki
2024-11-29 16:15 ` [RFC][PATCH v021 7/9] PM: EM: Register perf domains with ho :active_power() callbacks Rafael J. Wysocki
2024-12-16 10:59   ` Dietmar Eggemann
2024-12-16 11:58     ` Rafael J. Wysocki
2024-11-29 16:21 ` [RFC][PATCH v021 8/9] cpufreq: intel_pstate: Introduce hybrid domains Rafael J. Wysocki
2024-12-12 17:04   ` Christian Loehle
2024-11-29 16:28 ` [RFC][PATCH v021 9/9] cpufreq: intel_pstate: Add basic EAS support on hybrid platforms Rafael J. Wysocki
2025-01-25 11:18 ` [RFC][PATCH v021 0/9] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT Dietmar Eggemann
2025-01-27 13:57   ` Rafael J. Wysocki
2025-02-01 12:43     ` Christian Loehle

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox