[PATCH v3 0/2] x86 / intel_pstate: Set asymmetric CPU capacity on hybrid systems

linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/2] x86 / intel_pstate: Set asymmetric CPU capacity on hybrid systems
@ 2024-08-28 11:45 Rafael J. Wysocki
  2024-08-28 11:47 ` [PATCH v3 1/2] x86/sched: Add basic support for CPU capacity scaling Rafael J. Wysocki
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Rafael J. Wysocki @ 2024-08-28 11:45 UTC (permalink / raw)
  To: x86 Maintainers
  Cc: LKML, Linux PM, Thomas Gleixner, Peter Zijlstra,
	Srinivas Pandruvada, Rafael J. Wysocki, Dietmar Eggemann,
	Ricardo Neri, Tim Chen

Hi Everyone,

This is an update of

https://lore.kernel.org/linux-pm/4941491.31r3eYUQgx@rjwysocki.net/

which was an update of

https://lore.kernel.org/linux-pm/4908113.GXAFRqVoOG@rjwysocki.net/

It addresses Ricardo's review comments and fixes an issue with intel_pstate
operation mode changes that would cause it to attempt to enable hybrid CPU
capacity scaling after it has been already enabled during initialization.

The most visible difference with respect to the previous version is that
patch [1/3] has been dropped because it is not needed any more after using
the observation that sched_clear_itmt_support() would cause sched domains
to be rebuilt.

Other than this, there are cosmetic differences in patch [1/2] (previously [2/3])
and the new code in intel_pstate_register_driver() in patch [2/2] (previously [3/3])
has been squashed into hybrid_init_cpu_scaling() which now checks whether or
not to enable hybrid CPU capacity scaling (as it may have been enabled already).

This series is available from the following git branch:

https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=intel_pstate-testing

(with an extra debug commit on top).

The original cover letter quoted below still applies:

The purpose of this series is to provide the scheduler with asymmetric CPU
capacity information on x86 hybrid systems based on Intel hardware.

The asymmetric CPU capacity information is important on hybrid systems as it
allows utilization to be computed for tasks in a consistent way across all
CPUs in the system, regardless of their capacity.  This, in turn, allows
the schedutil cpufreq governor to set CPU performance levels consistently
in the cases when tasks migrate between CPUs of different capacities.  It
should also help to improve task placement and load balancing decisions on
hybrid systems and it is key for anything along the lines of EAS.

The information in question comes from the MSR_HWP_CAPABILITIES register and
is provided to the scheduler by the intel_pstate driver, as per the changelog
of patch [3/3].  Patch [2/3] introduces the arch infrastructure needed for
that (in the form of a per-CPU capacity variable) and patch [1/3] is a
preliminary code adjustment.

This is based on an RFC posted previously

https://lore.kernel.org/linux-pm/7663799.EvYhyI6sBW@kreacher/

but differs from it quite a bit (except for the first patch).  The most
significant difference is based on the observation that frequency-
invariance needs to adjusted to the capacity scaling on hybrid systems
for the complete scale-invariance to work as expected.

Thank you!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 1/2] x86/sched: Add basic support for CPU capacity scaling
  2024-08-28 11:45 [PATCH v3 0/2] x86 / intel_pstate: Set asymmetric CPU capacity on hybrid systems Rafael J. Wysocki
@ 2024-08-28 11:47 ` Rafael J. Wysocki
  2024-09-03 13:57   ` Peter Zijlstra
  2024-09-04  6:22   ` Ricardo Neri
  2024-08-28 11:48 ` [PATCH v3 2/2] cpufreq: intel_pstate: Set asymmetric CPU capacity on hybrid systems Rafael J. Wysocki
  2024-09-04  7:25 ` [PATCH v3 0/2] x86 / " Ricardo Neri
  2 siblings, 2 replies; 14+ messages in thread
From: Rafael J. Wysocki @ 2024-08-28 11:47 UTC (permalink / raw)
  To: x86 Maintainers
  Cc: LKML, Linux PM, Thomas Gleixner, Peter Zijlstra,
	Srinivas Pandruvada, Rafael J. Wysocki, Dietmar Eggemann,
	Ricardo Neri, Tim Chen

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

In order be able to compute the sizes of tasks consistently across all
CPUs in a hybrid system, it is necessary to provide CPU capacity scaling
information to the scheduler via arch_scale_cpu_capacity().  Moreover,
the value returned by arch_scale_freq_capacity() for the given CPU must
correspond to the arch_scale_cpu_capacity() return value for it, or
utilization computations will be inaccurate.

Add support for it through per-CPU variables holding the capacity and
maximum-to-base frequency ratio (times SCHED_CAPACITY_SCALE) that will
be returned by arch_scale_cpu_capacity() and used by scale_freq_tick()
to compute arch_freq_scale for the current CPU, respectively.

In order to avoid adding measurable overhead for non-hybrid x86 systems,
which are the vast majority in the field, whether or not the new hybrid
CPU capacity scaling will be in effect is controlled by a static key.
This static key is set by calling arch_enable_hybrid_capacity_scale()
which also allocates memory for the per-CPU data and initializes it.
Next, arch_set_cpu_capacity() is used to set the per-CPU variables
mentioned above for each CPU and arch_rebuild_sched_domains() needs
to be called for the scheduler to realize that capacity-aware
scheduling can be used going forward.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v2 -> v3:
   * Rebase after dropping patch [1/3].
   * Rename arch_set_cpu_capacity() arguments.
   * Add empty line to arch_enable_hybrid_capacity_scale().
   * Declare local variables in scale_freq_tick() on one line.

v1 -> v2:
   * Replaces WARN_ON_ONCE() with WARN_ONCE() (2 places)
   * Fix arch_enable_hybrid_capacity_scale() return value when hybrid
     capacity scaling is already enabled.
   * Allow arch_enable_hybrid_capacity_scale() to succeed when
     frequency-invariance is not enabled.
   * Fix arch_set_cpu_capacity() kerneldoc comment
   * Do not disable capacity scaling in disable_freq_invariance_workfn().
   * Relocate arch_hybrid_cap_scale_key definition.

---
 arch/x86/include/asm/topology.h  |   13 +++++
 arch/x86/kernel/cpu/aperfmperf.c |   89 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 100 insertions(+), 2 deletions(-)

Index: linux-pm/arch/x86/include/asm/topology.h
===================================================================
--- linux-pm.orig/arch/x86/include/asm/topology.h
+++ linux-pm/arch/x86/include/asm/topology.h
@@ -282,9 +282,22 @@ static inline long arch_scale_freq_capac
 }
 #define arch_scale_freq_capacity arch_scale_freq_capacity
 
+bool arch_enable_hybrid_capacity_scale(void);
+void arch_set_cpu_capacity(int cpu, unsigned long cap, unsigned long max_cap,
+			   unsigned long cap_freq, unsigned long base_freq);
+
+unsigned long arch_scale_cpu_capacity(int cpu);
+#define arch_scale_cpu_capacity arch_scale_cpu_capacity
+
 extern void arch_set_max_freq_ratio(bool turbo_disabled);
 extern void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled);
 #else
+static inline bool arch_enable_hybrid_capacity_scale(void) { return false; }
+static inline void arch_set_cpu_capacity(int cpu, unsigned long cap,
+					 unsigned long max_cap,
+					 unsigned long cap_freq,
+					 unsigned long base_freq) { }
+
 static inline void arch_set_max_freq_ratio(bool turbo_disabled) { }
 static inline void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled) { }
 #endif
Index: linux-pm/arch/x86/kernel/cpu/aperfmperf.c
===================================================================
--- linux-pm.orig/arch/x86/kernel/cpu/aperfmperf.c
+++ linux-pm/arch/x86/kernel/cpu/aperfmperf.c
@@ -349,9 +349,89 @@ static DECLARE_WORK(disable_freq_invaria
 DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE;
 EXPORT_PER_CPU_SYMBOL_GPL(arch_freq_scale);
 
+static DEFINE_STATIC_KEY_FALSE(arch_hybrid_cap_scale_key);
+
+struct arch_hybrid_cpu_scale {
+	unsigned long capacity;
+	unsigned long freq_ratio;
+};
+
+static struct arch_hybrid_cpu_scale __percpu *arch_cpu_scale;
+
+/**
+ * arch_enable_hybrid_capacity_scale - Enable hybrid CPU capacity scaling
+ *
+ * Allocate memory for per-CPU data used by hybrid CPU capacity scaling,
+ * initialize it and set the static key controlling its code paths.
+ *
+ * Must be called before arch_set_cpu_capacity().
+ */
+bool arch_enable_hybrid_capacity_scale(void)
+{
+	int cpu;
+
+	if (static_branch_unlikely(&arch_hybrid_cap_scale_key)) {
+		WARN_ONCE(1, "Hybrid CPU capacity scaling already enabled");
+		return true;
+	}
+
+	arch_cpu_scale = alloc_percpu(struct arch_hybrid_cpu_scale);
+	if (!arch_cpu_scale)
+		return false;
+
+	for_each_possible_cpu(cpu) {
+		per_cpu_ptr(arch_cpu_scale, cpu)->capacity = SCHED_CAPACITY_SCALE;
+		per_cpu_ptr(arch_cpu_scale, cpu)->freq_ratio = arch_max_freq_ratio;
+	}
+
+	static_branch_enable(&arch_hybrid_cap_scale_key);
+
+	pr_info("Hybrid CPU capacity scaling enabled\n");
+
+	return true;
+}
+
+/**
+ * arch_set_cpu_capacity - Set scale-invariance parameters for a CPU
+ * @cpu: Target CPU.
+ * @cap: Capacity of @cpu at its maximum frequency, relative to @max_cap.
+ * @max_cap: System-wide maximum CPU capacity.
+ * @cap_freq: Frequency of @cpu corresponding to @cap.
+ * @base_freq: Frequency of @cpu at which MPERF counts.
+ *
+ * The units in which @cap and @max_cap are expressed do not matter, so long
+ * as they are consistent, because the former is effectively divided by the
+ * latter.  Analogously for @cap_freq and @base_freq.
+ *
+ * After calling this function for all CPUs, call arch_rebuild_sched_domains()
+ * to let the scheduler know that capacity-aware scheduling can be used going
+ * forward.
+ */
+void arch_set_cpu_capacity(int cpu, unsigned long cap, unsigned long max_cap,
+			   unsigned long cap_freq, unsigned long base_freq)
+{
+	if (static_branch_likely(&arch_hybrid_cap_scale_key)) {
+		WRITE_ONCE(per_cpu_ptr(arch_cpu_scale, cpu)->capacity,
+			   div_u64(cap << SCHED_CAPACITY_SHIFT, max_cap));
+		WRITE_ONCE(per_cpu_ptr(arch_cpu_scale, cpu)->freq_ratio,
+			   div_u64(cap_freq << SCHED_CAPACITY_SHIFT, base_freq));
+	} else {
+		WARN_ONCE(1, "Hybrid CPU capacity scaling not enabled");
+	}
+}
+
+unsigned long arch_scale_cpu_capacity(int cpu)
+{
+	if (static_branch_unlikely(&arch_hybrid_cap_scale_key))
+		return READ_ONCE(per_cpu_ptr(arch_cpu_scale, cpu)->capacity);
+
+	return SCHED_CAPACITY_SCALE;
+}
+EXPORT_SYMBOL_GPL(arch_scale_cpu_capacity);
+
 static void scale_freq_tick(u64 acnt, u64 mcnt)
 {
-	u64 freq_scale;
+	u64 freq_scale, freq_ratio;
 
 	if (!arch_scale_freq_invariant())
 		return;
@@ -359,7 +439,12 @@ static void scale_freq_tick(u64 acnt, u6
 	if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
 		goto error;
 
-	if (check_mul_overflow(mcnt, arch_max_freq_ratio, &mcnt) || !mcnt)
+	if (static_branch_unlikely(&arch_hybrid_cap_scale_key))
+		freq_ratio = READ_ONCE(this_cpu_ptr(arch_cpu_scale)->freq_ratio);
+	else
+		freq_ratio = arch_max_freq_ratio;
+
+	if (check_mul_overflow(mcnt, freq_ratio, &mcnt) || !mcnt)
 		goto error;
 
 	freq_scale = div64_u64(acnt, mcnt);




^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 2/2] cpufreq: intel_pstate: Set asymmetric CPU capacity on hybrid systems
  2024-08-28 11:45 [PATCH v3 0/2] x86 / intel_pstate: Set asymmetric CPU capacity on hybrid systems Rafael J. Wysocki
  2024-08-28 11:47 ` [PATCH v3 1/2] x86/sched: Add basic support for CPU capacity scaling Rafael J. Wysocki
@ 2024-08-28 11:48 ` Rafael J. Wysocki
  2024-08-30  8:30   ` Shrikanth Hegde
                     ` (2 more replies)
  2024-09-04  7:25 ` [PATCH v3 0/2] x86 / " Ricardo Neri
  2 siblings, 3 replies; 14+ messages in thread
From: Rafael J. Wysocki @ 2024-08-28 11:48 UTC (permalink / raw)
  To: x86 Maintainers
  Cc: LKML, Linux PM, Thomas Gleixner, Peter Zijlstra,
	Srinivas Pandruvada, Rafael J. Wysocki, Dietmar Eggemann,
	Ricardo Neri, Tim Chen

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Make intel_pstate use the HWP_HIGHEST_PERF values from
MSR_HWP_CAPABILITIES to set asymmetric CPU capacity information
via the previously introduced arch_set_cpu_capacity() on hybrid
systems without SMT.

Setting asymmetric CPU capacity is generally necessary to allow the
scheduler to compute task sizes in a consistent way across all CPUs
in a system where they differ by capacity.  That, in turn, should help
to improve scheduling decisions.  It is also necessary for the schedutil
cpufreq governor to operate as expected on hybrid systems where tasks
migrate between CPUs of different capacities.

The underlying observation is that intel_pstate already uses
MSR_HWP_CAPABILITIES to get CPU performance information which is
exposed by it via sysfs and CPU performance scaling is based on it.
Thus using this information for setting asymmetric CPU capacity is
consistent with what the driver has been doing already.  Moreover,
HWP_HIGHEST_PERF reflects the maximum capacity of a given CPU including
both the instructions-per-cycle (IPC) factor and the maximum turbo
frequency and the units in which that value is expressed are the same
for all CPUs in the system, so the maximum capacity ratio between two
CPUs can be obtained by computing the ratio of their HWP_HIGHEST_PERF
values.  Of course, in principle that capacity ratio need not be
directly applicable at lower frequencies, so using it for providing the
asymmetric CPU capacity information to the scheduler is a rough
approximation, but it is as good as it gets.  Also, measurements
indicate that this approximation is not too bad in practice.

If the given system is hybrid and non-SMT, the new code disables ITMT
support in the scheduler (because it may get in the way of asymmetric CPU
capacity code in the scheduler that automatically gets enabled by setting
asymmetric CPU capacity) after initializing all online CPUs and finds
the one with the maximum HWP_HIGHEST_PERF value.  Next, it computes the
capacity number for each (online) CPU by dividing the product of its
HWP_HIGHEST_PERF and SCHED_CAPACITY_SCALE by the maximum HWP_HIGHEST_PERF.

When a CPU goes offline, its capacity is reset to SCHED_CAPACITY_SCALE
and if it is the one with the maximum HWP_HIGHEST_PERF value, the
capacity numbers for all of the other online CPUs are recomputed.  This
also takes care of a cleanup during driver operation mode changes.

Analogously, when a new CPU goes online, its capacity number is updated
and if its HWP_HIGHEST_PERF value is greater than the current maximum
one, the capacity numbers for all of the other online CPUs are
recomputed.

The case when the driver is notified of a CPU capacity change, either
through the HWP interrupt or through an ACPI notification, is handled
similarly to the CPU online case above, except that if the target CPU
is the current highest-capacity one and its capacity is reduced, the
capacity numbers for all of the other online CPUs need to be recomputed
either.

If the driver's "no_trubo" sysfs attribute is updated, all of the CPU
capacity information is computed from scratch to reflect the new turbo
status.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v2 -> v3:
   * Do not enable hybrid capacity scaling again on driver mode changes.
   * Call sched_clear_itmt_support() after __hybrid_init_cpu_scaling()
     to avoid unnecessary rebuilding of sched domains.

v1 -> v2:
   * Check hybrid_max_perf_cpu in intel_pstate_update_limits_for_all().

---
 drivers/cpufreq/intel_pstate.c |  232 ++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 228 insertions(+), 4 deletions(-)

Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -16,6 +16,7 @@
 #include <linux/tick.h>
 #include <linux/slab.h>
 #include <linux/sched/cpufreq.h>
+#include <linux/sched/smt.h>
 #include <linux/list.h>
 #include <linux/cpu.h>
 #include <linux/cpufreq.h>
@@ -215,6 +216,7 @@ struct global_params {
  * @hwp_req_cached:	Cached value of the last HWP Request MSR
  * @hwp_cap_cached:	Cached value of the last HWP Capabilities MSR
  * @last_io_update:	Last time when IO wake flag was set
+ * @capacity_perf:	Highest perf used for scale invariance
  * @sched_flags:	Store scheduler flags for possible cross CPU update
  * @hwp_boost_min:	Last HWP boosted min performance
  * @suspended:		Whether or not the driver has been suspended.
@@ -253,6 +255,7 @@ struct cpudata {
 	u64 hwp_req_cached;
 	u64 hwp_cap_cached;
 	u64 last_io_update;
+	unsigned int capacity_perf;
 	unsigned int sched_flags;
 	u32 hwp_boost_min;
 	bool suspended;
@@ -295,6 +298,7 @@ static int hwp_mode_bdw __ro_after_init;
 static bool per_cpu_limits __ro_after_init;
 static bool hwp_forced __ro_after_init;
 static bool hwp_boost __read_mostly;
+static bool hwp_is_hybrid;
 
 static struct cpufreq_driver *intel_pstate_driver __read_mostly;
 
@@ -934,6 +938,135 @@ static struct freq_attr *hwp_cpufreq_att
 	NULL,
 };
 
+static struct cpudata *hybrid_max_perf_cpu __read_mostly;
+/*
+ * Protects hybrid_max_perf_cpu, the capacity_perf fields in struct cpudata,
+ * and the x86 arch scale-invariance information from concurrent updates.
+ */
+static DEFINE_MUTEX(hybrid_capacity_lock);
+
+static void hybrid_set_cpu_capacity(struct cpudata *cpu)
+{
+	arch_set_cpu_capacity(cpu->cpu, cpu->capacity_perf,
+			      hybrid_max_perf_cpu->capacity_perf,
+			      cpu->capacity_perf,
+			      cpu->pstate.max_pstate_physical);
+
+	pr_debug("CPU%d: perf = %u, max. perf = %u, base perf = %d\n", cpu->cpu,
+		 cpu->capacity_perf, hybrid_max_perf_cpu->capacity_perf,
+		 cpu->pstate.max_pstate_physical);
+}
+
+static void hybrid_clear_cpu_capacity(unsigned int cpunum)
+{
+	arch_set_cpu_capacity(cpunum, 1, 1, 1, 1);
+}
+
+static void hybrid_get_capacity_perf(struct cpudata *cpu)
+{
+	if (READ_ONCE(global.no_turbo)) {
+		cpu->capacity_perf = cpu->pstate.max_pstate_physical;
+		return;
+	}
+
+	cpu->capacity_perf = HWP_HIGHEST_PERF(READ_ONCE(cpu->hwp_cap_cached));
+}
+
+static void hybrid_set_capacity_of_cpus(void)
+{
+	int cpunum;
+
+	for_each_online_cpu(cpunum) {
+		struct cpudata *cpu = all_cpu_data[cpunum];
+
+		if (cpu)
+			hybrid_set_cpu_capacity(cpu);
+	}
+}
+
+static void hybrid_update_cpu_scaling(void)
+{
+	struct cpudata *max_perf_cpu = NULL;
+	unsigned int max_cap_perf = 0;
+	int cpunum;
+
+	for_each_online_cpu(cpunum) {
+		struct cpudata *cpu = all_cpu_data[cpunum];
+
+		if (!cpu)
+			continue;
+
+		/*
+		 * During initialization, CPU performance at full capacity needs
+		 * to be determined.
+		 */
+		if (!hybrid_max_perf_cpu)
+			hybrid_get_capacity_perf(cpu);
+
+		/*
+		 * If hybrid_max_perf_cpu is not NULL at this point, it is
+		 * being replaced, so don't take it into account when looking
+		 * for the new one.
+		 */
+		if (cpu == hybrid_max_perf_cpu)
+			continue;
+
+		if (cpu->capacity_perf > max_cap_perf) {
+			max_cap_perf = cpu->capacity_perf;
+			max_perf_cpu = cpu;
+		}
+	}
+
+	if (max_perf_cpu) {
+		hybrid_max_perf_cpu = max_perf_cpu;
+		hybrid_set_capacity_of_cpus();
+	} else {
+		pr_info("Found no CPUs with nonzero maximum performance\n");
+		/* Revert to the flat CPU capacity structure. */
+		for_each_online_cpu(cpunum)
+			hybrid_clear_cpu_capacity(cpunum);
+	}
+}
+
+static void __hybrid_init_cpu_scaling(void)
+{
+	hybrid_max_perf_cpu = NULL;
+	hybrid_update_cpu_scaling();
+}
+
+static void hybrid_init_cpu_scaling(void)
+{
+	bool disable_itmt = false;
+
+	mutex_lock(&hybrid_capacity_lock);
+
+	/*
+	 * If hybrid_max_perf_cpu is set at this point, the hybrid CPU capacity
+	 * scaling has been enabled already and the driver is just changing the
+	 * operation mode.
+	 */
+	if (hybrid_max_perf_cpu) {
+		__hybrid_init_cpu_scaling();
+		goto unlock;
+	}
+
+	/*
+	 * On hybrid systems, use asym capacity instead of ITMT, but because
+	 * the capacity of SMT threads is not deterministic even approximately,
+	 * do not do that when SMT is in use.
+	 */
+	if (hwp_is_hybrid && !sched_smt_active() && arch_enable_hybrid_capacity_scale()) {
+		__hybrid_init_cpu_scaling();
+		disable_itmt = true;
+	}
+
+unlock:
+	mutex_unlock(&hybrid_capacity_lock);
+
+	if (disable_itmt)
+		sched_clear_itmt_support();
+}
+
 static void __intel_pstate_get_hwp_cap(struct cpudata *cpu)
 {
 	u64 cap;
@@ -962,6 +1095,43 @@ static void intel_pstate_get_hwp_cap(str
 	}
 }
 
+static void hybrid_update_capacity(struct cpudata *cpu)
+{
+	unsigned int max_cap_perf;
+
+	mutex_lock(&hybrid_capacity_lock);
+
+	if (!hybrid_max_perf_cpu)
+		goto unlock;
+
+	/*
+	 * The maximum performance of the CPU may have changed, but assume
+	 * that the performance of the other CPUs has not changed.
+	 */
+	max_cap_perf = hybrid_max_perf_cpu->capacity_perf;
+
+	intel_pstate_get_hwp_cap(cpu);
+
+	hybrid_get_capacity_perf(cpu);
+	/* Should hybrid_max_perf_cpu be replaced by this CPU? */
+	if (cpu->capacity_perf > max_cap_perf) {
+		hybrid_max_perf_cpu = cpu;
+		hybrid_set_capacity_of_cpus();
+		goto unlock;
+	}
+
+	/* If this CPU is hybrid_max_perf_cpu, should it be replaced? */
+	if (cpu == hybrid_max_perf_cpu && cpu->capacity_perf < max_cap_perf) {
+		hybrid_update_cpu_scaling();
+		goto unlock;
+	}
+
+	hybrid_set_cpu_capacity(cpu);
+
+unlock:
+	mutex_unlock(&hybrid_capacity_lock);
+}
+
 static void intel_pstate_hwp_set(unsigned int cpu)
 {
 	struct cpudata *cpu_data = all_cpu_data[cpu];
@@ -1070,6 +1240,22 @@ static void intel_pstate_hwp_offline(str
 		value |= HWP_ENERGY_PERF_PREFERENCE(HWP_EPP_POWERSAVE);
 
 	wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
+
+	mutex_lock(&hybrid_capacity_lock);
+
+	if (!hybrid_max_perf_cpu) {
+		mutex_unlock(&hybrid_capacity_lock);
+
+		return;
+	}
+
+	if (hybrid_max_perf_cpu == cpu)
+		hybrid_update_cpu_scaling();
+
+	mutex_unlock(&hybrid_capacity_lock);
+
+	/* Reset the capacity of the CPU going offline to the initial value. */
+	hybrid_clear_cpu_capacity(cpu->cpu);
 }
 
 #define POWER_CTL_EE_ENABLE	1
@@ -1165,21 +1351,46 @@ static void __intel_pstate_update_max_fr
 static void intel_pstate_update_limits(unsigned int cpu)
 {
 	struct cpufreq_policy *policy = cpufreq_cpu_acquire(cpu);
+	struct cpudata *cpudata;
 
 	if (!policy)
 		return;
 
-	__intel_pstate_update_max_freq(all_cpu_data[cpu], policy);
+	cpudata = all_cpu_data[cpu];
+
+	__intel_pstate_update_max_freq(cpudata, policy);
+
+	/* Prevent the driver from being unregistered now. */
+	mutex_lock(&intel_pstate_driver_lock);
 
 	cpufreq_cpu_release(policy);
+
+	hybrid_update_capacity(cpudata);
+
+	mutex_unlock(&intel_pstate_driver_lock);
 }
 
 static void intel_pstate_update_limits_for_all(void)
 {
 	int cpu;
 
-	for_each_possible_cpu(cpu)
-		intel_pstate_update_limits(cpu);
+	for_each_possible_cpu(cpu) {
+		struct cpufreq_policy *policy = cpufreq_cpu_acquire(cpu);
+
+		if (!policy)
+			continue;
+
+		__intel_pstate_update_max_freq(all_cpu_data[cpu], policy);
+
+		cpufreq_cpu_release(policy);
+	}
+
+	mutex_lock(&hybrid_capacity_lock);
+
+	if (hybrid_max_perf_cpu)
+		__hybrid_init_cpu_scaling();
+
+	mutex_unlock(&hybrid_capacity_lock);
 }
 
 /************************** sysfs begin ************************/
@@ -1618,6 +1829,13 @@ static void intel_pstate_notify_work(str
 		__intel_pstate_update_max_freq(cpudata, policy);
 
 		cpufreq_cpu_release(policy);
+
+		/*
+		 * The driver will not be unregistered while this function is
+		 * running, so update the capacity without acquiring the driver
+		 * lock.
+		 */
+		hybrid_update_capacity(cpudata);
 	}
 
 	wrmsrl_on_cpu(cpudata->cpu, MSR_HWP_STATUS, 0);
@@ -2034,8 +2252,10 @@ static void intel_pstate_get_cpu_pstates
 
 		if (pstate_funcs.get_cpu_scaling) {
 			cpu->pstate.scaling = pstate_funcs.get_cpu_scaling(cpu->cpu);
-			if (cpu->pstate.scaling != perf_ctl_scaling)
+			if (cpu->pstate.scaling != perf_ctl_scaling) {
 				intel_pstate_hybrid_hwp_adjust(cpu);
+				hwp_is_hybrid = true;
+			}
 		} else {
 			cpu->pstate.scaling = perf_ctl_scaling;
 		}
@@ -2707,6 +2927,8 @@ static int intel_pstate_cpu_online(struc
 		 */
 		intel_pstate_hwp_reenable(cpu);
 		cpu->suspended = false;
+
+		hybrid_update_capacity(cpu);
 	}
 
 	return 0;
@@ -3147,6 +3369,8 @@ static int intel_pstate_register_driver(
 
 	global.min_perf_pct = min_perf_pct_min();
 
+	hybrid_init_cpu_scaling();
+
 	return 0;
 }
 




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/2] cpufreq: intel_pstate: Set asymmetric CPU capacity on hybrid systems
  2024-08-28 11:48 ` [PATCH v3 2/2] cpufreq: intel_pstate: Set asymmetric CPU capacity on hybrid systems Rafael J. Wysocki
@ 2024-08-30  8:30   ` Shrikanth Hegde
  2024-08-30 11:10     ` Rafael J. Wysocki
  2024-09-03 13:55   ` Peter Zijlstra
  2024-09-04  6:39   ` Ricardo Neri
  2 siblings, 1 reply; 14+ messages in thread
From: Shrikanth Hegde @ 2024-08-30  8:30 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: x86 Maintainers, LKML, Linux PM, Thomas Gleixner, Peter Zijlstra,
	Srinivas Pandruvada, Rafael J. Wysocki, Dietmar Eggemann,
	Ricardo Neri, Tim Chen



On 8/28/24 17:18, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
[...]
>   
> +static void hybrid_update_capacity(struct cpudata *cpu)
> +{
> +	unsigned int max_cap_perf;
> +
> +	mutex_lock(&hybrid_capacity_lock);
> +
> +	if (!hybrid_max_perf_cpu)
> +		goto unlock;
> +
> +	/*
> +	 * The maximum performance of the CPU may have changed, but assume
> +	 * that the performance of the other CPUs has not changed.
> +	 */
> +	max_cap_perf = hybrid_max_perf_cpu->capacity_perf;
> +
> +	intel_pstate_get_hwp_cap(cpu);
> +
> +	hybrid_get_capacity_perf(cpu);
> +	/* Should hybrid_max_perf_cpu be replaced by this CPU? */
> +	if (cpu->capacity_perf > max_cap_perf) {
> +		hybrid_max_perf_cpu = cpu;
> +		hybrid_set_capacity_of_cpus();
> +		goto unlock;
> +	}
> +
> +	/* If this CPU is hybrid_max_perf_cpu, should it be replaced? */
> +	if (cpu == hybrid_max_perf_cpu && cpu->capacity_perf < max_cap_perf) {
> +		hybrid_update_cpu_scaling();
> +		goto unlock;
> +	}

I assume this CPU capacity is based on freq. It doesnt change based on 
irq, any upper scheduler classes such dl, rt right?

can capacity_perf change slightly or it can change such that it always 
changes to next possible level? The reason, if it can change slightly, 
but cpu is still hybrid_max_perf_cpu, it would end up accessing all the 
percpu structures and change it, that would be costly on larger systems.


> +
> +	hybrid_set_cpu_capacity(cpu);
> +
> +unlock:
> +	mutex_unlock(&hybrid_capacity_lock);
> +}
> +
>   static void intel_pstate_hwp_set(unsigned int cpu)
>   {
>   	struct cpudata *cpu_data = all_cpu_data[cpu];
> @@ -1070,6 +1240,22 @@ static void intel_pstate_hwp_offline(str
>   		value |= HWP_ENERGY_PERF_PREFERENCE(HWP_EPP_POWERSAVE);
>   
>   	wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
> +
> +	mutex_lock(&hybrid_capacity_lock);
> +
> +	if (!hybrid_max_perf_cpu) {
> +		mutex_unlock(&hybrid_capacity_lock);
> +
> +		return;
> +	}
> +
> +	if (hybrid_max_perf_cpu == cpu)
> +		hybrid_update_cpu_scaling();
> +
> +	mutex_unlock(&hybrid_capacity_lock);
> +
> +	/* Reset the capacity of the CPU going offline to the initial value. */
> +	hybrid_clear_cpu_capacity(cpu->cpu);
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/2] cpufreq: intel_pstate: Set asymmetric CPU capacity on hybrid systems
  2024-08-30  8:30   ` Shrikanth Hegde
@ 2024-08-30 11:10     ` Rafael J. Wysocki
  0 siblings, 0 replies; 14+ messages in thread
From: Rafael J. Wysocki @ 2024-08-30 11:10 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Rafael J. Wysocki, x86 Maintainers, LKML, Linux PM,
	Thomas Gleixner, Peter Zijlstra, Srinivas Pandruvada,
	Rafael J. Wysocki, Dietmar Eggemann, Ricardo Neri, Tim Chen

On Fri, Aug 30, 2024 at 10:30 AM Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
>
>
>
> On 8/28/24 17:18, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> [...]
> >
> > +static void hybrid_update_capacity(struct cpudata *cpu)
> > +{
> > +     unsigned int max_cap_perf;
> > +
> > +     mutex_lock(&hybrid_capacity_lock);
> > +
> > +     if (!hybrid_max_perf_cpu)
> > +             goto unlock;
> > +
> > +     /*
> > +      * The maximum performance of the CPU may have changed, but assume
> > +      * that the performance of the other CPUs has not changed.
> > +      */
> > +     max_cap_perf = hybrid_max_perf_cpu->capacity_perf;
> > +
> > +     intel_pstate_get_hwp_cap(cpu);
> > +
> > +     hybrid_get_capacity_perf(cpu);
> > +     /* Should hybrid_max_perf_cpu be replaced by this CPU? */
> > +     if (cpu->capacity_perf > max_cap_perf) {
> > +             hybrid_max_perf_cpu = cpu;
> > +             hybrid_set_capacity_of_cpus();
> > +             goto unlock;
> > +     }
> > +
> > +     /* If this CPU is hybrid_max_perf_cpu, should it be replaced? */
> > +     if (cpu == hybrid_max_perf_cpu && cpu->capacity_perf < max_cap_perf) {
> > +             hybrid_update_cpu_scaling();
> > +             goto unlock;
> > +     }
>
> I assume this CPU capacity is based on freq. It doesnt change based on
> irq, any upper scheduler classes such dl, rt right?

Right.

It is based on how many instructions per cycle the CPU can execute,
which does not change at all, and what its maximum possible frequency
is.

> can capacity_perf change slightly or it can change such that it always
> changes to next possible level? The reason, if it can change slightly,
> but cpu is still hybrid_max_perf_cpu, it would end up accessing all the
> percpu structures and change it, that would be costly on larger systems.

This should only change when the maximum voltage that can be supplied
to the CPU changes, so always to the next level.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/2] cpufreq: intel_pstate: Set asymmetric CPU capacity on hybrid systems
  2024-08-28 11:48 ` [PATCH v3 2/2] cpufreq: intel_pstate: Set asymmetric CPU capacity on hybrid systems Rafael J. Wysocki
  2024-08-30  8:30   ` Shrikanth Hegde
@ 2024-09-03 13:55   ` Peter Zijlstra
  2024-09-03 14:02     ` Rafael J. Wysocki
  2024-09-04  6:39   ` Ricardo Neri
  2 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2024-09-03 13:55 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: x86 Maintainers, LKML, Linux PM, Thomas Gleixner,
	Srinivas Pandruvada, Rafael J. Wysocki, Dietmar Eggemann,
	Ricardo Neri, Tim Chen

On Wed, Aug 28, 2024 at 01:48:10PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Make intel_pstate use the HWP_HIGHEST_PERF values from
> MSR_HWP_CAPABILITIES to set asymmetric CPU capacity information
> via the previously introduced arch_set_cpu_capacity() on hybrid
> systems without SMT.
> 
> Setting asymmetric CPU capacity is generally necessary to allow the
> scheduler to compute task sizes in a consistent way across all CPUs
> in a system where they differ by capacity.  That, in turn, should help
> to improve scheduling decisions.  It is also necessary for the schedutil
> cpufreq governor to operate as expected on hybrid systems where tasks
> migrate between CPUs of different capacities.
> 
> The underlying observation is that intel_pstate already uses
> MSR_HWP_CAPABILITIES to get CPU performance information which is
> exposed by it via sysfs and CPU performance scaling is based on it.
> Thus using this information for setting asymmetric CPU capacity is
> consistent with what the driver has been doing already.  Moreover,
> HWP_HIGHEST_PERF reflects the maximum capacity of a given CPU including
> both the instructions-per-cycle (IPC) factor and the maximum turbo
> frequency and the units in which that value is expressed are the same
> for all CPUs in the system, so the maximum capacity ratio between two
> CPUs can be obtained by computing the ratio of their HWP_HIGHEST_PERF
> values.  Of course, in principle that capacity ratio need not be
> directly applicable at lower frequencies, so using it for providing the
> asymmetric CPU capacity information to the scheduler is a rough
> approximation, but it is as good as it gets.  Also, measurements
> indicate that this approximation is not too bad in practice.
> 
> If the given system is hybrid and non-SMT, the new code disables ITMT
> support in the scheduler (because it may get in the way of asymmetric CPU
> capacity code in the scheduler that automatically gets enabled by setting
> asymmetric CPU capacity) after initializing all online CPUs and finds
> the one with the maximum HWP_HIGHEST_PERF value.  Next, it computes the
> capacity number for each (online) CPU by dividing the product of its
> HWP_HIGHEST_PERF and SCHED_CAPACITY_SCALE by the maximum HWP_HIGHEST_PERF.
> 
> When a CPU goes offline, its capacity is reset to SCHED_CAPACITY_SCALE
> and if it is the one with the maximum HWP_HIGHEST_PERF value, the
> capacity numbers for all of the other online CPUs are recomputed.  This
> also takes care of a cleanup during driver operation mode changes.
> 
> Analogously, when a new CPU goes online, its capacity number is updated
> and if its HWP_HIGHEST_PERF value is greater than the current maximum
> one, the capacity numbers for all of the other online CPUs are
> recomputed.
> 
> The case when the driver is notified of a CPU capacity change, either
> through the HWP interrupt or through an ACPI notification, is handled
> similarly to the CPU online case above, except that if the target CPU
> is the current highest-capacity one and its capacity is reduced, the
> capacity numbers for all of the other online CPUs need to be recomputed
> either.
> 
> If the driver's "no_trubo" sysfs attribute is updated, all of the CPU

Trubo :-)

> capacity information is computed from scratch to reflect the new turbo
> status.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/2] x86/sched: Add basic support for CPU capacity scaling
  2024-08-28 11:47 ` [PATCH v3 1/2] x86/sched: Add basic support for CPU capacity scaling Rafael J. Wysocki
@ 2024-09-03 13:57   ` Peter Zijlstra
  2024-09-04  6:22   ` Ricardo Neri
  1 sibling, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2024-09-03 13:57 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: x86 Maintainers, LKML, Linux PM, Thomas Gleixner,
	Srinivas Pandruvada, Rafael J. Wysocki, Dietmar Eggemann,
	Ricardo Neri, Tim Chen

On Wed, Aug 28, 2024 at 01:47:25PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> In order be able to compute the sizes of tasks consistently across all
> CPUs in a hybrid system, it is necessary to provide CPU capacity scaling
> information to the scheduler via arch_scale_cpu_capacity().  Moreover,
> the value returned by arch_scale_freq_capacity() for the given CPU must
> correspond to the arch_scale_cpu_capacity() return value for it, or
> utilization computations will be inaccurate.
> 
> Add support for it through per-CPU variables holding the capacity and
> maximum-to-base frequency ratio (times SCHED_CAPACITY_SCALE) that will
> be returned by arch_scale_cpu_capacity() and used by scale_freq_tick()
> to compute arch_freq_scale for the current CPU, respectively.
> 
> In order to avoid adding measurable overhead for non-hybrid x86 systems,
> which are the vast majority in the field, whether or not the new hybrid
> CPU capacity scaling will be in effect is controlled by a static key.
> This static key is set by calling arch_enable_hybrid_capacity_scale()
> which also allocates memory for the per-CPU data and initializes it.
> Next, arch_set_cpu_capacity() is used to set the per-CPU variables
> mentioned above for each CPU and arch_rebuild_sched_domains() needs
> to be called for the scheduler to realize that capacity-aware
> scheduling can be used going forward.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Looks about right; would be good to hear from the AMD folks if they can
use it as is, but if not, it should be simple enough to fix up later.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

> ---
>  arch/x86/include/asm/topology.h  |   13 +++++
>  arch/x86/kernel/cpu/aperfmperf.c |   89 ++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 100 insertions(+), 2 deletions(-)
> 
> Index: linux-pm/arch/x86/include/asm/topology.h
> ===================================================================
> --- linux-pm.orig/arch/x86/include/asm/topology.h
> +++ linux-pm/arch/x86/include/asm/topology.h
> @@ -282,9 +282,22 @@ static inline long arch_scale_freq_capac
>  }
>  #define arch_scale_freq_capacity arch_scale_freq_capacity
>  
> +bool arch_enable_hybrid_capacity_scale(void);
> +void arch_set_cpu_capacity(int cpu, unsigned long cap, unsigned long max_cap,
> +			   unsigned long cap_freq, unsigned long base_freq);
> +
> +unsigned long arch_scale_cpu_capacity(int cpu);
> +#define arch_scale_cpu_capacity arch_scale_cpu_capacity
> +
>  extern void arch_set_max_freq_ratio(bool turbo_disabled);
>  extern void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled);
>  #else
> +static inline bool arch_enable_hybrid_capacity_scale(void) { return false; }
> +static inline void arch_set_cpu_capacity(int cpu, unsigned long cap,
> +					 unsigned long max_cap,
> +					 unsigned long cap_freq,
> +					 unsigned long base_freq) { }
> +
>  static inline void arch_set_max_freq_ratio(bool turbo_disabled) { }
>  static inline void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled) { }
>  #endif
> Index: linux-pm/arch/x86/kernel/cpu/aperfmperf.c
> ===================================================================
> --- linux-pm.orig/arch/x86/kernel/cpu/aperfmperf.c
> +++ linux-pm/arch/x86/kernel/cpu/aperfmperf.c
> @@ -349,9 +349,89 @@ static DECLARE_WORK(disable_freq_invaria
>  DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE;
>  EXPORT_PER_CPU_SYMBOL_GPL(arch_freq_scale);
>  
> +static DEFINE_STATIC_KEY_FALSE(arch_hybrid_cap_scale_key);
> +
> +struct arch_hybrid_cpu_scale {
> +	unsigned long capacity;
> +	unsigned long freq_ratio;
> +};
> +
> +static struct arch_hybrid_cpu_scale __percpu *arch_cpu_scale;
> +
> +/**
> + * arch_enable_hybrid_capacity_scale - Enable hybrid CPU capacity scaling
> + *
> + * Allocate memory for per-CPU data used by hybrid CPU capacity scaling,
> + * initialize it and set the static key controlling its code paths.
> + *
> + * Must be called before arch_set_cpu_capacity().
> + */
> +bool arch_enable_hybrid_capacity_scale(void)
> +{
> +	int cpu;
> +
> +	if (static_branch_unlikely(&arch_hybrid_cap_scale_key)) {
> +		WARN_ONCE(1, "Hybrid CPU capacity scaling already enabled");
> +		return true;
> +	}
> +
> +	arch_cpu_scale = alloc_percpu(struct arch_hybrid_cpu_scale);
> +	if (!arch_cpu_scale)
> +		return false;
> +
> +	for_each_possible_cpu(cpu) {
> +		per_cpu_ptr(arch_cpu_scale, cpu)->capacity = SCHED_CAPACITY_SCALE;
> +		per_cpu_ptr(arch_cpu_scale, cpu)->freq_ratio = arch_max_freq_ratio;
> +	}
> +
> +	static_branch_enable(&arch_hybrid_cap_scale_key);
> +
> +	pr_info("Hybrid CPU capacity scaling enabled\n");
> +
> +	return true;
> +}
> +
> +/**
> + * arch_set_cpu_capacity - Set scale-invariance parameters for a CPU
> + * @cpu: Target CPU.
> + * @cap: Capacity of @cpu at its maximum frequency, relative to @max_cap.
> + * @max_cap: System-wide maximum CPU capacity.
> + * @cap_freq: Frequency of @cpu corresponding to @cap.
> + * @base_freq: Frequency of @cpu at which MPERF counts.
> + *
> + * The units in which @cap and @max_cap are expressed do not matter, so long
> + * as they are consistent, because the former is effectively divided by the
> + * latter.  Analogously for @cap_freq and @base_freq.
> + *
> + * After calling this function for all CPUs, call arch_rebuild_sched_domains()
> + * to let the scheduler know that capacity-aware scheduling can be used going
> + * forward.
> + */
> +void arch_set_cpu_capacity(int cpu, unsigned long cap, unsigned long max_cap,
> +			   unsigned long cap_freq, unsigned long base_freq)
> +{
> +	if (static_branch_likely(&arch_hybrid_cap_scale_key)) {
> +		WRITE_ONCE(per_cpu_ptr(arch_cpu_scale, cpu)->capacity,
> +			   div_u64(cap << SCHED_CAPACITY_SHIFT, max_cap));
> +		WRITE_ONCE(per_cpu_ptr(arch_cpu_scale, cpu)->freq_ratio,
> +			   div_u64(cap_freq << SCHED_CAPACITY_SHIFT, base_freq));
> +	} else {
> +		WARN_ONCE(1, "Hybrid CPU capacity scaling not enabled");
> +	}
> +}
> +
> +unsigned long arch_scale_cpu_capacity(int cpu)
> +{
> +	if (static_branch_unlikely(&arch_hybrid_cap_scale_key))
> +		return READ_ONCE(per_cpu_ptr(arch_cpu_scale, cpu)->capacity);
> +
> +	return SCHED_CAPACITY_SCALE;
> +}
> +EXPORT_SYMBOL_GPL(arch_scale_cpu_capacity);
> +
>  static void scale_freq_tick(u64 acnt, u64 mcnt)
>  {
> -	u64 freq_scale;
> +	u64 freq_scale, freq_ratio;
>  
>  	if (!arch_scale_freq_invariant())
>  		return;
> @@ -359,7 +439,12 @@ static void scale_freq_tick(u64 acnt, u6
>  	if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
>  		goto error;
>  
> -	if (check_mul_overflow(mcnt, arch_max_freq_ratio, &mcnt) || !mcnt)
> +	if (static_branch_unlikely(&arch_hybrid_cap_scale_key))
> +		freq_ratio = READ_ONCE(this_cpu_ptr(arch_cpu_scale)->freq_ratio);
> +	else
> +		freq_ratio = arch_max_freq_ratio;
> +
> +	if (check_mul_overflow(mcnt, freq_ratio, &mcnt) || !mcnt)
>  		goto error;
>  
>  	freq_scale = div64_u64(acnt, mcnt);
> 
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/2] cpufreq: intel_pstate: Set asymmetric CPU capacity on hybrid systems
  2024-09-03 13:55   ` Peter Zijlstra
@ 2024-09-03 14:02     ` Rafael J. Wysocki
  0 siblings, 0 replies; 14+ messages in thread
From: Rafael J. Wysocki @ 2024-09-03 14:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, x86 Maintainers, LKML, Linux PM,
	Thomas Gleixner, Srinivas Pandruvada, Rafael J. Wysocki,
	Dietmar Eggemann, Ricardo Neri, Tim Chen

On Tue, Sep 3, 2024 at 3:55 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Aug 28, 2024 at 01:48:10PM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Make intel_pstate use the HWP_HIGHEST_PERF values from
> > MSR_HWP_CAPABILITIES to set asymmetric CPU capacity information
> > via the previously introduced arch_set_cpu_capacity() on hybrid
> > systems without SMT.
> >
> > Setting asymmetric CPU capacity is generally necessary to allow the
> > scheduler to compute task sizes in a consistent way across all CPUs
> > in a system where they differ by capacity.  That, in turn, should help
> > to improve scheduling decisions.  It is also necessary for the schedutil
> > cpufreq governor to operate as expected on hybrid systems where tasks
> > migrate between CPUs of different capacities.
> >
> > The underlying observation is that intel_pstate already uses
> > MSR_HWP_CAPABILITIES to get CPU performance information which is
> > exposed by it via sysfs and CPU performance scaling is based on it.
> > Thus using this information for setting asymmetric CPU capacity is
> > consistent with what the driver has been doing already.  Moreover,
> > HWP_HIGHEST_PERF reflects the maximum capacity of a given CPU including
> > both the instructions-per-cycle (IPC) factor and the maximum turbo
> > frequency and the units in which that value is expressed are the same
> > for all CPUs in the system, so the maximum capacity ratio between two
> > CPUs can be obtained by computing the ratio of their HWP_HIGHEST_PERF
> > values.  Of course, in principle that capacity ratio need not be
> > directly applicable at lower frequencies, so using it for providing the
> > asymmetric CPU capacity information to the scheduler is a rough
> > approximation, but it is as good as it gets.  Also, measurements
> > indicate that this approximation is not too bad in practice.
> >
> > If the given system is hybrid and non-SMT, the new code disables ITMT
> > support in the scheduler (because it may get in the way of asymmetric CPU
> > capacity code in the scheduler that automatically gets enabled by setting
> > asymmetric CPU capacity) after initializing all online CPUs and finds
> > the one with the maximum HWP_HIGHEST_PERF value.  Next, it computes the
> > capacity number for each (online) CPU by dividing the product of its
> > HWP_HIGHEST_PERF and SCHED_CAPACITY_SCALE by the maximum HWP_HIGHEST_PERF.
> >
> > When a CPU goes offline, its capacity is reset to SCHED_CAPACITY_SCALE
> > and if it is the one with the maximum HWP_HIGHEST_PERF value, the
> > capacity numbers for all of the other online CPUs are recomputed.  This
> > also takes care of a cleanup during driver operation mode changes.
> >
> > Analogously, when a new CPU goes online, its capacity number is updated
> > and if its HWP_HIGHEST_PERF value is greater than the current maximum
> > one, the capacity numbers for all of the other online CPUs are
> > recomputed.
> >
> > The case when the driver is notified of a CPU capacity change, either
> > through the HWP interrupt or through an ACPI notification, is handled
> > similarly to the CPU online case above, except that if the target CPU
> > is the current highest-capacity one and its capacity is reduced, the
> > capacity numbers for all of the other online CPUs need to be recomputed
> > either.
> >
> > If the driver's "no_trubo" sysfs attribute is updated, all of the CPU
>
> Trubo :-)

Thanks, will fix before applying.

> > capacity information is computed from scratch to reflect the new turbo
> > status.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/2] x86/sched: Add basic support for CPU capacity scaling
  2024-08-28 11:47 ` [PATCH v3 1/2] x86/sched: Add basic support for CPU capacity scaling Rafael J. Wysocki
  2024-09-03 13:57   ` Peter Zijlstra
@ 2024-09-04  6:22   ` Ricardo Neri
  2024-09-04 11:24     ` Rafael J. Wysocki
  1 sibling, 1 reply; 14+ messages in thread
From: Ricardo Neri @ 2024-09-04  6:22 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: x86 Maintainers, LKML, Linux PM, Thomas Gleixner, Peter Zijlstra,
	Srinivas Pandruvada, Rafael J. Wysocki, Dietmar Eggemann,
	Ricardo Neri, Tim Chen

On Wed, Aug 28, 2024 at 01:47:25PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> In order be able to compute the sizes of tasks consistently across all
> CPUs in a hybrid system, it is necessary to provide CPU capacity scaling
> information to the scheduler via arch_scale_cpu_capacity().  Moreover,
> the value returned by arch_scale_freq_capacity() for the given CPU must
> correspond to the arch_scale_cpu_capacity() return value for it, or
> utilization computations will be inaccurate.
> 
> Add support for it through per-CPU variables holding the capacity and
> maximum-to-base frequency ratio (times SCHED_CAPACITY_SCALE) that will
> be returned by arch_scale_cpu_capacity() and used by scale_freq_tick()
> to compute arch_freq_scale for the current CPU, respectively.
> 
> In order to avoid adding measurable overhead for non-hybrid x86 systems,
> which are the vast majority in the field, whether or not the new hybrid
> CPU capacity scaling will be in effect is controlled by a static key.
> This static key is set by calling arch_enable_hybrid_capacity_scale()
> which also allocates memory for the per-CPU data and initializes it.
> Next, arch_set_cpu_capacity() is used to set the per-CPU variables
> mentioned above for each CPU and arch_rebuild_sched_domains() needs
> to be called for the scheduler to realize that capacity-aware
> scheduling can be used going forward.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> v2 -> v3:
>    * Rebase after dropping patch [1/3].
>    * Rename arch_set_cpu_capacity() arguments.
>    * Add empty line to arch_enable_hybrid_capacity_scale().
>    * Declare local variables in scale_freq_tick() on one line.
> 
> v1 -> v2:
>    * Replaces WARN_ON_ONCE() with WARN_ONCE() (2 places)
>    * Fix arch_enable_hybrid_capacity_scale() return value when hybrid
>      capacity scaling is already enabled.
>    * Allow arch_enable_hybrid_capacity_scale() to succeed when
>      frequency-invariance is not enabled.
>    * Fix arch_set_cpu_capacity() kerneldoc comment
>    * Do not disable capacity scaling in disable_freq_invariance_workfn().
>    * Relocate arch_hybrid_cap_scale_key definition.
> 
> ---

Only one minor comment below...

FWIW:
Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> # scale invariance

[...]

> +
> +static struct arch_hybrid_cpu_scale __percpu *arch_cpu_scale;
> +
> +/**
> + * arch_enable_hybrid_capacity_scale - Enable hybrid CPU capacity scaling

This looks to me like a kernel-doc comment. The function name should have ().

[...]
> +/**
> + * arch_set_cpu_capacity - Set scale-invariance parameters for a CPU

Same here.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/2] cpufreq: intel_pstate: Set asymmetric CPU capacity on hybrid systems
  2024-08-28 11:48 ` [PATCH v3 2/2] cpufreq: intel_pstate: Set asymmetric CPU capacity on hybrid systems Rafael J. Wysocki
  2024-08-30  8:30   ` Shrikanth Hegde
  2024-09-03 13:55   ` Peter Zijlstra
@ 2024-09-04  6:39   ` Ricardo Neri
  2024-09-04 11:29     ` Rafael J. Wysocki
  2 siblings, 1 reply; 14+ messages in thread
From: Ricardo Neri @ 2024-09-04  6:39 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: x86 Maintainers, LKML, Linux PM, Thomas Gleixner, Peter Zijlstra,
	Srinivas Pandruvada, Rafael J. Wysocki, Dietmar Eggemann,
	Ricardo Neri, Tim Chen

On Wed, Aug 28, 2024 at 01:48:10PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Make intel_pstate use the HWP_HIGHEST_PERF values from
> MSR_HWP_CAPABILITIES to set asymmetric CPU capacity information
> via the previously introduced arch_set_cpu_capacity() on hybrid
> systems without SMT.
> 
> Setting asymmetric CPU capacity is generally necessary to allow the
> scheduler to compute task sizes in a consistent way across all CPUs
> in a system where they differ by capacity.  That, in turn, should help
> to improve scheduling decisions.  It is also necessary for the schedutil
> cpufreq governor to operate as expected on hybrid systems where tasks
> migrate between CPUs of different capacities.
> 
> The underlying observation is that intel_pstate already uses
> MSR_HWP_CAPABILITIES to get CPU performance information which is
> exposed by it via sysfs and CPU performance scaling is based on it.
> Thus using this information for setting asymmetric CPU capacity is
> consistent with what the driver has been doing already.  Moreover,
> HWP_HIGHEST_PERF reflects the maximum capacity of a given CPU including
> both the instructions-per-cycle (IPC) factor and the maximum turbo
> frequency and the units in which that value is expressed are the same
> for all CPUs in the system, so the maximum capacity ratio between two
> CPUs can be obtained by computing the ratio of their HWP_HIGHEST_PERF
> values.  Of course, in principle that capacity ratio need not be
> directly applicable at lower frequencies, so using it for providing the
> asymmetric CPU capacity information to the scheduler is a rough
> approximation, but it is as good as it gets.  Also, measurements
> indicate that this approximation is not too bad in practice.
> 
> If the given system is hybrid and non-SMT, the new code disables ITMT
> support in the scheduler (because it may get in the way of asymmetric CPU
> capacity code in the scheduler that automatically gets enabled by setting
> asymmetric CPU capacity) after initializing all online CPUs and finds
> the one with the maximum HWP_HIGHEST_PERF value.  Next, it computes the
> capacity number for each (online) CPU by dividing the product of its
> HWP_HIGHEST_PERF and SCHED_CAPACITY_SCALE by the maximum HWP_HIGHEST_PERF.
> 
> When a CPU goes offline, its capacity is reset to SCHED_CAPACITY_SCALE
> and if it is the one with the maximum HWP_HIGHEST_PERF value, the
> capacity numbers for all of the other online CPUs are recomputed.  This
> also takes care of a cleanup during driver operation mode changes.
> 
> Analogously, when a new CPU goes online, its capacity number is updated
> and if its HWP_HIGHEST_PERF value is greater than the current maximum
> one, the capacity numbers for all of the other online CPUs are
> recomputed.
> 
> The case when the driver is notified of a CPU capacity change, either
> through the HWP interrupt or through an ACPI notification, is handled
> similarly to the CPU online case above, except that if the target CPU
> is the current highest-capacity one and its capacity is reduced, the
> capacity numbers for all of the other online CPUs need to be recomputed
> either.
> 
> If the driver's "no_trubo" sysfs attribute is updated, all of the CPU
> capacity information is computed from scratch to reflect the new turbo
> status.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

A few minor comments below...

FWIW,

Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> # scale invariance

[...]

> +
> +static void hybrid_init_cpu_scaling(void)

Maybe renaming hybrid_init_cpu_scaling() as hybrid_init_cpu_capacity_scaling(),
__hybrid_init_cpu_scaling() as __hybrid_init_cpu_capacity_scaling(), and
hybrid_update_cpu_scaling() as hybrid_update_cpu_capacity_scaling()?

It would make the code easier to read.

> +{
> +	bool disable_itmt = false;
> +
> +	mutex_lock(&hybrid_capacity_lock);
> +
> +	/*
> +	 * If hybrid_max_perf_cpu is set at this point, the hybrid CPU capacity
> +	 * scaling has been enabled already and the driver is just changing the
> +	 * operation mode.
> +	 */
> +	if (hybrid_max_perf_cpu) {
> +		__hybrid_init_cpu_scaling();
> +		goto unlock;
> +	}
> +
> +	/*
> +	 * On hybrid systems, use asym capacity instead of ITMT, but because
> +	 * the capacity of SMT threads is not deterministic even approximately,
> +	 * do not do that when SMT is in use.
> +	 */
> +	if (hwp_is_hybrid && !sched_smt_active() && arch_enable_hybrid_capacity_scale()) {
> +		__hybrid_init_cpu_scaling();
> +		disable_itmt = true;
> +	}
> +
> +unlock:
> +	mutex_unlock(&hybrid_capacity_lock);
> +
> +	if (disable_itmt)
> +		sched_clear_itmt_support();

It may be worth adding a comment here saying that the sched domains will
rebuilt to disable asym packing and enable asym capacity.
 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 0/2] x86 / intel_pstate: Set asymmetric CPU capacity on hybrid systems
  2024-08-28 11:45 [PATCH v3 0/2] x86 / intel_pstate: Set asymmetric CPU capacity on hybrid systems Rafael J. Wysocki
  2024-08-28 11:47 ` [PATCH v3 1/2] x86/sched: Add basic support for CPU capacity scaling Rafael J. Wysocki
  2024-08-28 11:48 ` [PATCH v3 2/2] cpufreq: intel_pstate: Set asymmetric CPU capacity on hybrid systems Rafael J. Wysocki
@ 2024-09-04  7:25 ` Ricardo Neri
  2024-09-04 11:30   ` Rafael J. Wysocki
  2 siblings, 1 reply; 14+ messages in thread
From: Ricardo Neri @ 2024-09-04  7:25 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: x86 Maintainers, LKML, Linux PM, Thomas Gleixner, Peter Zijlstra,
	Srinivas Pandruvada, Rafael J. Wysocki, Dietmar Eggemann,
	Ricardo Neri, Tim Chen

On Wed, Aug 28, 2024 at 01:45:00PM +0200, Rafael J. Wysocki wrote:
> Hi Everyone,
> 
> This is an update of
> 
> https://lore.kernel.org/linux-pm/4941491.31r3eYUQgx@rjwysocki.net/
> 
> which was an update of
> 
> https://lore.kernel.org/linux-pm/4908113.GXAFRqVoOG@rjwysocki.net/
> 
> It addresses Ricardo's review comments and fixes an issue with intel_pstate
> operation mode changes that would cause it to attempt to enable hybrid CPU
> capacity scaling after it has been already enabled during initialization.
> 
> The most visible difference with respect to the previous version is that
> patch [1/3] has been dropped because it is not needed any more after using
> the observation that sched_clear_itmt_support() would cause sched domains
> to be rebuilt.
> 
> Other than this, there are cosmetic differences in patch [1/2] (previously [2/3])
> and the new code in intel_pstate_register_driver() in patch [2/2] (previously [3/3])
> has been squashed into hybrid_init_cpu_scaling() which now checks whether or
> not to enable hybrid CPU capacity scaling (as it may have been enabled already).
> 
> This series is available from the following git branch:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=intel_pstate-testing
> 
> (with an extra debug commit on top).
> 
> The original cover letter quoted below still applies:
> 
> The purpose of this series is to provide the scheduler with asymmetric CPU
> capacity information on x86 hybrid systems based on Intel hardware.
> 
> The asymmetric CPU capacity information is important on hybrid systems as it
> allows utilization to be computed for tasks in a consistent way across all
> CPUs in the system, regardless of their capacity.  This, in turn, allows
> the schedutil cpufreq governor to set CPU performance levels consistently
> in the cases when tasks migrate between CPUs of different capacities.  It
> should also help to improve task placement and load balancing decisions on
> hybrid systems and it is key for anything along the lines of EAS.
> 
> The information in question comes from the MSR_HWP_CAPABILITIES register and
> is provided to the scheduler by the intel_pstate driver, as per the changelog
> of patch [3/3].  Patch [2/3] introduces the arch infrastructure needed for
> that (in the form of a per-CPU capacity variable) and patch [1/3] is a
> preliminary code adjustment.
> 
> This is based on an RFC posted previously
> 
> https://lore.kernel.org/linux-pm/7663799.EvYhyI6sBW@kreacher/
> 
> but differs from it quite a bit (except for the first patch).  The most
> significant difference is based on the observation that frequency-
> invariance needs to adjusted to the capacity scaling on hybrid systems
> for the complete scale-invariance to work as expected.
> 
> Thank you!

Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> # scale invariance

You can look at the scaling invariance these patches achieve here

https://pasteboard.co/dhBAUjfr36Tx.png

I tested these patches on an Meteor Lake system. It has CPUs with three
levels of capacity (Pcore, Ecore, and Lcore)

The "Requested work" plot shows a sawtooth pattern of the amount of work
requested as a percentage of the maximum amount of work that can be
obtained from the biggest CPU running at its maximum frequency. The work
is continuously calling getcpu() in a time window of constant duration
with varying percentages of work.

The "Achieved work" plot shows that the Ecore and Lcore cannot complete
as much work as the PCore even when fully busy (see the "Busy %" plot).
Also, bigger CPUs have more idle time.

The "Scale freq capacity" plot shows the current frequency of each CPU
is now scaled to 1024 by their respective max frequencies. It no longer
uses the single arch_max_freq_ratio value. Capacity now scales correctly:
when running at its maximum frequency, the current capacity (see
"Current capacity" plot and refer to cap_scale()) now matches the value
from arch_scale_cpu_capacity() (see "CPU capacity" plot).

The "Task utilization" plot shows that task->util_avg is now invariant
across CPUs.
> 
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/2] x86/sched: Add basic support for CPU capacity scaling
  2024-09-04  6:22   ` Ricardo Neri
@ 2024-09-04 11:24     ` Rafael J. Wysocki
  0 siblings, 0 replies; 14+ messages in thread
From: Rafael J. Wysocki @ 2024-09-04 11:24 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Rafael J. Wysocki, x86 Maintainers, LKML, Linux PM,
	Thomas Gleixner, Peter Zijlstra, Srinivas Pandruvada,
	Rafael J. Wysocki, Dietmar Eggemann, Ricardo Neri, Tim Chen

On Wed, Sep 4, 2024 at 8:17 AM Ricardo Neri
<ricardo.neri-calderon@linux.intel.com> wrote:
>
> On Wed, Aug 28, 2024 at 01:47:25PM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > In order be able to compute the sizes of tasks consistently across all
> > CPUs in a hybrid system, it is necessary to provide CPU capacity scaling
> > information to the scheduler via arch_scale_cpu_capacity().  Moreover,
> > the value returned by arch_scale_freq_capacity() for the given CPU must
> > correspond to the arch_scale_cpu_capacity() return value for it, or
> > utilization computations will be inaccurate.
> >
> > Add support for it through per-CPU variables holding the capacity and
> > maximum-to-base frequency ratio (times SCHED_CAPACITY_SCALE) that will
> > be returned by arch_scale_cpu_capacity() and used by scale_freq_tick()
> > to compute arch_freq_scale for the current CPU, respectively.
> >
> > In order to avoid adding measurable overhead for non-hybrid x86 systems,
> > which are the vast majority in the field, whether or not the new hybrid
> > CPU capacity scaling will be in effect is controlled by a static key.
> > This static key is set by calling arch_enable_hybrid_capacity_scale()
> > which also allocates memory for the per-CPU data and initializes it.
> > Next, arch_set_cpu_capacity() is used to set the per-CPU variables
> > mentioned above for each CPU and arch_rebuild_sched_domains() needs
> > to be called for the scheduler to realize that capacity-aware
> > scheduling can be used going forward.
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >
> > v2 -> v3:
> >    * Rebase after dropping patch [1/3].
> >    * Rename arch_set_cpu_capacity() arguments.
> >    * Add empty line to arch_enable_hybrid_capacity_scale().
> >    * Declare local variables in scale_freq_tick() on one line.
> >
> > v1 -> v2:
> >    * Replaces WARN_ON_ONCE() with WARN_ONCE() (2 places)
> >    * Fix arch_enable_hybrid_capacity_scale() return value when hybrid
> >      capacity scaling is already enabled.
> >    * Allow arch_enable_hybrid_capacity_scale() to succeed when
> >      frequency-invariance is not enabled.
> >    * Fix arch_set_cpu_capacity() kerneldoc comment
> >    * Do not disable capacity scaling in disable_freq_invariance_workfn().
> >    * Relocate arch_hybrid_cap_scale_key definition.
> >
> > ---
>
> Only one minor comment below...
>
> FWIW:
> Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> # scale invariance

Thanks!

> [...]
>
> > +
> > +static struct arch_hybrid_cpu_scale __percpu *arch_cpu_scale;
> > +
> > +/**
> > + * arch_enable_hybrid_capacity_scale - Enable hybrid CPU capacity scaling
>
> This looks to me like a kernel-doc comment. The function name should have ().

Well, there are quite a few function kerneldoc comments without the
parens at the end of the name, but sure.

> [...]
> > +/**
> > + * arch_set_cpu_capacity - Set scale-invariance parameters for a CPU
>
> Same here.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/2] cpufreq: intel_pstate: Set asymmetric CPU capacity on hybrid systems
  2024-09-04  6:39   ` Ricardo Neri
@ 2024-09-04 11:29     ` Rafael J. Wysocki
  0 siblings, 0 replies; 14+ messages in thread
From: Rafael J. Wysocki @ 2024-09-04 11:29 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Rafael J. Wysocki, x86 Maintainers, LKML, Linux PM,
	Thomas Gleixner, Peter Zijlstra, Srinivas Pandruvada,
	Rafael J. Wysocki, Dietmar Eggemann, Ricardo Neri, Tim Chen

On Wed, Sep 4, 2024 at 8:33 AM Ricardo Neri
<ricardo.neri-calderon@linux.intel.com> wrote:
>
> On Wed, Aug 28, 2024 at 01:48:10PM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Make intel_pstate use the HWP_HIGHEST_PERF values from
> > MSR_HWP_CAPABILITIES to set asymmetric CPU capacity information
> > via the previously introduced arch_set_cpu_capacity() on hybrid
> > systems without SMT.
> >
> > Setting asymmetric CPU capacity is generally necessary to allow the
> > scheduler to compute task sizes in a consistent way across all CPUs
> > in a system where they differ by capacity.  That, in turn, should help
> > to improve scheduling decisions.  It is also necessary for the schedutil
> > cpufreq governor to operate as expected on hybrid systems where tasks
> > migrate between CPUs of different capacities.
> >
> > The underlying observation is that intel_pstate already uses
> > MSR_HWP_CAPABILITIES to get CPU performance information which is
> > exposed by it via sysfs and CPU performance scaling is based on it.
> > Thus using this information for setting asymmetric CPU capacity is
> > consistent with what the driver has been doing already.  Moreover,
> > HWP_HIGHEST_PERF reflects the maximum capacity of a given CPU including
> > both the instructions-per-cycle (IPC) factor and the maximum turbo
> > frequency and the units in which that value is expressed are the same
> > for all CPUs in the system, so the maximum capacity ratio between two
> > CPUs can be obtained by computing the ratio of their HWP_HIGHEST_PERF
> > values.  Of course, in principle that capacity ratio need not be
> > directly applicable at lower frequencies, so using it for providing the
> > asymmetric CPU capacity information to the scheduler is a rough
> > approximation, but it is as good as it gets.  Also, measurements
> > indicate that this approximation is not too bad in practice.
> >
> > If the given system is hybrid and non-SMT, the new code disables ITMT
> > support in the scheduler (because it may get in the way of asymmetric CPU
> > capacity code in the scheduler that automatically gets enabled by setting
> > asymmetric CPU capacity) after initializing all online CPUs and finds
> > the one with the maximum HWP_HIGHEST_PERF value.  Next, it computes the
> > capacity number for each (online) CPU by dividing the product of its
> > HWP_HIGHEST_PERF and SCHED_CAPACITY_SCALE by the maximum HWP_HIGHEST_PERF.
> >
> > When a CPU goes offline, its capacity is reset to SCHED_CAPACITY_SCALE
> > and if it is the one with the maximum HWP_HIGHEST_PERF value, the
> > capacity numbers for all of the other online CPUs are recomputed.  This
> > also takes care of a cleanup during driver operation mode changes.
> >
> > Analogously, when a new CPU goes online, its capacity number is updated
> > and if its HWP_HIGHEST_PERF value is greater than the current maximum
> > one, the capacity numbers for all of the other online CPUs are
> > recomputed.
> >
> > The case when the driver is notified of a CPU capacity change, either
> > through the HWP interrupt or through an ACPI notification, is handled
> > similarly to the CPU online case above, except that if the target CPU
> > is the current highest-capacity one and its capacity is reduced, the
> > capacity numbers for all of the other online CPUs need to be recomputed
> > either.
> >
> > If the driver's "no_trubo" sysfs attribute is updated, all of the CPU
> > capacity information is computed from scratch to reflect the new turbo
> > status.
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> A few minor comments below...
>
> FWIW,
>
> Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> # scale invariance
>
> [...]
>
> > +
> > +static void hybrid_init_cpu_scaling(void)
>
> Maybe renaming hybrid_init_cpu_scaling() as hybrid_init_cpu_capacity_scaling(),
> __hybrid_init_cpu_scaling() as __hybrid_init_cpu_capacity_scaling(), and
> hybrid_update_cpu_scaling() as hybrid_update_cpu_capacity_scaling()?
>
> It would make the code easier to read.

Sure, if that helps.

> > +{
> > +     bool disable_itmt = false;
> > +
> > +     mutex_lock(&hybrid_capacity_lock);
> > +
> > +     /*
> > +      * If hybrid_max_perf_cpu is set at this point, the hybrid CPU capacity
> > +      * scaling has been enabled already and the driver is just changing the
> > +      * operation mode.
> > +      */
> > +     if (hybrid_max_perf_cpu) {
> > +             __hybrid_init_cpu_scaling();
> > +             goto unlock;
> > +     }
> > +
> > +     /*
> > +      * On hybrid systems, use asym capacity instead of ITMT, but because
> > +      * the capacity of SMT threads is not deterministic even approximately,
> > +      * do not do that when SMT is in use.
> > +      */
> > +     if (hwp_is_hybrid && !sched_smt_active() && arch_enable_hybrid_capacity_scale()) {
> > +             __hybrid_init_cpu_scaling();
> > +             disable_itmt = true;
> > +     }
> > +
> > +unlock:
> > +     mutex_unlock(&hybrid_capacity_lock);
> > +
> > +     if (disable_itmt)
> > +             sched_clear_itmt_support();
>
> It may be worth adding a comment here saying that the sched domains will
> rebuilt to disable asym packing and enable asym capacity.

Won't hurt I suppose.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 0/2] x86 / intel_pstate: Set asymmetric CPU capacity on hybrid systems
  2024-09-04  7:25 ` [PATCH v3 0/2] x86 / " Ricardo Neri
@ 2024-09-04 11:30   ` Rafael J. Wysocki
  0 siblings, 0 replies; 14+ messages in thread
From: Rafael J. Wysocki @ 2024-09-04 11:30 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Rafael J. Wysocki, x86 Maintainers, LKML, Linux PM,
	Thomas Gleixner, Peter Zijlstra, Srinivas Pandruvada,
	Rafael J. Wysocki, Dietmar Eggemann, Ricardo Neri, Tim Chen

On Wed, Sep 4, 2024 at 9:19 AM Ricardo Neri
<ricardo.neri-calderon@linux.intel.com> wrote:
>
> On Wed, Aug 28, 2024 at 01:45:00PM +0200, Rafael J. Wysocki wrote:
> > Hi Everyone,
> >
> > This is an update of
> >
> > https://lore.kernel.org/linux-pm/4941491.31r3eYUQgx@rjwysocki.net/
> >
> > which was an update of
> >
> > https://lore.kernel.org/linux-pm/4908113.GXAFRqVoOG@rjwysocki.net/
> >
> > It addresses Ricardo's review comments and fixes an issue with intel_pstate
> > operation mode changes that would cause it to attempt to enable hybrid CPU
> > capacity scaling after it has been already enabled during initialization.
> >
> > The most visible difference with respect to the previous version is that
> > patch [1/3] has been dropped because it is not needed any more after using
> > the observation that sched_clear_itmt_support() would cause sched domains
> > to be rebuilt.
> >
> > Other than this, there are cosmetic differences in patch [1/2] (previously [2/3])
> > and the new code in intel_pstate_register_driver() in patch [2/2] (previously [3/3])
> > has been squashed into hybrid_init_cpu_scaling() which now checks whether or
> > not to enable hybrid CPU capacity scaling (as it may have been enabled already).
> >
> > This series is available from the following git branch:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=intel_pstate-testing
> >
> > (with an extra debug commit on top).
> >
> > The original cover letter quoted below still applies:
> >
> > The purpose of this series is to provide the scheduler with asymmetric CPU
> > capacity information on x86 hybrid systems based on Intel hardware.
> >
> > The asymmetric CPU capacity information is important on hybrid systems as it
> > allows utilization to be computed for tasks in a consistent way across all
> > CPUs in the system, regardless of their capacity.  This, in turn, allows
> > the schedutil cpufreq governor to set CPU performance levels consistently
> > in the cases when tasks migrate between CPUs of different capacities.  It
> > should also help to improve task placement and load balancing decisions on
> > hybrid systems and it is key for anything along the lines of EAS.
> >
> > The information in question comes from the MSR_HWP_CAPABILITIES register and
> > is provided to the scheduler by the intel_pstate driver, as per the changelog
> > of patch [3/3].  Patch [2/3] introduces the arch infrastructure needed for
> > that (in the form of a per-CPU capacity variable) and patch [1/3] is a
> > preliminary code adjustment.
> >
> > This is based on an RFC posted previously
> >
> > https://lore.kernel.org/linux-pm/7663799.EvYhyI6sBW@kreacher/
> >
> > but differs from it quite a bit (except for the first patch).  The most
> > significant difference is based on the observation that frequency-
> > invariance needs to adjusted to the capacity scaling on hybrid systems
> > for the complete scale-invariance to work as expected.
> >
> > Thank you!
>
> Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> # scale invariance
>
> You can look at the scaling invariance these patches achieve here
>
> https://pasteboard.co/dhBAUjfr36Tx.png
>
> I tested these patches on an Meteor Lake system. It has CPUs with three
> levels of capacity (Pcore, Ecore, and Lcore)
>
> The "Requested work" plot shows a sawtooth pattern of the amount of work
> requested as a percentage of the maximum amount of work that can be
> obtained from the biggest CPU running at its maximum frequency. The work
> is continuously calling getcpu() in a time window of constant duration
> with varying percentages of work.
>
> The "Achieved work" plot shows that the Ecore and Lcore cannot complete
> as much work as the PCore even when fully busy (see the "Busy %" plot).
> Also, bigger CPUs have more idle time.
>
> The "Scale freq capacity" plot shows the current frequency of each CPU
> is now scaled to 1024 by their respective max frequencies. It no longer
> uses the single arch_max_freq_ratio value. Capacity now scales correctly:
> when running at its maximum frequency, the current capacity (see
> "Current capacity" plot and refer to cap_scale()) now matches the value
> from arch_scale_cpu_capacity() (see "CPU capacity" plot).
>
> The "Task utilization" plot shows that task->util_avg is now invariant
> across CPUs.

Thank you!

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-09-04 11:30 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-28 11:45 [PATCH v3 0/2] x86 / intel_pstate: Set asymmetric CPU capacity on hybrid systems Rafael J. Wysocki
2024-08-28 11:47 ` [PATCH v3 1/2] x86/sched: Add basic support for CPU capacity scaling Rafael J. Wysocki
2024-09-03 13:57   ` Peter Zijlstra
2024-09-04  6:22   ` Ricardo Neri
2024-09-04 11:24     ` Rafael J. Wysocki
2024-08-28 11:48 ` [PATCH v3 2/2] cpufreq: intel_pstate: Set asymmetric CPU capacity on hybrid systems Rafael J. Wysocki
2024-08-30  8:30   ` Shrikanth Hegde
2024-08-30 11:10     ` Rafael J. Wysocki
2024-09-03 13:55   ` Peter Zijlstra
2024-09-03 14:02     ` Rafael J. Wysocki
2024-09-04  6:39   ` Ricardo Neri
2024-09-04 11:29     ` Rafael J. Wysocki
2024-09-04  7:25 ` [PATCH v3 0/2] x86 / " Ricardo Neri
2024-09-04 11:30   ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).