linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] Rework system pressure interface to the scheduler
@ 2023-12-12 14:27 Vincent Guittot
  2023-12-12 14:27 ` [PATCH 1/4] cpufreq: Add a cpufreq pressure feedback for " Vincent Guittot
                   ` (4 more replies)
  0 siblings, 5 replies; 26+ messages in thread
From: Vincent Guittot @ 2023-12-12 14:27 UTC (permalink / raw)
  To: catalin.marinas, will, sudeep.holla, rafael, viresh.kumar, agross,
	andersson, konrad.dybcio, mingo, peterz, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	lukasz.luba, rui.zhang, mhiramat, daniel.lezcano, amit.kachhap,
	linux-arm-kernel, linux-kernel, linux-pm, linux-arm-msm,
	linux-trace-kernel
  Cc: Vincent Guittot

Following the consolidation and cleanup of CPU capacity in [1], this serie
reworks how the scheduler gets the pressures on CPUs. We need to take into
account all pressures applied by cpufreq on the compute capacity of a CPU
for dozens of ms or more and not only cpufreq cooling device or HW
mitigiations. we split the pressure applied on CPU's capacity in 2 parts:
- one from cpufreq and freq_qos
- one from HW high freq mitigiation.

The next step will be to add a dedicated interface for long standing
capping of the CPU capacity (i.e. for seconds or more) like the
scaling_max_freq of cpufreq sysfs. The latter is already taken into
account by this serie but as a temporary pressure which is not always the
best choice when we know that it will happen for seconds or more.

[1] https://lore.kernel.org/lkml/20231211104855.558096-1-vincent.guittot@linaro.org/

Vincent Guittot (4):
  cpufreq: Add a cpufreq pressure feedback for the scheduler
  sched: Take cpufreq feedback into account
  thermal/cpufreq: Remove arch_update_thermal_pressure()
  sched: Rename arch_update_thermal_pressure into
    arch_update_hw_pressure

 arch/arm/include/asm/topology.h               |  6 +--
 arch/arm64/include/asm/topology.h             |  6 +--
 drivers/base/arch_topology.c                  | 26 ++++-----
 drivers/cpufreq/cpufreq.c                     | 48 +++++++++++++++++
 drivers/cpufreq/qcom-cpufreq-hw.c             |  4 +-
 drivers/thermal/cpufreq_cooling.c             |  3 --
 include/linux/arch_topology.h                 |  8 +--
 include/linux/cpufreq.h                       | 10 ++++
 include/linux/sched/topology.h                |  8 +--
 .../{thermal_pressure.h => hw_pressure.h}     | 14 ++---
 include/trace/events/sched.h                  |  2 +-
 init/Kconfig                                  | 12 ++---
 kernel/sched/core.c                           |  8 +--
 kernel/sched/fair.c                           | 53 ++++++++++---------
 kernel/sched/pelt.c                           | 18 +++----
 kernel/sched/pelt.h                           | 16 +++---
 kernel/sched/sched.h                          |  4 +-
 17 files changed, 152 insertions(+), 94 deletions(-)
 rename include/trace/events/{thermal_pressure.h => hw_pressure.h} (55%)

-- 
2.34.1
 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/4] cpufreq: Add a cpufreq pressure feedback for the scheduler
  2023-12-12 14:27 [PATCH 0/5] Rework system pressure interface to the scheduler Vincent Guittot
@ 2023-12-12 14:27 ` Vincent Guittot
  2023-12-13  7:17   ` Viresh Kumar
                     ` (3 more replies)
  2023-12-12 14:27 ` [PATCH 2/4] sched: Take cpufreq feedback into account Vincent Guittot
                   ` (3 subsequent siblings)
  4 siblings, 4 replies; 26+ messages in thread
From: Vincent Guittot @ 2023-12-12 14:27 UTC (permalink / raw)
  To: catalin.marinas, will, sudeep.holla, rafael, viresh.kumar, agross,
	andersson, konrad.dybcio, mingo, peterz, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	lukasz.luba, rui.zhang, mhiramat, daniel.lezcano, amit.kachhap,
	linux-arm-kernel, linux-kernel, linux-pm, linux-arm-msm,
	linux-trace-kernel
  Cc: Vincent Guittot

Provide to the scheduler a feedback about the temporary max available
capacity. Unlike arch_update_thermal_pressure, this doesn't need to be
filtered as the pressure will happen for dozens ms or more.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 drivers/cpufreq/cpufreq.c | 48 +++++++++++++++++++++++++++++++++++++++
 include/linux/cpufreq.h   | 10 ++++++++
 2 files changed, 58 insertions(+)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 44db4f59c4cc..7d5f71be8d29 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -2563,6 +2563,50 @@ int cpufreq_get_policy(struct cpufreq_policy *policy, unsigned int cpu)
 }
 EXPORT_SYMBOL(cpufreq_get_policy);
 
+DEFINE_PER_CPU(unsigned long, cpufreq_pressure);
+EXPORT_PER_CPU_SYMBOL_GPL(cpufreq_pressure);
+
+/**
+ * cpufreq_update_pressure() - Update cpufreq pressure for CPUs
+ * @cpus        : The related CPUs for which max capacity has been reduced
+ * @capped_freq : The maximum allowed frequency that CPUs can run at
+ *
+ * Update the value of cpufreq pressure for all @cpus in the mask. The
+ * cpumask should include all (online+offline) affected CPUs, to avoid
+ * operating on stale data when hot-plug is used for some CPUs. The
+ * @capped_freq reflects the currently allowed max CPUs frequency due to
+ * freq_qos capping. It might be also a boost frequency value, which is bigger
+ * than the internal 'capacity_freq_ref' max frequency. In such case the
+ * pressure value should simply be removed, since this is an indication that
+ * there is no capping. The @capped_freq must be provided in kHz.
+ */
+static void cpufreq_update_pressure(const struct cpumask *cpus,
+				      unsigned long capped_freq)
+{
+	unsigned long max_capacity, capacity, pressure;
+	u32 max_freq;
+	int cpu;
+
+	cpu = cpumask_first(cpus);
+	max_capacity = arch_scale_cpu_capacity(cpu);
+	max_freq = arch_scale_freq_ref(cpu);
+
+	/*
+	 * Handle properly the boost frequencies, which should simply clean
+	 * the thermal pressure value.
+	 */
+	if (max_freq <= capped_freq)
+		capacity = max_capacity;
+	else
+		capacity = mult_frac(max_capacity, capped_freq, max_freq);
+
+	pressure = max_capacity - capacity;
+
+
+	for_each_cpu(cpu, cpus)
+		WRITE_ONCE(per_cpu(cpufreq_pressure, cpu), pressure);
+}
+
 /**
  * cpufreq_set_policy - Modify cpufreq policy parameters.
  * @policy: Policy object to modify.
@@ -2584,6 +2628,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
 {
 	struct cpufreq_policy_data new_data;
 	struct cpufreq_governor *old_gov;
+	struct cpumask *cpus;
 	int ret;
 
 	memcpy(&new_data.cpuinfo, &policy->cpuinfo, sizeof(policy->cpuinfo));
@@ -2618,6 +2663,9 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
 	policy->max = __resolve_freq(policy, policy->max, CPUFREQ_RELATION_H);
 	trace_cpu_frequency_limits(policy);
 
+	cpus = policy->related_cpus;
+	cpufreq_update_pressure(cpus, policy->max);
+
 	policy->cached_target_freq = UINT_MAX;
 
 	pr_debug("new min and max freqs are %u - %u kHz\n",
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index afda5f24d3dd..b1d97edd3253 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -241,6 +241,12 @@ struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy);
 void cpufreq_enable_fast_switch(struct cpufreq_policy *policy);
 void cpufreq_disable_fast_switch(struct cpufreq_policy *policy);
 bool has_target_index(void);
+
+DECLARE_PER_CPU(unsigned long, cpufreq_pressure);
+static inline unsigned long cpufreq_get_pressure(int cpu)
+{
+	return per_cpu(cpufreq_pressure, cpu);
+}
 #else
 static inline unsigned int cpufreq_get(unsigned int cpu)
 {
@@ -263,6 +269,10 @@ static inline bool cpufreq_supports_freq_invariance(void)
 	return false;
 }
 static inline void disable_cpufreq(void) { }
+static inline unsigned long cpufreq_get_pressure(int cpu)
+{
+	return 0;
+}
 #endif
 
 #ifdef CONFIG_CPU_FREQ_STAT
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/4] sched: Take cpufreq feedback into account
  2023-12-12 14:27 [PATCH 0/5] Rework system pressure interface to the scheduler Vincent Guittot
  2023-12-12 14:27 ` [PATCH 1/4] cpufreq: Add a cpufreq pressure feedback for " Vincent Guittot
@ 2023-12-12 14:27 ` Vincent Guittot
  2023-12-15 16:03   ` Lukasz Luba
  2023-12-12 14:27 ` [PATCH 3/4] thermal/cpufreq: Remove arch_update_thermal_pressure() Vincent Guittot
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 26+ messages in thread
From: Vincent Guittot @ 2023-12-12 14:27 UTC (permalink / raw)
  To: catalin.marinas, will, sudeep.holla, rafael, viresh.kumar, agross,
	andersson, konrad.dybcio, mingo, peterz, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	lukasz.luba, rui.zhang, mhiramat, daniel.lezcano, amit.kachhap,
	linux-arm-kernel, linux-kernel, linux-pm, linux-arm-msm,
	linux-trace-kernel
  Cc: Vincent Guittot

Aggregate the different pressures applied on the capacity of CPUs and
create a new function that returns the actual capacity of the CPU:
  get_actual_cpu_capacity()

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 43 +++++++++++++++++++++++--------------------
 1 file changed, 23 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bcea3d55d95d..11d3be829302 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4932,12 +4932,20 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
 	trace_sched_util_est_se_tp(&p->se);
 }
 
+static inline unsigned long get_actual_cpu_capacity(int cpu)
+{
+	unsigned long capacity = arch_scale_cpu_capacity(cpu);
+
+	capacity -= max(thermal_load_avg(cpu_rq(cpu)), cpufreq_get_pressure(cpu));
+
+	return capacity;
+}
 static inline int util_fits_cpu(unsigned long util,
 				unsigned long uclamp_min,
 				unsigned long uclamp_max,
 				int cpu)
 {
-	unsigned long capacity_orig, capacity_orig_thermal;
+	unsigned long capacity_orig;
 	unsigned long capacity = capacity_of(cpu);
 	bool fits, uclamp_max_fits;
 
@@ -4970,7 +4978,6 @@ static inline int util_fits_cpu(unsigned long util,
 	 * goal is to cap the task. So it's okay if it's getting less.
 	 */
 	capacity_orig = arch_scale_cpu_capacity(cpu);
-	capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu);
 
 	/*
 	 * We want to force a task to fit a cpu as implied by uclamp_max.
@@ -5045,7 +5052,7 @@ static inline int util_fits_cpu(unsigned long util,
 	 * handle the case uclamp_min > uclamp_max.
 	 */
 	uclamp_min = min(uclamp_min, uclamp_max);
-	if (fits && (util < uclamp_min) && (uclamp_min > capacity_orig_thermal))
+	if (fits && (util < uclamp_min) && (uclamp_min > get_actual_cpu_capacity(cpu)))
 		return -1;
 
 	return fits;
@@ -7426,7 +7433,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 		 * Look for the CPU with best capacity.
 		 */
 		else if (fits < 0)
-			cpu_cap = arch_scale_cpu_capacity(cpu) - thermal_load_avg(cpu_rq(cpu));
+			cpu_cap = get_actual_cpu_capacity(cpu);
 
 		/*
 		 * First, select CPU which fits better (-1 being better than 0).
@@ -7919,8 +7926,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	struct root_domain *rd = this_rq()->rd;
 	int cpu, best_energy_cpu, target = -1;
 	int prev_fits = -1, best_fits = -1;
-	unsigned long best_thermal_cap = 0;
-	unsigned long prev_thermal_cap = 0;
+	unsigned long best_actual_cap = 0;
+	unsigned long prev_actual_cap = 0;
 	struct sched_domain *sd;
 	struct perf_domain *pd;
 	struct energy_env eenv;
@@ -7950,7 +7957,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 
 	for (; pd; pd = pd->next) {
 		unsigned long util_min = p_util_min, util_max = p_util_max;
-		unsigned long cpu_cap, cpu_thermal_cap, util;
+		unsigned long cpu_cap, cpu_actual_cap, util;
 		long prev_spare_cap = -1, max_spare_cap = -1;
 		unsigned long rq_util_min, rq_util_max;
 		unsigned long cur_delta, base_energy;
@@ -7962,18 +7969,17 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 		if (cpumask_empty(cpus))
 			continue;
 
-		/* Account thermal pressure for the energy estimation */
+		/* Account external pressure for the energy estimation */
 		cpu = cpumask_first(cpus);
-		cpu_thermal_cap = arch_scale_cpu_capacity(cpu);
-		cpu_thermal_cap -= arch_scale_thermal_pressure(cpu);
+		cpu_actual_cap = get_actual_cpu_capacity(cpu);
 
-		eenv.cpu_cap = cpu_thermal_cap;
+		eenv.cpu_cap = cpu_actual_cap;
 		eenv.pd_cap = 0;
 
 		for_each_cpu(cpu, cpus) {
 			struct rq *rq = cpu_rq(cpu);
 
-			eenv.pd_cap += cpu_thermal_cap;
+			eenv.pd_cap += cpu_actual_cap;
 
 			if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
 				continue;
@@ -8044,7 +8050,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			if (prev_delta < base_energy)
 				goto unlock;
 			prev_delta -= base_energy;
-			prev_thermal_cap = cpu_thermal_cap;
+			prev_actual_cap = cpu_actual_cap;
 			best_delta = min(best_delta, prev_delta);
 		}
 
@@ -8059,7 +8065,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			 * but best energy cpu has better capacity.
 			 */
 			if ((max_fits < 0) &&
-			    (cpu_thermal_cap <= best_thermal_cap))
+			    (cpu_actual_cap <= best_actual_cap))
 				continue;
 
 			cur_delta = compute_energy(&eenv, pd, cpus, p,
@@ -8080,14 +8086,14 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			best_delta = cur_delta;
 			best_energy_cpu = max_spare_cap_cpu;
 			best_fits = max_fits;
-			best_thermal_cap = cpu_thermal_cap;
+			best_actual_cap = cpu_actual_cap;
 		}
 	}
 	rcu_read_unlock();
 
 	if ((best_fits > prev_fits) ||
 	    ((best_fits > 0) && (best_delta < prev_delta)) ||
-	    ((best_fits < 0) && (best_thermal_cap > prev_thermal_cap)))
+	    ((best_fits < 0) && (best_actual_cap > prev_actual_cap)))
 		target = best_energy_cpu;
 
 	return target;
@@ -9466,7 +9472,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
 static unsigned long scale_rt_capacity(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	unsigned long max = arch_scale_cpu_capacity(cpu);
+	unsigned long max = get_actual_cpu_capacity(cpu);
 	unsigned long used, free;
 	unsigned long irq;
 
@@ -9478,12 +9484,9 @@ static unsigned long scale_rt_capacity(int cpu)
 	/*
 	 * avg_rt.util_avg and avg_dl.util_avg track binary signals
 	 * (running and not running) with weights 0 and 1024 respectively.
-	 * avg_thermal.load_avg tracks thermal pressure and the weighted
-	 * average uses the actual delta max capacity(load).
 	 */
 	used = READ_ONCE(rq->avg_rt.util_avg);
 	used += READ_ONCE(rq->avg_dl.util_avg);
-	used += thermal_load_avg(rq);
 
 	if (unlikely(used >= max))
 		return 1;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/4] thermal/cpufreq: Remove arch_update_thermal_pressure()
  2023-12-12 14:27 [PATCH 0/5] Rework system pressure interface to the scheduler Vincent Guittot
  2023-12-12 14:27 ` [PATCH 1/4] cpufreq: Add a cpufreq pressure feedback for " Vincent Guittot
  2023-12-12 14:27 ` [PATCH 2/4] sched: Take cpufreq feedback into account Vincent Guittot
@ 2023-12-12 14:27 ` Vincent Guittot
  2023-12-15 15:38   ` Lukasz Luba
  2023-12-12 14:27 ` [PATCH 4/4] sched: Rename arch_update_thermal_pressure into arch_update_hw_pressure Vincent Guittot
  2023-12-14  8:22 ` [PATCH 0/5] Rework system pressure interface to the scheduler Lukasz Luba
  4 siblings, 1 reply; 26+ messages in thread
From: Vincent Guittot @ 2023-12-12 14:27 UTC (permalink / raw)
  To: catalin.marinas, will, sudeep.holla, rafael, viresh.kumar, agross,
	andersson, konrad.dybcio, mingo, peterz, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	lukasz.luba, rui.zhang, mhiramat, daniel.lezcano, amit.kachhap,
	linux-arm-kernel, linux-kernel, linux-pm, linux-arm-msm,
	linux-trace-kernel
  Cc: Vincent Guittot

arch_update_thermal_pressure() aims to update fast changing signal which
should be averaged using PELT filtering before being provided to the
scheduler which can't make smart use of fast changing signal.
cpufreq now provides the maximum freq_qos pressure on the capacity to the
scheduler, which includes cpufreq cooling device. Remove the call to
arch_update_thermal_pressure() in cpufreq cooling device as this is
handled by cpufreq_get_pressure().

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 drivers/thermal/cpufreq_cooling.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c
index e2cc7bd30862..e77d3b44903e 100644
--- a/drivers/thermal/cpufreq_cooling.c
+++ b/drivers/thermal/cpufreq_cooling.c
@@ -448,7 +448,6 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
 				 unsigned long state)
 {
 	struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata;
-	struct cpumask *cpus;
 	unsigned int frequency;
 	int ret;
 
@@ -465,8 +464,6 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
 	ret = freq_qos_update_request(&cpufreq_cdev->qos_req, frequency);
 	if (ret >= 0) {
 		cpufreq_cdev->cpufreq_state = state;
-		cpus = cpufreq_cdev->policy->related_cpus;
-		arch_update_thermal_pressure(cpus, frequency);
 		ret = 0;
 	}
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/4] sched: Rename arch_update_thermal_pressure into arch_update_hw_pressure
  2023-12-12 14:27 [PATCH 0/5] Rework system pressure interface to the scheduler Vincent Guittot
                   ` (2 preceding siblings ...)
  2023-12-12 14:27 ` [PATCH 3/4] thermal/cpufreq: Remove arch_update_thermal_pressure() Vincent Guittot
@ 2023-12-12 14:27 ` Vincent Guittot
  2023-12-14  8:31   ` Lukasz Luba
  2023-12-14  8:22 ` [PATCH 0/5] Rework system pressure interface to the scheduler Lukasz Luba
  4 siblings, 1 reply; 26+ messages in thread
From: Vincent Guittot @ 2023-12-12 14:27 UTC (permalink / raw)
  To: catalin.marinas, will, sudeep.holla, rafael, viresh.kumar, agross,
	andersson, konrad.dybcio, mingo, peterz, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	lukasz.luba, rui.zhang, mhiramat, daniel.lezcano, amit.kachhap,
	linux-arm-kernel, linux-kernel, linux-pm, linux-arm-msm,
	linux-trace-kernel
  Cc: Vincent Guittot

Now that cpufreq provides a pressure value to the scheduler, rename
arch_update_thermal_pressure into hw pressure to reflect that it returns
a pressure applied by HW with a high frequency and which needs filtering.
This pressure is not always related to thermal mitigation but can also be
generated by max current limitation as an example.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 arch/arm/include/asm/topology.h               |  6 ++---
 arch/arm64/include/asm/topology.h             |  6 ++---
 drivers/base/arch_topology.c                  | 26 +++++++++----------
 drivers/cpufreq/qcom-cpufreq-hw.c             |  4 +--
 include/linux/arch_topology.h                 |  8 +++---
 include/linux/sched/topology.h                |  8 +++---
 .../{thermal_pressure.h => hw_pressure.h}     | 14 +++++-----
 include/trace/events/sched.h                  |  2 +-
 init/Kconfig                                  | 12 ++++-----
 kernel/sched/core.c                           |  8 +++---
 kernel/sched/fair.c                           | 12 ++++-----
 kernel/sched/pelt.c                           | 18 ++++++-------
 kernel/sched/pelt.h                           | 16 ++++++------
 kernel/sched/sched.h                          |  4 +--
 14 files changed, 72 insertions(+), 72 deletions(-)
 rename include/trace/events/{thermal_pressure.h => hw_pressure.h} (55%)

diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index 853c4f81ba4a..e175e8596b5d 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -22,9 +22,9 @@
 /* Enable topology flag updates */
 #define arch_update_cpu_topology topology_update_cpu_topology
 
-/* Replace task scheduler's default thermal pressure API */
-#define arch_scale_thermal_pressure topology_get_thermal_pressure
-#define arch_update_thermal_pressure	topology_update_thermal_pressure
+/* Replace task scheduler's default hw pressure API */
+#define arch_scale_hw_pressure topology_get_hw_pressure
+#define arch_update_hw_pressure	topology_update_hw_pressure
 
 #else
 
diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h
index a323b109b9c4..a427650bdfba 100644
--- a/arch/arm64/include/asm/topology.h
+++ b/arch/arm64/include/asm/topology.h
@@ -35,9 +35,9 @@ void update_freq_counters_refs(void);
 /* Enable topology flag updates */
 #define arch_update_cpu_topology topology_update_cpu_topology
 
-/* Replace task scheduler's default thermal pressure API */
-#define arch_scale_thermal_pressure topology_get_thermal_pressure
-#define arch_update_thermal_pressure	topology_update_thermal_pressure
+/* Replace task scheduler's default hw pressure API */
+#define arch_scale_hw_pressure topology_get_hw_pressure
+#define arch_update_hw_pressure	topology_update_hw_pressure
 
 #include <asm-generic/topology.h>
 
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 0906114963ff..3d8dc9d5c3ad 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -22,7 +22,7 @@
 #include <linux/units.h>
 
 #define CREATE_TRACE_POINTS
-#include <trace/events/thermal_pressure.h>
+#include <trace/events/hw_pressure.h>
 
 static DEFINE_PER_CPU(struct scale_freq_data __rcu *, sft_data);
 static struct cpumask scale_freq_counters_mask;
@@ -160,26 +160,26 @@ void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity)
 	per_cpu(cpu_scale, cpu) = capacity;
 }
 
-DEFINE_PER_CPU(unsigned long, thermal_pressure);
+DEFINE_PER_CPU(unsigned long, hw_pressure);
 
 /**
- * topology_update_thermal_pressure() - Update thermal pressure for CPUs
+ * topology_update_hw_pressure() - Update hw pressure for CPUs
  * @cpus        : The related CPUs for which capacity has been reduced
  * @capped_freq : The maximum allowed frequency that CPUs can run at
  *
- * Update the value of thermal pressure for all @cpus in the mask. The
+ * Update the value of hw pressure for all @cpus in the mask. The
  * cpumask should include all (online+offline) affected CPUs, to avoid
  * operating on stale data when hot-plug is used for some CPUs. The
  * @capped_freq reflects the currently allowed max CPUs frequency due to
- * thermal capping. It might be also a boost frequency value, which is bigger
+ * hw capping. It might be also a boost frequency value, which is bigger
  * than the internal 'capacity_freq_ref' max frequency. In such case the
  * pressure value should simply be removed, since this is an indication that
- * there is no thermal throttling. The @capped_freq must be provided in kHz.
+ * there is no hw throttling. The @capped_freq must be provided in kHz.
  */
-void topology_update_thermal_pressure(const struct cpumask *cpus,
+void topology_update_hw_pressure(const struct cpumask *cpus,
 				      unsigned long capped_freq)
 {
-	unsigned long max_capacity, capacity, th_pressure;
+	unsigned long max_capacity, capacity, hw_pressure;
 	u32 max_freq;
 	int cpu;
 
@@ -189,21 +189,21 @@ void topology_update_thermal_pressure(const struct cpumask *cpus,
 
 	/*
 	 * Handle properly the boost frequencies, which should simply clean
-	 * the thermal pressure value.
+	 * the hw pressure value.
 	 */
 	if (max_freq <= capped_freq)
 		capacity = max_capacity;
 	else
 		capacity = mult_frac(max_capacity, capped_freq, max_freq);
 
-	th_pressure = max_capacity - capacity;
+	hw_pressure = max_capacity - capacity;
 
-	trace_thermal_pressure_update(cpu, th_pressure);
+	trace_hw_pressure_update(cpu, hw_pressure);
 
 	for_each_cpu(cpu, cpus)
-		WRITE_ONCE(per_cpu(thermal_pressure, cpu), th_pressure);
+		WRITE_ONCE(per_cpu(hw_pressure, cpu), hw_pressure);
 }
-EXPORT_SYMBOL_GPL(topology_update_thermal_pressure);
+EXPORT_SYMBOL_GPL(topology_update_hw_pressure);
 
 static ssize_t cpu_capacity_show(struct device *dev,
 				 struct device_attribute *attr,
diff --git a/drivers/cpufreq/qcom-cpufreq-hw.c b/drivers/cpufreq/qcom-cpufreq-hw.c
index 70b0f21968a0..a619202ba81d 100644
--- a/drivers/cpufreq/qcom-cpufreq-hw.c
+++ b/drivers/cpufreq/qcom-cpufreq-hw.c
@@ -347,8 +347,8 @@ static void qcom_lmh_dcvs_notify(struct qcom_cpufreq_data *data)
 
 	throttled_freq = freq_hz / HZ_PER_KHZ;
 
-	/* Update thermal pressure (the boost frequencies are accepted) */
-	arch_update_thermal_pressure(policy->related_cpus, throttled_freq);
+	/* Update hw pressure (the boost frequencies are accepted) */
+	arch_update_hw_pressure(policy->related_cpus, throttled_freq);
 
 	/*
 	 * In the unlikely case policy is unregistered do not enable
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index a63d61ca55af..b721f360d759 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -60,14 +60,14 @@ void topology_scale_freq_tick(void);
 void topology_set_scale_freq_source(struct scale_freq_data *data, const struct cpumask *cpus);
 void topology_clear_scale_freq_source(enum scale_freq_source source, const struct cpumask *cpus);
 
-DECLARE_PER_CPU(unsigned long, thermal_pressure);
+DECLARE_PER_CPU(unsigned long, hw_pressure);
 
-static inline unsigned long topology_get_thermal_pressure(int cpu)
+static inline unsigned long topology_get_hw_pressure(int cpu)
 {
-	return per_cpu(thermal_pressure, cpu);
+	return per_cpu(hw_pressure, cpu);
 }
 
-void topology_update_thermal_pressure(const struct cpumask *cpus,
+void topology_update_hw_pressure(const struct cpumask *cpus,
 				      unsigned long capped_freq);
 
 struct cpu_topology {
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index a6e04b4a21d7..e3b2cf7de018 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -264,17 +264,17 @@ unsigned long arch_scale_cpu_capacity(int cpu)
 }
 #endif
 
-#ifndef arch_scale_thermal_pressure
+#ifndef arch_scale_hw_pressure
 static __always_inline
-unsigned long arch_scale_thermal_pressure(int cpu)
+unsigned long arch_scale_hw_pressure(int cpu)
 {
 	return 0;
 }
 #endif
 
-#ifndef arch_update_thermal_pressure
+#ifndef arch_update_hw_pressure
 static __always_inline
-void arch_update_thermal_pressure(const struct cpumask *cpus,
+void arch_update_hw_pressure(const struct cpumask *cpus,
 				  unsigned long capped_frequency)
 { }
 #endif
diff --git a/include/trace/events/thermal_pressure.h b/include/trace/events/hw_pressure.h
similarity index 55%
rename from include/trace/events/thermal_pressure.h
rename to include/trace/events/hw_pressure.h
index b68680201360..b9cd68854128 100644
--- a/include/trace/events/thermal_pressure.h
+++ b/include/trace/events/hw_pressure.h
@@ -1,27 +1,27 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #undef TRACE_SYSTEM
-#define TRACE_SYSTEM thermal_pressure
+#define TRACE_SYSTEM hw_pressure
 
 #if !defined(_TRACE_THERMAL_PRESSURE_H) || defined(TRACE_HEADER_MULTI_READ)
 #define _TRACE_THERMAL_PRESSURE_H
 
 #include <linux/tracepoint.h>
 
-TRACE_EVENT(thermal_pressure_update,
-	TP_PROTO(int cpu, unsigned long thermal_pressure),
-	TP_ARGS(cpu, thermal_pressure),
+TRACE_EVENT(hw_pressure_update,
+	TP_PROTO(int cpu, unsigned long hw_pressure),
+	TP_ARGS(cpu, hw_pressure),
 
 	TP_STRUCT__entry(
-		__field(unsigned long, thermal_pressure)
+		__field(unsigned long, hw_pressure)
 		__field(int, cpu)
 	),
 
 	TP_fast_assign(
-		__entry->thermal_pressure = thermal_pressure;
+		__entry->hw_pressure = hw_pressure;
 		__entry->cpu = cpu;
 	),
 
-	TP_printk("cpu=%d thermal_pressure=%lu", __entry->cpu, __entry->thermal_pressure)
+	TP_printk("cpu=%d hw_pressure=%lu", __entry->cpu, __entry->hw_pressure)
 );
 #endif /* _TRACE_THERMAL_PRESSURE_H */
 
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index dbb01b4b7451..d115d64c4011 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -752,7 +752,7 @@ DECLARE_TRACE(pelt_dl_tp,
 	TP_PROTO(struct rq *rq),
 	TP_ARGS(rq));
 
-DECLARE_TRACE(pelt_thermal_tp,
+DECLARE_TRACE(pelt_hw_tp,
 	TP_PROTO(struct rq *rq),
 	TP_ARGS(rq));
 
diff --git a/init/Kconfig b/init/Kconfig
index 9ffb103fc927..37ceeb67e01c 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -538,24 +538,24 @@ config HAVE_SCHED_AVG_IRQ
 	depends on IRQ_TIME_ACCOUNTING || PARAVIRT_TIME_ACCOUNTING
 	depends on SMP
 
-config SCHED_THERMAL_PRESSURE
+config SCHED_HW_PRESSURE
 	bool
 	default y if ARM && ARM_CPU_TOPOLOGY
 	default y if ARM64
 	depends on SMP
 	depends on CPU_FREQ_THERMAL
 	help
-	  Select this option to enable thermal pressure accounting in the
-	  scheduler. Thermal pressure is the value conveyed to the scheduler
+	  Select this option to enable HW pressure accounting in the
+	  scheduler. HW pressure is the value conveyed to the scheduler
 	  that reflects the reduction in CPU compute capacity resulted from
-	  thermal throttling. Thermal throttling occurs when the performance of
-	  a CPU is capped due to high operating temperatures.
+	  HW throttling. HW throttling occurs when the performance of
+	  a CPU is capped due to high operating temperatures as an example.
 
 	  If selected, the scheduler will be able to balance tasks accordingly,
 	  i.e. put less load on throttled CPUs than on non/less throttled ones.
 
 	  This requires the architecture to implement
-	  arch_update_thermal_pressure() and arch_scale_thermal_pressure().
+	  arch_update_hw_pressure() and arch_scale_thermal_pressure().
 
 config BSD_PROCESS_ACCT
 	bool "BSD Process Accounting"
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index db4be4921e7f..c68e47bfd5ae 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -107,7 +107,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_rt_tp);
 EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_dl_tp);
 EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_irq_tp);
 EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_se_tp);
-EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_thermal_tp);
+EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_hw_tp);
 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_cpu_capacity_tp);
 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_overutilized_tp);
 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_cfs_tp);
@@ -5658,7 +5658,7 @@ void scheduler_tick(void)
 	struct rq *rq = cpu_rq(cpu);
 	struct task_struct *curr = rq->curr;
 	struct rq_flags rf;
-	unsigned long thermal_pressure;
+	unsigned long hw_pressure;
 	u64 resched_latency;
 
 	if (housekeeping_cpu(cpu, HK_TYPE_TICK))
@@ -5669,8 +5669,8 @@ void scheduler_tick(void)
 	rq_lock(rq, &rf);
 
 	update_rq_clock(rq);
-	thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
-	update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
+	hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
+	update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure);
 	curr->sched_class->task_tick(rq, curr, 0);
 	if (sched_feat(LATENCY_WARN))
 		resched_latency = cpu_resched_latency(rq);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 11d3be829302..07050955d6e0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4936,7 +4936,7 @@ static inline unsigned long get_actual_cpu_capacity(int cpu)
 {
 	unsigned long capacity = arch_scale_cpu_capacity(cpu);
 
-	capacity -= max(thermal_load_avg(cpu_rq(cpu)), cpufreq_get_pressure(cpu));
+	capacity -= max(hw_load_avg(cpu_rq(cpu)), cpufreq_get_pressure(cpu));
 
 	return capacity;
 }
@@ -4968,7 +4968,7 @@ static inline int util_fits_cpu(unsigned long util,
 	 * Similarly if a task is capped to arch_scale_cpu_capacity(little_cpu), it
 	 * should fit a little cpu even if there's some pressure.
 	 *
-	 * Only exception is for thermal pressure since it has a direct impact
+	 * Only exception is for hw or cpufreq pressure since it has a direct impact
 	 * on available OPP of the system.
 	 *
 	 * We honour it for uclamp_min only as a drop in performance level
@@ -9224,7 +9224,7 @@ static inline bool others_have_blocked(struct rq *rq)
 	if (READ_ONCE(rq->avg_dl.util_avg))
 		return true;
 
-	if (thermal_load_avg(rq))
+	if (hw_load_avg(rq))
 		return true;
 
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
@@ -9256,7 +9256,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
 {
 	const struct sched_class *curr_class;
 	u64 now = rq_clock_pelt(rq);
-	unsigned long thermal_pressure;
+	unsigned long hw_pressure;
 	bool decayed;
 
 	/*
@@ -9265,11 +9265,11 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
 	 */
 	curr_class = rq->curr->sched_class;
 
-	thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
+	hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
 
 	decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
 		  update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
-		  update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure) |
+		  update_hw_load_avg(now, rq, hw_pressure) |
 		  update_irq_load_avg(rq, 0);
 
 	if (others_have_blocked(rq))
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 63b6cf898220..cd599a5185c4 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -384,30 +384,30 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
 	return 0;
 }
 
-#ifdef CONFIG_SCHED_THERMAL_PRESSURE
+#ifdef CONFIG_SCHED_HW_PRESSURE
 /*
- * thermal:
+ * hardware:
  *
  *   load_sum = \Sum se->avg.load_sum but se->avg.load_sum is not tracked
  *
  *   util_avg and runnable_load_avg are not supported and meaningless.
  *
  * Unlike rt/dl utilization tracking that track time spent by a cpu
- * running a rt/dl task through util_avg, the average thermal pressure is
- * tracked through load_avg. This is because thermal pressure signal is
+ * running a rt/dl task through util_avg, the average hw pressure is
+ * tracked through load_avg. This is because hw pressure signal is
  * time weighted "delta" capacity unlike util_avg which is binary.
  * "delta capacity" =  actual capacity  -
- *			capped capacity a cpu due to a thermal event.
+ *			capped capacity a cpu due to a hw event.
  */
 
-int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
+int update_hw_load_avg(u64 now, struct rq *rq, u64 capacity)
 {
-	if (___update_load_sum(now, &rq->avg_thermal,
+	if (___update_load_sum(now, &rq->avg_hw,
 			       capacity,
 			       capacity,
 			       capacity)) {
-		___update_load_avg(&rq->avg_thermal, 1);
-		trace_pelt_thermal_tp(rq);
+		___update_load_avg(&rq->avg_hw, 1);
+		trace_pelt_hw_tp(rq);
 		return 1;
 	}
 
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 3a0e0dc28721..e3ab44c4a2ef 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -7,21 +7,21 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
 int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
 int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
 
-#ifdef CONFIG_SCHED_THERMAL_PRESSURE
-int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity);
+#ifdef CONFIG_SCHED_HW_PRESSURE
+int update_hw_load_avg(u64 now, struct rq *rq, u64 capacity);
 
-static inline u64 thermal_load_avg(struct rq *rq)
+static inline u64 hw_load_avg(struct rq *rq)
 {
-	return READ_ONCE(rq->avg_thermal.load_avg);
+	return READ_ONCE(rq->avg_hw.load_avg);
 }
 #else
 static inline int
-update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
+update_hw_load_avg(u64 now, struct rq *rq, u64 capacity)
 {
 	return 0;
 }
 
-static inline u64 thermal_load_avg(struct rq *rq)
+static inline u64 hw_load_avg(struct rq *rq)
 {
 	return 0;
 }
@@ -202,12 +202,12 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
 }
 
 static inline int
-update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
+update_hw_load_avg(u64 now, struct rq *rq, u64 capacity)
 {
 	return 0;
 }
 
-static inline u64 thermal_load_avg(struct rq *rq)
+static inline u64 hw_load_avg(struct rq *rq)
 {
 	return 0;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e58a54bda77d..7eaf186d071e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1078,8 +1078,8 @@ struct rq {
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
 	struct sched_avg	avg_irq;
 #endif
-#ifdef CONFIG_SCHED_THERMAL_PRESSURE
-	struct sched_avg	avg_thermal;
+#ifdef CONFIG_SCHED_HW_PRESSURE
+	struct sched_avg	avg_hw;
 #endif
 	u64			idle_stamp;
 	u64			avg_idle;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] cpufreq: Add a cpufreq pressure feedback for the scheduler
  2023-12-12 14:27 ` [PATCH 1/4] cpufreq: Add a cpufreq pressure feedback for " Vincent Guittot
@ 2023-12-13  7:17   ` Viresh Kumar
  2023-12-13  8:05     ` Vincent Guittot
  2023-12-14  0:41   ` Tim Chen
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 26+ messages in thread
From: Viresh Kumar @ 2023-12-13  7:17 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: catalin.marinas, will, sudeep.holla, rafael, agross, andersson,
	konrad.dybcio, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, vschneid, lukasz.luba,
	rui.zhang, mhiramat, daniel.lezcano, amit.kachhap,
	linux-arm-kernel, linux-kernel, linux-pm, linux-arm-msm,
	linux-trace-kernel

On 12-12-23, 15:27, Vincent Guittot wrote:
> Provide to the scheduler a feedback about the temporary max available
> capacity. Unlike arch_update_thermal_pressure, this doesn't need to be
> filtered as the pressure will happen for dozens ms or more.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  drivers/cpufreq/cpufreq.c | 48 +++++++++++++++++++++++++++++++++++++++
>  include/linux/cpufreq.h   | 10 ++++++++
>  2 files changed, 58 insertions(+)
> 
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 44db4f59c4cc..7d5f71be8d29 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -2563,6 +2563,50 @@ int cpufreq_get_policy(struct cpufreq_policy *policy, unsigned int cpu)
>  }
>  EXPORT_SYMBOL(cpufreq_get_policy);
>  
> +DEFINE_PER_CPU(unsigned long, cpufreq_pressure);
> +EXPORT_PER_CPU_SYMBOL_GPL(cpufreq_pressure);
> +
> +/**
> + * cpufreq_update_pressure() - Update cpufreq pressure for CPUs
> + * @cpus        : The related CPUs for which max capacity has been reduced
> + * @capped_freq : The maximum allowed frequency that CPUs can run at
> + *
> + * Update the value of cpufreq pressure for all @cpus in the mask. The
> + * cpumask should include all (online+offline) affected CPUs, to avoid
> + * operating on stale data when hot-plug is used for some CPUs. The
> + * @capped_freq reflects the currently allowed max CPUs frequency due to
> + * freq_qos capping. It might be also a boost frequency value, which is bigger
> + * than the internal 'capacity_freq_ref' max frequency. In such case the
> + * pressure value should simply be removed, since this is an indication that
> + * there is no capping. The @capped_freq must be provided in kHz.
> + */
> +static void cpufreq_update_pressure(const struct cpumask *cpus,

Since this is defined as 'static', why not just pass policy here ?

> +				      unsigned long capped_freq)
> +{
> +	unsigned long max_capacity, capacity, pressure;
> +	u32 max_freq;
> +	int cpu;
> +
> +	cpu = cpumask_first(cpus);
> +	max_capacity = arch_scale_cpu_capacity(cpu);

This anyway expects all of them to be from the same policy ..

> +	max_freq = arch_scale_freq_ref(cpu);
> +
> +	/*
> +	 * Handle properly the boost frequencies, which should simply clean
> +	 * the thermal pressure value.
> +	 */
> +	if (max_freq <= capped_freq)
> +		capacity = max_capacity;
> +	else
> +		capacity = mult_frac(max_capacity, capped_freq, max_freq);
> +
> +	pressure = max_capacity - capacity;
> +

Extra blank line here.

> +
> +	for_each_cpu(cpu, cpus)
> +		WRITE_ONCE(per_cpu(cpufreq_pressure, cpu), pressure);
> +}
> +
>  /**
>   * cpufreq_set_policy - Modify cpufreq policy parameters.
>   * @policy: Policy object to modify.
> @@ -2584,6 +2628,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
>  {
>  	struct cpufreq_policy_data new_data;
>  	struct cpufreq_governor *old_gov;
> +	struct cpumask *cpus;
>  	int ret;
>  
>  	memcpy(&new_data.cpuinfo, &policy->cpuinfo, sizeof(policy->cpuinfo));
> @@ -2618,6 +2663,9 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
>  	policy->max = __resolve_freq(policy, policy->max, CPUFREQ_RELATION_H);
>  	trace_cpu_frequency_limits(policy);
>  
> +	cpus = policy->related_cpus;

You don't need the extra variable anyway, but lets just pass policy
instead to the routine.

> +	cpufreq_update_pressure(cpus, policy->max);
> +
>  	policy->cached_target_freq = UINT_MAX;
>  
>  	pr_debug("new min and max freqs are %u - %u kHz\n",
> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> index afda5f24d3dd..b1d97edd3253 100644
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -241,6 +241,12 @@ struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy);
>  void cpufreq_enable_fast_switch(struct cpufreq_policy *policy);
>  void cpufreq_disable_fast_switch(struct cpufreq_policy *policy);
>  bool has_target_index(void);
> +
> +DECLARE_PER_CPU(unsigned long, cpufreq_pressure);
> +static inline unsigned long cpufreq_get_pressure(int cpu)
> +{
> +	return per_cpu(cpufreq_pressure, cpu);
> +}
>  #else
>  static inline unsigned int cpufreq_get(unsigned int cpu)
>  {
> @@ -263,6 +269,10 @@ static inline bool cpufreq_supports_freq_invariance(void)
>  	return false;
>  }
>  static inline void disable_cpufreq(void) { }
> +static inline unsigned long cpufreq_get_pressure(int cpu)
> +{
> +	return 0;
> +}
>  #endif
>  
>  #ifdef CONFIG_CPU_FREQ_STAT
> -- 
> 2.34.1

-- 
viresh

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] cpufreq: Add a cpufreq pressure feedback for the scheduler
  2023-12-13  7:17   ` Viresh Kumar
@ 2023-12-13  8:05     ` Vincent Guittot
  0 siblings, 0 replies; 26+ messages in thread
From: Vincent Guittot @ 2023-12-13  8:05 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: catalin.marinas, will, sudeep.holla, rafael, agross, andersson,
	konrad.dybcio, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, vschneid, lukasz.luba,
	rui.zhang, mhiramat, daniel.lezcano, amit.kachhap,
	linux-arm-kernel, linux-kernel, linux-pm, linux-arm-msm,
	linux-trace-kernel

On Wed, 13 Dec 2023 at 08:17, Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 12-12-23, 15:27, Vincent Guittot wrote:
> > Provide to the scheduler a feedback about the temporary max available
> > capacity. Unlike arch_update_thermal_pressure, this doesn't need to be
> > filtered as the pressure will happen for dozens ms or more.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  drivers/cpufreq/cpufreq.c | 48 +++++++++++++++++++++++++++++++++++++++
> >  include/linux/cpufreq.h   | 10 ++++++++
> >  2 files changed, 58 insertions(+)
> >
> > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > index 44db4f59c4cc..7d5f71be8d29 100644
> > --- a/drivers/cpufreq/cpufreq.c
> > +++ b/drivers/cpufreq/cpufreq.c
> > @@ -2563,6 +2563,50 @@ int cpufreq_get_policy(struct cpufreq_policy *policy, unsigned int cpu)
> >  }
> >  EXPORT_SYMBOL(cpufreq_get_policy);
> >
> > +DEFINE_PER_CPU(unsigned long, cpufreq_pressure);
> > +EXPORT_PER_CPU_SYMBOL_GPL(cpufreq_pressure);
> > +
> > +/**
> > + * cpufreq_update_pressure() - Update cpufreq pressure for CPUs
> > + * @cpus        : The related CPUs for which max capacity has been reduced
> > + * @capped_freq : The maximum allowed frequency that CPUs can run at
> > + *
> > + * Update the value of cpufreq pressure for all @cpus in the mask. The
> > + * cpumask should include all (online+offline) affected CPUs, to avoid
> > + * operating on stale data when hot-plug is used for some CPUs. The
> > + * @capped_freq reflects the currently allowed max CPUs frequency due to
> > + * freq_qos capping. It might be also a boost frequency value, which is bigger
> > + * than the internal 'capacity_freq_ref' max frequency. In such case the
> > + * pressure value should simply be removed, since this is an indication that
> > + * there is no capping. The @capped_freq must be provided in kHz.
> > + */
> > +static void cpufreq_update_pressure(const struct cpumask *cpus,
>
> Since this is defined as 'static', why not just pass policy here ?

Mainly because we only need the cpumask and also because this follows
the same pattern as other place like arch_topology.c

>
> > +                                   unsigned long capped_freq)
> > +{
> > +     unsigned long max_capacity, capacity, pressure;
> > +     u32 max_freq;
> > +     int cpu;
> > +
> > +     cpu = cpumask_first(cpus);
> > +     max_capacity = arch_scale_cpu_capacity(cpu);
>
> This anyway expects all of them to be from the same policy ..
>
> > +     max_freq = arch_scale_freq_ref(cpu);
> > +
> > +     /*
> > +      * Handle properly the boost frequencies, which should simply clean
> > +      * the thermal pressure value.
> > +      */
> > +     if (max_freq <= capped_freq)
> > +             capacity = max_capacity;
> > +     else
> > +             capacity = mult_frac(max_capacity, capped_freq, max_freq);
> > +
> > +     pressure = max_capacity - capacity;
> > +
>
> Extra blank line here.
>
> > +
> > +     for_each_cpu(cpu, cpus)
> > +             WRITE_ONCE(per_cpu(cpufreq_pressure, cpu), pressure);
> > +}
> > +
> >  /**
> >   * cpufreq_set_policy - Modify cpufreq policy parameters.
> >   * @policy: Policy object to modify.
> > @@ -2584,6 +2628,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
> >  {
> >       struct cpufreq_policy_data new_data;
> >       struct cpufreq_governor *old_gov;
> > +     struct cpumask *cpus;
> >       int ret;
> >
> >       memcpy(&new_data.cpuinfo, &policy->cpuinfo, sizeof(policy->cpuinfo));
> > @@ -2618,6 +2663,9 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
> >       policy->max = __resolve_freq(policy, policy->max, CPUFREQ_RELATION_H);
> >       trace_cpu_frequency_limits(policy);
> >
> > +     cpus = policy->related_cpus;
>
> You don't need the extra variable anyway, but lets just pass policy
> instead to the routine.

In fact I have followed what was done in cpufreq_cooling.c with
arch_update_thermal_pressure().

Will remove it

>
> > +     cpufreq_update_pressure(cpus, policy->max);
> > +
> >       policy->cached_target_freq = UINT_MAX;
> >
> >       pr_debug("new min and max freqs are %u - %u kHz\n",
> > diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> > index afda5f24d3dd..b1d97edd3253 100644
> > --- a/include/linux/cpufreq.h
> > +++ b/include/linux/cpufreq.h
> > @@ -241,6 +241,12 @@ struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy);
> >  void cpufreq_enable_fast_switch(struct cpufreq_policy *policy);
> >  void cpufreq_disable_fast_switch(struct cpufreq_policy *policy);
> >  bool has_target_index(void);
> > +
> > +DECLARE_PER_CPU(unsigned long, cpufreq_pressure);
> > +static inline unsigned long cpufreq_get_pressure(int cpu)
> > +{
> > +     return per_cpu(cpufreq_pressure, cpu);
> > +}
> >  #else
> >  static inline unsigned int cpufreq_get(unsigned int cpu)
> >  {
> > @@ -263,6 +269,10 @@ static inline bool cpufreq_supports_freq_invariance(void)
> >       return false;
> >  }
> >  static inline void disable_cpufreq(void) { }
> > +static inline unsigned long cpufreq_get_pressure(int cpu)
> > +{
> > +     return 0;
> > +}
> >  #endif
> >
> >  #ifdef CONFIG_CPU_FREQ_STAT
> > --
> > 2.34.1
>
> --
> viresh

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] cpufreq: Add a cpufreq pressure feedback for the scheduler
  2023-12-12 14:27 ` [PATCH 1/4] cpufreq: Add a cpufreq pressure feedback for " Vincent Guittot
  2023-12-13  7:17   ` Viresh Kumar
@ 2023-12-14  0:41   ` Tim Chen
  2023-12-14  5:36     ` Viresh Kumar
  2023-12-14  5:43   ` Viresh Kumar
  2023-12-14  9:21   ` Lukasz Luba
  3 siblings, 1 reply; 26+ messages in thread
From: Tim Chen @ 2023-12-14  0:41 UTC (permalink / raw)
  To: Vincent Guittot, catalin.marinas, will, sudeep.holla, rafael,
	viresh.kumar, agross, andersson, konrad.dybcio, mingo, peterz,
	juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman, bristot,
	vschneid, lukasz.luba, rui.zhang, mhiramat, daniel.lezcano,
	amit.kachhap, linux-arm-kernel, linux-kernel, linux-pm,
	linux-arm-msm, linux-trace-kernel

On Tue, 2023-12-12 at 15:27 +0100, Vincent Guittot wrote:
> Provide to the scheduler a feedback about the temporary max available
> capacity. Unlike arch_update_thermal_pressure, this doesn't need to be
> filtered as the pressure will happen for dozens ms or more.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  drivers/cpufreq/cpufreq.c | 48 +++++++++++++++++++++++++++++++++++++++
>  include/linux/cpufreq.h   | 10 ++++++++
>  2 files changed, 58 insertions(+)
> 
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 44db4f59c4cc..7d5f71be8d29 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -2563,6 +2563,50 @@ int cpufreq_get_policy(struct cpufreq_policy *policy, unsigned int cpu)
>  }
>  EXPORT_SYMBOL(cpufreq_get_policy);
>  
> +DEFINE_PER_CPU(unsigned long, cpufreq_pressure);
> +EXPORT_PER_CPU_SYMBOL_GPL(cpufreq_pressure);
> +
> +/**
> + * cpufreq_update_pressure() - Update cpufreq pressure for CPUs
> + * @cpus        : The related CPUs for which max capacity has been reduced
> + * @capped_freq : The maximum allowed frequency that CPUs can run at
> + *
> + * Update the value of cpufreq pressure for all @cpus in the mask. The
> + * cpumask should include all (online+offline) affected CPUs, to avoid
> + * operating on stale data when hot-plug is used for some CPUs. The
> + * @capped_freq reflects the currently allowed max CPUs frequency due to
> + * freq_qos capping. It might be also a boost frequency value, which is bigger
> + * than the internal 'capacity_freq_ref' max frequency. In such case the
> + * pressure value should simply be removed, since this is an indication that
> + * there is no capping. The @capped_freq must be provided in kHz.
> + */
> +static void cpufreq_update_pressure(const struct cpumask *cpus,
> +				      unsigned long capped_freq)
> +{
> +	unsigned long max_capacity, capacity, pressure;
> +	u32 max_freq;
> +	int cpu;
> +
> +	cpu = cpumask_first(cpus);
> +	max_capacity = arch_scale_cpu_capacity(cpu);
> +	max_freq = arch_scale_freq_ref(cpu);
> +
> +	/*
> +	 * Handle properly the boost frequencies, which should simply clean
> +	 * the thermal pressure value.
> +	 */
> +	if (max_freq <= capped_freq)
> +		capacity = max_capacity;
> +	else
> +		capacity = mult_frac(max_capacity, capped_freq, max_freq);
> +
> +	pressure = max_capacity - capacity;
> +
> +
> +	for_each_cpu(cpu, cpus)
> +		WRITE_ONCE(per_cpu(cpufreq_pressure, cpu), pressure);

Seems like the pressure value computed from the first CPU applies to all CPU.
Will this be valid for non-homogeneous CPUs that could have different
max_freq and max_capacity?

Tim

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] cpufreq: Add a cpufreq pressure feedback for the scheduler
  2023-12-14  0:41   ` Tim Chen
@ 2023-12-14  5:36     ` Viresh Kumar
  0 siblings, 0 replies; 26+ messages in thread
From: Viresh Kumar @ 2023-12-14  5:36 UTC (permalink / raw)
  To: Tim Chen
  Cc: Vincent Guittot, catalin.marinas, will, sudeep.holla, rafael,
	agross, andersson, konrad.dybcio, mingo, peterz, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	lukasz.luba, rui.zhang, mhiramat, daniel.lezcano, amit.kachhap,
	linux-arm-kernel, linux-kernel, linux-pm, linux-arm-msm,
	linux-trace-kernel

On 13-12-23, 16:41, Tim Chen wrote:
> Seems like the pressure value computed from the first CPU applies to all CPU.
> Will this be valid for non-homogeneous CPUs that could have different
> max_freq and max_capacity?

The will be part of different cpufreq policies and so it will work
fine.

-- 
viresh

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] cpufreq: Add a cpufreq pressure feedback for the scheduler
  2023-12-12 14:27 ` [PATCH 1/4] cpufreq: Add a cpufreq pressure feedback for " Vincent Guittot
  2023-12-13  7:17   ` Viresh Kumar
  2023-12-14  0:41   ` Tim Chen
@ 2023-12-14  5:43   ` Viresh Kumar
  2023-12-14  7:57     ` Vincent Guittot
  2023-12-14  9:21   ` Lukasz Luba
  3 siblings, 1 reply; 26+ messages in thread
From: Viresh Kumar @ 2023-12-14  5:43 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: catalin.marinas, will, sudeep.holla, rafael, agross, andersson,
	konrad.dybcio, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, vschneid, lukasz.luba,
	rui.zhang, mhiramat, daniel.lezcano, amit.kachhap,
	linux-arm-kernel, linux-kernel, linux-pm, linux-arm-msm,
	linux-trace-kernel

On 12-12-23, 15:27, Vincent Guittot wrote:
> @@ -2618,6 +2663,9 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
>  	policy->max = __resolve_freq(policy, policy->max, CPUFREQ_RELATION_H);
>  	trace_cpu_frequency_limits(policy);
>  
> +	cpus = policy->related_cpus;
> +	cpufreq_update_pressure(cpus, policy->max);
> +
>  	policy->cached_target_freq = UINT_MAX;

One more question, why are you doing this from cpufreq_set_policy ? If
due to cpufreq cooling or from userspace, we end up limiting the
maximum possible frequency, will this routine always get called ?

-- 
viresh

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] cpufreq: Add a cpufreq pressure feedback for the scheduler
  2023-12-14  5:43   ` Viresh Kumar
@ 2023-12-14  7:57     ` Vincent Guittot
  2023-12-14  9:08       ` Lukasz Luba
  0 siblings, 1 reply; 26+ messages in thread
From: Vincent Guittot @ 2023-12-14  7:57 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: catalin.marinas, will, sudeep.holla, rafael, agross, andersson,
	konrad.dybcio, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, vschneid, lukasz.luba,
	rui.zhang, mhiramat, daniel.lezcano, amit.kachhap,
	linux-arm-kernel, linux-kernel, linux-pm, linux-arm-msm,
	linux-trace-kernel

On Thu, 14 Dec 2023 at 06:43, Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 12-12-23, 15:27, Vincent Guittot wrote:
> > @@ -2618,6 +2663,9 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
> >       policy->max = __resolve_freq(policy, policy->max, CPUFREQ_RELATION_H);
> >       trace_cpu_frequency_limits(policy);
> >
> > +     cpus = policy->related_cpus;
> > +     cpufreq_update_pressure(cpus, policy->max);
> > +
> >       policy->cached_target_freq = UINT_MAX;
>
> One more question, why are you doing this from cpufreq_set_policy ? If
> due to cpufreq cooling or from userspace, we end up limiting the
> maximum possible frequency, will this routine always get called ?

Yes, any update of a FREQ_QOS_MAX ends up calling cpufreq_set_policy()
to update the policy->max


>
> --
> viresh

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/5] Rework system pressure interface to the scheduler
  2023-12-12 14:27 [PATCH 0/5] Rework system pressure interface to the scheduler Vincent Guittot
                   ` (3 preceding siblings ...)
  2023-12-12 14:27 ` [PATCH 4/4] sched: Rename arch_update_thermal_pressure into arch_update_hw_pressure Vincent Guittot
@ 2023-12-14  8:22 ` Lukasz Luba
  2023-12-14  8:29   ` Vincent Guittot
  4 siblings, 1 reply; 26+ messages in thread
From: Lukasz Luba @ 2023-12-14  8:22 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: catalin.marinas, will, linux-trace-kernel, amit.kachhap,
	daniel.lezcano, mhiramat, vschneid, bristot, mgorman, bsegall,
	juri.lelli, peterz, mingo, linux-pm, linux-kernel, konrad.dybcio,
	andersson, agross, rui.zhang, viresh.kumar, rafael, sudeep.holla,
	dietmar.eggemann, rostedt, linux-arm-kernel, linux-arm-msm

Hi Vincent,

I've been waiting for this feature, thanks!


On 12/12/23 14:27, Vincent Guittot wrote:
> Following the consolidation and cleanup of CPU capacity in [1], this serie
> reworks how the scheduler gets the pressures on CPUs. We need to take into
> account all pressures applied by cpufreq on the compute capacity of a CPU
> for dozens of ms or more and not only cpufreq cooling device or HW
> mitigiations. we split the pressure applied on CPU's capacity in 2 parts:
> - one from cpufreq and freq_qos
> - one from HW high freq mitigiation.
> 
> The next step will be to add a dedicated interface for long standing
> capping of the CPU capacity (i.e. for seconds or more) like the
> scaling_max_freq of cpufreq sysfs. The latter is already taken into
> account by this serie but as a temporary pressure which is not always the
> best choice when we know that it will happen for seconds or more.
> 
> [1] https://lore.kernel.org/lkml/20231211104855.558096-1-vincent.guittot@linaro.org/
> 
> Vincent Guittot (4):
>    cpufreq: Add a cpufreq pressure feedback for the scheduler
>    sched: Take cpufreq feedback into account
>    thermal/cpufreq: Remove arch_update_thermal_pressure()
>    sched: Rename arch_update_thermal_pressure into
>      arch_update_hw_pressure
> 
>   arch/arm/include/asm/topology.h               |  6 +--
>   arch/arm64/include/asm/topology.h             |  6 +--
>   drivers/base/arch_topology.c                  | 26 ++++-----
>   drivers/cpufreq/cpufreq.c                     | 48 +++++++++++++++++
>   drivers/cpufreq/qcom-cpufreq-hw.c             |  4 +-
>   drivers/thermal/cpufreq_cooling.c             |  3 --
>   include/linux/arch_topology.h                 |  8 +--
>   include/linux/cpufreq.h                       | 10 ++++
>   include/linux/sched/topology.h                |  8 +--
>   .../{thermal_pressure.h => hw_pressure.h}     | 14 ++---
>   include/trace/events/sched.h                  |  2 +-
>   init/Kconfig                                  | 12 ++---
>   kernel/sched/core.c                           |  8 +--
>   kernel/sched/fair.c                           | 53 ++++++++++---------
>   kernel/sched/pelt.c                           | 18 +++----
>   kernel/sched/pelt.h                           | 16 +++---
>   kernel/sched/sched.h                          |  4 +-
>   17 files changed, 152 insertions(+), 94 deletions(-)
>   rename include/trace/events/{thermal_pressure.h => hw_pressure.h} (55%)
> 

I would like to test it, but something worries me. Why there is 0/5 in
this subject and only 4 patches?

Could you tell me your base branch that I can apply this, please?

Regards,
Lukasz

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/5] Rework system pressure interface to the scheduler
  2023-12-14  8:22 ` [PATCH 0/5] Rework system pressure interface to the scheduler Lukasz Luba
@ 2023-12-14  8:29   ` Vincent Guittot
  2023-12-14  8:32     ` Lukasz Luba
  0 siblings, 1 reply; 26+ messages in thread
From: Vincent Guittot @ 2023-12-14  8:29 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: catalin.marinas, will, linux-trace-kernel, amit.kachhap,
	daniel.lezcano, mhiramat, vschneid, bristot, mgorman, bsegall,
	juri.lelli, peterz, mingo, linux-pm, linux-kernel, konrad.dybcio,
	andersson, agross, rui.zhang, viresh.kumar, rafael, sudeep.holla,
	dietmar.eggemann, rostedt, linux-arm-kernel, linux-arm-msm

On Thu, 14 Dec 2023 at 09:21, Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> Hi Vincent,
>
> I've been waiting for this feature, thanks!
>
>
> On 12/12/23 14:27, Vincent Guittot wrote:
> > Following the consolidation and cleanup of CPU capacity in [1], this serie
> > reworks how the scheduler gets the pressures on CPUs. We need to take into
> > account all pressures applied by cpufreq on the compute capacity of a CPU
> > for dozens of ms or more and not only cpufreq cooling device or HW
> > mitigiations. we split the pressure applied on CPU's capacity in 2 parts:
> > - one from cpufreq and freq_qos
> > - one from HW high freq mitigiation.
> >
> > The next step will be to add a dedicated interface for long standing
> > capping of the CPU capacity (i.e. for seconds or more) like the
> > scaling_max_freq of cpufreq sysfs. The latter is already taken into
> > account by this serie but as a temporary pressure which is not always the
> > best choice when we know that it will happen for seconds or more.
> >
> > [1] https://lore.kernel.org/lkml/20231211104855.558096-1-vincent.guittot@linaro.org/
> >
> > Vincent Guittot (4):
> >    cpufreq: Add a cpufreq pressure feedback for the scheduler
> >    sched: Take cpufreq feedback into account
> >    thermal/cpufreq: Remove arch_update_thermal_pressure()
> >    sched: Rename arch_update_thermal_pressure into
> >      arch_update_hw_pressure
> >
> >   arch/arm/include/asm/topology.h               |  6 +--
> >   arch/arm64/include/asm/topology.h             |  6 +--
> >   drivers/base/arch_topology.c                  | 26 ++++-----
> >   drivers/cpufreq/cpufreq.c                     | 48 +++++++++++++++++
> >   drivers/cpufreq/qcom-cpufreq-hw.c             |  4 +-
> >   drivers/thermal/cpufreq_cooling.c             |  3 --
> >   include/linux/arch_topology.h                 |  8 +--
> >   include/linux/cpufreq.h                       | 10 ++++
> >   include/linux/sched/topology.h                |  8 +--
> >   .../{thermal_pressure.h => hw_pressure.h}     | 14 ++---
> >   include/trace/events/sched.h                  |  2 +-
> >   init/Kconfig                                  | 12 ++---
> >   kernel/sched/core.c                           |  8 +--
> >   kernel/sched/fair.c                           | 53 ++++++++++---------
> >   kernel/sched/pelt.c                           | 18 +++----
> >   kernel/sched/pelt.h                           | 16 +++---
> >   kernel/sched/sched.h                          |  4 +-
> >   17 files changed, 152 insertions(+), 94 deletions(-)
> >   rename include/trace/events/{thermal_pressure.h => hw_pressure.h} (55%)
> >
>
> I would like to test it, but something worries me. Why there is 0/5 in
> this subject and only 4 patches?

I removed a patch from the series but copied/pasted the cover letter
subject without noticing the /5 instead of /4

>
> Could you tell me your base branch that I can apply this, please?

It applies on top of tip/sched/core + [1]
and you can find it here:
https://git.linaro.org/people/vincent.guittot/kernel.git/log/?h=sched/system-pressure

>
> Regards,
> Lukasz

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] sched: Rename arch_update_thermal_pressure into arch_update_hw_pressure
  2023-12-12 14:27 ` [PATCH 4/4] sched: Rename arch_update_thermal_pressure into arch_update_hw_pressure Vincent Guittot
@ 2023-12-14  8:31   ` Lukasz Luba
  2023-12-14  8:36     ` Vincent Guittot
  0 siblings, 1 reply; 26+ messages in thread
From: Lukasz Luba @ 2023-12-14  8:31 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: catalin.marinas, bristot, linux-pm, juri.lelli, agross,
	viresh.kumar, rafael, linux-kernel, rui.zhang, dietmar.eggemann,
	mgorman, linux-trace-kernel, mingo, peterz, konrad.dybcio,
	andersson, rostedt, linux-arm-msm, linux-arm-kernel, bsegall,
	vschneid, will, sudeep.holla, daniel.lezcano, mhiramat,
	amit.kachhap


On 12/12/23 14:27, Vincent Guittot wrote:
> Now that cpufreq provides a pressure value to the scheduler, rename
> arch_update_thermal_pressure into hw pressure to reflect that it returns
> a pressure applied by HW with a high frequency and which needs filtering.

I would elaborte this meaning 'filtering' here. Something like:
'... high frequency and which needs filtering to smooth the singal and
get an average value. That reflects available capacity of the CPU in
longer period'

> This pressure is not always related to thermal mitigation but can also be
> generated by max current limitation as an example.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>   arch/arm/include/asm/topology.h               |  6 ++---
>   arch/arm64/include/asm/topology.h             |  6 ++---
>   drivers/base/arch_topology.c                  | 26 +++++++++----------
>   drivers/cpufreq/qcom-cpufreq-hw.c             |  4 +--
>   include/linux/arch_topology.h                 |  8 +++---
>   include/linux/sched/topology.h                |  8 +++---
>   .../{thermal_pressure.h => hw_pressure.h}     | 14 +++++-----
>   include/trace/events/sched.h                  |  2 +-
>   init/Kconfig                                  | 12 ++++-----
>   kernel/sched/core.c                           |  8 +++---
>   kernel/sched/fair.c                           | 12 ++++-----
>   kernel/sched/pelt.c                           | 18 ++++++-------
>   kernel/sched/pelt.h                           | 16 ++++++------
>   kernel/sched/sched.h                          |  4 +--
>   14 files changed, 72 insertions(+), 72 deletions(-)
>   rename include/trace/events/{thermal_pressure.h => hw_pressure.h} (55%)
> 
> diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
> index 853c4f81ba4a..e175e8596b5d 100644
> --- a/arch/arm/include/asm/topology.h
> +++ b/arch/arm/include/asm/topology.h
> @@ -22,9 +22,9 @@
>   /* Enable topology flag updates */
>   #define arch_update_cpu_topology topology_update_cpu_topology
>   
> -/* Replace task scheduler's default thermal pressure API */
> -#define arch_scale_thermal_pressure topology_get_thermal_pressure
> -#define arch_update_thermal_pressure	topology_update_thermal_pressure
> +/* Replace task scheduler's default hw pressure API */
> +#define arch_scale_hw_pressure topology_get_hw_pressure
> +#define arch_update_hw_pressure	topology_update_hw_pressure
>   
>   #else
>   
> diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h
> index a323b109b9c4..a427650bdfba 100644
> --- a/arch/arm64/include/asm/topology.h
> +++ b/arch/arm64/include/asm/topology.h
> @@ -35,9 +35,9 @@ void update_freq_counters_refs(void);
>   /* Enable topology flag updates */
>   #define arch_update_cpu_topology topology_update_cpu_topology
>   
> -/* Replace task scheduler's default thermal pressure API */
> -#define arch_scale_thermal_pressure topology_get_thermal_pressure
> -#define arch_update_thermal_pressure	topology_update_thermal_pressure
> +/* Replace task scheduler's default hw pressure API */

s/hw/HW/ ?

> +#define arch_scale_hw_pressure topology_get_hw_pressure
> +#define arch_update_hw_pressure	topology_update_hw_pressure
>   
>   #include <asm-generic/topology.h>
>   
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 0906114963ff..3d8dc9d5c3ad 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -22,7 +22,7 @@
>   #include <linux/units.h>
>   
>   #define CREATE_TRACE_POINTS
> -#include <trace/events/thermal_pressure.h>
> +#include <trace/events/hw_pressure.h>
>   
>   static DEFINE_PER_CPU(struct scale_freq_data __rcu *, sft_data);
>   static struct cpumask scale_freq_counters_mask;
> @@ -160,26 +160,26 @@ void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity)
>   	per_cpu(cpu_scale, cpu) = capacity;
>   }
>   
> -DEFINE_PER_CPU(unsigned long, thermal_pressure);
> +DEFINE_PER_CPU(unsigned long, hw_pressure);
>   
>   /**
> - * topology_update_thermal_pressure() - Update thermal pressure for CPUs
> + * topology_update_hw_pressure() - Update hw pressure for CPUs

same here: HW?

>    * @cpus        : The related CPUs for which capacity has been reduced
>    * @capped_freq : The maximum allowed frequency that CPUs can run at
>    *
> - * Update the value of thermal pressure for all @cpus in the mask. The
> + * Update the value of hw pressure for all @cpus in the mask. The

HW?

>    * cpumask should include all (online+offline) affected CPUs, to avoid
>    * operating on stale data when hot-plug is used for some CPUs. The
>    * @capped_freq reflects the currently allowed max CPUs frequency due to
> - * thermal capping. It might be also a boost frequency value, which is bigger
> + * hw capping. It might be also a boost frequency value, which is bigger

HW?

>    * than the internal 'capacity_freq_ref' max frequency. In such case the
>    * pressure value should simply be removed, since this is an indication that
> - * there is no thermal throttling. The @capped_freq must be provided in kHz.
> + * there is no hw throttling. The @capped_freq must be provided in kHz.

HW?

>    */
> -void topology_update_thermal_pressure(const struct cpumask *cpus,
> +void topology_update_hw_pressure(const struct cpumask *cpus,
>   				      unsigned long capped_freq)
>   {
> -	unsigned long max_capacity, capacity, th_pressure;
> +	unsigned long max_capacity, capacity, hw_pressure;
>   	u32 max_freq;
>   	int cpu;
>   
> @@ -189,21 +189,21 @@ void topology_update_thermal_pressure(const struct cpumask *cpus,
>   
>   	/*
>   	 * Handle properly the boost frequencies, which should simply clean
> -	 * the thermal pressure value.
> +	 * the hw pressure value.

HW?

>   	 */
>   	if (max_freq <= capped_freq)
>   		capacity = max_capacity;
>   	else
>   		capacity = mult_frac(max_capacity, capped_freq, max_freq);
>   
> -	th_pressure = max_capacity - capacity;
> +	hw_pressure = max_capacity - capacity;
>   
> -	trace_thermal_pressure_update(cpu, th_pressure);
> +	trace_hw_pressure_update(cpu, hw_pressure);
>   
>   	for_each_cpu(cpu, cpus)
> -		WRITE_ONCE(per_cpu(thermal_pressure, cpu), th_pressure);
> +		WRITE_ONCE(per_cpu(hw_pressure, cpu), hw_pressure);
>   }
> -EXPORT_SYMBOL_GPL(topology_update_thermal_pressure);
> +EXPORT_SYMBOL_GPL(topology_update_hw_pressure);
>   
>   static ssize_t cpu_capacity_show(struct device *dev,
>   				 struct device_attribute *attr,
> diff --git a/drivers/cpufreq/qcom-cpufreq-hw.c b/drivers/cpufreq/qcom-cpufreq-hw.c
> index 70b0f21968a0..a619202ba81d 100644
> --- a/drivers/cpufreq/qcom-cpufreq-hw.c
> +++ b/drivers/cpufreq/qcom-cpufreq-hw.c
> @@ -347,8 +347,8 @@ static void qcom_lmh_dcvs_notify(struct qcom_cpufreq_data *data)
>   
>   	throttled_freq = freq_hz / HZ_PER_KHZ;
>   
> -	/* Update thermal pressure (the boost frequencies are accepted) */
> -	arch_update_thermal_pressure(policy->related_cpus, throttled_freq);
> +	/* Update hw pressure (the boost frequencies are accepted) */

HW?

> +	arch_update_hw_pressure(policy->related_cpus, throttled_freq);
>   

[snip]

>   
>   	if (housekeeping_cpu(cpu, HK_TYPE_TICK))
> @@ -5669,8 +5669,8 @@ void scheduler_tick(void)
>   	rq_lock(rq, &rf);
>   
>   	update_rq_clock(rq);
> -	thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
> -	update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
> +	hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
> +	update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure);

We switch to task clock here, could you tell me why?
Don't we have to maintain the boot command parameter for the shift?

>   	curr->sched_class->task_tick(rq, curr, 0);
>   	if (sched_feat(LATENCY_WARN))
>   		resched_latency = cpu_resched_latency(rq);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 11d3be829302..07050955d6e0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4936,7 +4936,7 @@ static inline unsigned long get_actual_cpu_capacity(int cpu)
>   {
>   	unsigned long capacity = arch_scale_cpu_capacity(cpu);
>   
> -	capacity -= max(thermal_load_avg(cpu_rq(cpu)), cpufreq_get_pressure(cpu));
> +	capacity -= max(hw_load_avg(cpu_rq(cpu)), cpufreq_get_pressure(cpu));
>   
>   	return capacity;
>   }
> @@ -4968,7 +4968,7 @@ static inline int util_fits_cpu(unsigned long util,
>   	 * Similarly if a task is capped to arch_scale_cpu_capacity(little_cpu), it
>   	 * should fit a little cpu even if there's some pressure.
>   	 *
> -	 * Only exception is for thermal pressure since it has a direct impact
> +	 * Only exception is for hw or cpufreq pressure since it has a direct impact

HW?

>   	 * on available OPP of the system.
>   	 *
>   	 * We honour it for uclamp_min only as a drop in performance level
> @@ -9224,7 +9224,7 @@ static inline bool others_have_blocked(struct rq *rq)
>   	if (READ_ONCE(rq->avg_dl.util_avg))
>   		return true;
>   
> -	if (thermal_load_avg(rq))
> +	if (hw_load_avg(rq))
>   		return true;
>   
>   #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
> @@ -9256,7 +9256,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
>   {
>   	const struct sched_class *curr_class;
>   	u64 now = rq_clock_pelt(rq);
> -	unsigned long thermal_pressure;
> +	unsigned long hw_pressure;
>   	bool decayed;
>   
>   	/*
> @@ -9265,11 +9265,11 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
>   	 */
>   	curr_class = rq->curr->sched_class;
>   
> -	thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
> +	hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
>   
>   	decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
>   		  update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
> -		  update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure) |
> +		  update_hw_load_avg(now, rq, hw_pressure) |

Here also the rq_clock_thermal() is not used anymore. Shouldn't we
remove the rq_clock_thermal() if not used anymore (I cannot se this in
the patch)?

>   		  update_irq_load_avg(rq, 0);
>   
>   	if (others_have_blocked(rq))

[snip]

> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index e58a54bda77d..7eaf186d071e 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1078,8 +1078,8 @@ struct rq {
>   #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
>   	struct sched_avg	avg_irq;
>   #endif
> -#ifdef CONFIG_SCHED_THERMAL_PRESSURE
> -	struct sched_avg	avg_thermal;
> +#ifdef CONFIG_SCHED_HW_PRESSURE
> +	struct sched_avg	avg_hw;
>   #endif
>   	u64			idle_stamp;
>   	u64			avg_idle;

I don't see that rq_clock_thermal() removal in this header.
Maybe I miss some patch?

BTW, I like the name 'HW pressure' for this information/signal.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/5] Rework system pressure interface to the scheduler
  2023-12-14  8:29   ` Vincent Guittot
@ 2023-12-14  8:32     ` Lukasz Luba
  2023-12-15 15:54       ` Lukasz Luba
  0 siblings, 1 reply; 26+ messages in thread
From: Lukasz Luba @ 2023-12-14  8:32 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: catalin.marinas, will, linux-trace-kernel, amit.kachhap,
	daniel.lezcano, mhiramat, vschneid, bristot, mgorman, bsegall,
	juri.lelli, peterz, mingo, linux-pm, linux-kernel, konrad.dybcio,
	andersson, agross, rui.zhang, viresh.kumar, rafael, sudeep.holla,
	dietmar.eggemann, rostedt, linux-arm-kernel, linux-arm-msm



On 12/14/23 08:29, Vincent Guittot wrote:
> On Thu, 14 Dec 2023 at 09:21, Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>> Hi Vincent,
>>
>> I've been waiting for this feature, thanks!
>>
>>
>> On 12/12/23 14:27, Vincent Guittot wrote:
>>> Following the consolidation and cleanup of CPU capacity in [1], this serie
>>> reworks how the scheduler gets the pressures on CPUs. We need to take into
>>> account all pressures applied by cpufreq on the compute capacity of a CPU
>>> for dozens of ms or more and not only cpufreq cooling device or HW
>>> mitigiations. we split the pressure applied on CPU's capacity in 2 parts:
>>> - one from cpufreq and freq_qos
>>> - one from HW high freq mitigiation.
>>>
>>> The next step will be to add a dedicated interface for long standing
>>> capping of the CPU capacity (i.e. for seconds or more) like the
>>> scaling_max_freq of cpufreq sysfs. The latter is already taken into
>>> account by this serie but as a temporary pressure which is not always the
>>> best choice when we know that it will happen for seconds or more.
>>>
>>> [1] https://lore.kernel.org/lkml/20231211104855.558096-1-vincent.guittot@linaro.org/
>>>
>>> Vincent Guittot (4):
>>>     cpufreq: Add a cpufreq pressure feedback for the scheduler
>>>     sched: Take cpufreq feedback into account
>>>     thermal/cpufreq: Remove arch_update_thermal_pressure()
>>>     sched: Rename arch_update_thermal_pressure into
>>>       arch_update_hw_pressure
>>>
>>>    arch/arm/include/asm/topology.h               |  6 +--
>>>    arch/arm64/include/asm/topology.h             |  6 +--
>>>    drivers/base/arch_topology.c                  | 26 ++++-----
>>>    drivers/cpufreq/cpufreq.c                     | 48 +++++++++++++++++
>>>    drivers/cpufreq/qcom-cpufreq-hw.c             |  4 +-
>>>    drivers/thermal/cpufreq_cooling.c             |  3 --
>>>    include/linux/arch_topology.h                 |  8 +--
>>>    include/linux/cpufreq.h                       | 10 ++++
>>>    include/linux/sched/topology.h                |  8 +--
>>>    .../{thermal_pressure.h => hw_pressure.h}     | 14 ++---
>>>    include/trace/events/sched.h                  |  2 +-
>>>    init/Kconfig                                  | 12 ++---
>>>    kernel/sched/core.c                           |  8 +--
>>>    kernel/sched/fair.c                           | 53 ++++++++++---------
>>>    kernel/sched/pelt.c                           | 18 +++----
>>>    kernel/sched/pelt.h                           | 16 +++---
>>>    kernel/sched/sched.h                          |  4 +-
>>>    17 files changed, 152 insertions(+), 94 deletions(-)
>>>    rename include/trace/events/{thermal_pressure.h => hw_pressure.h} (55%)
>>>
>>
>> I would like to test it, but something worries me. Why there is 0/5 in
>> this subject and only 4 patches?
> 
> I removed a patch from the series but copied/pasted the cover letter
> subject without noticing the /5 instead of /4

OK

> 
>>
>> Could you tell me your base branch that I can apply this, please?
> 
> It applies on top of tip/sched/core + [1]
> and you can find it here:
> https://git.linaro.org/people/vincent.guittot/kernel.git/log/?h=sched/system-pressure

Thanks for the info and handy link.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] sched: Rename arch_update_thermal_pressure into arch_update_hw_pressure
  2023-12-14  8:31   ` Lukasz Luba
@ 2023-12-14  8:36     ` Vincent Guittot
  2023-12-14  8:54       ` Lukasz Luba
  0 siblings, 1 reply; 26+ messages in thread
From: Vincent Guittot @ 2023-12-14  8:36 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: catalin.marinas, bristot, linux-pm, juri.lelli, agross,
	viresh.kumar, rafael, linux-kernel, rui.zhang, dietmar.eggemann,
	mgorman, linux-trace-kernel, mingo, peterz, konrad.dybcio,
	andersson, rostedt, linux-arm-msm, linux-arm-kernel, bsegall,
	vschneid, will, sudeep.holla, daniel.lezcano, mhiramat,
	amit.kachhap

On Thu, 14 Dec 2023 at 09:30, Lukasz Luba <lukasz.luba@arm.com> wrote:
>
>
> On 12/12/23 14:27, Vincent Guittot wrote:
> > Now that cpufreq provides a pressure value to the scheduler, rename
> > arch_update_thermal_pressure into hw pressure to reflect that it returns
> > a pressure applied by HW with a high frequency and which needs filtering.
>
> I would elaborte this meaning 'filtering' here. Something like:
> '... high frequency and which needs filtering to smooth the singal and
> get an average value. That reflects available capacity of the CPU in
> longer period'

Ok I will update the commit message to provide more details

>
> > This pressure is not always related to thermal mitigation but can also be
> > generated by max current limitation as an example.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >   arch/arm/include/asm/topology.h               |  6 ++---
> >   arch/arm64/include/asm/topology.h             |  6 ++---
> >   drivers/base/arch_topology.c                  | 26 +++++++++----------
> >   drivers/cpufreq/qcom-cpufreq-hw.c             |  4 +--
> >   include/linux/arch_topology.h                 |  8 +++---
> >   include/linux/sched/topology.h                |  8 +++---
> >   .../{thermal_pressure.h => hw_pressure.h}     | 14 +++++-----
> >   include/trace/events/sched.h                  |  2 +-
> >   init/Kconfig                                  | 12 ++++-----
> >   kernel/sched/core.c                           |  8 +++---
> >   kernel/sched/fair.c                           | 12 ++++-----
> >   kernel/sched/pelt.c                           | 18 ++++++-------
> >   kernel/sched/pelt.h                           | 16 ++++++------
> >   kernel/sched/sched.h                          |  4 +--
> >   14 files changed, 72 insertions(+), 72 deletions(-)
> >   rename include/trace/events/{thermal_pressure.h => hw_pressure.h} (55%)
> >
> > diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
> > index 853c4f81ba4a..e175e8596b5d 100644
> > --- a/arch/arm/include/asm/topology.h
> > +++ b/arch/arm/include/asm/topology.h
> > @@ -22,9 +22,9 @@
> >   /* Enable topology flag updates */
> >   #define arch_update_cpu_topology topology_update_cpu_topology
> >
> > -/* Replace task scheduler's default thermal pressure API */
> > -#define arch_scale_thermal_pressure topology_get_thermal_pressure
> > -#define arch_update_thermal_pressure topology_update_thermal_pressure
> > +/* Replace task scheduler's default hw pressure API */
> > +#define arch_scale_hw_pressure topology_get_hw_pressure
> > +#define arch_update_hw_pressure      topology_update_hw_pressure
> >
> >   #else
> >
> > diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h
> > index a323b109b9c4..a427650bdfba 100644
> > --- a/arch/arm64/include/asm/topology.h
> > +++ b/arch/arm64/include/asm/topology.h
> > @@ -35,9 +35,9 @@ void update_freq_counters_refs(void);
> >   /* Enable topology flag updates */
> >   #define arch_update_cpu_topology topology_update_cpu_topology
> >
> > -/* Replace task scheduler's default thermal pressure API */
> > -#define arch_scale_thermal_pressure topology_get_thermal_pressure
> > -#define arch_update_thermal_pressure topology_update_thermal_pressure
> > +/* Replace task scheduler's default hw pressure API */
>
> s/hw/HW/ ?
>
> > +#define arch_scale_hw_pressure topology_get_hw_pressure
> > +#define arch_update_hw_pressure      topology_update_hw_pressure
> >
> >   #include <asm-generic/topology.h>
> >
> > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> > index 0906114963ff..3d8dc9d5c3ad 100644
> > --- a/drivers/base/arch_topology.c
> > +++ b/drivers/base/arch_topology.c
> > @@ -22,7 +22,7 @@
> >   #include <linux/units.h>
> >
> >   #define CREATE_TRACE_POINTS
> > -#include <trace/events/thermal_pressure.h>
> > +#include <trace/events/hw_pressure.h>
> >
> >   static DEFINE_PER_CPU(struct scale_freq_data __rcu *, sft_data);
> >   static struct cpumask scale_freq_counters_mask;
> > @@ -160,26 +160,26 @@ void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity)
> >       per_cpu(cpu_scale, cpu) = capacity;
> >   }
> >
> > -DEFINE_PER_CPU(unsigned long, thermal_pressure);
> > +DEFINE_PER_CPU(unsigned long, hw_pressure);
> >
> >   /**
> > - * topology_update_thermal_pressure() - Update thermal pressure for CPUs
> > + * topology_update_hw_pressure() - Update hw pressure for CPUs
>
> same here: HW?
>
> >    * @cpus        : The related CPUs for which capacity has been reduced
> >    * @capped_freq : The maximum allowed frequency that CPUs can run at
> >    *
> > - * Update the value of thermal pressure for all @cpus in the mask. The
> > + * Update the value of hw pressure for all @cpus in the mask. The
>
> HW?
>
> >    * cpumask should include all (online+offline) affected CPUs, to avoid
> >    * operating on stale data when hot-plug is used for some CPUs. The
> >    * @capped_freq reflects the currently allowed max CPUs frequency due to
> > - * thermal capping. It might be also a boost frequency value, which is bigger
> > + * hw capping. It might be also a boost frequency value, which is bigger
>
> HW?
>
> >    * than the internal 'capacity_freq_ref' max frequency. In such case the
> >    * pressure value should simply be removed, since this is an indication that
> > - * there is no thermal throttling. The @capped_freq must be provided in kHz.
> > + * there is no hw throttling. The @capped_freq must be provided in kHz.
>
> HW?
>
> >    */
> > -void topology_update_thermal_pressure(const struct cpumask *cpus,
> > +void topology_update_hw_pressure(const struct cpumask *cpus,
> >                                     unsigned long capped_freq)
> >   {
> > -     unsigned long max_capacity, capacity, th_pressure;
> > +     unsigned long max_capacity, capacity, hw_pressure;
> >       u32 max_freq;
> >       int cpu;
> >
> > @@ -189,21 +189,21 @@ void topology_update_thermal_pressure(const struct cpumask *cpus,
> >
> >       /*
> >        * Handle properly the boost frequencies, which should simply clean
> > -      * the thermal pressure value.
> > +      * the hw pressure value.
>
> HW?
>
> >        */
> >       if (max_freq <= capped_freq)
> >               capacity = max_capacity;
> >       else
> >               capacity = mult_frac(max_capacity, capped_freq, max_freq);
> >
> > -     th_pressure = max_capacity - capacity;
> > +     hw_pressure = max_capacity - capacity;
> >
> > -     trace_thermal_pressure_update(cpu, th_pressure);
> > +     trace_hw_pressure_update(cpu, hw_pressure);
> >
> >       for_each_cpu(cpu, cpus)
> > -             WRITE_ONCE(per_cpu(thermal_pressure, cpu), th_pressure);
> > +             WRITE_ONCE(per_cpu(hw_pressure, cpu), hw_pressure);
> >   }
> > -EXPORT_SYMBOL_GPL(topology_update_thermal_pressure);
> > +EXPORT_SYMBOL_GPL(topology_update_hw_pressure);
> >
> >   static ssize_t cpu_capacity_show(struct device *dev,
> >                                struct device_attribute *attr,
> > diff --git a/drivers/cpufreq/qcom-cpufreq-hw.c b/drivers/cpufreq/qcom-cpufreq-hw.c
> > index 70b0f21968a0..a619202ba81d 100644
> > --- a/drivers/cpufreq/qcom-cpufreq-hw.c
> > +++ b/drivers/cpufreq/qcom-cpufreq-hw.c
> > @@ -347,8 +347,8 @@ static void qcom_lmh_dcvs_notify(struct qcom_cpufreq_data *data)
> >
> >       throttled_freq = freq_hz / HZ_PER_KHZ;
> >
> > -     /* Update thermal pressure (the boost frequencies are accepted) */
> > -     arch_update_thermal_pressure(policy->related_cpus, throttled_freq);
> > +     /* Update hw pressure (the boost frequencies are accepted) */
>
> HW?
>
> > +     arch_update_hw_pressure(policy->related_cpus, throttled_freq);
> >
>
> [snip]
>
> >
> >       if (housekeeping_cpu(cpu, HK_TYPE_TICK))
> > @@ -5669,8 +5669,8 @@ void scheduler_tick(void)
> >       rq_lock(rq, &rf);
> >
> >       update_rq_clock(rq);
> > -     thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
> > -     update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
> > +     hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
> > +     update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure);
>
> We switch to task clock here, could you tell me why?
> Don't we have to maintain the boot command parameter for the shift?

This should have been part of the patch5 that I finally removed. IMO,
the additional time shift with rq_clock_thermal is no more needed now
that we have 2 separates signals

>
> >       curr->sched_class->task_tick(rq, curr, 0);
> >       if (sched_feat(LATENCY_WARN))
> >               resched_latency = cpu_resched_latency(rq);
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 11d3be829302..07050955d6e0 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4936,7 +4936,7 @@ static inline unsigned long get_actual_cpu_capacity(int cpu)
> >   {
> >       unsigned long capacity = arch_scale_cpu_capacity(cpu);
> >
> > -     capacity -= max(thermal_load_avg(cpu_rq(cpu)), cpufreq_get_pressure(cpu));
> > +     capacity -= max(hw_load_avg(cpu_rq(cpu)), cpufreq_get_pressure(cpu));
> >
> >       return capacity;
> >   }
> > @@ -4968,7 +4968,7 @@ static inline int util_fits_cpu(unsigned long util,
> >        * Similarly if a task is capped to arch_scale_cpu_capacity(little_cpu), it
> >        * should fit a little cpu even if there's some pressure.
> >        *
> > -      * Only exception is for thermal pressure since it has a direct impact
> > +      * Only exception is for hw or cpufreq pressure since it has a direct impact
>
> HW?
>
> >        * on available OPP of the system.
> >        *
> >        * We honour it for uclamp_min only as a drop in performance level
> > @@ -9224,7 +9224,7 @@ static inline bool others_have_blocked(struct rq *rq)
> >       if (READ_ONCE(rq->avg_dl.util_avg))
> >               return true;
> >
> > -     if (thermal_load_avg(rq))
> > +     if (hw_load_avg(rq))
> >               return true;
> >
> >   #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
> > @@ -9256,7 +9256,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
> >   {
> >       const struct sched_class *curr_class;
> >       u64 now = rq_clock_pelt(rq);
> > -     unsigned long thermal_pressure;
> > +     unsigned long hw_pressure;
> >       bool decayed;
> >
> >       /*
> > @@ -9265,11 +9265,11 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
> >        */
> >       curr_class = rq->curr->sched_class;
> >
> > -     thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
> > +     hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
> >
> >       decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
> >                 update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
> > -               update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure) |
> > +               update_hw_load_avg(now, rq, hw_pressure) |
>
> Here also the rq_clock_thermal() is not used anymore. Shouldn't we
> remove the rq_clock_thermal() if not used anymore (I cannot se this in
> the patch)?
>
> >                 update_irq_load_avg(rq, 0);
> >
> >       if (others_have_blocked(rq))
>
> [snip]
>
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index e58a54bda77d..7eaf186d071e 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1078,8 +1078,8 @@ struct rq {
> >   #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
> >       struct sched_avg        avg_irq;
> >   #endif
> > -#ifdef CONFIG_SCHED_THERMAL_PRESSURE
> > -     struct sched_avg        avg_thermal;
> > +#ifdef CONFIG_SCHED_HW_PRESSURE
> > +     struct sched_avg        avg_hw;
> >   #endif
> >       u64                     idle_stamp;
> >       u64                     avg_idle;
>
> I don't see that rq_clock_thermal() removal in this header.
> Maybe I miss some patch?
>
> BTW, I like the name 'HW pressure' for this information/signal.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] sched: Rename arch_update_thermal_pressure into arch_update_hw_pressure
  2023-12-14  8:36     ` Vincent Guittot
@ 2023-12-14  8:54       ` Lukasz Luba
  2023-12-14  8:54         ` Vincent Guittot
  0 siblings, 1 reply; 26+ messages in thread
From: Lukasz Luba @ 2023-12-14  8:54 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: catalin.marinas, bristot, linux-pm, juri.lelli, agross,
	viresh.kumar, rafael, linux-kernel, rui.zhang, dietmar.eggemann,
	mgorman, linux-trace-kernel, mingo, peterz, konrad.dybcio,
	andersson, rostedt, linux-arm-msm, linux-arm-kernel, bsegall,
	vschneid, will, sudeep.holla, daniel.lezcano, mhiramat,
	amit.kachhap



On 12/14/23 08:36, Vincent Guittot wrote:
> On Thu, 14 Dec 2023 at 09:30, Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>>
>> On 12/12/23 14:27, Vincent Guittot wrote:

[snip]

>>>        update_rq_clock(rq);
>>> -     thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
>>> -     update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
>>> +     hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
>>> +     update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure);
>>
>> We switch to task clock here, could you tell me why?
>> Don't we have to maintain the boot command parameter for the shift?
> 
> This should have been part of the patch5 that I finally removed. IMO,
> the additional time shift with rq_clock_thermal is no more needed now
> that we have 2 separates signals
> 

I didn't like the left-shift which causes the signal to converge slowly.
I rather wanted right-shift to converge (react faster), so you have my
vote for this change. Also, I agree that with the two-signal approach
this shift trick can go away now. I just worry about the dropped boot
parameter.

So, are going to send that patach5 which removes the
'sched_thermal_decay_shift' and documentation bit?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] sched: Rename arch_update_thermal_pressure into arch_update_hw_pressure
  2023-12-14  8:54       ` Lukasz Luba
@ 2023-12-14  8:54         ` Vincent Guittot
  0 siblings, 0 replies; 26+ messages in thread
From: Vincent Guittot @ 2023-12-14  8:54 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: catalin.marinas, bristot, linux-pm, juri.lelli, agross,
	viresh.kumar, rafael, linux-kernel, rui.zhang, dietmar.eggemann,
	mgorman, linux-trace-kernel, mingo, peterz, konrad.dybcio,
	andersson, rostedt, linux-arm-msm, linux-arm-kernel, bsegall,
	vschneid, will, sudeep.holla, daniel.lezcano, mhiramat,
	amit.kachhap

On Thu, 14 Dec 2023 at 09:53, Lukasz Luba <lukasz.luba@arm.com> wrote:
>
>
>
> On 12/14/23 08:36, Vincent Guittot wrote:
> > On Thu, 14 Dec 2023 at 09:30, Lukasz Luba <lukasz.luba@arm.com> wrote:
> >>
> >>
> >> On 12/12/23 14:27, Vincent Guittot wrote:
>
> [snip]
>
> >>>        update_rq_clock(rq);
> >>> -     thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
> >>> -     update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
> >>> +     hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
> >>> +     update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure);
> >>
> >> We switch to task clock here, could you tell me why?
> >> Don't we have to maintain the boot command parameter for the shift?
> >
> > This should have been part of the patch5 that I finally removed. IMO,
> > the additional time shift with rq_clock_thermal is no more needed now
> > that we have 2 separates signals
> >
>
> I didn't like the left-shift which causes the signal to converge slowly.
> I rather wanted right-shift to converge (react faster), so you have my
> vote for this change. Also, I agree that with the two-signal approach
> this shift trick can go away now. I just worry about the dropped boot
> parameter.
>
> So, are going to send that patach5 which removes the
> 'sched_thermal_decay_shift' and documentation bit?

Yes, i will add it back for the next version

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] cpufreq: Add a cpufreq pressure feedback for the scheduler
  2023-12-14  7:57     ` Vincent Guittot
@ 2023-12-14  9:08       ` Lukasz Luba
  2023-12-14  9:40         ` Rafael J. Wysocki
  0 siblings, 1 reply; 26+ messages in thread
From: Lukasz Luba @ 2023-12-14  9:08 UTC (permalink / raw)
  To: Vincent Guittot, Viresh Kumar
  Cc: catalin.marinas, will, sudeep.holla, rafael, agross, andersson,
	konrad.dybcio, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, vschneid, rui.zhang, mhiramat,
	daniel.lezcano, amit.kachhap, linux-arm-kernel, linux-kernel,
	linux-pm, linux-arm-msm, linux-trace-kernel



On 12/14/23 07:57, Vincent Guittot wrote:
> On Thu, 14 Dec 2023 at 06:43, Viresh Kumar <viresh.kumar@linaro.org> wrote:
>>
>> On 12-12-23, 15:27, Vincent Guittot wrote:
>>> @@ -2618,6 +2663,9 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
>>>        policy->max = __resolve_freq(policy, policy->max, CPUFREQ_RELATION_H);
>>>        trace_cpu_frequency_limits(policy);
>>>
>>> +     cpus = policy->related_cpus;
>>> +     cpufreq_update_pressure(cpus, policy->max);
>>> +
>>>        policy->cached_target_freq = UINT_MAX;
>>
>> One more question, why are you doing this from cpufreq_set_policy ? If
>> due to cpufreq cooling or from userspace, we end up limiting the
>> maximum possible frequency, will this routine always get called ?
> 
> Yes, any update of a FREQ_QOS_MAX ends up calling cpufreq_set_policy()
> to update the policy->max
> 

Agree, cpufreq sysfs scaling_max_freq is also important to handle
in this new design. Currently we don't reflect that as reduced CPU
capacity in the scheduler. There was discussion when I proposed to feed
that CPU frequency reduction into thermal_pressure [1].

The same applies for the DTPM which is missing currently the proper
impact to the CPU reduced capacity in the scheduler.

IMHO any limit set into FREQ_QOS_MAX should be visible in this
new design of capacity reduction signaling.

[1] https://lore.kernel.org/lkml/20220930094821.31665-2-lukasz.luba@arm.com/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] cpufreq: Add a cpufreq pressure feedback for the scheduler
  2023-12-12 14:27 ` [PATCH 1/4] cpufreq: Add a cpufreq pressure feedback for " Vincent Guittot
                     ` (2 preceding siblings ...)
  2023-12-14  5:43   ` Viresh Kumar
@ 2023-12-14  9:21   ` Lukasz Luba
  2023-12-14 11:06     ` Vincent Guittot
  3 siblings, 1 reply; 26+ messages in thread
From: Lukasz Luba @ 2023-12-14  9:21 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: catalin.marinas, will, daniel.lezcano, mhiramat, rui.zhang,
	vschneid, bristot, bsegall, dietmar.eggemann, juri.lelli, peterz,
	mingo, linux-arm-msm, linux-pm, linux-arm-kernel, agross,
	konrad.dybcio, andersson, viresh.kumar, sudeep.holla, rafael,
	rostedt, mgorman, amit.kachhap, linux-kernel, linux-trace-kernel



On 12/12/23 14:27, Vincent Guittot wrote:
> Provide to the scheduler a feedback about the temporary max available
> capacity. Unlike arch_update_thermal_pressure, this doesn't need to be
> filtered as the pressure will happen for dozens ms or more.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>   drivers/cpufreq/cpufreq.c | 48 +++++++++++++++++++++++++++++++++++++++
>   include/linux/cpufreq.h   | 10 ++++++++
>   2 files changed, 58 insertions(+)
> 
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 44db4f59c4cc..7d5f71be8d29 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -2563,6 +2563,50 @@ int cpufreq_get_policy(struct cpufreq_policy *policy, unsigned int cpu)
>   }
>   EXPORT_SYMBOL(cpufreq_get_policy);
>   
> +DEFINE_PER_CPU(unsigned long, cpufreq_pressure);
> +EXPORT_PER_CPU_SYMBOL_GPL(cpufreq_pressure);

Why do we export this variable when we have get/update functions?
Do we expect modules would manipulate those per-cpu variables
independently and not like we do per-cpumask in the update func.?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] cpufreq: Add a cpufreq pressure feedback for the scheduler
  2023-12-14  9:08       ` Lukasz Luba
@ 2023-12-14  9:40         ` Rafael J. Wysocki
  2023-12-14 10:41           ` Lukasz Luba
  0 siblings, 1 reply; 26+ messages in thread
From: Rafael J. Wysocki @ 2023-12-14  9:40 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: Vincent Guittot, Viresh Kumar, catalin.marinas, will,
	sudeep.holla, rafael, agross, andersson, konrad.dybcio, mingo,
	peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, rui.zhang, mhiramat, daniel.lezcano,
	amit.kachhap, linux-arm-kernel, linux-kernel, linux-pm,
	linux-arm-msm, linux-trace-kernel

On Thu, Dec 14, 2023 at 10:07 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> On 12/14/23 07:57, Vincent Guittot wrote:
> > On Thu, 14 Dec 2023 at 06:43, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> >>
> >> On 12-12-23, 15:27, Vincent Guittot wrote:
> >>> @@ -2618,6 +2663,9 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
> >>>        policy->max = __resolve_freq(policy, policy->max, CPUFREQ_RELATION_H);
> >>>        trace_cpu_frequency_limits(policy);
> >>>
> >>> +     cpus = policy->related_cpus;
> >>> +     cpufreq_update_pressure(cpus, policy->max);
> >>> +
> >>>        policy->cached_target_freq = UINT_MAX;
> >>
> >> One more question, why are you doing this from cpufreq_set_policy ? If
> >> due to cpufreq cooling or from userspace, we end up limiting the
> >> maximum possible frequency, will this routine always get called ?
> >
> > Yes, any update of a FREQ_QOS_MAX ends up calling cpufreq_set_policy()
> > to update the policy->max
> >
>
> Agree, cpufreq sysfs scaling_max_freq is also important to handle
> in this new design. Currently we don't reflect that as reduced CPU
> capacity in the scheduler. There was discussion when I proposed to feed
> that CPU frequency reduction into thermal_pressure [1].
>
> The same applies for the DTPM which is missing currently the proper
> impact to the CPU reduced capacity in the scheduler.
>
> IMHO any limit set into FREQ_QOS_MAX should be visible in this
> new design of capacity reduction signaling.
>
> [1] https://lore.kernel.org/lkml/20220930094821.31665-2-lukasz.luba@arm.com/

Actually, freq_qos_read_value(&policy->constraints, FREQ_QOS_MAX) will
return the requisite limit.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] cpufreq: Add a cpufreq pressure feedback for the scheduler
  2023-12-14  9:40         ` Rafael J. Wysocki
@ 2023-12-14 10:41           ` Lukasz Luba
  0 siblings, 0 replies; 26+ messages in thread
From: Lukasz Luba @ 2023-12-14 10:41 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Vincent Guittot, Viresh Kumar, catalin.marinas, will,
	sudeep.holla, agross, andersson, konrad.dybcio, mingo, peterz,
	juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman, bristot,
	vschneid, rui.zhang, mhiramat, daniel.lezcano, amit.kachhap,
	linux-arm-kernel, linux-kernel, linux-pm, linux-arm-msm,
	linux-trace-kernel



On 12/14/23 09:40, Rafael J. Wysocki wrote:
> On Thu, Dec 14, 2023 at 10:07 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>> On 12/14/23 07:57, Vincent Guittot wrote:
>>> On Thu, 14 Dec 2023 at 06:43, Viresh Kumar <viresh.kumar@linaro.org> wrote:
>>>>
>>>> On 12-12-23, 15:27, Vincent Guittot wrote:
>>>>> @@ -2618,6 +2663,9 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
>>>>>         policy->max = __resolve_freq(policy, policy->max, CPUFREQ_RELATION_H);
>>>>>         trace_cpu_frequency_limits(policy);
>>>>>
>>>>> +     cpus = policy->related_cpus;
>>>>> +     cpufreq_update_pressure(cpus, policy->max);
>>>>> +
>>>>>         policy->cached_target_freq = UINT_MAX;
>>>>
>>>> One more question, why are you doing this from cpufreq_set_policy ? If
>>>> due to cpufreq cooling or from userspace, we end up limiting the
>>>> maximum possible frequency, will this routine always get called ?
>>>
>>> Yes, any update of a FREQ_QOS_MAX ends up calling cpufreq_set_policy()
>>> to update the policy->max
>>>
>>
>> Agree, cpufreq sysfs scaling_max_freq is also important to handle
>> in this new design. Currently we don't reflect that as reduced CPU
>> capacity in the scheduler. There was discussion when I proposed to feed
>> that CPU frequency reduction into thermal_pressure [1].
>>
>> The same applies for the DTPM which is missing currently the proper
>> impact to the CPU reduced capacity in the scheduler.
>>
>> IMHO any limit set into FREQ_QOS_MAX should be visible in this
>> new design of capacity reduction signaling.
>>
>> [1] https://lore.kernel.org/lkml/20220930094821.31665-2-lukasz.luba@arm.com/
> 
> Actually, freq_qos_read_value(&policy->constraints, FREQ_QOS_MAX) will
> return the requisite limit.

Yes, but we need to translate that information from freq domain
into capacity domain and plumb ii into scheduler as stolen CPU capacity.
Ideally, w/o any 'smoothing' but just instant value.
That's the hope of this patch set re-design.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] cpufreq: Add a cpufreq pressure feedback for the scheduler
  2023-12-14  9:21   ` Lukasz Luba
@ 2023-12-14 11:06     ` Vincent Guittot
  0 siblings, 0 replies; 26+ messages in thread
From: Vincent Guittot @ 2023-12-14 11:06 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: catalin.marinas, will, daniel.lezcano, mhiramat, rui.zhang,
	vschneid, bristot, bsegall, dietmar.eggemann, juri.lelli, peterz,
	mingo, linux-arm-msm, linux-pm, linux-arm-kernel, agross,
	konrad.dybcio, andersson, viresh.kumar, sudeep.holla, rafael,
	rostedt, mgorman, amit.kachhap, linux-kernel, linux-trace-kernel

On Thu, 14 Dec 2023 at 10:20, Lukasz Luba <lukasz.luba@arm.com> wrote:
>
>
>
> On 12/12/23 14:27, Vincent Guittot wrote:
> > Provide to the scheduler a feedback about the temporary max available
> > capacity. Unlike arch_update_thermal_pressure, this doesn't need to be
> > filtered as the pressure will happen for dozens ms or more.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >   drivers/cpufreq/cpufreq.c | 48 +++++++++++++++++++++++++++++++++++++++
> >   include/linux/cpufreq.h   | 10 ++++++++
> >   2 files changed, 58 insertions(+)
> >
> > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > index 44db4f59c4cc..7d5f71be8d29 100644
> > --- a/drivers/cpufreq/cpufreq.c
> > +++ b/drivers/cpufreq/cpufreq.c
> > @@ -2563,6 +2563,50 @@ int cpufreq_get_policy(struct cpufreq_policy *policy, unsigned int cpu)
> >   }
> >   EXPORT_SYMBOL(cpufreq_get_policy);
> >
> > +DEFINE_PER_CPU(unsigned long, cpufreq_pressure);
> > +EXPORT_PER_CPU_SYMBOL_GPL(cpufreq_pressure);
>
> Why do we export this variable when we have get/update functions?
> Do we expect modules would manipulate those per-cpu variables
> independently and not like we do per-cpumask in the update func.?

No, I will remove the EXPORT_PER_CPU_SYMBOL_GPL

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/4] thermal/cpufreq: Remove arch_update_thermal_pressure()
  2023-12-12 14:27 ` [PATCH 3/4] thermal/cpufreq: Remove arch_update_thermal_pressure() Vincent Guittot
@ 2023-12-15 15:38   ` Lukasz Luba
  0 siblings, 0 replies; 26+ messages in thread
From: Lukasz Luba @ 2023-12-15 15:38 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: catalin.marinas, will, linux-arm-msm, linux-kernel,
	daniel.lezcano, mhiramat, rui.zhang, linux-trace-kernel, vschneid,
	bristot, mgorman, bsegall, rostedt, dietmar.eggemann, juri.lelli,
	peterz, mingo, konrad.dybcio, andersson, agross, viresh.kumar,
	sudeep.holla, rafael, amit.kachhap, linux-arm-kernel, linux-pm



On 12/12/23 14:27, Vincent Guittot wrote:
> arch_update_thermal_pressure() aims to update fast changing signal which
> should be averaged using PELT filtering before being provided to the
> scheduler which can't make smart use of fast changing signal.
> cpufreq now provides the maximum freq_qos pressure on the capacity to the
> scheduler, which includes cpufreq cooling device. Remove the call to
> arch_update_thermal_pressure() in cpufreq cooling device as this is
> handled by cpufreq_get_pressure().
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>   drivers/thermal/cpufreq_cooling.c | 3 ---
>   1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c
> index e2cc7bd30862..e77d3b44903e 100644
> --- a/drivers/thermal/cpufreq_cooling.c
> +++ b/drivers/thermal/cpufreq_cooling.c
> @@ -448,7 +448,6 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
>   				 unsigned long state)
>   {
>   	struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata;
> -	struct cpumask *cpus;
>   	unsigned int frequency;
>   	int ret;
>   
> @@ -465,8 +464,6 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
>   	ret = freq_qos_update_request(&cpufreq_cdev->qos_req, frequency);
>   	if (ret >= 0) {
>   		cpufreq_cdev->cpufreq_state = state;
> -		cpus = cpufreq_cdev->policy->related_cpus;
> -		arch_update_thermal_pressure(cpus, frequency);
>   		ret = 0;
>   	}
>   

Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/5] Rework system pressure interface to the scheduler
  2023-12-14  8:32     ` Lukasz Luba
@ 2023-12-15 15:54       ` Lukasz Luba
  0 siblings, 0 replies; 26+ messages in thread
From: Lukasz Luba @ 2023-12-15 15:54 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: catalin.marinas, will, linux-trace-kernel, amit.kachhap,
	daniel.lezcano, mhiramat, vschneid, bristot, mgorman, bsegall,
	juri.lelli, peterz, mingo, linux-pm, linux-kernel, konrad.dybcio,
	andersson, agross, rui.zhang, viresh.kumar, rafael, sudeep.holla,
	dietmar.eggemann, rostedt, linux-arm-kernel, linux-arm-msm



On 12/14/23 08:32, Lukasz Luba wrote:
> 
> 
> On 12/14/23 08:29, Vincent Guittot wrote:
>> On Thu, 14 Dec 2023 at 09:21, Lukasz Luba <lukasz.luba@arm.com> wrote:
>>>
>>> Hi Vincent,
>>>
>>> I've been waiting for this feature, thanks!
>>>
>>>
>>> On 12/12/23 14:27, Vincent Guittot wrote:
>>>> Following the consolidation and cleanup of CPU capacity in [1], this 
>>>> serie
>>>> reworks how the scheduler gets the pressures on CPUs. We need to 
>>>> take into
>>>> account all pressures applied by cpufreq on the compute capacity of 
>>>> a CPU
>>>> for dozens of ms or more and not only cpufreq cooling device or HW
>>>> mitigiations. we split the pressure applied on CPU's capacity in 2 
>>>> parts:
>>>> - one from cpufreq and freq_qos
>>>> - one from HW high freq mitigiation.
>>>>
>>>> The next step will be to add a dedicated interface for long standing
>>>> capping of the CPU capacity (i.e. for seconds or more) like the
>>>> scaling_max_freq of cpufreq sysfs. The latter is already taken into
>>>> account by this serie but as a temporary pressure which is not 
>>>> always the
>>>> best choice when we know that it will happen for seconds or more.
>>>>
>>>> [1] 
>>>> https://lore.kernel.org/lkml/20231211104855.558096-1-vincent.guittot@linaro.org/
>>>>
>>>> Vincent Guittot (4):
>>>>     cpufreq: Add a cpufreq pressure feedback for the scheduler
>>>>     sched: Take cpufreq feedback into account
>>>>     thermal/cpufreq: Remove arch_update_thermal_pressure()
>>>>     sched: Rename arch_update_thermal_pressure into
>>>>       arch_update_hw_pressure
>>>>
>>>>    arch/arm/include/asm/topology.h               |  6 +--
>>>>    arch/arm64/include/asm/topology.h             |  6 +--
>>>>    drivers/base/arch_topology.c                  | 26 ++++-----
>>>>    drivers/cpufreq/cpufreq.c                     | 48 +++++++++++++++++
>>>>    drivers/cpufreq/qcom-cpufreq-hw.c             |  4 +-
>>>>    drivers/thermal/cpufreq_cooling.c             |  3 --
>>>>    include/linux/arch_topology.h                 |  8 +--
>>>>    include/linux/cpufreq.h                       | 10 ++++
>>>>    include/linux/sched/topology.h                |  8 +--
>>>>    .../{thermal_pressure.h => hw_pressure.h}     | 14 ++---
>>>>    include/trace/events/sched.h                  |  2 +-
>>>>    init/Kconfig                                  | 12 ++---
>>>>    kernel/sched/core.c                           |  8 +--
>>>>    kernel/sched/fair.c                           | 53 
>>>> ++++++++++---------
>>>>    kernel/sched/pelt.c                           | 18 +++----
>>>>    kernel/sched/pelt.h                           | 16 +++---
>>>>    kernel/sched/sched.h                          |  4 +-
>>>>    17 files changed, 152 insertions(+), 94 deletions(-)
>>>>    rename include/trace/events/{thermal_pressure.h => hw_pressure.h} 
>>>> (55%)
>>>>
>>>
>>> I would like to test it, but something worries me. Why there is 0/5 in
>>> this subject and only 4 patches?
>>
>> I removed a patch from the series but copied/pasted the cover letter
>> subject without noticing the /5 instead of /4
> 
> OK
> 
>>
>>>
>>> Could you tell me your base branch that I can apply this, please?
>>
>> It applies on top of tip/sched/core + [1]
>> and you can find it here:
>> https://git.linaro.org/people/vincent.guittot/kernel.git/log/?h=sched/system-pressure
> 
> Thanks for the info and handy link.
> 

I've tested your patches with: DTPM/PowerCap + thermal gov + cpufreq
sysfs scaling_max_freq. It works fine all my cases (couldn't cause
any issues). If you like to test DTPM you will need 2 fixed pending
in Rafael's tree.

So, I'm looking for a your v2 to continue reviewing it.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] sched: Take cpufreq feedback into account
  2023-12-12 14:27 ` [PATCH 2/4] sched: Take cpufreq feedback into account Vincent Guittot
@ 2023-12-15 16:03   ` Lukasz Luba
  0 siblings, 0 replies; 26+ messages in thread
From: Lukasz Luba @ 2023-12-15 16:03 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: catalin.marinas, will, linux-trace-kernel, linux-arm-msm, mgorman,
	linux-pm, amit.kachhap, daniel.lezcano, mhiramat, vschneid,
	bristot, dietmar.eggemann, juri.lelli, agross, peterz,
	linux-kernel, linux-arm-kernel, konrad.dybcio, andersson,
	viresh.kumar, sudeep.holla, rafael, mingo, rostedt, bsegall,
	rui.zhang



On 12/12/23 14:27, Vincent Guittot wrote:
> Aggregate the different pressures applied on the capacity of CPUs and
> create a new function that returns the actual capacity of the CPU:
>    get_actual_cpu_capacity()
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>   kernel/sched/fair.c | 43 +++++++++++++++++++++++--------------------
>   1 file changed, 23 insertions(+), 20 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bcea3d55d95d..11d3be829302 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4932,12 +4932,20 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
>   	trace_sched_util_est_se_tp(&p->se);
>   }
>   
> +static inline unsigned long get_actual_cpu_capacity(int cpu)
> +{
> +	unsigned long capacity = arch_scale_cpu_capacity(cpu);
> +
> +	capacity -= max(thermal_load_avg(cpu_rq(cpu)), cpufreq_get_pressure(cpu));
> +
> +	return capacity;
> +}
>   static inline int util_fits_cpu(unsigned long util,
>   				unsigned long uclamp_min,
>   				unsigned long uclamp_max,
>   				int cpu)
>   {
> -	unsigned long capacity_orig, capacity_orig_thermal;
> +	unsigned long capacity_orig;
>   	unsigned long capacity = capacity_of(cpu);
>   	bool fits, uclamp_max_fits;
>   
> @@ -4970,7 +4978,6 @@ static inline int util_fits_cpu(unsigned long util,
>   	 * goal is to cap the task. So it's okay if it's getting less.
>   	 */
>   	capacity_orig = arch_scale_cpu_capacity(cpu);
> -	capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu);
>   
>   	/*
>   	 * We want to force a task to fit a cpu as implied by uclamp_max.
> @@ -5045,7 +5052,7 @@ static inline int util_fits_cpu(unsigned long util,
>   	 * handle the case uclamp_min > uclamp_max.
>   	 */
>   	uclamp_min = min(uclamp_min, uclamp_max);
> -	if (fits && (util < uclamp_min) && (uclamp_min > capacity_orig_thermal))
> +	if (fits && (util < uclamp_min) && (uclamp_min > get_actual_cpu_capacity(cpu)))

That's quite long line, I would break it into 2.

>   		return -1;
>   
>   	return fits;
> @@ -7426,7 +7433,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>   		 * Look for the CPU with best capacity.
>   		 */
>   		else if (fits < 0)
> -			cpu_cap = arch_scale_cpu_capacity(cpu) - thermal_load_avg(cpu_rq(cpu));
> +			cpu_cap = get_actual_cpu_capacity(cpu);
>   
>   		/*
>   		 * First, select CPU which fits better (-1 being better than 0).
> @@ -7919,8 +7926,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>   	struct root_domain *rd = this_rq()->rd;
>   	int cpu, best_energy_cpu, target = -1;
>   	int prev_fits = -1, best_fits = -1;
> -	unsigned long best_thermal_cap = 0;
> -	unsigned long prev_thermal_cap = 0;
> +	unsigned long best_actual_cap = 0;
> +	unsigned long prev_actual_cap = 0;
>   	struct sched_domain *sd;
>   	struct perf_domain *pd;
>   	struct energy_env eenv;
> @@ -7950,7 +7957,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>   
>   	for (; pd; pd = pd->next) {
>   		unsigned long util_min = p_util_min, util_max = p_util_max;
> -		unsigned long cpu_cap, cpu_thermal_cap, util;
> +		unsigned long cpu_cap, cpu_actual_cap, util;
>   		long prev_spare_cap = -1, max_spare_cap = -1;
>   		unsigned long rq_util_min, rq_util_max;
>   		unsigned long cur_delta, base_energy;
> @@ -7962,18 +7969,17 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>   		if (cpumask_empty(cpus))
>   			continue;
>   
> -		/* Account thermal pressure for the energy estimation */
> +		/* Account external pressure for the energy estimation */
>   		cpu = cpumask_first(cpus);
> -		cpu_thermal_cap = arch_scale_cpu_capacity(cpu);
> -		cpu_thermal_cap -= arch_scale_thermal_pressure(cpu);
> +		cpu_actual_cap = get_actual_cpu_capacity(cpu);
>   
> -		eenv.cpu_cap = cpu_thermal_cap;
> +		eenv.cpu_cap = cpu_actual_cap;
>   		eenv.pd_cap = 0;
>   
>   		for_each_cpu(cpu, cpus) {
>   			struct rq *rq = cpu_rq(cpu);
>   
> -			eenv.pd_cap += cpu_thermal_cap;
> +			eenv.pd_cap += cpu_actual_cap;
>   
>   			if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
>   				continue;
> @@ -8044,7 +8050,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>   			if (prev_delta < base_energy)
>   				goto unlock;
>   			prev_delta -= base_energy;
> -			prev_thermal_cap = cpu_thermal_cap;
> +			prev_actual_cap = cpu_actual_cap;
>   			best_delta = min(best_delta, prev_delta);
>   		}
>   
> @@ -8059,7 +8065,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>   			 * but best energy cpu has better capacity.
>   			 */
>   			if ((max_fits < 0) &&
> -			    (cpu_thermal_cap <= best_thermal_cap))
> +			    (cpu_actual_cap <= best_actual_cap))
>   				continue;
>   
>   			cur_delta = compute_energy(&eenv, pd, cpus, p,
> @@ -8080,14 +8086,14 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>   			best_delta = cur_delta;
>   			best_energy_cpu = max_spare_cap_cpu;
>   			best_fits = max_fits;
> -			best_thermal_cap = cpu_thermal_cap;
> +			best_actual_cap = cpu_actual_cap;
>   		}
>   	}
>   	rcu_read_unlock();
>   
>   	if ((best_fits > prev_fits) ||
>   	    ((best_fits > 0) && (best_delta < prev_delta)) ||
> -	    ((best_fits < 0) && (best_thermal_cap > prev_thermal_cap)))
> +	    ((best_fits < 0) && (best_actual_cap > prev_actual_cap)))
>   		target = best_energy_cpu;
>   
>   	return target;
> @@ -9466,7 +9472,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
>   static unsigned long scale_rt_capacity(int cpu)
>   {
>   	struct rq *rq = cpu_rq(cpu);
> -	unsigned long max = arch_scale_cpu_capacity(cpu);
> +	unsigned long max = get_actual_cpu_capacity(cpu);

Maybe it's not strictly related to this patch, but I would swap the
2 above lines, so that they will be sorted (from longest to shortest).

>   	unsigned long used, free;
>   	unsigned long irq;
>   
> @@ -9478,12 +9484,9 @@ static unsigned long scale_rt_capacity(int cpu)
>   	/*
>   	 * avg_rt.util_avg and avg_dl.util_avg track binary signals
>   	 * (running and not running) with weights 0 and 1024 respectively.
> -	 * avg_thermal.load_avg tracks thermal pressure and the weighted
> -	 * average uses the actual delta max capacity(load).
>   	 */
>   	used = READ_ONCE(rq->avg_rt.util_avg);
>   	used += READ_ONCE(rq->avg_dl.util_avg);
> -	used += thermal_load_avg(rq);
>   
>   	if (unlikely(used >= max))
>   		return 1;

The rest looks good

Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2023-12-15 16:02 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-12 14:27 [PATCH 0/5] Rework system pressure interface to the scheduler Vincent Guittot
2023-12-12 14:27 ` [PATCH 1/4] cpufreq: Add a cpufreq pressure feedback for " Vincent Guittot
2023-12-13  7:17   ` Viresh Kumar
2023-12-13  8:05     ` Vincent Guittot
2023-12-14  0:41   ` Tim Chen
2023-12-14  5:36     ` Viresh Kumar
2023-12-14  5:43   ` Viresh Kumar
2023-12-14  7:57     ` Vincent Guittot
2023-12-14  9:08       ` Lukasz Luba
2023-12-14  9:40         ` Rafael J. Wysocki
2023-12-14 10:41           ` Lukasz Luba
2023-12-14  9:21   ` Lukasz Luba
2023-12-14 11:06     ` Vincent Guittot
2023-12-12 14:27 ` [PATCH 2/4] sched: Take cpufreq feedback into account Vincent Guittot
2023-12-15 16:03   ` Lukasz Luba
2023-12-12 14:27 ` [PATCH 3/4] thermal/cpufreq: Remove arch_update_thermal_pressure() Vincent Guittot
2023-12-15 15:38   ` Lukasz Luba
2023-12-12 14:27 ` [PATCH 4/4] sched: Rename arch_update_thermal_pressure into arch_update_hw_pressure Vincent Guittot
2023-12-14  8:31   ` Lukasz Luba
2023-12-14  8:36     ` Vincent Guittot
2023-12-14  8:54       ` Lukasz Luba
2023-12-14  8:54         ` Vincent Guittot
2023-12-14  8:22 ` [PATCH 0/5] Rework system pressure interface to the scheduler Lukasz Luba
2023-12-14  8:29   ` Vincent Guittot
2023-12-14  8:32     ` Lukasz Luba
2023-12-15 15:54       ` Lukasz Luba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).