[PATCH 0/7 v2] sched/fair: Rework EAS to handle more cases

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/7 v2] sched/fair: Rework EAS to handle more cases
@ 2024-12-17 16:07 Vincent Guittot
  2024-12-17 16:07 ` [PATCH 1/7 v2] sched/fair: Filter false overloaded_group case for EAS Vincent Guittot
                   ` (7 more replies)
  0 siblings, 8 replies; 19+ messages in thread
From: Vincent Guittot @ 2024-12-17 16:07 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2, pierre.gondois, christian.loehle, qperret,
	Vincent Guittot

The current Energy Aware Scheduler has some known limitations which have
became more and more visible with features like uclamp as an example. This
serie tries to fix some of those issues:
- tasks stacked on the same CPU of a PD
- tasks stuck on the wrong CPU.

Patch 1 fixes the case where a CPU is wrongly classified as overloaded
whereas it is capped to a lower compute capacity. This wrong classification
can prevent periodic load balancer to select a group_misfit_task CPU
because group_overloaded has higher priority.

Patch 2 creates a new EM interface that will be used by Patch 3

Patch 3 fixes the issue of tasks being stacked on same CPU of a PD whereas
others might be a better choice. feec() looks for the CPU with the highest
spare capacity in a PD assuming that it will be the best CPU from a energy
efficiency PoV because it will require the smallest increase of OPP.
This is often but not always true, this policy filters some others CPUs
which would be as efficients because of using the same OPP but with less
running tasks as an example.
In fact, we only care about the cost of the new OPP that will be
selected to handle the waking task. In many cases, several CPUs will end
up selecting the same OPP and as a result having the same energy cost. In
such cases, we can use other metrics to select the best CPU with the same
energy cost. Patch 3 rework feec() to look 1st for the lowest cost in a PD
and then the most performant CPU between CPUs. At now, this only tries to
evenly spread the number of runnable tasks on CPUs but this can be
improved with other metric like the sched slice duration in a follow up
series.

perf sched pipe on a dragonboard rb5 has been used to compare the overhead
of the new feec() vs current implementation.

9 iterations of perf bench sched pipe -T -l 80000
                ops/sec  stdev 
tip/sched/core  13001    (+/- 1.2%)
+ patches 1-3   14349    (+/- 5.4%)  +10.4%

Patch 4 removed the now unused em_cpu_energy()

Patch 5 solves another problem with tasks being stuck on a CPU forever
because it doesn't sleep anymore and as a result never wakeup and call
feec(). Such task can be detected by comparing util_avg or runnable_avg
with the compute capacity of the CPU. Once detected, we can call feec() to
check if there is a better CPU for the stuck task. The call can be done in
2 places:
- When the task is put back in the runnnable list after its running slice
  with the balance callback mecanism similarly to the rt/dl push callback.
- During cfs tick when there is only 1 running task stuck on the CPU in
  which case the balance callback can't be used.

This push callback mecanism with the new feec() algorithm ensures that
tasks always get a chance to migrate on the best suitable CPU and don't
stay stuck on a CPU which is no more the most suitable one. As examples:
- A task waking on a big CPU with a uclamp max preventing it to sleep and
  wake up, can migrate on a smaller CPU once it's more power efficient.
- The tasks are spread on CPUs in the PD when they target the same OPP.

Patch 6 adds task misfit migration case in the cfs tick and push callback
mecanism to prevent waking up an idle cpu unnecessarily.

Patch 7 removes the need of testing uclamp_min in cpu_overutilized to
trigger the active migration of a task on another CPU.

Compared to v1:
- The call to feec() even when overutilized has been removed
from this serie and will be adressed in a separate series. Only the case
of uclamp_min has been kept as it is now handled by push callback and
tick mecanism.
- The push mecanism has been cleanup, fixed and simplified.

This series implements some of the topics discussed at OSPM [1]. Other
topics will be part of an other serie

[1] https://youtu.be/PHEBAyxeM_M?si=ZApIOw3BS4SOLPwp

Vincent Guittot (7):
  sched/fair: Filter false overloaded_group case for EAS
  energy model: Add a get previous state function
  sched/fair: Rework feec() to use cost instead of spare capacity
  energy model: Remove unused em_cpu_energy()
  sched/fair: Add push task callback for EAS
  sched/fair: Add misfit case to push task callback for EAS
  sched/fair: Update overutilized detection

 include/linux/energy_model.h | 112 ++----
 kernel/sched/fair.c          | 707 +++++++++++++++++++++++------------
 kernel/sched/sched.h         |   2 +
 3 files changed, 503 insertions(+), 318 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/7 v2] sched/fair: Filter false overloaded_group case for EAS
  2024-12-17 16:07 [PATCH 0/7 v2] sched/fair: Rework EAS to handle more cases Vincent Guittot
@ 2024-12-17 16:07 ` Vincent Guittot
  2024-12-17 19:15   ` Dhaval Giani
  2024-12-17 16:07 ` [PATCH 2/7 v2] energy model: Add a get previous state function Vincent Guittot
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: Vincent Guittot @ 2024-12-17 16:07 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2, pierre.gondois, christian.loehle, qperret,
	Vincent Guittot

With EAS, a group should be set overloaded if at least 1 CPU in the group
is overutilized bit it can happen that a CPU is fully utilized by tasks
because of clamping the compute capacity of the CPU. In such case, the CPU
is not overutilized and as a result should not be set overloaded as well.

group_overloaded being a higher priority than group_misfit, such group can
be selected as the busiest group instead of a group with a mistfit task
and prevents load_balance to select the CPU with the misfit task to pull
the latter on a fitting CPU.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
---
 kernel/sched/fair.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2c4ebfc82917..893eb6844642 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9916,6 +9916,7 @@ struct sg_lb_stats {
 	unsigned int group_asym_packing;	/* Tasks should be moved to preferred CPU */
 	unsigned int group_smt_balance;		/* Task on busy SMT be moved */
 	unsigned long group_misfit_task_load;	/* A CPU has a task too big for its capacity */
+	unsigned int group_overutilized;	/* At least one CPU is overutilized in the group */
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
@@ -10148,6 +10149,13 @@ group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
 static inline bool
 group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
 {
+	/*
+	 * With EAS and uclamp, 1 CPU in the group must be overutilized to
+	 * consider the group overloaded.
+	 */
+	if (sched_energy_enabled() && !sgs->group_overutilized)
+		return false;
+
 	if (sgs->sum_nr_running <= sgs->group_weight)
 		return false;
 
@@ -10361,8 +10369,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		if (nr_running > 1)
 			*sg_overloaded = 1;
 
-		if (cpu_overutilized(i))
+		if (cpu_overutilized(i)) {
 			*sg_overutilized = 1;
+			sgs->group_overutilized = 1;
+		}
 
 #ifdef CONFIG_NUMA_BALANCING
 		sgs->nr_numa_running += rq->nr_numa_running;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/7 v2] sched/fair: Filter false overloaded_group case for EAS
  2024-12-17 16:07 ` [PATCH 1/7 v2] sched/fair: Filter false overloaded_group case for EAS Vincent Guittot
@ 2024-12-17 19:15   ` Dhaval Giani
  0 siblings, 0 replies; 19+ messages in thread
From: Dhaval Giani @ 2024-12-17 19:15 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2, pierre.gondois, christian.loehle, qperret

On Tue, Dec 17, 2024 at 05:07:14PM +0100, Vincent Guittot wrote:
> With EAS, a group should be set overloaded if at least 1 CPU in the group
> is overutilized bit it can happen that a CPU is fully utilized by tasks

typo - s/bit/but

> because of clamping the compute capacity of the CPU. In such case, the CPU
> is not overutilized and as a result should not be set overloaded as well.
> 
> group_overloaded being a higher priority than group_misfit, such group can
> be selected as the busiest group instead of a group with a mistfit task
> and prevents load_balance to select the CPU with the misfit task to pull
> the latter on a fitting CPU.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> Tested-by: Pierre Gondois <pierre.gondois@arm.com>
 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 2/7 v2] energy model: Add a get previous state function
  2024-12-17 16:07 [PATCH 0/7 v2] sched/fair: Rework EAS to handle more cases Vincent Guittot
  2024-12-17 16:07 ` [PATCH 1/7 v2] sched/fair: Filter false overloaded_group case for EAS Vincent Guittot
@ 2024-12-17 16:07 ` Vincent Guittot
  2024-12-17 16:07 ` [PATCH 3/7 v2] sched/fair: Rework feec() to use cost instead of spare capacity Vincent Guittot
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Vincent Guittot @ 2024-12-17 16:07 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2, pierre.gondois, christian.loehle, qperret,
	Vincent Guittot

Instead of parsing all EM table everytime, add a function to get the
previous state.

Will be used in the scheduler feec() function.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/energy_model.h | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 752e0b297582..26d0ff72feac 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -215,6 +215,26 @@ em_pd_get_efficient_state(struct em_perf_state *table,
 	return max_ps;
 }
 
+static inline int
+em_pd_get_previous_state(struct em_perf_state *table,
+			 struct em_perf_domain *pd, int idx)
+{
+	unsigned long pd_flags = pd->flags;
+	int min_ps = pd->min_perf_state;
+	struct em_perf_state *ps;
+	int i;
+
+	for (i = idx - 1; i >= min_ps; i--) {
+		ps = &table[i];
+		if (pd_flags & EM_PERF_DOMAIN_SKIP_INEFFICIENCIES &&
+		    ps->flags & EM_PERF_STATE_INEFFICIENT)
+			continue;
+		return i;
+	}
+
+	return -1;
+}
+
 /**
  * em_cpu_energy() - Estimates the energy consumed by the CPUs of a
  *		performance domain
@@ -361,6 +381,19 @@ static inline struct em_perf_domain *em_pd_get(struct device *dev)
 {
 	return NULL;
 }
+static inline int
+em_pd_get_efficient_state(struct em_perf_state *table,
+			  struct em_perf_domain *pd, unsigned long max_util)
+{
+	return 0;
+}
+
+static inline int
+em_pd_get_previous_state(struct em_perf_state *table, int nr_perf_states,
+			  int idx, unsigned long pd_flags)
+{
+	return -1;
+}
 static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 			unsigned long max_util, unsigned long sum_util,
 			unsigned long allowed_cpu_cap)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/7 v2] sched/fair: Rework feec() to use cost instead of spare capacity
  2024-12-17 16:07 [PATCH 0/7 v2] sched/fair: Rework EAS to handle more cases Vincent Guittot
  2024-12-17 16:07 ` [PATCH 1/7 v2] sched/fair: Filter false overloaded_group case for EAS Vincent Guittot
  2024-12-17 16:07 ` [PATCH 2/7 v2] energy model: Add a get previous state function Vincent Guittot
@ 2024-12-17 16:07 ` Vincent Guittot
  2024-12-24 16:46   ` Luis Machado
  2024-12-17 16:07 ` [PATCH 4/7 v2] energy model: Remove unused em_cpu_energy() Vincent Guittot
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: Vincent Guittot @ 2024-12-17 16:07 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2, pierre.gondois, christian.loehle, qperret,
	Vincent Guittot

feec() looks for the CPU with highest spare capacity in a PD assuming that
it will be the best CPU from a energy efficiency PoV because it will
require the smallest increase of OPP. Although this is true generally
speaking, this policy also filters some others CPUs which will be as
efficients because of using the same OPP.
In fact, we really care about the cost of the new OPP that will be
selected to handle the waking task. In many cases, several CPUs will end
up selecting the same OPP and as a result using the same energy cost. In
these cases, we can use other metrics to select the best CPU for the same
energy cost.

Rework feec() to look 1st for the lowest cost in a PD and then the most
performant CPU between CPUs. The cost of the OPP remains the only
comparison criteria between Performance Domains.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 463 +++++++++++++++++++++++---------------------
 1 file changed, 241 insertions(+), 222 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 893eb6844642..cd046e8216a9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8228,29 +8228,37 @@ unsigned long sched_cpu_util(int cpu)
 }
 
 /*
- * energy_env - Utilization landscape for energy estimation.
- * @task_busy_time: Utilization contribution by the task for which we test the
- *                  placement. Given by eenv_task_busy_time().
- * @pd_busy_time:   Utilization of the whole perf domain without the task
- *                  contribution. Given by eenv_pd_busy_time().
- * @cpu_cap:        Maximum CPU capacity for the perf domain.
- * @pd_cap:         Entire perf domain capacity. (pd->nr_cpus * cpu_cap).
- */
-struct energy_env {
-	unsigned long task_busy_time;
-	unsigned long pd_busy_time;
-	unsigned long cpu_cap;
-	unsigned long pd_cap;
+ * energy_cpu_stat - Utilization landscape for energy estimation.
+ * @idx :        Index of the OPP in the performance domain
+ * @cost :       Cost of the OPP
+ * @max_perf :   Compute capacity of OPP
+ * @min_perf :   Compute capacity of the previous OPP
+ * @capa :       Capacity of the CPU
+ * @runnable :   runnbale_avg of the CPU
+ * @nr_running : number of cfs running task
+ * @fits :       Fits level of the CPU
+ * @cpu :        current best CPU
+ */
+struct energy_cpu_stat {
+	unsigned long idx;
+	unsigned long cost;
+	unsigned long max_perf;
+	unsigned long min_perf;
+	unsigned long capa;
+	unsigned long util;
+	unsigned long runnable;
+	unsigned int nr_running;
+	int fits;
+	int cpu;
 };
 
 /*
- * Compute the task busy time for compute_energy(). This time cannot be
- * injected directly into effective_cpu_util() because of the IRQ scaling.
+ * Compute the task busy time for computing its energy impact. This time cannot
+ * be injected directly into effective_cpu_util() because of the IRQ scaling.
  * The latter only makes sense with the most recent CPUs where the task has
  * run.
  */
-static inline void eenv_task_busy_time(struct energy_env *eenv,
-				       struct task_struct *p, int prev_cpu)
+static inline unsigned long task_busy_time(struct task_struct *p, int prev_cpu)
 {
 	unsigned long busy_time, max_cap = arch_scale_cpu_capacity(prev_cpu);
 	unsigned long irq = cpu_util_irq(cpu_rq(prev_cpu));
@@ -8260,124 +8268,150 @@ static inline void eenv_task_busy_time(struct energy_env *eenv,
 	else
 		busy_time = scale_irq_capacity(task_util_est(p), irq, max_cap);
 
-	eenv->task_busy_time = busy_time;
+	return busy_time;
 }
 
-/*
- * Compute the perf_domain (PD) busy time for compute_energy(). Based on the
- * utilization for each @pd_cpus, it however doesn't take into account
- * clamping since the ratio (utilization / cpu_capacity) is already enough to
- * scale the EM reported power consumption at the (eventually clamped)
- * cpu_capacity.
- *
- * The contribution of the task @p for which we want to estimate the
- * energy cost is removed (by cpu_util()) and must be calculated
- * separately (see eenv_task_busy_time). This ensures:
- *
- *   - A stable PD utilization, no matter which CPU of that PD we want to place
- *     the task on.
- *
- *   - A fair comparison between CPUs as the task contribution (task_util())
- *     will always be the same no matter which CPU utilization we rely on
- *     (util_avg or util_est).
- *
- * Set @eenv busy time for the PD that spans @pd_cpus. This busy time can't
- * exceed @eenv->pd_cap.
- */
-static inline void eenv_pd_busy_time(struct energy_env *eenv,
-				     struct cpumask *pd_cpus,
-				     struct task_struct *p)
+/* Estimate the utilization of the CPU that is then used to select the OPP */
+static unsigned long find_cpu_max_util(int cpu, struct task_struct *p, int dst_cpu)
 {
-	unsigned long busy_time = 0;
-	int cpu;
+	unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
+	unsigned long eff_util, min, max;
+
+	/*
+	 * Performance domain frequency: utilization clamping
+	 * must be considered since it affects the selection
+	 * of the performance domain frequency.
+	 */
+	eff_util = effective_cpu_util(cpu, util, &min, &max);
 
-	for_each_cpu(cpu, pd_cpus) {
-		unsigned long util = cpu_util(cpu, p, -1, 0);
+	/* Task's uclamp can modify min and max value */
+	if (uclamp_is_used() && cpu == dst_cpu) {
+		min = max(min, uclamp_eff_value(p, UCLAMP_MIN));
 
-		busy_time += effective_cpu_util(cpu, util, NULL, NULL);
+		/*
+		 * If there is no active max uclamp constraint,
+		 * directly use task's one, otherwise keep max.
+		 */
+		if (uclamp_rq_is_idle(cpu_rq(cpu)))
+			max = uclamp_eff_value(p, UCLAMP_MAX);
+		else
+			max = max(max, uclamp_eff_value(p, UCLAMP_MAX));
 	}
 
-	eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
+	eff_util = sugov_effective_cpu_perf(cpu, eff_util, min, max);
+	return eff_util;
 }
 
-/*
- * Compute the maximum utilization for compute_energy() when the task @p
- * is placed on the cpu @dst_cpu.
- *
- * Returns the maximum utilization among @eenv->cpus. This utilization can't
- * exceed @eenv->cpu_cap.
- */
-static inline unsigned long
-eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
-		 struct task_struct *p, int dst_cpu)
+/* Estimate the utilization of the CPU without the task */
+static unsigned long find_cpu_actual_util(int cpu, struct task_struct *p)
 {
-	unsigned long max_util = 0;
-	int cpu;
+	unsigned long util = cpu_util(cpu, p, -1, 0);
+	unsigned long eff_util;
 
-	for_each_cpu(cpu, pd_cpus) {
-		struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
-		unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
-		unsigned long eff_util, min, max;
+	eff_util = effective_cpu_util(cpu, util, NULL, NULL);
 
-		/*
-		 * Performance domain frequency: utilization clamping
-		 * must be considered since it affects the selection
-		 * of the performance domain frequency.
-		 * NOTE: in case RT tasks are running, by default the min
-		 * utilization can be max OPP.
-		 */
-		eff_util = effective_cpu_util(cpu, util, &min, &max);
+	return eff_util;
+}
 
-		/* Task's uclamp can modify min and max value */
-		if (tsk && uclamp_is_used()) {
-			min = max(min, uclamp_eff_value(p, UCLAMP_MIN));
+/* Find the cost of a performance domain for the estimated utilization */
+static inline void find_pd_cost(struct em_perf_domain *pd,
+				unsigned long max_util,
+				struct energy_cpu_stat *stat)
+{
+	struct em_perf_table *em_table;
+	struct em_perf_state *ps;
+	int i;
 
-			/*
-			 * If there is no active max uclamp constraint,
-			 * directly use task's one, otherwise keep max.
-			 */
-			if (uclamp_rq_is_idle(cpu_rq(cpu)))
-				max = uclamp_eff_value(p, UCLAMP_MAX);
-			else
-				max = max(max, uclamp_eff_value(p, UCLAMP_MAX));
-		}
+	/*
+	 * Find the lowest performance state of the Energy Model above the
+	 * requested performance.
+	 */
+	em_table = rcu_dereference(pd->em_table);
+	i = em_pd_get_efficient_state(em_table->state, pd, max_util);
+	ps = &em_table->state[i];
 
-		eff_util = sugov_effective_cpu_perf(cpu, eff_util, min, max);
-		max_util = max(max_util, eff_util);
+	/* Save the cost and performance range of the OPP */
+	stat->max_perf = ps->performance;
+	stat->cost = ps->cost;
+	i = em_pd_get_previous_state(em_table->state, pd, i);
+	if (i < 0)
+		stat->min_perf = 0;
+	else {
+		ps = &em_table->state[i];
+		stat->min_perf = ps->performance;
 	}
-
-	return min(max_util, eenv->cpu_cap);
 }
 
-/*
- * compute_energy(): Use the Energy Model to estimate the energy that @pd would
- * consume for a given utilization landscape @eenv. When @dst_cpu < 0, the task
- * contribution is ignored.
- */
-static inline unsigned long
-compute_energy(struct energy_env *eenv, struct perf_domain *pd,
-	       struct cpumask *pd_cpus, struct task_struct *p, int dst_cpu)
+/*Check if the CPU can handle the waking task */
+static int check_cpu_with_task(struct task_struct *p, int cpu)
 {
-	unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
-	unsigned long busy_time = eenv->pd_busy_time;
-	unsigned long energy;
+	unsigned long p_util_min = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MIN) : 0;
+	unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
+	unsigned long util_min = p_util_min;
+	unsigned long util_max = p_util_max;
+	unsigned long util = cpu_util(cpu, p, cpu, 0);
+	struct rq *rq = cpu_rq(cpu);
 
-	if (dst_cpu >= 0)
-		busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
+	/*
+	 * Skip CPUs that cannot satisfy the capacity request.
+	 * IOW, placing the task there would make the CPU
+	 * overutilized. Take uclamp into account to see how
+	 * much capacity we can get out of the CPU; this is
+	 * aligned with sched_cpu_util().
+	 */
+	if (uclamp_is_used() && !uclamp_rq_is_idle(rq)) {
+		unsigned long rq_util_min, rq_util_max;
+		/*
+		 * Open code uclamp_rq_util_with() except for
+		 * the clamp() part. I.e.: apply max aggregation
+		 * only. util_fits_cpu() logic requires to
+		 * operate on non clamped util but must use the
+		 * max-aggregated uclamp_{min, max}.
+		 */
+		rq_util_min = uclamp_rq_get(rq, UCLAMP_MIN);
+		rq_util_max = uclamp_rq_get(rq, UCLAMP_MAX);
+		util_min = max(rq_util_min, p_util_min);
+		util_max = max(rq_util_max, p_util_max);
+	}
+	return util_fits_cpu(util, util_min, util_max, cpu);
+}
+
+/* For a same cost, select the CPU that will povide best performance for the task */
+static bool select_best_cpu(struct energy_cpu_stat *target,
+			    struct energy_cpu_stat *min,
+			    int prev, struct sched_domain *sd)
+{
+	/*  Select the one with the least number of running tasks */
+	if (target->nr_running < min->nr_running)
+		return true;
+	if (target->nr_running > min->nr_running)
+		return false;
 
-	energy = em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
+	/* Favor previous CPU otherwise */
+	if (target->cpu == prev)
+		return true;
+	if (min->cpu == prev)
+		return false;
 
-	trace_sched_compute_energy_tp(p, dst_cpu, energy, max_util, busy_time);
+	/*
+	 * Choose CPU with lowest contention. One might want to consider load instead of
+	 * runnable but we are supposed to not be overutilized so there is enough compute
+	 * capacity for everybody.
+	 */
+	if ((target->runnable * min->capa * sd->imbalance_pct) >=
+			(min->runnable * target->capa * 100))
+		return false;
 
-	return energy;
+	return true;
 }
 
 /*
  * find_energy_efficient_cpu(): Find most energy-efficient target CPU for the
- * waking task. find_energy_efficient_cpu() looks for the CPU with maximum
- * spare capacity in each performance domain and uses it as a potential
- * candidate to execute the task. Then, it uses the Energy Model to figure
- * out which of the CPU candidates is the most energy-efficient.
+ * waking task. find_energy_efficient_cpu() looks for the CPU with the lowest
+ * power cost (usually with maximum spare capacity but not always) in each
+ * performance domain and uses it as a potential candidate to execute the task.
+ * Then, it uses the Energy Model to figure out which of the CPU candidates is
+ * the most energy-efficient.
  *
  * The rationale for this heuristic is as follows. In a performance domain,
  * all the most energy efficient CPU candidates (according to the Energy
@@ -8414,17 +8448,14 @@ compute_energy(struct energy_env *eenv, struct perf_domain *pd,
 static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
-	unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
-	unsigned long p_util_min = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MIN) : 0;
-	unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
 	struct root_domain *rd = this_rq()->rd;
-	int cpu, best_energy_cpu, target = -1;
-	int prev_fits = -1, best_fits = -1;
-	unsigned long best_actual_cap = 0;
-	unsigned long prev_actual_cap = 0;
+	unsigned long best_nrg = ULONG_MAX;
+	unsigned long task_util;
 	struct sched_domain *sd;
 	struct perf_domain *pd;
-	struct energy_env eenv;
+	int cpu, target = -1;
+	int best_fits = -1;
+	int best_cpu = -1;
 
 	rcu_read_lock();
 	pd = rcu_dereference(rd->pd);
@@ -8444,19 +8475,19 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	target = prev_cpu;
 
 	sync_entity_load_avg(&p->se);
-	if (!task_util_est(p) && p_util_min == 0)
-		goto unlock;
-
-	eenv_task_busy_time(&eenv, p, prev_cpu);
+	task_util = task_busy_time(p, prev_cpu);
 
 	for (; pd; pd = pd->next) {
-		unsigned long util_min = p_util_min, util_max = p_util_max;
-		unsigned long cpu_cap, cpu_actual_cap, util;
-		long prev_spare_cap = -1, max_spare_cap = -1;
-		unsigned long rq_util_min, rq_util_max;
-		unsigned long cur_delta, base_energy;
-		int max_spare_cap_cpu = -1;
-		int fits, max_fits = -1;
+		unsigned long pd_actual_util = 0, delta_nrg = 0;
+		unsigned long cpu_actual_cap, max_cost = 0;
+		struct energy_cpu_stat target_stat;
+		struct energy_cpu_stat min_stat = {
+			.cost = ULONG_MAX,
+			.max_perf = ULONG_MAX,
+			.min_perf = ULONG_MAX,
+			.fits = -2,
+			.cpu = -1,
+		};
 
 		cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
 
@@ -8467,13 +8498,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 		cpu = cpumask_first(cpus);
 		cpu_actual_cap = get_actual_cpu_capacity(cpu);
 
-		eenv.cpu_cap = cpu_actual_cap;
-		eenv.pd_cap = 0;
-
+		/* In a PD, the CPU with the lowest cost will be the most efficient */
 		for_each_cpu(cpu, cpus) {
-			struct rq *rq = cpu_rq(cpu);
-
-			eenv.pd_cap += cpu_actual_cap;
+			unsigned long target_perf;
 
 			if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
 				continue;
@@ -8481,120 +8508,112 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			if (!cpumask_test_cpu(cpu, p->cpus_ptr))
 				continue;
 
-			util = cpu_util(cpu, p, cpu, 0);
-			cpu_cap = capacity_of(cpu);
+			target_stat.fits = check_cpu_with_task(p, cpu);
+
+			if (!target_stat.fits)
+				continue;
+
+			/* 1st select the CPU that fits best */
+			if (target_stat.fits < min_stat.fits)
+				continue;
+
+			/* Then select the CPU with lowest cost */
+
+			/* Get the performance of the CPU w/ waking task. */
+			target_perf = find_cpu_max_util(cpu, p, cpu);
+			target_perf = min(target_perf, cpu_actual_cap);
+
+			/* Needing a higher OPP means a higher cost */
+			if (target_perf > min_stat.max_perf)
+				continue;
 
 			/*
-			 * Skip CPUs that cannot satisfy the capacity request.
-			 * IOW, placing the task there would make the CPU
-			 * overutilized. Take uclamp into account to see how
-			 * much capacity we can get out of the CPU; this is
-			 * aligned with sched_cpu_util().
+			 * At this point, target's cost can be either equal or
+			 * lower than the current minimum cost.
 			 */
-			if (uclamp_is_used() && !uclamp_rq_is_idle(rq)) {
-				/*
-				 * Open code uclamp_rq_util_with() except for
-				 * the clamp() part. I.e.: apply max aggregation
-				 * only. util_fits_cpu() logic requires to
-				 * operate on non clamped util but must use the
-				 * max-aggregated uclamp_{min, max}.
-				 */
-				rq_util_min = uclamp_rq_get(rq, UCLAMP_MIN);
-				rq_util_max = uclamp_rq_get(rq, UCLAMP_MAX);
 
-				util_min = max(rq_util_min, p_util_min);
-				util_max = max(rq_util_max, p_util_max);
-			}
+			/* Gather more statistics */
+			target_stat.cpu = cpu;
+			target_stat.runnable = cpu_runnable(cpu_rq(cpu));
+			target_stat.capa = capacity_of(cpu);
+			target_stat.nr_running = cpu_rq(cpu)->cfs.h_nr_runnable;
 
-			fits = util_fits_cpu(util, util_min, util_max, cpu);
-			if (!fits)
+			/* If the target needs a lower OPP, then look up for
+			 * the corresponding OPP and its associated cost.
+			 * Otherwise at same cost level, select the CPU which
+			 * provides best performance.
+			 */
+			if (target_perf < min_stat.min_perf)
+				find_pd_cost(pd->em_pd, target_perf, &target_stat);
+			else if (!select_best_cpu(&target_stat, &min_stat, prev_cpu, sd))
 				continue;
 
-			lsub_positive(&cpu_cap, util);
-
-			if (cpu == prev_cpu) {
-				/* Always use prev_cpu as a candidate. */
-				prev_spare_cap = cpu_cap;
-				prev_fits = fits;
-			} else if ((fits > max_fits) ||
-				   ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {
-				/*
-				 * Find the CPU with the maximum spare capacity
-				 * among the remaining CPUs in the performance
-				 * domain.
-				 */
-				max_spare_cap = cpu_cap;
-				max_spare_cap_cpu = cpu;
-				max_fits = fits;
-			}
+			/* Save the new most efficient CPU of the PD */
+			min_stat = target_stat;
 		}
 
-		if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
+		if (min_stat.cpu == -1)
 			continue;
 
-		eenv_pd_busy_time(&eenv, cpus, p);
-		/* Compute the 'base' energy of the pd, without @p */
-		base_energy = compute_energy(&eenv, pd, cpus, p, -1);
+		if (min_stat.fits < best_fits)
+			continue;
 
-		/* Evaluate the energy impact of using prev_cpu. */
-		if (prev_spare_cap > -1) {
-			prev_delta = compute_energy(&eenv, pd, cpus, p,
-						    prev_cpu);
-			/* CPU utilization has changed */
-			if (prev_delta < base_energy)
-				goto unlock;
-			prev_delta -= base_energy;
-			prev_actual_cap = cpu_actual_cap;
-			best_delta = min(best_delta, prev_delta);
-		}
+		/* Idle system costs nothing */
+		target_stat.max_perf = 0;
+		target_stat.cost = 0;
 
-		/* Evaluate the energy impact of using max_spare_cap_cpu. */
-		if (max_spare_cap_cpu >= 0 && max_spare_cap > prev_spare_cap) {
-			/* Current best energy cpu fits better */
-			if (max_fits < best_fits)
-				continue;
+		/* Estimate utilization and cost without p */
+		for_each_cpu(cpu, cpus) {
+			unsigned long target_util;
 
-			/*
-			 * Both don't fit performance hint (i.e. uclamp_min)
-			 * but best energy cpu has better capacity.
-			 */
-			if ((max_fits < 0) &&
-			    (cpu_actual_cap <= best_actual_cap))
-				continue;
+			/* Accumulate actual utilization w/o task p */
+			pd_actual_util += find_cpu_actual_util(cpu, p);
 
-			cur_delta = compute_energy(&eenv, pd, cpus, p,
-						   max_spare_cap_cpu);
-			/* CPU utilization has changed */
-			if (cur_delta < base_energy)
-				goto unlock;
-			cur_delta -= base_energy;
+			/* Get the max utilization of the CPU w/o task p */
+			target_util = find_cpu_max_util(cpu, p, -1);
+			target_util = min(target_util, cpu_actual_cap);
 
-			/*
-			 * Both fit for the task but best energy cpu has lower
-			 * energy impact.
-			 */
-			if ((max_fits > 0) && (best_fits > 0) &&
-			    (cur_delta >= best_delta))
+			/* Current OPP is enough */
+			if (target_util <= target_stat.max_perf)
 				continue;
 
-			best_delta = cur_delta;
-			best_energy_cpu = max_spare_cap_cpu;
-			best_fits = max_fits;
-			best_actual_cap = cpu_actual_cap;
+			/* Compute and save the cost of the OPP */
+			find_pd_cost(pd->em_pd, target_util, &target_stat);
+			max_cost = target_stat.cost;
 		}
-	}
-	rcu_read_unlock();
 
-	if ((best_fits > prev_fits) ||
-	    ((best_fits > 0) && (best_delta < prev_delta)) ||
-	    ((best_fits < 0) && (best_actual_cap > prev_actual_cap)))
-		target = best_energy_cpu;
+		/* Add the NRG cost of p */
+		delta_nrg = task_util * min_stat.cost;
 
-	return target;
+		/* Compute the NRG cost of others running at higher OPP because of p */
+		if (min_stat.cost > max_cost)
+			delta_nrg += pd_actual_util * (min_stat.cost - max_cost);
+
+		/* nrg with p */
+		trace_sched_compute_energy_tp(p, min_stat.cpu, delta_nrg,
+				min_stat.max_perf, pd_actual_util + task_util);
+
+		/*
+		 * The probability that delta NRGs are equals is almost null. PDs being sorted
+		 * by max capacity, keep the one with highest max capacity if this
+		 * happens.
+		 * TODO: add a margin in nrg cost and take into account other stats
+		 */
+		if ((min_stat.fits == best_fits) &&
+		    (delta_nrg >= best_nrg))
+			continue;
+
+		best_fits = min_stat.fits;
+		best_nrg = delta_nrg;
+		best_cpu = min_stat.cpu;
+	}
 
 unlock:
 	rcu_read_unlock();
 
+	if (best_cpu >= 0)
+		target = best_cpu;
+
 	return target;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/7 v2] sched/fair: Rework feec() to use cost instead of spare capacity
  2024-12-17 16:07 ` [PATCH 3/7 v2] sched/fair: Rework feec() to use cost instead of spare capacity Vincent Guittot
@ 2024-12-24 16:46   ` Luis Machado
  0 siblings, 0 replies; 19+ messages in thread
From: Luis Machado @ 2024-12-24 16:46 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, lukasz.luba,
	rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2, pierre.gondois, christian.loehle, qperret

Hi,

Just spotted a few things amidst going through the series to get a feel
for what it is trying to accomplish. Feel free to take or ignore the
suggestions, as they're mostly cosmetic anyway.

On 12/17/24 16:07, Vincent Guittot wrote:
> feec() looks for the CPU with highest spare capacity in a PD assuming that
> it will be the best CPU from a energy efficiency PoV because it will
> require the smallest increase of OPP. Although this is true generally
> speaking, this policy also filters some others CPUs which will be as
> efficients because of using the same OPP.
> In fact, we really care about the cost of the new OPP that will be
> selected to handle the waking task. In many cases, several CPUs will end
> up selecting the same OPP and as a result using the same energy cost. In
> these cases, we can use other metrics to select the best CPU for the same
> energy cost.
> 
> Rework feec() to look 1st for the lowest cost in a PD and then the most
> performant CPU between CPUs. The cost of the OPP remains the only
> comparison criteria between Performance Domains.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 463 +++++++++++++++++++++++---------------------
>  1 file changed, 241 insertions(+), 222 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 893eb6844642..cd046e8216a9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8228,29 +8228,37 @@ unsigned long sched_cpu_util(int cpu)
>  }
>  
>  /*
> - * energy_env - Utilization landscape for energy estimation.
> - * @task_busy_time: Utilization contribution by the task for which we test the
> - *                  placement. Given by eenv_task_busy_time().
> - * @pd_busy_time:   Utilization of the whole perf domain without the task
> - *                  contribution. Given by eenv_pd_busy_time().
> - * @cpu_cap:        Maximum CPU capacity for the perf domain.
> - * @pd_cap:         Entire perf domain capacity. (pd->nr_cpus * cpu_cap).
> - */
> -struct energy_env {
> -	unsigned long task_busy_time;
> -	unsigned long pd_busy_time;
> -	unsigned long cpu_cap;
> -	unsigned long pd_cap;
> + * energy_cpu_stat - Utilization landscape for energy estimation.
> + * @idx :        Index of the OPP in the performance domain
> + * @cost :       Cost of the OPP
> + * @max_perf :   Compute capacity of OPP
> + * @min_perf :   Compute capacity of the previous OPP
> + * @capa :       Capacity of the CPU
> + * @runnable :   runnbale_avg of the CPU

Typo: runnbale/runnable

> + * @nr_running : number of cfs running task
> + * @fits :       Fits level of the CPU
> + * @cpu :        current best CPU
> + */
> +struct energy_cpu_stat {
> +	unsigned long idx;
> +	unsigned long cost;
> +	unsigned long max_perf;
> +	unsigned long min_perf;
> +	unsigned long capa;
> +	unsigned long util;
> +	unsigned long runnable;
> +	unsigned int nr_running;
> +	int fits;
> +	int cpu;
>  };
>  
>  /*
> - * Compute the task busy time for compute_energy(). This time cannot be
> - * injected directly into effective_cpu_util() because of the IRQ scaling.
> + * Compute the task busy time for computing its energy impact. This time cannot
> + * be injected directly into effective_cpu_util() because of the IRQ scaling.
>   * The latter only makes sense with the most recent CPUs where the task has
>   * run.
>   */
> -static inline void eenv_task_busy_time(struct energy_env *eenv,
> -				       struct task_struct *p, int prev_cpu)
> +static inline unsigned long task_busy_time(struct task_struct *p, int prev_cpu)
>  {
>  	unsigned long busy_time, max_cap = arch_scale_cpu_capacity(prev_cpu);
>  	unsigned long irq = cpu_util_irq(cpu_rq(prev_cpu));
> @@ -8260,124 +8268,150 @@ static inline void eenv_task_busy_time(struct energy_env *eenv,
>  	else
>  		busy_time = scale_irq_capacity(task_util_est(p), irq, max_cap);
>  
> -	eenv->task_busy_time = busy_time;
> +	return busy_time;
>  }
>  
> -/*
> - * Compute the perf_domain (PD) busy time for compute_energy(). Based on the
> - * utilization for each @pd_cpus, it however doesn't take into account
> - * clamping since the ratio (utilization / cpu_capacity) is already enough to
> - * scale the EM reported power consumption at the (eventually clamped)
> - * cpu_capacity.
> - *
> - * The contribution of the task @p for which we want to estimate the
> - * energy cost is removed (by cpu_util()) and must be calculated
> - * separately (see eenv_task_busy_time). This ensures:
> - *
> - *   - A stable PD utilization, no matter which CPU of that PD we want to place
> - *     the task on.
> - *
> - *   - A fair comparison between CPUs as the task contribution (task_util())
> - *     will always be the same no matter which CPU utilization we rely on
> - *     (util_avg or util_est).
> - *
> - * Set @eenv busy time for the PD that spans @pd_cpus. This busy time can't
> - * exceed @eenv->pd_cap.
> - */
> -static inline void eenv_pd_busy_time(struct energy_env *eenv,
> -				     struct cpumask *pd_cpus,
> -				     struct task_struct *p)
> +/* Estimate the utilization of the CPU that is then used to select the OPP */
> +static unsigned long find_cpu_max_util(int cpu, struct task_struct *p, int dst_cpu)
>  {
> -	unsigned long busy_time = 0;
> -	int cpu;
> +	unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
> +	unsigned long eff_util, min, max;
> +
> +	/*
> +	 * Performance domain frequency: utilization clamping
> +	 * must be considered since it affects the selection
> +	 * of the performance domain frequency.
> +	 */
> +	eff_util = effective_cpu_util(cpu, util, &min, &max);
>  
> -	for_each_cpu(cpu, pd_cpus) {
> -		unsigned long util = cpu_util(cpu, p, -1, 0);
> +	/* Task's uclamp can modify min and max value */
> +	if (uclamp_is_used() && cpu == dst_cpu) {
> +		min = max(min, uclamp_eff_value(p, UCLAMP_MIN));
>  
> -		busy_time += effective_cpu_util(cpu, util, NULL, NULL);
> +		/*
> +		 * If there is no active max uclamp constraint,
> +		 * directly use task's one, otherwise keep max.
> +		 */
> +		if (uclamp_rq_is_idle(cpu_rq(cpu)))
> +			max = uclamp_eff_value(p, UCLAMP_MAX);
> +		else
> +			max = max(max, uclamp_eff_value(p, UCLAMP_MAX));
>  	}
>  
> -	eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
> +	eff_util = sugov_effective_cpu_perf(cpu, eff_util, min, max);
> +	return eff_util;
>  }
>  
> -/*
> - * Compute the maximum utilization for compute_energy() when the task @p
> - * is placed on the cpu @dst_cpu.
> - *
> - * Returns the maximum utilization among @eenv->cpus. This utilization can't
> - * exceed @eenv->cpu_cap.
> - */
> -static inline unsigned long
> -eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
> -		 struct task_struct *p, int dst_cpu)
> +/* Estimate the utilization of the CPU without the task */
> +static unsigned long find_cpu_actual_util(int cpu, struct task_struct *p)
>  {
> -	unsigned long max_util = 0;
> -	int cpu;
> +	unsigned long util = cpu_util(cpu, p, -1, 0);
> +	unsigned long eff_util;
>  
> -	for_each_cpu(cpu, pd_cpus) {
> -		struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
> -		unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
> -		unsigned long eff_util, min, max;
> +	eff_util = effective_cpu_util(cpu, util, NULL, NULL);
>  
> -		/*
> -		 * Performance domain frequency: utilization clamping
> -		 * must be considered since it affects the selection
> -		 * of the performance domain frequency.
> -		 * NOTE: in case RT tasks are running, by default the min
> -		 * utilization can be max OPP.
> -		 */
> -		eff_util = effective_cpu_util(cpu, util, &min, &max);
> +	return eff_util;
> +}
>  
> -		/* Task's uclamp can modify min and max value */
> -		if (tsk && uclamp_is_used()) {
> -			min = max(min, uclamp_eff_value(p, UCLAMP_MIN));
> +/* Find the cost of a performance domain for the estimated utilization */
> +static inline void find_pd_cost(struct em_perf_domain *pd,
> +				unsigned long max_util,
> +				struct energy_cpu_stat *stat)
> +{
> +	struct em_perf_table *em_table;
> +	struct em_perf_state *ps;
> +	int i;
>  
> -			/*
> -			 * If there is no active max uclamp constraint,
> -			 * directly use task's one, otherwise keep max.
> -			 */
> -			if (uclamp_rq_is_idle(cpu_rq(cpu)))
> -				max = uclamp_eff_value(p, UCLAMP_MAX);
> -			else
> -				max = max(max, uclamp_eff_value(p, UCLAMP_MAX));
> -		}
> +	/*
> +	 * Find the lowest performance state of the Energy Model above the
> +	 * requested performance.
> +	 */
> +	em_table = rcu_dereference(pd->em_table);
> +	i = em_pd_get_efficient_state(em_table->state, pd, max_util);
> +	ps = &em_table->state[i];
>  
> -		eff_util = sugov_effective_cpu_perf(cpu, eff_util, min, max);
> -		max_util = max(max_util, eff_util);
> +	/* Save the cost and performance range of the OPP */
> +	stat->max_perf = ps->performance;
> +	stat->cost = ps->cost;
> +	i = em_pd_get_previous_state(em_table->state, pd, i);
> +	if (i < 0)
> +		stat->min_perf = 0;
> +	else {
> +		ps = &em_table->state[i];
> +		stat->min_perf = ps->performance;
>  	}
> -
> -	return min(max_util, eenv->cpu_cap);
>  }
>  
> -/*
> - * compute_energy(): Use the Energy Model to estimate the energy that @pd would
> - * consume for a given utilization landscape @eenv. When @dst_cpu < 0, the task
> - * contribution is ignored.
> - */
> -static inline unsigned long
> -compute_energy(struct energy_env *eenv, struct perf_domain *pd,
> -	       struct cpumask *pd_cpus, struct task_struct *p, int dst_cpu)
> +/*Check if the CPU can handle the waking task */
> +static int check_cpu_with_task(struct task_struct *p, int cpu)
>  {
> -	unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
> -	unsigned long busy_time = eenv->pd_busy_time;
> -	unsigned long energy;
> +	unsigned long p_util_min = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MIN) : 0;
> +	unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
> +	unsigned long util_min = p_util_min;
> +	unsigned long util_max = p_util_max;
> +	unsigned long util = cpu_util(cpu, p, cpu, 0);
> +	struct rq *rq = cpu_rq(cpu);
>  
> -	if (dst_cpu >= 0)
> -		busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
> +	/*
> +	 * Skip CPUs that cannot satisfy the capacity request.
> +	 * IOW, placing the task there would make the CPU
> +	 * overutilized. Take uclamp into account to see how
> +	 * much capacity we can get out of the CPU; this is
> +	 * aligned with sched_cpu_util().
> +	 */
> +	if (uclamp_is_used() && !uclamp_rq_is_idle(rq)) {
> +		unsigned long rq_util_min, rq_util_max;
> +		/*
> +		 * Open code uclamp_rq_util_with() except for
> +		 * the clamp() part. I.e.: apply max aggregation
> +		 * only. util_fits_cpu() logic requires to
> +		 * operate on non clamped util but must use the
> +		 * max-aggregated uclamp_{min, max}.
> +		 */
> +		rq_util_min = uclamp_rq_get(rq, UCLAMP_MIN);
> +		rq_util_max = uclamp_rq_get(rq, UCLAMP_MAX);
> +		util_min = max(rq_util_min, p_util_min);
> +		util_max = max(rq_util_max, p_util_max);
> +	}
> +	return util_fits_cpu(util, util_min, util_max, cpu);
> +}
> +
> +/* For a same cost, select the CPU that will povide best performance for the task */

s/For a same cost/For the same cost
s/povide/provide

> +static bool select_best_cpu(struct energy_cpu_stat *target,
> +			    struct energy_cpu_stat *min,
> +			    int prev, struct sched_domain *sd)
> +{
> +	/*  Select the one with the least number of running tasks */
> +	if (target->nr_running < min->nr_running)
> +		return true;
> +	if (target->nr_running > min->nr_running)
> +		return false;

Reading through the above, it seems obvious what to do for the cases where we're below
and above min->nr_running. Then the only case for which we want to do additional checks is
when target->nr_running == min->nr_running. Would you mind adding a comment to that effect,
on why we need further processing for that case? I feel it will help clarify things.

>  
> -	energy = em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
> +	/* Favor previous CPU otherwise */
> +	if (target->cpu == prev)
> +		return true;
> +	if (min->cpu == prev)
> +		return false;
>  
> -	trace_sched_compute_energy_tp(p, dst_cpu, energy, max_util, busy_time);
> +	/*
> +	 * Choose CPU with lowest contention. One might want to consider load instead of
> +	 * runnable but we are supposed to not be overutilized so there is enough compute
> +	 * capacity for everybody.
> +	 */
> +	if ((target->runnable * min->capa * sd->imbalance_pct) >=
> +			(min->runnable * target->capa * 100))
> +		return false;
>  
> -	return energy;
> +	return true;
>  }

Might be just me, but the name of the function, select_best_cpu, seems to call for a
different answer than false/true. It seems to me it is trying to determine whether
target->cpu is the best cpu for the job at hand. Should we adjust the function name to
something a bit more appropriate? Maybe is_best_cpu?

>  
>  /*
>   * find_energy_efficient_cpu(): Find most energy-efficient target CPU for the
> - * waking task. find_energy_efficient_cpu() looks for the CPU with maximum
> - * spare capacity in each performance domain and uses it as a potential
> - * candidate to execute the task. Then, it uses the Energy Model to figure
> - * out which of the CPU candidates is the most energy-efficient.
> + * waking task. find_energy_efficient_cpu() looks for the CPU with the lowest
> + * power cost (usually with maximum spare capacity but not always) in each
> + * performance domain and uses it as a potential candidate to execute the task.
> + * Then, it uses the Energy Model to figure out which of the CPU candidates is
> + * the most energy-efficient.
>   *
>   * The rationale for this heuristic is as follows. In a performance domain,
>   * all the most energy efficient CPU candidates (according to the Energy
> @@ -8414,17 +8448,14 @@ compute_energy(struct energy_env *eenv, struct perf_domain *pd,
>  static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  {
>  	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
> -	unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
> -	unsigned long p_util_min = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MIN) : 0;
> -	unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
>  	struct root_domain *rd = this_rq()->rd;
> -	int cpu, best_energy_cpu, target = -1;
> -	int prev_fits = -1, best_fits = -1;
> -	unsigned long best_actual_cap = 0;
> -	unsigned long prev_actual_cap = 0;
> +	unsigned long best_nrg = ULONG_MAX;
> +	unsigned long task_util;
>  	struct sched_domain *sd;
>  	struct perf_domain *pd;
> -	struct energy_env eenv;
> +	int cpu, target = -1;
> +	int best_fits = -1;
> +	int best_cpu = -1;
>  
>  	rcu_read_lock();
>  	pd = rcu_dereference(rd->pd);
> @@ -8444,19 +8475,19 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  	target = prev_cpu;
>  
>  	sync_entity_load_avg(&p->se);
> -	if (!task_util_est(p) && p_util_min == 0)
> -		goto unlock;
> -
> -	eenv_task_busy_time(&eenv, p, prev_cpu);
> +	task_util = task_busy_time(p, prev_cpu);
>  
>  	for (; pd; pd = pd->next) {
> -		unsigned long util_min = p_util_min, util_max = p_util_max;
> -		unsigned long cpu_cap, cpu_actual_cap, util;
> -		long prev_spare_cap = -1, max_spare_cap = -1;
> -		unsigned long rq_util_min, rq_util_max;
> -		unsigned long cur_delta, base_energy;
> -		int max_spare_cap_cpu = -1;
> -		int fits, max_fits = -1;
> +		unsigned long pd_actual_util = 0, delta_nrg = 0;
> +		unsigned long cpu_actual_cap, max_cost = 0;
> +		struct energy_cpu_stat target_stat;
> +		struct energy_cpu_stat min_stat = {
> +			.cost = ULONG_MAX,
> +			.max_perf = ULONG_MAX,
> +			.min_perf = ULONG_MAX,
> +			.fits = -2,
> +			.cpu = -1,
> +		};
>  
>  		cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
>  
> @@ -8467,13 +8498,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  		cpu = cpumask_first(cpus);
>  		cpu_actual_cap = get_actual_cpu_capacity(cpu);
>  
> -		eenv.cpu_cap = cpu_actual_cap;
> -		eenv.pd_cap = 0;
> -
> +		/* In a PD, the CPU with the lowest cost will be the most efficient */
>  		for_each_cpu(cpu, cpus) {
> -			struct rq *rq = cpu_rq(cpu);
> -
> -			eenv.pd_cap += cpu_actual_cap;
> +			unsigned long target_perf;
>  
>  			if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
>  				continue;
> @@ -8481,120 +8508,112 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  			if (!cpumask_test_cpu(cpu, p->cpus_ptr))
>  				continue;
>  
> -			util = cpu_util(cpu, p, cpu, 0);
> -			cpu_cap = capacity_of(cpu);
> +			target_stat.fits = check_cpu_with_task(p, cpu);
> +
> +			if (!target_stat.fits)
> +				continue;
> +
> +			/* 1st select the CPU that fits best */
> +			if (target_stat.fits < min_stat.fits)
> +				continue;
> +
> +			/* Then select the CPU with lowest cost */
> +
> +			/* Get the performance of the CPU w/ waking task. */
> +			target_perf = find_cpu_max_util(cpu, p, cpu);
> +			target_perf = min(target_perf, cpu_actual_cap);
> +
> +			/* Needing a higher OPP means a higher cost */
> +			if (target_perf > min_stat.max_perf)
> +				continue;
>  
>  			/*
> -			 * Skip CPUs that cannot satisfy the capacity request.
> -			 * IOW, placing the task there would make the CPU
> -			 * overutilized. Take uclamp into account to see how
> -			 * much capacity we can get out of the CPU; this is
> -			 * aligned with sched_cpu_util().
> +			 * At this point, target's cost can be either equal or
> +			 * lower than the current minimum cost.
>  			 */
> -			if (uclamp_is_used() && !uclamp_rq_is_idle(rq)) {
> -				/*
> -				 * Open code uclamp_rq_util_with() except for
> -				 * the clamp() part. I.e.: apply max aggregation
> -				 * only. util_fits_cpu() logic requires to
> -				 * operate on non clamped util but must use the
> -				 * max-aggregated uclamp_{min, max}.
> -				 */
> -				rq_util_min = uclamp_rq_get(rq, UCLAMP_MIN);
> -				rq_util_max = uclamp_rq_get(rq, UCLAMP_MAX);
>  
> -				util_min = max(rq_util_min, p_util_min);
> -				util_max = max(rq_util_max, p_util_max);
> -			}
> +			/* Gather more statistics */
> +			target_stat.cpu = cpu;
> +			target_stat.runnable = cpu_runnable(cpu_rq(cpu));
> +			target_stat.capa = capacity_of(cpu);
> +			target_stat.nr_running = cpu_rq(cpu)->cfs.h_nr_runnable;
>  
> -			fits = util_fits_cpu(util, util_min, util_max, cpu);
> -			if (!fits)
> +			/* If the target needs a lower OPP, then look up for
> +			 * the corresponding OPP and its associated cost.
> +			 * Otherwise at same cost level, select the CPU which
> +			 * provides best performance.
> +			 */
> +			if (target_perf < min_stat.min_perf)
> +				find_pd_cost(pd->em_pd, target_perf, &target_stat);
> +			else if (!select_best_cpu(&target_stat, &min_stat, prev_cpu, sd))
>  				continue;
>  
> -			lsub_positive(&cpu_cap, util);
> -
> -			if (cpu == prev_cpu) {
> -				/* Always use prev_cpu as a candidate. */
> -				prev_spare_cap = cpu_cap;
> -				prev_fits = fits;
> -			} else if ((fits > max_fits) ||
> -				   ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {
> -				/*
> -				 * Find the CPU with the maximum spare capacity
> -				 * among the remaining CPUs in the performance
> -				 * domain.
> -				 */
> -				max_spare_cap = cpu_cap;
> -				max_spare_cap_cpu = cpu;
> -				max_fits = fits;
> -			}
> +			/* Save the new most efficient CPU of the PD */
> +			min_stat = target_stat;
>  		}
>  
> -		if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
> +		if (min_stat.cpu == -1)
>  			continue;
>  
> -		eenv_pd_busy_time(&eenv, cpus, p);
> -		/* Compute the 'base' energy of the pd, without @p */
> -		base_energy = compute_energy(&eenv, pd, cpus, p, -1);
> +		if (min_stat.fits < best_fits)
> +			continue;
>  
> -		/* Evaluate the energy impact of using prev_cpu. */
> -		if (prev_spare_cap > -1) {
> -			prev_delta = compute_energy(&eenv, pd, cpus, p,
> -						    prev_cpu);
> -			/* CPU utilization has changed */
> -			if (prev_delta < base_energy)
> -				goto unlock;
> -			prev_delta -= base_energy;
> -			prev_actual_cap = cpu_actual_cap;
> -			best_delta = min(best_delta, prev_delta);
> -		}
> +		/* Idle system costs nothing */
> +		target_stat.max_perf = 0;
> +		target_stat.cost = 0;
>  
> -		/* Evaluate the energy impact of using max_spare_cap_cpu. */
> -		if (max_spare_cap_cpu >= 0 && max_spare_cap > prev_spare_cap) {
> -			/* Current best energy cpu fits better */
> -			if (max_fits < best_fits)
> -				continue;
> +		/* Estimate utilization and cost without p */
> +		for_each_cpu(cpu, cpus) {
> +			unsigned long target_util;
>  
> -			/*
> -			 * Both don't fit performance hint (i.e. uclamp_min)
> -			 * but best energy cpu has better capacity.
> -			 */
> -			if ((max_fits < 0) &&
> -			    (cpu_actual_cap <= best_actual_cap))
> -				continue;
> +			/* Accumulate actual utilization w/o task p */
> +			pd_actual_util += find_cpu_actual_util(cpu, p);
>  
> -			cur_delta = compute_energy(&eenv, pd, cpus, p,
> -						   max_spare_cap_cpu);
> -			/* CPU utilization has changed */
> -			if (cur_delta < base_energy)
> -				goto unlock;
> -			cur_delta -= base_energy;
> +			/* Get the max utilization of the CPU w/o task p */
> +			target_util = find_cpu_max_util(cpu, p, -1);
> +			target_util = min(target_util, cpu_actual_cap);
>  
> -			/*
> -			 * Both fit for the task but best energy cpu has lower
> -			 * energy impact.
> -			 */
> -			if ((max_fits > 0) && (best_fits > 0) &&
> -			    (cur_delta >= best_delta))
> +			/* Current OPP is enough */
> +			if (target_util <= target_stat.max_perf)
>  				continue;
>  
> -			best_delta = cur_delta;
> -			best_energy_cpu = max_spare_cap_cpu;
> -			best_fits = max_fits;
> -			best_actual_cap = cpu_actual_cap;
> +			/* Compute and save the cost of the OPP */
> +			find_pd_cost(pd->em_pd, target_util, &target_stat);
> +			max_cost = target_stat.cost;
>  		}
> -	}
> -	rcu_read_unlock();
>  
> -	if ((best_fits > prev_fits) ||
> -	    ((best_fits > 0) && (best_delta < prev_delta)) ||
> -	    ((best_fits < 0) && (best_actual_cap > prev_actual_cap)))
> -		target = best_energy_cpu;
> +		/* Add the NRG cost of p */
> +		delta_nrg = task_util * min_stat.cost;
>  
> -	return target;
> +		/* Compute the NRG cost of others running at higher OPP because of p */
> +		if (min_stat.cost > max_cost)
> +			delta_nrg += pd_actual_util * (min_stat.cost - max_cost);
> +
> +		/* nrg with p */
> +		trace_sched_compute_energy_tp(p, min_stat.cpu, delta_nrg,
> +				min_stat.max_perf, pd_actual_util + task_util);
> +
> +		/*
> +		 * The probability that delta NRGs are equals is almost null. PDs being sorted
> +		 * by max capacity, keep the one with highest max capacity if this
> +		 * happens.
> +		 * TODO: add a margin in nrg cost and take into account other stats
> +		 */
> +		if ((min_stat.fits == best_fits) &&
> +		    (delta_nrg >= best_nrg))
> +			continue;
> +
> +		best_fits = min_stat.fits;
> +		best_nrg = delta_nrg;

General comment. Maybe using nrg to mean energy is a known abbreviation in the area. If so,
feel free to ignore this. But it took me a bit to make the connection between the two, whereas
best_energy/delta_energy would've been a bit more clear.

> +		best_cpu = min_stat.cpu;
> +	}
>  
>  unlock:
>  	rcu_read_unlock();
>  
> +	if (best_cpu >= 0)
> +		target = best_cpu;
> +
>  	return target;
>  }
>  


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 4/7 v2] energy model: Remove unused em_cpu_energy()
  2024-12-17 16:07 [PATCH 0/7 v2] sched/fair: Rework EAS to handle more cases Vincent Guittot
                   ` (2 preceding siblings ...)
  2024-12-17 16:07 ` [PATCH 3/7 v2] sched/fair: Rework feec() to use cost instead of spare capacity Vincent Guittot
@ 2024-12-17 16:07 ` Vincent Guittot
  2024-12-18 14:59   ` Christian Loehle
  2024-12-17 16:07 ` [PATCH 5/7 v2] sched/fair: Add push task callback for EAS Vincent Guittot
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: Vincent Guittot @ 2024-12-17 16:07 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2, pierre.gondois, christian.loehle, qperret,
	Vincent Guittot

Remove the unused function em_cpu_energy()

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/energy_model.h | 99 ------------------------------------
 1 file changed, 99 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 26d0ff72feac..c766642dc541 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -235,99 +235,6 @@ em_pd_get_previous_state(struct em_perf_state *table,
 	return -1;
 }
 
-/**
- * em_cpu_energy() - Estimates the energy consumed by the CPUs of a
- *		performance domain
- * @pd		: performance domain for which energy has to be estimated
- * @max_util	: highest utilization among CPUs of the domain
- * @sum_util	: sum of the utilization of all CPUs in the domain
- * @allowed_cpu_cap	: maximum allowed CPU capacity for the @pd, which
- *			  might reflect reduced frequency (due to thermal)
- *
- * This function must be used only for CPU devices. There is no validation,
- * i.e. if the EM is a CPU type and has cpumask allocated. It is called from
- * the scheduler code quite frequently and that is why there is not checks.
- *
- * Return: the sum of the energy consumed by the CPUs of the domain assuming
- * a capacity state satisfying the max utilization of the domain.
- */
-static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
-				unsigned long max_util, unsigned long sum_util,
-				unsigned long allowed_cpu_cap)
-{
-	struct em_perf_table *em_table;
-	struct em_perf_state *ps;
-	int i;
-
-#ifdef CONFIG_SCHED_DEBUG
-	WARN_ONCE(!rcu_read_lock_held(), "EM: rcu read lock needed\n");
-#endif
-
-	if (!sum_util)
-		return 0;
-
-	/*
-	 * In order to predict the performance state, map the utilization of
-	 * the most utilized CPU of the performance domain to a requested
-	 * performance, like schedutil. Take also into account that the real
-	 * performance might be set lower (due to thermal capping). Thus, clamp
-	 * max utilization to the allowed CPU capacity before calculating
-	 * effective performance.
-	 */
-	max_util = min(max_util, allowed_cpu_cap);
-
-	/*
-	 * Find the lowest performance state of the Energy Model above the
-	 * requested performance.
-	 */
-	em_table = rcu_dereference(pd->em_table);
-	i = em_pd_get_efficient_state(em_table->state, pd, max_util);
-	ps = &em_table->state[i];
-
-	/*
-	 * The performance (capacity) of a CPU in the domain at the performance
-	 * state (ps) can be computed as:
-	 *
-	 *                     ps->freq * scale_cpu
-	 *   ps->performance = --------------------                  (1)
-	 *                         cpu_max_freq
-	 *
-	 * So, ignoring the costs of idle states (which are not available in
-	 * the EM), the energy consumed by this CPU at that performance state
-	 * is estimated as:
-	 *
-	 *             ps->power * cpu_util
-	 *   cpu_nrg = --------------------                          (2)
-	 *               ps->performance
-	 *
-	 * since 'cpu_util / ps->performance' represents its percentage of busy
-	 * time.
-	 *
-	 *   NOTE: Although the result of this computation actually is in
-	 *         units of power, it can be manipulated as an energy value
-	 *         over a scheduling period, since it is assumed to be
-	 *         constant during that interval.
-	 *
-	 * By injecting (1) in (2), 'cpu_nrg' can be re-expressed as a product
-	 * of two terms:
-	 *
-	 *             ps->power * cpu_max_freq
-	 *   cpu_nrg = ------------------------ * cpu_util           (3)
-	 *               ps->freq * scale_cpu
-	 *
-	 * The first term is static, and is stored in the em_perf_state struct
-	 * as 'ps->cost'.
-	 *
-	 * Since all CPUs of the domain have the same micro-architecture, they
-	 * share the same 'ps->cost', and the same CPU capacity. Hence, the
-	 * total energy of the domain (which is the simple sum of the energy of
-	 * all of its CPUs) can be factorized as:
-	 *
-	 *   pd_nrg = ps->cost * \Sum cpu_util                       (4)
-	 */
-	return ps->cost * sum_util;
-}
-
 /**
  * em_pd_nr_perf_states() - Get the number of performance states of a perf.
  *				domain
@@ -394,12 +301,6 @@ em_pd_get_previous_state(struct em_perf_state *table, int nr_perf_states,
 {
 	return -1;
 }
-static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
-			unsigned long max_util, unsigned long sum_util,
-			unsigned long allowed_cpu_cap)
-{
-	return 0;
-}
 static inline int em_pd_nr_perf_states(struct em_perf_domain *pd)
 {
 	return 0;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/7 v2] energy model: Remove unused em_cpu_energy()
  2024-12-17 16:07 ` [PATCH 4/7 v2] energy model: Remove unused em_cpu_energy() Vincent Guittot
@ 2024-12-18 14:59   ` Christian Loehle
  0 siblings, 0 replies; 19+ messages in thread
From: Christian Loehle @ 2024-12-18 14:59 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, lukasz.luba,
	rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2, pierre.gondois, qperret

On 12/17/24 16:07, Vincent Guittot wrote:
> Remove the unused function em_cpu_energy()
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---

NIT: Title should be s/energy model:/PM: EM:/ I believe
for 2/7 as well.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 5/7 v2] sched/fair: Add push task callback for EAS
  2024-12-17 16:07 [PATCH 0/7 v2] sched/fair: Rework EAS to handle more cases Vincent Guittot
                   ` (3 preceding siblings ...)
  2024-12-17 16:07 ` [PATCH 4/7 v2] energy model: Remove unused em_cpu_energy() Vincent Guittot
@ 2024-12-17 16:07 ` Vincent Guittot
  2025-01-16 17:34   ` Pierre Gondois
  2024-12-17 16:07 ` [PATCH 6/7 v2] sched/fair: Add misfit case to " Vincent Guittot
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: Vincent Guittot @ 2024-12-17 16:07 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2, pierre.gondois, christian.loehle, qperret,
	Vincent Guittot

EAS is based on wakeup events to efficiently place tasks on the system, but
there are cases where a task will not have wakeup events anymore or at a
far too low pace. For such situation, we can take advantage of the task
being put back in the enqueued list to check if it should be migrated on
another CPU.

Wake up events remain the main way to migrate tasks but we now detect
situation where a task is stuck on a CPU by checking that its utilization
is larger than the max available compute capacity (max cpu capacity or
uclamp max setting)

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c  | 206 +++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |   2 +
 2 files changed, 208 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cd046e8216a9..2affc063da55 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7088,6 +7088,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	hrtick_update(rq);
 }
 
+static void dequeue_pushable_task(struct rq *rq, struct task_struct *p);
 static void set_next_buddy(struct sched_entity *se);
 
 /*
@@ -7118,6 +7119,9 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		h_nr_idle = task_has_idle_policy(p);
 		if (task_sleep || task_delayed || !se->sched_delayed)
 			h_nr_runnable = 1;
+
+		if (task_sleep || task_on_rq_migrating(p))
+			dequeue_pushable_task(rq, p);
 	} else {
 		cfs_rq = group_cfs_rq(se);
 		slice = cfs_rq_min_slice(cfs_rq);
@@ -8617,6 +8621,182 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	return target;
 }
 
+static inline bool task_misfit_cpu(struct task_struct *p, int cpu)
+{
+	unsigned long max_capa = get_actual_cpu_capacity(cpu);
+	unsigned long util = task_util_est(p);
+
+	max_capa = min(max_capa, uclamp_eff_value(p, UCLAMP_MAX));
+	util = max(util, task_runnable(p));
+
+	/*
+	 * Return true only if the task might not sleep/wakeup because of a low
+	 * compute capacity. Tasks, which wake up regularly, will be handled by
+	 * feec().
+	 */
+	return (util > max_capa);
+}
+
+static int active_load_balance_cpu_stop(void *data);
+
+static inline void migrate_misfit_task(struct task_struct *p, struct rq *rq)
+{
+	int new_cpu, cpu = cpu_of(rq);
+
+	if (!sched_energy_enabled() || is_rd_overutilized(rq->rd))
+		return;
+
+	if (WARN_ON(!p))
+		return;
+
+	if (WARN_ON(p != rq->curr))
+		return;
+
+	if (is_migration_disabled(p))
+		return;
+
+	if ((rq->nr_running > 1) || (p->nr_cpus_allowed == 1))
+		return;
+
+	if (!task_misfit_cpu(p, cpu))
+		return;
+
+	new_cpu = find_energy_efficient_cpu(p, cpu);
+
+	if (new_cpu == cpu)
+		return;
+
+	/*
+	 * ->active_balance synchronizes accesses to
+	 * ->active_balance_work.  Once set, it's cleared
+	 * only after active load balance is finished.
+	 */
+	if (!rq->active_balance) {
+		rq->active_balance = 1;
+		rq->push_cpu = new_cpu;
+	} else
+		return;
+
+	raw_spin_rq_unlock(rq);
+	stop_one_cpu_nowait(cpu,
+		active_load_balance_cpu_stop, rq,
+		&rq->active_balance_work);
+	raw_spin_rq_lock(rq);
+}
+
+static inline int has_pushable_tasks(struct rq *rq)
+{
+	return !plist_head_empty(&rq->cfs.pushable_tasks);
+}
+
+static struct task_struct *pick_next_pushable_fair_task(struct rq *rq)
+{
+	struct task_struct *p;
+
+	if (!has_pushable_tasks(rq))
+		return NULL;
+
+	p = plist_first_entry(&rq->cfs.pushable_tasks,
+			      struct task_struct, pushable_tasks);
+
+	WARN_ON_ONCE(rq->cpu != task_cpu(p));
+	WARN_ON_ONCE(task_current(rq, p));
+	WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
+	WARN_ON_ONCE(!task_on_rq_queued(p));
+
+	/*
+	 * Remove task from the pushable list as we try only once after that
+	 * the task has been put back in enqueued list.
+	 */
+	plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+
+	return p;
+}
+
+/*
+ * See if the non running fair tasks on this rq can be sent on other CPUs
+ * that fits better with their profile.
+ */
+static bool push_fair_task(struct rq *rq)
+{
+	struct task_struct *next_task;
+	int prev_cpu, new_cpu;
+	struct rq *new_rq;
+
+	next_task = pick_next_pushable_fair_task(rq);
+	if (!next_task)
+		return false;
+
+	if (is_migration_disabled(next_task))
+		return true;
+
+	/* We might release rq lock */
+	get_task_struct(next_task);
+
+	prev_cpu = rq->cpu;
+
+	new_cpu = find_energy_efficient_cpu(next_task, prev_cpu);
+
+	if (new_cpu == prev_cpu)
+		goto out;
+
+	new_rq = cpu_rq(new_cpu);
+
+	if (double_lock_balance(rq, new_rq)) {
+		/* The task has already migrated in between */
+		if (task_cpu(next_task) != rq->cpu) {
+			double_unlock_balance(rq, new_rq);
+			goto out;
+		}
+
+		deactivate_task(rq, next_task, 0);
+		set_task_cpu(next_task, new_cpu);
+		activate_task(new_rq, next_task, 0);
+
+		resched_curr(new_rq);
+
+		double_unlock_balance(rq, new_rq);
+	}
+
+out:
+	put_task_struct(next_task);
+
+	return true;
+}
+
+static void push_fair_tasks(struct rq *rq)
+{
+	/* push_fair_task() will return true if it moved a fair task */
+	while (push_fair_task(rq))
+		;
+}
+
+static DEFINE_PER_CPU(struct balance_callback, fair_push_head);
+
+static inline void fair_queue_push_tasks(struct rq *rq)
+{
+	if (!sched_energy_enabled() || !has_pushable_tasks(rq))
+		return;
+
+	queue_balance_callback(rq, &per_cpu(fair_push_head, rq->cpu), push_fair_tasks);
+}
+static void dequeue_pushable_task(struct rq *rq, struct task_struct *p)
+{
+	if (sched_energy_enabled())
+		plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+}
+
+static void enqueue_pushable_task(struct rq *rq, struct task_struct *p)
+{
+	if (sched_energy_enabled() && task_on_rq_queued(p) && !p->se.sched_delayed) {
+		if (!is_rd_overutilized(rq->rd) && task_misfit_cpu(p, rq->cpu)) {
+			plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+			plist_node_init(&p->pushable_tasks, p->prio);
+			plist_add(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+		}
+	}
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAKE,
@@ -8786,6 +8966,10 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	return sched_balance_newidle(rq, rf) != 0;
 }
 #else
+static inline void migrate_misfit_task(struct task_struct *p, struct rq *rq) {}
+static inline void fair_queue_push_tasks(struct rq *rq) {}
+static void dequeue_pushable_task(struct cfs_rq *cfs_rq, struct task_struct *p) {}
+static inline void enqueue_pushable_task(struct cfs_rq *cfs_rq, struct task_struct *p) {}
 static inline void set_task_max_allowed_capacity(struct task_struct *p) {}
 #endif /* CONFIG_SMP */
 
@@ -8968,6 +9152,12 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 		put_prev_entity(cfs_rq, pse);
 		set_next_entity(cfs_rq, se);
 
+		/*
+		 * The previous task might be eligible for being pushed on
+		 * another cpu if it is still runnable.
+		 */
+		enqueue_pushable_task(rq, prev);
+
 		__set_next_task_fair(rq, p, true);
 	}
 
@@ -9040,6 +9230,13 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct t
 		cfs_rq = cfs_rq_of(se);
 		put_prev_entity(cfs_rq, se);
 	}
+
+	/*
+	 * The previous task might be eligible for pushing it on
+	 * another cpu if it is still active.
+	 */
+	enqueue_pushable_task(rq, prev);
+
 }
 
 /*
@@ -13102,6 +13299,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
 
+	migrate_misfit_task(curr, rq);
 	update_misfit_status(curr, rq);
 	check_update_overutilized_status(task_rq(curr));
 
@@ -13254,6 +13452,8 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
 {
 	struct sched_entity *se = &p->se;
 
+	dequeue_pushable_task(rq, p);
+
 #ifdef CONFIG_SMP
 	if (task_on_rq_queued(p)) {
 		/*
@@ -13271,6 +13471,11 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
 	if (hrtick_enabled_fair(rq))
 		hrtick_start_fair(rq, p);
 
+	/*
+	 * Try to push prev task before checking misfit for next task as
+	 * the migration of prev can make next fitting the CPU
+	 */
+	fair_queue_push_tasks(rq);
 	update_misfit_status(p, rq);
 	sched_fair_update_stop_tick(rq, p);
 }
@@ -13301,6 +13506,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 	cfs_rq->tasks_timeline = RB_ROOT_CACHED;
 	cfs_rq->min_vruntime = (u64)(-(1LL << 20));
 #ifdef CONFIG_SMP
+	plist_head_init(&cfs_rq->pushable_tasks);
 	raw_spin_lock_init(&cfs_rq->removed.lock);
 #endif
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index aef716c41edb..c9875cd4c986 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -717,6 +717,8 @@ struct cfs_rq {
 	struct list_head	leaf_cfs_rq_list;
 	struct task_group	*tg;	/* group that "owns" this runqueue */
 
+	struct plist_head	pushable_tasks;
+
 	/* Locally cached copy of our task_group's idle value */
 	int			idle;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/7 v2] sched/fair: Add push task callback for EAS
  2024-12-17 16:07 ` [PATCH 5/7 v2] sched/fair: Add push task callback for EAS Vincent Guittot
@ 2025-01-16 17:34   ` Pierre Gondois
  2025-01-20 15:50     ` Vincent Guittot
  0 siblings, 1 reply; 19+ messages in thread
From: Pierre Gondois @ 2025-01-16 17:34 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, lukasz.luba,
	rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2, christian.loehle, qperret

Hello Vincent,

On 12/17/24 17:07, Vincent Guittot wrote:
> EAS is based on wakeup events to efficiently place tasks on the system, but
> there are cases where a task will not have wakeup events anymore or at a
> far too low pace. For such situation, we can take advantage of the task
> being put back in the enqueued list to check if it should be migrated on
> another CPU.
> 
> Wake up events remain the main way to migrate tasks but we now detect
> situation where a task is stuck on a CPU by checking that its utilization
> is larger than the max available compute capacity (max cpu capacity or
> uclamp max setting)

It seems there are 2 distinct cases:
a- The task is alone on a rq
b- The task shares the rq and is enqueued/dequeued

a. doesn't seem to need any of the push functions, and b. doesn't seem to
need any of the misfit functions. Maybe it's worth splitting the patch in 2.

> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>   kernel/sched/fair.c  | 206 +++++++++++++++++++++++++++++++++++++++++++
>   kernel/sched/sched.h |   2 +
>   2 files changed, 208 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index cd046e8216a9..2affc063da55 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7088,6 +7088,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>   	hrtick_update(rq);
>   }
>   
> +static void dequeue_pushable_task(struct rq *rq, struct task_struct *p);
>   static void set_next_buddy(struct sched_entity *se);
>   
>   /*
> @@ -7118,6 +7119,9 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>   		h_nr_idle = task_has_idle_policy(p);
>   		if (task_sleep || task_delayed || !se->sched_delayed)
>   			h_nr_runnable = 1;
> +
> +		if (task_sleep || task_on_rq_migrating(p))
> +			dequeue_pushable_task(rq, p);
>   	} else {
>   		cfs_rq = group_cfs_rq(se);
>   		slice = cfs_rq_min_slice(cfs_rq);
> @@ -8617,6 +8621,182 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>   	return target;
>   }
>   
> +static inline bool task_misfit_cpu(struct task_struct *p, int cpu)
> +{
> +	unsigned long max_capa = get_actual_cpu_capacity(cpu);
> +	unsigned long util = task_util_est(p);
> +
> +	max_capa = min(max_capa, uclamp_eff_value(p, UCLAMP_MAX));
> +	util = max(util, task_runnable(p));
> +
> +	/*
> +	 * Return true only if the task might not sleep/wakeup because of a low
> +	 * compute capacity. Tasks, which wake up regularly, will be handled by
> +	 * feec().
> +	 */

NIT:
On a little CPU with min_OPP=256 and max_OPP=512,
a task with a util=100 and U_Max=10 will trigger this condition.
However:
- the task is already well placed from a power PoV
- the tasks has opportunities to sleep/wake-up
Shouldn't we ideally take:

unsigned long max_capa;
max_capa = max(min_capa(cpu), uclamp_eff_value(p, UCLAMP_MAX));
max_capa = min(get_actual_cpu_capacity(cpu), max_capa);

with min_capa(cpu) returning 256 in this case, i.e. the CPU capacity at the
lowest OPP ?

> +	return (util > max_capa);
> +}
> +

[...]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/7 v2] sched/fair: Add push task callback for EAS
  2025-01-16 17:34   ` Pierre Gondois
@ 2025-01-20 15:50     ` Vincent Guittot
  0 siblings, 0 replies; 19+ messages in thread
From: Vincent Guittot @ 2025-01-20 15:50 UTC (permalink / raw)
  To: Pierre Gondois
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2, christian.loehle, qperret

On Thu, 16 Jan 2025 at 18:34, Pierre Gondois <pierre.gondois@arm.com> wrote:
>
> Hello Vincent,
>
> On 12/17/24 17:07, Vincent Guittot wrote:
> > EAS is based on wakeup events to efficiently place tasks on the system, but
> > there are cases where a task will not have wakeup events anymore or at a
> > far too low pace. For such situation, we can take advantage of the task
> > being put back in the enqueued list to check if it should be migrated on
> > another CPU.
> >
> > Wake up events remain the main way to migrate tasks but we now detect
> > situation where a task is stuck on a CPU by checking that its utilization
> > is larger than the max available compute capacity (max cpu capacity or
> > uclamp max setting)
>
> It seems there are 2 distinct cases:
> a- The task is alone on a rq
> b- The task shares the rq and is enqueued/dequeued

Do you mean pick/set and put  instead of enqueued/dequeued ? which are
the events used for push callback

The enqueue/dequeue_pushable_task names are maybe a bit misleading
because they mean tasks are enqueued/dequeued from the pushable list
but not enqueued/dequeued from the rq. I should probably rename them
add/remove_pushable_task to avoid confusion

>
> a. doesn't seem to need any of the push functions, and b. doesn't seem to
> need any of the misfit functions. Maybe it's worth splitting the patch in 2.

In both case we check if there is a reason for the task not being
enqueued on the right cpu but I can split anyway this patch in 2 if
it's make it easier to review

>
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >   kernel/sched/fair.c  | 206 +++++++++++++++++++++++++++++++++++++++++++
> >   kernel/sched/sched.h |   2 +
> >   2 files changed, 208 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index cd046e8216a9..2affc063da55 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7088,6 +7088,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >       hrtick_update(rq);
> >   }
> >
> > +static void dequeue_pushable_task(struct rq *rq, struct task_struct *p);
> >   static void set_next_buddy(struct sched_entity *se);
> >
> >   /*
> > @@ -7118,6 +7119,9 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> >               h_nr_idle = task_has_idle_policy(p);
> >               if (task_sleep || task_delayed || !se->sched_delayed)
> >                       h_nr_runnable = 1;
> > +
> > +             if (task_sleep || task_on_rq_migrating(p))
> > +                     dequeue_pushable_task(rq, p);
> >       } else {
> >               cfs_rq = group_cfs_rq(se);
> >               slice = cfs_rq_min_slice(cfs_rq);
> > @@ -8617,6 +8621,182 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >       return target;
> >   }
> >
> > +static inline bool task_misfit_cpu(struct task_struct *p, int cpu)
> > +{
> > +     unsigned long max_capa = get_actual_cpu_capacity(cpu);
> > +     unsigned long util = task_util_est(p);
> > +
> > +     max_capa = min(max_capa, uclamp_eff_value(p, UCLAMP_MAX));
> > +     util = max(util, task_runnable(p));
> > +
> > +     /*
> > +      * Return true only if the task might not sleep/wakeup because of a low
> > +      * compute capacity. Tasks, which wake up regularly, will be handled by
> > +      * feec().
> > +      */
>
> NIT:
> On a little CPU with min_OPP=256 and max_OPP=512,
> a task with a util=100 and U_Max=10 will trigger this condition.
> However:
> - the task is already well placed from a power PoV
> - the tasks has opportunities to sleep/wake-up

I agree. I took a wide condition to start with and plan to narrow it
step by step

> Shouldn't we ideally take:
>
> unsigned long max_capa;
> max_capa = max(min_capa(cpu), uclamp_eff_value(p, UCLAMP_MAX));

fair enough. will add it for next version



> max_capa = min(get_actual_cpu_capacity(cpu), max_capa);
>
> with min_capa(cpu) returning 256 in this case, i.e. the CPU capacity at the
> lowest OPP ?
>
> > +     return (util > max_capa);
> > +}
> > +
>
> [...]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 6/7 v2] sched/fair: Add misfit case to push task callback for EAS
  2024-12-17 16:07 [PATCH 0/7 v2] sched/fair: Rework EAS to handle more cases Vincent Guittot
                   ` (4 preceding siblings ...)
  2024-12-17 16:07 ` [PATCH 5/7 v2] sched/fair: Add push task callback for EAS Vincent Guittot
@ 2024-12-17 16:07 ` Vincent Guittot
  2025-01-16 17:35   ` Pierre Gondois
  2024-12-17 16:07 ` [PATCH 7/7 v2] sched/fair: Update overutilized detection Vincent Guittot
  2024-12-18 14:06 ` [PATCH 0/7 v2] sched/fair: Rework EAS to handle more cases Christian Loehle
  7 siblings, 1 reply; 19+ messages in thread
From: Vincent Guittot @ 2024-12-17 16:07 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2, pierre.gondois, christian.loehle, qperret,
	Vincent Guittot

Some task misfit cases can be handled directly by the push callback
instead of triggering an idle load balance to pull the task on a better
CPU.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

# Conflicts:
#	kernel/sched/fair.c
---
 kernel/sched/fair.c | 53 +++++++++++++++++++++++++++++----------------
 1 file changed, 34 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2affc063da55..9bddb094ee21 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8541,6 +8541,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			target_stat.runnable = cpu_runnable(cpu_rq(cpu));
 			target_stat.capa = capacity_of(cpu);
 			target_stat.nr_running = cpu_rq(cpu)->cfs.h_nr_runnable;
+			if ((p->on_rq) && (!p->se.sched_delayed) && (cpu == prev_cpu))
+				target_stat.nr_running--;
 
 			/* If the target needs a lower OPP, then look up for
 			 * the corresponding OPP and its associated cost.
@@ -8623,48 +8625,58 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 
 static inline bool task_misfit_cpu(struct task_struct *p, int cpu)
 {
-	unsigned long max_capa = get_actual_cpu_capacity(cpu);
-	unsigned long util = task_util_est(p);
+	unsigned long max_capa, util;
+
+	if (p->nr_cpus_allowed == 1)
+		return false;
 
-	max_capa = min(max_capa, uclamp_eff_value(p, UCLAMP_MAX));
-	util = max(util, task_runnable(p));
+	max_capa = min(get_actual_cpu_capacity(cpu),
+		       uclamp_eff_value(p, UCLAMP_MAX));
+	util = max(task_util_est(p), task_runnable(p));
 
 	/*
 	 * Return true only if the task might not sleep/wakeup because of a low
 	 * compute capacity. Tasks, which wake up regularly, will be handled by
 	 * feec().
 	 */
-	return (util > max_capa);
+	if (util > max_capa)
+		return true;
+
+	/* Return true if the task doesn't fit anymore to run on the cpu */
+	if ((arch_scale_cpu_capacity(cpu) < p->max_allowed_capacity) && !task_fits_cpu(p, cpu))
+		return true;
+
+	return false;
 }
 
 static int active_load_balance_cpu_stop(void *data);
 
-static inline void migrate_misfit_task(struct task_struct *p, struct rq *rq)
+static inline bool migrate_misfit_task(struct task_struct *p, struct rq *rq)
 {
 	int new_cpu, cpu = cpu_of(rq);
 
 	if (!sched_energy_enabled() || is_rd_overutilized(rq->rd))
-		return;
+		return false;
 
 	if (WARN_ON(!p))
-		return;
+		return false;
 
-	if (WARN_ON(p != rq->curr))
-		return;
+	if (WARN_ON(!task_current(rq, p)))
+		return false;
 
 	if (is_migration_disabled(p))
-		return;
+		return false;
 
-	if ((rq->nr_running > 1) || (p->nr_cpus_allowed == 1))
-		return;
+	if (rq->nr_running > 1)
+		return false;
 
 	if (!task_misfit_cpu(p, cpu))
-		return;
+		return false;
 
 	new_cpu = find_energy_efficient_cpu(p, cpu);
 
 	if (new_cpu == cpu)
-		return;
+		return false;
 
 	/*
 	 * ->active_balance synchronizes accesses to
@@ -8675,13 +8687,15 @@ static inline void migrate_misfit_task(struct task_struct *p, struct rq *rq)
 		rq->active_balance = 1;
 		rq->push_cpu = new_cpu;
 	} else
-		return;
+		return false;
 
 	raw_spin_rq_unlock(rq);
 	stop_one_cpu_nowait(cpu,
 		active_load_balance_cpu_stop, rq,
 		&rq->active_balance_work);
 	raw_spin_rq_lock(rq);
+
+	return true;
 }
 
 static inline int has_pushable_tasks(struct rq *rq)
@@ -13299,9 +13313,10 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
 
-	migrate_misfit_task(curr, rq);
-	update_misfit_status(curr, rq);
-	check_update_overutilized_status(task_rq(curr));
+	if (!migrate_misfit_task(curr, rq)) {
+		update_misfit_status(curr, rq);
+		check_update_overutilized_status(task_rq(curr));
+	}
 
 	task_tick_core(rq, curr);
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/7 v2] sched/fair: Add misfit case to push task callback for EAS
  2024-12-17 16:07 ` [PATCH 6/7 v2] sched/fair: Add misfit case to " Vincent Guittot
@ 2025-01-16 17:35   ` Pierre Gondois
  2025-01-20 15:50     ` Vincent Guittot
  0 siblings, 1 reply; 19+ messages in thread
From: Pierre Gondois @ 2025-01-16 17:35 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, lukasz.luba,
	rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2, christian.loehle, qperret



On 12/17/24 17:07, Vincent Guittot wrote:
> Some task misfit cases can be handled directly by the push callback
> instead of triggering an idle load balance to pull the task on a better
> CPU.

Aren't misfit tasks migrated using active_load_balance_cpu_stop() rather than
the push mechanism ?

Also, I don't see cases where a misfit task would not be migrated by either
the push mechanism or the misfit handling present in this patch. Is it possible
to detail a case where the misfit load balancer would still be needed ?

> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> 
> # Conflicts:
> #	kernel/sched/fair.c
> ---
>   kernel/sched/fair.c | 53 +++++++++++++++++++++++++++++----------------
>   1 file changed, 34 insertions(+), 19 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2affc063da55..9bddb094ee21 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8541,6 +8541,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>   			target_stat.runnable = cpu_runnable(cpu_rq(cpu));
>   			target_stat.capa = capacity_of(cpu);
>   			target_stat.nr_running = cpu_rq(cpu)->cfs.h_nr_runnable;
> +			if ((p->on_rq) && (!p->se.sched_delayed) && (cpu == prev_cpu))
> +				target_stat.nr_running--;
>   
>   			/* If the target needs a lower OPP, then look up for
>   			 * the corresponding OPP and its associated cost.
> @@ -8623,48 +8625,58 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>   
>   static inline bool task_misfit_cpu(struct task_struct *p, int cpu)
>   {
> -	unsigned long max_capa = get_actual_cpu_capacity(cpu);
> -	unsigned long util = task_util_est(p);
> +	unsigned long max_capa, util;
> +
> +	if (p->nr_cpus_allowed == 1)
> +		return false;
>   
> -	max_capa = min(max_capa, uclamp_eff_value(p, UCLAMP_MAX));
> -	util = max(util, task_runnable(p));
> +	max_capa = min(get_actual_cpu_capacity(cpu),
> +		       uclamp_eff_value(p, UCLAMP_MAX));
> +	util = max(task_util_est(p), task_runnable(p));
>   
>   	/*
>   	 * Return true only if the task might not sleep/wakeup because of a low
>   	 * compute capacity. Tasks, which wake up regularly, will be handled by
>   	 * feec().
>   	 */
> -	return (util > max_capa);
> +	if (util > max_capa)
> +		return true;
> +
> +	/* Return true if the task doesn't fit anymore to run on the cpu */
> +	if ((arch_scale_cpu_capacity(cpu) < p->max_allowed_capacity) && !task_fits_cpu(p, cpu))
> +		return true;

This logic seems to already be present in update_misfit_status(). Maybe it would be
good to factorize it to have a common criteria for misfit tasks.

> +
> +	return false;
>   }
>   
>   static int active_load_balance_cpu_stop(void *data);
>   
> -static inline void migrate_misfit_task(struct task_struct *p, struct rq *rq)
> +static inline bool migrate_misfit_task(struct task_struct *p, struct rq *rq)
>   {
>   	int new_cpu, cpu = cpu_of(rq);
>   
>   	if (!sched_energy_enabled() || is_rd_overutilized(rq->rd))
> -		return;
> +		return false;
>   
>   	if (WARN_ON(!p))
> -		return;
> +		return false;
>   
> -	if (WARN_ON(p != rq->curr))
> -		return;
> +	if (WARN_ON(!task_current(rq, p)))
> +		return false;
>   
>   	if (is_migration_disabled(p))
> -		return;
> +		return false;
>   
> -	if ((rq->nr_running > 1) || (p->nr_cpus_allowed == 1))
> -		return;
> +	if (rq->nr_running > 1)
> +		return false;

NIT: Maybe the condition (p->nr_cpus_allowed == 1) could have already been
part of task_misfit_cpu() in the previous patch.

>   
>   	if (!task_misfit_cpu(p, cpu))
> -		return;
> +		return false;
>   
>   	new_cpu = find_energy_efficient_cpu(p, cpu);
>   
>   	if (new_cpu == cpu)
> -		return;
> +		return false;
>   
>   	/*
>   	 * ->active_balance synchronizes accesses to
> @@ -8675,13 +8687,15 @@ static inline void migrate_misfit_task(struct task_struct *p, struct rq *rq)
>   		rq->active_balance = 1;
>   		rq->push_cpu = new_cpu;
>   	} else
> -		return;
> +		return false;
>   
>   	raw_spin_rq_unlock(rq);
>   	stop_one_cpu_nowait(cpu,
>   		active_load_balance_cpu_stop, rq,
>   		&rq->active_balance_work);
>   	raw_spin_rq_lock(rq);
> +
> +	return true;
>   }
>   
>   static inline int has_pushable_tasks(struct rq *rq)
> @@ -13299,9 +13313,10 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>   	if (static_branch_unlikely(&sched_numa_balancing))
>   		task_tick_numa(rq, curr);
>   
> -	migrate_misfit_task(curr, rq);
> -	update_misfit_status(curr, rq);
> -	check_update_overutilized_status(task_rq(curr));
> +	if (!migrate_misfit_task(curr, rq)) {
> +		update_misfit_status(curr, rq);

If the system is not-OU, the only case I see where migrate_misfit_task() would
not detect a misfit task and update_misfit_status() would is if there is another
task on the rq. I.e. through:
migrate_misfit_task()
\-if (rq->nr_running > 1) return false;

However in this case, the push callback should migrate the misfit task. So is it still
necessary to look for misfit task through sched_balance_find_src_group() ?

> +		check_update_overutilized_status(task_rq(curr));
> +	}
>   
>   	task_tick_core(rq, curr);
>   }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/7 v2] sched/fair: Add misfit case to push task callback for EAS
  2025-01-16 17:35   ` Pierre Gondois
@ 2025-01-20 15:50     ` Vincent Guittot
  0 siblings, 0 replies; 19+ messages in thread
From: Vincent Guittot @ 2025-01-20 15:50 UTC (permalink / raw)
  To: Pierre Gondois
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2, christian.loehle, qperret

On Thu, 16 Jan 2025 at 18:35, Pierre Gondois <pierre.gondois@arm.com> wrote:
>
>
>
> On 12/17/24 17:07, Vincent Guittot wrote:
> > Some task misfit cases can be handled directly by the push callback
> > instead of triggering an idle load balance to pull the task on a better
> > CPU.
>
> Aren't misfit tasks migrated using active_load_balance_cpu_stop() rather than
> the push mechanism ?

Both. the push mechanism will check when the task is put back and
another task becomes the current task. active_load_balance_cpu_stop()
is used when the task is alone and can't be put when letting the CPU
to another task

>
> Also, I don't see cases where a misfit task would not be migrated by either
> the push mechanism or the misfit handling present in this patch. Is it possible
> to detail a case where the misfit load balancer would still be needed ?

When the system is overutilized

>
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> >
> > # Conflicts:
> > #     kernel/sched/fair.c
> > ---
> >   kernel/sched/fair.c | 53 +++++++++++++++++++++++++++++----------------
> >   1 file changed, 34 insertions(+), 19 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 2affc063da55..9bddb094ee21 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8541,6 +8541,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >                       target_stat.runnable = cpu_runnable(cpu_rq(cpu));
> >                       target_stat.capa = capacity_of(cpu);
> >                       target_stat.nr_running = cpu_rq(cpu)->cfs.h_nr_runnable;
> > +                     if ((p->on_rq) && (!p->se.sched_delayed) && (cpu == prev_cpu))
> > +                             target_stat.nr_running--;
> >
> >                       /* If the target needs a lower OPP, then look up for
> >                        * the corresponding OPP and its associated cost.
> > @@ -8623,48 +8625,58 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >
> >   static inline bool task_misfit_cpu(struct task_struct *p, int cpu)
> >   {
> > -     unsigned long max_capa = get_actual_cpu_capacity(cpu);
> > -     unsigned long util = task_util_est(p);
> > +     unsigned long max_capa, util;
> > +
> > +     if (p->nr_cpus_allowed == 1)
> > +             return false;
> >
> > -     max_capa = min(max_capa, uclamp_eff_value(p, UCLAMP_MAX));
> > -     util = max(util, task_runnable(p));
> > +     max_capa = min(get_actual_cpu_capacity(cpu),
> > +                    uclamp_eff_value(p, UCLAMP_MAX));
> > +     util = max(task_util_est(p), task_runnable(p));
> >
> >       /*
> >        * Return true only if the task might not sleep/wakeup because of a low
> >        * compute capacity. Tasks, which wake up regularly, will be handled by
> >        * feec().
> >        */
> > -     return (util > max_capa);
> > +     if (util > max_capa)
> > +             return true;
> > +
> > +     /* Return true if the task doesn't fit anymore to run on the cpu */
> > +     if ((arch_scale_cpu_capacity(cpu) < p->max_allowed_capacity) && !task_fits_cpu(p, cpu))
> > +             return true;
>
> This logic seems to already be present in update_misfit_status(). Maybe it would be
> good to factorize it to have a common criteria for misfit tasks.

i will think about it but the condition was so short that I didn't see
any real benefit of adding a helper function for that

>
> > +
> > +     return false;
> >   }
> >
> >   static int active_load_balance_cpu_stop(void *data);
> >
> > -static inline void migrate_misfit_task(struct task_struct *p, struct rq *rq)
> > +static inline bool migrate_misfit_task(struct task_struct *p, struct rq *rq)
> >   {
> >       int new_cpu, cpu = cpu_of(rq);
> >
> >       if (!sched_energy_enabled() || is_rd_overutilized(rq->rd))
> > -             return;
> > +             return false;
> >
> >       if (WARN_ON(!p))
> > -             return;
> > +             return false;
> >
> > -     if (WARN_ON(p != rq->curr))
> > -             return;
> > +     if (WARN_ON(!task_current(rq, p)))
> > +             return false;
> >
> >       if (is_migration_disabled(p))
> > -             return;
> > +             return false;
> >
> > -     if ((rq->nr_running > 1) || (p->nr_cpus_allowed == 1))
> > -             return;
> > +     if (rq->nr_running > 1)
> > +             return false;
>
> NIT: Maybe the condition (p->nr_cpus_allowed == 1) could have already been
> part of task_misfit_cpu() in the previous patch.

I vaguely  remember that there was a reason why I didn't put the
condition in the previous patch. I need to check in my log

>
> >
> >       if (!task_misfit_cpu(p, cpu))
> > -             return;
> > +             return false;
> >
> >       new_cpu = find_energy_efficient_cpu(p, cpu);
> >
> >       if (new_cpu == cpu)
> > -             return;
> > +             return false;
> >
> >       /*
> >        * ->active_balance synchronizes accesses to
> > @@ -8675,13 +8687,15 @@ static inline void migrate_misfit_task(struct task_struct *p, struct rq *rq)
> >               rq->active_balance = 1;
> >               rq->push_cpu = new_cpu;
> >       } else
> > -             return;
> > +             return false;
> >
> >       raw_spin_rq_unlock(rq);
> >       stop_one_cpu_nowait(cpu,
> >               active_load_balance_cpu_stop, rq,
> >               &rq->active_balance_work);
> >       raw_spin_rq_lock(rq);
> > +
> > +     return true;
> >   }
> >
> >   static inline int has_pushable_tasks(struct rq *rq)
> > @@ -13299,9 +13313,10 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
> >       if (static_branch_unlikely(&sched_numa_balancing))
> >               task_tick_numa(rq, curr);
> >
> > -     migrate_misfit_task(curr, rq);
> > -     update_misfit_status(curr, rq);
> > -     check_update_overutilized_status(task_rq(curr));
> > +     if (!migrate_misfit_task(curr, rq)) {
> > +             update_misfit_status(curr, rq);
>
> If the system is not-OU, the only case I see where migrate_misfit_task() would
> not detect a misfit task and update_misfit_status() would is if there is another
> task on the rq. I.e. through:
> migrate_misfit_task()
> \-if (rq->nr_running > 1) return false;
>
> However in this case, the push callback should migrate the misfit task. So is it still
> necessary to look for misfit task through sched_balance_find_src_group() ?

isn't the case overutilized enough to keep it ?
Also, the push callback only happens when the task is put in favor of
another one which can be several ms later depending on nice and slice
values. Or the task is preempted by a higher priority class



>
> > +             check_update_overutilized_status(task_rq(curr));
> > +     }
> >
> >       task_tick_core(rq, curr);
> >   }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 7/7 v2] sched/fair: Update overutilized detection
  2024-12-17 16:07 [PATCH 0/7 v2] sched/fair: Rework EAS to handle more cases Vincent Guittot
                   ` (5 preceding siblings ...)
  2024-12-17 16:07 ` [PATCH 6/7 v2] sched/fair: Add misfit case to " Vincent Guittot
@ 2024-12-17 16:07 ` Vincent Guittot
  2025-01-17 10:27   ` Pierre Gondois
  2024-12-18 14:06 ` [PATCH 0/7 v2] sched/fair: Rework EAS to handle more cases Christian Loehle
  7 siblings, 1 reply; 19+ messages in thread
From: Vincent Guittot @ 2024-12-17 16:07 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2, pierre.gondois, christian.loehle, qperret,
	Vincent Guittot

Checking uclamp_min is useless and counterproductive for overutilized state
as misfit can now happen without being in overutilized state

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9bddb094ee21..9eb4c4946ddc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6870,16 +6870,15 @@ static inline void hrtick_update(struct rq *rq)
 #ifdef CONFIG_SMP
 static inline bool cpu_overutilized(int cpu)
 {
-	unsigned long  rq_util_min, rq_util_max;
+	unsigned long rq_util_max;
 
 	if (!sched_energy_enabled())
 		return false;
 
-	rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
 	rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
 
 	/* Return true only if the utilization doesn't fit CPU's capacity */
-	return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
+	return !util_fits_cpu(cpu_util_cfs(cpu), 0, rq_util_max, cpu);
 }
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/7 v2] sched/fair: Update overutilized detection
  2024-12-17 16:07 ` [PATCH 7/7 v2] sched/fair: Update overutilized detection Vincent Guittot
@ 2025-01-17 10:27   ` Pierre Gondois
  2025-01-20 15:50     ` Vincent Guittot
  0 siblings, 1 reply; 19+ messages in thread
From: Pierre Gondois @ 2025-01-17 10:27 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, lukasz.luba,
	rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2, christian.loehle, qperret

Hello Vincent,

On 12/17/24 17:07, Vincent Guittot wrote:
> Checking uclamp_min is useless and counterproductive for overutilized state
> as misfit can now happen without being in overutilized state

Before this patch a task was misfit if:
   !task_fits_cpu(p, cpu)
This is, if either:
- task_util > 80% * CPU_capacity
- task_UCLAMP_MIN > get_actual_cpu_capacity(cpu)

A CPU was OU if:
   !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu)
This is:
- CPU_util > 80% * CPU_capacity

This should be the same after this patch. Just to be sure I understand correctly,
this patch has no functional change and is independent from this serie right ?

> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>   kernel/sched/fair.c | 5 ++---
>   1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9bddb094ee21..9eb4c4946ddc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6870,16 +6870,15 @@ static inline void hrtick_update(struct rq *rq)
>   #ifdef CONFIG_SMP
>   static inline bool cpu_overutilized(int cpu)
>   {
> -	unsigned long  rq_util_min, rq_util_max;
> +	unsigned long rq_util_max;
>   
>   	if (!sched_energy_enabled())
>   		return false;
>   
> -	rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
>   	rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
>   
>   	/* Return true only if the utilization doesn't fit CPU's capacity */
> -	return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
> +	return !util_fits_cpu(cpu_util_cfs(cpu), 0, rq_util_max, cpu);
>   }
>   
>   /*

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/7 v2] sched/fair: Update overutilized detection
  2025-01-17 10:27   ` Pierre Gondois
@ 2025-01-20 15:50     ` Vincent Guittot
  0 siblings, 0 replies; 19+ messages in thread
From: Vincent Guittot @ 2025-01-20 15:50 UTC (permalink / raw)
  To: Pierre Gondois
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2, christian.loehle, qperret

On Fri, 17 Jan 2025 at 11:27, Pierre Gondois <pierre.gondois@arm.com> wrote:
>
> Hello Vincent,
>
> On 12/17/24 17:07, Vincent Guittot wrote:
> > Checking uclamp_min is useless and counterproductive for overutilized state
> > as misfit can now happen without being in overutilized state
>
> Before this patch a task was misfit if:
>    !task_fits_cpu(p, cpu)
> This is, if either:
> - task_util > 80% * CPU_capacity
> - task_UCLAMP_MIN > get_actual_cpu_capacity(cpu)
>
> A CPU was OU if:
>    !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu)
> This is:
> - CPU_util > 80% * CPU_capacity
>
> This should be the same after this patch. Just to be sure I understand correctly,
> this patch has no functional change and is independent from this serie right ?

yeah, I find it while running some functional tests on the patchset



>
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >   kernel/sched/fair.c | 5 ++---
> >   1 file changed, 2 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 9bddb094ee21..9eb4c4946ddc 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6870,16 +6870,15 @@ static inline void hrtick_update(struct rq *rq)
> >   #ifdef CONFIG_SMP
> >   static inline bool cpu_overutilized(int cpu)
> >   {
> > -     unsigned long  rq_util_min, rq_util_max;
> > +     unsigned long rq_util_max;
> >
> >       if (!sched_energy_enabled())
> >               return false;
> >
> > -     rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
> >       rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
> >
> >       /* Return true only if the utilization doesn't fit CPU's capacity */
> > -     return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
> > +     return !util_fits_cpu(cpu_util_cfs(cpu), 0, rq_util_max, cpu);
> >   }
> >
> >   /*

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/7 v2] sched/fair: Rework EAS to handle more cases
  2024-12-17 16:07 [PATCH 0/7 v2] sched/fair: Rework EAS to handle more cases Vincent Guittot
                   ` (6 preceding siblings ...)
  2024-12-17 16:07 ` [PATCH 7/7 v2] sched/fair: Update overutilized detection Vincent Guittot
@ 2024-12-18 14:06 ` Christian Loehle
  2024-12-19 16:22   ` Vincent Guittot
  7 siblings, 1 reply; 19+ messages in thread
From: Christian Loehle @ 2024-12-18 14:06 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, lukasz.luba,
	rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2, pierre.gondois, qperret

Hi Vincent,
just some quick remarks, I won't have time to actually review and test this
in-depth until January. Sorry for that.

On 12/17/24 16:07, Vincent Guittot wrote:
> The current Energy Aware Scheduler has some known limitations which have
> became more and more visible with features like uclamp as an example. This
> serie tries to fix some of those issues:
> - tasks stacked on the same CPU of a PD
> - tasks stuck on the wrong CPU.
> 
> Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> whereas it is capped to a lower compute capacity. This wrong classification
> can prevent periodic load balancer to select a group_misfit_task CPU
> because group_overloaded has higher priority.
> 
> Patch 2 creates a new EM interface that will be used by Patch 3
> 
> Patch 3 fixes the issue of tasks being stacked on same CPU of a PD whereas
> others might be a better choice. feec() looks for the CPU with the highest
> spare capacity in a PD assuming that it will be the best CPU from a energy
> efficiency PoV because it will require the smallest increase of OPP.
> This is often but not always true, this policy filters some others CPUs
> which would be as efficients because of using the same OPP but with less
> running tasks as an example.
> In fact, we only care about the cost of the new OPP that will be
> selected to handle the waking task. In many cases, several CPUs will end
> up selecting the same OPP and as a result having the same energy cost. In
> such cases, we can use other metrics to select the best CPU with the same
> energy cost. Patch 3 rework feec() to look 1st for the lowest cost in a PD
> and then the most performant CPU between CPUs. At now, this only tries to
> evenly spread the number of runnable tasks on CPUs but this can be
> improved with other metric like the sched slice duration in a follow up
> series.


Could you elaborate why this is the better strategy instead of max_spare_cap?
Presumably the highest max_spare_cap has to have rather small tasks if it
still has more runnable tasks than the other (higher util) CPUs of the PD.
So nr of runnable tasks should intuitively be the less stable metric (to
me anyway).

For which workloads does it make a difference?
Which benefit from nr of runnable tasks? Which for max_spare_cap?

> 
> perf sched pipe on a dragonboard rb5 has been used to compare the overhead
> of the new feec() vs current implementation.
> 
> 9 iterations of perf bench sched pipe -T -l 80000
>                 ops/sec  stdev 
> tip/sched/core  13001    (+/- 1.2%)
> + patches 1-3   14349    (+/- 5.4%)  +10.4%

I'm confused, the feec() rework in patch 3 does more comparisons overall,
so should be slower, but here we have a 10% improvement?
OTOH feec() shouldn't be running much in the first place, since you
don't run it when overutilized anymore (i.e. keep mainline behavior).
The difference should be negligible then and for me it basically is (rk3399
and -l 5000 to get roughly comparable test duration (results in seconds,
lower is better), 10 iterations:
tip/sched/core:
20.4573 +-0.0832
vingu/rework-eas-v2-patches-1-to-3:
20.7054 +-0.0411

> 
> 
> Patch 4 removed the now unused em_cpu_energy()
> 
> Patch 5 solves another problem with tasks being stuck on a CPU forever
> because it doesn't sleep anymore and as a result never wakeup and call
> feec(). Such task can be detected by comparing util_avg or runnable_avg
> with the compute capacity of the CPU. Once detected, we can call feec() to
> check if there is a better CPU for the stuck task. The call can be done in
> 2 places:
> - When the task is put back in the runnnable list after its running slice
>   with the balance callback mecanism similarly to the rt/dl push callback.
> - During cfs tick when there is only 1 running task stuck on the CPU in
>   which case the balance callback can't be used.
> 
> This push callback mecanism with the new feec() algorithm ensures that
> tasks always get a chance to migrate on the best suitable CPU and don't
> stay stuck on a CPU which is no more the most suitable one. As examples:
> - A task waking on a big CPU with a uclamp max preventing it to sleep and
>   wake up, can migrate on a smaller CPU once it's more power efficient.
> - The tasks are spread on CPUs in the PD when they target the same OPP.
> 
> Patch 6 adds task misfit migration case in the cfs tick and push callback
> mecanism to prevent waking up an idle cpu unnecessarily.
> 
> Patch 7 removes the need of testing uclamp_min in cpu_overutilized to
> trigger the active migration of a task on another CPU.

Would it make sense to further split 5-7 for ease of reviewing?
Maybe even 1 and 4 as fixes, too?

Regards,
Christian


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/7 v2] sched/fair: Rework EAS to handle more cases
  2024-12-18 14:06 ` [PATCH 0/7 v2] sched/fair: Rework EAS to handle more cases Christian Loehle
@ 2024-12-19 16:22   ` Vincent Guittot
  0 siblings, 0 replies; 19+ messages in thread
From: Vincent Guittot @ 2024-12-19 16:22 UTC (permalink / raw)
  To: Christian Loehle
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2, pierre.gondois, qperret

On Wed, 18 Dec 2024 at 15:06, Christian Loehle <christian.loehle@arm.com> wrote:
>
> Hi Vincent,
> just some quick remarks, I won't have time to actually review and test this
> in-depth until January. Sorry for that.

no problem

>
> On 12/17/24 16:07, Vincent Guittot wrote:
> > The current Energy Aware Scheduler has some known limitations which have
> > became more and more visible with features like uclamp as an example. This
> > serie tries to fix some of those issues:
> > - tasks stacked on the same CPU of a PD
> > - tasks stuck on the wrong CPU.
> >
> > Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> > whereas it is capped to a lower compute capacity. This wrong classification
> > can prevent periodic load balancer to select a group_misfit_task CPU
> > because group_overloaded has higher priority.
> >
> > Patch 2 creates a new EM interface that will be used by Patch 3
> >
> > Patch 3 fixes the issue of tasks being stacked on same CPU of a PD whereas
> > others might be a better choice. feec() looks for the CPU with the highest
> > spare capacity in a PD assuming that it will be the best CPU from a energy
> > efficiency PoV because it will require the smallest increase of OPP.
> > This is often but not always true, this policy filters some others CPUs
> > which would be as efficients because of using the same OPP but with less
> > running tasks as an example.
> > In fact, we only care about the cost of the new OPP that will be
> > selected to handle the waking task. In many cases, several CPUs will end
> > up selecting the same OPP and as a result having the same energy cost. In
> > such cases, we can use other metrics to select the best CPU with the same
> > energy cost. Patch 3 rework feec() to look 1st for the lowest cost in a PD
> > and then the most performant CPU between CPUs. At now, this only tries to
> > evenly spread the number of runnable tasks on CPUs but this can be
> > improved with other metric like the sched slice duration in a follow up
> > series.
>
>
> Could you elaborate why this is the better strategy instead of max_spare_cap?
> Presumably the highest max_spare_cap has to have rather small tasks if it
> still has more runnable tasks than the other (higher util) CPUs of the PD.

You don't always have a direct relation between nr_runnable,
max_spare_cap and task "size" because of blocked utilization. This
rework keeps the same behavior of highest max_spare_cap in a lot of
cases which includes a spare capacity that make  selecting a different
OPP but It also covers other cases when blocked utilization,
uclamp_min, uclamp_max, cpufreq clamping min/max freq breaks this
relation

While studying trace, we can often see small tasks being packed on a
CPU whereas another one is idle in the same PD

> So nr of runnable tasks should intuitively be the less stable metric (to
> me anyway).

spreading tasks helps to reduce the average scheduling latency which
is beneficial for small tasks. This performance decision is a 1st
simple version which aimed to be improved with other hints like the
sched_slice

>
> For which workloads does it make a difference?
> Which benefit from nr of runnable tasks? Which for max_spare_cap?

I have started to run some tests on Android device but doesn't have
consolidated results yet and I didn't want to delay more the v2

>
> >
> > perf sched pipe on a dragonboard rb5 has been used to compare the overhead
> > of the new feec() vs current implementation.
> >
> > 9 iterations of perf bench sched pipe -T -l 80000
> >                 ops/sec  stdev
> > tip/sched/core  13001    (+/- 1.2%)
> > + patches 1-3   14349    (+/- 5.4%)  +10.4%
>
> I'm confused, the feec() rework in patch 3 does more comparisons overall,
> so should be slower, but here we have a 10% improvement?

TBH, I didn't expect perf improvement but wanted to test that there is
no regression. I run the tests several time and the results are always
in the same range

> OTOH feec() shouldn't be running much in the first place, since you
> don't run it when overutilized anymore (i.e. keep mainline behavior).

This should not make any difference here as the system is not
overutilized anyway

> The difference should be negligible then and for me it basically is (rk3399
> and -l 5000 to get roughly comparable test duration (results in seconds,
> lower is better), 10 iterations:
> tip/sched/core:
> 20.4573 +-0.0832
> vingu/rework-eas-v2-patches-1-to-3:
> 20.7054 +-0.0411
>
> >
> >
> > Patch 4 removed the now unused em_cpu_energy()
> >
> > Patch 5 solves another problem with tasks being stuck on a CPU forever
> > because it doesn't sleep anymore and as a result never wakeup and call
> > feec(). Such task can be detected by comparing util_avg or runnable_avg
> > with the compute capacity of the CPU. Once detected, we can call feec() to
> > check if there is a better CPU for the stuck task. The call can be done in
> > 2 places:
> > - When the task is put back in the runnnable list after its running slice
> >   with the balance callback mecanism similarly to the rt/dl push callback.
> > - During cfs tick when there is only 1 running task stuck on the CPU in
> >   which case the balance callback can't be used.
> >
> > This push callback mecanism with the new feec() algorithm ensures that
> > tasks always get a chance to migrate on the best suitable CPU and don't
> > stay stuck on a CPU which is no more the most suitable one. As examples:
> > - A task waking on a big CPU with a uclamp max preventing it to sleep and
> >   wake up, can migrate on a smaller CPU once it's more power efficient.
> > - The tasks are spread on CPUs in the PD when they target the same OPP.
> >
> > Patch 6 adds task misfit migration case in the cfs tick and push callback
> > mecanism to prevent waking up an idle cpu unnecessarily.
> >
> > Patch 7 removes the need of testing uclamp_min in cpu_overutilized to
> > trigger the active migration of a task on another CPU.
>
> Would it make sense to further split 5-7 for ease of reviewing?
> Maybe even 1 and 4 as fixes, too?
>
> Regards,
> Christian
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-01-20 15:50 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-17 16:07 [PATCH 0/7 v2] sched/fair: Rework EAS to handle more cases Vincent Guittot
2024-12-17 16:07 ` [PATCH 1/7 v2] sched/fair: Filter false overloaded_group case for EAS Vincent Guittot
2024-12-17 19:15   ` Dhaval Giani
2024-12-17 16:07 ` [PATCH 2/7 v2] energy model: Add a get previous state function Vincent Guittot
2024-12-17 16:07 ` [PATCH 3/7 v2] sched/fair: Rework feec() to use cost instead of spare capacity Vincent Guittot
2024-12-24 16:46   ` Luis Machado
2024-12-17 16:07 ` [PATCH 4/7 v2] energy model: Remove unused em_cpu_energy() Vincent Guittot
2024-12-18 14:59   ` Christian Loehle
2024-12-17 16:07 ` [PATCH 5/7 v2] sched/fair: Add push task callback for EAS Vincent Guittot
2025-01-16 17:34   ` Pierre Gondois
2025-01-20 15:50     ` Vincent Guittot
2024-12-17 16:07 ` [PATCH 6/7 v2] sched/fair: Add misfit case to " Vincent Guittot
2025-01-16 17:35   ` Pierre Gondois
2025-01-20 15:50     ` Vincent Guittot
2024-12-17 16:07 ` [PATCH 7/7 v2] sched/fair: Update overutilized detection Vincent Guittot
2025-01-17 10:27   ` Pierre Gondois
2025-01-20 15:50     ` Vincent Guittot
2024-12-18 14:06 ` [PATCH 0/7 v2] sched/fair: Rework EAS to handle more cases Christian Loehle
2024-12-19 16:22   ` Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox