[PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases
@ 2025-12-01  9:13 Vincent Guittot
  2025-12-01  9:13 ` [PATCH 1/6 v7] sched/fair: Filter false overloaded_group case for EAS Vincent Guittot
                   ` (7 more replies)
  0 siblings, 8 replies; 23+ messages in thread
From: Vincent Guittot @ 2025-12-01  9:13 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak
  Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado,
	Vincent Guittot

This is a subset of [1] (sched/fair: Rework EAS to handle more cases)

[1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/

The current Energy Aware Scheduler has some known limitations which have
became more and more visible with features like uclamp as an example. This
serie tries to fix some of those issues:
- tasks stacked on the same CPU of a PD
- tasks stuck on the wrong CPU.

Patch 1 fixes the case where a CPU is wrongly classified as overloaded
whereas it is capped to a lower compute capacity. This wrong classification
can prevent periodic load balancer to select a group_misfit_task CPU
because group_overloaded has higher priority.

Patch 2 removes the need of testing uclamp_min in cpu_overutilized to
trigger the active migration of a task on another CPU.

Patch 3 prepares select_task_rq_fair() to be called without TTWU, Fork or
Exec flags when we just want to look for a possible better CPU.

Patch 4 adds push call back mecanism to fair scheduler but doesn't enable
it.

Patch 5 enable has_idle_core for !SMP system to track if there may be an
idle CPU in the LLC.

Patch 6 adds some conditions to enable pushing runnable tasks for EAS:
- when a task is stuck on a CPU and the system is not overutilized.
- if there is a possible idle CPU when the system is overutilized.

More tests results will come later as I wanted to send the pachtset before
LPC.

Tbench  on dragonboard rb5
schedutil and EAS enabled

# process     tip                   +patchset
1              29.1(+/-4.1%)        124.7(+/-12.3%) +329%
2              60.0(+/-0.9%)        216.1(+/- 7.9%) +260%
4             255.8(+/-1.9%)        421.4(+/- 2.0%)  +65%       
8            1317.3(+/-4.6%)       1396.1(+/- 3.0%)   +6%
16            958.2(+/-4.6%)        979.6(+/- 2.0%)   +2%

Hackbench didn't show any difference


Vincent Guittot (6):
  sched/fair: Filter false overloaded_group case for EAS
  sched/fair: Update overutilized detection
  sched/fair: Prepare select_task_rq_fair() to be called for new cases
  sched/fair: Add push task mechanism for fair
  sched/fair: Enable idle core tracking for !SMT
  sched/fair: Add EAS and idle cpu push trigger

 kernel/sched/fair.c     | 350 +++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h    |  46 ++++--
 kernel/sched/topology.c |   3 +
 3 files changed, 346 insertions(+), 53 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/6 v7] sched/fair: Filter false overloaded_group case for EAS
  2025-12-01  9:13 [PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases Vincent Guittot
@ 2025-12-01  9:13 ` Vincent Guittot
  2025-12-01  9:13 ` [PATCH 2/6 v7] sched/fair: Update overutilized detection Vincent Guittot
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Vincent Guittot @ 2025-12-01  9:13 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak
  Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado,
	Vincent Guittot

With EAS, a group should be set overloaded if at least 1 CPU in the group
is overutilized but it can happen that a CPU is fully utilized by tasks
because of clamping the compute capacity of the CPU. In such case, the CPU
is not overutilized and as a result should not be set overloaded as well.

group_overloaded being a higher priority than group_misfit, such group can
be selected as the busiest group instead of a group with a mistfit task
and prevents load_balance to select the CPU with the misfit task to pull
the latter on a fitting CPU.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
---
 kernel/sched/fair.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1855975b8248..b10f04715251 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9987,6 +9987,7 @@ struct sg_lb_stats {
 	unsigned int group_asym_packing;	/* Tasks should be moved to preferred CPU */
 	unsigned int group_smt_balance;		/* Task on busy SMT be moved */
 	unsigned long group_misfit_task_load;	/* A CPU has a task too big for its capacity */
+	unsigned int group_overutilized;	/* At least one CPU is overutilized in the group */
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
@@ -10219,6 +10220,13 @@ group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
 static inline bool
 group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
 {
+	/*
+	 * With EAS and uclamp, 1 CPU in the group must be overutilized to
+	 * consider the group overloaded.
+	 */
+	if (sched_energy_enabled() && !sgs->group_overutilized)
+		return false;
+
 	if (sgs->sum_nr_running <= sgs->group_weight)
 		return false;
 
@@ -10402,14 +10410,12 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
  * @group: sched_group whose statistics are to be updated.
  * @sgs: variable to hold the statistics for this group.
  * @sg_overloaded: sched_group is overloaded
- * @sg_overutilized: sched_group is overutilized
  */
 static inline void update_sg_lb_stats(struct lb_env *env,
 				      struct sd_lb_stats *sds,
 				      struct sched_group *group,
 				      struct sg_lb_stats *sgs,
-				      bool *sg_overloaded,
-				      bool *sg_overutilized)
+				      bool *sg_overloaded)
 {
 	int i, nr_running, local_group, sd_flags = env->sd->flags;
 	bool balancing_at_rd = !env->sd->parent;
@@ -10431,7 +10437,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->sum_nr_running += nr_running;
 
 		if (cpu_overutilized(i))
-			*sg_overutilized = 1;
+			sgs->group_overutilized = 1;
 
 		/*
 		 * No need to call idle_cpu() if nr_running is not 0
@@ -11103,13 +11109,15 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 				update_group_capacity(env->sd, env->dst_cpu);
 		}
 
-		update_sg_lb_stats(env, sds, sg, sgs, &sg_overloaded, &sg_overutilized);
+		update_sg_lb_stats(env, sds, sg, sgs, &sg_overloaded);
 
 		if (!local_group && update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest = sg;
 			sds->busiest_stat = *sgs;
 		}
 
+		sg_overutilized |= sgs->group_overutilized;
+
 		/* Now, start updating sd_lb_stats */
 		sds->total_load += sgs->group_load;
 		sds->total_capacity += sgs->group_capacity;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/6 v7] sched/fair: Update overutilized detection
  2025-12-01  9:13 [PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases Vincent Guittot
  2025-12-01  9:13 ` [PATCH 1/6 v7] sched/fair: Filter false overloaded_group case for EAS Vincent Guittot
@ 2025-12-01  9:13 ` Vincent Guittot
  2025-12-01  9:13 ` [PATCH 3/6 v7] sched/fair: Prepare select_task_rq_fair() to be called for new cases Vincent Guittot
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Vincent Guittot @ 2025-12-01  9:13 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak
  Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado,
	Vincent Guittot

Checking uclamp_min is useless and counterproductive for overutilized state
as misfit can now happen without being in overutilized state

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b10f04715251..f430ec890b72 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6785,16 +6785,15 @@ static inline void hrtick_update(struct rq *rq)
 
 static inline bool cpu_overutilized(int cpu)
 {
-	unsigned long  rq_util_min, rq_util_max;
+	unsigned long rq_util_max;
 
 	if (!sched_energy_enabled())
 		return false;
 
-	rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
 	rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
 
 	/* Return true only if the utilization doesn't fit CPU's capacity */
-	return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
+	return !util_fits_cpu(cpu_util_cfs(cpu), 0, rq_util_max, cpu);
 }
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/6 v7] sched/fair: Prepare select_task_rq_fair() to be called for new cases
  2025-12-01  9:13 [PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases Vincent Guittot
  2025-12-01  9:13 ` [PATCH 1/6 v7] sched/fair: Filter false overloaded_group case for EAS Vincent Guittot
  2025-12-01  9:13 ` [PATCH 2/6 v7] sched/fair: Update overutilized detection Vincent Guittot
@ 2025-12-01  9:13 ` Vincent Guittot
  2025-12-01  9:13 ` [PATCH 4/6 v7] sched/fair: Add push task mechanism for fair Vincent Guittot
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Vincent Guittot @ 2025-12-01  9:13 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak
  Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado,
	Vincent Guittot

Update select_task_rq_fair() to be called out of the 3 current cases which
are :
- wake up
- exec
- fork

We wants to select a rq in some new cases like pushing a runnable task on a
better CPU than the local one. In such case, it's not a wakeup , nor an
exec nor a fork. We make sure to not distrub these cases but still
go through EAS and fast-path.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f430ec890b72..80c4131fb35b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8518,6 +8518,7 @@ static int
 select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 {
 	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
+	int want_sibling = !(wake_flags & (WF_EXEC | WF_FORK));
 	struct sched_domain *tmp, *sd = NULL;
 	int cpu = smp_processor_id();
 	int new_cpu = prev_cpu;
@@ -8535,16 +8536,21 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 		if ((wake_flags & WF_CURRENT_CPU) &&
 		    cpumask_test_cpu(cpu, p->cpus_ptr))
 			return cpu;
+	}
 
-		if (!is_rd_overutilized(this_rq()->rd)) {
-			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
-			if (new_cpu >= 0)
-				return new_cpu;
-			new_cpu = prev_cpu;
-		}
+	/*
+	 * We don't want EAS to be called for exec or fork but it should be
+	 * called for any other case such as wake up or push callback.
+	 */
+	if (!is_rd_overutilized(this_rq()->rd) && want_sibling) {
+		new_cpu = find_energy_efficient_cpu(p, prev_cpu);
+		if (new_cpu >= 0)
+			return new_cpu;
+		new_cpu = prev_cpu;
+	}
 
+	if (wake_flags & WF_TTWU)
 		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
-	}
 
 	rcu_read_lock();
 	for_each_domain(cpu, tmp) {
@@ -8575,7 +8581,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	if (unlikely(sd)) {
 		/* Slow path */
 		new_cpu = sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag);
-	} else if (wake_flags & WF_TTWU) { /* XXX always ? */
+	} else if (want_sibling) {
 		/* Fast path */
 		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 4/6 v7] sched/fair: Add push task mechanism for fair
  2025-12-01  9:13 [PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases Vincent Guittot
                   ` (2 preceding siblings ...)
  2025-12-01  9:13 ` [PATCH 3/6 v7] sched/fair: Prepare select_task_rq_fair() to be called for new cases Vincent Guittot
@ 2025-12-01  9:13 ` Vincent Guittot
  2025-12-01  9:13 ` [RFC PATCH 5/6 v7] sched/fair: Enable idle core tracking for !SMT Vincent Guittot
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Vincent Guittot @ 2025-12-01  9:13 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak
  Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado,
	Vincent Guittot

EAS is based on wakeup events to efficiently place tasks on the system, but
there are cases where a task doesn't have wakeup events anymore or at a far
too low pace. For such situation, we can take advantage of the task being
put back in the enqueued list to check if it should be pushed on another
CPU.
When the task is alone on the CPU, it's never put back in the enqueued
list; In this special case, we use the tick to run the check.

Add a push task mecanism that enables fair scheduler to push runnable
tasks. EAS will be one user but other feature like filling idle CPUs
can also take advantage of it.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c  | 211 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |   4 +
 2 files changed, 213 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 80c4131fb35b..4e94a4cb8caa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6989,6 +6989,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	hrtick_update(rq);
 }
 
+static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p);
 /*
  * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
  * failing half-way through and resume the dequeue later.
@@ -7017,6 +7018,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		h_nr_idle = task_has_idle_policy(p);
 		if (task_sleep || task_delayed || !se->sched_delayed)
 			h_nr_runnable = 1;
+
+		fair_remove_pushable_task(rq, p);
 	}
 
 	for_each_sched_entity(se) {
@@ -8504,6 +8507,187 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	return target;
 }
 
+DEFINE_STATIC_KEY_FALSE(sched_push_task);
+
+static inline bool sched_push_task_enabled(void)
+{
+	return static_branch_unlikely(&sched_push_task);
+}
+
+static bool fair_push_task(struct rq *rq, struct task_struct *p)
+{
+	return false;
+}
+
+static inline int has_pushable_tasks(struct rq *rq)
+{
+	return !plist_head_empty(&rq->cfs.pushable_tasks);
+}
+
+static struct task_struct *pick_next_pushable_fair_task(struct rq *rq)
+{
+	struct task_struct *p;
+
+	if (!has_pushable_tasks(rq))
+		return NULL;
+
+	p = plist_first_entry(&rq->cfs.pushable_tasks,
+			      struct task_struct, pushable_tasks);
+
+	WARN_ON_ONCE(rq->cpu != task_cpu(p));
+	WARN_ON_ONCE(task_current(rq, p));
+	WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
+	WARN_ON_ONCE(!task_on_rq_queued(p));
+
+	/*
+	 * Remove task from the pushable list as we try only once after that
+	 * the task has been put back in enqueued list.
+	 */
+	plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+
+	return p;
+}
+
+static int
+select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags);
+
+/*
+ * See if the non running fair tasks on this rq can be sent on other CPUs
+ * that fits better with their profile.
+ */
+static bool push_fair_task(struct rq *rq)
+{
+	struct task_struct *next_task;
+	int prev_cpu, new_cpu;
+	struct rq *new_rq;
+
+	next_task = pick_next_pushable_fair_task(rq);
+	if (!next_task)
+		return false;
+
+	if (is_migration_disabled(next_task))
+		return true;
+
+	/* We might release rq lock */
+	get_task_struct(next_task);
+
+	prev_cpu = rq->cpu;
+
+	new_cpu = select_task_rq_fair(next_task, prev_cpu, 0);
+
+	if (new_cpu == prev_cpu)
+		goto out;
+
+	new_rq = cpu_rq(new_cpu);
+
+	if (double_lock_balance(rq, new_rq)) {
+		/* The task has already migrated in between */
+		if (task_cpu(next_task) != rq->cpu) {
+			double_unlock_balance(rq, new_rq);
+			goto out;
+		}
+
+		deactivate_task(rq, next_task, 0);
+		set_task_cpu(next_task, new_cpu);
+		activate_task(new_rq, next_task, 0);
+
+		resched_curr(new_rq);
+
+		double_unlock_balance(rq, new_rq);
+	}
+
+out:
+	put_task_struct(next_task);
+
+	return true;
+}
+
+static void push_fair_tasks(struct rq *rq)
+{
+	/* push_fair_task() will return true if it moved a fair task */
+	while (push_fair_task(rq))
+		;
+}
+
+static DEFINE_PER_CPU(struct balance_callback, fair_push_head);
+
+static inline void fair_queue_pushable_tasks(struct rq *rq)
+{
+	if (!sched_push_task_enabled() || !has_pushable_tasks(rq))
+		return;
+
+	queue_balance_callback(rq, &per_cpu(fair_push_head, rq->cpu), push_fair_tasks);
+}
+
+static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p)
+{
+	if (sched_push_task_enabled())
+		plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+}
+
+static void fair_add_pushable_task(struct rq *rq, struct task_struct *p)
+{
+	if (sched_push_task_enabled() && fair_push_task(rq, p)) {
+		plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+		plist_node_init(&p->pushable_tasks, p->prio);
+		plist_add(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+	}
+}
+
+static int active_load_balance_cpu_stop(void *data);
+
+/*
+ * See if the alone task running on the CPU should migrate on a better than
+ * the local one.
+ */
+static inline bool check_pushable_task(struct task_struct *p, struct rq *rq)
+{
+	int new_cpu, cpu = cpu_of(rq);
+
+	if (!sched_push_task_enabled())
+		return false;
+
+	if (WARN_ON(!p))
+		return false;
+
+	if (WARN_ON(!task_current(rq, p)))
+		return false;
+
+	if (is_migration_disabled(p))
+		return false;
+
+	/* If there are several task, wait for being put back */
+	if (rq->nr_running > 1)
+		return false;
+
+	if (!fair_push_task(rq, p))
+		return false;
+
+	new_cpu = select_task_rq_fair(p, cpu, 0);
+
+	if (new_cpu == cpu)
+		return false;
+
+	/*
+	 * ->active_balance synchronizes accesses to
+	 * ->active_balance_work.  Once set, it's cleared
+	 * only after active load balance is finished.
+	 */
+	if (!rq->active_balance) {
+		rq->active_balance = 1;
+		rq->push_cpu = new_cpu;
+	} else
+		return false;
+
+	raw_spin_rq_unlock(rq);
+	stop_one_cpu_nowait(cpu,
+		active_load_balance_cpu_stop, rq,
+		&rq->active_balance_work);
+	raw_spin_rq_lock(rq);
+
+	return true;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAKE,
@@ -8973,6 +9157,12 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 		put_prev_entity(cfs_rq, pse);
 		set_next_entity(cfs_rq, se);
 
+		/*
+		 * The previous task might be eligible for being pushed on
+		 * another cpu if it is still active.
+		 */
+		fair_add_pushable_task(rq, prev);
+
 		__set_next_task_fair(rq, p, true);
 	}
 
@@ -9036,6 +9226,13 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct t
 		cfs_rq = cfs_rq_of(se);
 		put_prev_entity(cfs_rq, se);
 	}
+
+	/*
+	 * The previous task might be eligible for being pushed on another cpu
+	 * if it is still active.
+	 */
+	fair_add_pushable_task(rq, prev);
+
 }
 
 /*
@@ -13390,8 +13587,10 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
 
-	update_misfit_status(curr, rq);
-	check_update_overutilized_status(task_rq(curr));
+	if (!check_pushable_task(curr, rq)) {
+		update_misfit_status(curr, rq);
+		check_update_overutilized_status(task_rq(curr));
+	}
 
 	task_tick_core(rq, curr);
 }
@@ -13552,6 +13751,8 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
 {
 	struct sched_entity *se = &p->se;
 
+	fair_remove_pushable_task(rq, p);
+
 	if (task_on_rq_queued(p)) {
 		/*
 		 * Move the next running task to the front of the list, so our
@@ -13567,6 +13768,11 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
 	if (hrtick_enabled_fair(rq))
 		hrtick_start_fair(rq, p);
 
+	/*
+	 * Try to push prev task before checking misfit for next task as
+	 * the migration of prev can make next fitting the CPU
+	 */
+	fair_queue_pushable_tasks(rq);
 	update_misfit_status(p, rq);
 	sched_fair_update_stop_tick(rq, p);
 }
@@ -13596,6 +13802,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	cfs_rq->tasks_timeline = RB_ROOT_CACHED;
 	cfs_rq->zero_vruntime = (u64)(-(1LL << 20));
+	plist_head_init(&cfs_rq->pushable_tasks);
 	raw_spin_lock_init(&cfs_rq->removed.lock);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b419a4d98461..697bd654298a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -711,6 +711,8 @@ struct cfs_rq {
 		unsigned long	runnable_avg;
 	} removed;
 
+	struct plist_head	pushable_tasks;
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	u64			last_update_tg_load_avg;
 	unsigned long		tg_load_avg_contrib;
@@ -3620,6 +3622,8 @@ static inline bool sched_energy_enabled(void) { return false; }
 
 #endif /* !(CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL) */
 
+DECLARE_STATIC_KEY_FALSE(sched_push_task);
+
 #ifdef CONFIG_MEMBARRIER
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 5/6 v7] sched/fair: Enable idle core tracking for !SMT
  2025-12-01  9:13 [PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases Vincent Guittot
                   ` (3 preceding siblings ...)
  2025-12-01  9:13 ` [PATCH 4/6 v7] sched/fair: Add push task mechanism for fair Vincent Guittot
@ 2025-12-01  9:13 ` Vincent Guittot
  2025-12-01  9:13 ` [RFC PATCH 6/6 v7] sched/fair: Add EAS and idle cpu push trigger Vincent Guittot
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Vincent Guittot @ 2025-12-01  9:13 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak
  Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado,
	Vincent Guittot

Enable has_idle_cores at llc level feature for !SMT system for which
CPU equals core.

We don't enable has_idle_core feature of select_idle_cpu to be
conservative and don't parse all CPUs of LLC.

At now, has_idle_cores can be cleared even if a CPU is idle because of
SIS_UTIL but it looks reasonnable as the probablity to get an idle CPU is
low anyway.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c  | 29 +++++++----------------------
 kernel/sched/sched.h | 42 +++++++++++++++++++++++++++++-------------
 2 files changed, 36 insertions(+), 35 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4e94a4cb8caa..9af8d0a61856 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7500,19 +7500,6 @@ static inline int __select_idle_cpu(int cpu, struct task_struct *p)
 	return -1;
 }
 
-#ifdef CONFIG_SCHED_SMT
-DEFINE_STATIC_KEY_FALSE(sched_smt_present);
-EXPORT_SYMBOL_GPL(sched_smt_present);
-
-static inline void set_idle_cores(int cpu, int val)
-{
-	struct sched_domain_shared *sds;
-
-	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
-	if (sds)
-		WRITE_ONCE(sds->has_idle_cores, val);
-}
-
 static inline bool test_idle_cores(int cpu)
 {
 	struct sched_domain_shared *sds;
@@ -7524,6 +7511,10 @@ static inline bool test_idle_cores(int cpu)
 	return false;
 }
 
+#ifdef CONFIG_SCHED_SMT
+DEFINE_STATIC_KEY_FALSE(sched_smt_present);
+EXPORT_SYMBOL_GPL(sched_smt_present);
+
 /*
  * Scans the local SMT mask to see if the entire core is idle, and records this
  * information in sd_llc_shared->has_idle_cores.
@@ -7611,15 +7602,6 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
 
 #else /* !CONFIG_SCHED_SMT: */
 
-static inline void set_idle_cores(int cpu, int val)
-{
-}
-
-static inline bool test_idle_cores(int cpu)
-{
-	return false;
-}
-
 static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
 {
 	return __select_idle_cpu(core, p);
@@ -7885,6 +7867,9 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
 
+	if (!sched_smt_active())
+		set_idle_cores(target, 0);
+
 	/*
 	 * For cluster machines which have lower sharing cache like L2 or
 	 * LLC Tag, we tend to find an idle CPU in the target's cluster
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 697bd654298a..b9e228333d5e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1591,19 +1591,6 @@ do {						\
 	flags = _raw_spin_rq_lock_irqsave(rq);	\
 } while (0)
 
-#ifdef CONFIG_SCHED_SMT
-extern void __update_idle_core(struct rq *rq);
-
-static inline void update_idle_core(struct rq *rq)
-{
-	if (static_branch_unlikely(&sched_smt_present))
-		__update_idle_core(rq);
-}
-
-#else /* !CONFIG_SCHED_SMT: */
-static inline void update_idle_core(struct rq *rq) { }
-#endif /* !CONFIG_SCHED_SMT */
-
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
 static inline struct task_struct *task_of(struct sched_entity *se)
@@ -2091,6 +2078,35 @@ static __always_inline bool sched_asym_cpucap_active(void)
 	return static_branch_unlikely(&sched_asym_cpucapacity);
 }
 
+static inline void set_idle_cores(int cpu, int val)
+{
+	struct sched_domain_shared *sds;
+
+	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (sds)
+		WRITE_ONCE(sds->has_idle_cores, val);
+}
+
+#ifdef CONFIG_SCHED_SMT
+extern void __update_idle_core(struct rq *rq);
+
+static inline void update_idle_core(struct rq *rq)
+{
+	if (static_branch_unlikely(&sched_smt_present))
+		__update_idle_core(rq);
+	else
+		set_idle_cores(cpu_of(rq), 1);
+
+}
+
+#else /* !CONFIG_SCHED_SMT: */
+static inline void update_idle_core(struct rq *rq)
+{
+	set_idle_cores(cpu_of(rq), 1);
+}
+#endif /* !CONFIG_SCHED_SMT */
+
+
 struct sched_group_capacity {
 	atomic_t		ref;
 	/*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 6/6 v7] sched/fair: Add EAS and idle cpu push trigger
  2025-12-01  9:13 [PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases Vincent Guittot
                   ` (4 preceding siblings ...)
  2025-12-01  9:13 ` [RFC PATCH 5/6 v7] sched/fair: Enable idle core tracking for !SMT Vincent Guittot
@ 2025-12-01  9:13 ` Vincent Guittot
  2025-12-01 13:53   ` Christian Loehle
  2025-12-02  9:44   ` Hillf Danton
  2025-12-01 13:31 ` [PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases Christian Loehle
  2025-12-01 22:02 ` David Laight
  7 siblings, 2 replies; 23+ messages in thread
From: Vincent Guittot @ 2025-12-01  9:13 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak
  Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado,
	Vincent Guittot

EAS is based on wakeup events to efficiently place tasks on the system, but
there are cases where a task doesn't have wakeup events anymore or at a far
too low pace. For such cases, we check if it's worht pushing hte task on
another CPUs instead of putting it back in the enqueued list.

Wake up events remain the main way to migrate tasks but we now detect
situation where a task is stuck on a CPU by checking that its utilization
is larger than the max available compute capacity (max cpu capacity or
uclamp max setting)

When the system becomes overutilized and some CPUs are idle, we try to
push tasks instead of waiting periodic load balance.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c     | 65 +++++++++++++++++++++++++++++++++++++++++
 kernel/sched/topology.c |  3 ++
 2 files changed, 68 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9af8d0a61856..e9e1d0c05805 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6990,6 +6990,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 }
 
 static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p);
+
 /*
  * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
  * failing half-way through and resume the dequeue later.
@@ -8499,8 +8500,72 @@ static inline bool sched_push_task_enabled(void)
 	return static_branch_unlikely(&sched_push_task);
 }
 
+static inline bool task_stuck_on_cpu(struct task_struct *p, int cpu)
+{
+	unsigned long max_capa, util;
+
+	max_capa = min(get_actual_cpu_capacity(cpu),
+		       uclamp_eff_value(p, UCLAMP_MAX));
+	util = max(task_util_est(p), task_runnable(p));
+
+	/*
+	 * Return true only if the task might not sleep/wakeup because of a low
+	 * compute capacity. Tasks, which wake up regularly, will be handled by
+	 * feec().
+	 */
+	return (util > max_capa);
+}
+
+static inline bool sched_energy_push_task(struct task_struct *p, struct rq *rq)
+{
+	if (!sched_energy_enabled())
+		return false;
+
+	if (is_rd_overutilized(rq->rd))
+		return false;
+
+	if (task_stuck_on_cpu(p, cpu_of(rq)))
+		return true;
+
+	if (!task_fits_cpu(p, cpu_of(rq)))
+		return true;
+
+	return false;
+}
+
+static inline bool sched_idle_push_task(struct task_struct *p, struct rq *rq)
+{
+	if (rq->nr_running == 1)
+		return false;
+
+	if (!is_rd_overutilized(rq->rd))
+		return false;
+
+	/* If there are idle cpus in the llc then try to push the task on it */
+	if (test_idle_cores(cpu_of(rq)))
+		return true;
+
+	return false;
+}
+
+
 static bool fair_push_task(struct rq *rq, struct task_struct *p)
 {
+	if (!task_on_rq_queued(p))
+		return false;
+
+	if (p->se.sched_delayed)
+		return false;
+
+	if (p->nr_cpus_allowed == 1)
+		return false;
+
+	if (sched_energy_push_task(p, rq))
+		return true;
+
+	if (sched_idle_push_task(p, rq))
+		return true;
+
 	return false;
 }
 
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index cf643a5ddedd..5edf7b117ed9 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -391,10 +391,13 @@ static void sched_energy_set(bool has_eas)
 		if (sched_debug())
 			pr_info("%s: stopping EAS\n", __func__);
 		static_branch_disable_cpuslocked(&sched_energy_present);
+		static_branch_dec_cpuslocked(&sched_push_task);
+	} else if (has_eas && !sched_energy_enabled()) {
 	} else if (has_eas && !static_branch_unlikely(&sched_energy_present)) {
 		if (sched_debug())
 			pr_info("%s: starting EAS\n", __func__);
 		static_branch_enable_cpuslocked(&sched_energy_present);
+		static_branch_inc_cpuslocked(&sched_push_task);
 	}
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases
  2025-12-01  9:13 [PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases Vincent Guittot
                   ` (5 preceding siblings ...)
  2025-12-01  9:13 ` [RFC PATCH 6/6 v7] sched/fair: Add EAS and idle cpu push trigger Vincent Guittot
@ 2025-12-01 13:31 ` Christian Loehle
  2025-12-01 13:57   ` Christian Loehle
  2025-12-01 17:48   ` Vincent Guittot
  2025-12-01 22:02 ` David Laight
  7 siblings, 2 replies; 23+ messages in thread
From: Christian Loehle @ 2025-12-01 13:31 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, pierre.gondois,
	kprateek.nayak
  Cc: qyousef, hongyan.xia2, luis.machado

On 12/1/25 09:13, Vincent Guittot wrote:
> This is a subset of [1] (sched/fair: Rework EAS to handle more cases)
> 
> [1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/
> 
> The current Energy Aware Scheduler has some known limitations which have
> became more and more visible with features like uclamp as an example. This
> serie tries to fix some of those issues:
> - tasks stacked on the same CPU of a PD
> - tasks stuck on the wrong CPU.
> 
> Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> whereas it is capped to a lower compute capacity. This wrong classification
> can prevent periodic load balancer to select a group_misfit_task CPU
> because group_overloaded has higher priority.
> 
> Patch 2 removes the need of testing uclamp_min in cpu_overutilized to
> trigger the active migration of a task on another CPU.
> 
> Patch 3 prepares select_task_rq_fair() to be called without TTWU, Fork or
> Exec flags when we just want to look for a possible better CPU.
> 
> Patch 4 adds push call back mecanism to fair scheduler but doesn't enable
> it.
> 
> Patch 5 enable has_idle_core for !SMP system to track if there may be an
> idle CPU in the LLC.
> 
> Patch 6 adds some conditions to enable pushing runnable tasks for EAS:
> - when a task is stuck on a CPU and the system is not overutilized.
> - if there is a possible idle CPU when the system is overutilized.
> 
> More tests results will come later as I wanted to send the pachtset before
> LPC.
> 
> Tbench  on dragonboard rb5
> schedutil and EAS enabled
> 
> # process     tip                   +patchset
> 1              29.1(+/-4.1%)        124.7(+/-12.3%) +329%
> 2              60.0(+/-0.9%)        216.1(+/- 7.9%) +260%
> 4             255.8(+/-1.9%)        421.4(+/- 2.0%)  +65%       
> 8            1317.3(+/-4.6%)       1396.1(+/- 3.0%)   +6%
> 16            958.2(+/-4.6%)        979.6(+/- 2.0%)   +2%

Just so I understand, there's no uclamp in the workload here?
Could you expand on the workload a little, what were the parameters/settings?
So the significant increase is really only for nr_proc < nr_cpus, with the
observed throughput increase it'll probably be something like "always running
on little CPUs" vs "always running on big CPUs", is that what's happening?
Also shouldn't tbench still have plenty of wakeup events? It issues plenty of
TCP anyway.

> 
> Hackbench didn't show any difference
> 
> 
> Vincent Guittot (6):
>   sched/fair: Filter false overloaded_group case for EAS
>   sched/fair: Update overutilized detection
>   sched/fair: Prepare select_task_rq_fair() to be called for new cases
>   sched/fair: Add push task mechanism for fair
>   sched/fair: Enable idle core tracking for !SMT
>   sched/fair: Add EAS and idle cpu push trigger
> 
>  kernel/sched/fair.c     | 350 +++++++++++++++++++++++++++++++++++-----
>  kernel/sched/sched.h    |  46 ++++--
>  kernel/sched/topology.c |   3 +
>  3 files changed, 346 insertions(+), 53 deletions(-)
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 6/6 v7] sched/fair: Add EAS and idle cpu push trigger
  2025-12-01  9:13 ` [RFC PATCH 6/6 v7] sched/fair: Add EAS and idle cpu push trigger Vincent Guittot
@ 2025-12-01 13:53   ` Christian Loehle
  2025-12-01 17:49     ` Vincent Guittot
  2025-12-02  9:44   ` Hillf Danton
  1 sibling, 1 reply; 23+ messages in thread
From: Christian Loehle @ 2025-12-01 13:53 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, pierre.gondois,
	kprateek.nayak
  Cc: qyousef, hongyan.xia2, luis.machado

Some nits below for now

On 12/1/25 09:13, Vincent Guittot wrote:
> EAS is based on wakeup events to efficiently place tasks on the system, but
> there are cases where a task doesn't have wakeup events anymore or at a far
> too low pace. For such cases, we check if it's worht pushing hte task on

worth
the

> another CPUs instead of putting it back in the enqueued list.
> 
> Wake up events remain the main way to migrate tasks but we now detect
> situation where a task is stuck on a CPU by checking that its utilization
> is larger than the max available compute capacity (max cpu capacity or
> uclamp max setting)
> 
> When the system becomes overutilized and some CPUs are idle, we try to
> push tasks instead of waiting periodic load balance.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c     | 65 +++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/topology.c |  3 ++
>  2 files changed, 68 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9af8d0a61856..e9e1d0c05805 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6990,6 +6990,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  }
>  
>  static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p);
> +

This doesn't belong here

>  /*
>   * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
>   * failing half-way through and resume the dequeue later.
> @@ -8499,8 +8500,72 @@ static inline bool sched_push_task_enabled(void)
>  	return static_branch_unlikely(&sched_push_task);
>  }
>  
> +static inline bool task_stuck_on_cpu(struct task_struct *p, int cpu)
> +{
> +	unsigned long max_capa, util;
> +
> +	max_capa = min(get_actual_cpu_capacity(cpu),
> +		       uclamp_eff_value(p, UCLAMP_MAX));
> +	util = max(task_util_est(p), task_runnable(p));
> +
> +	/*
> +	 * Return true only if the task might not sleep/wakeup because of a low
> +	 * compute capacity. Tasks, which wake up regularly, will be handled by
> +	 * feec().
> +	 */
> +	return (util > max_capa);
> +}
> +
> +static inline bool sched_energy_push_task(struct task_struct *p, struct rq *rq)
> +{
> +	if (!sched_energy_enabled())
> +		return false;
> +
> +	if (is_rd_overutilized(rq->rd))
> +		return false;
> +
> +	if (task_stuck_on_cpu(p, cpu_of(rq)))
> +		return true;
> +
> +	if (!task_fits_cpu(p, cpu_of(rq)))
> +		return true;
> +
> +	return false;
> +}
> +
> +static inline bool sched_idle_push_task(struct task_struct *p, struct rq *rq)
> +{
> +	if (rq->nr_running == 1)
> +		return false;
> +
> +	if (!is_rd_overutilized(rq->rd))
> +		return false;
> +
> +	/* If there are idle cpus in the llc then try to push the task on it */
> +	if (test_idle_cores(cpu_of(rq)))
> +		return true;
> +
> +	return false;
> +}
> +
> +
>  static bool fair_push_task(struct rq *rq, struct task_struct *p)
>  {
> +	if (!task_on_rq_queued(p))
> +		return false;
> +
> +	if (p->se.sched_delayed)
> +		return false;
> +
> +	if (p->nr_cpus_allowed == 1)
> +		return false;
> +
> +	if (sched_energy_push_task(p, rq))
> +		return true;
> +
> +	if (sched_idle_push_task(p, rq))
> +		return true;
> +
>  	return false;
>  }
>  
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index cf643a5ddedd..5edf7b117ed9 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -391,10 +391,13 @@ static void sched_energy_set(bool has_eas)
>  		if (sched_debug())
>  			pr_info("%s: stopping EAS\n", __func__);
>  		static_branch_disable_cpuslocked(&sched_energy_present);
> +		static_branch_dec_cpuslocked(&sched_push_task);
> +	} else if (has_eas && !sched_energy_enabled()) {
>  	} else if (has_eas && !static_branch_unlikely(&sched_energy_present)) {

This could just be (has_eas && && sched_energy_enabled() && !static_branch_unlikely(&sched_energy_present))
to avoid the awkward else if above

>  		if (sched_debug())
>  			pr_info("%s: starting EAS\n", __func__);
>  		static_branch_enable_cpuslocked(&sched_energy_present);
> +		static_branch_inc_cpuslocked(&sched_push_task);
>  	}
>  }
>  


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases
  2025-12-01 13:31 ` [PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases Christian Loehle
@ 2025-12-01 13:57   ` Christian Loehle
  2025-12-01 17:48     ` Vincent Guittot
  2025-12-01 17:48   ` Vincent Guittot
  1 sibling, 1 reply; 23+ messages in thread
From: Christian Loehle @ 2025-12-01 13:57 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, pierre.gondois,
	kprateek.nayak
  Cc: qyousef, hongyan.xia2, luis.machado

Nit in the title: mechanism, handle

On 12/1/25 13:31, Christian Loehle wrote:
> On 12/1/25 09:13, Vincent Guittot wrote:
>> This is a subset of [1] (sched/fair: Rework EAS to handle more cases)
>>
>> [1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/
>>
>> The current Energy Aware Scheduler has some known limitations which have
>> became more and more visible with features like uclamp as an example. This
>> serie tries to fix some of those issues:
>> - tasks stacked on the same CPU of a PD
>> - tasks stuck on the wrong CPU.
>>
>> Patch 1 fixes the case where a CPU is wrongly classified as overloaded
>> whereas it is capped to a lower compute capacity. This wrong classification
>> can prevent periodic load balancer to select a group_misfit_task CPU
>> because group_overloaded has higher priority.
>>
>> Patch 2 removes the need of testing uclamp_min in cpu_overutilized to
>> trigger the active migration of a task on another CPU.
>>
>> Patch 3 prepares select_task_rq_fair() to be called without TTWU, Fork or
>> Exec flags when we just want to look for a possible better CPU.
>>
>> Patch 4 adds push call back mecanism to fair scheduler but doesn't enable
>> it.
>>
>> Patch 5 enable has_idle_core for !SMP system to track if there may be an
>> idle CPU in the LLC.
>>
>> Patch 6 adds some conditions to enable pushing runnable tasks for EAS:
>> - when a task is stuck on a CPU and the system is not overutilized.
>> - if there is a possible idle CPU when the system is overutilized.
>>
>> More tests results will come later as I wanted to send the pachtset before
>> LPC.
>>
>> Tbench  on dragonboard rb5
>> schedutil and EAS enabled
>>
>> # process     tip                   +patchset
>> 1              29.1(+/-4.1%)        124.7(+/-12.3%) +329%
>> 2              60.0(+/-0.9%)        216.1(+/- 7.9%) +260%
>> 4             255.8(+/-1.9%)        421.4(+/- 2.0%)  +65%       
>> 8            1317.3(+/-4.6%)       1396.1(+/- 3.0%)   +6%
>> 16            958.2(+/-4.6%)        979.6(+/- 2.0%)   +2%
> 
> Just so I understand, there's no uclamp in the workload here?
> Could you expand on the workload a little, what were the parameters/settings?
> So the significant increase is really only for nr_proc < nr_cpus, with the
> observed throughput increase it'll probably be something like "always running
> on little CPUs" vs "always running on big CPUs", is that what's happening?
> Also shouldn't tbench still have plenty of wakeup events? It issues plenty of
> TCP anyway.

... or if not why does OU not trigger on tip?

> 
>>
>> Hackbench didn't show any difference
>>
>>
>> Vincent Guittot (6):
>>   sched/fair: Filter false overloaded_group case for EAS
>>   sched/fair: Update overutilized detection
>>   sched/fair: Prepare select_task_rq_fair() to be called for new cases
>>   sched/fair: Add push task mechanism for fair
>>   sched/fair: Enable idle core tracking for !SMT
>>   sched/fair: Add EAS and idle cpu push trigger
>>
>>  kernel/sched/fair.c     | 350 +++++++++++++++++++++++++++++++++++-----
>>  kernel/sched/sched.h    |  46 ++++--
>>  kernel/sched/topology.c |   3 +
>>  3 files changed, 346 insertions(+), 53 deletions(-)
>>

I can't apply this on yesterday's released 6.18 and not on tip/sched-core, what's
this based on? Can I get a branch or a 6.18 rebase?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases
  2025-12-01 13:31 ` [PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases Christian Loehle
  2025-12-01 13:57   ` Christian Loehle
@ 2025-12-01 17:48   ` Vincent Guittot
  1 sibling, 0 replies; 23+ messages in thread
From: Vincent Guittot @ 2025-12-01 17:48 UTC (permalink / raw)
  To: Christian Loehle
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak,
	qyousef, hongyan.xia2, luis.machado

On Mon, 1 Dec 2025 at 14:31, Christian Loehle <christian.loehle@arm.com> wrote:
>
> On 12/1/25 09:13, Vincent Guittot wrote:
> > This is a subset of [1] (sched/fair: Rework EAS to handle more cases)
> >
> > [1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/
> >
> > The current Energy Aware Scheduler has some known limitations which have
> > became more and more visible with features like uclamp as an example. This
> > serie tries to fix some of those issues:
> > - tasks stacked on the same CPU of a PD
> > - tasks stuck on the wrong CPU.
> >
> > Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> > whereas it is capped to a lower compute capacity. This wrong classification
> > can prevent periodic load balancer to select a group_misfit_task CPU
> > because group_overloaded has higher priority.
> >
> > Patch 2 removes the need of testing uclamp_min in cpu_overutilized to
> > trigger the active migration of a task on another CPU.
> >
> > Patch 3 prepares select_task_rq_fair() to be called without TTWU, Fork or
> > Exec flags when we just want to look for a possible better CPU.
> >
> > Patch 4 adds push call back mecanism to fair scheduler but doesn't enable
> > it.
> >
> > Patch 5 enable has_idle_core for !SMP system to track if there may be an
> > idle CPU in the LLC.
> >
> > Patch 6 adds some conditions to enable pushing runnable tasks for EAS:
> > - when a task is stuck on a CPU and the system is not overutilized.
> > - if there is a possible idle CPU when the system is overutilized.
> >
> > More tests results will come later as I wanted to send the pachtset before
> > LPC.
> >
> > Tbench  on dragonboard rb5
> > schedutil and EAS enabled
> >
> > # process     tip                   +patchset
> > 1              29.1(+/-4.1%)        124.7(+/-12.3%) +329%
> > 2              60.0(+/-0.9%)        216.1(+/- 7.9%) +260%
> > 4             255.8(+/-1.9%)        421.4(+/- 2.0%)  +65%
> > 8            1317.3(+/-4.6%)       1396.1(+/- 3.0%)   +6%
> > 16            958.2(+/-4.6%)        979.6(+/- 2.0%)   +2%
>
> Just so I understand, there's no uclamp in the workload here?

Yes, no uclamp

> Could you expand on the workload a little, what were the parameters/settings?

for g in 1 2 4 8 16; do
for i in {0..8}; do
sync
sleep 3.777
tbench -t 10 $g
done
done

> So the significant increase is really only for nr_proc < nr_cpus, with the

yes

> observed throughput increase it'll probably be something like "always running
> on little CPUs" vs "always running on big CPUs", is that what's happening?

I have looked at the details. These results are part of the bench that
I'm running with hackbench but It's most probably come from migrating
task on a better cpu

> Also shouldn't tbench still have plenty of wakeup events? It issues plenty of
> TCP anyway.

Yes


>
> >
> > Hackbench didn't show any difference
> >
> >
> > Vincent Guittot (6):
> >   sched/fair: Filter false overloaded_group case for EAS
> >   sched/fair: Update overutilized detection
> >   sched/fair: Prepare select_task_rq_fair() to be called for new cases
> >   sched/fair: Add push task mechanism for fair
> >   sched/fair: Enable idle core tracking for !SMT
> >   sched/fair: Add EAS and idle cpu push trigger
> >
> >  kernel/sched/fair.c     | 350 +++++++++++++++++++++++++++++++++++-----
> >  kernel/sched/sched.h    |  46 ++++--
> >  kernel/sched/topology.c |   3 +
> >  3 files changed, 346 insertions(+), 53 deletions(-)
> >
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases
  2025-12-01 13:57   ` Christian Loehle
@ 2025-12-01 17:48     ` Vincent Guittot
  0 siblings, 0 replies; 23+ messages in thread
From: Vincent Guittot @ 2025-12-01 17:48 UTC (permalink / raw)
  To: Christian Loehle
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak,
	qyousef, hongyan.xia2, luis.machado

On Mon, 1 Dec 2025 at 14:57, Christian Loehle <christian.loehle@arm.com> wrote:
>
> Nit in the title: mechanism, handle
>
> On 12/1/25 13:31, Christian Loehle wrote:
> > On 12/1/25 09:13, Vincent Guittot wrote:
> >> This is a subset of [1] (sched/fair: Rework EAS to handle more cases)
> >>
> >> [1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/
> >>
> >> The current Energy Aware Scheduler has some known limitations which have
> >> became more and more visible with features like uclamp as an example. This
> >> serie tries to fix some of those issues:
> >> - tasks stacked on the same CPU of a PD
> >> - tasks stuck on the wrong CPU.
> >>
> >> Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> >> whereas it is capped to a lower compute capacity. This wrong classification
> >> can prevent periodic load balancer to select a group_misfit_task CPU
> >> because group_overloaded has higher priority.
> >>
> >> Patch 2 removes the need of testing uclamp_min in cpu_overutilized to
> >> trigger the active migration of a task on another CPU.
> >>
> >> Patch 3 prepares select_task_rq_fair() to be called without TTWU, Fork or
> >> Exec flags when we just want to look for a possible better CPU.
> >>
> >> Patch 4 adds push call back mecanism to fair scheduler but doesn't enable
> >> it.
> >>
> >> Patch 5 enable has_idle_core for !SMP system to track if there may be an
> >> idle CPU in the LLC.
> >>
> >> Patch 6 adds some conditions to enable pushing runnable tasks for EAS:
> >> - when a task is stuck on a CPU and the system is not overutilized.
> >> - if there is a possible idle CPU when the system is overutilized.
> >>
> >> More tests results will come later as I wanted to send the pachtset before
> >> LPC.
> >>
> >> Tbench  on dragonboard rb5
> >> schedutil and EAS enabled
> >>
> >> # process     tip                   +patchset
> >> 1              29.1(+/-4.1%)        124.7(+/-12.3%) +329%
> >> 2              60.0(+/-0.9%)        216.1(+/- 7.9%) +260%
> >> 4             255.8(+/-1.9%)        421.4(+/- 2.0%)  +65%
> >> 8            1317.3(+/-4.6%)       1396.1(+/- 3.0%)   +6%
> >> 16            958.2(+/-4.6%)        979.6(+/- 2.0%)   +2%
> >
> > Just so I understand, there's no uclamp in the workload here?
> > Could you expand on the workload a little, what were the parameters/settings?
> > So the significant increase is really only for nr_proc < nr_cpus, with the
> > observed throughput increase it'll probably be something like "always running
> > on little CPUs" vs "always running on big CPUs", is that what's happening?
> > Also shouldn't tbench still have plenty of wakeup events? It issues plenty of
> > TCP anyway.
>
> ... or if not why does OU not trigger on tip?
>
> >
> >>
> >> Hackbench didn't show any difference
> >>
> >>
> >> Vincent Guittot (6):
> >>   sched/fair: Filter false overloaded_group case for EAS
> >>   sched/fair: Update overutilized detection
> >>   sched/fair: Prepare select_task_rq_fair() to be called for new cases
> >>   sched/fair: Add push task mechanism for fair
> >>   sched/fair: Enable idle core tracking for !SMT
> >>   sched/fair: Add EAS and idle cpu push trigger
> >>
> >>  kernel/sched/fair.c     | 350 +++++++++++++++++++++++++++++++++++-----
> >>  kernel/sched/sched.h    |  46 ++++--
> >>  kernel/sched/topology.c |   3 +
> >>  3 files changed, 346 insertions(+), 53 deletions(-)
> >>
>
> I can't apply this on yesterday's released 6.18 and not on tip/sched-core, what's
> this based on? Can I get a branch or a 6.18 rebase?

The patchset is based on tip/sched/core commit 33cf66d88306
("sched/fair: Proportional newidle balance")

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 6/6 v7] sched/fair: Add EAS and idle cpu push trigger
  2025-12-01 13:53   ` Christian Loehle
@ 2025-12-01 17:49     ` Vincent Guittot
  2025-12-01 19:33       ` Vincent Guittot
  0 siblings, 1 reply; 23+ messages in thread
From: Vincent Guittot @ 2025-12-01 17:49 UTC (permalink / raw)
  To: Christian Loehle
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak,
	qyousef, hongyan.xia2, luis.machado

On Mon, 1 Dec 2025 at 14:53, Christian Loehle <christian.loehle@arm.com> wrote:
>
> Some nits below for now
>
> On 12/1/25 09:13, Vincent Guittot wrote:
> > EAS is based on wakeup events to efficiently place tasks on the system, but
> > there are cases where a task doesn't have wakeup events anymore or at a far
> > too low pace. For such cases, we check if it's worht pushing hte task on
>
> worth
> the

+1

>
> > another CPUs instead of putting it back in the enqueued list.
> >
> > Wake up events remain the main way to migrate tasks but we now detect
> > situation where a task is stuck on a CPU by checking that its utilization
> > is larger than the max available compute capacity (max cpu capacity or
> > uclamp max setting)
> >
> > When the system becomes overutilized and some CPUs are idle, we try to
> > push tasks instead of waiting periodic load balance.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/fair.c     | 65 +++++++++++++++++++++++++++++++++++++++++
> >  kernel/sched/topology.c |  3 ++
> >  2 files changed, 68 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 9af8d0a61856..e9e1d0c05805 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6990,6 +6990,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  }
> >
> >  static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p);
> > +
>
> This doesn't belong here

yes, don't know what I mess up with my patches

>
> >  /*
> >   * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
> >   * failing half-way through and resume the dequeue later.
> > @@ -8499,8 +8500,72 @@ static inline bool sched_push_task_enabled(void)
> >       return static_branch_unlikely(&sched_push_task);
> >  }
> >
> > +static inline bool task_stuck_on_cpu(struct task_struct *p, int cpu)
> > +{
> > +     unsigned long max_capa, util;
> > +
> > +     max_capa = min(get_actual_cpu_capacity(cpu),
> > +                    uclamp_eff_value(p, UCLAMP_MAX));
> > +     util = max(task_util_est(p), task_runnable(p));
> > +
> > +     /*
> > +      * Return true only if the task might not sleep/wakeup because of a low
> > +      * compute capacity. Tasks, which wake up regularly, will be handled by
> > +      * feec().
> > +      */
> > +     return (util > max_capa);
> > +}
> > +
> > +static inline bool sched_energy_push_task(struct task_struct *p, struct rq *rq)
> > +{
> > +     if (!sched_energy_enabled())
> > +             return false;
> > +
> > +     if (is_rd_overutilized(rq->rd))
> > +             return false;
> > +
> > +     if (task_stuck_on_cpu(p, cpu_of(rq)))
> > +             return true;
> > +
> > +     if (!task_fits_cpu(p, cpu_of(rq)))
> > +             return true;
> > +
> > +     return false;
> > +}
> > +
> > +static inline bool sched_idle_push_task(struct task_struct *p, struct rq *rq)
> > +{
> > +     if (rq->nr_running == 1)
> > +             return false;
> > +
> > +     if (!is_rd_overutilized(rq->rd))
> > +             return false;
> > +
> > +     /* If there are idle cpus in the llc then try to push the task on it */
> > +     if (test_idle_cores(cpu_of(rq)))
> > +             return true;
> > +
> > +     return false;
> > +}
> > +
> > +
> >  static bool fair_push_task(struct rq *rq, struct task_struct *p)
> >  {
> > +     if (!task_on_rq_queued(p))
> > +             return false;
> > +
> > +     if (p->se.sched_delayed)
> > +             return false;
> > +
> > +     if (p->nr_cpus_allowed == 1)
> > +             return false;
> > +
> > +     if (sched_energy_push_task(p, rq))
> > +             return true;
> > +
> > +     if (sched_idle_push_task(p, rq))
> > +             return true;
> > +
> >       return false;
> >  }
> >
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index cf643a5ddedd..5edf7b117ed9 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -391,10 +391,13 @@ static void sched_energy_set(bool has_eas)
> >               if (sched_debug())
> >                       pr_info("%s: stopping EAS\n", __func__);
> >               static_branch_disable_cpuslocked(&sched_energy_present);
> > +             static_branch_dec_cpuslocked(&sched_push_task);
> > +     } else if (has_eas && !sched_energy_enabled()) {
> >       } else if (has_eas && !static_branch_unlikely(&sched_energy_present)) {
>
> This could just be (has_eas && && sched_energy_enabled() && !static_branch_unlikely(&sched_energy_present))
> to avoid the awkward else if above

Argh, I messed up something with this patchset and another pending
cleanup patch when I rebased it.
It should be :

                static_branch_disable_cpuslocked(&sched_energy_present);
+                static_branch_dec_cpuslocked(&sched_push_task);
        } else if (has_eas && !static_branch_unlikely(&sched_energy_present)) {

I need to rerun the bench to check that the results of the cover
letter are still correct.

 That's what happens when you want to send a patchset too quickly ...


>
> >               if (sched_debug())
> >                       pr_info("%s: starting EAS\n", __func__);
> >               static_branch_enable_cpuslocked(&sched_energy_present);
> > +             static_branch_inc_cpuslocked(&sched_push_task);
> >       }
> >  }
> >
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 6/6 v7] sched/fair: Add EAS and idle cpu push trigger
  2025-12-01 17:49     ` Vincent Guittot
@ 2025-12-01 19:33       ` Vincent Guittot
  0 siblings, 0 replies; 23+ messages in thread
From: Vincent Guittot @ 2025-12-01 19:33 UTC (permalink / raw)
  To: Christian Loehle
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak,
	qyousef, hongyan.xia2, luis.machado

On Mon, 1 Dec 2025 at 18:49, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>
> On Mon, 1 Dec 2025 at 14:53, Christian Loehle <christian.loehle@arm.com> wrote:
> >
> > Some nits below for now
> >
> > On 12/1/25 09:13, Vincent Guittot wrote:
> > > EAS is based on wakeup events to efficiently place tasks on the system, but
> > > there are cases where a task doesn't have wakeup events anymore or at a far
> > > too low pace. For such cases, we check if it's worht pushing hte task on
> >
> > worth
> > the
>
> +1
>
> >
> > > another CPUs instead of putting it back in the enqueued list.
> > >
> > > Wake up events remain the main way to migrate tasks but we now detect
> > > situation where a task is stuck on a CPU by checking that its utilization
> > > is larger than the max available compute capacity (max cpu capacity or
> > > uclamp max setting)
> > >
> > > When the system becomes overutilized and some CPUs are idle, we try to
> > > push tasks instead of waiting periodic load balance.
> > >
> > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > > ---
> > >  kernel/sched/fair.c     | 65 +++++++++++++++++++++++++++++++++++++++++
> > >  kernel/sched/topology.c |  3 ++
> > >  2 files changed, 68 insertions(+)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 9af8d0a61856..e9e1d0c05805 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -6990,6 +6990,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > >  }
> > >
> > >  static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p);
> > > +
> >
> > This doesn't belong here
>
> yes, don't know what I mess up with my patches
>
> >
> > >  /*
> > >   * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
> > >   * failing half-way through and resume the dequeue later.
> > > @@ -8499,8 +8500,72 @@ static inline bool sched_push_task_enabled(void)
> > >       return static_branch_unlikely(&sched_push_task);
> > >  }
> > >
> > > +static inline bool task_stuck_on_cpu(struct task_struct *p, int cpu)
> > > +{
> > > +     unsigned long max_capa, util;
> > > +
> > > +     max_capa = min(get_actual_cpu_capacity(cpu),
> > > +                    uclamp_eff_value(p, UCLAMP_MAX));
> > > +     util = max(task_util_est(p), task_runnable(p));
> > > +
> > > +     /*
> > > +      * Return true only if the task might not sleep/wakeup because of a low
> > > +      * compute capacity. Tasks, which wake up regularly, will be handled by
> > > +      * feec().
> > > +      */
> > > +     return (util > max_capa);
> > > +}
> > > +
> > > +static inline bool sched_energy_push_task(struct task_struct *p, struct rq *rq)
> > > +{
> > > +     if (!sched_energy_enabled())
> > > +             return false;
> > > +
> > > +     if (is_rd_overutilized(rq->rd))
> > > +             return false;
> > > +
> > > +     if (task_stuck_on_cpu(p, cpu_of(rq)))
> > > +             return true;
> > > +
> > > +     if (!task_fits_cpu(p, cpu_of(rq)))
> > > +             return true;
> > > +
> > > +     return false;
> > > +}
> > > +
> > > +static inline bool sched_idle_push_task(struct task_struct *p, struct rq *rq)
> > > +{
> > > +     if (rq->nr_running == 1)
> > > +             return false;
> > > +
> > > +     if (!is_rd_overutilized(rq->rd))
> > > +             return false;
> > > +
> > > +     /* If there are idle cpus in the llc then try to push the task on it */
> > > +     if (test_idle_cores(cpu_of(rq)))
> > > +             return true;
> > > +
> > > +     return false;
> > > +}
> > > +
> > > +
> > >  static bool fair_push_task(struct rq *rq, struct task_struct *p)
> > >  {
> > > +     if (!task_on_rq_queued(p))
> > > +             return false;
> > > +
> > > +     if (p->se.sched_delayed)
> > > +             return false;
> > > +
> > > +     if (p->nr_cpus_allowed == 1)
> > > +             return false;
> > > +
> > > +     if (sched_energy_push_task(p, rq))
> > > +             return true;
> > > +
> > > +     if (sched_idle_push_task(p, rq))
> > > +             return true;
> > > +
> > >       return false;
> > >  }
> > >
> > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > index cf643a5ddedd..5edf7b117ed9 100644
> > > --- a/kernel/sched/topology.c
> > > +++ b/kernel/sched/topology.c
> > > @@ -391,10 +391,13 @@ static void sched_energy_set(bool has_eas)
> > >               if (sched_debug())
> > >                       pr_info("%s: stopping EAS\n", __func__);
> > >               static_branch_disable_cpuslocked(&sched_energy_present);
> > > +             static_branch_dec_cpuslocked(&sched_push_task);
> > > +     } else if (has_eas && !sched_energy_enabled()) {
> > >       } else if (has_eas && !static_branch_unlikely(&sched_energy_present)) {
> >
> > This could just be (has_eas && && sched_energy_enabled() && !static_branch_unlikely(&sched_energy_present))
> > to avoid the awkward else if above
>
> Argh, I messed up something with this patchset and another pending
> cleanup patch when I rebased it.
> It should be :
>
>                 static_branch_disable_cpuslocked(&sched_energy_present);
> +                static_branch_dec_cpuslocked(&sched_push_task);
>         } else if (has_eas && !static_branch_unlikely(&sched_energy_present)) {
>
> I need to rerun the bench to check that the results of the cover
> letter are still correct.

And the results are now the same

Sorry for the noise, I'm going to fix this in a v8

>
>  That's what happens when you want to send a patchset too quickly ...
>
>
> >
> > >               if (sched_debug())
> > >                       pr_info("%s: starting EAS\n", __func__);
> > >               static_branch_enable_cpuslocked(&sched_energy_present);
> > > +             static_branch_inc_cpuslocked(&sched_push_task);
> > >       }
> > >  }
> > >
> >

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases
  2025-12-01  9:13 [PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases Vincent Guittot
                   ` (6 preceding siblings ...)
  2025-12-01 13:31 ` [PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases Christian Loehle
@ 2025-12-01 22:02 ` David Laight
  2025-12-02 13:24   ` Vincent Guittot
  7 siblings, 1 reply; 23+ messages in thread
From: David Laight @ 2025-12-01 22:02 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak,
	qyousef, hongyan.xia2, christian.loehle, luis.machado

On Mon,  1 Dec 2025 10:13:02 +0100
Vincent Guittot <vincent.guittot@linaro.org> wrote:

...

If you've got sched/fair.c out on the operating table have a look at all the
code that multiplies by PELT_MIN_DIVISOR (about 48k).
There are max_t(u32) that (I think) mask the product to 32bits (on 64bit)
before assigning to a u64.
Conversely on 32bit the product is only 32bits - even though it is assigned
to a u64.

There might a valid justification for the 'utilisation' fitting in 32bits,
but I'm not sure it applies to any of the other fields.

There are also all the 'long' variables in the code - which change size
between 32bit and 64bit.
I failed to spot an explanation as to why this is valid.
I suspect they should all be either u32 or u64.

This all means that variables the 'runnable_sum' may be truncated and much
smaller than they ought to be.
I think that means the scheduler can incorrectly think a 'session' is idle
when, in fact, it is very busy.

I didn't do a full analysis of the code, just looked at a few expressions.

The 64bit code calculates 'long_var * PELT_MIN_DIVISOR' to get a 64bit product.
Doing a full 64x64 multiply if 32bit is rather more expensive.
Given PELT_MIN_DIVISOR is just a scale factor to get extra precision
(I think the product decays with time) multiplying by 32768 would be much
cheaper and have much the same effect.

	David

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 6/6 v7] sched/fair: Add EAS and idle cpu push trigger
  2025-12-01  9:13 ` [RFC PATCH 6/6 v7] sched/fair: Add EAS and idle cpu push trigger Vincent Guittot
  2025-12-01 13:53   ` Christian Loehle
@ 2025-12-02  9:44   ` Hillf Danton
  2025-12-02 13:01     ` Vincent Guittot
  1 sibling, 1 reply; 23+ messages in thread
From: Hillf Danton @ 2025-12-02  9:44 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, linux-kernel, pierre.gondois, kprateek.nayak, qyousef,
	christian.loehle, luis.machado

On Mon,  1 Dec 2025 10:13:08 +0100 Vincent Guittot wrote:
> EAS is based on wakeup events to efficiently place tasks on the system, but
> there are cases where a task doesn't have wakeup events anymore or at a far
> too low pace. For such cases, we check if it's worht pushing hte task on
> another CPUs instead of putting it back in the enqueued list.
> 
> Wake up events remain the main way to migrate tasks but we now detect
> situation where a task is stuck on a CPU by checking that its utilization
> is larger than the max available compute capacity (max cpu capacity or
> uclamp max setting)
> 
> When the system becomes overutilized and some CPUs are idle, we try to
> push tasks instead of waiting periodic load balance.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c     | 65 +++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/topology.c |  3 ++
>  2 files changed, 68 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9af8d0a61856..e9e1d0c05805 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6990,6 +6990,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  }
>  
>  static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p);
> +
>  /*
>   * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
>   * failing half-way through and resume the dequeue later.
> @@ -8499,8 +8500,72 @@ static inline bool sched_push_task_enabled(void)
>  	return static_branch_unlikely(&sched_push_task);
>  }
>  
> +static inline bool task_stuck_on_cpu(struct task_struct *p, int cpu)
> +{
> +	unsigned long max_capa, util;
> +
> +	max_capa = min(get_actual_cpu_capacity(cpu),
> +		       uclamp_eff_value(p, UCLAMP_MAX));
> +	util = max(task_util_est(p), task_runnable(p));
> +
> +	/*
> +	 * Return true only if the task might not sleep/wakeup because of a low
> +	 * compute capacity. Tasks, which wake up regularly, will be handled by
> +	 * feec().
> +	 */
> +	return (util > max_capa);
> +}
> +
> +static inline bool sched_energy_push_task(struct task_struct *p, struct rq *rq)
> +{
> +	if (!sched_energy_enabled())
> +		return false;
> +
> +	if (is_rd_overutilized(rq->rd))
> +		return false;
> +
> +	if (task_stuck_on_cpu(p, cpu_of(rq)))
> +		return true;
> +
> +	if (!task_fits_cpu(p, cpu_of(rq)))
> +		return true;
> +
> +	return false;
> +}
> +
> +static inline bool sched_idle_push_task(struct task_struct *p, struct rq *rq)
> +{
> +	if (rq->nr_running == 1)
> +		return false;
> +
> +	if (!is_rd_overutilized(rq->rd))
> +		return false;
> +
> +	/* If there are idle cpus in the llc then try to push the task on it */
> +	if (test_idle_cores(cpu_of(rq)))
> +		return true;
> +
> +	return false;
> +}
> +
> +
>  static bool fair_push_task(struct rq *rq, struct task_struct *p)
>  {
> +	if (!task_on_rq_queued(p))
> +		return false;

Task is queued on rq.
> +
> +	if (p->se.sched_delayed)
> +		return false;
> +
> +	if (p->nr_cpus_allowed == 1)
> +		return false;
> +
> +	if (sched_energy_push_task(p, rq))
> +		return true;

If task is stuck on CPU, it could not be on rq. Weird.
> +
> +	if (sched_idle_push_task(p, rq))
> +		return true;
> +
>  	return false;
>  }
>  
More, in the tick path,

task_tick_fair
  check_pushable_task
    fair_push_task
      task_on_rq_queued // this check makes no sense

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 6/6 v7] sched/fair: Add EAS and idle cpu push trigger
  2025-12-02  9:44   ` Hillf Danton
@ 2025-12-02 13:01     ` Vincent Guittot
  2025-12-03  9:00       ` Hillf Danton
  0 siblings, 1 reply; 23+ messages in thread
From: Vincent Guittot @ 2025-12-02 13:01 UTC (permalink / raw)
  To: Hillf Danton
  Cc: peterz, linux-kernel, pierre.gondois, kprateek.nayak, qyousef,
	christian.loehle, luis.machado

On Tue, 2 Dec 2025 at 10:45, Hillf Danton <hdanton@sina.com> wrote:
>
> On Mon,  1 Dec 2025 10:13:08 +0100 Vincent Guittot wrote:
> > EAS is based on wakeup events to efficiently place tasks on the system, but
> > there are cases where a task doesn't have wakeup events anymore or at a far
> > too low pace. For such cases, we check if it's worht pushing hte task on
> > another CPUs instead of putting it back in the enqueued list.
> >
> > Wake up events remain the main way to migrate tasks but we now detect
> > situation where a task is stuck on a CPU by checking that its utilization
> > is larger than the max available compute capacity (max cpu capacity or
> > uclamp max setting)
> >
> > When the system becomes overutilized and some CPUs are idle, we try to
> > push tasks instead of waiting periodic load balance.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/fair.c     | 65 +++++++++++++++++++++++++++++++++++++++++
> >  kernel/sched/topology.c |  3 ++
> >  2 files changed, 68 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 9af8d0a61856..e9e1d0c05805 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6990,6 +6990,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  }
> >
> >  static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p);
> > +
> >  /*
> >   * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
> >   * failing half-way through and resume the dequeue later.
> > @@ -8499,8 +8500,72 @@ static inline bool sched_push_task_enabled(void)
> >       return static_branch_unlikely(&sched_push_task);
> >  }
> >
> > +static inline bool task_stuck_on_cpu(struct task_struct *p, int cpu)
> > +{
> > +     unsigned long max_capa, util;
> > +
> > +     max_capa = min(get_actual_cpu_capacity(cpu),
> > +                    uclamp_eff_value(p, UCLAMP_MAX));
> > +     util = max(task_util_est(p), task_runnable(p));
> > +
> > +     /*
> > +      * Return true only if the task might not sleep/wakeup because of a low
> > +      * compute capacity. Tasks, which wake up regularly, will be handled by
> > +      * feec().
> > +      */
> > +     return (util > max_capa);
> > +}
> > +
> > +static inline bool sched_energy_push_task(struct task_struct *p, struct rq *rq)
> > +{
> > +     if (!sched_energy_enabled())
> > +             return false;
> > +
> > +     if (is_rd_overutilized(rq->rd))
> > +             return false;
> > +
> > +     if (task_stuck_on_cpu(p, cpu_of(rq)))
> > +             return true;
> > +
> > +     if (!task_fits_cpu(p, cpu_of(rq)))
> > +             return true;
> > +
> > +     return false;
> > +}
> > +
> > +static inline bool sched_idle_push_task(struct task_struct *p, struct rq *rq)
> > +{
> > +     if (rq->nr_running == 1)
> > +             return false;
> > +
> > +     if (!is_rd_overutilized(rq->rd))
> > +             return false;
> > +
> > +     /* If there are idle cpus in the llc then try to push the task on it */
> > +     if (test_idle_cores(cpu_of(rq)))
> > +             return true;
> > +
> > +     return false;
> > +}
> > +
> > +
> >  static bool fair_push_task(struct rq *rq, struct task_struct *p)
> >  {
> > +     if (!task_on_rq_queued(p))
> > +             return false;
>
> Task is queued on rq.
> > +
> > +     if (p->se.sched_delayed)
> > +             return false;
> > +
> > +     if (p->nr_cpus_allowed == 1)
> > +             return false;
> > +
> > +     if (sched_energy_push_task(p, rq))
> > +             return true;
>
> If task is stuck on CPU, it could not be on rq. Weird.

May be it comes from my description and I should use task_stuck_on_rq
By stuck, I mean that the task doesn't have any opportunity to migrate
on another cpu/rq and stay "forever"  (at least until next sleep) on
this cpu/rq because load balancing is disabled/bypassed w/ EAS
Here Stuck does not mean blocked/sleeping

> > +
> > +     if (sched_idle_push_task(p, rq))
> > +             return true;
> > +
> >       return false;
> >  }
> >
> More, in the tick path,
>
> task_tick_fair
>   check_pushable_task
>     fair_push_task
>       task_on_rq_queued // this check makes no sense

I want to use a single entry point (fair_push_task) for deciding to
push a task so I agree that testing task_on_rq_queued() at tick is
useless but it is needed for other cases when the task is put back in
the rb tree

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases
  2025-12-01 22:02 ` David Laight
@ 2025-12-02 13:24   ` Vincent Guittot
  0 siblings, 0 replies; 23+ messages in thread
From: Vincent Guittot @ 2025-12-02 13:24 UTC (permalink / raw)
  To: David Laight
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak,
	qyousef, hongyan.xia2, christian.loehle, luis.machado

On Mon, 1 Dec 2025 at 23:03, David Laight <david.laight.linux@gmail.com> wrote:
>
> On Mon,  1 Dec 2025 10:13:02 +0100
> Vincent Guittot <vincent.guittot@linaro.org> wrote:
>
> ...
>
> If you've got sched/fair.c out on the operating table have a look at all the
> code that multiplies by PELT_MIN_DIVISOR (about 48k).
> There are max_t(u32) that (I think) mask the product to 32bits (on 64bit)
> before assigning to a u64.

I'm going to have a look. Some stay in the 32 bits range like util_sum
but some others don't and we have scale_load_down() which is either a
nop or >> 10 in the picture

> Conversely on 32bit the product is only 32bits - even though it is assigned
> to a u64.
>
> There might a valid justification for the 'utilisation' fitting in 32bits,
> but I'm not sure it applies to any of the other fields.
>
> There are also all the 'long' variables in the code - which change size
> between 32bit and 64bit.
> I failed to spot an explanation as to why this is valid.
> I suspect they should all be either u32 or u64.
>
> This all means that variables the 'runnable_sum' may be truncated and much
> smaller than they ought to be.
> I think that means the scheduler can incorrectly think a 'session' is idle
> when, in fact, it is very busy.
>
> I didn't do a full analysis of the code, just looked at a few expressions.
>
> The 64bit code calculates 'long_var * PELT_MIN_DIVISOR' to get a 64bit product.
> Doing a full 64x64 multiply if 32bit is rather more expensive.
> Given PELT_MIN_DIVISOR is just a scale factor to get extra precision
> (I think the product decays with time) multiplying by 32768 would be much
> cheaper and have much the same effect.
>
>         David

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 6/6 v7] sched/fair: Add EAS and idle cpu push trigger
  2025-12-02 13:01     ` Vincent Guittot
@ 2025-12-03  9:00       ` Hillf Danton
  2025-12-03 13:32         ` Vincent Guittot
  0 siblings, 1 reply; 23+ messages in thread
From: Hillf Danton @ 2025-12-03  9:00 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, linux-kernel, pierre.gondois, kprateek.nayak, qyousef,
	christian.loehle, luis.machado

On Tue, 2 Dec 2025 14:01:39 +0100 Vincent Guittot wrote:
>On Tue, 2 Dec 2025 at 10:45, Hillf Danton <hdanton@sina.com> wrote:
>> On Mon,  1 Dec 2025 10:13:08 +0100 Vincent Guittot wrote:
>> > EAS is based on wakeup events to efficiently place tasks on the system, but
>> > there are cases where a task doesn't have wakeup events anymore or at a far
>> > too low pace. For such cases, we check if it's worht pushing hte task on
>> > another CPUs instead of putting it back in the enqueued list.
>> >
>> > Wake up events remain the main way to migrate tasks but we now detect
>> > situation where a task is stuck on a CPU by checking that its utilization
>> > is larger than the max available compute capacity (max cpu capacity or
>> > uclamp max setting)
>> >
>> > When the system becomes overutilized and some CPUs are idle, we try to
>> > push tasks instead of waiting periodic load balance.
>> >
>> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> > ---
>> >  kernel/sched/fair.c     | 65 +++++++++++++++++++++++++++++++++++++++++
>> >  kernel/sched/topology.c |  3 ++
>> >  2 files changed, 68 insertions(+)
>> >
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index 9af8d0a61856..e9e1d0c05805 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -6990,6 +6990,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> >  }
>> >
>> >  static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p);
>> > +
>> >  /*
>> >   * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
>> >   * failing half-way through and resume the dequeue later.
>> > @@ -8499,8 +8500,72 @@ static inline bool sched_push_task_enabled(void)
>> >       return static_branch_unlikely(&sched_push_task);
>> >  }
>> >
>> > +static inline bool task_stuck_on_cpu(struct task_struct *p, int cpu)
>> > +{
>> > +     unsigned long max_capa, util;
>> > +
>> > +     max_capa = min(get_actual_cpu_capacity(cpu),
>> > +                    uclamp_eff_value(p, UCLAMP_MAX));
>> > +     util = max(task_util_est(p), task_runnable(p));
>> > +
>> > +     /*
>> > +      * Return true only if the task might not sleep/wakeup because of a low
>> > +      * compute capacity. Tasks, which wake up regularly, will be handled by
>> > +      * feec().
>> > +      */
>> > +     return (util > max_capa);
>> > +}
>> > +
>> > +static inline bool sched_energy_push_task(struct task_struct *p, struct rq *rq)
>> > +{
>> > +     if (!sched_energy_enabled())
>> > +             return false;
>> > +
>> > +     if (is_rd_overutilized(rq->rd))
>> > +             return false;
>> > +
>> > +     if (task_stuck_on_cpu(p, cpu_of(rq)))
>> > +             return true;
>> > +
>> > +     if (!task_fits_cpu(p, cpu_of(rq)))
>> > +             return true;
>> > +
>> > +     return false;
>> > +}
>> > +
>> > +static inline bool sched_idle_push_task(struct task_struct *p, struct rq *rq)
>> > +{
>> > +     if (rq->nr_running == 1)
>> > +             return false;
>> > +
>> > +     if (!is_rd_overutilized(rq->rd))
>> > +             return false;
>> > +
>> > +     /* If there are idle cpus in the llc then try to push the task on it */
>> > +     if (test_idle_cores(cpu_of(rq)))
>> > +             return true;
>> > +
>> > +     return false;
>> > +}
>> > +
>> > +
>> >  static bool fair_push_task(struct rq *rq, struct task_struct *p)
>> >  {
>> > +     if (!task_on_rq_queued(p))
>> > +             return false;
>>
>> Task is queued on rq.
>> > +
>> > +     if (p->se.sched_delayed)
>> > +             return false;
>> > +
>> > +     if (p->nr_cpus_allowed == 1)
>> > +             return false;
>> > +
>> > +     if (sched_energy_push_task(p, rq))
>> > +             return true;
>>
>> If task is stuck on CPU, it could not be on rq. Weird.
>
> May be it comes from my description and I should use task_stuck_on_rq
> By stuck, I mean that the task doesn't have any opportunity to migrate
> on another cpu/rq and stay "forever"  (at least until next sleep) on
> this cpu/rq because load balancing is disabled/bypassed w/ EAS
> Here Stuck does not mean blocked/sleeping
>
Given task queued on rq, I find the correct phrase, stack, in the cover
letter instead of stuck, and the long-standing stacking tasks mean load
balancer fails to cure that stack. 1/7 fixes that failure, no?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 6/6 v7] sched/fair: Add EAS and idle cpu push trigger
  2025-12-03  9:00       ` Hillf Danton
@ 2025-12-03 13:32         ` Vincent Guittot
  2025-12-04  6:59           ` Hillf Danton
  0 siblings, 1 reply; 23+ messages in thread
From: Vincent Guittot @ 2025-12-03 13:32 UTC (permalink / raw)
  To: Hillf Danton
  Cc: peterz, linux-kernel, pierre.gondois, kprateek.nayak, qyousef,
	christian.loehle, luis.machado

On Wed, 3 Dec 2025 at 10:00, Hillf Danton <hdanton@sina.com> wrote:
>
> On Tue, 2 Dec 2025 14:01:39 +0100 Vincent Guittot wrote:
> >On Tue, 2 Dec 2025 at 10:45, Hillf Danton <hdanton@sina.com> wrote:
> >> On Mon,  1 Dec 2025 10:13:08 +0100 Vincent Guittot wrote:
> >> > EAS is based on wakeup events to efficiently place tasks on the system, but
> >> > there are cases where a task doesn't have wakeup events anymore or at a far
> >> > too low pace. For such cases, we check if it's worht pushing hte task on
> >> > another CPUs instead of putting it back in the enqueued list.
> >> >
> >> > Wake up events remain the main way to migrate tasks but we now detect
> >> > situation where a task is stuck on a CPU by checking that its utilization
> >> > is larger than the max available compute capacity (max cpu capacity or
> >> > uclamp max setting)
> >> >
> >> > When the system becomes overutilized and some CPUs are idle, we try to
> >> > push tasks instead of waiting periodic load balance.
> >> >
> >> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> >> > ---
> >> >  kernel/sched/fair.c     | 65 +++++++++++++++++++++++++++++++++++++++++
> >> >  kernel/sched/topology.c |  3 ++
> >> >  2 files changed, 68 insertions(+)
> >> >
> >> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> > index 9af8d0a61856..e9e1d0c05805 100644
> >> > --- a/kernel/sched/fair.c
> >> > +++ b/kernel/sched/fair.c
> >> > @@ -6990,6 +6990,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >> >  }
> >> >
> >> >  static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p);
> >> > +
> >> >  /*
> >> >   * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
> >> >   * failing half-way through and resume the dequeue later.
> >> > @@ -8499,8 +8500,72 @@ static inline bool sched_push_task_enabled(void)
> >> >       return static_branch_unlikely(&sched_push_task);
> >> >  }
> >> >
> >> > +static inline bool task_stuck_on_cpu(struct task_struct *p, int cpu)
> >> > +{
> >> > +     unsigned long max_capa, util;
> >> > +
> >> > +     max_capa = min(get_actual_cpu_capacity(cpu),
> >> > +                    uclamp_eff_value(p, UCLAMP_MAX));
> >> > +     util = max(task_util_est(p), task_runnable(p));
> >> > +
> >> > +     /*
> >> > +      * Return true only if the task might not sleep/wakeup because of a low
> >> > +      * compute capacity. Tasks, which wake up regularly, will be handled by
> >> > +      * feec().
> >> > +      */
> >> > +     return (util > max_capa);
> >> > +}
> >> > +
> >> > +static inline bool sched_energy_push_task(struct task_struct *p, struct rq *rq)
> >> > +{
> >> > +     if (!sched_energy_enabled())
> >> > +             return false;
> >> > +
> >> > +     if (is_rd_overutilized(rq->rd))
> >> > +             return false;
> >> > +
> >> > +     if (task_stuck_on_cpu(p, cpu_of(rq)))
> >> > +             return true;
> >> > +
> >> > +     if (!task_fits_cpu(p, cpu_of(rq)))
> >> > +             return true;
> >> > +
> >> > +     return false;
> >> > +}
> >> > +
> >> > +static inline bool sched_idle_push_task(struct task_struct *p, struct rq *rq)
> >> > +{
> >> > +     if (rq->nr_running == 1)
> >> > +             return false;
> >> > +
> >> > +     if (!is_rd_overutilized(rq->rd))
> >> > +             return false;
> >> > +
> >> > +     /* If there are idle cpus in the llc then try to push the task on it */
> >> > +     if (test_idle_cores(cpu_of(rq)))
> >> > +             return true;
> >> > +
> >> > +     return false;
> >> > +}
> >> > +
> >> > +
> >> >  static bool fair_push_task(struct rq *rq, struct task_struct *p)
> >> >  {
> >> > +     if (!task_on_rq_queued(p))
> >> > +             return false;
> >>
> >> Task is queued on rq.
> >> > +
> >> > +     if (p->se.sched_delayed)
> >> > +             return false;
> >> > +
> >> > +     if (p->nr_cpus_allowed == 1)
> >> > +             return false;
> >> > +
> >> > +     if (sched_energy_push_task(p, rq))
> >> > +             return true;
> >>
> >> If task is stuck on CPU, it could not be on rq. Weird.
> >
> > May be it comes from my description and I should use task_stuck_on_rq
> > By stuck, I mean that the task doesn't have any opportunity to migrate
> > on another cpu/rq and stay "forever"  (at least until next sleep) on
> > this cpu/rq because load balancing is disabled/bypassed w/ EAS
> > Here Stuck does not mean blocked/sleeping
> >
> Given task queued on rq, I find the correct phrase, stack, in the cover
> letter instead of stuck, and the long-standing stacking tasks mean load
> balancer fails to cure that stack. 1/7 fixes that failure, no?

It's not just stacked because we sometimes/often want to stack tasks
on the same CPU. EAS is based on the assumption that tasks will sleep
and wake up regularly and EAS will select a new CPU at each wakeup but
it's not always true. We can have situations where task A has been put
on CPU0when waking up, sharing the CPU with others tasks. But after
some time, task A should be better on CPUB now not because of not
fitting anymore on CPU0 but just because the system state has changed
since its wakeup. Because task A shares the CPU0 with other tasks, it
can takes dozen/hundreds of ms to finish its works and to sleep and we
don't wait those hundreds of ms whereas a CPU1 might be a better
choice now.
Patch 1 fixes a case where a CPU was wrongly classified as overloaded
whereas it's not the case (because of uclamp max as an example)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 6/6 v7] sched/fair: Add EAS and idle cpu push trigger
  2025-12-03 13:32         ` Vincent Guittot
@ 2025-12-04  6:59           ` Hillf Danton
  2025-12-05 15:02             ` Vincent Guittot
  0 siblings, 1 reply; 23+ messages in thread
From: Hillf Danton @ 2025-12-04  6:59 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, linux-kernel, pierre.gondois, kprateek.nayak, qyousef,
	christian.loehle, luis.machado

On Wed, 3 Dec 2025 14:32:06 +0100 Vincent Guittot wrote:
> On Wed, 3 Dec 2025 at 10:00, Hillf Danton <hdanton@sina.com> wrote:
> > Given task queued on rq, I find the correct phrase, stack, in the cover
> > letter instead of stuck, and the long-standing stacking tasks mean load
> > balancer fails to cure that stack. 1/7 fixes that failure, no?
> 
> It's not just stacked because we sometimes/often want to stack tasks
> on the same CPU. EAS is based on the assumption that tasks will sleep
> and wake up regularly and EAS will select a new CPU at each wakeup but
> it's not always true. We can have situations where task A has been put
> on CPU0when waking up, sharing the CPU with others tasks. But after
> some time, task A should be better on CPUB now not because of not
> fitting anymore on CPU0 but just because the system state has changed
> since its wakeup. Because task A shares the CPU0 with other tasks, it
> can takes dozen/hundreds of ms to finish its works and to sleep and we
> don't wait those hundreds of ms whereas a CPU1 might be a better
> choice now.
> 
Even if task is pushed from an ARM little core to a big one, the net
result could be zero, either because the number of stacking tasks on the
dst CPU increases or more important the dst CPU cycles are shared at the
pace of tick. In general if stacking is not mitigated but migrated from
one CPU to another, pushing could not make much difference.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 6/6 v7] sched/fair: Add EAS and idle cpu push trigger
  2025-12-04  6:59           ` Hillf Danton
@ 2025-12-05 15:02             ` Vincent Guittot
  2025-12-06 10:31               ` Hillf Danton
  0 siblings, 1 reply; 23+ messages in thread
From: Vincent Guittot @ 2025-12-05 15:02 UTC (permalink / raw)
  To: Hillf Danton
  Cc: peterz, linux-kernel, pierre.gondois, kprateek.nayak, qyousef,
	christian.loehle, luis.machado

On Thu, 4 Dec 2025 at 07:59, Hillf Danton <hdanton@sina.com> wrote:
>
> On Wed, 3 Dec 2025 14:32:06 +0100 Vincent Guittot wrote:
> > On Wed, 3 Dec 2025 at 10:00, Hillf Danton <hdanton@sina.com> wrote:
> > > Given task queued on rq, I find the correct phrase, stack, in the cover
> > > letter instead of stuck, and the long-standing stacking tasks mean load
> > > balancer fails to cure that stack. 1/7 fixes that failure, no?
> >
> > It's not just stacked because we sometimes/often want to stack tasks
> > on the same CPU. EAS is based on the assumption that tasks will sleep
> > and wake up regularly and EAS will select a new CPU at each wakeup but
> > it's not always true. We can have situations where task A has been put
> > on CPU0when waking up, sharing the CPU with others tasks. But after
> > some time, task A should be better on CPUB now not because of not
> > fitting anymore on CPU0 but just because the system state has changed
> > since its wakeup. Because task A shares the CPU0 with other tasks, it
> > can takes dozen/hundreds of ms to finish its works and to sleep and we
> > don't wait those hundreds of ms whereas a CPU1 might be a better
> > choice now.
> >
> Even if task is pushed from an ARM little core to a big one, the net
> result could be zero, either because the number of stacking tasks on the
> dst CPU increases or more important the dst CPU cycles are shared at the
> pace of tick. In general if stacking is not mitigated but migrated from
> one CPU to another, pushing could not make much difference.

if select_task_rq/feec returns a new CPU, it means that it will make a
difference in the consumed energy or the available capacity for the
task. And when overutilized, it looks for an idle CPUs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 6/6 v7] sched/fair: Add EAS and idle cpu push trigger
  2025-12-05 15:02             ` Vincent Guittot
@ 2025-12-06 10:31               ` Hillf Danton
  0 siblings, 0 replies; 23+ messages in thread
From: Hillf Danton @ 2025-12-06 10:31 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, linux-kernel, pierre.gondois, kprateek.nayak, qyousef,
	christian.loehle, luis.machado

On Fri, 5 Dec 2025 16:02:27 +0100 Vincent Guittot wrote:
> On Thu, 4 Dec 2025 at 07:59, Hillf Danton <hdanton@sina.com> wrote:
> > On Wed, 3 Dec 2025 14:32:06 +0100 Vincent Guittot wrote:
> > > On Wed, 3 Dec 2025 at 10:00, Hillf Danton <hdanton@sina.com> wrote:
> > > > Given task queued on rq, I find the correct phrase, stack, in the cover
> > > > letter instead of stuck, and the long-standing stacking tasks mean load
> > > > balancer fails to cure that stack. 1/7 fixes that failure, no?
> > >
> > > It's not just stacked because we sometimes/often want to stack tasks
> > > on the same CPU. EAS is based on the assumption that tasks will sleep
> > > and wake up regularly and EAS will select a new CPU at each wakeup but
> > > it's not always true. We can have situations where task A has been put
> > > on CPU0when waking up, sharing the CPU with others tasks. But after
> > > some time, task A should be better on CPUB now not because of not
> > > fitting anymore on CPU0 but just because the system state has changed
> > > since its wakeup. Because task A shares the CPU0 with other tasks, it
> > > can takes dozen/hundreds of ms to finish its works and to sleep and we
> > > don't wait those hundreds of ms whereas a CPU1 might be a better
> > > choice now.
> > >
> > Even if task is pushed from an ARM little core to a big one, the net
> > result could be zero, either because the number of stacking tasks on the
> > dst CPU increases or more important the dst CPU cycles are shared at the
> > pace of tick. In general if stacking is not mitigated but migrated from
> > one CPU to another, pushing could not make much difference.
> 
> if select_task_rq/feec returns a new CPU, it means that it will make a
> difference in the consumed energy or the available capacity for the
> task. And when overutilized, it looks for an idle CPUs
> 
Yeah given the correct CPU from select_task_rq/feec, in case of stacking
tasks what push does is blindly searching for idlest CPU.
On the opposite, when task sleeps, what pull does is correctly searching
for the busiest CPU. By correctly I mean it is the right time to migrate
task.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-12-06 10:32 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-01  9:13 [PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases Vincent Guittot
2025-12-01  9:13 ` [PATCH 1/6 v7] sched/fair: Filter false overloaded_group case for EAS Vincent Guittot
2025-12-01  9:13 ` [PATCH 2/6 v7] sched/fair: Update overutilized detection Vincent Guittot
2025-12-01  9:13 ` [PATCH 3/6 v7] sched/fair: Prepare select_task_rq_fair() to be called for new cases Vincent Guittot
2025-12-01  9:13 ` [PATCH 4/6 v7] sched/fair: Add push task mechanism for fair Vincent Guittot
2025-12-01  9:13 ` [RFC PATCH 5/6 v7] sched/fair: Enable idle core tracking for !SMT Vincent Guittot
2025-12-01  9:13 ` [RFC PATCH 6/6 v7] sched/fair: Add EAS and idle cpu push trigger Vincent Guittot
2025-12-01 13:53   ` Christian Loehle
2025-12-01 17:49     ` Vincent Guittot
2025-12-01 19:33       ` Vincent Guittot
2025-12-02  9:44   ` Hillf Danton
2025-12-02 13:01     ` Vincent Guittot
2025-12-03  9:00       ` Hillf Danton
2025-12-03 13:32         ` Vincent Guittot
2025-12-04  6:59           ` Hillf Danton
2025-12-05 15:02             ` Vincent Guittot
2025-12-06 10:31               ` Hillf Danton
2025-12-01 13:31 ` [PATCH 0/6 v7] sched/fair: Add push task mecansim and hadle more EAS cases Christian Loehle
2025-12-01 13:57   ` Christian Loehle
2025-12-01 17:48     ` Vincent Guittot
2025-12-01 17:48   ` Vincent Guittot
2025-12-01 22:02 ` David Laight
2025-12-02 13:24   ` Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox