* [PATCH 1/6 v8] sched/fair: Filter false overloaded_group case for EAS
2025-12-02 18:12 [PATCH 0/6 v8] sched/fair: Add push task mechanism and handle more EAS cases Vincent Guittot
@ 2025-12-02 18:12 ` Vincent Guittot
2025-12-02 18:12 ` [PATCH 2/6 v8] sched/fair: Update overutilized detection Vincent Guittot
` (8 subsequent siblings)
9 siblings, 0 replies; 47+ messages in thread
From: Vincent Guittot @ 2025-12-02 18:12 UTC (permalink / raw)
To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak
Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado,
Vincent Guittot
With EAS, a group should be set overloaded if at least 1 CPU in the group
is overutilized but it can happen that a CPU is fully utilized by tasks
because of clamping the compute capacity of the CPU. In such case, the CPU
is not overutilized and as a result should not be set overloaded as well.
group_overloaded being a higher priority than group_misfit, such group can
be selected as the busiest group instead of a group with a mistfit task
and prevents load_balance to select the CPU with the misfit task to pull
the latter on a fitting CPU.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
---
kernel/sched/fair.c | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1855975b8248..b10f04715251 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9987,6 +9987,7 @@ struct sg_lb_stats {
unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
unsigned int group_smt_balance; /* Task on busy SMT be moved */
unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
+ unsigned int group_overutilized; /* At least one CPU is overutilized in the group */
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
@@ -10219,6 +10220,13 @@ group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
static inline bool
group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
{
+ /*
+ * With EAS and uclamp, 1 CPU in the group must be overutilized to
+ * consider the group overloaded.
+ */
+ if (sched_energy_enabled() && !sgs->group_overutilized)
+ return false;
+
if (sgs->sum_nr_running <= sgs->group_weight)
return false;
@@ -10402,14 +10410,12 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
* @group: sched_group whose statistics are to be updated.
* @sgs: variable to hold the statistics for this group.
* @sg_overloaded: sched_group is overloaded
- * @sg_overutilized: sched_group is overutilized
*/
static inline void update_sg_lb_stats(struct lb_env *env,
struct sd_lb_stats *sds,
struct sched_group *group,
struct sg_lb_stats *sgs,
- bool *sg_overloaded,
- bool *sg_overutilized)
+ bool *sg_overloaded)
{
int i, nr_running, local_group, sd_flags = env->sd->flags;
bool balancing_at_rd = !env->sd->parent;
@@ -10431,7 +10437,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->sum_nr_running += nr_running;
if (cpu_overutilized(i))
- *sg_overutilized = 1;
+ sgs->group_overutilized = 1;
/*
* No need to call idle_cpu() if nr_running is not 0
@@ -11103,13 +11109,15 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
update_group_capacity(env->sd, env->dst_cpu);
}
- update_sg_lb_stats(env, sds, sg, sgs, &sg_overloaded, &sg_overutilized);
+ update_sg_lb_stats(env, sds, sg, sgs, &sg_overloaded);
if (!local_group && update_sd_pick_busiest(env, sds, sg, sgs)) {
sds->busiest = sg;
sds->busiest_stat = *sgs;
}
+ sg_overutilized |= sgs->group_overutilized;
+
/* Now, start updating sd_lb_stats */
sds->total_load += sgs->group_load;
sds->total_capacity += sgs->group_capacity;
--
2.43.0
^ permalink raw reply related [flat|nested] 47+ messages in thread* [PATCH 2/6 v8] sched/fair: Update overutilized detection
2025-12-02 18:12 [PATCH 0/6 v8] sched/fair: Add push task mechanism and handle more EAS cases Vincent Guittot
2025-12-02 18:12 ` [PATCH 1/6 v8] sched/fair: Filter false overloaded_group case for EAS Vincent Guittot
@ 2025-12-02 18:12 ` Vincent Guittot
2026-02-06 17:42 ` Qais Yousef
2025-12-02 18:12 ` [PATCH 3/6 v8] sched/fair: Prepare select_task_rq_fair() to be called for new cases Vincent Guittot
` (7 subsequent siblings)
9 siblings, 1 reply; 47+ messages in thread
From: Vincent Guittot @ 2025-12-02 18:12 UTC (permalink / raw)
To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak
Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado,
Vincent Guittot
Checking uclamp_min is useless and counterproductive for overutilized state
as misfit can now happen without being in overutilized state
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
kernel/sched/fair.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b10f04715251..f430ec890b72 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6785,16 +6785,15 @@ static inline void hrtick_update(struct rq *rq)
static inline bool cpu_overutilized(int cpu)
{
- unsigned long rq_util_min, rq_util_max;
+ unsigned long rq_util_max;
if (!sched_energy_enabled())
return false;
- rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
/* Return true only if the utilization doesn't fit CPU's capacity */
- return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
+ return !util_fits_cpu(cpu_util_cfs(cpu), 0, rq_util_max, cpu);
}
/*
--
2.43.0
^ permalink raw reply related [flat|nested] 47+ messages in thread* Re: [PATCH 2/6 v8] sched/fair: Update overutilized detection
2025-12-02 18:12 ` [PATCH 2/6 v8] sched/fair: Update overutilized detection Vincent Guittot
@ 2026-02-06 17:42 ` Qais Yousef
0 siblings, 0 replies; 47+ messages in thread
From: Qais Yousef @ 2026-02-06 17:42 UTC (permalink / raw)
To: Vincent Guittot
Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak,
hongyan.xia2, christian.loehle, luis.machado
On 12/02/25 19:12, Vincent Guittot wrote:
> Checking uclamp_min is useless and counterproductive for overutilized state
> as misfit can now happen without being in overutilized state
>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
> ---
> kernel/sched/fair.c | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b10f04715251..f430ec890b72 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6785,16 +6785,15 @@ static inline void hrtick_update(struct rq *rq)
>
> static inline bool cpu_overutilized(int cpu)
> {
> - unsigned long rq_util_min, rq_util_max;
> + unsigned long rq_util_max;
>
> if (!sched_energy_enabled())
> return false;
>
> - rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
> rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
>
> /* Return true only if the utilization doesn't fit CPU's capacity */
> - return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
> + return !util_fits_cpu(cpu_util_cfs(cpu), 0, rq_util_max, cpu);
> }
>
> /*
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 47+ messages in thread
* [PATCH 3/6 v8] sched/fair: Prepare select_task_rq_fair() to be called for new cases
2025-12-02 18:12 [PATCH 0/6 v8] sched/fair: Add push task mechanism and handle more EAS cases Vincent Guittot
2025-12-02 18:12 ` [PATCH 1/6 v8] sched/fair: Filter false overloaded_group case for EAS Vincent Guittot
2025-12-02 18:12 ` [PATCH 2/6 v8] sched/fair: Update overutilized detection Vincent Guittot
@ 2025-12-02 18:12 ` Vincent Guittot
2025-12-07 13:23 ` Shrikanth Hegde
2026-02-06 18:03 ` Qais Yousef
2025-12-02 18:12 ` [PATCH 4/6 v8] sched/fair: Add push task mechanism for fair Vincent Guittot
` (6 subsequent siblings)
9 siblings, 2 replies; 47+ messages in thread
From: Vincent Guittot @ 2025-12-02 18:12 UTC (permalink / raw)
To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak
Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado,
Vincent Guittot
Update select_task_rq_fair() to be called out of the 3 current cases which
are :
- wake up
- exec
- fork
We wants to select a rq in some new cases like pushing a runnable task on a
better CPU than the local one. In such case, it's not a wakeup , nor an
exec nor a fork. We make sure to not distrub these cases but still
go through EAS and fast-path.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
kernel/sched/fair.c | 22 ++++++++++++++--------
1 file changed, 14 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f430ec890b72..80c4131fb35b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8518,6 +8518,7 @@ static int
select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
{
int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
+ int want_sibling = !(wake_flags & (WF_EXEC | WF_FORK));
struct sched_domain *tmp, *sd = NULL;
int cpu = smp_processor_id();
int new_cpu = prev_cpu;
@@ -8535,16 +8536,21 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
if ((wake_flags & WF_CURRENT_CPU) &&
cpumask_test_cpu(cpu, p->cpus_ptr))
return cpu;
+ }
- if (!is_rd_overutilized(this_rq()->rd)) {
- new_cpu = find_energy_efficient_cpu(p, prev_cpu);
- if (new_cpu >= 0)
- return new_cpu;
- new_cpu = prev_cpu;
- }
+ /*
+ * We don't want EAS to be called for exec or fork but it should be
+ * called for any other case such as wake up or push callback.
+ */
+ if (!is_rd_overutilized(this_rq()->rd) && want_sibling) {
+ new_cpu = find_energy_efficient_cpu(p, prev_cpu);
+ if (new_cpu >= 0)
+ return new_cpu;
+ new_cpu = prev_cpu;
+ }
+ if (wake_flags & WF_TTWU)
want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
- }
rcu_read_lock();
for_each_domain(cpu, tmp) {
@@ -8575,7 +8581,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
if (unlikely(sd)) {
/* Slow path */
new_cpu = sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag);
- } else if (wake_flags & WF_TTWU) { /* XXX always ? */
+ } else if (want_sibling) {
/* Fast path */
new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 47+ messages in thread* Re: [PATCH 3/6 v8] sched/fair: Prepare select_task_rq_fair() to be called for new cases
2025-12-02 18:12 ` [PATCH 3/6 v8] sched/fair: Prepare select_task_rq_fair() to be called for new cases Vincent Guittot
@ 2025-12-07 13:23 ` Shrikanth Hegde
2026-02-09 13:21 ` Vincent Guittot
2026-02-06 18:03 ` Qais Yousef
1 sibling, 1 reply; 47+ messages in thread
From: Shrikanth Hegde @ 2025-12-07 13:23 UTC (permalink / raw)
To: Vincent Guittot
Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado, mingo,
peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, pierre.gondois, kprateek.nayak
On 12/2/25 11:42 PM, Vincent Guittot wrote:
> Update select_task_rq_fair() to be called out of the 3 current cases which
> are :
> - wake up
> - exec
> - fork
>
> We wants to select a rq in some new cases like pushing a runnable task on a
> better CPU than the local one. In such case, it's not a wakeup , nor an
> exec nor a fork. We make sure to not distrub these cases but still
nit: s/distrub/disturb
> go through EAS and fast-path.
>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> kernel/sched/fair.c | 22 ++++++++++++++--------
> 1 file changed, 14 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f430ec890b72..80c4131fb35b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8518,6 +8518,7 @@ static int
> select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> {
> int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
> + int want_sibling = !(wake_flags & (WF_EXEC | WF_FORK));
> struct sched_domain *tmp, *sd = NULL;
> int cpu = smp_processor_id();
> int new_cpu = prev_cpu;
> @@ -8535,16 +8536,21 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> if ((wake_flags & WF_CURRENT_CPU) &&
> cpumask_test_cpu(cpu, p->cpus_ptr))
> return cpu;
> + }
>
> - if (!is_rd_overutilized(this_rq()->rd)) {
> - new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> - if (new_cpu >= 0)
> - return new_cpu;
> - new_cpu = prev_cpu;
> - }
> + /*
> + * We don't want EAS to be called for exec or fork but it should be
> + * called for any other case such as wake up or push callback.
> + */
> + if (!is_rd_overutilized(this_rq()->rd) && want_sibling) {
> + new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> + if (new_cpu >= 0)
> + return new_cpu;
> + new_cpu = prev_cpu;
> + }
>
> + if (wake_flags & WF_TTWU)
> want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
> - }
>
> rcu_read_lock();
> for_each_domain(cpu, tmp) {
> @@ -8575,7 +8581,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> if (unlikely(sd)) {
> /* Slow path */
> new_cpu = sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag);
> - } else if (wake_flags & WF_TTWU) { /* XXX always ? */
> + } else if (want_sibling) {
It is going to find a idle core withing LLC first. then idle sibling. right?
So may need a better name than want_sibling.
> /* Fast path */
> new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> }
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [PATCH 3/6 v8] sched/fair: Prepare select_task_rq_fair() to be called for new cases
2025-12-07 13:23 ` Shrikanth Hegde
@ 2026-02-09 13:21 ` Vincent Guittot
0 siblings, 0 replies; 47+ messages in thread
From: Vincent Guittot @ 2026-02-09 13:21 UTC (permalink / raw)
To: Shrikanth Hegde
Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado, mingo,
peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, pierre.gondois, kprateek.nayak
On Sun, 7 Dec 2025 at 14:23, Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
>
>
>
> On 12/2/25 11:42 PM, Vincent Guittot wrote:
> > Update select_task_rq_fair() to be called out of the 3 current cases which
> > are :
> > - wake up
> > - exec
> > - fork
> >
> > We wants to select a rq in some new cases like pushing a runnable task on a
> > better CPU than the local one. In such case, it's not a wakeup , nor an
> > exec nor a fork. We make sure to not distrub these cases but still
>
> nit: s/distrub/disturb
>
> > go through EAS and fast-path.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> > kernel/sched/fair.c | 22 ++++++++++++++--------
> > 1 file changed, 14 insertions(+), 8 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index f430ec890b72..80c4131fb35b 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8518,6 +8518,7 @@ static int
> > select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > {
> > int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
> > + int want_sibling = !(wake_flags & (WF_EXEC | WF_FORK));
> > struct sched_domain *tmp, *sd = NULL;
> > int cpu = smp_processor_id();
> > int new_cpu = prev_cpu;
> > @@ -8535,16 +8536,21 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > if ((wake_flags & WF_CURRENT_CPU) &&
> > cpumask_test_cpu(cpu, p->cpus_ptr))
> > return cpu;
> > + }
> >
> > - if (!is_rd_overutilized(this_rq()->rd)) {
> > - new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> > - if (new_cpu >= 0)
> > - return new_cpu;
> > - new_cpu = prev_cpu;
> > - }
> > + /*
> > + * We don't want EAS to be called for exec or fork but it should be
> > + * called for any other case such as wake up or push callback.
> > + */
> > + if (!is_rd_overutilized(this_rq()->rd) && want_sibling) {
> > + new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> > + if (new_cpu >= 0)
> > + return new_cpu;
> > + new_cpu = prev_cpu;
> > + }
> >
> > + if (wake_flags & WF_TTWU)
> > want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
> > - }
> >
> > rcu_read_lock();
> > for_each_domain(cpu, tmp) {
> > @@ -8575,7 +8581,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > if (unlikely(sd)) {
> > /* Slow path */
> > new_cpu = sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag);
> > - } else if (wake_flags & WF_TTWU) { /* XXX always ? */
> > + } else if (want_sibling) {
>
> It is going to find a idle core withing LLC first. then idle sibling. right?
> So may need a better name than want_sibling.
it was a shortcut for want select_idle_sibling vs the larger space
search with sched_balance_find_dst_cpu but I will try to find a better
name
>
> > /* Fast path */
> > new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> > }
>
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 3/6 v8] sched/fair: Prepare select_task_rq_fair() to be called for new cases
2025-12-02 18:12 ` [PATCH 3/6 v8] sched/fair: Prepare select_task_rq_fair() to be called for new cases Vincent Guittot
2025-12-07 13:23 ` Shrikanth Hegde
@ 2026-02-06 18:03 ` Qais Yousef
2026-02-09 13:21 ` Vincent Guittot
1 sibling, 1 reply; 47+ messages in thread
From: Qais Yousef @ 2026-02-06 18:03 UTC (permalink / raw)
To: Vincent Guittot
Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak,
hongyan.xia2, christian.loehle, luis.machado
On 12/02/25 19:12, Vincent Guittot wrote:
> Update select_task_rq_fair() to be called out of the 3 current cases which
> are :
> - wake up
> - exec
> - fork
>
> We wants to select a rq in some new cases like pushing a runnable task on a
> better CPU than the local one. In such case, it's not a wakeup , nor an
> exec nor a fork. We make sure to not distrub these cases but still
> go through EAS and fast-path.
I'd add we have a fallback mechanism when moving between cpusets causes to pick
a random cpu. We have been carrying out of tree hack in Android for a while to
make this use the wake up path. Especially on HMP system, a random cpu could
mean bad placement decision as not all cores are equal. And it seems server
market is catching up with quirky caching systems.
>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
> ---
> kernel/sched/fair.c | 22 ++++++++++++++--------
> 1 file changed, 14 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f430ec890b72..80c4131fb35b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8518,6 +8518,7 @@ static int
> select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> {
> int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
> + int want_sibling = !(wake_flags & (WF_EXEC | WF_FORK));
> struct sched_domain *tmp, *sd = NULL;
> int cpu = smp_processor_id();
> int new_cpu = prev_cpu;
> @@ -8535,16 +8536,21 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> if ((wake_flags & WF_CURRENT_CPU) &&
> cpumask_test_cpu(cpu, p->cpus_ptr))
> return cpu;
> + }
>
> - if (!is_rd_overutilized(this_rq()->rd)) {
> - new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> - if (new_cpu >= 0)
> - return new_cpu;
> - new_cpu = prev_cpu;
> - }
> + /*
> + * We don't want EAS to be called for exec or fork but it should be
> + * called for any other case such as wake up or push callback.
> + */
> + if (!is_rd_overutilized(this_rq()->rd) && want_sibling) {
> + new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> + if (new_cpu >= 0)
> + return new_cpu;
> + new_cpu = prev_cpu;
> + }
>
> + if (wake_flags & WF_TTWU)
> want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
> - }
>
> rcu_read_lock();
> for_each_domain(cpu, tmp) {
> @@ -8575,7 +8581,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> if (unlikely(sd)) {
> /* Slow path */
> new_cpu = sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag);
> - } else if (wake_flags & WF_TTWU) { /* XXX always ? */
> + } else if (want_sibling) {
> /* Fast path */
> new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> }
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [PATCH 3/6 v8] sched/fair: Prepare select_task_rq_fair() to be called for new cases
2026-02-06 18:03 ` Qais Yousef
@ 2026-02-09 13:21 ` Vincent Guittot
0 siblings, 0 replies; 47+ messages in thread
From: Vincent Guittot @ 2026-02-09 13:21 UTC (permalink / raw)
To: Qais Yousef
Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak,
hongyan.xia2, christian.loehle, luis.machado
On Fri, 6 Feb 2026 at 19:03, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 12/02/25 19:12, Vincent Guittot wrote:
> > Update select_task_rq_fair() to be called out of the 3 current cases which
> > are :
> > - wake up
> > - exec
> > - fork
> >
> > We wants to select a rq in some new cases like pushing a runnable task on a
> > better CPU than the local one. In such case, it's not a wakeup , nor an
> > exec nor a fork. We make sure to not distrub these cases but still
> > go through EAS and fast-path.
>
> I'd add we have a fallback mechanism when moving between cpusets causes to pick
> a random cpu. We have been carrying out of tree hack in Android for a while to
> make this use the wake up path. Especially on HMP system, a random cpu could
> mean bad placement decision as not all cores are equal. And it seems server
> market is catching up with quirky caching systems.
I will look have a look at this case
>
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> Reviewed-by: Qais Yousef <qyousef@layalina.io>
>
> > ---
> > kernel/sched/fair.c | 22 ++++++++++++++--------
> > 1 file changed, 14 insertions(+), 8 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index f430ec890b72..80c4131fb35b 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8518,6 +8518,7 @@ static int
> > select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > {
> > int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
> > + int want_sibling = !(wake_flags & (WF_EXEC | WF_FORK));
> > struct sched_domain *tmp, *sd = NULL;
> > int cpu = smp_processor_id();
> > int new_cpu = prev_cpu;
> > @@ -8535,16 +8536,21 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > if ((wake_flags & WF_CURRENT_CPU) &&
> > cpumask_test_cpu(cpu, p->cpus_ptr))
> > return cpu;
> > + }
> >
> > - if (!is_rd_overutilized(this_rq()->rd)) {
> > - new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> > - if (new_cpu >= 0)
> > - return new_cpu;
> > - new_cpu = prev_cpu;
> > - }
> > + /*
> > + * We don't want EAS to be called for exec or fork but it should be
> > + * called for any other case such as wake up or push callback.
> > + */
> > + if (!is_rd_overutilized(this_rq()->rd) && want_sibling) {
> > + new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> > + if (new_cpu >= 0)
> > + return new_cpu;
> > + new_cpu = prev_cpu;
> > + }
> >
> > + if (wake_flags & WF_TTWU)
> > want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
> > - }
> >
> > rcu_read_lock();
> > for_each_domain(cpu, tmp) {
> > @@ -8575,7 +8581,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > if (unlikely(sd)) {
> > /* Slow path */
> > new_cpu = sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag);
> > - } else if (wake_flags & WF_TTWU) { /* XXX always ? */
> > + } else if (want_sibling) {
> > /* Fast path */
> > new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> > }
> > --
> > 2.43.0
> >
^ permalink raw reply [flat|nested] 47+ messages in thread
* [PATCH 4/6 v8] sched/fair: Add push task mechanism for fair
2025-12-02 18:12 [PATCH 0/6 v8] sched/fair: Add push task mechanism and handle more EAS cases Vincent Guittot
` (2 preceding siblings ...)
2025-12-02 18:12 ` [PATCH 3/6 v8] sched/fair: Prepare select_task_rq_fair() to be called for new cases Vincent Guittot
@ 2025-12-02 18:12 ` Vincent Guittot
2025-12-04 10:46 ` Peter Zijlstra
` (4 more replies)
2025-12-02 18:12 ` [RFC PATCH 5/6 v8] sched/fair: Enable idle core tracking for !SMT Vincent Guittot
` (5 subsequent siblings)
9 siblings, 5 replies; 47+ messages in thread
From: Vincent Guittot @ 2025-12-02 18:12 UTC (permalink / raw)
To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak
Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado,
Vincent Guittot
EAS is based on wakeup events to efficiently place tasks on the system, but
there are cases where a task doesn't have wakeup events anymore or at a far
too low pace. For such situation, we can take advantage of the task being
put back in the enqueued list to check if it should be pushed on another
CPU.
When the task is alone on the CPU, it's never put back in the enqueued
list; In this special case, we use the tick to run the check.
Add a push task mechanism that enables fair scheduler to push runnable
tasks. EAS will be one user but other feature like filling idle CPUs
can also take advantage of it.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
kernel/sched/fair.c | 212 ++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 4 +
2 files changed, 214 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 80c4131fb35b..252254168c92 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6989,6 +6989,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
hrtick_update(rq);
}
+static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p);
+
/*
* Basically dequeue_task_fair(), except it can deal with dequeue_entity()
* failing half-way through and resume the dequeue later.
@@ -7017,6 +7019,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
h_nr_idle = task_has_idle_policy(p);
if (task_sleep || task_delayed || !se->sched_delayed)
h_nr_runnable = 1;
+
+ fair_remove_pushable_task(rq, p);
}
for_each_sched_entity(se) {
@@ -8504,6 +8508,187 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
return target;
}
+DEFINE_STATIC_KEY_FALSE(sched_push_task);
+
+static inline bool sched_push_task_enabled(void)
+{
+ return static_branch_unlikely(&sched_push_task);
+}
+
+static bool fair_push_task(struct rq *rq, struct task_struct *p)
+{
+ return false;
+}
+
+static inline int has_pushable_tasks(struct rq *rq)
+{
+ return !plist_head_empty(&rq->cfs.pushable_tasks);
+}
+
+static struct task_struct *pick_next_pushable_fair_task(struct rq *rq)
+{
+ struct task_struct *p;
+
+ if (!has_pushable_tasks(rq))
+ return NULL;
+
+ p = plist_first_entry(&rq->cfs.pushable_tasks,
+ struct task_struct, pushable_tasks);
+
+ WARN_ON_ONCE(rq->cpu != task_cpu(p));
+ WARN_ON_ONCE(task_current(rq, p));
+ WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
+ WARN_ON_ONCE(!task_on_rq_queued(p));
+
+ /*
+ * Remove task from the pushable list as we try only once after that
+ * the task has been put back in enqueued list.
+ */
+ plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+
+ return p;
+}
+
+static int
+select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags);
+
+/*
+ * See if the non running fair tasks on this rq can be sent on other CPUs
+ * that fits better with their profile.
+ */
+static bool push_fair_task(struct rq *rq)
+{
+ struct task_struct *next_task;
+ int prev_cpu, new_cpu;
+ struct rq *new_rq;
+
+ next_task = pick_next_pushable_fair_task(rq);
+ if (!next_task)
+ return false;
+
+ if (is_migration_disabled(next_task))
+ return true;
+
+ /* We might release rq lock */
+ get_task_struct(next_task);
+
+ prev_cpu = rq->cpu;
+
+ new_cpu = select_task_rq_fair(next_task, prev_cpu, 0);
+
+ if (new_cpu == prev_cpu)
+ goto out;
+
+ new_rq = cpu_rq(new_cpu);
+
+ if (double_lock_balance(rq, new_rq)) {
+ /* The task has already migrated in between */
+ if (task_cpu(next_task) != rq->cpu) {
+ double_unlock_balance(rq, new_rq);
+ goto out;
+ }
+
+ deactivate_task(rq, next_task, 0);
+ set_task_cpu(next_task, new_cpu);
+ activate_task(new_rq, next_task, 0);
+
+ resched_curr(new_rq);
+
+ double_unlock_balance(rq, new_rq);
+ }
+
+out:
+ put_task_struct(next_task);
+
+ return true;
+}
+
+static void push_fair_tasks(struct rq *rq)
+{
+ /* push_fair_task() will return true if it moved a fair task */
+ while (push_fair_task(rq))
+ ;
+}
+
+static DEFINE_PER_CPU(struct balance_callback, fair_push_head);
+
+static inline void fair_queue_pushable_tasks(struct rq *rq)
+{
+ if (!sched_push_task_enabled() || !has_pushable_tasks(rq))
+ return;
+
+ queue_balance_callback(rq, &per_cpu(fair_push_head, rq->cpu), push_fair_tasks);
+}
+
+static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p)
+{
+ if (sched_push_task_enabled())
+ plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+}
+
+static void fair_add_pushable_task(struct rq *rq, struct task_struct *p)
+{
+ if (sched_push_task_enabled() && fair_push_task(rq, p)) {
+ plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+ plist_node_init(&p->pushable_tasks, p->prio);
+ plist_add(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+ }
+}
+
+static int active_load_balance_cpu_stop(void *data);
+
+/*
+ * See if the alone task running on the CPU should migrate on a better than
+ * the local one.
+ */
+static inline bool check_pushable_task(struct task_struct *p, struct rq *rq)
+{
+ int new_cpu, cpu = cpu_of(rq);
+
+ if (!sched_push_task_enabled())
+ return false;
+
+ if (WARN_ON(!p))
+ return false;
+
+ if (WARN_ON(!task_current(rq, p)))
+ return false;
+
+ if (is_migration_disabled(p))
+ return false;
+
+ /* If there are several task, wait for being put back */
+ if (rq->nr_running > 1)
+ return false;
+
+ if (!fair_push_task(rq, p))
+ return false;
+
+ new_cpu = select_task_rq_fair(p, cpu, 0);
+
+ if (new_cpu == cpu)
+ return false;
+
+ /*
+ * ->active_balance synchronizes accesses to
+ * ->active_balance_work. Once set, it's cleared
+ * only after active load balance is finished.
+ */
+ if (!rq->active_balance) {
+ rq->active_balance = 1;
+ rq->push_cpu = new_cpu;
+ } else
+ return false;
+
+ raw_spin_rq_unlock(rq);
+ stop_one_cpu_nowait(cpu,
+ active_load_balance_cpu_stop, rq,
+ &rq->active_balance_work);
+ raw_spin_rq_lock(rq);
+
+ return true;
+}
+
/*
* select_task_rq_fair: Select target runqueue for the waking task in domains
* that have the relevant SD flag set. In practice, this is SD_BALANCE_WAKE,
@@ -8973,6 +9158,12 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
put_prev_entity(cfs_rq, pse);
set_next_entity(cfs_rq, se);
+ /*
+ * The previous task might be eligible for being pushed on
+ * another cpu if it is still active.
+ */
+ fair_add_pushable_task(rq, prev);
+
__set_next_task_fair(rq, p, true);
}
@@ -9036,6 +9227,13 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct t
cfs_rq = cfs_rq_of(se);
put_prev_entity(cfs_rq, se);
}
+
+ /*
+ * The previous task might be eligible for being pushed on another cpu
+ * if it is still active.
+ */
+ fair_add_pushable_task(rq, prev);
+
}
/*
@@ -13390,8 +13588,10 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);
- update_misfit_status(curr, rq);
- check_update_overutilized_status(task_rq(curr));
+ if (!check_pushable_task(curr, rq)) {
+ update_misfit_status(curr, rq);
+ check_update_overutilized_status(task_rq(curr));
+ }
task_tick_core(rq, curr);
}
@@ -13552,6 +13752,8 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
{
struct sched_entity *se = &p->se;
+ fair_remove_pushable_task(rq, p);
+
if (task_on_rq_queued(p)) {
/*
* Move the next running task to the front of the list, so our
@@ -13567,6 +13769,11 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
if (hrtick_enabled_fair(rq))
hrtick_start_fair(rq, p);
+ /*
+ * Try to push prev task before checking misfit for next task as
+ * the migration of prev can make next fitting the CPU
+ */
+ fair_queue_pushable_tasks(rq);
update_misfit_status(p, rq);
sched_fair_update_stop_tick(rq, p);
}
@@ -13596,6 +13803,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
{
cfs_rq->tasks_timeline = RB_ROOT_CACHED;
cfs_rq->zero_vruntime = (u64)(-(1LL << 20));
+ plist_head_init(&cfs_rq->pushable_tasks);
raw_spin_lock_init(&cfs_rq->removed.lock);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b419a4d98461..697bd654298a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -711,6 +711,8 @@ struct cfs_rq {
unsigned long runnable_avg;
} removed;
+ struct plist_head pushable_tasks;
+
#ifdef CONFIG_FAIR_GROUP_SCHED
u64 last_update_tg_load_avg;
unsigned long tg_load_avg_contrib;
@@ -3620,6 +3622,8 @@ static inline bool sched_energy_enabled(void) { return false; }
#endif /* !(CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL) */
+DECLARE_STATIC_KEY_FALSE(sched_push_task);
+
#ifdef CONFIG_MEMBARRIER
/*
--
2.43.0
^ permalink raw reply related [flat|nested] 47+ messages in thread* Re: [PATCH 4/6 v8] sched/fair: Add push task mechanism for fair
2025-12-02 18:12 ` [PATCH 4/6 v8] sched/fair: Add push task mechanism for fair Vincent Guittot
@ 2025-12-04 10:46 ` Peter Zijlstra
2025-12-04 14:32 ` Vincent Guittot
2025-12-04 11:29 ` Peter Zijlstra
` (3 subsequent siblings)
4 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2025-12-04 10:46 UTC (permalink / raw)
To: Vincent Guittot
Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, pierre.gondois, kprateek.nayak, qyousef,
hongyan.xia2, christian.loehle, luis.machado
On Tue, Dec 02, 2025 at 07:12:40PM +0100, Vincent Guittot wrote:
> +static void fair_add_pushable_task(struct rq *rq, struct task_struct *p)
> +{
> + if (sched_push_task_enabled() && fair_push_task(rq, p)) {
> + plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> + plist_node_init(&p->pushable_tasks, p->prio);
> + plist_add(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> + }
> +}
I might have asked before, but if there was an answer, I have forgotten
:/
Why is this a prio-list? It seems to me that we don't particularly care
about keeping this push sorted by nice value, right?
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [PATCH 4/6 v8] sched/fair: Add push task mechanism for fair
2025-12-04 10:46 ` Peter Zijlstra
@ 2025-12-04 14:32 ` Vincent Guittot
0 siblings, 0 replies; 47+ messages in thread
From: Vincent Guittot @ 2025-12-04 14:32 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, pierre.gondois, kprateek.nayak, qyousef,
hongyan.xia2, christian.loehle, luis.machado
On Thu, 4 Dec 2025 at 11:46, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Dec 02, 2025 at 07:12:40PM +0100, Vincent Guittot wrote:
>
>
> > +static void fair_add_pushable_task(struct rq *rq, struct task_struct *p)
> > +{
> > + if (sched_push_task_enabled() && fair_push_task(rq, p)) {
> > + plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> > + plist_node_init(&p->pushable_tasks, p->prio);
> > + plist_add(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> > + }
> > +}
>
> I might have asked before, but if there was an answer, I have forgotten
> :/
>
> Why is this a prio-list? It seems to me that we don't particularly care
> about keeping this push sorted by nice value, right?
We re-use the same struct plist_node pushable_tasks field as rt in task_struct
and we might add some ordering later based on slice as an example
but other than that no other reason
>
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 4/6 v8] sched/fair: Add push task mechanism for fair
2025-12-02 18:12 ` [PATCH 4/6 v8] sched/fair: Add push task mechanism for fair Vincent Guittot
2025-12-04 10:46 ` Peter Zijlstra
@ 2025-12-04 11:29 ` Peter Zijlstra
2025-12-04 14:34 ` Vincent Guittot
2025-12-07 12:13 ` Shrikanth Hegde
` (2 subsequent siblings)
4 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2025-12-04 11:29 UTC (permalink / raw)
To: Vincent Guittot
Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, pierre.gondois, kprateek.nayak, qyousef,
hongyan.xia2, christian.loehle, luis.machado
On Tue, Dec 02, 2025 at 07:12:40PM +0100, Vincent Guittot wrote:
> +/*
> + * See if the non running fair tasks on this rq can be sent on other CPUs
> + * that fits better with their profile.
> + */
> +static bool push_fair_task(struct rq *rq)
> +{
> + struct task_struct *next_task;
> + int prev_cpu, new_cpu;
> + struct rq *new_rq;
> +
> + next_task = pick_next_pushable_fair_task(rq);
> + if (!next_task)
> + return false;
> +
> + if (is_migration_disabled(next_task))
> + return true;
> +
> + /* We might release rq lock */
> + get_task_struct(next_task);
> +
> + prev_cpu = rq->cpu;
> +
> + new_cpu = select_task_rq_fair(next_task, prev_cpu, 0);
> +
> + if (new_cpu == prev_cpu)
> + goto out;
> +
> + new_rq = cpu_rq(new_cpu);
> +
> + if (double_lock_balance(rq, new_rq)) {
> + /* The task has already migrated in between */
> + if (task_cpu(next_task) != rq->cpu) {
> + double_unlock_balance(rq, new_rq);
> + goto out;
> + }
> +
> + deactivate_task(rq, next_task, 0);
> + set_task_cpu(next_task, new_cpu);
> + activate_task(new_rq, next_task, 0);
> +
> + resched_curr(new_rq);
> +
> + double_unlock_balance(rq, new_rq);
> + }
Why not use move_queued_task() ?
> +
> +out:
> + put_task_struct(next_task);
> +
> + return true;
> +}
> +
> +static void push_fair_tasks(struct rq *rq)
> +{
> + /* push_fair_task() will return true if it moved a fair task */
> + while (push_fair_task(rq))
> + ;
If we're going to be looping on that, why not also loop in
pick_next_pushable_task() like:
list_for_each_entity(p, &rq->cfs.pushable_tasks, pushable_tasks) {
if (!is_migration_disabled(p)) {
list_del(&p->pushable_tasks);
return p;
}
}
return NULL;
Because as is, I think you'll fail the moment there's a
migrate_disable() tasks at the head of things.
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [PATCH 4/6 v8] sched/fair: Add push task mechanism for fair
2025-12-04 11:29 ` Peter Zijlstra
@ 2025-12-04 14:34 ` Vincent Guittot
2025-12-05 8:59 ` Peter Zijlstra
0 siblings, 1 reply; 47+ messages in thread
From: Vincent Guittot @ 2025-12-04 14:34 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, pierre.gondois, kprateek.nayak, qyousef,
hongyan.xia2, christian.loehle, luis.machado
On Thu, 4 Dec 2025 at 12:29, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Dec 02, 2025 at 07:12:40PM +0100, Vincent Guittot wrote:
> > +/*
> > + * See if the non running fair tasks on this rq can be sent on other CPUs
> > + * that fits better with their profile.
> > + */
> > +static bool push_fair_task(struct rq *rq)
> > +{
> > + struct task_struct *next_task;
> > + int prev_cpu, new_cpu;
> > + struct rq *new_rq;
> > +
> > + next_task = pick_next_pushable_fair_task(rq);
> > + if (!next_task)
> > + return false;
> > +
> > + if (is_migration_disabled(next_task))
> > + return true;
> > +
> > + /* We might release rq lock */
> > + get_task_struct(next_task);
> > +
> > + prev_cpu = rq->cpu;
> > +
> > + new_cpu = select_task_rq_fair(next_task, prev_cpu, 0);
> > +
> > + if (new_cpu == prev_cpu)
> > + goto out;
> > +
> > + new_rq = cpu_rq(new_cpu);
> > +
> > + if (double_lock_balance(rq, new_rq)) {
> > + /* The task has already migrated in between */
> > + if (task_cpu(next_task) != rq->cpu) {
> > + double_unlock_balance(rq, new_rq);
> > + goto out;
> > + }
> > +
> > + deactivate_task(rq, next_task, 0);
> > + set_task_cpu(next_task, new_cpu);
> > + activate_task(new_rq, next_task, 0);
> > +
> > + resched_curr(new_rq);
> > +
> > + double_unlock_balance(rq, new_rq);
> > + }
>
> Why not use move_queued_task() ?
double_lock_balance() can fail and prevent being blocked waiting for
new rq whereas move_queued_task() will wait, won't it ?
Do you think move_queued_task() would be better ?
>
>
> > +
> > +out:
> > + put_task_struct(next_task);
> > +
> > + return true;
> > +}
> > +
> > +static void push_fair_tasks(struct rq *rq)
> > +{
> > + /* push_fair_task() will return true if it moved a fair task */
> > + while (push_fair_task(rq))
> > + ;
>
> If we're going to be looping on that, why not also loop in
> pick_next_pushable_task() like:
>
> list_for_each_entity(p, &rq->cfs.pushable_tasks, pushable_tasks) {
> if (!is_migration_disabled(p)) {
> list_del(&p->pushable_tasks);
> return p;
> }
> }
> return NULL;
>
> Because as is, I think you'll fail the moment there's a
> migrate_disable() tasks at the head of things.
In case of migrate_disable, push_fair_task() returns true and we
continue with the next task (It should not have much anyway). If the
task is migrate_disabled when we try to push it, we remove it from the
list anyway. At now, we try to not have more than 1 task in the list
to cap the overhead on sched_switch
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [PATCH 4/6 v8] sched/fair: Add push task mechanism for fair
2025-12-04 14:34 ` Vincent Guittot
@ 2025-12-05 8:59 ` Peter Zijlstra
2025-12-05 12:49 ` K Prateek Nayak
2025-12-05 13:26 ` Vincent Guittot
0 siblings, 2 replies; 47+ messages in thread
From: Peter Zijlstra @ 2025-12-05 8:59 UTC (permalink / raw)
To: Vincent Guittot
Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, pierre.gondois, kprateek.nayak, qyousef,
hongyan.xia2, christian.loehle, luis.machado
On Thu, Dec 04, 2025 at 03:34:15PM +0100, Vincent Guittot wrote:
> On Thu, 4 Dec 2025 at 12:29, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Tue, Dec 02, 2025 at 07:12:40PM +0100, Vincent Guittot wrote:
> > > +/*
> > > + * See if the non running fair tasks on this rq can be sent on other CPUs
> > > + * that fits better with their profile.
> > > + */
> > > +static bool push_fair_task(struct rq *rq)
> > > +{
> > > + struct task_struct *next_task;
> > > + int prev_cpu, new_cpu;
> > > + struct rq *new_rq;
> > > +
> > > + next_task = pick_next_pushable_fair_task(rq);
> > > + if (!next_task)
> > > + return false;
> > > +
> > > + if (is_migration_disabled(next_task))
> > > + return true;
> > > +
> > > + /* We might release rq lock */
> > > + get_task_struct(next_task);
> > > +
> > > + prev_cpu = rq->cpu;
> > > +
> > > + new_cpu = select_task_rq_fair(next_task, prev_cpu, 0);
> > > +
> > > + if (new_cpu == prev_cpu)
> > > + goto out;
> > > +
> > > + new_rq = cpu_rq(new_cpu);
> > > +
> > > + if (double_lock_balance(rq, new_rq)) {
> > > + /* The task has already migrated in between */
> > > + if (task_cpu(next_task) != rq->cpu) {
> > > + double_unlock_balance(rq, new_rq);
> > > + goto out;
> > > + }
> > > +
> > > + deactivate_task(rq, next_task, 0);
> > > + set_task_cpu(next_task, new_cpu);
> > > + activate_task(new_rq, next_task, 0);
> > > +
> > > + resched_curr(new_rq);
> > > +
> > > + double_unlock_balance(rq, new_rq);
> > > + }
> >
> > Why not use move_queued_task() ?
>
> double_lock_balance() can fail and prevent being blocked waiting for
> new rq whereas move_queued_task() will wait, won't it ?
>
> Do you think move_queued_task() would be better ?
No, double_lock_balance() never fails, the return value indicates if the
currently held rq-lock, (the first argument) was unlocked while
attaining both -- this is required when the first rq is a higher address
than the second.
double_lock_balance() also puts the wait-time and hold time of the
second inside the hold time of the first, which gets you a quadric term
in the rq hold times IIRC. Something that's best avoided.
move_queued_task() OTOH takes the task off the runqueue you already hold
locked, drops this lock, acquires the second, puts the task there, and
returns with the dst rq locked.
> In case of migrate_disable, push_fair_task() returns true and we
> continue with the next task (It should not have much anyway). If the
> task is migrate_disabled when we try to push it, we remove it from the
> list anyway. At now, we try to not have more than 1 task in the list
> to cap the overhead on sched_switch
Right, clearly I needed more wake-up juice, I thought it returned false
and would stick around.
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [PATCH 4/6 v8] sched/fair: Add push task mechanism for fair
2025-12-05 8:59 ` Peter Zijlstra
@ 2025-12-05 12:49 ` K Prateek Nayak
2025-12-05 12:56 ` Peter Zijlstra
2025-12-05 13:36 ` Vincent Guittot
2025-12-05 13:26 ` Vincent Guittot
1 sibling, 2 replies; 47+ messages in thread
From: K Prateek Nayak @ 2025-12-05 12:49 UTC (permalink / raw)
To: Peter Zijlstra, Vincent Guittot
Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, pierre.gondois, qyousef, hongyan.xia2,
christian.loehle, luis.machado
On 12/5/2025 2:29 PM, Peter Zijlstra wrote:
>>> Why not use move_queued_task() ?
>>
>> double_lock_balance() can fail and prevent being blocked waiting for
>> new rq whereas move_queued_task() will wait, won't it ?
>>
>> Do you think move_queued_task() would be better ?
>
> No, double_lock_balance() never fails, the return value indicates if the
> currently held rq-lock, (the first argument) was unlocked while
> attaining both -- this is required when the first rq is a higher address
> than the second.
>
> double_lock_balance() also puts the wait-time and hold time of the
> second inside the hold time of the first, which gets you a quadric term
> in the rq hold times IIRC. Something that's best avoided.
>
> move_queued_task() OTOH takes the task off the runqueue you already hold
> locked, drops this lock, acquires the second, puts the task there, and
> returns with the dst rq locked.
So I was experimenting with:
deactivate_task(rq, p, 0);
set_task_cpu(p, target_cpu);
__ttwu_queue_wakelist(p, target_cpu, 0);
and nothing has screamed at me yet during the benchmark runs.
Would this be any good instead of the whole lock juggling?
Since this CPU is found to be going overloaded, pushing via an
IPI vs taking the overhead ourselves seems to make more sense
to me from EAS standpoint.
Given TTWU_QUEUE is disabled for PREEMPT_RT, I'm assuming this
might be problematic too?
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [PATCH 4/6 v8] sched/fair: Add push task mechanism for fair
2025-12-05 12:49 ` K Prateek Nayak
@ 2025-12-05 12:56 ` Peter Zijlstra
2025-12-05 13:05 ` K Prateek Nayak
2025-12-05 13:36 ` Vincent Guittot
1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2025-12-05 12:56 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Vincent Guittot, mingo, juri.lelli, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, linux-kernel, pierre.gondois, qyousef,
hongyan.xia2, christian.loehle, luis.machado
On Fri, Dec 05, 2025 at 06:19:07PM +0530, K Prateek Nayak wrote:
> On 12/5/2025 2:29 PM, Peter Zijlstra wrote:
> >>> Why not use move_queued_task() ?
> >>
> >> double_lock_balance() can fail and prevent being blocked waiting for
> >> new rq whereas move_queued_task() will wait, won't it ?
> >>
> >> Do you think move_queued_task() would be better ?
> >
> > No, double_lock_balance() never fails, the return value indicates if the
> > currently held rq-lock, (the first argument) was unlocked while
> > attaining both -- this is required when the first rq is a higher address
> > than the second.
> >
> > double_lock_balance() also puts the wait-time and hold time of the
> > second inside the hold time of the first, which gets you a quadric term
> > in the rq hold times IIRC. Something that's best avoided.
> >
> > move_queued_task() OTOH takes the task off the runqueue you already hold
> > locked, drops this lock, acquires the second, puts the task there, and
> > returns with the dst rq locked.
>
> So I was experimenting with:
>
> deactivate_task(rq, p, 0);
> set_task_cpu(p, target_cpu);
> __ttwu_queue_wakelist(p, target_cpu, 0);
>
> and nothing has screamed at me yet during the benchmark runs.
> Would this be any good instead of the whole lock juggling?
This will get schedstats and any class with ->task_woken confused I
think. The IPI handler (sched_ttwu_pending / ttwu_do_activate) is
currently only geared towards doing the remote bit of ttwu.
This is fixable of course.
> Since this CPU is found to be going overloaded, pushing via an
> IPI vs taking the overhead ourselves seems to make more sense
> to me from EAS standpoint.
The performance characteristics here are very platform dependent.
Sometimes raising an IPI can be very expensive in and of itself.
> Given TTWU_QUEUE is disabled for PREEMPT_RT, I'm assuming this
> might be problematic too?
Yes, you have to check TTWU_QUEUE and have a fallback.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 4/6 v8] sched/fair: Add push task mechanism for fair
2025-12-05 12:56 ` Peter Zijlstra
@ 2025-12-05 13:05 ` K Prateek Nayak
0 siblings, 0 replies; 47+ messages in thread
From: K Prateek Nayak @ 2025-12-05 13:05 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Vincent Guittot, mingo, juri.lelli, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, linux-kernel, pierre.gondois, qyousef,
hongyan.xia2, christian.loehle, luis.machado
On 12/5/2025 6:26 PM, Peter Zijlstra wrote:
> On Fri, Dec 05, 2025 at 06:19:07PM +0530, K Prateek Nayak wrote:
>> On 12/5/2025 2:29 PM, Peter Zijlstra wrote:
>>>>> Why not use move_queued_task() ?
>>>>
>>>> double_lock_balance() can fail and prevent being blocked waiting for
>>>> new rq whereas move_queued_task() will wait, won't it ?
>>>>
>>>> Do you think move_queued_task() would be better ?
>>>
>>> No, double_lock_balance() never fails, the return value indicates if the
>>> currently held rq-lock, (the first argument) was unlocked while
>>> attaining both -- this is required when the first rq is a higher address
>>> than the second.
>>>
>>> double_lock_balance() also puts the wait-time and hold time of the
>>> second inside the hold time of the first, which gets you a quadric term
>>> in the rq hold times IIRC. Something that's best avoided.
>>>
>>> move_queued_task() OTOH takes the task off the runqueue you already hold
>>> locked, drops this lock, acquires the second, puts the task there, and
>>> returns with the dst rq locked.
>>
>> So I was experimenting with:
>>
>> deactivate_task(rq, p, 0);
>> set_task_cpu(p, target_cpu);
>> __ttwu_queue_wakelist(p, target_cpu, 0);
>>
>> and nothing has screamed at me yet during the benchmark runs.
>> Would this be any good instead of the whole lock juggling?
>
> This will get schedstats and any class with ->task_woken confused I
> think. The IPI handler (sched_ttwu_pending / ttwu_do_activate) is
> currently only geared towards doing the remote bit of ttwu.
>
> This is fixable of course.
I believe a simple enough check could be:
task_on_rq_migrating() -> queued (vs) blocked -> queued
during activation.
>
>> Since this CPU is found to be going overloaded, pushing via an
>> IPI vs taking the overhead ourselves seems to make more sense
>> to me from EAS standpoint.
>
> The performance characteristics here are very platform dependent.
> Sometimes raising an IPI can be very expensive in and of itself.
Ack! But I assume those platforms already disable TTWU_QUEUE?
Or maybe the ttwu_queue_cond() takes care of it indirectly. We
can add some guards if necessary and use the same fallback as
PREEMPT_RT.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 4/6 v8] sched/fair: Add push task mechanism for fair
2025-12-05 12:49 ` K Prateek Nayak
2025-12-05 12:56 ` Peter Zijlstra
@ 2025-12-05 13:36 ` Vincent Guittot
2025-12-06 3:08 ` K Prateek Nayak
1 sibling, 1 reply; 47+ messages in thread
From: Vincent Guittot @ 2025-12-05 13:36 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Peter Zijlstra, mingo, juri.lelli, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, linux-kernel, pierre.gondois, qyousef,
hongyan.xia2, christian.loehle, luis.machado
On Fri, 5 Dec 2025 at 13:49, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> On 12/5/2025 2:29 PM, Peter Zijlstra wrote:
> >>> Why not use move_queued_task() ?
> >>
> >> double_lock_balance() can fail and prevent being blocked waiting for
> >> new rq whereas move_queued_task() will wait, won't it ?
> >>
> >> Do you think move_queued_task() would be better ?
> >
> > No, double_lock_balance() never fails, the return value indicates if the
> > currently held rq-lock, (the first argument) was unlocked while
> > attaining both -- this is required when the first rq is a higher address
> > than the second.
> >
> > double_lock_balance() also puts the wait-time and hold time of the
> > second inside the hold time of the first, which gets you a quadric term
> > in the rq hold times IIRC. Something that's best avoided.
> >
> > move_queued_task() OTOH takes the task off the runqueue you already hold
> > locked, drops this lock, acquires the second, puts the task there, and
> > returns with the dst rq locked.
>
> So I was experimenting with:
>
> deactivate_task(rq, p, 0);
> set_task_cpu(p, target_cpu);
> __ttwu_queue_wakelist(p, target_cpu, 0);
>
> and nothing has screamed at me yet during the benchmark runs.
> Would this be any good instead of the whole lock juggling?
>
> Since this CPU is found to be going overloaded, pushing via an
Just to make sure that we speak about the same thing. With EAS
overloaded and overutilized are 2 different things. EAS don't care and
some time want to overload a CPU( having more than1 task on the CPU)
but EAS is diable once teh CPU becomes overutilized
> IPI vs taking the overhead ourselves seems to make more sense
> to me from EAS standpoint.
I suppose that it's worth trying the IPI on EAS and embedded device
>
> Given TTWU_QUEUE is disabled for PREEMPT_RT, I'm assuming this
> might be problematic too?
>
> --
> Thanks and Regards,
> Prateek
>
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 4/6 v8] sched/fair: Add push task mechanism for fair
2025-12-05 13:36 ` Vincent Guittot
@ 2025-12-06 3:08 ` K Prateek Nayak
0 siblings, 0 replies; 47+ messages in thread
From: K Prateek Nayak @ 2025-12-06 3:08 UTC (permalink / raw)
To: Vincent Guittot
Cc: Peter Zijlstra, mingo, juri.lelli, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, linux-kernel, pierre.gondois, qyousef,
hongyan.xia2, christian.loehle, luis.machado
On 12/5/2025 7:06 PM, Vincent Guittot wrote:
>> So I was experimenting with:
>>
>> deactivate_task(rq, p, 0);
>> set_task_cpu(p, target_cpu);
>> __ttwu_queue_wakelist(p, target_cpu, 0);
>>
>> and nothing has screamed at me yet during the benchmark runs.
>> Would this be any good instead of the whole lock juggling?
>>
>> Since this CPU is found to be going overloaded, pushing via an
>
> Just to make sure that we speak about the same thing. With EAS
> overloaded and overutilized are 2 different things. EAS don't care and
> some time want to overload a CPU( having more than1 task on the CPU)
> but EAS is diable once teh CPU becomes overutilized
I meant to say overutilized in this context. Sorry about that.
>
>> IPI vs taking the overhead ourselves seems to make more sense
>> to me from EAS standpoint.
>
> I suppose that it's worth trying the IPI on EAS and embedded device
If they are cheap enough, we can simply use TTWU_QUEUE check for the
!PREEMPT_RT and use move queued task otherwise.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 4/6 v8] sched/fair: Add push task mechanism for fair
2025-12-05 8:59 ` Peter Zijlstra
2025-12-05 12:49 ` K Prateek Nayak
@ 2025-12-05 13:26 ` Vincent Guittot
1 sibling, 0 replies; 47+ messages in thread
From: Vincent Guittot @ 2025-12-05 13:26 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, pierre.gondois, kprateek.nayak, qyousef,
hongyan.xia2, christian.loehle, luis.machado
Le vendredi 05 déc. 2025 à 09:59:12 (+0100), Peter Zijlstra a écrit :
> On Thu, Dec 04, 2025 at 03:34:15PM +0100, Vincent Guittot wrote:
> > On Thu, 4 Dec 2025 at 12:29, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Tue, Dec 02, 2025 at 07:12:40PM +0100, Vincent Guittot wrote:
> > > > +/*
> > > > + * See if the non running fair tasks on this rq can be sent on other CPUs
> > > > + * that fits better with their profile.
> > > > + */
> > > > +static bool push_fair_task(struct rq *rq)
> > > > +{
> > > > + struct task_struct *next_task;
> > > > + int prev_cpu, new_cpu;
> > > > + struct rq *new_rq;
> > > > +
> > > > + next_task = pick_next_pushable_fair_task(rq);
> > > > + if (!next_task)
> > > > + return false;
> > > > +
> > > > + if (is_migration_disabled(next_task))
> > > > + return true;
> > > > +
> > > > + /* We might release rq lock */
> > > > + get_task_struct(next_task);
> > > > +
> > > > + prev_cpu = rq->cpu;
> > > > +
> > > > + new_cpu = select_task_rq_fair(next_task, prev_cpu, 0);
> > > > +
> > > > + if (new_cpu == prev_cpu)
> > > > + goto out;
> > > > +
> > > > + new_rq = cpu_rq(new_cpu);
> > > > +
> > > > + if (double_lock_balance(rq, new_rq)) {
> > > > + /* The task has already migrated in between */
> > > > + if (task_cpu(next_task) != rq->cpu) {
> > > > + double_unlock_balance(rq, new_rq);
> > > > + goto out;
> > > > + }
> > > > +
> > > > + deactivate_task(rq, next_task, 0);
> > > > + set_task_cpu(next_task, new_cpu);
> > > > + activate_task(new_rq, next_task, 0);
> > > > +
> > > > + resched_curr(new_rq);
> > > > +
> > > > + double_unlock_balance(rq, new_rq);
> > > > + }
> > >
> > > Why not use move_queued_task() ?
> >
> > double_lock_balance() can fail and prevent being blocked waiting for
> > new rq whereas move_queued_task() will wait, won't it ?
> >
> > Do you think move_queued_task() would be better ?
>
> No, double_lock_balance() never fails, the return value indicates if the
> currently held rq-lock, (the first argument) was unlocked while
> attaining both -- this is required when the first rq is a higher address
> than the second.
>
> double_lock_balance() also puts the wait-time and hold time of the
> second inside the hold time of the first, which gets you a quadric term
> in the rq hold times IIRC. Something that's best avoided.
yeah, I misread the return and my current code need to be fixed like:
---
kernel/sched/fair.c | 19 +++++++++----------
1 file changed, 9 insertions(+), 10 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fbbe325dc633..35c7c968ddd2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8629,19 +8629,18 @@ static bool push_fair_task(struct rq *rq)
if (double_lock_balance(rq, new_rq)) {
/* The task has already migrated in between */
- if (task_cpu(next_task) != rq->cpu) {
- double_unlock_balance(rq, new_rq);
- goto out;
- }
+ if (task_cpu(next_task) != rq->cpu)
+ goto unlock;
+ }
- deactivate_task(rq, next_task, 0);
- set_task_cpu(next_task, new_cpu);
- activate_task(new_rq, next_task, 0);
+ deactivate_task(rq, next_task, DEQUEUE_NOCLOCK);
+ set_task_cpu(next_task, new_cpu);
+ activate_task(new_rq, next_task, 0);
- resched_curr(new_rq);
+ wakeup_preempt(new_rq, next_task, 0);
- double_unlock_balance(rq, new_rq);
- }
+unlock:
+ double_unlock_balance(rq, new_rq);
out:
put_task_struct(next_task);
--
2.43.0
>
> move_queued_task() OTOH takes the task off the runqueue you already hold
> locked, drops this lock, acquires the second, puts the task there, and
> returns with the dst rq locked.
I supposed it's doable even if we don't have rq_flags
But we need the re-lock the current rq and release the new one to let the balance_callback
loop in the same state
>
> > In case of migrate_disable, push_fair_task() returns true and we
> > continue with the next task (It should not have much anyway). If the
> > task is migrate_disabled when we try to push it, we remove it from the
> > list anyway. At now, we try to not have more than 1 task in the list
> > to cap the overhead on sched_switch
>
> Right, clearly I needed more wake-up juice, I thought it returned false
> and would stick around.
^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: [PATCH 4/6 v8] sched/fair: Add push task mechanism for fair
2025-12-02 18:12 ` [PATCH 4/6 v8] sched/fair: Add push task mechanism for fair Vincent Guittot
2025-12-04 10:46 ` Peter Zijlstra
2025-12-04 11:29 ` Peter Zijlstra
@ 2025-12-07 12:13 ` Shrikanth Hegde
2026-02-09 13:17 ` Vincent Guittot
2025-12-10 14:01 ` Dietmar Eggemann
2026-02-06 18:21 ` Qais Yousef
4 siblings, 1 reply; 47+ messages in thread
From: Shrikanth Hegde @ 2025-12-07 12:13 UTC (permalink / raw)
To: Vincent Guittot, mingo, peterz, vschneid, juri.lelli
Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado,
dietmar.eggemann, rostedt, bsegall, mgorman, linux-kernel,
pierre.gondois, kprateek.nayak
On 12/2/25 11:42 PM, Vincent Guittot wrote:
> EAS is based on wakeup events to efficiently place tasks on the system, but
> there are cases where a task doesn't have wakeup events anymore or at a far
> too low pace. For such situation, we can take advantage of the task being
> put back in the enqueued list to check if it should be pushed on another
> CPU.
> When the task is alone on the CPU, it's never put back in the enqueued
> list; In this special case, we use the tick to run the check.
>
> Add a push task mechanism that enables fair scheduler to push runnable
> tasks. EAS will be one user but other feature like filling idle CPUs
> can also take advantage of it.
>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> kernel/sched/fair.c | 212 ++++++++++++++++++++++++++++++++++++++++++-
> kernel/sched/sched.h | 4 +
> 2 files changed, 214 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 80c4131fb35b..252254168c92 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6989,6 +6989,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> hrtick_update(rq);
> }
>
> +static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p);
> +
> /*
> * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
> * failing half-way through and resume the dequeue later.
> @@ -7017,6 +7019,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> h_nr_idle = task_has_idle_policy(p);
> if (task_sleep || task_delayed || !se->sched_delayed)
> h_nr_runnable = 1;
> +
> + fair_remove_pushable_task(rq, p);
> }
>
> for_each_sched_entity(se) {
> @@ -8504,6 +8508,187 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> return target;
> }
>
> +DEFINE_STATIC_KEY_FALSE(sched_push_task);
> +
> +static inline bool sched_push_task_enabled(void)
> +{
> + return static_branch_unlikely(&sched_push_task);
> +}
> +
> +static bool fair_push_task(struct rq *rq, struct task_struct *p)
> +{
> + return false;
> +}
> +
> +static inline int has_pushable_tasks(struct rq *rq)
> +{
> + return !plist_head_empty(&rq->cfs.pushable_tasks);
> +}
> +
> +static struct task_struct *pick_next_pushable_fair_task(struct rq *rq)
> +{
> + struct task_struct *p;
> +
> + if (!has_pushable_tasks(rq))
> + return NULL;
> +
> + p = plist_first_entry(&rq->cfs.pushable_tasks,
> + struct task_struct, pushable_tasks);
> +
> + WARN_ON_ONCE(rq->cpu != task_cpu(p));
> + WARN_ON_ONCE(task_current(rq, p));
> + WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
> + WARN_ON_ONCE(!task_on_rq_queued(p));
> +
> + /*
> + * Remove task from the pushable list as we try only once after that
> + * the task has been put back in enqueued list.
> + */
> + plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> +
> + return p;
> +}
> +
> +static int
> +select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags);
> +
> +/*
> + * See if the non running fair tasks on this rq can be sent on other CPUs
> + * that fits better with their profile.
> + */
> +static bool push_fair_task(struct rq *rq)
> +{
> + struct task_struct *next_task;
> + int prev_cpu, new_cpu;
> + struct rq *new_rq;
> +
> + next_task = pick_next_pushable_fair_task(rq);
> + if (!next_task)
> + return false;
> +
> + if (is_migration_disabled(next_task))
> + return true;
> +
> + /* We might release rq lock */
> + get_task_struct(next_task);
> +
> + prev_cpu = rq->cpu;
> +
> + new_cpu = select_task_rq_fair(next_task, prev_cpu, 0);
> +
> + if (new_cpu == prev_cpu)
> + goto out;
> +
> + new_rq = cpu_rq(new_cpu);
> +
> + if (double_lock_balance(rq, new_rq)) {
> + /* The task has already migrated in between */
> + if (task_cpu(next_task) != rq->cpu) {
> + double_unlock_balance(rq, new_rq);
> + goto out;
> + }
> +
> + deactivate_task(rq, next_task, 0);
> + set_task_cpu(next_task, new_cpu);
> + activate_task(new_rq, next_task, 0);
> +
> + resched_curr(new_rq);
> +
> + double_unlock_balance(rq, new_rq);
> + }
> +
> +out:
> + put_task_struct(next_task);
> +
> + return true;
> +}
> +
> +static void push_fair_tasks(struct rq *rq)
> +{
> + /* push_fair_task() will return true if it moved a fair task */
> + while (push_fair_task(rq))
> + ;
> +}
> +
> +static DEFINE_PER_CPU(struct balance_callback, fair_push_head);
> +
> +static inline void fair_queue_pushable_tasks(struct rq *rq)
> +{
> + if (!sched_push_task_enabled() || !has_pushable_tasks(rq))
> + return;
> +
> + queue_balance_callback(rq, &per_cpu(fair_push_head, rq->cpu), push_fair_tasks);
> +}
> +
> +static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p)
> +{
> + if (sched_push_task_enabled())
> + plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> +}
> +
> +static void fair_add_pushable_task(struct rq *rq, struct task_struct *p)
> +{
> + if (sched_push_task_enabled() && fair_push_task(rq, p)) {
> + plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> + plist_node_init(&p->pushable_tasks, p->prio);
> + plist_add(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> + }
> +}
> +
> +static int active_load_balance_cpu_stop(void *data);
> +
> +/*
> + * See if the alone task running on the CPU should migrate on a better than
> + * the local one.
> + */
> +static inline bool check_pushable_task(struct task_struct *p, struct rq *rq)
> +{
> + int new_cpu, cpu = cpu_of(rq);
> +
> + if (!sched_push_task_enabled())
> + return false;
> +
> + if (WARN_ON(!p))
> + return false;
> +
> + if (WARN_ON(!task_current(rq, p)))
> + return false;
> +
> + if (is_migration_disabled(p))
> + return false;
> +
> + /* If there are several task, wait for being put back */
> + if (rq->nr_running > 1)
> + return false;
> +
> + if (!fair_push_task(rq, p))
> + return false;
> +
RT matters for EAS too? or only CFS?
Since we have quite a few patches floating around push task framework,
can we generalize the framework for pushing the current task out?
push_current_task(rq, CFS|RT|DL|IDLE|EXT|ALL)
- Depending on the second argument push the task out after doing necessary
class specific checks? Maybe a new method be added per class.
- current cpu hotplug code can make use of this infra with (ALL)
- push_rt_task with (RT), sched_balance_rq (CFS)
- push_current_from_paravirt_cpu (CFS|RT) (Patch series which i sent few days ago)
I know it is tricky right now due to specific checks in each path and
the way new cpu is found is different and all that. affine_move_task seems
quite complicated to fit in.
Maybe i thinking too far.
> + new_cpu = select_task_rq_fair(p, cpu, 0);
> +
> + if (new_cpu == cpu)
> + return false;
> +
> + /*
> + * ->active_balance synchronizes accesses to
> + * ->active_balance_work. Once set, it's cleared
> + * only after active load balance is finished.
> + */
> + if (!rq->active_balance) {
> + rq->active_balance = 1;
> + rq->push_cpu = new_cpu;
> + } else
> + return false;
> +
> + raw_spin_rq_unlock(rq);
can this race with sched_balance_rq?
I think it is okay since rq->active_balance = 0 at the end. so work buffer
should be protected.
> + stop_one_cpu_nowait(cpu,
> + active_load_balance_cpu_stop, rq,
> + &rq->active_balance_work);
> + raw_spin_rq_lock(rq);
> +
> + return true;
> +}
> +
> /*
> * select_task_rq_fair: Select target runqueue for the waking task in domains
> * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAKE,
> @@ -8973,6 +9158,12 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
> put_prev_entity(cfs_rq, pse);
> set_next_entity(cfs_rq, se);
>
> + /*
> + * The previous task might be eligible for being pushed on
> + * another cpu if it is still active.
> + */
> + fair_add_pushable_task(rq, prev);
> +
> __set_next_task_fair(rq, p, true);
> }
>
> @@ -9036,6 +9227,13 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct t
> cfs_rq = cfs_rq_of(se);
> put_prev_entity(cfs_rq, se);
> }
> +
> + /*
> + * The previous task might be eligible for being pushed on another cpu
> + * if it is still active.
> + */
> + fair_add_pushable_task(rq, prev);
> +
> }
>
> /*
> @@ -13390,8 +13588,10 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
> if (static_branch_unlikely(&sched_numa_balancing))
> task_tick_numa(rq, curr);
>
> - update_misfit_status(curr, rq);
> - check_update_overutilized_status(task_rq(curr));
> + if (!check_pushable_task(curr, rq)) {
> + update_misfit_status(curr, rq);
> + check_update_overutilized_status(task_rq(curr));
> + }
>
> task_tick_core(rq, curr);
> }
> @@ -13552,6 +13752,8 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
> {
> struct sched_entity *se = &p->se;
>
> + fair_remove_pushable_task(rq, p);
> +
> if (task_on_rq_queued(p)) {
> /*
> * Move the next running task to the front of the list, so our
> @@ -13567,6 +13769,11 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
> if (hrtick_enabled_fair(rq))
> hrtick_start_fair(rq, p);
>
> + /*
> + * Try to push prev task before checking misfit for next task as
> + * the migration of prev can make next fitting the CPU
> + */
> + fair_queue_pushable_tasks(rq);
> update_misfit_status(p, rq);
> sched_fair_update_stop_tick(rq, p);
> }
> @@ -13596,6 +13803,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
> {
> cfs_rq->tasks_timeline = RB_ROOT_CACHED;
> cfs_rq->zero_vruntime = (u64)(-(1LL << 20));
> + plist_head_init(&cfs_rq->pushable_tasks);
> raw_spin_lock_init(&cfs_rq->removed.lock);
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b419a4d98461..697bd654298a 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -711,6 +711,8 @@ struct cfs_rq {
> unsigned long runnable_avg;
> } removed;
>
> + struct plist_head pushable_tasks;
> +
> #ifdef CONFIG_FAIR_GROUP_SCHED
> u64 last_update_tg_load_avg;
> unsigned long tg_load_avg_contrib;
> @@ -3620,6 +3622,8 @@ static inline bool sched_energy_enabled(void) { return false; }
>
> #endif /* !(CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL) */
>
> +DECLARE_STATIC_KEY_FALSE(sched_push_task);
> +
You have sched_energy_present which is also enabled at the same point.
Do you see more usecases for sched_push_task?
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [PATCH 4/6 v8] sched/fair: Add push task mechanism for fair
2025-12-07 12:13 ` Shrikanth Hegde
@ 2026-02-09 13:17 ` Vincent Guittot
0 siblings, 0 replies; 47+ messages in thread
From: Vincent Guittot @ 2026-02-09 13:17 UTC (permalink / raw)
To: Shrikanth Hegde
Cc: mingo, peterz, vschneid, juri.lelli, qyousef, hongyan.xia2,
christian.loehle, luis.machado, dietmar.eggemann, rostedt,
bsegall, mgorman, linux-kernel, pierre.gondois, kprateek.nayak
On Sun, 7 Dec 2025 at 13:13, Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
>
>
>
> On 12/2/25 11:42 PM, Vincent Guittot wrote:
> > EAS is based on wakeup events to efficiently place tasks on the system, but
> > there are cases where a task doesn't have wakeup events anymore or at a far
> > too low pace. For such situation, we can take advantage of the task being
> > put back in the enqueued list to check if it should be pushed on another
> > CPU.
> > When the task is alone on the CPU, it's never put back in the enqueued
> > list; In this special case, we use the tick to run the check.
> >
> > Add a push task mechanism that enables fair scheduler to push runnable
> > tasks. EAS will be one user but other feature like filling idle CPUs
> > can also take advantage of it.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> > kernel/sched/fair.c | 212 ++++++++++++++++++++++++++++++++++++++++++-
> > kernel/sched/sched.h | 4 +
> > 2 files changed, 214 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 80c4131fb35b..252254168c92 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6989,6 +6989,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > hrtick_update(rq);
> > }
> >
> > +static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p);
> > +
> > /*
> > * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
> > * failing half-way through and resume the dequeue later.
> > @@ -7017,6 +7019,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> > h_nr_idle = task_has_idle_policy(p);
> > if (task_sleep || task_delayed || !se->sched_delayed)
> > h_nr_runnable = 1;
> > +
> > + fair_remove_pushable_task(rq, p);
> > }
> >
> > for_each_sched_entity(se) {
> > @@ -8504,6 +8508,187 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > return target;
> > }
> >
> > +DEFINE_STATIC_KEY_FALSE(sched_push_task);
> > +
> > +static inline bool sched_push_task_enabled(void)
> > +{
> > + return static_branch_unlikely(&sched_push_task);
> > +}
> > +
> > +static bool fair_push_task(struct rq *rq, struct task_struct *p)
> > +{
> > + return false;
> > +}
> > +
> > +static inline int has_pushable_tasks(struct rq *rq)
> > +{
> > + return !plist_head_empty(&rq->cfs.pushable_tasks);
> > +}
> > +
> > +static struct task_struct *pick_next_pushable_fair_task(struct rq *rq)
> > +{
> > + struct task_struct *p;
> > +
> > + if (!has_pushable_tasks(rq))
> > + return NULL;
> > +
> > + p = plist_first_entry(&rq->cfs.pushable_tasks,
> > + struct task_struct, pushable_tasks);
> > +
> > + WARN_ON_ONCE(rq->cpu != task_cpu(p));
> > + WARN_ON_ONCE(task_current(rq, p));
> > + WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
> > + WARN_ON_ONCE(!task_on_rq_queued(p));
> > +
> > + /*
> > + * Remove task from the pushable list as we try only once after that
> > + * the task has been put back in enqueued list.
> > + */
> > + plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> > +
> > + return p;
> > +}
> > +
> > +static int
> > +select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags);
> > +
> > +/*
> > + * See if the non running fair tasks on this rq can be sent on other CPUs
> > + * that fits better with their profile.
> > + */
> > +static bool push_fair_task(struct rq *rq)
> > +{
> > + struct task_struct *next_task;
> > + int prev_cpu, new_cpu;
> > + struct rq *new_rq;
> > +
> > + next_task = pick_next_pushable_fair_task(rq);
> > + if (!next_task)
> > + return false;
> > +
> > + if (is_migration_disabled(next_task))
> > + return true;
> > +
> > + /* We might release rq lock */
> > + get_task_struct(next_task);
> > +
> > + prev_cpu = rq->cpu;
> > +
> > + new_cpu = select_task_rq_fair(next_task, prev_cpu, 0);
> > +
> > + if (new_cpu == prev_cpu)
> > + goto out;
> > +
> > + new_rq = cpu_rq(new_cpu);
> > +
> > + if (double_lock_balance(rq, new_rq)) {
> > + /* The task has already migrated in between */
> > + if (task_cpu(next_task) != rq->cpu) {
> > + double_unlock_balance(rq, new_rq);
> > + goto out;
> > + }
> > +
> > + deactivate_task(rq, next_task, 0);
> > + set_task_cpu(next_task, new_cpu);
> > + activate_task(new_rq, next_task, 0);
> > +
> > + resched_curr(new_rq);
> > +
> > + double_unlock_balance(rq, new_rq);
> > + }
> > +
> > +out:
> > + put_task_struct(next_task);
> > +
> > + return true;
> > +}
> > +
> > +static void push_fair_tasks(struct rq *rq)
> > +{
> > + /* push_fair_task() will return true if it moved a fair task */
> > + while (push_fair_task(rq))
> > + ;
> > +}
> > +
> > +static DEFINE_PER_CPU(struct balance_callback, fair_push_head);
> > +
> > +static inline void fair_queue_pushable_tasks(struct rq *rq)
> > +{
> > + if (!sched_push_task_enabled() || !has_pushable_tasks(rq))
> > + return;
> > +
> > + queue_balance_callback(rq, &per_cpu(fair_push_head, rq->cpu), push_fair_tasks);
> > +}
> > +
> > +static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p)
> > +{
> > + if (sched_push_task_enabled())
> > + plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> > +}
> > +
> > +static void fair_add_pushable_task(struct rq *rq, struct task_struct *p)
> > +{
> > + if (sched_push_task_enabled() && fair_push_task(rq, p)) {
> > + plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> > + plist_node_init(&p->pushable_tasks, p->prio);
> > + plist_add(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> > + }
> > +}
> > +
> > +static int active_load_balance_cpu_stop(void *data);
> > +
> > +/*
> > + * See if the alone task running on the CPU should migrate on a better than
> > + * the local one.
> > + */
> > +static inline bool check_pushable_task(struct task_struct *p, struct rq *rq)
> > +{
> > + int new_cpu, cpu = cpu_of(rq);
> > +
> > + if (!sched_push_task_enabled())
> > + return false;
> > +
> > + if (WARN_ON(!p))
> > + return false;
> > +
> > + if (WARN_ON(!task_current(rq, p)))
> > + return false;
> > +
> > + if (is_migration_disabled(p))
> > + return false;
> > +
> > + /* If there are several task, wait for being put back */
> > + if (rq->nr_running > 1)
> > + return false;
> > +
> > + if (!fair_push_task(rq, p))
> > + return false;
> > +
>
> RT matters for EAS too? or only CFS?
>
> Since we have quite a few patches floating around push task framework,
> can we generalize the framework for pushing the current task out?
>
> push_current_task(rq, CFS|RT|DL|IDLE|EXT|ALL)
> - Depending on the second argument push the task out after doing necessary
> class specific checks? Maybe a new method be added per class.
Sorry I thought that I answered to your email but I can't find it
The generalization is not straight forward as they are not all using
the same kind of list like DL which uses a rb tree and the place where
we want to check if a task should be added in the pushable list
>
> - current cpu hotplug code can make use of this infra with (ALL)
> - push_rt_task with (RT), sched_balance_rq (CFS)
> - push_current_from_paravirt_cpu (CFS|RT) (Patch series which i sent few days ago)
>
> I know it is tricky right now due to specific checks in each path and
> the way new cpu is found is different and all that. affine_move_task seems
> quite complicated to fit in.
>
> Maybe i thinking too far.
This could come in a 2nd step of consolidation once we know what we
want to put in each push callback.
>
>
> > + new_cpu = select_task_rq_fair(p, cpu, 0);
> > +
> > + if (new_cpu == cpu)
> > + return false;
> > +
> > + /*
> > + * ->active_balance synchronizes accesses to
> > + * ->active_balance_work. Once set, it's cleared
> > + * only after active load balance is finished.
> > + */
> > + if (!rq->active_balance) {
> > + rq->active_balance = 1;
> > + rq->push_cpu = new_cpu;
> > + } else
> > + return false;
> > +
> > + raw_spin_rq_unlock(rq);
>
> can this race with sched_balance_rq?
> I think it is okay since rq->active_balance = 0 at the end. so work buffer
> should be protected.
Yeah, rq->active_balance protects it
>
> > + stop_one_cpu_nowait(cpu,
> > + active_load_balance_cpu_stop, rq,
> > + &rq->active_balance_work);
> > + raw_spin_rq_lock(rq);
> > +
> > + return true;
> > +}
> > +
> > /*
> > * select_task_rq_fair: Select target runqueue for the waking task in domains
> > * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAKE,
> > @@ -8973,6 +9158,12 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
> > put_prev_entity(cfs_rq, pse);
> > set_next_entity(cfs_rq, se);
> >
> > + /*
> > + * The previous task might be eligible for being pushed on
> > + * another cpu if it is still active.
> > + */
> > + fair_add_pushable_task(rq, prev);
> > +
> > __set_next_task_fair(rq, p, true);
> > }
> >
> > @@ -9036,6 +9227,13 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct t
> > cfs_rq = cfs_rq_of(se);
> > put_prev_entity(cfs_rq, se);
> > }
> > +
> > + /*
> > + * The previous task might be eligible for being pushed on another cpu
> > + * if it is still active.
> > + */
> > + fair_add_pushable_task(rq, prev);
> > +
> > }
> >
> > /*
> > @@ -13390,8 +13588,10 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
> > if (static_branch_unlikely(&sched_numa_balancing))
> > task_tick_numa(rq, curr);
> >
> > - update_misfit_status(curr, rq);
> > - check_update_overutilized_status(task_rq(curr));
> > + if (!check_pushable_task(curr, rq)) {
> > + update_misfit_status(curr, rq);
> > + check_update_overutilized_status(task_rq(curr));
> > + }
> >
> > task_tick_core(rq, curr);
> > }
> > @@ -13552,6 +13752,8 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
> > {
> > struct sched_entity *se = &p->se;
> >
> > + fair_remove_pushable_task(rq, p);
> > +
> > if (task_on_rq_queued(p)) {
> > /*
> > * Move the next running task to the front of the list, so our
> > @@ -13567,6 +13769,11 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
> > if (hrtick_enabled_fair(rq))
> > hrtick_start_fair(rq, p);
> >
> > + /*
> > + * Try to push prev task before checking misfit for next task as
> > + * the migration of prev can make next fitting the CPU
> > + */
> > + fair_queue_pushable_tasks(rq);
> > update_misfit_status(p, rq);
> > sched_fair_update_stop_tick(rq, p);
> > }
> > @@ -13596,6 +13803,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
> > {
> > cfs_rq->tasks_timeline = RB_ROOT_CACHED;
> > cfs_rq->zero_vruntime = (u64)(-(1LL << 20));
> > + plist_head_init(&cfs_rq->pushable_tasks);
> > raw_spin_lock_init(&cfs_rq->removed.lock);
> > }
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index b419a4d98461..697bd654298a 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -711,6 +711,8 @@ struct cfs_rq {
> > unsigned long runnable_avg;
> > } removed;
> >
> > + struct plist_head pushable_tasks;
> > +
> > #ifdef CONFIG_FAIR_GROUP_SCHED
> > u64 last_update_tg_load_avg;
> > unsigned long tg_load_avg_contrib;
> > @@ -3620,6 +3622,8 @@ static inline bool sched_energy_enabled(void) { return false; }
> >
> > #endif /* !(CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL) */
> >
> > +DECLARE_STATIC_KEY_FALSE(sched_push_task);
> > +
> You have sched_energy_present which is also enabled at the same point.
> Do you see more usecases for sched_push_task?
In my current patchset sched_push_task is only enabled for EAS but I
wanted to make it possible to be enabled for other cases
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 4/6 v8] sched/fair: Add push task mechanism for fair
2025-12-02 18:12 ` [PATCH 4/6 v8] sched/fair: Add push task mechanism for fair Vincent Guittot
` (2 preceding siblings ...)
2025-12-07 12:13 ` Shrikanth Hegde
@ 2025-12-10 14:01 ` Dietmar Eggemann
2026-02-09 13:17 ` Vincent Guittot
2026-02-06 18:21 ` Qais Yousef
4 siblings, 1 reply; 47+ messages in thread
From: Dietmar Eggemann @ 2025-12-10 14:01 UTC (permalink / raw)
To: Vincent Guittot, mingo, peterz, juri.lelli, rostedt, bsegall,
mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak
Cc: qyousef, christian.loehle
- hongyan.xia2@arm.com
- luis.machado@arm.com
On 02.12.25 19:12, Vincent Guittot wrote:
> EAS is based on wakeup events to efficiently place tasks on the system, but
> there are cases where a task doesn't have wakeup events anymore or at a far
> too low pace. For such situation, we can take advantage of the task being
> put back in the enqueued list to check if it should be pushed on another
> CPU.
> When the task is alone on the CPU, it's never put back in the enqueued
> list; In this special case, we use the tick to run the check.
>
> Add a push task mechanism that enables fair scheduler to push runnable
> tasks. EAS will be one user but other feature like filling idle CPUs
> can also take advantage of it.
[...]
> +/*
> + * See if the non running fair tasks on this rq can be sent on other CPUs
> + * that fits better with their profile.
> + */
> +static bool push_fair_task(struct rq *rq)
> +{
> + struct task_struct *next_task;
> + int prev_cpu, new_cpu;
> + struct rq *new_rq;
> +
> + next_task = pick_next_pushable_fair_task(rq);
> + if (!next_task)
> + return false;
> +
> + if (is_migration_disabled(next_task))
> + return true;
> +
> + /* We might release rq lock */
> + get_task_struct(next_task);
> +
> + prev_cpu = rq->cpu;
> +
select_task_rq_fair() requires p->pi_lock to be held. I assume
check_pushable_task() (push single running task) has the same issue.
> + new_cpu = select_task_rq_fair(next_task, prev_cpu, 0);
> +
[...]
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [PATCH 4/6 v8] sched/fair: Add push task mechanism for fair
2025-12-10 14:01 ` Dietmar Eggemann
@ 2026-02-09 13:17 ` Vincent Guittot
0 siblings, 0 replies; 47+ messages in thread
From: Vincent Guittot @ 2026-02-09 13:17 UTC (permalink / raw)
To: Dietmar Eggemann
Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, vschneid,
linux-kernel, pierre.gondois, kprateek.nayak, qyousef,
christian.loehle
On Wed, 10 Dec 2025 at 15:01, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> - hongyan.xia2@arm.com
> - luis.machado@arm.com
>
> On 02.12.25 19:12, Vincent Guittot wrote:
> > EAS is based on wakeup events to efficiently place tasks on the system, but
> > there are cases where a task doesn't have wakeup events anymore or at a far
> > too low pace. For such situation, we can take advantage of the task being
> > put back in the enqueued list to check if it should be pushed on another
> > CPU.
> > When the task is alone on the CPU, it's never put back in the enqueued
> > list; In this special case, we use the tick to run the check.
> >
> > Add a push task mechanism that enables fair scheduler to push runnable
> > tasks. EAS will be one user but other feature like filling idle CPUs
> > can also take advantage of it.
>
> [...]
>
> > +/*
> > + * See if the non running fair tasks on this rq can be sent on other CPUs
> > + * that fits better with their profile.
> > + */
> > +static bool push_fair_task(struct rq *rq)
> > +{
> > + struct task_struct *next_task;
> > + int prev_cpu, new_cpu;
> > + struct rq *new_rq;
> > +
> > + next_task = pick_next_pushable_fair_task(rq);
> > + if (!next_task)
> > + return false;
> > +
> > + if (is_migration_disabled(next_task))
> > + return true;
> > +
> > + /* We might release rq lock */
> > + get_task_struct(next_task);
> > +
> > + prev_cpu = rq->cpu;
> > +
>
> select_task_rq_fair() requires p->pi_lock to be held. I assume
> check_pushable_task() (push single running task) has the same issue.
fair enough
>
> > + new_cpu = select_task_rq_fair(next_task, prev_cpu, 0);
> > +
>
> [...]
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 4/6 v8] sched/fair: Add push task mechanism for fair
2025-12-02 18:12 ` [PATCH 4/6 v8] sched/fair: Add push task mechanism for fair Vincent Guittot
` (3 preceding siblings ...)
2025-12-10 14:01 ` Dietmar Eggemann
@ 2026-02-06 18:21 ` Qais Yousef
2026-02-09 13:18 ` Vincent Guittot
4 siblings, 1 reply; 47+ messages in thread
From: Qais Yousef @ 2026-02-06 18:21 UTC (permalink / raw)
To: Vincent Guittot
Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak,
hongyan.xia2, christian.loehle, luis.machado
On 12/02/25 19:12, Vincent Guittot wrote:
> EAS is based on wakeup events to efficiently place tasks on the system, but
> there are cases where a task doesn't have wakeup events anymore or at a far
> too low pace. For such situation, we can take advantage of the task being
> put back in the enqueued list to check if it should be pushed on another
> CPU.
> When the task is alone on the CPU, it's never put back in the enqueued
> list; In this special case, we use the tick to run the check.
>
> Add a push task mechanism that enables fair scheduler to push runnable
> tasks. EAS will be one user but other feature like filling idle CPUs
> can also take advantage of it.
I think worth adding that we are improving responsiveness of lb, this is
a critical side effect. Currently pull mechanism is too slow - and takes wrong
decisions for systems that relies on feec() as you pointed out.
It also prepares for a unified decision between wake up and lb for more
coherent task placement decisions.
>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> kernel/sched/fair.c | 212 ++++++++++++++++++++++++++++++++++++++++++-
> kernel/sched/sched.h | 4 +
> 2 files changed, 214 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 80c4131fb35b..252254168c92 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6989,6 +6989,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> hrtick_update(rq);
> }
>
> +static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p);
> +
> /*
> * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
> * failing half-way through and resume the dequeue later.
> @@ -7017,6 +7019,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> h_nr_idle = task_has_idle_policy(p);
> if (task_sleep || task_delayed || !se->sched_delayed)
> h_nr_runnable = 1;
> +
> + fair_remove_pushable_task(rq, p);
> }
>
> for_each_sched_entity(se) {
> @@ -8504,6 +8508,187 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> return target;
> }
>
> +DEFINE_STATIC_KEY_FALSE(sched_push_task);
> +
> +static inline bool sched_push_task_enabled(void)
> +{
> + return static_branch_unlikely(&sched_push_task);
> +}
> +
> +static bool fair_push_task(struct rq *rq, struct task_struct *p)
I expected this to be named is_pushable_task()?
> +{
> + return false;
> +}
> +
> +static inline int has_pushable_tasks(struct rq *rq)
> +{
> + return !plist_head_empty(&rq->cfs.pushable_tasks);
> +}
> +
> +static struct task_struct *pick_next_pushable_fair_task(struct rq *rq)
> +{
> + struct task_struct *p;
> +
> + if (!has_pushable_tasks(rq))
> + return NULL;
> +
> + p = plist_first_entry(&rq->cfs.pushable_tasks,
> + struct task_struct, pushable_tasks);
> +
> + WARN_ON_ONCE(rq->cpu != task_cpu(p));
> + WARN_ON_ONCE(task_current(rq, p));
> + WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
> + WARN_ON_ONCE(!task_on_rq_queued(p));
> +
> + /*
> + * Remove task from the pushable list as we try only once after that
> + * the task has been put back in enqueued list.
> + */
> + plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> +
> + return p;
> +}
> +
> +static int
> +select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags);
> +
> +/*
> + * See if the non running fair tasks on this rq can be sent on other CPUs
> + * that fits better with their profile.
> + */
> +static bool push_fair_task(struct rq *rq)
> +{
> + struct task_struct *next_task;
> + int prev_cpu, new_cpu;
> + struct rq *new_rq;
> +
> + next_task = pick_next_pushable_fair_task(rq);
> + if (!next_task)
> + return false;
> +
> + if (is_migration_disabled(next_task))
> + return true;
When we loop to push tasks, the task might become unpushable say after pushing
another task. Should we add a late check to verify the task is still pushable?
if (!fair_push_task(next_task, rq))
return true;
> +
> + /* We might release rq lock */
> + get_task_struct(next_task);
> +
> + prev_cpu = rq->cpu;
> +
> + new_cpu = select_task_rq_fair(next_task, prev_cpu, 0);
> +
> + if (new_cpu == prev_cpu)
> + goto out;
> +
> + new_rq = cpu_rq(new_cpu);
> +
> + if (double_lock_balance(rq, new_rq)) {
> + /* The task has already migrated in between */
> + if (task_cpu(next_task) != rq->cpu) {
> + double_unlock_balance(rq, new_rq);
> + goto out;
> + }
> +
> + deactivate_task(rq, next_task, 0);
> + set_task_cpu(next_task, new_cpu);
> + activate_task(new_rq, next_task, 0);
> +
> + resched_curr(new_rq);
> +
> + double_unlock_balance(rq, new_rq);
> + }
> +
> +out:
> + put_task_struct(next_task);
> +
> + return true;
> +}
> +
> +static void push_fair_tasks(struct rq *rq)
> +{
> + /* push_fair_task() will return true if it moved a fair task */
> + while (push_fair_task(rq))
> + ;
> +}
> +
> +static DEFINE_PER_CPU(struct balance_callback, fair_push_head);
> +
> +static inline void fair_queue_pushable_tasks(struct rq *rq)
> +{
> + if (!sched_push_task_enabled() || !has_pushable_tasks(rq))
> + return;
> +
> + queue_balance_callback(rq, &per_cpu(fair_push_head, rq->cpu), push_fair_tasks);
> +}
> +
> +static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p)
> +{
> + if (sched_push_task_enabled())
> + plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> +}
> +
> +static void fair_add_pushable_task(struct rq *rq, struct task_struct *p)
> +{
> + if (sched_push_task_enabled() && fair_push_task(rq, p)) {
> + plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> + plist_node_init(&p->pushable_tasks, p->prio);
> + plist_add(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> + }
> +}
> +
> +static int active_load_balance_cpu_stop(void *data);
> +
> +/*
> + * See if the alone task running on the CPU should migrate on a better than
> + * the local one.
> + */
> +static inline bool check_pushable_task(struct task_struct *p, struct rq *rq)
> +{
> + int new_cpu, cpu = cpu_of(rq);
> +
> + if (!sched_push_task_enabled())
> + return false;
> +
> + if (WARN_ON(!p))
> + return false;
> +
> + if (WARN_ON(!task_current(rq, p)))
> + return false;
> +
> + if (is_migration_disabled(p))
> + return false;
> +
> + /* If there are several task, wait for being put back */
> + if (rq->nr_running > 1)
> + return false;
> +
> + if (!fair_push_task(rq, p))
> + return false;
> +
> + new_cpu = select_task_rq_fair(p, cpu, 0);
> +
> + if (new_cpu == cpu)
> + return false;
> +
> + /*
> + * ->active_balance synchronizes accesses to
> + * ->active_balance_work. Once set, it's cleared
> + * only after active load balance is finished.
> + */
> + if (!rq->active_balance) {
> + rq->active_balance = 1;
> + rq->push_cpu = new_cpu;
> + } else
> + return false;
> +
> + raw_spin_rq_unlock(rq);
> + stop_one_cpu_nowait(cpu,
> + active_load_balance_cpu_stop, rq,
> + &rq->active_balance_work);
> + raw_spin_rq_lock(rq);
> +
> + return true;
> +}
> +
> /*
> * select_task_rq_fair: Select target runqueue for the waking task in domains
> * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAKE,
> @@ -8973,6 +9158,12 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
> put_prev_entity(cfs_rq, pse);
> set_next_entity(cfs_rq, se);
>
> + /*
> + * The previous task might be eligible for being pushed on
> + * another cpu if it is still active.
> + */
> + fair_add_pushable_task(rq, prev);
> +
> __set_next_task_fair(rq, p, true);
> }
>
> @@ -9036,6 +9227,13 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct t
> cfs_rq = cfs_rq_of(se);
> put_prev_entity(cfs_rq, se);
> }
> +
> + /*
> + * The previous task might be eligible for being pushed on another cpu
> + * if it is still active.
> + */
> + fair_add_pushable_task(rq, prev);
> +
> }
>
> /*
> @@ -13390,8 +13588,10 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
> if (static_branch_unlikely(&sched_numa_balancing))
> task_tick_numa(rq, curr);
>
> - update_misfit_status(curr, rq);
> - check_update_overutilized_status(task_rq(curr));
> + if (!check_pushable_task(curr, rq)) {
> + update_misfit_status(curr, rq);
> + check_update_overutilized_status(task_rq(curr));
> + }
>
> task_tick_core(rq, curr);
> }
> @@ -13552,6 +13752,8 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
> {
> struct sched_entity *se = &p->se;
>
> + fair_remove_pushable_task(rq, p);
> +
> if (task_on_rq_queued(p)) {
> /*
> * Move the next running task to the front of the list, so our
> @@ -13567,6 +13769,11 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
> if (hrtick_enabled_fair(rq))
> hrtick_start_fair(rq, p);
>
> + /*
> + * Try to push prev task before checking misfit for next task as
> + * the migration of prev can make next fitting the CPU
> + */
> + fair_queue_pushable_tasks(rq);
> update_misfit_status(p, rq);
> sched_fair_update_stop_tick(rq, p);
> }
> @@ -13596,6 +13803,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
> {
> cfs_rq->tasks_timeline = RB_ROOT_CACHED;
> cfs_rq->zero_vruntime = (u64)(-(1LL << 20));
> + plist_head_init(&cfs_rq->pushable_tasks);
> raw_spin_lock_init(&cfs_rq->removed.lock);
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b419a4d98461..697bd654298a 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -711,6 +711,8 @@ struct cfs_rq {
> unsigned long runnable_avg;
> } removed;
>
> + struct plist_head pushable_tasks;
> +
> #ifdef CONFIG_FAIR_GROUP_SCHED
> u64 last_update_tg_load_avg;
> unsigned long tg_load_avg_contrib;
> @@ -3620,6 +3622,8 @@ static inline bool sched_energy_enabled(void) { return false; }
>
> #endif /* !(CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL) */
>
> +DECLARE_STATIC_KEY_FALSE(sched_push_task);
> +
> #ifdef CONFIG_MEMBARRIER
>
> /*
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [PATCH 4/6 v8] sched/fair: Add push task mechanism for fair
2026-02-06 18:21 ` Qais Yousef
@ 2026-02-09 13:18 ` Vincent Guittot
0 siblings, 0 replies; 47+ messages in thread
From: Vincent Guittot @ 2026-02-09 13:18 UTC (permalink / raw)
To: Qais Yousef
Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak,
hongyan.xia2, christian.loehle, luis.machado
On Fri, 6 Feb 2026 at 19:21, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 12/02/25 19:12, Vincent Guittot wrote:
> > EAS is based on wakeup events to efficiently place tasks on the system, but
> > there are cases where a task doesn't have wakeup events anymore or at a far
> > too low pace. For such situation, we can take advantage of the task being
> > put back in the enqueued list to check if it should be pushed on another
> > CPU.
> > When the task is alone on the CPU, it's never put back in the enqueued
> > list; In this special case, we use the tick to run the check.
> >
> > Add a push task mechanism that enables fair scheduler to push runnable
> > tasks. EAS will be one user but other feature like filling idle CPUs
> > can also take advantage of it.
>
> I think worth adding that we are improving responsiveness of lb, this is
> a critical side effect. Currently pull mechanism is too slow - and takes wrong
> decisions for systems that relies on feec() as you pointed out.
>
> It also prepares for a unified decision between wake up and lb for more
> coherent task placement decisions.
yes
>
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> > kernel/sched/fair.c | 212 ++++++++++++++++++++++++++++++++++++++++++-
> > kernel/sched/sched.h | 4 +
> > 2 files changed, 214 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 80c4131fb35b..252254168c92 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6989,6 +6989,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > hrtick_update(rq);
> > }
> >
> > +static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p);
> > +
> > /*
> > * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
> > * failing half-way through and resume the dequeue later.
> > @@ -7017,6 +7019,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> > h_nr_idle = task_has_idle_policy(p);
> > if (task_sleep || task_delayed || !se->sched_delayed)
> > h_nr_runnable = 1;
> > +
> > + fair_remove_pushable_task(rq, p);
> > }
> >
> > for_each_sched_entity(se) {
> > @@ -8504,6 +8508,187 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > return target;
> > }
> >
> > +DEFINE_STATIC_KEY_FALSE(sched_push_task);
> > +
> > +static inline bool sched_push_task_enabled(void)
> > +{
> > + return static_branch_unlikely(&sched_push_task);
> > +}
> > +
> > +static bool fair_push_task(struct rq *rq, struct task_struct *p)
>
> I expected this to be named is_pushable_task()?
>
> > +{
> > + return false;
> > +}
> > +
> > +static inline int has_pushable_tasks(struct rq *rq)
> > +{
> > + return !plist_head_empty(&rq->cfs.pushable_tasks);
> > +}
> > +
> > +static struct task_struct *pick_next_pushable_fair_task(struct rq *rq)
> > +{
> > + struct task_struct *p;
> > +
> > + if (!has_pushable_tasks(rq))
> > + return NULL;
> > +
> > + p = plist_first_entry(&rq->cfs.pushable_tasks,
> > + struct task_struct, pushable_tasks);
> > +
> > + WARN_ON_ONCE(rq->cpu != task_cpu(p));
> > + WARN_ON_ONCE(task_current(rq, p));
> > + WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
> > + WARN_ON_ONCE(!task_on_rq_queued(p));
> > +
> > + /*
> > + * Remove task from the pushable list as we try only once after that
> > + * the task has been put back in enqueued list.
> > + */
> > + plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> > +
> > + return p;
> > +}
> > +
> > +static int
> > +select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags);
> > +
> > +/*
> > + * See if the non running fair tasks on this rq can be sent on other CPUs
> > + * that fits better with their profile.
> > + */
> > +static bool push_fair_task(struct rq *rq)
> > +{
> > + struct task_struct *next_task;
> > + int prev_cpu, new_cpu;
> > + struct rq *new_rq;
> > +
> > + next_task = pick_next_pushable_fair_task(rq);
> > + if (!next_task)
> > + return false;
> > +
> > + if (is_migration_disabled(next_task))
> > + return true;
>
> When we loop to push tasks, the task might become unpushable say after pushing
> another task. Should we add a late check to verify the task is still pushable?
>
> if (!fair_push_task(next_task, rq))
> return true;
I expect select_task_rq_fair to return the current cpu in this case.
>
> > +
> > + /* We might release rq lock */
> > + get_task_struct(next_task);
> > +
> > + prev_cpu = rq->cpu;
> > +
> > + new_cpu = select_task_rq_fair(next_task, prev_cpu, 0);
> > +
> > + if (new_cpu == prev_cpu)
> > + goto out;
> > +
> > + new_rq = cpu_rq(new_cpu);
> > +
> > + if (double_lock_balance(rq, new_rq)) {
> > + /* The task has already migrated in between */
> > + if (task_cpu(next_task) != rq->cpu) {
> > + double_unlock_balance(rq, new_rq);
> > + goto out;
> > + }
> > +
> > + deactivate_task(rq, next_task, 0);
> > + set_task_cpu(next_task, new_cpu);
> > + activate_task(new_rq, next_task, 0);
> > +
> > + resched_curr(new_rq);
> > +
> > + double_unlock_balance(rq, new_rq);
> > + }
> > +
> > +out:
> > + put_task_struct(next_task);
> > +
> > + return true;
> > +}
> > +
> > +static void push_fair_tasks(struct rq *rq)
> > +{
> > + /* push_fair_task() will return true if it moved a fair task */
> > + while (push_fair_task(rq))
> > + ;
> > +}
> > +
> > +static DEFINE_PER_CPU(struct balance_callback, fair_push_head);
> > +
> > +static inline void fair_queue_pushable_tasks(struct rq *rq)
> > +{
> > + if (!sched_push_task_enabled() || !has_pushable_tasks(rq))
> > + return;
> > +
> > + queue_balance_callback(rq, &per_cpu(fair_push_head, rq->cpu), push_fair_tasks);
> > +}
> > +
> > +static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p)
> > +{
> > + if (sched_push_task_enabled())
> > + plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> > +}
> > +
> > +static void fair_add_pushable_task(struct rq *rq, struct task_struct *p)
> > +{
> > + if (sched_push_task_enabled() && fair_push_task(rq, p)) {
> > + plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> > + plist_node_init(&p->pushable_tasks, p->prio);
> > + plist_add(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> > + }
> > +}
> > +
> > +static int active_load_balance_cpu_stop(void *data);
> > +
> > +/*
> > + * See if the alone task running on the CPU should migrate on a better than
> > + * the local one.
> > + */
> > +static inline bool check_pushable_task(struct task_struct *p, struct rq *rq)
> > +{
> > + int new_cpu, cpu = cpu_of(rq);
> > +
> > + if (!sched_push_task_enabled())
> > + return false;
> > +
> > + if (WARN_ON(!p))
> > + return false;
> > +
> > + if (WARN_ON(!task_current(rq, p)))
> > + return false;
> > +
> > + if (is_migration_disabled(p))
> > + return false;
> > +
> > + /* If there are several task, wait for being put back */
> > + if (rq->nr_running > 1)
> > + return false;
> > +
> > + if (!fair_push_task(rq, p))
> > + return false;
> > +
> > + new_cpu = select_task_rq_fair(p, cpu, 0);
> > +
> > + if (new_cpu == cpu)
> > + return false;
> > +
> > + /*
> > + * ->active_balance synchronizes accesses to
> > + * ->active_balance_work. Once set, it's cleared
> > + * only after active load balance is finished.
> > + */
> > + if (!rq->active_balance) {
> > + rq->active_balance = 1;
> > + rq->push_cpu = new_cpu;
> > + } else
> > + return false;
> > +
> > + raw_spin_rq_unlock(rq);
> > + stop_one_cpu_nowait(cpu,
> > + active_load_balance_cpu_stop, rq,
> > + &rq->active_balance_work);
> > + raw_spin_rq_lock(rq);
> > +
> > + return true;
> > +}
> > +
> > /*
> > * select_task_rq_fair: Select target runqueue for the waking task in domains
> > * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAKE,
> > @@ -8973,6 +9158,12 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
> > put_prev_entity(cfs_rq, pse);
> > set_next_entity(cfs_rq, se);
> >
> > + /*
> > + * The previous task might be eligible for being pushed on
> > + * another cpu if it is still active.
> > + */
> > + fair_add_pushable_task(rq, prev);
> > +
> > __set_next_task_fair(rq, p, true);
> > }
> >
> > @@ -9036,6 +9227,13 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct t
> > cfs_rq = cfs_rq_of(se);
> > put_prev_entity(cfs_rq, se);
> > }
> > +
> > + /*
> > + * The previous task might be eligible for being pushed on another cpu
> > + * if it is still active.
> > + */
> > + fair_add_pushable_task(rq, prev);
> > +
> > }
> >
> > /*
> > @@ -13390,8 +13588,10 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
> > if (static_branch_unlikely(&sched_numa_balancing))
> > task_tick_numa(rq, curr);
> >
> > - update_misfit_status(curr, rq);
> > - check_update_overutilized_status(task_rq(curr));
> > + if (!check_pushable_task(curr, rq)) {
> > + update_misfit_status(curr, rq);
> > + check_update_overutilized_status(task_rq(curr));
> > + }
> >
> > task_tick_core(rq, curr);
> > }
> > @@ -13552,6 +13752,8 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
> > {
> > struct sched_entity *se = &p->se;
> >
> > + fair_remove_pushable_task(rq, p);
> > +
> > if (task_on_rq_queued(p)) {
> > /*
> > * Move the next running task to the front of the list, so our
> > @@ -13567,6 +13769,11 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
> > if (hrtick_enabled_fair(rq))
> > hrtick_start_fair(rq, p);
> >
> > + /*
> > + * Try to push prev task before checking misfit for next task as
> > + * the migration of prev can make next fitting the CPU
> > + */
> > + fair_queue_pushable_tasks(rq);
> > update_misfit_status(p, rq);
> > sched_fair_update_stop_tick(rq, p);
> > }
> > @@ -13596,6 +13803,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
> > {
> > cfs_rq->tasks_timeline = RB_ROOT_CACHED;
> > cfs_rq->zero_vruntime = (u64)(-(1LL << 20));
> > + plist_head_init(&cfs_rq->pushable_tasks);
> > raw_spin_lock_init(&cfs_rq->removed.lock);
> > }
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index b419a4d98461..697bd654298a 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -711,6 +711,8 @@ struct cfs_rq {
> > unsigned long runnable_avg;
> > } removed;
> >
> > + struct plist_head pushable_tasks;
> > +
> > #ifdef CONFIG_FAIR_GROUP_SCHED
> > u64 last_update_tg_load_avg;
> > unsigned long tg_load_avg_contrib;
> > @@ -3620,6 +3622,8 @@ static inline bool sched_energy_enabled(void) { return false; }
> >
> > #endif /* !(CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL) */
> >
> > +DECLARE_STATIC_KEY_FALSE(sched_push_task);
> > +
> > #ifdef CONFIG_MEMBARRIER
> >
> > /*
> > --
> > 2.43.0
> >
^ permalink raw reply [flat|nested] 47+ messages in thread
* [RFC PATCH 5/6 v8] sched/fair: Enable idle core tracking for !SMT
2025-12-02 18:12 [PATCH 0/6 v8] sched/fair: Add push task mechanism and handle more EAS cases Vincent Guittot
` (3 preceding siblings ...)
2025-12-02 18:12 ` [PATCH 4/6 v8] sched/fair: Add push task mechanism for fair Vincent Guittot
@ 2025-12-02 18:12 ` Vincent Guittot
2025-12-05 15:52 ` Christian Loehle
2025-12-08 18:43 ` Christian Loehle
2025-12-02 18:12 ` [RFC PATCH 6/6 v8] sched/fair: Add EAS and idle cpu push trigger Vincent Guittot
` (4 subsequent siblings)
9 siblings, 2 replies; 47+ messages in thread
From: Vincent Guittot @ 2025-12-02 18:12 UTC (permalink / raw)
To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak
Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado,
Vincent Guittot
Enable has_idle_cores at llc level feature for !SMT system for which
CPU equals core.
We don't enable has_idle_core feature of select_idle_cpu to be
conservative and don't parse all CPUs of LLC.
At now, has_idle_cores can be cleared even if a CPU is idle because of
SIS_UTIL but it looks reasonnable as the probablity to get an idle CPU is
low anyway.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
kernel/sched/fair.c | 29 +++++++----------------------
kernel/sched/sched.h | 42 +++++++++++++++++++++++++++++-------------
2 files changed, 36 insertions(+), 35 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 252254168c92..0c0c675f39cf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7501,19 +7501,6 @@ static inline int __select_idle_cpu(int cpu, struct task_struct *p)
return -1;
}
-#ifdef CONFIG_SCHED_SMT
-DEFINE_STATIC_KEY_FALSE(sched_smt_present);
-EXPORT_SYMBOL_GPL(sched_smt_present);
-
-static inline void set_idle_cores(int cpu, int val)
-{
- struct sched_domain_shared *sds;
-
- sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
- if (sds)
- WRITE_ONCE(sds->has_idle_cores, val);
-}
-
static inline bool test_idle_cores(int cpu)
{
struct sched_domain_shared *sds;
@@ -7525,6 +7512,10 @@ static inline bool test_idle_cores(int cpu)
return false;
}
+#ifdef CONFIG_SCHED_SMT
+DEFINE_STATIC_KEY_FALSE(sched_smt_present);
+EXPORT_SYMBOL_GPL(sched_smt_present);
+
/*
* Scans the local SMT mask to see if the entire core is idle, and records this
* information in sd_llc_shared->has_idle_cores.
@@ -7612,15 +7603,6 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
#else /* !CONFIG_SCHED_SMT: */
-static inline void set_idle_cores(int cpu, int val)
-{
-}
-
-static inline bool test_idle_cores(int cpu)
-{
- return false;
-}
-
static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
{
return __select_idle_cpu(core, p);
@@ -7886,6 +7868,9 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if ((unsigned)i < nr_cpumask_bits)
return i;
+ if (!sched_smt_active())
+ set_idle_cores(target, 0);
+
/*
* For cluster machines which have lower sharing cache like L2 or
* LLC Tag, we tend to find an idle CPU in the target's cluster
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 697bd654298a..b9e228333d5e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1591,19 +1591,6 @@ do { \
flags = _raw_spin_rq_lock_irqsave(rq); \
} while (0)
-#ifdef CONFIG_SCHED_SMT
-extern void __update_idle_core(struct rq *rq);
-
-static inline void update_idle_core(struct rq *rq)
-{
- if (static_branch_unlikely(&sched_smt_present))
- __update_idle_core(rq);
-}
-
-#else /* !CONFIG_SCHED_SMT: */
-static inline void update_idle_core(struct rq *rq) { }
-#endif /* !CONFIG_SCHED_SMT */
-
#ifdef CONFIG_FAIR_GROUP_SCHED
static inline struct task_struct *task_of(struct sched_entity *se)
@@ -2091,6 +2078,35 @@ static __always_inline bool sched_asym_cpucap_active(void)
return static_branch_unlikely(&sched_asym_cpucapacity);
}
+static inline void set_idle_cores(int cpu, int val)
+{
+ struct sched_domain_shared *sds;
+
+ sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+ if (sds)
+ WRITE_ONCE(sds->has_idle_cores, val);
+}
+
+#ifdef CONFIG_SCHED_SMT
+extern void __update_idle_core(struct rq *rq);
+
+static inline void update_idle_core(struct rq *rq)
+{
+ if (static_branch_unlikely(&sched_smt_present))
+ __update_idle_core(rq);
+ else
+ set_idle_cores(cpu_of(rq), 1);
+
+}
+
+#else /* !CONFIG_SCHED_SMT: */
+static inline void update_idle_core(struct rq *rq)
+{
+ set_idle_cores(cpu_of(rq), 1);
+}
+#endif /* !CONFIG_SCHED_SMT */
+
+
struct sched_group_capacity {
atomic_t ref;
/*
--
2.43.0
^ permalink raw reply related [flat|nested] 47+ messages in thread* Re: [RFC PATCH 5/6 v8] sched/fair: Enable idle core tracking for !SMT
2025-12-02 18:12 ` [RFC PATCH 5/6 v8] sched/fair: Enable idle core tracking for !SMT Vincent Guittot
@ 2025-12-05 15:52 ` Christian Loehle
2025-12-06 2:11 ` Chen, Yu C
2025-12-06 10:09 ` Vincent Guittot
2025-12-08 18:43 ` Christian Loehle
1 sibling, 2 replies; 47+ messages in thread
From: Christian Loehle @ 2025-12-05 15:52 UTC (permalink / raw)
To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, linux-kernel, pierre.gondois,
kprateek.nayak
Cc: qyousef, hongyan.xia2, luis.machado
On 12/2/25 18:12, Vincent Guittot wrote:
> Enable has_idle_cores at llc level feature for !SMT system for which
> CPU equals core.
>
> We don't enable has_idle_core feature of select_idle_cpu to be
> conservative and don't parse all CPUs of LLC.
>
> At now, has_idle_cores can be cleared even if a CPU is idle because of
> SIS_UTIL but it looks reasonnable as the probablity to get an idle CPU is
> low anyway.
>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> kernel/sched/fair.c | 29 +++++++----------------------
> kernel/sched/sched.h | 42 +++++++++++++++++++++++++++++-------------
> 2 files changed, 36 insertions(+), 35 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 252254168c92..0c0c675f39cf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7501,19 +7501,6 @@ static inline int __select_idle_cpu(int cpu, struct task_struct *p)
> return -1;
> }
>
> -#ifdef CONFIG_SCHED_SMT
> -DEFINE_STATIC_KEY_FALSE(sched_smt_present);
> -EXPORT_SYMBOL_GPL(sched_smt_present);
> -
> -static inline void set_idle_cores(int cpu, int val)
> -{
> - struct sched_domain_shared *sds;
> -
> - sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> - if (sds)
> - WRITE_ONCE(sds->has_idle_cores, val);
> -}
> -
> static inline bool test_idle_cores(int cpu)
> {
> struct sched_domain_shared *sds;
> @@ -7525,6 +7512,10 @@ static inline bool test_idle_cores(int cpu)
> return false;
> }
>
> +#ifdef CONFIG_SCHED_SMT
> +DEFINE_STATIC_KEY_FALSE(sched_smt_present);
> +EXPORT_SYMBOL_GPL(sched_smt_present);
> +
> /*
> * Scans the local SMT mask to see if the entire core is idle, and records this
> * information in sd_llc_shared->has_idle_cores.
> @@ -7612,15 +7603,6 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
>
> #else /* !CONFIG_SCHED_SMT: */
>
> -static inline void set_idle_cores(int cpu, int val)
> -{
> -}
> -
> -static inline bool test_idle_cores(int cpu)
> -{
> - return false;
> -}
> -
> static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
> {
> return __select_idle_cpu(core, p);
> @@ -7886,6 +7868,9 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> if ((unsigned)i < nr_cpumask_bits)
> return i;
>
a
> +
> /*
> * For cluster machines which have lower sharing cache like L2 or
> * LLC Tag, we tend to find an idle CPU in the target's cluster
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 697bd654298a..b9e228333d5e 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1591,19 +1591,6 @@ do { \
> flags = _raw_spin_rq_lock_irqsave(rq); \
> } while (0)
>
> -#ifdef CONFIG_SCHED_SMT
> -extern void __update_idle_core(struct rq *rq);
> -
> -static inline void update_idle_core(struct rq *rq)
> -{
> - if (static_branch_unlikely(&sched_smt_present))
> - __update_idle_core(rq);
> -}
> -
> -#else /* !CONFIG_SCHED_SMT: */
> -static inline void update_idle_core(struct rq *rq) { }
> -#endif /* !CONFIG_SCHED_SMT */
> -
> #ifdef CONFIG_FAIR_GROUP_SCHED
>
> static inline struct task_struct *task_of(struct sched_entity *se)
> @@ -2091,6 +2078,35 @@ static __always_inline bool sched_asym_cpucap_active(void)
> return static_branch_unlikely(&sched_asym_cpucapacity);
> }
>
> +static inline void set_idle_cores(int cpu, int val)
> +{
> + struct sched_domain_shared *sds;
> +
> + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> + if (sds)
> + WRITE_ONCE(sds->has_idle_cores, val);
> +}
FWIW this triggers
[ 0.172174] =============================
[ 0.172177] WARNING: suspicious RCU usage
[ 0.172179] 6.18.0-rc7-cix-build+ #215 Not tainted
[ 0.172184] Detected PIPT I-cache on CPU1
[ 0.178161] -----------------------------
[ 0.178163] kernel/sched/sched.h:2085 suspicious rcu_dereference_check() usage!
[ 0.178165]
other info that might help us debug this:
[ 0.178177] CPU features: SANITY CHECK: Unexpected variation in SYS_ID_AA64MMFR1_EL1. Boot CPU: 0x1001111010312122, CPU1: 0x1001111011312122
[ 0.182211]
rcu_scheduler_active = 1, debug_locks = 1
[ 0.182213] 4 locks held by swapper/0/1:
[ 0.182224] CPU features: Unsupported CPU feature variation detected.
[ 0.186260] #0: ffff800082b2bf00
[ 0.186277] GICv3: CPU1: found redistributor 0 region 0:0x000000000e090000
[ 0.191101] (cpu_add_remove_lock){+.+.}-{4:4}, at: cpu_up+0x90/0x158
[ 0.191115] GICv3: CPU1: using allocated LPI pending table @0x0000000100330000
[ 0.195158] #1: ffff800082b2c0a0 (cpu_hotplug_lock
[ 0.195277] CPU1: Booted secondary processor 0x0000000000 [0x410fd801]
[ 0.199208] ){++++}-{0:0}, at: _cpu_up+0x58/0x268
[ 0.199213] #2: ffff800082ebddd0 (sparse_irq_lock){+.+.}-{4:4}, at: irq_lock_sparse+0x20/0x2c
[ 0.293548] #3: ffff0001feec1c18 (&rq->__lock){-...}-{2:2}, at: __schedule+0x144/0x1058
[ 0.301737]
stack backtrace:
[ 0.306136] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Tainted: G S 6.18.0-rc7-cix-build+ #215 PREEMPT
[ 0.306141] Tainted: [S]=CPU_OUT_OF_SPEC
[ 0.306144] Call trace:
[ 0.306145] show_stack+0x18/0x24 (C)
[ 0.306150] dump_stack_lvl+0x90/0xd0
[ 0.306155] dump_stack+0x18/0x24
[ 0.306159] lockdep_rcu_suspicious+0x168/0x238
[ 0.306164] set_next_task_idle+0x144/0x148
[ 0.306167] __schedule+0xc50/0x1058
[ 0.306171] schedule+0x48/0x15c
[ 0.306173] schedule_timeout+0x90/0x128
[ 0.306177] wait_for_completion_timeout+0x88/0x13c
[ 0.306180] __cpu_up+0x80/0x1e4
[ 0.306186] bringup_cpu+0x48/0x2ac
[ 0.306189] cpuhp_invoke_callback+0x18c/0x358
[ 0.306191] __cpuhp_invoke_callback_range+0xf4/0x130
[ 0.306194] _cpu_up+0x150/0x268
[ 0.306196] cpu_up+0xcc/0x158
[ 0.306199] bringup_nonboot_cpus+0x84/0xcc
[ 0.306203] smp_init+0x30/0x8c
[ 0.306208] kernel_init_freeable+0x18c/0x504
[ 0.306215] kernel_init+0x20/0x1d8
[ 0.306218] ret_from_fork+0x10/0x20
on my machine...
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [RFC PATCH 5/6 v8] sched/fair: Enable idle core tracking for !SMT
2025-12-05 15:52 ` Christian Loehle
@ 2025-12-06 2:11 ` Chen, Yu C
2025-12-06 10:18 ` Vincent Guittot
2025-12-06 10:09 ` Vincent Guittot
1 sibling, 1 reply; 47+ messages in thread
From: Chen, Yu C @ 2025-12-06 2:11 UTC (permalink / raw)
To: Christian Loehle
Cc: qyousef, hongyan.xia2, luis.machado, Vincent Guittot, mingo,
peterz, dietmar.eggemann, juri.lelli, rostedt, bsegall, mgorman,
vschneid, pierre.gondois, kprateek.nayak, linux-kernel
On 12/5/2025 11:52 PM, Christian Loehle wrote:
> On 12/2/25 18:12, Vincent Guittot wrote:
>> Enable has_idle_cores at llc level feature for !SMT system for which
>> CPU equals core.
>>
>> We don't enable has_idle_core feature of select_idle_cpu to be
>> conservative and don't parse all CPUs of LLC.
>>
>> At now, has_idle_cores can be cleared even if a CPU is idle because of
>> SIS_UTIL but it looks reasonnable as the probablity to get an idle CPU is
>> low anyway.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>> kernel/sched/fair.c | 29 +++++++----------------------
>> kernel/sched/sched.h | 42 +++++++++++++++++++++++++++++-------------
>> 2 files changed, 36 insertions(+), 35 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 252254168c92..0c0c675f39cf 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7501,19 +7501,6 @@ static inline int __select_idle_cpu(int cpu, struct task_struct *p)
>> return -1;
>> }
>>
>> -#ifdef CONFIG_SCHED_SMT
>> -DEFINE_STATIC_KEY_FALSE(sched_smt_present);
>> -EXPORT_SYMBOL_GPL(sched_smt_present);
>> -
>> -static inline void set_idle_cores(int cpu, int val)
>> -{
>> - struct sched_domain_shared *sds;
>> -
>> - sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
>> - if (sds)
>> - WRITE_ONCE(sds->has_idle_cores, val);
>> -}
>> -
>> static inline bool test_idle_cores(int cpu)
>> {
>> struct sched_domain_shared *sds;
>> @@ -7525,6 +7512,10 @@ static inline bool test_idle_cores(int cpu)
>> return false;
>> }
>>
>> +#ifdef CONFIG_SCHED_SMT
>> +DEFINE_STATIC_KEY_FALSE(sched_smt_present);
>> +EXPORT_SYMBOL_GPL(sched_smt_present);
>> +
>> /*
>> * Scans the local SMT mask to see if the entire core is idle, and records this
>> * information in sd_llc_shared->has_idle_cores.
>> @@ -7612,15 +7603,6 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
>>
>> #else /* !CONFIG_SCHED_SMT: */
>>
>> -static inline void set_idle_cores(int cpu, int val)
>> -{
>> -}
>> -
>> -static inline bool test_idle_cores(int cpu)
>> -{
>> - return false;
>> -}
>> -
>> static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
>> {
>> return __select_idle_cpu(core, p);
>> @@ -7886,6 +7868,9 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>> if ((unsigned)i < nr_cpumask_bits)
>> return i;
>>
> a
>> +
>> /*
>> * For cluster machines which have lower sharing cache like L2 or
>> * LLC Tag, we tend to find an idle CPU in the target's cluster
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 697bd654298a..b9e228333d5e 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1591,19 +1591,6 @@ do { \
>> flags = _raw_spin_rq_lock_irqsave(rq); \
>> } while (0)
>>
>> -#ifdef CONFIG_SCHED_SMT
>> -extern void __update_idle_core(struct rq *rq);
>> -
>> -static inline void update_idle_core(struct rq *rq)
>> -{
>> - if (static_branch_unlikely(&sched_smt_present))
>> - __update_idle_core(rq);
>> -}
>> -
>> -#else /* !CONFIG_SCHED_SMT: */
>> -static inline void update_idle_core(struct rq *rq) { }
>> -#endif /* !CONFIG_SCHED_SMT */
>> -
>> #ifdef CONFIG_FAIR_GROUP_SCHED
>>
>> static inline struct task_struct *task_of(struct sched_entity *se)
>> @@ -2091,6 +2078,35 @@ static __always_inline bool sched_asym_cpucap_active(void)
>> return static_branch_unlikely(&sched_asym_cpucapacity);
>> }
>>
>> +static inline void set_idle_cores(int cpu, int val)
>> +{
>> + struct sched_domain_shared *sds;
>> +
>> + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
>> + if (sds)
>> + WRITE_ONCE(sds->has_idle_cores, val);
>> +}
>
> FWIW this triggers
> [ 0.172174] =============================
> [ 0.172177] WARNING: suspicious RCU usage
> [ 0.172179] 6.18.0-rc7-cix-build+ #215 Not tainted
> [ 0.172184] Detected PIPT I-cache on CPU1
> [ 0.178161] -----------------------------
> [ 0.178163] kernel/sched/sched.h:2085 suspicious rcu_dereference_check() usage!
> [ 0.178165]
> other info that might help us debug this:
>
> [ 0.178177] CPU features: SANITY CHECK: Unexpected variation in SYS_ID_AA64MMFR1_EL1. Boot CPU: 0x1001111010312122, CPU1: 0x1001111011312122
> [ 0.182211]
> rcu_scheduler_active = 1, debug_locks = 1
> [ 0.182213] 4 locks held by swapper/0/1:
> [ 0.182224] CPU features: Unsupported CPU feature variation detected.
> [ 0.186260] #0: ffff800082b2bf00
> [ 0.186277] GICv3: CPU1: found redistributor 0 region 0:0x000000000e090000
> [ 0.191101] (cpu_add_remove_lock){+.+.}-{4:4}, at: cpu_up+0x90/0x158
> [ 0.191115] GICv3: CPU1: using allocated LPI pending table @0x0000000100330000
> [ 0.195158] #1: ffff800082b2c0a0 (cpu_hotplug_lock
> [ 0.195277] CPU1: Booted secondary processor 0x0000000000 [0x410fd801]
> [ 0.199208] ){++++}-{0:0}, at: _cpu_up+0x58/0x268
> [ 0.199213] #2: ffff800082ebddd0 (sparse_irq_lock){+.+.}-{4:4}, at: irq_lock_sparse+0x20/0x2c
> [ 0.293548] #3: ffff0001feec1c18 (&rq->__lock){-...}-{2:2}, at: __schedule+0x144/0x1058
> [ 0.301737]
> stack backtrace:
> [ 0.306136] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Tainted: G S 6.18.0-rc7-cix-build+ #215 PREEMPT
> [ 0.306141] Tainted: [S]=CPU_OUT_OF_SPEC
> [ 0.306144] Call trace:
> [ 0.306145] show_stack+0x18/0x24 (C)
> [ 0.306150] dump_stack_lvl+0x90/0xd0
> [ 0.306155] dump_stack+0x18/0x24
> [ 0.306159] lockdep_rcu_suspicious+0x168/0x238
> [ 0.306164] set_next_task_idle+0x144/0x148
> [ 0.306167] __schedule+0xc50/0x1058
> [ 0.306171] schedule+0x48/0x15c
> [ 0.306173] schedule_timeout+0x90/0x128
> [ 0.306177] wait_for_completion_timeout+0x88/0x13c
> [ 0.306180] __cpu_up+0x80/0x1e4
> [ 0.306186] bringup_cpu+0x48/0x2ac
> [ 0.306189] cpuhp_invoke_callback+0x18c/0x358
> [ 0.306191] __cpuhp_invoke_callback_range+0xf4/0x130
> [ 0.306194] _cpu_up+0x150/0x268
> [ 0.306196] cpu_up+0xcc/0x158
> [ 0.306199] bringup_nonboot_cpus+0x84/0xcc
> [ 0.306203] smp_init+0x30/0x8c
> [ 0.306208] kernel_init_freeable+0x18c/0x504
> [ 0.306215] kernel_init+0x20/0x1d8
> [ 0.306218] ret_from_fork+0x10/0x20
>
>
> on my machine...
>
update_idle_core() might need to deal with rcu protection in the original
code, maybe something like this would help:
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9dfabaa314b1..4c9348075abf 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2094,17 +2094,22 @@ extern void __update_idle_core(struct rq *rq);
static inline void update_idle_core(struct rq *rq)
{
- if (static_branch_unlikely(&sched_smt_present))
+ if (static_branch_unlikely(&sched_smt_present)) {
__update_idle_core(rq);
- else
+ } else {
+ rcu_read_lock();
set_idle_cores(cpu_of(rq), 1);
+ rcu_read_unlock();
+ }
}
#else /* !CONFIG_SCHED_SMT: */
static inline void update_idle_core(struct rq *rq)
{
+ rcu_read_lock();
set_idle_cores(cpu_of(rq), 1);
+ rcu_read_unlock();
}
#endif /* !CONFIG_SCHED_SMT */
^ permalink raw reply related [flat|nested] 47+ messages in thread* Re: [RFC PATCH 5/6 v8] sched/fair: Enable idle core tracking for !SMT
2025-12-06 2:11 ` Chen, Yu C
@ 2025-12-06 10:18 ` Vincent Guittot
0 siblings, 0 replies; 47+ messages in thread
From: Vincent Guittot @ 2025-12-06 10:18 UTC (permalink / raw)
To: Chen, Yu C
Cc: Christian Loehle, qyousef, hongyan.xia2, luis.machado, mingo,
peterz, dietmar.eggemann, juri.lelli, rostedt, bsegall, mgorman,
vschneid, pierre.gondois, kprateek.nayak, linux-kernel
On Sat, 6 Dec 2025 at 03:11, Chen, Yu C <yu.c.chen@intel.com> wrote:
>
> On 12/5/2025 11:52 PM, Christian Loehle wrote:
> > On 12/2/25 18:12, Vincent Guittot wrote:
> >> Enable has_idle_cores at llc level feature for !SMT system for which
> >> CPU equals core.
> >>
> >> We don't enable has_idle_core feature of select_idle_cpu to be
> >> conservative and don't parse all CPUs of LLC.
> >>
> >> At now, has_idle_cores can be cleared even if a CPU is idle because of
> >> SIS_UTIL but it looks reasonnable as the probablity to get an idle CPU is
> >> low anyway.
> >>
> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> >> ---
> >> kernel/sched/fair.c | 29 +++++++----------------------
> >> kernel/sched/sched.h | 42 +++++++++++++++++++++++++++++-------------
> >> 2 files changed, 36 insertions(+), 35 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 252254168c92..0c0c675f39cf 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -7501,19 +7501,6 @@ static inline int __select_idle_cpu(int cpu, struct task_struct *p)
> >> return -1;
> >> }
> >>
> >> -#ifdef CONFIG_SCHED_SMT
> >> -DEFINE_STATIC_KEY_FALSE(sched_smt_present);
> >> -EXPORT_SYMBOL_GPL(sched_smt_present);
> >> -
> >> -static inline void set_idle_cores(int cpu, int val)
> >> -{
> >> - struct sched_domain_shared *sds;
> >> -
> >> - sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> >> - if (sds)
> >> - WRITE_ONCE(sds->has_idle_cores, val);
> >> -}
> >> -
> >> static inline bool test_idle_cores(int cpu)
> >> {
> >> struct sched_domain_shared *sds;
> >> @@ -7525,6 +7512,10 @@ static inline bool test_idle_cores(int cpu)
> >> return false;
> >> }
> >>
> >> +#ifdef CONFIG_SCHED_SMT
> >> +DEFINE_STATIC_KEY_FALSE(sched_smt_present);
> >> +EXPORT_SYMBOL_GPL(sched_smt_present);
> >> +
> >> /*
> >> * Scans the local SMT mask to see if the entire core is idle, and records this
> >> * information in sd_llc_shared->has_idle_cores.
> >> @@ -7612,15 +7603,6 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
> >>
> >> #else /* !CONFIG_SCHED_SMT: */
> >>
> >> -static inline void set_idle_cores(int cpu, int val)
> >> -{
> >> -}
> >> -
> >> -static inline bool test_idle_cores(int cpu)
> >> -{
> >> - return false;
> >> -}
> >> -
> >> static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
> >> {
> >> return __select_idle_cpu(core, p);
> >> @@ -7886,6 +7868,9 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >> if ((unsigned)i < nr_cpumask_bits)
> >> return i;
> >>
> > a
> >> +
> >> /*
> >> * For cluster machines which have lower sharing cache like L2 or
> >> * LLC Tag, we tend to find an idle CPU in the target's cluster
> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >> index 697bd654298a..b9e228333d5e 100644
> >> --- a/kernel/sched/sched.h
> >> +++ b/kernel/sched/sched.h
> >> @@ -1591,19 +1591,6 @@ do { \
> >> flags = _raw_spin_rq_lock_irqsave(rq); \
> >> } while (0)
> >>
> >> -#ifdef CONFIG_SCHED_SMT
> >> -extern void __update_idle_core(struct rq *rq);
> >> -
> >> -static inline void update_idle_core(struct rq *rq)
> >> -{
> >> - if (static_branch_unlikely(&sched_smt_present))
> >> - __update_idle_core(rq);
> >> -}
> >> -
> >> -#else /* !CONFIG_SCHED_SMT: */
> >> -static inline void update_idle_core(struct rq *rq) { }
> >> -#endif /* !CONFIG_SCHED_SMT */
> >> -
> >> #ifdef CONFIG_FAIR_GROUP_SCHED
> >>
> >> static inline struct task_struct *task_of(struct sched_entity *se)
> >> @@ -2091,6 +2078,35 @@ static __always_inline bool sched_asym_cpucap_active(void)
> >> return static_branch_unlikely(&sched_asym_cpucapacity);
> >> }
> >>
> >> +static inline void set_idle_cores(int cpu, int val)
> >> +{
> >> + struct sched_domain_shared *sds;
> >> +
> >> + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> >> + if (sds)
> >> + WRITE_ONCE(sds->has_idle_cores, val);
> >> +}
> >
> > FWIW this triggers
> > [ 0.172174] =============================
> > [ 0.172177] WARNING: suspicious RCU usage
> > [ 0.172179] 6.18.0-rc7-cix-build+ #215 Not tainted
> > [ 0.172184] Detected PIPT I-cache on CPU1
> > [ 0.178161] -----------------------------
> > [ 0.178163] kernel/sched/sched.h:2085 suspicious rcu_dereference_check() usage!
> > [ 0.178165]
> > other info that might help us debug this:
> >
> > [ 0.178177] CPU features: SANITY CHECK: Unexpected variation in SYS_ID_AA64MMFR1_EL1. Boot CPU: 0x1001111010312122, CPU1: 0x1001111011312122
> > [ 0.182211]
> > rcu_scheduler_active = 1, debug_locks = 1
> > [ 0.182213] 4 locks held by swapper/0/1:
> > [ 0.182224] CPU features: Unsupported CPU feature variation detected.
> > [ 0.186260] #0: ffff800082b2bf00
> > [ 0.186277] GICv3: CPU1: found redistributor 0 region 0:0x000000000e090000
> > [ 0.191101] (cpu_add_remove_lock){+.+.}-{4:4}, at: cpu_up+0x90/0x158
> > [ 0.191115] GICv3: CPU1: using allocated LPI pending table @0x0000000100330000
> > [ 0.195158] #1: ffff800082b2c0a0 (cpu_hotplug_lock
> > [ 0.195277] CPU1: Booted secondary processor 0x0000000000 [0x410fd801]
> > [ 0.199208] ){++++}-{0:0}, at: _cpu_up+0x58/0x268
> > [ 0.199213] #2: ffff800082ebddd0 (sparse_irq_lock){+.+.}-{4:4}, at: irq_lock_sparse+0x20/0x2c
> > [ 0.293548] #3: ffff0001feec1c18 (&rq->__lock){-...}-{2:2}, at: __schedule+0x144/0x1058
> > [ 0.301737]
> > stack backtrace:
> > [ 0.306136] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Tainted: G S 6.18.0-rc7-cix-build+ #215 PREEMPT
> > [ 0.306141] Tainted: [S]=CPU_OUT_OF_SPEC
> > [ 0.306144] Call trace:
> > [ 0.306145] show_stack+0x18/0x24 (C)
> > [ 0.306150] dump_stack_lvl+0x90/0xd0
> > [ 0.306155] dump_stack+0x18/0x24
> > [ 0.306159] lockdep_rcu_suspicious+0x168/0x238
> > [ 0.306164] set_next_task_idle+0x144/0x148
> > [ 0.306167] __schedule+0xc50/0x1058
> > [ 0.306171] schedule+0x48/0x15c
> > [ 0.306173] schedule_timeout+0x90/0x128
> > [ 0.306177] wait_for_completion_timeout+0x88/0x13c
> > [ 0.306180] __cpu_up+0x80/0x1e4
> > [ 0.306186] bringup_cpu+0x48/0x2ac
> > [ 0.306189] cpuhp_invoke_callback+0x18c/0x358
> > [ 0.306191] __cpuhp_invoke_callback_range+0xf4/0x130
> > [ 0.306194] _cpu_up+0x150/0x268
> > [ 0.306196] cpu_up+0xcc/0x158
> > [ 0.306199] bringup_nonboot_cpus+0x84/0xcc
> > [ 0.306203] smp_init+0x30/0x8c
> > [ 0.306208] kernel_init_freeable+0x18c/0x504
> > [ 0.306215] kernel_init+0x20/0x1d8
> > [ 0.306218] ret_from_fork+0x10/0x20
> >
> >
> > on my machine...
> >
>
> update_idle_core() might need to deal with rcu protection in the original
> code, maybe something like this would help:
fair enough
>
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 9dfabaa314b1..4c9348075abf 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2094,17 +2094,22 @@ extern void __update_idle_core(struct rq *rq);
>
> static inline void update_idle_core(struct rq *rq)
> {
> - if (static_branch_unlikely(&sched_smt_present))
> + if (static_branch_unlikely(&sched_smt_present)) {
> __update_idle_core(rq);
> - else
> + } else {
> + rcu_read_lock();
I will move it up and include __update_idle_core() and remove those in
this latter
> set_idle_cores(cpu_of(rq), 1);
> + rcu_read_unlock();
> + }
>
> }
>
> #else /* !CONFIG_SCHED_SMT: */
> static inline void update_idle_core(struct rq *rq)
> {
> + rcu_read_lock();
> set_idle_cores(cpu_of(rq), 1);
> + rcu_read_unlock();
> }
> #endif /* !CONFIG_SCHED_SMT */
>
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [RFC PATCH 5/6 v8] sched/fair: Enable idle core tracking for !SMT
2025-12-05 15:52 ` Christian Loehle
2025-12-06 2:11 ` Chen, Yu C
@ 2025-12-06 10:09 ` Vincent Guittot
1 sibling, 0 replies; 47+ messages in thread
From: Vincent Guittot @ 2025-12-06 10:09 UTC (permalink / raw)
To: Christian Loehle
Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak,
qyousef, hongyan.xia2, luis.machado
On Fri, 5 Dec 2025 at 16:53, Christian Loehle <christian.loehle@arm.com> wrote:
>
> On 12/2/25 18:12, Vincent Guittot wrote:
> > Enable has_idle_cores at llc level feature for !SMT system for which
> > CPU equals core.
> >
> > We don't enable has_idle_core feature of select_idle_cpu to be
> > conservative and don't parse all CPUs of LLC.
> >
> > At now, has_idle_cores can be cleared even if a CPU is idle because of
> > SIS_UTIL but it looks reasonnable as the probablity to get an idle CPU is
> > low anyway.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> > kernel/sched/fair.c | 29 +++++++----------------------
> > kernel/sched/sched.h | 42 +++++++++++++++++++++++++++++-------------
> > 2 files changed, 36 insertions(+), 35 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 252254168c92..0c0c675f39cf 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7501,19 +7501,6 @@ static inline int __select_idle_cpu(int cpu, struct task_struct *p)
> > return -1;
> > }
> >
> > -#ifdef CONFIG_SCHED_SMT
> > -DEFINE_STATIC_KEY_FALSE(sched_smt_present);
> > -EXPORT_SYMBOL_GPL(sched_smt_present);
> > -
> > -static inline void set_idle_cores(int cpu, int val)
> > -{
> > - struct sched_domain_shared *sds;
> > -
> > - sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > - if (sds)
> > - WRITE_ONCE(sds->has_idle_cores, val);
> > -}
> > -
> > static inline bool test_idle_cores(int cpu)
> > {
> > struct sched_domain_shared *sds;
> > @@ -7525,6 +7512,10 @@ static inline bool test_idle_cores(int cpu)
> > return false;
> > }
> >
> > +#ifdef CONFIG_SCHED_SMT
> > +DEFINE_STATIC_KEY_FALSE(sched_smt_present);
> > +EXPORT_SYMBOL_GPL(sched_smt_present);
> > +
> > /*
> > * Scans the local SMT mask to see if the entire core is idle, and records this
> > * information in sd_llc_shared->has_idle_cores.
> > @@ -7612,15 +7603,6 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
> >
> > #else /* !CONFIG_SCHED_SMT: */
> >
> > -static inline void set_idle_cores(int cpu, int val)
> > -{
> > -}
> > -
> > -static inline bool test_idle_cores(int cpu)
> > -{
> > - return false;
> > -}
> > -
> > static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
> > {
> > return __select_idle_cpu(core, p);
> > @@ -7886,6 +7868,9 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> > if ((unsigned)i < nr_cpumask_bits)
> > return i;
> >
> a
> > +
> > /*
> > * For cluster machines which have lower sharing cache like L2 or
> > * LLC Tag, we tend to find an idle CPU in the target's cluster
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 697bd654298a..b9e228333d5e 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1591,19 +1591,6 @@ do { \
> > flags = _raw_spin_rq_lock_irqsave(rq); \
> > } while (0)
> >
> > -#ifdef CONFIG_SCHED_SMT
> > -extern void __update_idle_core(struct rq *rq);
> > -
> > -static inline void update_idle_core(struct rq *rq)
> > -{
> > - if (static_branch_unlikely(&sched_smt_present))
> > - __update_idle_core(rq);
> > -}
> > -
> > -#else /* !CONFIG_SCHED_SMT: */
> > -static inline void update_idle_core(struct rq *rq) { }
> > -#endif /* !CONFIG_SCHED_SMT */
> > -
> > #ifdef CONFIG_FAIR_GROUP_SCHED
> >
> > static inline struct task_struct *task_of(struct sched_entity *se)
> > @@ -2091,6 +2078,35 @@ static __always_inline bool sched_asym_cpucap_active(void)
> > return static_branch_unlikely(&sched_asym_cpucapacity);
> > }
> >
> > +static inline void set_idle_cores(int cpu, int val)
> > +{
> > + struct sched_domain_shared *sds;
> > +
> > + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > + if (sds)
> > + WRITE_ONCE(sds->has_idle_cores, val);
> > +}
>
> FWIW this triggers
Thanks for the report
> [ 0.172174] =============================
> [ 0.172177] WARNING: suspicious RCU usage
> [ 0.172179] 6.18.0-rc7-cix-build+ #215 Not tainted
> [ 0.172184] Detected PIPT I-cache on CPU1
> [ 0.178161] -----------------------------
> [ 0.178163] kernel/sched/sched.h:2085 suspicious rcu_dereference_check() usage!
> [ 0.178165]
> other info that might help us debug this:
>
> [ 0.178177] CPU features: SANITY CHECK: Unexpected variation in SYS_ID_AA64MMFR1_EL1. Boot CPU: 0x1001111010312122, CPU1: 0x1001111011312122
> [ 0.182211]
> rcu_scheduler_active = 1, debug_locks = 1
> [ 0.182213] 4 locks held by swapper/0/1:
> [ 0.182224] CPU features: Unsupported CPU feature variation detected.
> [ 0.186260] #0: ffff800082b2bf00
> [ 0.186277] GICv3: CPU1: found redistributor 0 region 0:0x000000000e090000
> [ 0.191101] (cpu_add_remove_lock){+.+.}-{4:4}, at: cpu_up+0x90/0x158
> [ 0.191115] GICv3: CPU1: using allocated LPI pending table @0x0000000100330000
> [ 0.195158] #1: ffff800082b2c0a0 (cpu_hotplug_lock
> [ 0.195277] CPU1: Booted secondary processor 0x0000000000 [0x410fd801]
> [ 0.199208] ){++++}-{0:0}, at: _cpu_up+0x58/0x268
> [ 0.199213] #2: ffff800082ebddd0 (sparse_irq_lock){+.+.}-{4:4}, at: irq_lock_sparse+0x20/0x2c
> [ 0.293548] #3: ffff0001feec1c18 (&rq->__lock){-...}-{2:2}, at: __schedule+0x144/0x1058
> [ 0.301737]
> stack backtrace:
> [ 0.306136] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Tainted: G S 6.18.0-rc7-cix-build+ #215 PREEMPT
> [ 0.306141] Tainted: [S]=CPU_OUT_OF_SPEC
> [ 0.306144] Call trace:
> [ 0.306145] show_stack+0x18/0x24 (C)
> [ 0.306150] dump_stack_lvl+0x90/0xd0
> [ 0.306155] dump_stack+0x18/0x24
> [ 0.306159] lockdep_rcu_suspicious+0x168/0x238
> [ 0.306164] set_next_task_idle+0x144/0x148
> [ 0.306167] __schedule+0xc50/0x1058
> [ 0.306171] schedule+0x48/0x15c
> [ 0.306173] schedule_timeout+0x90/0x128
> [ 0.306177] wait_for_completion_timeout+0x88/0x13c
> [ 0.306180] __cpu_up+0x80/0x1e4
> [ 0.306186] bringup_cpu+0x48/0x2ac
> [ 0.306189] cpuhp_invoke_callback+0x18c/0x358
> [ 0.306191] __cpuhp_invoke_callback_range+0xf4/0x130
> [ 0.306194] _cpu_up+0x150/0x268
> [ 0.306196] cpu_up+0xcc/0x158
> [ 0.306199] bringup_nonboot_cpus+0x84/0xcc
> [ 0.306203] smp_init+0x30/0x8c
> [ 0.306208] kernel_init_freeable+0x18c/0x504
> [ 0.306215] kernel_init+0x20/0x1d8
> [ 0.306218] ret_from_fork+0x10/0x20
>
>
> on my machine...
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [RFC PATCH 5/6 v8] sched/fair: Enable idle core tracking for !SMT
2025-12-02 18:12 ` [RFC PATCH 5/6 v8] sched/fair: Enable idle core tracking for !SMT Vincent Guittot
2025-12-05 15:52 ` Christian Loehle
@ 2025-12-08 18:43 ` Christian Loehle
1 sibling, 0 replies; 47+ messages in thread
From: Christian Loehle @ 2025-12-08 18:43 UTC (permalink / raw)
To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, linux-kernel, pierre.gondois,
kprateek.nayak
Cc: qyousef, hongyan.xia2, luis.machado
On 12/2/25 18:12, Vincent Guittot wrote:
> Enable has_idle_cores at llc level feature for !SMT system for which
> CPU equals core.
>
> We don't enable has_idle_core feature of select_idle_cpu to be
> conservative and don't parse all CPUs of LLC.
>
> At now, has_idle_cores can be cleared even if a CPU is idle because of
> SIS_UTIL but it looks reasonnable as the probablity to get an idle CPU is
> low anyway.
>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> kernel/sched/fair.c | 29 +++++++----------------------
> kernel/sched/sched.h | 42 +++++++++++++++++++++++++++++-------------
> 2 files changed, 36 insertions(+), 35 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 252254168c92..0c0c675f39cf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> [snip]
> @@ -7849,80 +7831,83 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> }
>
> /*
> * For asymmetric CPU capacity systems, our domain of interest is
> * sd_asym_cpucapacity rather than sd_llc.
> */
> if (sched_asym_cpucap_active()) {
> sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, target));
> /*
> * On an asymmetric CPU capacity system where an exclusive
> * cpuset defines a symmetric island (i.e. one unique
> * capacity_orig value through the cpuset), the key will be set
> * but the CPUs within that cpuset will not have a domain with
> * SD_ASYM_CPUCAPACITY. These should follow the usual symmetric
> * capacity path.
> */
> if (sd) {
> i = select_idle_capacity(p, sd, target);
> return ((unsigned)i < nr_cpumask_bits) ? i : target;
> }
> }
>
> sd = rcu_dereference(per_cpu(sd_llc, target));
> if (!sd)
> return target;
>
> if (sched_smt_active()) {
> has_idle_core = test_idle_cores(target);
>
> if (!has_idle_core && cpus_share_cache(prev, target)) {
> i = select_idle_smt(p, sd, prev);
> if ((unsigned int)i < nr_cpumask_bits)
> return i;
> }
> }
>
> i = select_idle_cpu(p, sd, has_idle_core, target);
> if ((unsigned)i < nr_cpumask_bits)
> return i;
>
> + if (!sched_smt_active())
> + set_idle_cores(target, 0);
I have added some more context for the patch that makes it rather obvious that
this is broken. For asym systems (and their subset EAS which this series is
concerned with) the above is unreachable as either select_idle_capacity() will
find a CPU or target is returned.
Thus this code here is never reached and idle_cores will be set but never stopped
and the push mechanism keep triggering when it shouldn't be.
^ permalink raw reply [flat|nested] 47+ messages in thread
* [RFC PATCH 6/6 v8] sched/fair: Add EAS and idle cpu push trigger
2025-12-02 18:12 [PATCH 0/6 v8] sched/fair: Add push task mechanism and handle more EAS cases Vincent Guittot
` (4 preceding siblings ...)
2025-12-02 18:12 ` [RFC PATCH 5/6 v8] sched/fair: Enable idle core tracking for !SMT Vincent Guittot
@ 2025-12-02 18:12 ` Vincent Guittot
2026-02-06 18:30 ` Qais Yousef
2025-12-03 14:06 ` [PATCH 0/6 v8] sched/fair: Add push task mechanism and handle more EAS cases Christian Loehle
` (3 subsequent siblings)
9 siblings, 1 reply; 47+ messages in thread
From: Vincent Guittot @ 2025-12-02 18:12 UTC (permalink / raw)
To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak
Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado,
Vincent Guittot
EAS is based on wakeup events to efficiently place tasks on the system, but
there are cases where a task doesn't have wakeup events anymore or at a far
too low pace. For such cases, we check if it's worth pushing the task on
another CPUs instead of putting it back in the enqueued list.
Wake up events remain the main way to migrate tasks but we now detect
situation where a task is stuck on a CPU by checking that its utilization
is larger than the max available compute capacity (max cpu capacity or
uclamp max setting).
When the system becomes overutilized and some CPUs are idle, we try to
push tasks instead of waiting periodic load balance.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
kernel/sched/fair.c | 64 +++++++++++++++++++++++++++++++++++++++++
kernel/sched/topology.c | 2 ++
2 files changed, 66 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0c0c675f39cf..e9e1d0c05805 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8500,8 +8500,72 @@ static inline bool sched_push_task_enabled(void)
return static_branch_unlikely(&sched_push_task);
}
+static inline bool task_stuck_on_cpu(struct task_struct *p, int cpu)
+{
+ unsigned long max_capa, util;
+
+ max_capa = min(get_actual_cpu_capacity(cpu),
+ uclamp_eff_value(p, UCLAMP_MAX));
+ util = max(task_util_est(p), task_runnable(p));
+
+ /*
+ * Return true only if the task might not sleep/wakeup because of a low
+ * compute capacity. Tasks, which wake up regularly, will be handled by
+ * feec().
+ */
+ return (util > max_capa);
+}
+
+static inline bool sched_energy_push_task(struct task_struct *p, struct rq *rq)
+{
+ if (!sched_energy_enabled())
+ return false;
+
+ if (is_rd_overutilized(rq->rd))
+ return false;
+
+ if (task_stuck_on_cpu(p, cpu_of(rq)))
+ return true;
+
+ if (!task_fits_cpu(p, cpu_of(rq)))
+ return true;
+
+ return false;
+}
+
+static inline bool sched_idle_push_task(struct task_struct *p, struct rq *rq)
+{
+ if (rq->nr_running == 1)
+ return false;
+
+ if (!is_rd_overutilized(rq->rd))
+ return false;
+
+ /* If there are idle cpus in the llc then try to push the task on it */
+ if (test_idle_cores(cpu_of(rq)))
+ return true;
+
+ return false;
+}
+
+
static bool fair_push_task(struct rq *rq, struct task_struct *p)
{
+ if (!task_on_rq_queued(p))
+ return false;
+
+ if (p->se.sched_delayed)
+ return false;
+
+ if (p->nr_cpus_allowed == 1)
+ return false;
+
+ if (sched_energy_push_task(p, rq))
+ return true;
+
+ if (sched_idle_push_task(p, rq))
+ return true;
+
return false;
}
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index cf643a5ddedd..00abd01acb84 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -391,10 +391,12 @@ static void sched_energy_set(bool has_eas)
if (sched_debug())
pr_info("%s: stopping EAS\n", __func__);
static_branch_disable_cpuslocked(&sched_energy_present);
+ static_branch_dec_cpuslocked(&sched_push_task);
} else if (has_eas && !static_branch_unlikely(&sched_energy_present)) {
if (sched_debug())
pr_info("%s: starting EAS\n", __func__);
static_branch_enable_cpuslocked(&sched_energy_present);
+ static_branch_inc_cpuslocked(&sched_push_task);
}
}
--
2.43.0
^ permalink raw reply related [flat|nested] 47+ messages in thread* Re: [RFC PATCH 6/6 v8] sched/fair: Add EAS and idle cpu push trigger
2025-12-02 18:12 ` [RFC PATCH 6/6 v8] sched/fair: Add EAS and idle cpu push trigger Vincent Guittot
@ 2026-02-06 18:30 ` Qais Yousef
2026-02-09 13:20 ` Vincent Guittot
0 siblings, 1 reply; 47+ messages in thread
From: Qais Yousef @ 2026-02-06 18:30 UTC (permalink / raw)
To: Vincent Guittot
Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak,
hongyan.xia2, christian.loehle, luis.machado
On 12/02/25 19:12, Vincent Guittot wrote:
> EAS is based on wakeup events to efficiently place tasks on the system, but
> there are cases where a task doesn't have wakeup events anymore or at a far
> too low pace. For such cases, we check if it's worth pushing the task on
> another CPUs instead of putting it back in the enqueued list.
>
> Wake up events remain the main way to migrate tasks but we now detect
> situation where a task is stuck on a CPU by checking that its utilization
> is larger than the max available compute capacity (max cpu capacity or
> uclamp max setting).
>
> When the system becomes overutilized and some CPUs are idle, we try to
> push tasks instead of waiting periodic load balance.
I am fine with these wording. But I think enable lb based on power is a very
good description too. Basically we don't have the concept on down migration for
HMP systems to help save power for tasks that are hinted are fine with running
at lower performance level via uclamp_max.
>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> kernel/sched/fair.c | 64 +++++++++++++++++++++++++++++++++++++++++
> kernel/sched/topology.c | 2 ++
> 2 files changed, 66 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0c0c675f39cf..e9e1d0c05805 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8500,8 +8500,72 @@ static inline bool sched_push_task_enabled(void)
> return static_branch_unlikely(&sched_push_task);
> }
>
> +static inline bool task_stuck_on_cpu(struct task_struct *p, int cpu)
> +{
> + unsigned long max_capa, util;
> +
> + max_capa = min(get_actual_cpu_capacity(cpu),
> + uclamp_eff_value(p, UCLAMP_MAX));
I think we check if uclamp_max is == SCHED_CAPACITY_SCALE. By definition these
are not stuck. I found without this condition we can trigger this a lot
unnecessarily.
> + util = max(task_util_est(p), task_runnable(p));
We must take the min(util, SCHED_CAPACITY_SCALE) here since runnable can get
too large making the condition above true even if you are on the biggest
capacity cpu.
> +
> + /*
> + * Return true only if the task might not sleep/wakeup because of a low
> + * compute capacity. Tasks, which wake up regularly, will be handled by
> + * feec().
> + */
> + return (util > max_capa);
> +}
> +
> +static inline bool sched_energy_push_task(struct task_struct *p, struct rq *rq)
> +{
> + if (!sched_energy_enabled())
> + return false;
> +
> + if (is_rd_overutilized(rq->rd))
> + return false;
> +
> + if (task_stuck_on_cpu(p, cpu_of(rq)))
> + return true;
> +
> + if (!task_fits_cpu(p, cpu_of(rq)))
> + return true;
> +
> + return false;
> +}
> +
> +static inline bool sched_idle_push_task(struct task_struct *p, struct rq *rq)
> +{
> + if (rq->nr_running == 1)
> + return false;
> +
> + if (!is_rd_overutilized(rq->rd))
> + return false;
> +
> + /* If there are idle cpus in the llc then try to push the task on it */
> + if (test_idle_cores(cpu_of(rq)))
> + return true;
> +
> + return false;
> +}
> +
> +
> static bool fair_push_task(struct rq *rq, struct task_struct *p)
> {
> + if (!task_on_rq_queued(p))
> + return false;
> +
> + if (p->se.sched_delayed)
> + return false;
> +
> + if (p->nr_cpus_allowed == 1)
> + return false;
> +
> + if (sched_energy_push_task(p, rq))
> + return true;
> +
> + if (sched_idle_push_task(p, rq))
> + return true;
In my testing (of earlier version of the patch) I found adding a new
is_rq_overloaded(rq) test which simply checks if rq->nr_running > 1 is helpful
to make the whole regular lb required at all (get rid of overutilized). Still
testing it though, something to consider now or later. I don't mind.
> +
> return false;
> }
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index cf643a5ddedd..00abd01acb84 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -391,10 +391,12 @@ static void sched_energy_set(bool has_eas)
> if (sched_debug())
> pr_info("%s: stopping EAS\n", __func__);
> static_branch_disable_cpuslocked(&sched_energy_present);
> + static_branch_dec_cpuslocked(&sched_push_task);
> } else if (has_eas && !static_branch_unlikely(&sched_energy_present)) {
> if (sched_debug())
> pr_info("%s: starting EAS\n", __func__);
> static_branch_enable_cpuslocked(&sched_energy_present);
> + static_branch_inc_cpuslocked(&sched_push_task);
> }
> }
>
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [RFC PATCH 6/6 v8] sched/fair: Add EAS and idle cpu push trigger
2026-02-06 18:30 ` Qais Yousef
@ 2026-02-09 13:20 ` Vincent Guittot
2026-02-11 0:59 ` Qais Yousef
0 siblings, 1 reply; 47+ messages in thread
From: Vincent Guittot @ 2026-02-09 13:20 UTC (permalink / raw)
To: Qais Yousef
Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak,
hongyan.xia2, christian.loehle, luis.machado
On Fri, 6 Feb 2026 at 19:30, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 12/02/25 19:12, Vincent Guittot wrote:
> > EAS is based on wakeup events to efficiently place tasks on the system, but
> > there are cases where a task doesn't have wakeup events anymore or at a far
> > too low pace. For such cases, we check if it's worth pushing the task on
> > another CPUs instead of putting it back in the enqueued list.
> >
> > Wake up events remain the main way to migrate tasks but we now detect
> > situation where a task is stuck on a CPU by checking that its utilization
> > is larger than the max available compute capacity (max cpu capacity or
> > uclamp max setting).
> >
> > When the system becomes overutilized and some CPUs are idle, we try to
> > push tasks instead of waiting periodic load balance.
>
> I am fine with these wording. But I think enable lb based on power is a very
> good description too. Basically we don't have the concept on down migration for
> HMP systems to help save power for tasks that are hinted are fine with running
> at lower performance level via uclamp_max.
>
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> > kernel/sched/fair.c | 64 +++++++++++++++++++++++++++++++++++++++++
> > kernel/sched/topology.c | 2 ++
> > 2 files changed, 66 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 0c0c675f39cf..e9e1d0c05805 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8500,8 +8500,72 @@ static inline bool sched_push_task_enabled(void)
> > return static_branch_unlikely(&sched_push_task);
> > }
> >
> > +static inline bool task_stuck_on_cpu(struct task_struct *p, int cpu)
> > +{
> > + unsigned long max_capa, util;
> > +
> > + max_capa = min(get_actual_cpu_capacity(cpu),
> > + uclamp_eff_value(p, UCLAMP_MAX));
>
> I think we check if uclamp_max is == SCHED_CAPACITY_SCALE. By definition these
> are not stuck. I found without this condition we can trigger this a lot
> unnecessarily.
okay , will check
>
> > + util = max(task_util_est(p), task_runnable(p));
>
> We must take the min(util, SCHED_CAPACITY_SCALE) here since runnable can get
> too large making the condition above true even if you are on the biggest
> capacity cpu.
hmm task_runnable should not go above SCHED_CAPACITY_SCALE. do you
have seen cases where task's runnable_avg goes above
SCHED_CAPACITY_SCALE ?
In fact neither task_util_est nor task_runnable should go above
SCHED_CAPACITY_SCALE
>
> > +
> > + /*
> > + * Return true only if the task might not sleep/wakeup because of a low
> > + * compute capacity. Tasks, which wake up regularly, will be handled by
> > + * feec().
> > + */
> > + return (util > max_capa);
> > +}
> > +
> > +static inline bool sched_energy_push_task(struct task_struct *p, struct rq *rq)
> > +{
> > + if (!sched_energy_enabled())
> > + return false;
> > +
> > + if (is_rd_overutilized(rq->rd))
> > + return false;
> > +
> > + if (task_stuck_on_cpu(p, cpu_of(rq)))
> > + return true;
> > +
> > + if (!task_fits_cpu(p, cpu_of(rq)))
> > + return true;
> > +
> > + return false;
> > +}
> > +
> > +static inline bool sched_idle_push_task(struct task_struct *p, struct rq *rq)
> > +{
> > + if (rq->nr_running == 1)
> > + return false;
> > +
> > + if (!is_rd_overutilized(rq->rd))
> > + return false;
> > +
> > + /* If there are idle cpus in the llc then try to push the task on it */
> > + if (test_idle_cores(cpu_of(rq)))
> > + return true;
> > +
> > + return false;
> > +}
> > +
> > +
> > static bool fair_push_task(struct rq *rq, struct task_struct *p)
> > {
> > + if (!task_on_rq_queued(p))
> > + return false;
> > +
> > + if (p->se.sched_delayed)
> > + return false;
> > +
> > + if (p->nr_cpus_allowed == 1)
> > + return false;
> > +
> > + if (sched_energy_push_task(p, rq))
> > + return true;
> > +
> > + if (sched_idle_push_task(p, rq))
> > + return true;
>
> In my testing (of earlier version of the patch) I found adding a new
> is_rq_overloaded(rq) test which simply checks if rq->nr_running > 1 is helpful
> to make the whole regular lb required at all (get rid of overutilized). Still
> testing it though, something to consider now or later. I don't mind.
I was conservative and didn't want to trigger push too often but it
might end up being better. I will check
>
> > +
> > return false;
> > }
> >
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index cf643a5ddedd..00abd01acb84 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -391,10 +391,12 @@ static void sched_energy_set(bool has_eas)
> > if (sched_debug())
> > pr_info("%s: stopping EAS\n", __func__);
> > static_branch_disable_cpuslocked(&sched_energy_present);
> > + static_branch_dec_cpuslocked(&sched_push_task);
> > } else if (has_eas && !static_branch_unlikely(&sched_energy_present)) {
> > if (sched_debug())
> > pr_info("%s: starting EAS\n", __func__);
> > static_branch_enable_cpuslocked(&sched_energy_present);
> > + static_branch_inc_cpuslocked(&sched_push_task);
> > }
> > }
> >
> > --
> > 2.43.0
> >
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [RFC PATCH 6/6 v8] sched/fair: Add EAS and idle cpu push trigger
2026-02-09 13:20 ` Vincent Guittot
@ 2026-02-11 0:59 ` Qais Yousef
0 siblings, 0 replies; 47+ messages in thread
From: Qais Yousef @ 2026-02-11 0:59 UTC (permalink / raw)
To: Vincent Guittot
Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak,
hongyan.xia2, christian.loehle, luis.machado
On 02/09/26 14:20, Vincent Guittot wrote:
> >
> > > + util = max(task_util_est(p), task_runnable(p));
> >
> > We must take the min(util, SCHED_CAPACITY_SCALE) here since runnable can get
> > too large making the condition above true even if you are on the biggest
> > capacity cpu.
>
> hmm task_runnable should not go above SCHED_CAPACITY_SCALE. do you
> have seen cases where task's runnable_avg goes above
> SCHED_CAPACITY_SCALE ?
>
> In fact neither task_util_est nor task_runnable should go above
> SCHED_CAPACITY_SCALE
Yes you're right. My memory is hazy now, but I recall I've seen runnable_avg
jump above 1024, but that might have been load_avg.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 0/6 v8] sched/fair: Add push task mechanism and handle more EAS cases
2025-12-02 18:12 [PATCH 0/6 v8] sched/fair: Add push task mechanism and handle more EAS cases Vincent Guittot
` (5 preceding siblings ...)
2025-12-02 18:12 ` [RFC PATCH 6/6 v8] sched/fair: Add EAS and idle cpu push trigger Vincent Guittot
@ 2025-12-03 14:06 ` Christian Loehle
2025-12-10 13:30 ` Dietmar Eggemann
` (2 subsequent siblings)
9 siblings, 0 replies; 47+ messages in thread
From: Christian Loehle @ 2025-12-03 14:06 UTC (permalink / raw)
To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, linux-kernel, pierre.gondois,
kprateek.nayak
Cc: qyousef, hongyan.xia2, luis.machado
On 12/2/25 18:12, Vincent Guittot wrote:
> This is a subset of [1] (sched/fair: Rework EAS to handle more cases)
>
> [1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/
>
> The current Energy Aware Scheduler has some known limitations which have
> became more and more visible with features like uclamp as an example. This
> serie tries to fix some of those issues:
> - tasks stacked on the same CPU of a PD
This needs elaboration IMO as "tasks stacked on the same CPU of a PD" isn't
really an issue per se? What's the scenario being fixed here?
> - tasks stuck on the wrong CPU.
>
> Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> whereas it is capped to a lower compute capacity. This wrong classification
> can prevent periodic load balancer to select a group_misfit_task CPU
> because group_overloaded has higher priority.
>
> Patch 2 removes the need of testing uclamp_min in cpu_overutilized to
> trigger the active migration of a task on another CPU.
>
> Patch 3 prepares select_task_rq_fair() to be called without TTWU, Fork or
> Exec flags when we just want to look for a possible better CPU.
>
> Patch 4 adds push call back mecanism to fair scheduler but doesn't enable
> it.
nit: here's still the mecanism typo :)
>
> Patch 5 enable has_idle_core for !SMP system to track if there may be an
> idle CPU in the LLC.
s/!SMP/!SMT/
>
> Patch 6 adds some conditions to enable pushing runnable tasks for EAS:
> - when a task is stuck on a CPU and the system is not overutilized.
> - if there is a possible idle CPU when the system is overutilized.
I'd find it helpful to have the motivation spelled out more verbosely here.
Why are there tasks stuck? UCLAMP_MAX? Temporarily reduced capacity?
Would be nice to have a very concrete list of scenarios/issues in mind that
are being fixed and a description of how they're fixed by this patchset.
(e.g. current behaviour, new behaviour, reason why this behaviour is the
'more' correct one).
>
> More tests results will come later as I wanted to send the pachtset before
> LPC.
>
> I have kept Tbench figures as I added them in v7 but results are the same
> with the correct patch 6.
Ah I was confused by this sentence at first, so now for v8 both hackbench
and tbench are same for baseline and patchset.
>
> Tbench on dragonboard rb5
> schedutil and EAS enabled
>
> # process tip +patchset
> 1 29.3(+/-0.3%) 29.2(+/-0.2%) +0%
> 2 61.1(+/-1.8%) 61.7(+/-3.2%) +1%
> 4 260.0(+/-1.7%) 258.8(+/-2.8%) -1%
> 8 1361.2(+/-3.1%) 1377.1(+/-1.9%) +1%
> 16 981.5(+/-0.6%) 958.0(+/-1.7%) -2%
So I've done some analysis on tbench in the meantime, at least for the 1-process
case, because I was puzzled by your v7 result and indeed there are plenty
of wakeups, in particular in a 10s run I see 62806 tbench wakeups
with a distribution like so (time from one wakeup to the next):
0 ms - 1 ms: 62157
1 ms - 2 ms: 44
2 ms - 3 ms: 32
3 ms - 4 ms: 5
4 ms - 5 ms: 10
5 ms - 6 ms: 6
6 ms - 7 ms: 2
7 ms - 8 ms: 2
8 ms - 9 ms: 3
12 ms - 13 ms: 2
15 ms - 16 ms: 1
16 ms - 17 ms: 1
24 ms - 25 ms: 1
95 ms - 96 ms: 1
> Hackbench didn't show any difference
hackbench is always OU once it ramped up anyway, right? So this is expected.
If I'm not mistaken neither of the workloads then are likely to run through
the changes for the series? (Both have more than enough wakeup events, hackbench
is additionally OU so EAS is mostly skipped).
Would be helpful for reviewing then to have a workload that benefits from this
push mechanism, maybe at least one with and one without UCLAMP_MAX?
>
> Changes since v7:
> - Rebased on latest tip/sched/core
> - Fix some typos
> - Fix patch 6 mess
>
> Vincent Guittot (6):
> sched/fair: Filter false overloaded_group case for EAS
> sched/fair: Update overutilized detection
> sched/fair: Prepare select_task_rq_fair() to be called for new cases
> sched/fair: Add push task mechanism for fair
> sched/fair: Enable idle core tracking for !SMT
> sched/fair: Add EAS and idle cpu push trigger
>
> kernel/sched/fair.c | 350 +++++++++++++++++++++++++++++++++++-----
> kernel/sched/sched.h | 46 ++++--
> kernel/sched/topology.c | 2 +
> 3 files changed, 345 insertions(+), 53 deletions(-)
>
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [PATCH 0/6 v8] sched/fair: Add push task mechanism and handle more EAS cases
2025-12-02 18:12 [PATCH 0/6 v8] sched/fair: Add push task mechanism and handle more EAS cases Vincent Guittot
` (6 preceding siblings ...)
2025-12-03 14:06 ` [PATCH 0/6 v8] sched/fair: Add push task mechanism and handle more EAS cases Christian Loehle
@ 2025-12-10 13:30 ` Dietmar Eggemann
2026-02-06 18:32 ` Qais Yousef
2026-02-26 17:34 ` Pierre Gondois
9 siblings, 0 replies; 47+ messages in thread
From: Dietmar Eggemann @ 2025-12-10 13:30 UTC (permalink / raw)
To: Vincent Guittot, mingo, peterz, juri.lelli, rostedt, bsegall,
mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak
Cc: qyousef, christian.loehle
- hongyan.xia2@arm.com
- luis.machado@arm.com
On 02.12.25 19:12, Vincent Guittot wrote:
> This is a subset of [1] (sched/fair: Rework EAS to handle more cases)
>
> [1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/
>
> The current Energy Aware Scheduler has some known limitations which have
> became more and more visible with features like uclamp as an example. This
> serie tries to fix some of those issues:
> - tasks stacked on the same CPU of a PD
> - tasks stuck on the wrong CPU.
>
> Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> whereas it is capped to a lower compute capacity. This wrong classification
> can prevent periodic load balancer to select a group_misfit_task CPU
> because group_overloaded has higher priority.
>
> Patch 2 removes the need of testing uclamp_min in cpu_overutilized to
> trigger the active migration of a task on another CPU.
>
> Patch 3 prepares select_task_rq_fair() to be called without TTWU, Fork or
> Exec flags when we just want to look for a possible better CPU.
>
> Patch 4 adds push call back mecanism to fair scheduler but doesn't enable
> it.
>
> Patch 5 enable has_idle_core for !SMP system to track if there may be an
> idle CPU in the LLC.
>
> Patch 6 adds some conditions to enable pushing runnable tasks for EAS:
> - when a task is stuck on a CPU and the system is not overutilized.
> - if there is a possible idle CPU when the system is overutilized.
>
> More tests results will come later as I wanted to send the pachtset before
> LPC.
>
> I have kept Tbench figures as I added them in v7 but results are the same
> with the correct patch 6.
>
> Tbench on dragonboard rb5
> schedutil and EAS enabled
>
> # process tip +patchset
> 1 29.3(+/-0.3%) 29.2(+/-0.2%) +0%
> 2 61.1(+/-1.8%) 61.7(+/-3.2%) +1%
> 4 260.0(+/-1.7%) 258.8(+/-2.8%) -1%
> 8 1361.2(+/-3.1%) 1377.1(+/-1.9%) +1%
> 16 981.5(+/-0.6%) 958.0(+/-1.7%) -2%
>
> Hackbench didn't show any difference
I guess this is the overall idea here is:
-->
(1) Push runnable tasks
[pick_next|put_prev]_task_fair() -> fair_add_pushable_task() ->
fair_push_task() (*)
__set_next_task_fair() -> fair_queue_pushable_tasks() ->
queue_balance_callback(..., push_fair_tasks)
push_fair_task() -> strf(), move_queued_task() (or similar)
(2) Push single running task
tick() -> check_pushable_task() -> fair_push_task() (*), strf(),
active_balance
<--
strf() ... select_task_rq_fair(..., 0)
(1) & (2) are invoked when the policy fair_push_task() (2 parts
according to OverUtilized (OU) scenario) says the task should be moved
fair_push_task() (*)
sched_energy_push_task() - non-OU
sched_idle_push_task() - OU
Pretty complex to reason about where this could be beneficial. I'm
thinking about the interaction of (1) and (2) with wakeup & MF handling
in non-OU and with load-balance in in OU.
You mentioned that you will show more test results next to tbench soon.
I don't know right now how to interpret the tbench results above.
IMHO, a set of rt-app files (customisable to a specific asymmetric CPU
capacity systems, potentially with uclamp max settings) with scenarios
to provoke the new functionality would help with the
understanding/evaluating here.
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [PATCH 0/6 v8] sched/fair: Add push task mechanism and handle more EAS cases
2025-12-02 18:12 [PATCH 0/6 v8] sched/fair: Add push task mechanism and handle more EAS cases Vincent Guittot
` (7 preceding siblings ...)
2025-12-10 13:30 ` Dietmar Eggemann
@ 2026-02-06 18:32 ` Qais Yousef
2026-02-09 13:20 ` Vincent Guittot
2026-02-26 17:34 ` Pierre Gondois
9 siblings, 1 reply; 47+ messages in thread
From: Qais Yousef @ 2026-02-06 18:32 UTC (permalink / raw)
To: Vincent Guittot
Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak,
hongyan.xia2, christian.loehle, luis.machado
On 12/02/25 19:12, Vincent Guittot wrote:
> This is a subset of [1] (sched/fair: Rework EAS to handle more cases)
>
> [1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/
>
> The current Energy Aware Scheduler has some known limitations which have
> became more and more visible with features like uclamp as an example. This
> serie tries to fix some of those issues:
> - tasks stacked on the same CPU of a PD
> - tasks stuck on the wrong CPU.
I think you are under selling the importance of this improvement :-)
FWIW, in my view the new mechanism will help us:
1. Improve slow reaction time of lb. Waiting for another CPU to pull is very
slow. And with 4ms TICK being the default and what I believe should be
demolished (for most systems) back off mechanisms, when lb kicks in things
has gone really bad already.
2. It helps implement misfit based on energy I brought up in the past [1]
3. It brings up a step closer to unify wake up and load balancer paths as we
discussed is necessary to get a decent sched qos. We need to add the concept
of task placement based on latency, and if lb can't take similar decision it
is hard to make this useful. With push lb, both path can easily follow up
the same decision tree.
I have backported an earlier version of this to help verify it, but so far
I think it is an amazing addition. Thanks for this!
[1] https://lore.kernel.org/lkml/20231209011759.398021-1-qyousef@layalina.io/
>
> Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> whereas it is capped to a lower compute capacity. This wrong classification
> can prevent periodic load balancer to select a group_misfit_task CPU
> because group_overloaded has higher priority.
>
> Patch 2 removes the need of testing uclamp_min in cpu_overutilized to
> trigger the active migration of a task on another CPU.
>
> Patch 3 prepares select_task_rq_fair() to be called without TTWU, Fork or
> Exec flags when we just want to look for a possible better CPU.
>
> Patch 4 adds push call back mecanism to fair scheduler but doesn't enable
> it.
>
> Patch 5 enable has_idle_core for !SMP system to track if there may be an
> idle CPU in the LLC.
>
> Patch 6 adds some conditions to enable pushing runnable tasks for EAS:
> - when a task is stuck on a CPU and the system is not overutilized.
> - if there is a possible idle CPU when the system is overutilized.
>
> More tests results will come later as I wanted to send the pachtset before
> LPC.
>
> I have kept Tbench figures as I added them in v7 but results are the same
> with the correct patch 6.
>
> Tbench on dragonboard rb5
> schedutil and EAS enabled
>
> # process tip +patchset
> 1 29.3(+/-0.3%) 29.2(+/-0.2%) +0%
> 2 61.1(+/-1.8%) 61.7(+/-3.2%) +1%
> 4 260.0(+/-1.7%) 258.8(+/-2.8%) -1%
> 8 1361.2(+/-3.1%) 1377.1(+/-1.9%) +1%
> 16 981.5(+/-0.6%) 958.0(+/-1.7%) -2%
>
> Hackbench didn't show any difference
>
> Changes since v7:
> - Rebased on latest tip/sched/core
> - Fix some typos
> - Fix patch 6 mess
>
> Vincent Guittot (6):
> sched/fair: Filter false overloaded_group case for EAS
> sched/fair: Update overutilized detection
> sched/fair: Prepare select_task_rq_fair() to be called for new cases
> sched/fair: Add push task mechanism for fair
> sched/fair: Enable idle core tracking for !SMT
> sched/fair: Add EAS and idle cpu push trigger
>
> kernel/sched/fair.c | 350 +++++++++++++++++++++++++++++++++++-----
> kernel/sched/sched.h | 46 ++++--
> kernel/sched/topology.c | 2 +
> 3 files changed, 345 insertions(+), 53 deletions(-)
>
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [PATCH 0/6 v8] sched/fair: Add push task mechanism and handle more EAS cases
2026-02-06 18:32 ` Qais Yousef
@ 2026-02-09 13:20 ` Vincent Guittot
0 siblings, 0 replies; 47+ messages in thread
From: Vincent Guittot @ 2026-02-09 13:20 UTC (permalink / raw)
To: Qais Yousef
Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
mgorman, vschneid, linux-kernel, pierre.gondois, kprateek.nayak,
hongyan.xia2, christian.loehle, luis.machado
On Fri, 6 Feb 2026 at 19:32, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 12/02/25 19:12, Vincent Guittot wrote:
> > This is a subset of [1] (sched/fair: Rework EAS to handle more cases)
> >
> > [1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/
> >
> > The current Energy Aware Scheduler has some known limitations which have
> > became more and more visible with features like uclamp as an example. This
> > serie tries to fix some of those issues:
> > - tasks stacked on the same CPU of a PD
> > - tasks stuck on the wrong CPU.
>
> I think you are under selling the importance of this improvement :-)
>
> FWIW, in my view the new mechanism will help us:
>
> 1. Improve slow reaction time of lb. Waiting for another CPU to pull is very
> slow. And with 4ms TICK being the default and what I believe should be
> demolished (for most systems) back off mechanisms, when lb kicks in things
> has gone really bad already.
> 2. It helps implement misfit based on energy I brought up in the past [1]
> 3. It brings up a step closer to unify wake up and load balancer paths as we
> discussed is necessary to get a decent sched qos. We need to add the concept
> of task placement based on latency, and if lb can't take similar decision it
> is hard to make this useful. With push lb, both path can easily follow up
> the same decision tree.
>
> I have backported an earlier version of this to help verify it, but so far
> I think it is an amazing addition. Thanks for this!
Thanks
>
> [1] https://lore.kernel.org/lkml/20231209011759.398021-1-qyousef@layalina.io/
>
> >
> > Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> > whereas it is capped to a lower compute capacity. This wrong classification
> > can prevent periodic load balancer to select a group_misfit_task CPU
> > because group_overloaded has higher priority.
> >
> > Patch 2 removes the need of testing uclamp_min in cpu_overutilized to
> > trigger the active migration of a task on another CPU.
> >
> > Patch 3 prepares select_task_rq_fair() to be called without TTWU, Fork or
> > Exec flags when we just want to look for a possible better CPU.
> >
> > Patch 4 adds push call back mecanism to fair scheduler but doesn't enable
> > it.
> >
> > Patch 5 enable has_idle_core for !SMP system to track if there may be an
> > idle CPU in the LLC.
> >
> > Patch 6 adds some conditions to enable pushing runnable tasks for EAS:
> > - when a task is stuck on a CPU and the system is not overutilized.
> > - if there is a possible idle CPU when the system is overutilized.
> >
> > More tests results will come later as I wanted to send the pachtset before
> > LPC.
> >
> > I have kept Tbench figures as I added them in v7 but results are the same
> > with the correct patch 6.
> >
> > Tbench on dragonboard rb5
> > schedutil and EAS enabled
> >
> > # process tip +patchset
> > 1 29.3(+/-0.3%) 29.2(+/-0.2%) +0%
> > 2 61.1(+/-1.8%) 61.7(+/-3.2%) +1%
> > 4 260.0(+/-1.7%) 258.8(+/-2.8%) -1%
> > 8 1361.2(+/-3.1%) 1377.1(+/-1.9%) +1%
> > 16 981.5(+/-0.6%) 958.0(+/-1.7%) -2%
> >
> > Hackbench didn't show any difference
> >
> > Changes since v7:
> > - Rebased on latest tip/sched/core
> > - Fix some typos
> > - Fix patch 6 mess
> >
> > Vincent Guittot (6):
> > sched/fair: Filter false overloaded_group case for EAS
> > sched/fair: Update overutilized detection
> > sched/fair: Prepare select_task_rq_fair() to be called for new cases
> > sched/fair: Add push task mechanism for fair
> > sched/fair: Enable idle core tracking for !SMT
> > sched/fair: Add EAS and idle cpu push trigger
> >
> > kernel/sched/fair.c | 350 +++++++++++++++++++++++++++++++++++-----
> > kernel/sched/sched.h | 46 ++++--
> > kernel/sched/topology.c | 2 +
> > 3 files changed, 345 insertions(+), 53 deletions(-)
> >
> > --
> > 2.43.0
> >
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 0/6 v8] sched/fair: Add push task mechanism and handle more EAS cases
2025-12-02 18:12 [PATCH 0/6 v8] sched/fair: Add push task mechanism and handle more EAS cases Vincent Guittot
` (8 preceding siblings ...)
2026-02-06 18:32 ` Qais Yousef
@ 2026-02-26 17:34 ` Pierre Gondois
2026-03-10 4:16 ` Qais Yousef
9 siblings, 1 reply; 47+ messages in thread
From: Pierre Gondois @ 2026-02-26 17:34 UTC (permalink / raw)
To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak
Cc: qyousef, christian.loehle
On 12/2/25 19:12, Vincent Guittot wrote:
> This is a subset of [1] (sched/fair: Rework EAS to handle more cases)
>
> [1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/
>
> The current Energy Aware Scheduler has some known limitations which have
> became more and more visible with features like uclamp as an example. This
> serie tries to fix some of those issues:
> - tasks stacked on the same CPU of a PD
> - tasks stuck on the wrong CPU.
Following some other comments I think, I'm not sure I understand the use
case
the patchset tries to solve.
- If this is for UCLAMP_MAX tasks:
As Christian said (somwhere) the utilization of a long running task doesn't
represent anything, so using EAS to do task placement cannot give a good
placement. The push mechanism effectively allows to down-migrate UCLAMP_MAX
tasks, but the repartition of these tasks is then subject to randomness.
On a Radxa Orion:
- 12 CPUs
- CPU[1-4] are little CPUs with capa=290
- using an artificial EM
Running 8 CPU-bound tasks with UCLAMP_MAX=100, the task placement can be:
- CPU1: 6 tasks
- CPU2: 1 task
- CPU3: 1 task
- CPU4: idle
The push mechanism triggers feec() and down-migrate tasks to little CPUs.
However doesn't balance the ratio of (load / capacity) between CPUs as the
load balancer could do. So the above placement is correct in that regard.
Another point is that it is hard to reason about what a 'fair' task
placement
is for UCLAMP_MAX tasks as their throughput is limited on purpose.
The previous version of your patchset was trying to solve that issue,
but IMO this issue is inherent to UCLAMP_MAX setting. EAS doesn't
consider load during the task placement as all tasks are supposed
to be ~periodic and have wake-up events. CPUs are also supposed to have
some idle time, which guarantees that tasks are never really
starving, but UCLAMP_MAX contradicts this assumption.
With:
- Task[0-1]: NICE=-19, cpumask = CPUA,CPUB
- Task[2-3]: NICE=20, cpumask = CPUA,CPUB
The following task placement:
- CPUA: Task0 + Task1
- CPUB: Task2 + Task3
is fine for EAS, but sched_balance_find_dst_cpu() would do:
- CPUA: Task0 + Task2
- CPUB: Task1 + Task3
to balance the load, which is more 'fair'.
------------
- If this is to have better energy results by running feec() more often
You say later in the cover letter that other numbers would come
later, so I m curious to see the improvement.
Also I think that Christian mentioned somewhere the fact that
feec() is subject to concurrency. I quickly got some numbers and didn't see
a huge increase of concurrent decisions with the push mechanism,
but this indeed seems like something to worry about.
feec() is costly to run. I don't have any numbers to provide.
------------
- If this is to bail out of the OU state faster by migrating tasks to
idle CPUs
or running feec() before a CPU is considered as overutilized
I can understand this point. When testing the patches, it seemed that
an inflating task still triggered the OU state.
Indeed other CPUs are going through a load balance through:
sched_balance_find_src_group()
\-update_sd_lb_stats
\-set_rd_overutilized()
and trigger the OU state, or through:
task_tick_fair()
\-check_pushable_task()
\-if (rq->nr_running > 1) -> return False
\-check_update_overutilized_status()
Also task_stuck_on_cpu() checks whether a single task fills the CPU
capacity,
not whether the CPU utilization reaches the 80% threshold.
So I didn't see that much improvement on the OU front.
However as Qais noted, the load balancer is effectively quite slow to
migrate
misfit tasks.
The patchset runs some checks on each sched tick and each time a rq switches
to another task. If the goal was to:
- non-EAS: push misfit tasks quickly
- EAS: avoid going in the OU state
this would already be a great improvement. I assume this would also allow to
remove the misfit handling code in the load balancer.
This would also mean extending the push mechanism to all HMP systems,
not just EAS-enabled systems.
------------
Summary:
- IMO UCLAMP_MAX tasks will always be an issue for EAS. Even if these tasks
were down-migrated, other issues would come up
- I'm interested in seeing energy consumption improvement numbers,
or other performance numbers.
- Following Qais (IIUC), the push mechanism could be useful to improve
misfit task
migration latency and avoid going in the OU state. I tried to do some
modifications
in that sense and didn't see any show stopper so far. This would also
allow to
remove some code in the load balancer.
>
> Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> whereas it is capped to a lower compute capacity. This wrong classification
> can prevent periodic load balancer to select a group_misfit_task CPU
> because group_overloaded has higher priority.
>
> Patch 2 removes the need of testing uclamp_min in cpu_overutilized to
> trigger the active migration of a task on another CPU.
>
> Patch 3 prepares select_task_rq_fair() to be called without TTWU, Fork or
> Exec flags when we just want to look for a possible better CPU.
>
> Patch 4 adds push call back mecanism to fair scheduler but doesn't enable
> it.
>
> Patch 5 enable has_idle_core for !SMP system to track if there may be an
> idle CPU in the LLC.
>
> Patch 6 adds some conditions to enable pushing runnable tasks for EAS:
> - when a task is stuck on a CPU and the system is not overutilized.
> - if there is a possible idle CPU when the system is overutilized.
>
> More tests results will come later as I wanted to send the pachtset before
> LPC.
>
> I have kept Tbench figures as I added them in v7 but results are the same
> with the correct patch 6.
>
> Tbench on dragonboard rb5
> schedutil and EAS enabled
>
> # process tip +patchset
> 1 29.3(+/-0.3%) 29.2(+/-0.2%) +0%
> 2 61.1(+/-1.8%) 61.7(+/-3.2%) +1%
> 4 260.0(+/-1.7%) 258.8(+/-2.8%) -1%
> 8 1361.2(+/-3.1%) 1377.1(+/-1.9%) +1%
> 16 981.5(+/-0.6%) 958.0(+/-1.7%) -2%
>
> Hackbench didn't show any difference
>
> Changes since v7:
> - Rebased on latest tip/sched/core
> - Fix some typos
> - Fix patch 6 mess
>
> Vincent Guittot (6):
> sched/fair: Filter false overloaded_group case for EAS
> sched/fair: Update overutilized detection
> sched/fair: Prepare select_task_rq_fair() to be called for new cases
> sched/fair: Add push task mechanism for fair
> sched/fair: Enable idle core tracking for !SMT
> sched/fair: Add EAS and idle cpu push trigger
>
> kernel/sched/fair.c | 350 +++++++++++++++++++++++++++++++++++-----
> kernel/sched/sched.h | 46 ++++--
> kernel/sched/topology.c | 2 +
> 3 files changed, 345 insertions(+), 53 deletions(-)
>
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [PATCH 0/6 v8] sched/fair: Add push task mechanism and handle more EAS cases
2026-02-26 17:34 ` Pierre Gondois
@ 2026-03-10 4:16 ` Qais Yousef
2026-03-10 10:27 ` Pierre Gondois
0 siblings, 1 reply; 47+ messages in thread
From: Qais Yousef @ 2026-03-10 4:16 UTC (permalink / raw)
To: Pierre Gondois
Cc: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
christian.loehle
On 02/26/26 18:34, Pierre Gondois wrote:
>
> On 12/2/25 19:12, Vincent Guittot wrote:
> > This is a subset of [1] (sched/fair: Rework EAS to handle more cases)
> >
> > [1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/
> >
> > The current Energy Aware Scheduler has some known limitations which have
> > became more and more visible with features like uclamp as an example. This
> > serie tries to fix some of those issues:
> > - tasks stacked on the same CPU of a PD
> > - tasks stuck on the wrong CPU.
>
> Following some other comments I think, I'm not sure I understand the use
> case
> the patchset tries to solve.
> - If this is for UCLAMP_MAX tasks:
> As Christian said (somwhere) the utilization of a long running task doesn't
> represent anything, so using EAS to do task placement cannot give a good
> placement. The push mechanism effectively allows to down-migrate UCLAMP_MAX
> tasks, but the repartition of these tasks is then subject to randomness.
Why randomness? We should distribute within the same perf domain, no?
>
> On a Radxa Orion:
> - 12 CPUs
> - CPU[1-4] are little CPUs with capa=290
> - using an artificial EM
>
> Running 8 CPU-bound tasks with UCLAMP_MAX=100, the task placement can be:
> - CPU1: 6 tasks
> - CPU2: 1 task
> - CPU3: 1 task
> - CPU4: idle
> The push mechanism triggers feec() and down-migrate tasks to little CPUs.
> However doesn't balance the ratio of (load / capacity) between CPUs as the
> load balancer could do. So the above placement is correct in that regard.
Hmm. Energy should tell us which perf domain is cheaper. But within the same
perf domain we pick the CPU with the most spare capacity.
Do all the CPUs appear loaded with max_spare_cap = 0?
Worth noting as part of looking at enabling overloaded support, it is important
to look at nr_running which I think something we should look at as we evolve
this handling. But for now, I think max_spare_cap checks should distribute
within a perf domain. nr_running will handle this more gracefully which is
trivial to add later for feec(). But ideally we want all wake up code to look
at nr_running and I think better defer it to after initial merge.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 0/6 v8] sched/fair: Add push task mechanism and handle more EAS cases
2026-03-10 4:16 ` Qais Yousef
@ 2026-03-10 10:27 ` Pierre Gondois
2026-03-10 15:11 ` Qais Yousef
0 siblings, 1 reply; 47+ messages in thread
From: Pierre Gondois @ 2026-03-10 10:27 UTC (permalink / raw)
To: Qais Yousef
Cc: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
christian.loehle
On 3/10/26 05:16, Qais Yousef wrote:
> On 02/26/26 18:34, Pierre Gondois wrote:
>> On 12/2/25 19:12, Vincent Guittot wrote:
>>> This is a subset of [1] (sched/fair: Rework EAS to handle more cases)
>>>
>>> [1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@linaro.org/
>>>
>>> The current Energy Aware Scheduler has some known limitations which have
>>> became more and more visible with features like uclamp as an example. This
>>> serie tries to fix some of those issues:
>>> - tasks stacked on the same CPU of a PD
>>> - tasks stuck on the wrong CPU.
>> Following some other comments I think, I'm not sure I understand the use
>> case
>> the patchset tries to solve.
>> - If this is for UCLAMP_MAX tasks:
>> As Christian said (somwhere) the utilization of a long running task doesn't
>> represent anything, so using EAS to do task placement cannot give a good
>> placement. The push mechanism effectively allows to down-migrate UCLAMP_MAX
>> tasks, but the repartition of these tasks is then subject to randomness.
> Why randomness? We should distribute within the same perf domain, no?
Yes right, but cf. the example below, UCLAMP_MAX tasks will be
distributed regardless of the load.
>
>> On a Radxa Orion:
>> - 12 CPUs
>> - CPU[1-4] are little CPUs with capa=290
>> - using an artificial EM
>>
>> Running 8 CPU-bound tasks with UCLAMP_MAX=100, the task placement can be:
>> - CPU1: 6 tasks
>> - CPU2: 1 task
>> - CPU3: 1 task
>> - CPU4: idle
>> The push mechanism triggers feec() and down-migrate tasks to little CPUs.
>> However doesn't balance the ratio of (load / capacity) between CPUs as the
>> load balancer could do. So the above placement is correct in that regard.
> Hmm. Energy should tell us which perf domain is cheaper. But within the same
> perf domain we pick the CPU with the most spare capacity.
>
> Do all the CPUs appear loaded with max_spare_cap = 0?
Yes, as they all have no spare cycle. This results in prev_cpu being picked.
In a way feec() does its job: this is a correct placement energy-wise.
However feec() wasn't made to handle cases where utilization is not
reliable.
>
> Worth noting as part of looking at enabling overloaded support, it is important
> to look at nr_running which I think something we should look at as we evolve
> this handling. But for now, I think max_spare_cap checks should distribute
> within a perf domain. nr_running will handle this more gracefully which is
> trivial to add later for feec(). But ideally we want all wake up code to look
> at nr_running and I think better defer it to after initial merge.
If we have 2 little CPUs (CPU0/CPU1) with 4 tasks:
- TaskA: Nice=10 (i.e. weight=110)
- Task[B,C,D]: Nice=15 (i.e. weight=36)
Then using nr_running would yield a placement as with 2 tasks
on each CPU:
- CPU0: TaskA + TaskB
Total weight = 110 + 36 = 146
- CPU1: TaskC + TaskD
Total weight = 36 + 36 = 52
With such placement:
- TaskA and TaskB are receiving less throughput
- TaskC and TaskD are receiving more throughput
than what they would if the placement was balanced.
This is not compliant with the scheduler Nice interface.
Also the documentation of UCLAMP states that it should only
be treated as hints.
A more balanced placement is:
- CPU0: TaskA
Total weight = 110
- CPU1: TaskB + TaskC + TaskD
Total weight = 36 + 36 + 36 = 88
The previous versions of Vincent's patchset was already using
nr_running to help balancing UCLAMP_MAX tasks in feec() IIRC.
However this will likely lead to the creation of a second
load balancer in feec(), as the example above shows.
------------
The push mechanism allows to down-migrate UCLAMP_MAX tasks,
which is indeed a better handling of UCLAMP_MAX. However it is
likely the first step toward more complicated issues.
IMO the best way to handle UCLAMP_MAX tasks would be to make
them second-class tasks, as the documentation describes them:
"""
Like explained for Android case in the introduction. Any app can lower
UCLAMP_MAX for some background tasks that don't care about performance
but could end up being busy and consume unnecessary system resources
on the system.
"""
But this would require having QoS classes for fair tasks and
this is also a large and complex problem.
Another solution would be to force the policy of every
UCLAMP_MAX task to SCHED_IDLE. This would also allow to just balance
the number of h_nr_idle on each CPU, as you and Vincent want to do
IIUC. Indeed if tasks have the same weights, the example above
doesn't hold anymore.
Using SCHED_IDLE for UCLAMP_MAX tasks can be viewed as a cheap
implementation of a lower QoS task class. Their priority is lower than
'normal' CFS classes (i.e. without UCLAMP_MAX set) and they cannot
steal time from 'normal' tasks.
But:
- the higher the Nice value of a task, the less true it becomes.
- as UCLAMP_MAX tasks and normal CFS tasks are still part of the same
'class', they are competing for CPU time on the same level.
Thus UCLAMP_MAX tasks cannot be made 'background tasks' and actually
run on spare CPU cycles (and avoid going in the over-utilized state).
------------
So IMO placement of UCLAMP_MAX tasks can only be achieved once QoS
classes are implemented. The push mechanism is still a good idea for
misfit/overutilized handling (IMO).
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 0/6 v8] sched/fair: Add push task mechanism and handle more EAS cases
2026-03-10 10:27 ` Pierre Gondois
@ 2026-03-10 15:11 ` Qais Yousef
2026-03-10 16:59 ` Pierre Gondois
0 siblings, 1 reply; 47+ messages in thread
From: Qais Yousef @ 2026-03-10 15:11 UTC (permalink / raw)
To: Pierre Gondois
Cc: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
christian.loehle
On 03/10/26 11:27, Pierre Gondois wrote:
> If we have 2 little CPUs (CPU0/CPU1) with 4 tasks:
> - TaskA: Nice=10 (i.e. weight=110)
> - Task[B,C,D]: Nice=15 (i.e. weight=36)
>
> Then using nr_running would yield a placement as with 2 tasks
> on each CPU:
> - CPU0: TaskA + TaskB
> Total weight = 110 + 36 = 146
> - CPU1: TaskC + TaskD
> Total weight = 36 + 36 = 52
> With such placement:
> - TaskA and TaskB are receiving less throughput
> - TaskC and TaskD are receiving more throughput
> than what they would if the placement was balanced.
>
> This is not compliant with the scheduler Nice interface.
This is over thinking it. On 2 core SMP system, no uclamp and no EAS. 4 always
busy tasks with different nice values will still be placed based on load and
neither wake up path nor load balancer has notion of throughput based on nice
to manage task placement.
Generally with EEVDF managing the slice size is better than nice value and with
the QoS framework we are proposing I think nice value is better locked down to
0. But we shall see.
More over the idea is to enable wake up path to be multi-modal and coherent
with lb decision (via push lb). So fixing all these problems is possible in the
future, fingers crossed without much added complexity. But again, we shall see.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 0/6 v8] sched/fair: Add push task mechanism and handle more EAS cases
2026-03-10 15:11 ` Qais Yousef
@ 2026-03-10 16:59 ` Pierre Gondois
2026-03-12 8:19 ` Vincent Guittot
0 siblings, 1 reply; 47+ messages in thread
From: Pierre Gondois @ 2026-03-10 16:59 UTC (permalink / raw)
To: Qais Yousef
Cc: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
christian.loehle
On 3/10/26 16:11, Qais Yousef wrote:
> On 03/10/26 11:27, Pierre Gondois wrote:
>
>> If we have 2 little CPUs (CPU0/CPU1) with 4 tasks:
>> - TaskA: Nice=10 (i.e. weight=110)
>> - Task[B,C,D]: Nice=15 (i.e. weight=36)
>>
>> Then using nr_running would yield a placement as with 2 tasks
>> on each CPU:
>> - CPU0: TaskA + TaskB
>> Total weight = 110 + 36 = 146
>> - CPU1: TaskC + TaskD
>> Total weight = 36 + 36 = 52
>> With such placement:
>> - TaskA and TaskB are receiving less throughput
>> - TaskC and TaskD are receiving more throughput
>> than what they would if the placement was balanced.
>>
>> This is not compliant with the scheduler Nice interface.
> This is over thinking it. On 2 core SMP system, no uclamp and no EAS. 4 always
> busy tasks with different nice values will still be placed based on load and
> neither wake up path nor load balancer has notion of throughput based on nice
> to manage task placement.
Yes right, by setting the Nice value of tasks and using the
associated weight (Nice=10 -> weight=110), I also meant that
the load of these tasks was approximately equal to the weight.
I.e.:
- TaskA: Nice=10 <-> weight=110 <-> load=110
- Task[B,C,D]: Nice=15 <-> weight=36 <-> load=36
In that regard, the load balancer balances load between CPUs
to try to provide an equal throughput to all tasks
(in respect to their weight or Nice value).
I only have doubt about the the push mechanism for the setup with:
- EAS
- long running tasks + UCLAMP_MAX
because in that setup case the Nice value and CPU load is ignored,
leading to task placement that can be incorrect.
Just to be sure, I am not arguing in the non-EAS case. As the
load balancer is active in that case, there is a mechanism
to have a global 'fairness' among CPUs.
When EAS is active, the load balancer is disabled and there is
no mechanism to manage the load between CPUs.
Vincent's patchset was advertised to help EAS:
"sched/fair: Add push task mechanism and handle more EAS cases"
so I was more thinking about that case.
If the goal is to have unified wake-up + load balancer framework
I currently have nothing to object.
(On a throughput-related subject)
I am working on a mechanism to try to help handling throughput
on HMP. This might be posted as RFC at some point, if you
have some time to have a look later.
>
> Generally with EEVDF managing the slice size is better than nice value and with
> the QoS framework we are proposing I think nice value is better locked down to
> 0. But we shall see.
Maybe I m completely off but I thought the EEVDF slice length
and the Nice values were handling different things. If you
have a link that shows your QoS approach and how they interact
I m interested.
> More over the idea is to enable wake up path to be multi-modal and coherent
> with lb decision (via push lb). So fixing all these problems is possible in the
> future, fingers crossed without much added complexity. But again, we shall see.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 0/6 v8] sched/fair: Add push task mechanism and handle more EAS cases
2026-03-10 16:59 ` Pierre Gondois
@ 2026-03-12 8:19 ` Vincent Guittot
0 siblings, 0 replies; 47+ messages in thread
From: Vincent Guittot @ 2026-03-12 8:19 UTC (permalink / raw)
To: Pierre Gondois
Cc: Qais Yousef, mingo, peterz, juri.lelli, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
christian.loehle
On Tue, 10 Mar 2026 at 18:00, Pierre Gondois <pierre.gondois@arm.com> wrote:
>
>
> On 3/10/26 16:11, Qais Yousef wrote:
> > On 03/10/26 11:27, Pierre Gondois wrote:
> >
> >> If we have 2 little CPUs (CPU0/CPU1) with 4 tasks:
> >> - TaskA: Nice=10 (i.e. weight=110)
> >> - Task[B,C,D]: Nice=15 (i.e. weight=36)
> >>
> >> Then using nr_running would yield a placement as with 2 tasks
> >> on each CPU:
> >> - CPU0: TaskA + TaskB
> >> Total weight = 110 + 36 = 146
> >> - CPU1: TaskC + TaskD
> >> Total weight = 36 + 36 = 52
> >> With such placement:
> >> - TaskA and TaskB are receiving less throughput
> >> - TaskC and TaskD are receiving more throughput
> >> than what they would if the placement was balanced.
> >>
> >> This is not compliant with the scheduler Nice interface.
> > This is over thinking it. On 2 core SMP system, no uclamp and no EAS. 4 always
> > busy tasks with different nice values will still be placed based on load and
> > neither wake up path nor load balancer has notion of throughput based on nice
> > to manage task placement.
>
> Yes right, by setting the Nice value of tasks and using the
> associated weight (Nice=10 -> weight=110), I also meant that
> the load of these tasks was approximately equal to the weight.
> I.e.:
> - TaskA: Nice=10 <-> weight=110 <-> load=110
> - Task[B,C,D]: Nice=15 <-> weight=36 <-> load=36
> In that regard, the load balancer balances load between CPUs
> to try to provide an equal throughput to all tasks
> (in respect to their weight or Nice value).
>
> I only have doubt about the the push mechanism for the setup with:
> - EAS
> - long running tasks + UCLAMP_MAX
> because in that setup case the Nice value and CPU load is ignored,
> leading to task placement that can be incorrect.
The previous rework of feec that I sent was a 1st step in the
direction where we not only take into account NRG but also other hints
such as nr_running and then slice duration.
>
> Just to be sure, I am not arguing in the non-EAS case. As the
> load balancer is active in that case, there is a mechanism
> to have a global 'fairness' among CPUs.
> When EAS is active, the load balancer is disabled and there is
> no mechanism to manage the load between CPUs.
>
> Vincent's patchset was advertised to help EAS:
> "sched/fair: Add push task mechanism and handle more EAS cases"
> so I was more thinking about that case.
This is a starting point but the push task mechanism can be used for
other use cases too. One use case is pushing tasks to idle CPUs when
the system is overloaded, for example. The end goal is to call the
same select_task_rq function every time. And as Qais already said, we
could disable periodic load balancing at the LLC level or further
increase the period.
> If the goal is to have unified wake-up + load balancer framework
> I currently have nothing to object.
>
> (On a throughput-related subject)
> I am working on a mechanism to try to help handling throughput
> on HMP. This might be posted as RFC at some point, if you
> have some time to have a look later.
>
> >
> > Generally with EEVDF managing the slice size is better than nice value and with
> > the QoS framework we are proposing I think nice value is better locked down to
> > 0. But we shall see.
> Maybe I m completely off but I thought the EEVDF slice length
> and the Nice values were handling different things. If you
> have a link that shows your QoS approach and how they interact
> I m interested.
> > More over the idea is to enable wake up path to be multi-modal and coherent
> > with lb decision (via push lb). So fixing all these problems is possible in the
> > future, fingers crossed without much added complexity. But again, we shall see.
^ permalink raw reply [flat|nested] 47+ messages in thread