[PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases
@ 2025-03-02 21:05 Vincent Guittot
  2025-03-02 21:05 ` [PATCH 1/7 v5] sched/fair: Filter false overloaded_group case for EAS Vincent Guittot
                   ` (8 more replies)
  0 siblings, 9 replies; 31+ messages in thread
From: Vincent Guittot @ 2025-03-02 21:05 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, pierre.gondois,
	linux-kernel
  Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado, qperret,
	Vincent Guittot

The current Energy Aware Scheduler has some known limitations which have
became more and more visible with features like uclamp as an example. This
serie tries to fix some of those issues:
- tasks stacked on the same CPU of a PD
- tasks stuck on the wrong CPU.

Patch 1 fixes the case where a CPU is wrongly classified as overloaded
whereas it is capped to a lower compute capacity. This wrong classification
can prevent periodic load balancer to select a group_misfit_task CPU
because group_overloaded has higher priority.

Patch 2 creates a new EM interface that will be used by Patch 3

Patch 3 fixes the issue of tasks being stacked on same CPU of a PD whereas
others might be a better choice. feec() looks for the CPU with the highest
spare capacity in a PD assuming that it will be the best CPU from a energy
efficiency PoV because it will require the smallest increase of OPP.
This is often but not always true, this policy filters some others CPUs
which would be as efficients because of using the same OPP but with less
running tasks as an example.
In fact, we only care about the cost of the new OPP that will be
selected to handle the waking task. In many cases, several CPUs will end
up selecting the same OPP and as a result having the same energy cost. In
such cases, we can use other metrics to select the best CPU with the same
energy cost. Patch 3 rework feec() to look 1st for the lowest cost in a PD
and then the most performant CPU between CPUs. At now, this only tries to
evenly spread the number of runnable tasks on CPUs but this can be
improved with other metric like the sched slice duration in a follow up
series.

perf sched pipe on a dragonboard rb5 has been used to compare the overhead
of the new feec() vs current implementation.

9 iterations of perf bench sched pipe -T -l 80000
                ops/sec  stdev 
tip/sched/core  16634    (+/- 0.5%)
+ patches 1-3   17434    (+/- 1.2%)  +4.8%

Patch 4 removed the now unused em_cpu_energy()

Patch 5 solves another problem with tasks being stuck on a CPU forever
because it doesn't sleep anymore and as a result never wakeup and call
feec(). Such task can be detected by comparing util_avg or runnable_avg
with the compute capacity of the CPU. Once detected, we can call feec() to
check if there is a better CPU for the stuck task. The call can be done in
2 places:
- When the task is put back in the runnnable list after its running slice
  with the balance callback mecanism similarly to the rt/dl push callback.
- During cfs tick when there is only 1 running task stuck on the CPU in
  which case the balance callback can't be used.

This push callback mecanism with the new feec() algorithm ensures that
tasks always get a chance to migrate on the best suitable CPU and don't
stay stuck on a CPU which is no more the most suitable one. As examples:
- A task waking on a big CPU with a uclamp max preventing it to sleep and
  wake up, can migrate on a smaller CPU once it's more power efficient.
- The tasks are spread on CPUs in the PD when they target the same OPP.

Patch 6 adds task misfit migration case in the cfs tick and push callback
mecanism to prevent waking up an idle cpu unnecessarily.

Patch 7 removes the need of testing uclamp_min in cpu_overutilized to
trigger the active migration of a task on another CPU.

Compared to v4:
- Fixed check_pushable_task for !SMP

Compared to v3:
- Fixed the empty functions

Compared to v2:
- Renamed the push and tick functions to ease understanding what they do.
  Both are kept in the same patch as they solve the same problem.
- Created some helper functions
- Fixing some typos and comments
- The task_stuck_on_cpu() condition remains unchanged. Pierre suggested to
  take into account the min capacity of the CPU but the is not directly
  available right now. It can trigger feec() when uclamp_max is very low
  compare to the min capacity of the CPU but the feec() should keep 
  returning the same CPU. This can be handled in a follow on patch

Compared to v1:
- The call to feec() even when overutilized has been removed
from this serie and will be adressed in a separate series. Only the case
of uclamp_min has been kept as it is now handled by push callback and
tick mecanism.
- The push mecanism has been cleanup, fixed and simplified.

This series implements some of the topics discussed at OSPM [1]. Other
topics will be part of an other serie

[1] https://youtu.be/PHEBAyxeM_M?si=ZApIOw3BS4SOLPwp

Vincent Guittot (7):
  sched/fair: Filter false overloaded_group case for EAS
  energy model: Add a get previous state function
  sched/fair: Rework feec() to use cost instead of spare capacity
  energy model: Remove unused em_cpu_energy()
  sched/fair: Add push task mechanism for EAS
  sched/fair: Add misfit case to push task mecanism for EAS
  sched/fair: Update overutilized detection

 include/linux/energy_model.h | 111 ++----
 kernel/sched/fair.c          | 721 ++++++++++++++++++++++++-----------
 kernel/sched/sched.h         |   2 +
 3 files changed, 518 insertions(+), 316 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/7 v5] sched/fair: Filter false overloaded_group case for EAS
  2025-03-02 21:05 [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases Vincent Guittot
@ 2025-03-02 21:05 ` Vincent Guittot
  2025-03-04  4:38   ` K Prateek Nayak
  2025-03-02 21:05 ` [PATCH 2/7 v5] energy model: Add a get previous state function Vincent Guittot
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Vincent Guittot @ 2025-03-02 21:05 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, pierre.gondois,
	linux-kernel
  Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado, qperret,
	Vincent Guittot

With EAS, a group should be set overloaded if at least 1 CPU in the group
is overutilized but it can happen that a CPU is fully utilized by tasks
because of clamping the compute capacity of the CPU. In such case, the CPU
is not overutilized and as a result should not be set overloaded as well.

group_overloaded being a higher priority than group_misfit, such group can
be selected as the busiest group instead of a group with a mistfit task
and prevents load_balance to select the CPU with the misfit task to pull
the latter on a fitting CPU.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
---
 kernel/sched/fair.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 857808da23d8..d3d1a2ba6b1a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9931,6 +9931,7 @@ struct sg_lb_stats {
 	unsigned int group_asym_packing;	/* Tasks should be moved to preferred CPU */
 	unsigned int group_smt_balance;		/* Task on busy SMT be moved */
 	unsigned long group_misfit_task_load;	/* A CPU has a task too big for its capacity */
+	unsigned int group_overutilized;	/* At least one CPU is overutilized in the group */
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
@@ -10163,6 +10164,13 @@ group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
 static inline bool
 group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
 {
+	/*
+	 * With EAS and uclamp, 1 CPU in the group must be overutilized to
+	 * consider the group overloaded.
+	 */
+	if (sched_energy_enabled() && !sgs->group_overutilized)
+		return false;
+
 	if (sgs->sum_nr_running <= sgs->group_weight)
 		return false;
 
@@ -10374,8 +10382,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		nr_running = rq->nr_running;
 		sgs->sum_nr_running += nr_running;
 
-		if (cpu_overutilized(i))
+		if (cpu_overutilized(i)) {
 			*sg_overutilized = 1;
+			sgs->group_overutilized = 1;
+		}
 
 		/*
 		 * No need to call idle_cpu() if nr_running is not 0
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/7 v5] sched/fair: Filter false overloaded_group case for EAS
  2025-03-02 21:05 ` [PATCH 1/7 v5] sched/fair: Filter false overloaded_group case for EAS Vincent Guittot
@ 2025-03-04  4:38   ` K Prateek Nayak
  2025-03-05  8:13     ` Vincent Guittot
  0 siblings, 1 reply; 31+ messages in thread
From: K Prateek Nayak @ 2025-03-04  4:38 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, lukasz.luba,
	rafael.j.wysocki, pierre.gondois, linux-kernel
  Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado, qperret

Hello Vincent,

On 3/3/2025 2:35 AM, Vincent Guittot wrote:
> With EAS, a group should be set overloaded if at least 1 CPU in the group
> is overutilized but it can happen that a CPU is fully utilized by tasks
> because of clamping the compute capacity of the CPU. In such case, the CPU
> is not overutilized and as a result should not be set overloaded as well.
> 
> group_overloaded being a higher priority than group_misfit, such group can
> be selected as the busiest group instead of a group with a mistfit task
> and prevents load_balance to select the CPU with the misfit task to pull
> the latter on a fitting CPU.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> Tested-by: Pierre Gondois <pierre.gondois@arm.com>
> ---
>   kernel/sched/fair.c | 12 +++++++++++-
>   1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 857808da23d8..d3d1a2ba6b1a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9931,6 +9931,7 @@ struct sg_lb_stats {
>   	unsigned int group_asym_packing;	/* Tasks should be moved to preferred CPU */
>   	unsigned int group_smt_balance;		/* Task on busy SMT be moved */
>   	unsigned long group_misfit_task_load;	/* A CPU has a task too big for its capacity */
> +	unsigned int group_overutilized;	/* At least one CPU is overutilized in the group */
>   #ifdef CONFIG_NUMA_BALANCING
>   	unsigned int nr_numa_running;
>   	unsigned int nr_preferred_running;
> @@ -10163,6 +10164,13 @@ group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
>   static inline bool
>   group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
>   {
> +	/*
> +	 * With EAS and uclamp, 1 CPU in the group must be overutilized to
> +	 * consider the group overloaded.
> +	 */
> +	if (sched_energy_enabled() && !sgs->group_overutilized)
> +		return false;
> +
>   	if (sgs->sum_nr_running <= sgs->group_weight)
>   		return false;
>   
> @@ -10374,8 +10382,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>   		nr_running = rq->nr_running;
>   		sgs->sum_nr_running += nr_running;
>   
> -		if (cpu_overutilized(i))
> +		if (cpu_overutilized(i)) {
>   			*sg_overutilized = 1;

Since sgs->overutilized is tracking the overutilized status, can we get
avoid passing the "sg_overutilized" pointer to update_sg_lb_stats() and
just use the sg->overutilized in update_sd_lb_stats()?

Something like below:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 857808da23d8..de4a7e19d383 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10346,14 +10346,12 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
   * @group: sched_group whose statistics are to be updated.
   * @sgs: variable to hold the statistics for this group.
   * @sg_overloaded: sched_group is overloaded
- * @sg_overutilized: sched_group is overutilized
   */
  static inline void update_sg_lb_stats(struct lb_env *env,
  				      struct sd_lb_stats *sds,
  				      struct sched_group *group,
  				      struct sg_lb_stats *sgs,
-				      bool *sg_overloaded,
-				      bool *sg_overutilized)
+				      bool *sg_overloaded)
  {
  	int i, nr_running, local_group, sd_flags = env->sd->flags;
  	bool balancing_at_rd = !env->sd->parent;
@@ -10375,7 +10373,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
  		sgs->sum_nr_running += nr_running;
  
  		if (cpu_overutilized(i))
-			*sg_overutilized = 1;
+			sgs->group_overutilized = 1;
  
  		/*
  		 * No need to call idle_cpu() if nr_running is not 0
@@ -11046,7 +11044,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
  				update_group_capacity(env->sd, env->dst_cpu);
  		}
  
-		update_sg_lb_stats(env, sds, sg, sgs, &sg_overloaded, &sg_overutilized);
+		update_sg_lb_stats(env, sds, sg, sgs, &sg_overloaded);
  
  		if (!local_group && update_sd_pick_busiest(env, sds, sg, sgs)) {
  			sds->busiest = sg;
@@ -11056,6 +11054,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
  		/* Now, start updating sd_lb_stats */
  		sds->total_load += sgs->group_load;
  		sds->total_capacity += sgs->group_capacity;
+		sg_overutilized |= sgs->group_overutilized;
  
  		sum_util += sgs->group_util;
  		sg = sg->next;
-- 
Thanks and Regards,
Prateek

> +			sgs->group_overutilized = 1;
> +		}
>   
>   		/*
>   		 * No need to call idle_cpu() if nr_running is not 0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/7 v5] sched/fair: Filter false overloaded_group case for EAS
  2025-03-04  4:38   ` K Prateek Nayak
@ 2025-03-05  8:13     ` Vincent Guittot
  0 siblings, 0 replies; 31+ messages in thread
From: Vincent Guittot @ 2025-03-05  8:13 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, pierre.gondois,
	linux-kernel, qyousef, hongyan.xia2, christian.loehle,
	luis.machado, qperret

Hi Prateek,

On Tue, 4 Mar 2025 at 05:38, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Vincent,
>
> On 3/3/2025 2:35 AM, Vincent Guittot wrote:
> > With EAS, a group should be set overloaded if at least 1 CPU in the group
> > is overutilized but it can happen that a CPU is fully utilized by tasks
> > because of clamping the compute capacity of the CPU. In such case, the CPU
> > is not overutilized and as a result should not be set overloaded as well.
> >
> > group_overloaded being a higher priority than group_misfit, such group can
> > be selected as the busiest group instead of a group with a mistfit task
> > and prevents load_balance to select the CPU with the misfit task to pull
> > the latter on a fitting CPU.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > Tested-by: Pierre Gondois <pierre.gondois@arm.com>
> > ---
> >   kernel/sched/fair.c | 12 +++++++++++-
> >   1 file changed, 11 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 857808da23d8..d3d1a2ba6b1a 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9931,6 +9931,7 @@ struct sg_lb_stats {
> >       unsigned int group_asym_packing;        /* Tasks should be moved to preferred CPU */
> >       unsigned int group_smt_balance;         /* Task on busy SMT be moved */
> >       unsigned long group_misfit_task_load;   /* A CPU has a task too big for its capacity */
> > +     unsigned int group_overutilized;        /* At least one CPU is overutilized in the group */
> >   #ifdef CONFIG_NUMA_BALANCING
> >       unsigned int nr_numa_running;
> >       unsigned int nr_preferred_running;
> > @@ -10163,6 +10164,13 @@ group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
> >   static inline bool
> >   group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
> >   {
> > +     /*
> > +      * With EAS and uclamp, 1 CPU in the group must be overutilized to
> > +      * consider the group overloaded.
> > +      */
> > +     if (sched_energy_enabled() && !sgs->group_overutilized)
> > +             return false;
> > +
> >       if (sgs->sum_nr_running <= sgs->group_weight)
> >               return false;
> >
> > @@ -10374,8 +10382,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >               nr_running = rq->nr_running;
> >               sgs->sum_nr_running += nr_running;
> >
> > -             if (cpu_overutilized(i))
> > +             if (cpu_overutilized(i)) {
> >                       *sg_overutilized = 1;
>
> Since sgs->overutilized is tracking the overutilized status, can we get
> avoid passing the "sg_overutilized" pointer to update_sg_lb_stats() and
> just use the sg->overutilized in update_sd_lb_stats()?

yes, make sense

>
> Something like below:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 857808da23d8..de4a7e19d383 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10346,14 +10346,12 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
>    * @group: sched_group whose statistics are to be updated.
>    * @sgs: variable to hold the statistics for this group.
>    * @sg_overloaded: sched_group is overloaded
> - * @sg_overutilized: sched_group is overutilized
>    */
>   static inline void update_sg_lb_stats(struct lb_env *env,
>                                       struct sd_lb_stats *sds,
>                                       struct sched_group *group,
>                                       struct sg_lb_stats *sgs,
> -                                     bool *sg_overloaded,
> -                                     bool *sg_overutilized)
> +                                     bool *sg_overloaded)
>   {
>         int i, nr_running, local_group, sd_flags = env->sd->flags;
>         bool balancing_at_rd = !env->sd->parent;
> @@ -10375,7 +10373,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>                 sgs->sum_nr_running += nr_running;
>
>                 if (cpu_overutilized(i))
> -                       *sg_overutilized = 1;
> +                       sgs->group_overutilized = 1;
>
>                 /*
>                  * No need to call idle_cpu() if nr_running is not 0
> @@ -11046,7 +11044,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>                                 update_group_capacity(env->sd, env->dst_cpu);
>                 }
>
> -               update_sg_lb_stats(env, sds, sg, sgs, &sg_overloaded, &sg_overutilized);
> +               update_sg_lb_stats(env, sds, sg, sgs, &sg_overloaded);
>
>                 if (!local_group && update_sd_pick_busiest(env, sds, sg, sgs)) {
>                         sds->busiest = sg;
> @@ -11056,6 +11054,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>                 /* Now, start updating sd_lb_stats */
>                 sds->total_load += sgs->group_load;
>                 sds->total_capacity += sgs->group_capacity;
> +               sg_overutilized |= sgs->group_overutilized;
>
>                 sum_util += sgs->group_util;
>                 sg = sg->next;
> --
> Thanks and Regards,
> Prateek
>
> > +                     sgs->group_overutilized = 1;
> > +             }
> >
> >               /*
> >                * No need to call idle_cpu() if nr_running is not 0
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 2/7 v5] energy model: Add a get previous state function
  2025-03-02 21:05 [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases Vincent Guittot
  2025-03-02 21:05 ` [PATCH 1/7 v5] sched/fair: Filter false overloaded_group case for EAS Vincent Guittot
@ 2025-03-02 21:05 ` Vincent Guittot
  2025-03-02 21:05 ` [PATCH 3/7 v5] sched/fair: Rework feec() to use cost instead of spare capacity Vincent Guittot
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 31+ messages in thread
From: Vincent Guittot @ 2025-03-02 21:05 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, pierre.gondois,
	linux-kernel
  Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado, qperret,
	Vincent Guittot

Instead of parsing the entire EM table everytime, add a function to get the
previous state.

Will be used in the scheduler feec() function.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/energy_model.h | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 78318d49276d..551e243b9c43 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -216,6 +216,26 @@ em_pd_get_efficient_state(struct em_perf_state *table,
 	return max_ps;
 }
 
+static inline int
+em_pd_get_previous_state(struct em_perf_state *table,
+			 struct em_perf_domain *pd, int idx)
+{
+	unsigned long pd_flags = pd->flags;
+	int min_ps = pd->min_perf_state;
+	struct em_perf_state *ps;
+	int i;
+
+	for (i = idx - 1; i >= min_ps; i--) {
+		ps = &table[i];
+		if (pd_flags & EM_PERF_DOMAIN_SKIP_INEFFICIENCIES &&
+		    ps->flags & EM_PERF_STATE_INEFFICIENT)
+			continue;
+		return i;
+	}
+
+	return -1;
+}
+
 /**
  * em_cpu_energy() - Estimates the energy consumed by the CPUs of a
  *		performance domain
@@ -362,6 +382,18 @@ static inline struct em_perf_domain *em_pd_get(struct device *dev)
 {
 	return NULL;
 }
+static inline int
+em_pd_get_efficient_state(struct em_perf_state *table,
+			  struct em_perf_domain *pd, unsigned long max_util)
+{
+	return 0;
+}
+static inline int
+em_pd_get_previous_state(struct em_perf_state *table,
+			 struct em_perf_domain *pd, int idx)
+{
+	return -1;
+}
 static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 			unsigned long max_util, unsigned long sum_util,
 			unsigned long allowed_cpu_cap)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/7 v5] sched/fair: Rework feec() to use cost instead of spare capacity
  2025-03-02 21:05 [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases Vincent Guittot
  2025-03-02 21:05 ` [PATCH 1/7 v5] sched/fair: Filter false overloaded_group case for EAS Vincent Guittot
  2025-03-02 21:05 ` [PATCH 2/7 v5] energy model: Add a get previous state function Vincent Guittot
@ 2025-03-02 21:05 ` Vincent Guittot
  2025-03-12 14:08   ` Pierre Gondois
  2025-03-25 11:09   ` Pierre Gondois
  2025-03-02 21:05 ` [PATCH 4/7 v5] energy model: Remove unused em_cpu_energy() Vincent Guittot
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 31+ messages in thread
From: Vincent Guittot @ 2025-03-02 21:05 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, pierre.gondois,
	linux-kernel
  Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado, qperret,
	Vincent Guittot

feec() looks for the CPU with highest spare capacity in a PD assuming that
it will be the best CPU from a energy efficiency PoV because it will
require the smallest increase of OPP. Although this is true generally
speaking, this policy also filters some others CPUs which will be as
efficients because of using the same OPP.
In fact, we really care about the cost of the new OPP that will be
selected to handle the waking task. In many cases, several CPUs will end
up selecting the same OPP and as a result using the same energy cost. In
these cases, we can use other metrics to select the best CPU for the same
energy cost.

Rework feec() to look 1st for the lowest cost in a PD and then the most
performant CPU between CPUs. The cost of the OPP remains the only
comparison criteria between Performance Domains.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 466 +++++++++++++++++++++++---------------------
 1 file changed, 246 insertions(+), 220 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d3d1a2ba6b1a..a9b97bbc085f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8193,29 +8193,37 @@ unsigned long sched_cpu_util(int cpu)
 }
 
 /*
- * energy_env - Utilization landscape for energy estimation.
- * @task_busy_time: Utilization contribution by the task for which we test the
- *                  placement. Given by eenv_task_busy_time().
- * @pd_busy_time:   Utilization of the whole perf domain without the task
- *                  contribution. Given by eenv_pd_busy_time().
- * @cpu_cap:        Maximum CPU capacity for the perf domain.
- * @pd_cap:         Entire perf domain capacity. (pd->nr_cpus * cpu_cap).
- */
-struct energy_env {
-	unsigned long task_busy_time;
-	unsigned long pd_busy_time;
-	unsigned long cpu_cap;
-	unsigned long pd_cap;
+ * energy_cpu_stat - Utilization landscape for energy estimation.
+ * @idx :        Index of the OPP in the performance domain
+ * @cost :       Cost of the OPP
+ * @max_perf :   Compute capacity of OPP
+ * @min_perf :   Compute capacity of the previous OPP
+ * @capa :       Capacity of the CPU
+ * @runnable :   runnable_avg of the CPU
+ * @nr_running : Number of cfs running task
+ * @fits :       Fits level of the CPU
+ * @cpu :        Current best CPU
+ */
+struct energy_cpu_stat {
+	unsigned long idx;
+	unsigned long cost;
+	unsigned long max_perf;
+	unsigned long min_perf;
+	unsigned long capa;
+	unsigned long util;
+	unsigned long runnable;
+	unsigned int nr_running;
+	int fits;
+	int cpu;
 };
 
 /*
- * Compute the task busy time for compute_energy(). This time cannot be
- * injected directly into effective_cpu_util() because of the IRQ scaling.
+ * Compute the task busy time for computing its energy impact. This time cannot
+ * be injected directly into effective_cpu_util() because of the IRQ scaling.
  * The latter only makes sense with the most recent CPUs where the task has
  * run.
  */
-static inline void eenv_task_busy_time(struct energy_env *eenv,
-				       struct task_struct *p, int prev_cpu)
+static inline unsigned long task_busy_time(struct task_struct *p, int prev_cpu)
 {
 	unsigned long busy_time, max_cap = arch_scale_cpu_capacity(prev_cpu);
 	unsigned long irq = cpu_util_irq(cpu_rq(prev_cpu));
@@ -8225,124 +8233,153 @@ static inline void eenv_task_busy_time(struct energy_env *eenv,
 	else
 		busy_time = scale_irq_capacity(task_util_est(p), irq, max_cap);
 
-	eenv->task_busy_time = busy_time;
+	return busy_time;
 }
 
-/*
- * Compute the perf_domain (PD) busy time for compute_energy(). Based on the
- * utilization for each @pd_cpus, it however doesn't take into account
- * clamping since the ratio (utilization / cpu_capacity) is already enough to
- * scale the EM reported power consumption at the (eventually clamped)
- * cpu_capacity.
- *
- * The contribution of the task @p for which we want to estimate the
- * energy cost is removed (by cpu_util()) and must be calculated
- * separately (see eenv_task_busy_time). This ensures:
- *
- *   - A stable PD utilization, no matter which CPU of that PD we want to place
- *     the task on.
- *
- *   - A fair comparison between CPUs as the task contribution (task_util())
- *     will always be the same no matter which CPU utilization we rely on
- *     (util_avg or util_est).
- *
- * Set @eenv busy time for the PD that spans @pd_cpus. This busy time can't
- * exceed @eenv->pd_cap.
- */
-static inline void eenv_pd_busy_time(struct energy_env *eenv,
-				     struct cpumask *pd_cpus,
-				     struct task_struct *p)
+/* Estimate the utilization of the CPU that is then used to select the OPP */
+static unsigned long find_cpu_max_util(int cpu, struct task_struct *p, int dst_cpu)
 {
-	unsigned long busy_time = 0;
-	int cpu;
+	unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
+	unsigned long eff_util, min, max;
+
+	/*
+	 * Performance domain frequency: utilization clamping
+	 * must be considered since it affects the selection
+	 * of the performance domain frequency.
+	 */
+	eff_util = effective_cpu_util(cpu, util, &min, &max);
 
-	for_each_cpu(cpu, pd_cpus) {
-		unsigned long util = cpu_util(cpu, p, -1, 0);
+	/* Task's uclamp can modify min and max value */
+	if (uclamp_is_used() && cpu == dst_cpu) {
+		min = max(min, uclamp_eff_value(p, UCLAMP_MIN));
 
-		busy_time += effective_cpu_util(cpu, util, NULL, NULL);
+		/*
+		 * If there is no active max uclamp constraint,
+		 * directly use task's one, otherwise keep max.
+		 */
+		if (uclamp_rq_is_idle(cpu_rq(cpu)))
+			max = uclamp_eff_value(p, UCLAMP_MAX);
+		else
+			max = max(max, uclamp_eff_value(p, UCLAMP_MAX));
 	}
 
-	eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
+	eff_util = sugov_effective_cpu_perf(cpu, eff_util, min, max);
+	return eff_util;
 }
 
-/*
- * Compute the maximum utilization for compute_energy() when the task @p
- * is placed on the cpu @dst_cpu.
- *
- * Returns the maximum utilization among @eenv->cpus. This utilization can't
- * exceed @eenv->cpu_cap.
- */
-static inline unsigned long
-eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
-		 struct task_struct *p, int dst_cpu)
+/* Estimate the utilization of the CPU without the task */
+static unsigned long find_cpu_actual_util(int cpu, struct task_struct *p)
 {
-	unsigned long max_util = 0;
-	int cpu;
+	unsigned long util = cpu_util(cpu, p, -1, 0);
+	unsigned long eff_util;
 
-	for_each_cpu(cpu, pd_cpus) {
-		struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
-		unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
-		unsigned long eff_util, min, max;
+	eff_util = effective_cpu_util(cpu, util, NULL, NULL);
 
-		/*
-		 * Performance domain frequency: utilization clamping
-		 * must be considered since it affects the selection
-		 * of the performance domain frequency.
-		 * NOTE: in case RT tasks are running, by default the min
-		 * utilization can be max OPP.
-		 */
-		eff_util = effective_cpu_util(cpu, util, &min, &max);
+	return eff_util;
+}
 
-		/* Task's uclamp can modify min and max value */
-		if (tsk && uclamp_is_used()) {
-			min = max(min, uclamp_eff_value(p, UCLAMP_MIN));
+/* Find the cost of a performance domain for the estimated utilization */
+static inline void find_pd_cost(struct em_perf_domain *pd,
+				unsigned long max_util,
+				struct energy_cpu_stat *stat)
+{
+	struct em_perf_table *em_table;
+	struct em_perf_state *ps;
+	int i;
 
-			/*
-			 * If there is no active max uclamp constraint,
-			 * directly use task's one, otherwise keep max.
-			 */
-			if (uclamp_rq_is_idle(cpu_rq(cpu)))
-				max = uclamp_eff_value(p, UCLAMP_MAX);
-			else
-				max = max(max, uclamp_eff_value(p, UCLAMP_MAX));
-		}
+	/*
+	 * Find the lowest performance state of the Energy Model above the
+	 * requested performance.
+	 */
+	em_table = rcu_dereference(pd->em_table);
+	i = em_pd_get_efficient_state(em_table->state, pd, max_util);
+	ps = &em_table->state[i];
 
-		eff_util = sugov_effective_cpu_perf(cpu, eff_util, min, max);
-		max_util = max(max_util, eff_util);
+	/* Save the cost and performance range of the OPP */
+	stat->max_perf = ps->performance;
+	stat->cost = ps->cost;
+	i = em_pd_get_previous_state(em_table->state, pd, i);
+	if (i < 0)
+		stat->min_perf = 0;
+	else {
+		ps = &em_table->state[i];
+		stat->min_perf = ps->performance;
 	}
+}
+
+/*Check if the CPU can handle the waking task */
+static int check_cpu_with_task(struct task_struct *p, int cpu)
+{
+	unsigned long p_util_min = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MIN) : 0;
+	unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
+	unsigned long util_min = p_util_min;
+	unsigned long util_max = p_util_max;
+	unsigned long util = cpu_util(cpu, p, cpu, 0);
+	struct rq *rq = cpu_rq(cpu);
 
-	return min(max_util, eenv->cpu_cap);
+	/*
+	 * Skip CPUs that cannot satisfy the capacity request.
+	 * IOW, placing the task there would make the CPU
+	 * overutilized. Take uclamp into account to see how
+	 * much capacity we can get out of the CPU; this is
+	 * aligned with sched_cpu_util().
+	 */
+	if (uclamp_is_used() && !uclamp_rq_is_idle(rq)) {
+		unsigned long rq_util_min, rq_util_max;
+		/*
+		 * Open code uclamp_rq_util_with() except for
+		 * the clamp() part. I.e.: apply max aggregation
+		 * only. util_fits_cpu() logic requires to
+		 * operate on non clamped util but must use the
+		 * max-aggregated uclamp_{min, max}.
+		 */
+		rq_util_min = uclamp_rq_get(rq, UCLAMP_MIN);
+		rq_util_max = uclamp_rq_get(rq, UCLAMP_MAX);
+		util_min = max(rq_util_min, p_util_min);
+		util_max = max(rq_util_max, p_util_max);
+	}
+	return util_fits_cpu(util, util_min, util_max, cpu);
 }
 
 /*
- * compute_energy(): Use the Energy Model to estimate the energy that @pd would
- * consume for a given utilization landscape @eenv. When @dst_cpu < 0, the task
- * contribution is ignored.
+ * For the same cost, select the CPU that will povide best performance for the
+ * task.
  */
-static inline unsigned long
-compute_energy(struct energy_env *eenv, struct perf_domain *pd,
-	       struct cpumask *pd_cpus, struct task_struct *p, int dst_cpu)
+static bool update_best_cpu(struct energy_cpu_stat *target,
+			    struct energy_cpu_stat *min,
+			    int prev, struct sched_domain *sd)
 {
-	unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
-	unsigned long busy_time = eenv->pd_busy_time;
-	unsigned long energy;
-
-	if (dst_cpu >= 0)
-		busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
+	/*  Select the one with the least number of running tasks */
+	if (target->nr_running < min->nr_running)
+		return true;
+	if (target->nr_running > min->nr_running)
+		return false;
 
-	energy = em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
+	/* Favor previous CPU otherwise */
+	if (target->cpu == prev)
+		return true;
+	if (min->cpu == prev)
+		return false;
 
-	trace_sched_compute_energy_tp(p, dst_cpu, energy, max_util, busy_time);
+	/*
+	 * Choose CPU with lowest contention. One might want to consider load
+	 * instead of runnable but we are supposed to not be overutilized so
+	 * there is enough compute capacity for everybody.
+	 */
+	if ((target->runnable * min->capa * sd->imbalance_pct) >=
+			(min->runnable * target->capa * 100))
+		return false;
 
-	return energy;
+	return true;
 }
 
 /*
  * find_energy_efficient_cpu(): Find most energy-efficient target CPU for the
- * waking task. find_energy_efficient_cpu() looks for the CPU with maximum
- * spare capacity in each performance domain and uses it as a potential
- * candidate to execute the task. Then, it uses the Energy Model to figure
- * out which of the CPU candidates is the most energy-efficient.
+ * waking task. find_energy_efficient_cpu() looks for the CPU with the lowest
+ * power cost (usually with maximum spare capacity but not always) in each
+ * performance domain and uses it as a potential candidate to execute the task.
+ * Then, it uses the Energy Model to figure out which of the CPU candidates is
+ * the most energy-efficient.
  *
  * The rationale for this heuristic is as follows. In a performance domain,
  * all the most energy efficient CPU candidates (according to the Energy
@@ -8379,17 +8416,14 @@ compute_energy(struct energy_env *eenv, struct perf_domain *pd,
 static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
-	unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
-	unsigned long p_util_min = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MIN) : 0;
-	unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
 	struct root_domain *rd = this_rq()->rd;
-	int cpu, best_energy_cpu, target = -1;
-	int prev_fits = -1, best_fits = -1;
-	unsigned long best_actual_cap = 0;
-	unsigned long prev_actual_cap = 0;
+	unsigned long best_nrg = ULONG_MAX;
+	unsigned long task_util;
 	struct sched_domain *sd;
 	struct perf_domain *pd;
-	struct energy_env eenv;
+	int cpu, target = -1;
+	int best_fits = -1;
+	int best_cpu = -1;
 
 	rcu_read_lock();
 	pd = rcu_dereference(rd->pd);
@@ -8409,19 +8443,19 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	target = prev_cpu;
 
 	sync_entity_load_avg(&p->se);
-	if (!task_util_est(p) && p_util_min == 0)
-		goto unlock;
-
-	eenv_task_busy_time(&eenv, p, prev_cpu);
+	task_util = task_busy_time(p, prev_cpu);
 
 	for (; pd; pd = pd->next) {
-		unsigned long util_min = p_util_min, util_max = p_util_max;
-		unsigned long cpu_cap, cpu_actual_cap, util;
-		long prev_spare_cap = -1, max_spare_cap = -1;
-		unsigned long rq_util_min, rq_util_max;
-		unsigned long cur_delta, base_energy;
-		int max_spare_cap_cpu = -1;
-		int fits, max_fits = -1;
+		unsigned long pd_actual_util = 0, delta_nrg = 0;
+		unsigned long cpu_actual_cap, max_cost = 0;
+		struct energy_cpu_stat target_stat;
+		struct energy_cpu_stat min_stat = {
+			.cost = ULONG_MAX,
+			.max_perf = ULONG_MAX,
+			.min_perf = ULONG_MAX,
+			.fits = -2,
+			.cpu = -1,
+		};
 
 		cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
 
@@ -8432,13 +8466,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 		cpu = cpumask_first(cpus);
 		cpu_actual_cap = get_actual_cpu_capacity(cpu);
 
-		eenv.cpu_cap = cpu_actual_cap;
-		eenv.pd_cap = 0;
-
+		/* In a PD, the CPU with the lowest cost will be the most efficient */
 		for_each_cpu(cpu, cpus) {
-			struct rq *rq = cpu_rq(cpu);
-
-			eenv.pd_cap += cpu_actual_cap;
+			unsigned long target_perf;
 
 			if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
 				continue;
@@ -8446,120 +8476,116 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			if (!cpumask_test_cpu(cpu, p->cpus_ptr))
 				continue;
 
-			util = cpu_util(cpu, p, cpu, 0);
-			cpu_cap = capacity_of(cpu);
+			target_stat.fits = check_cpu_with_task(p, cpu);
+
+			if (!target_stat.fits)
+				continue;
+
+			/* 1st select the CPU that fits best */
+			if (target_stat.fits < min_stat.fits)
+				continue;
+
+			/* Then select the CPU with lowest cost */
+
+			/* Get the performance of the CPU w/ the waking task */
+			target_perf = find_cpu_max_util(cpu, p, cpu);
+			target_perf = min(target_perf, cpu_actual_cap);
+
+			/* Needing a higher OPP means a higher cost */
+			if (target_perf > min_stat.max_perf)
+				continue;
 
 			/*
-			 * Skip CPUs that cannot satisfy the capacity request.
-			 * IOW, placing the task there would make the CPU
-			 * overutilized. Take uclamp into account to see how
-			 * much capacity we can get out of the CPU; this is
-			 * aligned with sched_cpu_util().
+			 * At this point, target's cost can be either equal or
+			 * lower than the current minimum cost.
 			 */
-			if (uclamp_is_used() && !uclamp_rq_is_idle(rq)) {
-				/*
-				 * Open code uclamp_rq_util_with() except for
-				 * the clamp() part. I.e.: apply max aggregation
-				 * only. util_fits_cpu() logic requires to
-				 * operate on non clamped util but must use the
-				 * max-aggregated uclamp_{min, max}.
-				 */
-				rq_util_min = uclamp_rq_get(rq, UCLAMP_MIN);
-				rq_util_max = uclamp_rq_get(rq, UCLAMP_MAX);
 
-				util_min = max(rq_util_min, p_util_min);
-				util_max = max(rq_util_max, p_util_max);
-			}
+			/* Gather more statistics */
+			target_stat.cpu = cpu;
+			target_stat.runnable = cpu_runnable(cpu_rq(cpu));
+			target_stat.capa = capacity_of(cpu);
+			target_stat.nr_running = cpu_rq(cpu)->cfs.h_nr_runnable;
 
-			fits = util_fits_cpu(util, util_min, util_max, cpu);
-			if (!fits)
+			/* If the target needs a lower OPP, then look up for
+			 * the corresponding OPP and its associated cost.
+			 * Otherwise at same cost level, select the CPU which
+			 * provides best performance.
+			 */
+			if (target_perf < min_stat.min_perf)
+				find_pd_cost(pd->em_pd, target_perf, &target_stat);
+			else if (!update_best_cpu(&target_stat, &min_stat, prev_cpu, sd))
 				continue;
 
-			lsub_positive(&cpu_cap, util);
-
-			if (cpu == prev_cpu) {
-				/* Always use prev_cpu as a candidate. */
-				prev_spare_cap = cpu_cap;
-				prev_fits = fits;
-			} else if ((fits > max_fits) ||
-				   ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {
-				/*
-				 * Find the CPU with the maximum spare capacity
-				 * among the remaining CPUs in the performance
-				 * domain.
-				 */
-				max_spare_cap = cpu_cap;
-				max_spare_cap_cpu = cpu;
-				max_fits = fits;
-			}
+			/* Save the new most efficient CPU of the PD */
+			min_stat = target_stat;
 		}
 
-		if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
+		if (min_stat.cpu == -1)
 			continue;
 
-		eenv_pd_busy_time(&eenv, cpus, p);
-		/* Compute the 'base' energy of the pd, without @p */
-		base_energy = compute_energy(&eenv, pd, cpus, p, -1);
+		if (min_stat.fits < best_fits)
+			continue;
 
-		/* Evaluate the energy impact of using prev_cpu. */
-		if (prev_spare_cap > -1) {
-			prev_delta = compute_energy(&eenv, pd, cpus, p,
-						    prev_cpu);
-			/* CPU utilization has changed */
-			if (prev_delta < base_energy)
-				goto unlock;
-			prev_delta -= base_energy;
-			prev_actual_cap = cpu_actual_cap;
-			best_delta = min(best_delta, prev_delta);
-		}
+		/* Idle system costs nothing */
+		target_stat.max_perf = 0;
+		target_stat.cost = 0;
 
-		/* Evaluate the energy impact of using max_spare_cap_cpu. */
-		if (max_spare_cap_cpu >= 0 && max_spare_cap > prev_spare_cap) {
-			/* Current best energy cpu fits better */
-			if (max_fits < best_fits)
-				continue;
+		/* Estimate utilization and cost without p */
+		for_each_cpu(cpu, cpus) {
+			unsigned long target_util;
 
-			/*
-			 * Both don't fit performance hint (i.e. uclamp_min)
-			 * but best energy cpu has better capacity.
-			 */
-			if ((max_fits < 0) &&
-			    (cpu_actual_cap <= best_actual_cap))
-				continue;
+			/* Accumulate actual utilization w/o task p */
+			pd_actual_util += find_cpu_actual_util(cpu, p);
 
-			cur_delta = compute_energy(&eenv, pd, cpus, p,
-						   max_spare_cap_cpu);
-			/* CPU utilization has changed */
-			if (cur_delta < base_energy)
-				goto unlock;
-			cur_delta -= base_energy;
+			/* Get the max utilization of the CPU w/o task p */
+			target_util = find_cpu_max_util(cpu, p, -1);
+			target_util = min(target_util, cpu_actual_cap);
 
-			/*
-			 * Both fit for the task but best energy cpu has lower
-			 * energy impact.
-			 */
-			if ((max_fits > 0) && (best_fits > 0) &&
-			    (cur_delta >= best_delta))
+			/* Current OPP is enough */
+			if (target_util <= target_stat.max_perf)
 				continue;
 
-			best_delta = cur_delta;
-			best_energy_cpu = max_spare_cap_cpu;
-			best_fits = max_fits;
-			best_actual_cap = cpu_actual_cap;
+			/* Compute and save the cost of the OPP */
+			find_pd_cost(pd->em_pd, target_util, &target_stat);
+			max_cost = target_stat.cost;
 		}
-	}
-	rcu_read_unlock();
 
-	if ((best_fits > prev_fits) ||
-	    ((best_fits > 0) && (best_delta < prev_delta)) ||
-	    ((best_fits < 0) && (best_actual_cap > prev_actual_cap)))
-		target = best_energy_cpu;
+		/* Add the energy cost of p */
+		delta_nrg = task_util * min_stat.cost;
 
-	return target;
+		/*
+		 * Compute the energy cost of others running at higher OPP
+		 * because of p.
+		 */
+		if (min_stat.cost > max_cost)
+			delta_nrg += pd_actual_util * (min_stat.cost - max_cost);
+
+		/* Delta energy with p */
+		trace_sched_compute_energy_tp(p, min_stat.cpu, delta_nrg,
+				min_stat.max_perf, pd_actual_util + task_util);
+
+		/*
+		 * The probability that delta energies are equals is almost
+		 * null. PDs being sorted by max capacity, keep the one with
+		 * highest max capacity if this happens.
+		 * TODO: add a margin in energy cost and take into account
+		 * other stats.
+		 */
+		if ((min_stat.fits == best_fits) &&
+		    (delta_nrg >= best_nrg))
+			continue;
+
+		best_fits = min_stat.fits;
+		best_nrg = delta_nrg;
+		best_cpu = min_stat.cpu;
+	}
 
 unlock:
 	rcu_read_unlock();
 
+	if (best_cpu >= 0)
+		target = best_cpu;
+
 	return target;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/7 v5] sched/fair: Rework feec() to use cost instead of spare capacity
  2025-03-02 21:05 ` [PATCH 3/7 v5] sched/fair: Rework feec() to use cost instead of spare capacity Vincent Guittot
@ 2025-03-12 14:08   ` Pierre Gondois
  2025-03-14 16:24     ` Vincent Guittot
  2025-03-25 11:09   ` Pierre Gondois
  1 sibling, 1 reply; 31+ messages in thread
From: Pierre Gondois @ 2025-03-12 14:08 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, lukasz.luba,
	rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado, qperret

Hello Vincent,

On 3/2/25 22:05, Vincent Guittot wrote:
> feec() looks for the CPU with highest spare capacity in a PD assuming that
> it will be the best CPU from a energy efficiency PoV because it will
> require the smallest increase of OPP. Although this is true generally
> speaking, this policy also filters some others CPUs which will be as
> efficients because of using the same OPP.
> In fact, we really care about the cost of the new OPP that will be
> selected to handle the waking task. In many cases, several CPUs will end
> up selecting the same OPP and as a result using the same energy cost. In
> these cases, we can use other metrics to select the best CPU for the same
> energy cost.
> 
> Rework feec() to look 1st for the lowest cost in a PD and then the most
> performant CPU between CPUs. The cost of the OPP remains the only
> comparison criteria between Performance Domains.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>   kernel/sched/fair.c | 466 +++++++++++++++++++++++---------------------
>   1 file changed, 246 insertions(+), 220 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d3d1a2ba6b1a..a9b97bbc085f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c

[...]

> +static bool update_best_cpu(struct energy_cpu_stat *target,
> +			    struct energy_cpu_stat *min,
> +			    int prev, struct sched_domain *sd)
>   {
> -	unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
> -	unsigned long busy_time = eenv->pd_busy_time;
> -	unsigned long energy;
> -
> -	if (dst_cpu >= 0)
> -		busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
> +	/*  Select the one with the least number of running tasks */
> +	if (target->nr_running < min->nr_running)
> +		return true;
> +	if (target->nr_running > min->nr_running)
> +		return false;
>   
> -	energy = em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
> +	/* Favor previous CPU otherwise */
> +	if (target->cpu == prev)
> +		return true;
> +	if (min->cpu == prev)
> +		return false;
>   
> -	trace_sched_compute_energy_tp(p, dst_cpu, energy, max_util, busy_time);
> +	/*
> +	 * Choose CPU with lowest contention. One might want to consider load
> +	 * instead of runnable but we are supposed to not be overutilized so
> +	 * there is enough compute capacity for everybody.
> +	 */

I'm not sure I understand the comment. With UCLAMP_MAX tasks, a CPU can lack
compute capacity while not being tagged as overutilized. IIUC this is actually
the goal of UCLAMP_MAX.

With the following workload:
- 2 tasks A with duty_cycle=30%, UCLAMP_MIN/MAX=(0,1), niceness=0
- 2 tasks B with duty_cycle=70%, UCLAMP_MIN/MAX=(0,1), niceness=-10
The workload runs on a Pixel6 with a reduced cpuset of [1,2,7], i.e. 2 little
CPUs (1,2) capa=160 and one big CPU (7) capa=1024.
CPU7 is avoided by the tasks as their UCLAMP_MAX setting make them fit on the
little CPUs.

select_best_cpu() prefers to place tasks based on nr_running. If the 2 tasks A
end up being placed on one little CPU, and the 2 tasks B are placed on the
other little CPU, feec() is theoretically unable to balance the workload.
In practice, a kworker ends up spawning on one of these 2 little CPUs and tasks
are shuffled, so the pattern breaks after ~30ms.

This pattern seems problematic as tasks A are:
- smaller (30% < 70%)
- nicer (0 > -10)
than tasks B. So I assume the correct task placement should be one task of each
type on each little CPU.

------

There are some comments in the load balancer code:
1.
/* Computing avg_load makes sense only when group is overloaded */
2.
/*
* Computing avg_load makes sense only when group is fully busy or
* overloaded
*/

IIUC, the load is only meaningful when there is not enough compute capacity
to estimate the task size, otherwise util_avg makes more sense. It seems that
when it comes to UCLAMP_MAX task, CPUs are placed in this exact situation:
load_avg makes more sense that util_avg.
However, in this situation, energy computations also lose sense since they
are based on the util_avg values.

------

select_best_cpu() could check the CPU load before checking nr_running, but it
would be meaningless if there is enough CPU time for all the tasks.

Maybe CPU load should then be checked only if the system doesn't have enough
CPU time. But this would be equivalent to:
- remove UCLAMP_MAX in cpu_overutilized()
- when the system is overutilized (no UCLAMP_MAX involved), go back to the
   load balancer

In other words, I don't really see how it is possible to reconciliate UCLAMP_MAX
tasks with EAS as EAS relies on util_avg values, and UCLAMP_MAX forces to
rely on load_avg value rather than util_avg.

Regards,
Pierre


> +	if ((target->runnable * min->capa * sd->imbalance_pct) >=
> +			(min->runnable * target->capa * 100))
> +		return false;
>   
> -	return energy;
> +	return true;
>   }

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/7 v5] sched/fair: Rework feec() to use cost instead of spare capacity
  2025-03-12 14:08   ` Pierre Gondois
@ 2025-03-14 16:24     ` Vincent Guittot
  2025-03-16 20:21       ` Pierre Gondois
  0 siblings, 1 reply; 31+ messages in thread
From: Vincent Guittot @ 2025-03-14 16:24 UTC (permalink / raw)
  To: Pierre Gondois
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2, christian.loehle, luis.machado, qperret

On Wed, 12 Mar 2025 at 15:09, Pierre Gondois <pierre.gondois@arm.com> wrote:
>
> Hello Vincent,
>
> On 3/2/25 22:05, Vincent Guittot wrote:
> > feec() looks for the CPU with highest spare capacity in a PD assuming that
> > it will be the best CPU from a energy efficiency PoV because it will
> > require the smallest increase of OPP. Although this is true generally
> > speaking, this policy also filters some others CPUs which will be as
> > efficients because of using the same OPP.
> > In fact, we really care about the cost of the new OPP that will be
> > selected to handle the waking task. In many cases, several CPUs will end
> > up selecting the same OPP and as a result using the same energy cost. In
> > these cases, we can use other metrics to select the best CPU for the same
> > energy cost.
> >
> > Rework feec() to look 1st for the lowest cost in a PD and then the most
> > performant CPU between CPUs. The cost of the OPP remains the only
> > comparison criteria between Performance Domains.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >   kernel/sched/fair.c | 466 +++++++++++++++++++++++---------------------
> >   1 file changed, 246 insertions(+), 220 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d3d1a2ba6b1a..a9b97bbc085f 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
>
> [...]
>
> > +static bool update_best_cpu(struct energy_cpu_stat *target,
> > +                         struct energy_cpu_stat *min,
> > +                         int prev, struct sched_domain *sd)
> >   {
> > -     unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
> > -     unsigned long busy_time = eenv->pd_busy_time;
> > -     unsigned long energy;
> > -
> > -     if (dst_cpu >= 0)
> > -             busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
> > +     /*  Select the one with the least number of running tasks */
> > +     if (target->nr_running < min->nr_running)
> > +             return true;
> > +     if (target->nr_running > min->nr_running)
> > +             return false;
> >
> > -     energy = em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
> > +     /* Favor previous CPU otherwise */
> > +     if (target->cpu == prev)
> > +             return true;
> > +     if (min->cpu == prev)
> > +             return false;
> >
> > -     trace_sched_compute_energy_tp(p, dst_cpu, energy, max_util, busy_time);
> > +     /*
> > +      * Choose CPU with lowest contention. One might want to consider load
> > +      * instead of runnable but we are supposed to not be overutilized so
> > +      * there is enough compute capacity for everybody.
> > +      */
>
> I'm not sure I understand the comment. With UCLAMP_MAX tasks, a CPU can lack
> compute capacity while not being tagged as overutilized. IIUC this is actually
> the goal of UCLAMP_MAX.

the uclamp_max says that the task doesn't need more than a compute
capacity of 1 so there is no lack

>
> With the following workload:
> - 2 tasks A with duty_cycle=30%, UCLAMP_MIN/MAX=(0,1), niceness=0
> - 2 tasks B with duty_cycle=70%, UCLAMP_MIN/MAX=(0,1), niceness=-10

Does the duty cycle make any difference here ? they won't be able to
run 30% or 70% anyway because of their uclamp_max, will they ?

> The workload runs on a Pixel6 with a reduced cpuset of [1,2,7], i.e. 2 little
> CPUs (1,2) capa=160 and one big CPU (7) capa=1024.
> CPU7 is avoided by the tasks as their UCLAMP_MAX setting make them fit on the
> little CPUs.
>
> select_best_cpu() prefers to place tasks based on nr_running. If the 2 tasks A
> end up being placed on one little CPU, and the 2 tasks B are placed on the
> other little CPU, feec() is theoretically unable to balance the workload.

They will all have 80 compute capacity which is more than their uclamp_max= 1

> In practice, a kworker ends up spawning on one of these 2 little CPUs and tasks
> are shuffled, so the pattern breaks after ~30ms.
>
> This pattern seems problematic as tasks A are:
> - smaller (30% < 70%)
> - nicer (0 > -10)
> than tasks B. So I assume the correct task placement should be one task of each
> type on each little CPU.
>
> ------
>
> There are some comments in the load balancer code:
> 1.
> /* Computing avg_load makes sense only when group is overloaded */
> 2.
> /*
> * Computing avg_load makes sense only when group is fully busy or
> * overloaded
> */
>
> IIUC, the load is only meaningful when there is not enough compute capacity
> to estimate the task size, otherwise util_avg makes more sense. It seems that
> when it comes to UCLAMP_MAX task, CPUs are placed in this exact situation:
> load_avg makes more sense that util_avg.
> However, in this situation, energy computations also lose sense since they
> are based on the util_avg values.
>
> ------
>
> select_best_cpu() could check the CPU load before checking nr_running, but it
> would be meaningless if there is enough CPU time for all the tasks.
>
> Maybe CPU load should then be checked only if the system doesn't have enough
> CPU time. But this would be equivalent to:
> - remove UCLAMP_MAX in cpu_overutilized()
> - when the system is overutilized (no UCLAMP_MAX involved), go back to the
>    load balancer
>
> In other words, I don't really see how it is possible to reconciliate UCLAMP_MAX
> tasks with EAS as EAS relies on util_avg values, and UCLAMP_MAX forces to
> rely on load_avg value rather than util_avg.
>
> Regards,
> Pierre
>
>
> > +     if ((target->runnable * min->capa * sd->imbalance_pct) >=
> > +                     (min->runnable * target->capa * 100))
> > +             return false;
> >
> > -     return energy;
> > +     return true;
> >   }

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/7 v5] sched/fair: Rework feec() to use cost instead of spare capacity
  2025-03-14 16:24     ` Vincent Guittot
@ 2025-03-16 20:21       ` Pierre Gondois
  0 siblings, 0 replies; 31+ messages in thread
From: Pierre Gondois @ 2025-03-16 20:21 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2, christian.loehle, luis.machado, qperret



On 3/14/25 17:24, Vincent Guittot wrote:
> On Wed, 12 Mar 2025 at 15:09, Pierre Gondois <pierre.gondois@arm.com> wrote:
>>
>> Hello Vincent,
>>
>> On 3/2/25 22:05, Vincent Guittot wrote:
>>> feec() looks for the CPU with highest spare capacity in a PD assuming that
>>> it will be the best CPU from a energy efficiency PoV because it will
>>> require the smallest increase of OPP. Although this is true generally
>>> speaking, this policy also filters some others CPUs which will be as
>>> efficients because of using the same OPP.
>>> In fact, we really care about the cost of the new OPP that will be
>>> selected to handle the waking task. In many cases, several CPUs will end
>>> up selecting the same OPP and as a result using the same energy cost. In
>>> these cases, we can use other metrics to select the best CPU for the same
>>> energy cost.
>>>
>>> Rework feec() to look 1st for the lowest cost in a PD and then the most
>>> performant CPU between CPUs. The cost of the OPP remains the only
>>> comparison criteria between Performance Domains.
>>>
>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>> ---
>>>    kernel/sched/fair.c | 466 +++++++++++++++++++++++---------------------
>>>    1 file changed, 246 insertions(+), 220 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index d3d1a2ba6b1a..a9b97bbc085f 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>
>> [...]
>>
>>> +static bool update_best_cpu(struct energy_cpu_stat *target,
>>> +                         struct energy_cpu_stat *min,
>>> +                         int prev, struct sched_domain *sd)
>>>    {
>>> -     unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
>>> -     unsigned long busy_time = eenv->pd_busy_time;
>>> -     unsigned long energy;
>>> -
>>> -     if (dst_cpu >= 0)
>>> -             busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
>>> +     /*  Select the one with the least number of running tasks */
>>> +     if (target->nr_running < min->nr_running)
>>> +             return true;
>>> +     if (target->nr_running > min->nr_running)
>>> +             return false;
>>>
>>> -     energy = em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
>>> +     /* Favor previous CPU otherwise */
>>> +     if (target->cpu == prev)
>>> +             return true;
>>> +     if (min->cpu == prev)
>>> +             return false;
>>>
>>> -     trace_sched_compute_energy_tp(p, dst_cpu, energy, max_util, busy_time);
>>> +     /*
>>> +      * Choose CPU with lowest contention. One might want to consider load
>>> +      * instead of runnable but we are supposed to not be overutilized so
>>> +      * there is enough compute capacity for everybody.
>>> +      */
>>
>> I'm not sure I understand the comment. With UCLAMP_MAX tasks, a CPU can lack
>> compute capacity while not being tagged as overutilized. IIUC this is actually
>> the goal of UCLAMP_MAX.
> 
> the uclamp_max says that the task doesn't need more than a compute
> capacity of 1 so there is no lack

If UCLAMP_MAX was a bandwidth hint and the task was requesting 1/1024 ~= 0.1%
of the bandwidth, then effectively there would be no lack.
However if this is a performance hint, then the CPU will run at the lowest
available performance level of the CPU. I don't think there is any guarantee
on the bandwidth in that case.

> 
>>
>> With the following workload:
>> - 2 tasks A with duty_cycle=30%, UCLAMP_MIN/MAX=(0,1), niceness=0
>> - 2 tasks B with duty_cycle=70%, UCLAMP_MIN/MAX=(0,1), niceness=-10
> 
> Does the duty cycle make any difference here ? they won't be able to
> run 30% or 70% anyway because of their uclamp_max, will they ?

No indeed, this will make no difference.

> 
>> The workload runs on a Pixel6 with a reduced cpuset of [1,2,7], i.e. 2 little
>> CPUs (1,2) capa=160 and one big CPU (7) capa=1024.
>> CPU7 is avoided by the tasks as their UCLAMP_MAX setting make them fit on the
>> little CPUs.
>>
>> select_best_cpu() prefers to place tasks based on nr_running. If the 2 tasks A
>> end up being placed on one little CPU, and the 2 tasks B are placed on the
>> other little CPU, feec() is theoretically unable to balance the workload.
> 
> They will all have 80 compute capacity which is more than their uclamp_max= 1

Cf. above, if UCLAMP_MAX is a performance hint, then there should be no guarantee
on the received bandwidth. If we take the same example with UCLAMP_MAX=81, then
this should still hold. However the received bandwidth will be of 80 util.

(Small note that with UCLAMP_MAX=1, the little CPUs should run at a performance
level lower than 160. But the example should still hold if the little CPUs only
have one OPP at 160.)

------

To go further in that direction, balancing with the number of tasks + runnable
seems like a similar version of the load balancer's group_overloaded case. We're
trying to share the bandwidth equally among UCLAMP_MAX tasks, but without taking
into account the niceness of tasks.
EAS didn't have to handle the case of compute capacity shortage until UCLAMP_MAX
came in: EAS was only active when there was enough compute capacity due to the
overutilized threshold.

Doing load balancing among UCLAMP_MAX tasks in feec() seems like a complex
exercise: what if the pd topology is: 2 little CPUs + 2 little CPUs + 1 big CPU.
AFAICS, UCLAMP_MAX tasks could all start on one of the little CPUs' pd and never
be balanced toward the second little CPU's pd as we are looking for the CPU with
the lowest #tasks inside one pd (assuming that all little CPUs are running at
the same, lowest OPP).

> 
>> In practice, a kworker ends up spawning on one of these 2 little CPUs and tasks
>> are shuffled, so the pattern breaks after ~30ms.
>>
>> This pattern seems problematic as tasks A are:
>> - smaller (30% < 70%)
>> - nicer (0 > -10)
>> than tasks B. So I assume the correct task placement should be one task of each
>> type on each little CPU.
>>
>> ------
>>
>> There are some comments in the load balancer code:
>> 1.
>> /* Computing avg_load makes sense only when group is overloaded */
>> 2.
>> /*
>> * Computing avg_load makes sense only when group is fully busy or
>> * overloaded
>> */
>>
>> IIUC, the load is only meaningful when there is not enough compute capacity
>> to estimate the task size, otherwise util_avg makes more sense. It seems that
>> when it comes to UCLAMP_MAX task, CPUs are placed in this exact situation:
>> load_avg makes more sense that util_avg.
>> However, in this situation, energy computations also lose sense since they
>> are based on the util_avg values.
>>
>> ------
>>
>> select_best_cpu() could check the CPU load before checking nr_running, but it
>> would be meaningless if there is enough CPU time for all the tasks.
>>
>> Maybe CPU load should then be checked only if the system doesn't have enough
>> CPU time. But this would be equivalent to:
>> - remove UCLAMP_MAX in cpu_overutilized()
>> - when the system is overutilized (no UCLAMP_MAX involved), go back to the
>>     load balancer
>>
>> In other words, I don't really see how it is possible to reconciliate UCLAMP_MAX
>> tasks with EAS as EAS relies on util_avg values, and UCLAMP_MAX forces to
>> rely on load_avg value rather than util_avg.
>>
>> Regards,
>> Pierre
>>
>>
>>> +     if ((target->runnable * min->capa * sd->imbalance_pct) >=
>>> +                     (min->runnable * target->capa * 100))
>>> +             return false;
>>>
>>> -     return energy;
>>> +     return true;
>>>    }

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/7 v5] sched/fair: Rework feec() to use cost instead of spare capacity
  2025-03-02 21:05 ` [PATCH 3/7 v5] sched/fair: Rework feec() to use cost instead of spare capacity Vincent Guittot
  2025-03-12 14:08   ` Pierre Gondois
@ 2025-03-25 11:09   ` Pierre Gondois
  1 sibling, 0 replies; 31+ messages in thread
From: Pierre Gondois @ 2025-03-25 11:09 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, lukasz.luba,
	rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado, qperret

Hello Vincent,

On 3/2/25 22:05, Vincent Guittot wrote:
> feec() looks for the CPU with highest spare capacity in a PD assuming that
> it will be the best CPU from a energy efficiency PoV because it will
> require the smallest increase of OPP. Although this is true generally
> speaking, this policy also filters some others CPUs which will be as
> efficients because of using the same OPP.
> In fact, we really care about the cost of the new OPP that will be
> selected to handle the waking task. In many cases, several CPUs will end
> up selecting the same OPP and as a result using the same energy cost. In
> these cases, we can use other metrics to select the best CPU for the same
> energy cost.
> 
> Rework feec() to look 1st for the lowest cost in a PD and then the most
> performant CPU between CPUs. The cost of the OPP remains the only
> comparison criteria between Performance Domains.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>   kernel/sched/fair.c | 466 +++++++++++++++++++++++---------------------
>   1 file changed, 246 insertions(+), 220 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d3d1a2ba6b1a..a9b97bbc085f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c

[...]

>   
> -	if ((best_fits > prev_fits) ||
> -	    ((best_fits > 0) && (best_delta < prev_delta)) ||
> -	    ((best_fits < 0) && (best_actual_cap > prev_actual_cap)))
> -		target = best_energy_cpu;
> +		/* Add the energy cost of p */
> +		delta_nrg = task_util * min_stat.cost;

Not capping task_util to the target CPU's max capacity when computing energy
means this is ok to have fully-utilized CPUs.

Just to re-advertise what can happen with utilization values of CPUs that are
fully utilized:

On a Pixel6, placing 3 tasks with:
period=16ms, duty_cycle=100%, UCLAMP_MAX=700, cpuset=5,6
I.e. the tasks can run on one medium/big core. UCLAMP_MAX is set such as the
system doesn't become overutilized.

Tasks are going to bounce on the medium CPU as, if we follow one task:
1. The task will grow to reach a maximum utilization of ~340 (=1024/3 due to
the pressure of other tasks)
2. As the big CPU is less energy efficient than the medium CPU, the big CPU
will eventually reach an OPP where it will be better to run on the medium CPU
energy-wise
3. The task is migrated to the medium CPU. However the task can now grow
its utilization since there is no pressure from other tasks. So the
utilization of the task slowly grows.
4. The medium CPU reaches on OPP where it is more energy efficient to migrate
the task on the big CPU. We can go to step 1.

As the utilization is only a reflection of how much CPU time the task received,
util_avg depends on the #tasks/niceness of the co-scheduled tasks. Thus, it is
really hard to make any assumption based on this value. UCLAMP_MAX puts tasks
in the exact situation where util_avg doesn't represent the size of the task
anymore.

------

In this example, niceness is not taken into account, but cf. the other thread,
other scenarios could be created where the task placement is incorrect/un-stable.
EAS mainly relies on util_avg, so having correct task estimations should be the
priority.

------

As you and Christian I think briefly evoked as an idea, setting the SCHED_IDLE
policy to UCLAMP_MAX could maybe help. However, this implies:
1. not being able to set UCLAMP_MIN and UCLAMP_MAX at the same time for a task
Indeed, as someone might want to have SCHED_NORMAL for it UCLAMP_MIN task

2. not using feec() for UCLAMP_MAX tasks.
Indeed, SCHED_IDLE tasks will likely have wrong util_avg values since they don't
have much running time. So doing energy computation using these values would not
work.
Another advantage is that UCLAMP_MAX tasks will impact the util_avg of
non-UCLAMP_MAX tasks only lightly as their running time will be really low.
Balancing CPUs based on h_nr_idle rather than h_nr_runnable would also allow
to have ~the same number of UCLAMP_MAX tasks on all CPUs of the same
type/capacity.

------

At the risk of being a bit insistant, I don't see how UCLAMP_MAX tasks could
be placed using EAS. Either EAS or UCLAMP_MAX needs to change...

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 4/7 v5] energy model: Remove unused em_cpu_energy()
  2025-03-02 21:05 [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases Vincent Guittot
                   ` (2 preceding siblings ...)
  2025-03-02 21:05 ` [PATCH 3/7 v5] sched/fair: Rework feec() to use cost instead of spare capacity Vincent Guittot
@ 2025-03-02 21:05 ` Vincent Guittot
  2025-03-02 21:05 ` [PATCH 5/7 v5] sched/fair: Add push task mechanism for EAS Vincent Guittot
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 31+ messages in thread
From: Vincent Guittot @ 2025-03-02 21:05 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, pierre.gondois,
	linux-kernel
  Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado, qperret,
	Vincent Guittot

Remove the unused function em_cpu_energy()

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/energy_model.h | 99 ------------------------------------
 1 file changed, 99 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 551e243b9c43..d0adabba2c56 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -236,99 +236,6 @@ em_pd_get_previous_state(struct em_perf_state *table,
 	return -1;
 }
 
-/**
- * em_cpu_energy() - Estimates the energy consumed by the CPUs of a
- *		performance domain
- * @pd		: performance domain for which energy has to be estimated
- * @max_util	: highest utilization among CPUs of the domain
- * @sum_util	: sum of the utilization of all CPUs in the domain
- * @allowed_cpu_cap	: maximum allowed CPU capacity for the @pd, which
- *			  might reflect reduced frequency (due to thermal)
- *
- * This function must be used only for CPU devices. There is no validation,
- * i.e. if the EM is a CPU type and has cpumask allocated. It is called from
- * the scheduler code quite frequently and that is why there is not checks.
- *
- * Return: the sum of the energy consumed by the CPUs of the domain assuming
- * a capacity state satisfying the max utilization of the domain.
- */
-static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
-				unsigned long max_util, unsigned long sum_util,
-				unsigned long allowed_cpu_cap)
-{
-	struct em_perf_table *em_table;
-	struct em_perf_state *ps;
-	int i;
-
-#ifdef CONFIG_SCHED_DEBUG
-	WARN_ONCE(!rcu_read_lock_held(), "EM: rcu read lock needed\n");
-#endif
-
-	if (!sum_util)
-		return 0;
-
-	/*
-	 * In order to predict the performance state, map the utilization of
-	 * the most utilized CPU of the performance domain to a requested
-	 * performance, like schedutil. Take also into account that the real
-	 * performance might be set lower (due to thermal capping). Thus, clamp
-	 * max utilization to the allowed CPU capacity before calculating
-	 * effective performance.
-	 */
-	max_util = min(max_util, allowed_cpu_cap);
-
-	/*
-	 * Find the lowest performance state of the Energy Model above the
-	 * requested performance.
-	 */
-	em_table = rcu_dereference(pd->em_table);
-	i = em_pd_get_efficient_state(em_table->state, pd, max_util);
-	ps = &em_table->state[i];
-
-	/*
-	 * The performance (capacity) of a CPU in the domain at the performance
-	 * state (ps) can be computed as:
-	 *
-	 *                     ps->freq * scale_cpu
-	 *   ps->performance = --------------------                  (1)
-	 *                         cpu_max_freq
-	 *
-	 * So, ignoring the costs of idle states (which are not available in
-	 * the EM), the energy consumed by this CPU at that performance state
-	 * is estimated as:
-	 *
-	 *             ps->power * cpu_util
-	 *   cpu_nrg = --------------------                          (2)
-	 *               ps->performance
-	 *
-	 * since 'cpu_util / ps->performance' represents its percentage of busy
-	 * time.
-	 *
-	 *   NOTE: Although the result of this computation actually is in
-	 *         units of power, it can be manipulated as an energy value
-	 *         over a scheduling period, since it is assumed to be
-	 *         constant during that interval.
-	 *
-	 * By injecting (1) in (2), 'cpu_nrg' can be re-expressed as a product
-	 * of two terms:
-	 *
-	 *             ps->power * cpu_max_freq
-	 *   cpu_nrg = ------------------------ * cpu_util           (3)
-	 *               ps->freq * scale_cpu
-	 *
-	 * The first term is static, and is stored in the em_perf_state struct
-	 * as 'ps->cost'.
-	 *
-	 * Since all CPUs of the domain have the same micro-architecture, they
-	 * share the same 'ps->cost', and the same CPU capacity. Hence, the
-	 * total energy of the domain (which is the simple sum of the energy of
-	 * all of its CPUs) can be factorized as:
-	 *
-	 *   pd_nrg = ps->cost * \Sum cpu_util                       (4)
-	 */
-	return ps->cost * sum_util;
-}
-
 /**
  * em_pd_nr_perf_states() - Get the number of performance states of a perf.
  *				domain
@@ -394,12 +301,6 @@ em_pd_get_previous_state(struct em_perf_state *table,
 {
 	return -1;
 }
-static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
-			unsigned long max_util, unsigned long sum_util,
-			unsigned long allowed_cpu_cap)
-{
-	return 0;
-}
 static inline int em_pd_nr_perf_states(struct em_perf_domain *pd)
 {
 	return 0;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 5/7 v5] sched/fair: Add push task mechanism for EAS
  2025-03-02 21:05 [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases Vincent Guittot
                   ` (3 preceding siblings ...)
  2025-03-02 21:05 ` [PATCH 4/7 v5] energy model: Remove unused em_cpu_energy() Vincent Guittot
@ 2025-03-02 21:05 ` Vincent Guittot
  2025-03-07 12:51   ` kernel test robot
                     ` (6 more replies)
  2025-03-02 21:05 ` [PATCH 6/7 v5] sched/fair: Add misfit case to push task mecanism " Vincent Guittot
                   ` (3 subsequent siblings)
  8 siblings, 7 replies; 31+ messages in thread
From: Vincent Guittot @ 2025-03-02 21:05 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, pierre.gondois,
	linux-kernel
  Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado, qperret,
	Vincent Guittot

EAS is based on wakeup events to efficiently place tasks on the system, but
there are cases where a task doesn't have wakeup events anymore or at a far
too low pace. For such situation, we can take advantage of the task being
put back in the enqueued list to check if it should be pushed on another
CPU. When the task is alone on the CPU, it's never put back in the enqueued
list; In this special case, we use the tick to run the check.

Wake up events remain the main way to migrate tasks but we now detect
situation where a task is stuck on a CPU by checking that its utilization
is larger than the max available compute capacity (max cpu capacity or
uclamp max setting)

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c  | 220 +++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |   2 +
 2 files changed, 222 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a9b97bbc085f..c3e383b86808 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7051,6 +7051,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	hrtick_update(rq);
 }
 
+static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p);
 static void set_next_buddy(struct sched_entity *se);
 
 /*
@@ -7081,6 +7082,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		h_nr_idle = task_has_idle_policy(p);
 		if (task_sleep || task_delayed || !se->sched_delayed)
 			h_nr_runnable = 1;
+
+		fair_remove_pushable_task(rq, p);
 	} else {
 		cfs_rq = group_cfs_rq(se);
 		slice = cfs_rq_min_slice(cfs_rq);
@@ -8589,6 +8592,197 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	return target;
 }
 
+static inline bool task_stuck_on_cpu(struct task_struct *p, int cpu)
+{
+	unsigned long max_capa, util;
+
+	max_capa = min(get_actual_cpu_capacity(cpu),
+		       uclamp_eff_value(p, UCLAMP_MAX));
+	util = max(task_util_est(p), task_runnable(p));
+
+	/*
+	 * Return true only if the task might not sleep/wakeup because of a low
+	 * compute capacity. Tasks, which wake up regularly, will be handled by
+	 * feec().
+	 */
+	return (util > max_capa);
+}
+
+static inline bool sched_energy_push_task(struct task_struct *p, struct rq *rq)
+{
+	if (p->nr_cpus_allowed == 1)
+		return false;
+
+	if (is_rd_overutilized(rq->rd))
+		return false;
+
+	if (task_stuck_on_cpu(p, cpu_of(rq)))
+		return true;
+
+	return false;
+}
+
+static int active_load_balance_cpu_stop(void *data);
+
+static inline void check_pushable_task(struct task_struct *p, struct rq *rq)
+{
+	int new_cpu, cpu = cpu_of(rq);
+
+	if (!sched_energy_enabled())
+		return;
+
+	if (WARN_ON(!p))
+		return;
+
+	if (WARN_ON(!task_current(rq, p)))
+		return;
+
+	if (is_migration_disabled(p))
+		return;
+
+	/* If there are several task, wait for being put back */
+	if (rq->nr_running > 1)
+		return;
+
+	if (!sched_energy_push_task(p, rq))
+		return;
+
+	new_cpu = find_energy_efficient_cpu(p, cpu);
+
+	if (new_cpu == cpu)
+		return;
+
+	/*
+	 * ->active_balance synchronizes accesses to
+	 * ->active_balance_work.  Once set, it's cleared
+	 * only after active load balance is finished.
+	 */
+	if (!rq->active_balance) {
+		rq->active_balance = 1;
+		rq->push_cpu = new_cpu;
+	} else
+		return;
+
+	raw_spin_rq_unlock(rq);
+	stop_one_cpu_nowait(cpu,
+		active_load_balance_cpu_stop, rq,
+		&rq->active_balance_work);
+	raw_spin_rq_lock(rq);
+}
+
+static inline int has_pushable_tasks(struct rq *rq)
+{
+	return !plist_head_empty(&rq->cfs.pushable_tasks);
+}
+
+static struct task_struct *pick_next_pushable_fair_task(struct rq *rq)
+{
+	struct task_struct *p;
+
+	if (!has_pushable_tasks(rq))
+		return NULL;
+
+	p = plist_first_entry(&rq->cfs.pushable_tasks,
+			      struct task_struct, pushable_tasks);
+
+	WARN_ON_ONCE(rq->cpu != task_cpu(p));
+	WARN_ON_ONCE(task_current(rq, p));
+	WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
+	WARN_ON_ONCE(!task_on_rq_queued(p));
+
+	/*
+	 * Remove task from the pushable list as we try only once after that
+	 * the task has been put back in enqueued list.
+	 */
+	plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+
+	return p;
+}
+
+/*
+ * See if the non running fair tasks on this rq can be sent on other CPUs
+ * that fits better with their profile.
+ */
+static bool push_fair_task(struct rq *rq)
+{
+	struct task_struct *next_task;
+	int prev_cpu, new_cpu;
+	struct rq *new_rq;
+
+	next_task = pick_next_pushable_fair_task(rq);
+	if (!next_task)
+		return false;
+
+	if (is_migration_disabled(next_task))
+		return true;
+
+	/* We might release rq lock */
+	get_task_struct(next_task);
+
+	prev_cpu = rq->cpu;
+
+	new_cpu = find_energy_efficient_cpu(next_task, prev_cpu);
+
+	if (new_cpu == prev_cpu)
+		goto out;
+
+	new_rq = cpu_rq(new_cpu);
+
+	if (double_lock_balance(rq, new_rq)) {
+		/* The task has already migrated in between */
+		if (task_cpu(next_task) != rq->cpu) {
+			double_unlock_balance(rq, new_rq);
+			goto out;
+		}
+
+		deactivate_task(rq, next_task, 0);
+		set_task_cpu(next_task, new_cpu);
+		activate_task(new_rq, next_task, 0);
+
+		resched_curr(new_rq);
+
+		double_unlock_balance(rq, new_rq);
+	}
+
+out:
+	put_task_struct(next_task);
+
+	return true;
+}
+
+static void push_fair_tasks(struct rq *rq)
+{
+	/* push_fair_task() will return true if it moved a fair task */
+	while (push_fair_task(rq))
+		;
+}
+
+static DEFINE_PER_CPU(struct balance_callback, fair_push_head);
+
+static inline void fair_queue_pushable_tasks(struct rq *rq)
+{
+	if (!sched_energy_enabled() || !has_pushable_tasks(rq))
+		return;
+
+	queue_balance_callback(rq, &per_cpu(fair_push_head, rq->cpu), push_fair_tasks);
+}
+static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p)
+{
+	if (sched_energy_enabled())
+		plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+}
+
+static void fair_add_pushable_task(struct rq *rq, struct task_struct *p)
+{
+	if (sched_energy_enabled() && task_on_rq_queued(p) && !p->se.sched_delayed) {
+		if (sched_energy_push_task(p, rq)) {
+			plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+			plist_node_init(&p->pushable_tasks, p->prio);
+			plist_add(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+		}
+	}
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAKE,
@@ -8758,6 +8952,10 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	return sched_balance_newidle(rq, rf) != 0;
 }
 #else
+static inline void check_pushable_task(struct task_struct *p, struct rq *rq) {}
+static inline void fair_queue_pushable_tasks(struct rq *rq) {}
+static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p) {}
+static inline void fair_add_pushable_task(struct rq *rq, struct task_struct *p) {}
 static inline void set_task_max_allowed_capacity(struct task_struct *p) {}
 #endif /* CONFIG_SMP */
 
@@ -8947,6 +9145,12 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 		put_prev_entity(cfs_rq, pse);
 		set_next_entity(cfs_rq, se);
 
+		/*
+		 * The previous task might be eligible for being pushed on
+		 * another cpu if it is still active.
+		 */
+		fair_add_pushable_task(rq, prev);
+
 		__set_next_task_fair(rq, p, true);
 	}
 
@@ -9019,6 +9223,13 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct t
 		cfs_rq = cfs_rq_of(se);
 		put_prev_entity(cfs_rq, se);
 	}
+
+	/*
+	 * The previous task might be eligible for being pushed on another cpu
+	 * if it is still active.
+	 */
+	fair_add_pushable_task(rq, prev);
+
 }
 
 /*
@@ -13151,6 +13362,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
 
+	check_pushable_task(curr, rq);
 	update_misfit_status(curr, rq);
 	check_update_overutilized_status(task_rq(curr));
 
@@ -13303,6 +13515,8 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
 {
 	struct sched_entity *se = &p->se;
 
+	fair_remove_pushable_task(rq, p);
+
 #ifdef CONFIG_SMP
 	if (task_on_rq_queued(p)) {
 		/*
@@ -13320,6 +13534,11 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
 	if (hrtick_enabled_fair(rq))
 		hrtick_start_fair(rq, p);
 
+	/*
+	 * Try to push prev task before checking misfit for next task as
+	 * the migration of prev can make next fitting the CPU
+	 */
+	fair_queue_pushable_tasks(rq);
 	update_misfit_status(p, rq);
 	sched_fair_update_stop_tick(rq, p);
 }
@@ -13350,6 +13569,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 	cfs_rq->tasks_timeline = RB_ROOT_CACHED;
 	cfs_rq->min_vruntime = (u64)(-(1LL << 20));
 #ifdef CONFIG_SMP
+	plist_head_init(&cfs_rq->pushable_tasks);
 	raw_spin_lock_init(&cfs_rq->removed.lock);
 #endif
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ab16d3d0e51c..2db198dccf21 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -722,6 +722,8 @@ struct cfs_rq {
 	struct list_head	leaf_cfs_rq_list;
 	struct task_group	*tg;	/* group that "owns" this runqueue */
 
+	struct plist_head	pushable_tasks;
+
 	/* Locally cached copy of our task_group's idle value */
 	int			idle;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/7 v5] sched/fair: Add push task mechanism for EAS
  2025-03-02 21:05 ` [PATCH 5/7 v5] sched/fair: Add push task mechanism for EAS Vincent Guittot
@ 2025-03-07 12:51   ` kernel test robot
  2025-03-10 12:47   ` kernel test robot
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 31+ messages in thread
From: kernel test robot @ 2025-03-07 12:51 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, lukasz.luba,
	rafael.j.wysocki, pierre.gondois, linux-kernel
  Cc: oe-kbuild-all, qyousef, hongyan.xia2, christian.loehle,
	luis.machado, qperret, Vincent Guittot

Hi Vincent,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on peterz-queue/sched/core linus/master v6.14-rc5 next-20250306]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Vincent-Guittot/sched-fair-Filter-false-overloaded_group-case-for-EAS/20250303-050850
base:   tip/sched/core
patch link:    https://lore.kernel.org/r/20250302210539.1563190-6-vincent.guittot%40linaro.org
patch subject: [PATCH 5/7 v5] sched/fair: Add push task mechanism for EAS
config: x86_64-randconfig-003-20250307 (https://download.01.org/0day-ci/archive/20250307/202503072035.8PEXiAFe-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-12) 11.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250307/202503072035.8PEXiAFe-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202503072035.8PEXiAFe-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/sched/fair.c: In function 'has_pushable_tasks':
   kernel/sched/fair.c:8675:42: error: 'struct cfs_rq' has no member named 'pushable_tasks'
    8675 |         return !plist_head_empty(&rq->cfs.pushable_tasks);
         |                                          ^
   In file included from include/linux/kernel.h:22,
                    from include/linux/cpumask.h:11,
                    from include/linux/energy_model.h:4,
                    from kernel/sched/fair.c:23:
   kernel/sched/fair.c: In function 'pick_next_pushable_fair_task':
   kernel/sched/fair.c:8685:39: error: 'struct cfs_rq' has no member named 'pushable_tasks'
    8685 |         p = plist_first_entry(&rq->cfs.pushable_tasks,
         |                                       ^
   include/linux/container_of.h:19:33: note: in definition of macro 'container_of'
      19 |         void *__mptr = (void *)(ptr);                                   \
         |                                 ^~~
   kernel/sched/fair.c:8685:13: note: in expansion of macro 'plist_first_entry'
    8685 |         p = plist_first_entry(&rq->cfs.pushable_tasks,
         |             ^~~~~~~~~~~~~~~~~
   In file included from include/linux/container_of.h:5,
                    from include/linux/kernel.h:22,
                    from include/linux/cpumask.h:11,
                    from include/linux/energy_model.h:4,
                    from kernel/sched/fair.c:23:
   kernel/sched/fair.c:8685:39: error: 'struct cfs_rq' has no member named 'pushable_tasks'
    8685 |         p = plist_first_entry(&rq->cfs.pushable_tasks,
         |                                       ^
   include/linux/build_bug.h:78:56: note: in definition of macro '__static_assert'
      78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
         |                                                        ^~~~
   include/linux/container_of.h:20:9: note: in expansion of macro 'static_assert'
      20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
         |         ^~~~~~~~~~~~~
   include/linux/container_of.h:20:23: note: in expansion of macro '__same_type'
      20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
         |                       ^~~~~~~~~~~
   include/linux/plist.h:233:9: note: in expansion of macro 'container_of'
     233 |         container_of(plist_first(head), type, member)
         |         ^~~~~~~~~~~~
   kernel/sched/fair.c:8685:13: note: in expansion of macro 'plist_first_entry'
    8685 |         p = plist_first_entry(&rq->cfs.pushable_tasks,
         |             ^~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:8685:39: error: 'struct cfs_rq' has no member named 'pushable_tasks'
    8685 |         p = plist_first_entry(&rq->cfs.pushable_tasks,
         |                                       ^
   include/linux/build_bug.h:78:56: note: in definition of macro '__static_assert'
      78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
         |                                                        ^~~~
   include/linux/container_of.h:20:9: note: in expansion of macro 'static_assert'
      20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
         |         ^~~~~~~~~~~~~
   include/linux/container_of.h:21:23: note: in expansion of macro '__same_type'
      21 |                       __same_type(*(ptr), void),                        \
         |                       ^~~~~~~~~~~
   include/linux/plist.h:233:9: note: in expansion of macro 'container_of'
     233 |         container_of(plist_first(head), type, member)
         |         ^~~~~~~~~~~~
   kernel/sched/fair.c:8685:13: note: in expansion of macro 'plist_first_entry'
    8685 |         p = plist_first_entry(&rq->cfs.pushable_tasks,
         |             ^~~~~~~~~~~~~~~~~
>> include/linux/compiler_types.h:483:27: error: expression in static assertion is not an integer
     483 | #define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))
         |                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:78:56: note: in definition of macro '__static_assert'
      78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
         |                                                        ^~~~
   include/linux/container_of.h:20:9: note: in expansion of macro 'static_assert'
      20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
         |         ^~~~~~~~~~~~~
   include/linux/container_of.h:20:23: note: in expansion of macro '__same_type'
      20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
         |                       ^~~~~~~~~~~
   include/linux/plist.h:233:9: note: in expansion of macro 'container_of'
     233 |         container_of(plist_first(head), type, member)
         |         ^~~~~~~~~~~~
   kernel/sched/fair.c:8685:13: note: in expansion of macro 'plist_first_entry'
    8685 |         p = plist_first_entry(&rq->cfs.pushable_tasks,
         |             ^~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:8697:47: error: 'struct cfs_rq' has no member named 'pushable_tasks'
    8697 |         plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
         |                                               ^
   kernel/sched/fair.c: In function 'fair_remove_pushable_task':
   kernel/sched/fair.c:8772:55: error: 'struct cfs_rq' has no member named 'pushable_tasks'
    8772 |                 plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
         |                                                       ^
   kernel/sched/fair.c: In function 'fair_add_pushable_task':
   kernel/sched/fair.c:8779:63: error: 'struct cfs_rq' has no member named 'pushable_tasks'
    8779 |                         plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
         |                                                               ^
   kernel/sched/fair.c:8781:63: error: 'struct cfs_rq' has no member named 'pushable_tasks'
    8781 |                         plist_add(&p->pushable_tasks, &rq->cfs.pushable_tasks);
         |                                                               ^
   kernel/sched/fair.c: In function 'init_cfs_rq':
   kernel/sched/fair.c:13572:32: error: 'struct cfs_rq' has no member named 'pushable_tasks'
   13572 |         plist_head_init(&cfs_rq->pushable_tasks);
         |                                ^~
   kernel/sched/fair.c: In function 'has_pushable_tasks':
   kernel/sched/fair.c:8676:1: warning: control reaches end of non-void function [-Wreturn-type]
    8676 | }
         | ^


vim +483 include/linux/compiler_types.h

eb111869301e15 Rasmus Villemoes 2019-09-13  481  
d15155824c5014 Will Deacon      2017-10-24  482  /* Are two types/vars the same type (ignoring qualifiers)? */
d15155824c5014 Will Deacon      2017-10-24 @483  #define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))
d15155824c5014 Will Deacon      2017-10-24  484  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/7 v5] sched/fair: Add push task mechanism for EAS
  2025-03-02 21:05 ` [PATCH 5/7 v5] sched/fair: Add push task mechanism for EAS Vincent Guittot
  2025-03-07 12:51   ` kernel test robot
@ 2025-03-10 12:47   ` kernel test robot
  2025-03-10 18:20   ` Shrikanth Hegde
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 31+ messages in thread
From: kernel test robot @ 2025-03-10 12:47 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, lukasz.luba,
	rafael.j.wysocki, pierre.gondois, linux-kernel
  Cc: llvm, oe-kbuild-all, qyousef, hongyan.xia2, christian.loehle,
	luis.machado, qperret, Vincent Guittot

Hi Vincent,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on peterz-queue/sched/core linus/master v6.14-rc6 next-20250307]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Vincent-Guittot/sched-fair-Filter-false-overloaded_group-case-for-EAS/20250303-050850
base:   tip/sched/core
patch link:    https://lore.kernel.org/r/20250302210539.1563190-6-vincent.guittot%40linaro.org
patch subject: [PATCH 5/7 v5] sched/fair: Add push task mechanism for EAS
config: arm-realview_defconfig (https://download.01.org/0day-ci/archive/20250310/202503102022.MhverD5b-lkp@intel.com/config)
compiler: clang version 19.1.7 (https://github.com/llvm/llvm-project cd708029e0b2869e80abe31ddb175f7c35361f90)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250310/202503102022.MhverD5b-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202503102022.MhverD5b-lkp@intel.com/

All errors (new ones prefixed by >>):

>> kernel/sched/fair.c:8675:36: error: no member named 'pushable_tasks' in 'struct cfs_rq'
    8675 |         return !plist_head_empty(&rq->cfs.pushable_tasks);
         |                                   ~~~~~~~ ^
   kernel/sched/fair.c:8685:33: error: no member named 'pushable_tasks' in 'struct cfs_rq'
    8685 |         p = plist_first_entry(&rq->cfs.pushable_tasks,
         |                                ~~~~~~~ ^
   include/linux/plist.h:233:27: note: expanded from macro 'plist_first_entry'
     233 |         container_of(plist_first(head), type, member)
         |                                  ^~~~
   include/linux/container_of.h:19:26: note: expanded from macro 'container_of'
      19 |         void *__mptr = (void *)(ptr);                                   \
         |                                 ^~~
   kernel/sched/fair.c:8685:33: error: no member named 'pushable_tasks' in 'struct cfs_rq'
    8685 |         p = plist_first_entry(&rq->cfs.pushable_tasks,
         |                                ~~~~~~~ ^
   include/linux/plist.h:233:27: note: expanded from macro 'plist_first_entry'
     233 |         container_of(plist_first(head), type, member)
         |                                  ^~~~
   include/linux/container_of.h:20:30: note: expanded from macro 'container_of'
      20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
         |                                     ^~~
   include/linux/compiler_types.h:483:63: note: expanded from macro '__same_type'
     483 | #define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))
         |                                                               ^
   include/linux/build_bug.h:77:50: note: expanded from macro 'static_assert'
      77 | #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
         |                                                  ^~~~
   include/linux/build_bug.h:78:56: note: expanded from macro '__static_assert'
      78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
         |                                                        ^~~~
   kernel/sched/fair.c:8685:33: error: no member named 'pushable_tasks' in 'struct cfs_rq'
    8685 |         p = plist_first_entry(&rq->cfs.pushable_tasks,
         |                                ~~~~~~~ ^
   include/linux/plist.h:233:27: note: expanded from macro 'plist_first_entry'
     233 |         container_of(plist_first(head), type, member)
         |                                  ^~~~
   include/linux/container_of.h:21:23: note: expanded from macro 'container_of'
      21 |                       __same_type(*(ptr), void),                        \
         |                                     ^~~
   include/linux/compiler_types.h:483:63: note: expanded from macro '__same_type'
     483 | #define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))
         |                                                               ^
   include/linux/build_bug.h:77:50: note: expanded from macro 'static_assert'
      77 | #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
         |                                                  ^~~~
   include/linux/build_bug.h:78:56: note: expanded from macro '__static_assert'
      78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
         |                                                        ^~~~
   kernel/sched/fair.c:8697:41: error: no member named 'pushable_tasks' in 'struct cfs_rq'
    8697 |         plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
         |                                        ~~~~~~~ ^
   kernel/sched/fair.c:8772:42: error: no member named 'pushable_tasks' in 'struct cfs_rq'
    8772 |                 plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
         |                                                ~~~~~~~ ^
   kernel/sched/fair.c:8779:43: error: no member named 'pushable_tasks' in 'struct cfs_rq'
    8779 |                         plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
         |                                                        ~~~~~~~ ^
   kernel/sched/fair.c:8781:43: error: no member named 'pushable_tasks' in 'struct cfs_rq'
    8781 |                         plist_add(&p->pushable_tasks, &rq->cfs.pushable_tasks);
         |                                                        ~~~~~~~ ^
   kernel/sched/fair.c:13572:27: error: no member named 'pushable_tasks' in 'struct cfs_rq'
    13572 |         plist_head_init(&cfs_rq->pushable_tasks);
          |                          ~~~~~~  ^
   9 errors generated.


vim +8675 kernel/sched/fair.c

  8672	
  8673	static inline int has_pushable_tasks(struct rq *rq)
  8674	{
> 8675		return !plist_head_empty(&rq->cfs.pushable_tasks);
  8676	}
  8677	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/7 v5] sched/fair: Add push task mechanism for EAS
  2025-03-02 21:05 ` [PATCH 5/7 v5] sched/fair: Add push task mechanism for EAS Vincent Guittot
  2025-03-07 12:51   ` kernel test robot
  2025-03-10 12:47   ` kernel test robot
@ 2025-03-10 18:20   ` Shrikanth Hegde
  2025-03-11 16:27     ` Vincent Guittot
  2025-03-19 15:26   ` Valentin Schneider
                     ` (3 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Shrikanth Hegde @ 2025-03-10 18:20 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado, qperret,
	mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, pierre.gondois,
	linux-kernel


Hi Vincent, trying to understand this series. I see most of the places 
are with sched_energy_enabled() checks. So it shouldn't affect non-EAS 
systems.

> EAS is based on wakeup events to efficiently place tasks on the system, but
> there are cases where a task doesn't have wakeup events anymore or at a far
> too low pace. For such situation, we can take advantage of the task being
> put back in the enqueued list to check if it should be pushed on another
> CPU. When the task is alone on the CPU, it's never put back in the enqueued
> list; In this special case, we use the tick to run the check.
> 
> Wake up events remain the main way to migrate tasks but we now detect
> situation where a task is stuck on a CPU by checking that its utilization
> is larger than the max available compute capacity (max cpu capacity or
> uclamp max setting)
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>   kernel/sched/fair.c  | 220 +++++++++++++++++++++++++++++++++++++++++++
>   kernel/sched/sched.h |   2 +
>   2 files changed, 222 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a9b97bbc085f..c3e383b86808 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7051,6 +7051,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>   	hrtick_update(rq);
>   }
>   
> +static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p);
>   static void set_next_buddy(struct sched_entity *se);
>   
>   /*
> @@ -7081,6 +7082,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>   		h_nr_idle = task_has_idle_policy(p);
>   		if (task_sleep || task_delayed || !se->sched_delayed)
>   			h_nr_runnable = 1;
> +
> +		fair_remove_pushable_task(rq, p);
>   	} else {
>   		cfs_rq = group_cfs_rq(se);
>   		slice = cfs_rq_min_slice(cfs_rq);
> @@ -8589,6 +8592,197 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>   	return target;
>   }
>   
> +static inline bool task_stuck_on_cpu(struct task_struct *p, int cpu)
> +{
> +	unsigned long max_capa, util;
> +
> +	max_capa = min(get_actual_cpu_capacity(cpu),
> +		       uclamp_eff_value(p, UCLAMP_MAX));
> +	util = max(task_util_est(p), task_runnable(p));
> +
> +	/*
> +	 * Return true only if the task might not sleep/wakeup because of a low
> +	 * compute capacity. Tasks, which wake up regularly, will be handled by
> +	 * feec().
> +	 */
> +	return (util > max_capa);
> +}
> +
> +static inline bool sched_energy_push_task(struct task_struct *p, struct rq *rq)
> +{
> +	if (p->nr_cpus_allowed == 1)
> +		return false;
> +
> +	if (is_rd_overutilized(rq->rd))
> +		return false;
> +
> +	if (task_stuck_on_cpu(p, cpu_of(rq)))
> +		return true;
> +
> +	return false;
> +}
> +
> +static int active_load_balance_cpu_stop(void *data);
> +
> +static inline void check_pushable_task(struct task_struct *p, struct rq *rq)
> +{
> +	int new_cpu, cpu = cpu_of(rq);
> +
> +	if (!sched_energy_enabled())
> +		return;
> +
> +	if (WARN_ON(!p))
> +		return;
> +
> +	if (WARN_ON(!task_current(rq, p)))
> +		return;
> +
> +	if (is_migration_disabled(p))
> +		return;
> +
> +	/* If there are several task, wait for being put back */
> +	if (rq->nr_running > 1)
> +		return;
> +
> +	if (!sched_energy_push_task(p, rq))
> +		return;
> +
> +	new_cpu = find_energy_efficient_cpu(p, cpu);
> +
> +	if (new_cpu == cpu)
> +		return;
> +
> +	/*
> +	 * ->active_balance synchronizes accesses to
> +	 * ->active_balance_work.  Once set, it's cleared
> +	 * only after active load balance is finished.
> +	 */
> +	if (!rq->active_balance) {
> +		rq->active_balance = 1;
> +		rq->push_cpu = new_cpu;
> +	} else
> +		return;
> +

Does this need preempt disable/enable guards similar to sched_balance_rq?

> +	raw_spin_rq_unlock(rq);
> +	stop_one_cpu_nowait(cpu,
> +		active_load_balance_cpu_stop, rq,
> +		&rq->active_balance_work);
> +	raw_spin_rq_lock(rq);
> +}
> +
> +static inline int has_pushable_tasks(struct rq *rq)
> +{
> +	return !plist_head_empty(&rq->cfs.pushable_tasks);
> +}
> +
> +static struct task_struct *pick_next_pushable_fair_task(struct rq *rq)
> +{
> +	struct task_struct *p;
> +
> +	if (!has_pushable_tasks(rq))
> +		return NULL;
> +
> +	p = plist_first_entry(&rq->cfs.pushable_tasks,
> +			      struct task_struct, pushable_tasks);
> +
> +	WARN_ON_ONCE(rq->cpu != task_cpu(p));
> +	WARN_ON_ONCE(task_current(rq, p));
> +	WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
> +	WARN_ON_ONCE(!task_on_rq_queued(p));
> +

Isnt it better to print it everytime? it could different process each 
time no?

> +	/*
> +	 * Remove task from the pushable list as we try only once after that
> +	 * the task has been put back in enqueued list.
> +	 */
> +	plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> +
> +	return p;
> +}
> +
> +/*
> + * See if the non running fair tasks on this rq can be sent on other CPUs
> + * that fits better with their profile.
> + */
> +static bool push_fair_task(struct rq *rq)
> +{
> +	struct task_struct *next_task;
> +	int prev_cpu, new_cpu;
> +	struct rq *new_rq;
> +
> +	next_task = pick_next_pushable_fair_task(rq);
> +	if (!next_task)
> +		return false;
> +
> +	if (is_migration_disabled(next_task))
> +		return true;
> +
> +	/* We might release rq lock */
> +	get_task_struct(next_task);
> +
> +	prev_cpu = rq->cpu;
> +
> +	new_cpu = find_energy_efficient_cpu(next_task, prev_cpu);
> +
> +	if (new_cpu == prev_cpu)
> +		goto out;
> +
> +	new_rq = cpu_rq(new_cpu);
> +
> +	if (double_lock_balance(rq, new_rq)) {
> +		/* The task has already migrated in between */
> +		if (task_cpu(next_task) != rq->cpu) {
> +			double_unlock_balance(rq, new_rq);
> +			goto out;
> +		}
> +
> +		deactivate_task(rq, next_task, 0);
> +		set_task_cpu(next_task, new_cpu);
> +		activate_task(new_rq, next_task, 0);
> +
> +		resched_curr(new_rq);
> +
> +		double_unlock_balance(rq, new_rq);
> +	}
> +
> +out:
> +	put_task_struct(next_task);
> +
> +	return true;
> +}
> +
> +static void push_fair_tasks(struct rq *rq)
> +{
> +	/* push_fair_task() will return true if it moved a fair task */
> +	while (push_fair_task(rq))
> +		;
> +}
> +
> +static DEFINE_PER_CPU(struct balance_callback, fair_push_head);
> +
> +static inline void fair_queue_pushable_tasks(struct rq *rq)
> +{
> +	if (!sched_energy_enabled() || !has_pushable_tasks(rq))
> +		return;

has_pushable_task has any tasks iff sched_energy_enabled. so this check 
may not be needed. But it shouldnt hurt, since it is static key.

> +
> +	queue_balance_callback(rq, &per_cpu(fair_push_head, rq->cpu), push_fair_tasks);
> +}
> +static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p)
> +{
> +	if (sched_energy_enabled())
> +		plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> +}
> +
> +static void fair_add_pushable_task(struct rq *rq, struct task_struct *p)
> +{
> +	if (sched_energy_enabled() && task_on_rq_queued(p) && !p->se.sched_delayed) {
> +		if (sched_energy_push_task(p, rq)) {
> +			plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> +			plist_node_init(&p->pushable_tasks, p->prio);
> +			plist_add(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> +		}
> +	}
> +}
> +
>   /*
>    * select_task_rq_fair: Select target runqueue for the waking task in domains
>    * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAKE,
> @@ -8758,6 +8952,10 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>   	return sched_balance_newidle(rq, rf) != 0;
>   }
>   #else
> +static inline void check_pushable_task(struct task_struct *p, struct rq *rq) {}
> +static inline void fair_queue_pushable_tasks(struct rq *rq) {}
> +static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p) {}
> +static inline void fair_add_pushable_task(struct rq *rq, struct task_struct *p) {}
>   static inline void set_task_max_allowed_capacity(struct task_struct *p) {}
>   #endif /* CONFIG_SMP */
>   
> @@ -8947,6 +9145,12 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>   		put_prev_entity(cfs_rq, pse);
>   		set_next_entity(cfs_rq, se);
>   
> +		/*
> +		 * The previous task might be eligible for being pushed on
> +		 * another cpu if it is still active.
> +		 */
> +		fair_add_pushable_task(rq, prev);
> +
>   		__set_next_task_fair(rq, p, true);
>   	}
>   
> @@ -9019,6 +9223,13 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct t
>   		cfs_rq = cfs_rq_of(se);
>   		put_prev_entity(cfs_rq, se);
>   	}
> +
> +	/*
> +	 * The previous task might be eligible for being pushed on another cpu
> +	 * if it is still active.
> +	 */
> +	fair_add_pushable_task(rq, prev);
> +
>   }
>   
>   /*
> @@ -13151,6 +13362,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>   	if (static_branch_unlikely(&sched_numa_balancing))
>   		task_tick_numa(rq, curr);
>   
> +	check_pushable_task(curr, rq);
>   	update_misfit_status(curr, rq);
>   	check_update_overutilized_status(task_rq(curr));
>   
> @@ -13303,6 +13515,8 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
>   {
>   	struct sched_entity *se = &p->se;
>   
> +	fair_remove_pushable_task(rq, p);
> +
>   #ifdef CONFIG_SMP
>   	if (task_on_rq_queued(p)) {
>   		/*
> @@ -13320,6 +13534,11 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
>   	if (hrtick_enabled_fair(rq))
>   		hrtick_start_fair(rq, p);
>   
> +	/*
> +	 * Try to push prev task before checking misfit for next task as
> +	 * the migration of prev can make next fitting the CPU
> +	 */
> +	fair_queue_pushable_tasks(rq);
>   	update_misfit_status(p, rq);
>   	sched_fair_update_stop_tick(rq, p);
>   }
> @@ -13350,6 +13569,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
>   	cfs_rq->tasks_timeline = RB_ROOT_CACHED;
>   	cfs_rq->min_vruntime = (u64)(-(1LL << 20));
>   #ifdef CONFIG_SMP
> +	plist_head_init(&cfs_rq->pushable_tasks);
>   	raw_spin_lock_init(&cfs_rq->removed.lock);
>   #endif
>   }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ab16d3d0e51c..2db198dccf21 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -722,6 +722,8 @@ struct cfs_rq {
>   	struct list_head	leaf_cfs_rq_list;
>   	struct task_group	*tg;	/* group that "owns" this runqueue */
>   
> +	struct plist_head	pushable_tasks;
> +
>   	/* Locally cached copy of our task_group's idle value */
>   	int			idle;
>   


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/7 v5] sched/fair: Add push task mechanism for EAS
  2025-03-10 18:20   ` Shrikanth Hegde
@ 2025-03-11 16:27     ` Vincent Guittot
  0 siblings, 0 replies; 31+ messages in thread
From: Vincent Guittot @ 2025-03-11 16:27 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado, qperret,
	mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, pierre.gondois,
	linux-kernel

Hi Shrikanth,

On Mon, 10 Mar 2025 at 19:21, Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
>
>
> Hi Vincent, trying to understand this series. I see most of the places
> are with sched_energy_enabled() checks. So it shouldn't affect non-EAS
> systems.
>
> > EAS is based on wakeup events to efficiently place tasks on the system, but
> > there are cases where a task doesn't have wakeup events anymore or at a far
> > too low pace. For such situation, we can take advantage of the task being
> > put back in the enqueued list to check if it should be pushed on another
> > CPU. When the task is alone on the CPU, it's never put back in the enqueued
> > list; In this special case, we use the tick to run the check.
> >
> > Wake up events remain the main way to migrate tasks but we now detect
> > situation where a task is stuck on a CPU by checking that its utilization
> > is larger than the max available compute capacity (max cpu capacity or
> > uclamp max setting)
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >   kernel/sched/fair.c  | 220 +++++++++++++++++++++++++++++++++++++++++++
> >   kernel/sched/sched.h |   2 +
> >   2 files changed, 222 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a9b97bbc085f..c3e383b86808 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7051,6 +7051,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >       hrtick_update(rq);
> >   }
> >
> > +static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p);
> >   static void set_next_buddy(struct sched_entity *se);
> >
> >   /*
> > @@ -7081,6 +7082,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> >               h_nr_idle = task_has_idle_policy(p);
> >               if (task_sleep || task_delayed || !se->sched_delayed)
> >                       h_nr_runnable = 1;
> > +
> > +             fair_remove_pushable_task(rq, p);
> >       } else {
> >               cfs_rq = group_cfs_rq(se);
> >               slice = cfs_rq_min_slice(cfs_rq);
> > @@ -8589,6 +8592,197 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >       return target;
> >   }
> >
> > +static inline bool task_stuck_on_cpu(struct task_struct *p, int cpu)
> > +{
> > +     unsigned long max_capa, util;
> > +
> > +     max_capa = min(get_actual_cpu_capacity(cpu),
> > +                    uclamp_eff_value(p, UCLAMP_MAX));
> > +     util = max(task_util_est(p), task_runnable(p));
> > +
> > +     /*
> > +      * Return true only if the task might not sleep/wakeup because of a low
> > +      * compute capacity. Tasks, which wake up regularly, will be handled by
> > +      * feec().
> > +      */
> > +     return (util > max_capa);
> > +}
> > +
> > +static inline bool sched_energy_push_task(struct task_struct *p, struct rq *rq)
> > +{
> > +     if (p->nr_cpus_allowed == 1)
> > +             return false;
> > +
> > +     if (is_rd_overutilized(rq->rd))
> > +             return false;
> > +
> > +     if (task_stuck_on_cpu(p, cpu_of(rq)))
> > +             return true;
> > +
> > +     return false;
> > +}
> > +
> > +static int active_load_balance_cpu_stop(void *data);
> > +
> > +static inline void check_pushable_task(struct task_struct *p, struct rq *rq)
> > +{
> > +     int new_cpu, cpu = cpu_of(rq);
> > +
> > +     if (!sched_energy_enabled())
> > +             return;
> > +
> > +     if (WARN_ON(!p))
> > +             return;
> > +
> > +     if (WARN_ON(!task_current(rq, p)))
> > +             return;
> > +
> > +     if (is_migration_disabled(p))
> > +             return;
> > +
> > +     /* If there are several task, wait for being put back */
> > +     if (rq->nr_running > 1)
> > +             return;
> > +
> > +     if (!sched_energy_push_task(p, rq))
> > +             return;
> > +
> > +     new_cpu = find_energy_efficient_cpu(p, cpu);
> > +
> > +     if (new_cpu == cpu)
> > +             return;
> > +
> > +     /*
> > +      * ->active_balance synchronizes accesses to
> > +      * ->active_balance_work.  Once set, it's cleared
> > +      * only after active load balance is finished.
> > +      */
> > +     if (!rq->active_balance) {
> > +             rq->active_balance = 1;
> > +             rq->push_cpu = new_cpu;
> > +     } else
> > +             return;
> > +
>
> Does this need preempt disable/enable guards similar to sched_balance_rq?

Pierre asked me about this in the RFC version [1]. Preempt
enable/disable has been added by commit f0498d2a54e7 ("sched: Fix
stop_one_cpu_nowait() vs hotplug") and AFAIK we are safe with the use
case mentioned in the commit

[1] https://lore.kernel.org/lkml/ccf4095f-5fca-42f4-b9fe-aa93e703016e@arm.com/

>
> > +     raw_spin_rq_unlock(rq);
> > +     stop_one_cpu_nowait(cpu,
> > +             active_load_balance_cpu_stop, rq,
> > +             &rq->active_balance_work);
> > +     raw_spin_rq_lock(rq);
> > +}
> > +
> > +static inline int has_pushable_tasks(struct rq *rq)
> > +{
> > +     return !plist_head_empty(&rq->cfs.pushable_tasks);
> > +}
> > +
> > +static struct task_struct *pick_next_pushable_fair_task(struct rq *rq)
> > +{
> > +     struct task_struct *p;
> > +
> > +     if (!has_pushable_tasks(rq))
> > +             return NULL;
> > +
> > +     p = plist_first_entry(&rq->cfs.pushable_tasks,
> > +                           struct task_struct, pushable_tasks);
> > +
> > +     WARN_ON_ONCE(rq->cpu != task_cpu(p));
> > +     WARN_ON_ONCE(task_current(rq, p));
> > +     WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
> > +     WARN_ON_ONCE(!task_on_rq_queued(p));
> > +
>
> Isnt it better to print it everytime? it could different process each
> time no?

This should never happen so once seems enough and it prevents
overloading the log.

>
> > +     /*
> > +      * Remove task from the pushable list as we try only once after that
> > +      * the task has been put back in enqueued list.
> > +      */
> > +     plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> > +
> > +     return p;
> > +}
> > +
> > +/*
> > + * See if the non running fair tasks on this rq can be sent on other CPUs
> > + * that fits better with their profile.
> > + */
> > +static bool push_fair_task(struct rq *rq)
> > +{
> > +     struct task_struct *next_task;
> > +     int prev_cpu, new_cpu;
> > +     struct rq *new_rq;
> > +
> > +     next_task = pick_next_pushable_fair_task(rq);
> > +     if (!next_task)
> > +             return false;
> > +
> > +     if (is_migration_disabled(next_task))
> > +             return true;
> > +
> > +     /* We might release rq lock */
> > +     get_task_struct(next_task);
> > +
> > +     prev_cpu = rq->cpu;
> > +
> > +     new_cpu = find_energy_efficient_cpu(next_task, prev_cpu);
> > +
> > +     if (new_cpu == prev_cpu)
> > +             goto out;
> > +
> > +     new_rq = cpu_rq(new_cpu);
> > +
> > +     if (double_lock_balance(rq, new_rq)) {
> > +             /* The task has already migrated in between */
> > +             if (task_cpu(next_task) != rq->cpu) {
> > +                     double_unlock_balance(rq, new_rq);
> > +                     goto out;
> > +             }
> > +
> > +             deactivate_task(rq, next_task, 0);
> > +             set_task_cpu(next_task, new_cpu);
> > +             activate_task(new_rq, next_task, 0);
> > +
> > +             resched_curr(new_rq);
> > +
> > +             double_unlock_balance(rq, new_rq);
> > +     }
> > +
> > +out:
> > +     put_task_struct(next_task);
> > +
> > +     return true;
> > +}
> > +
> > +static void push_fair_tasks(struct rq *rq)
> > +{
> > +     /* push_fair_task() will return true if it moved a fair task */
> > +     while (push_fair_task(rq))
> > +             ;
> > +}
> > +
> > +static DEFINE_PER_CPU(struct balance_callback, fair_push_head);
> > +
> > +static inline void fair_queue_pushable_tasks(struct rq *rq)
> > +{
> > +     if (!sched_energy_enabled() || !has_pushable_tasks(rq))
> > +             return;
>
> has_pushable_task has any tasks iff sched_energy_enabled. so this check
> may not be needed. But it shouldnt hurt, since it is static key.

I didn't want to add the useless call of  has_pushable_tasks() even if
it should be cheap

>
> > +
> > +     queue_balance_callback(rq, &per_cpu(fair_push_head, rq->cpu), push_fair_tasks);
> > +}
> > +static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p)
> > +{
> > +     if (sched_energy_enabled())
> > +             plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> > +}
> > +
> > +static void fair_add_pushable_task(struct rq *rq, struct task_struct *p)
> > +{
> > +     if (sched_energy_enabled() && task_on_rq_queued(p) && !p->se.sched_delayed) {
> > +             if (sched_energy_push_task(p, rq)) {
> > +                     plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> > +                     plist_node_init(&p->pushable_tasks, p->prio);
> > +                     plist_add(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> > +             }
> > +     }
> > +}
> > +
> >   /*
> >    * select_task_rq_fair: Select target runqueue for the waking task in domains
> >    * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAKE,
> > @@ -8758,6 +8952,10 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >       return sched_balance_newidle(rq, rf) != 0;
> >   }
> >   #else
> > +static inline void check_pushable_task(struct task_struct *p, struct rq *rq) {}
> > +static inline void fair_queue_pushable_tasks(struct rq *rq) {}
> > +static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p) {}
> > +static inline void fair_add_pushable_task(struct rq *rq, struct task_struct *p) {}
> >   static inline void set_task_max_allowed_capacity(struct task_struct *p) {}
> >   #endif /* CONFIG_SMP */
> >
> > @@ -8947,6 +9145,12 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
> >               put_prev_entity(cfs_rq, pse);
> >               set_next_entity(cfs_rq, se);
> >
> > +             /*
> > +              * The previous task might be eligible for being pushed on
> > +              * another cpu if it is still active.
> > +              */
> > +             fair_add_pushable_task(rq, prev);
> > +
> >               __set_next_task_fair(rq, p, true);
> >       }
> >
> > @@ -9019,6 +9223,13 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct t
> >               cfs_rq = cfs_rq_of(se);
> >               put_prev_entity(cfs_rq, se);
> >       }
> > +
> > +     /*
> > +      * The previous task might be eligible for being pushed on another cpu
> > +      * if it is still active.
> > +      */
> > +     fair_add_pushable_task(rq, prev);
> > +
> >   }
> >
> >   /*
> > @@ -13151,6 +13362,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
> >       if (static_branch_unlikely(&sched_numa_balancing))
> >               task_tick_numa(rq, curr);
> >
> > +     check_pushable_task(curr, rq);
> >       update_misfit_status(curr, rq);
> >       check_update_overutilized_status(task_rq(curr));
> >
> > @@ -13303,6 +13515,8 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
> >   {
> >       struct sched_entity *se = &p->se;
> >
> > +     fair_remove_pushable_task(rq, p);
> > +
> >   #ifdef CONFIG_SMP
> >       if (task_on_rq_queued(p)) {
> >               /*
> > @@ -13320,6 +13534,11 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
> >       if (hrtick_enabled_fair(rq))
> >               hrtick_start_fair(rq, p);
> >
> > +     /*
> > +      * Try to push prev task before checking misfit for next task as
> > +      * the migration of prev can make next fitting the CPU
> > +      */
> > +     fair_queue_pushable_tasks(rq);
> >       update_misfit_status(p, rq);
> >       sched_fair_update_stop_tick(rq, p);
> >   }
> > @@ -13350,6 +13569,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
> >       cfs_rq->tasks_timeline = RB_ROOT_CACHED;
> >       cfs_rq->min_vruntime = (u64)(-(1LL << 20));
> >   #ifdef CONFIG_SMP
> > +     plist_head_init(&cfs_rq->pushable_tasks);
> >       raw_spin_lock_init(&cfs_rq->removed.lock);
> >   #endif
> >   }
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index ab16d3d0e51c..2db198dccf21 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -722,6 +722,8 @@ struct cfs_rq {
> >       struct list_head        leaf_cfs_rq_list;
> >       struct task_group       *tg;    /* group that "owns" this runqueue */
> >
> > +     struct plist_head       pushable_tasks;
> > +
> >       /* Locally cached copy of our task_group's idle value */
> >       int                     idle;
> >
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/7 v5] sched/fair: Add push task mechanism for EAS
  2025-03-02 21:05 ` [PATCH 5/7 v5] sched/fair: Add push task mechanism for EAS Vincent Guittot
                     ` (2 preceding siblings ...)
  2025-03-10 18:20   ` Shrikanth Hegde
@ 2025-03-19 15:26   ` Valentin Schneider
  2025-03-24 16:34   ` Christian Loehle
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 31+ messages in thread
From: Valentin Schneider @ 2025-03-19 15:26 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, lukasz.luba, rafael.j.wysocki,
	pierre.gondois, linux-kernel
  Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado, qperret,
	Vincent Guittot

On 02/03/25 22:05, Vincent Guittot wrote:
> +static struct task_struct *pick_next_pushable_fair_task(struct rq *rq)
> +{
> +	struct task_struct *p;
> +
> +	if (!has_pushable_tasks(rq))
> +		return NULL;
> +
> +	p = plist_first_entry(&rq->cfs.pushable_tasks,
> +			      struct task_struct, pushable_tasks);
> +
> +	WARN_ON_ONCE(rq->cpu != task_cpu(p));
> +	WARN_ON_ONCE(task_current(rq, p));
> +	WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
> +	WARN_ON_ONCE(!task_on_rq_queued(p));
> +
> +	/*
> +	 * Remove task from the pushable list as we try only once after that
> +	 * the task has been put back in enqueued list.
> +	 */
> +	plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> +
> +	return p;

I've only had a look at this patch per the OSPM newidle balance discussion;
coupled with something like RT/DL's overload cpumask, this could be a
viable newidle_balance() replacement.

Unfortunately this means we now have a third copy of the push mechanism
along with RT and DL, so a third place to manually patch whenever a bug is
fixed in one of them [1].

We could perhaps have a skeleton of the pushable list handling in
{enqueue,dequeue)_task() and put_prev_task(), with class-specific conditions and
backing storage, (plist vs rbtree) handled via class callbacks.

Or even make the whole pushable enqueue/dequeue its own class callback,
which would simplify [2].

[1]: http://lore.kernel.org/r/20250304103001.0f89e953@gandalf.local.home
[2]: https://lore.kernel.org/lkml/20250312221147.1865364-7-jstultz@google.com/


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/7 v5] sched/fair: Add push task mechanism for EAS
  2025-03-02 21:05 ` [PATCH 5/7 v5] sched/fair: Add push task mechanism for EAS Vincent Guittot
                     ` (3 preceding siblings ...)
  2025-03-19 15:26   ` Valentin Schneider
@ 2025-03-24 16:34   ` Christian Loehle
  2025-03-25 11:16   ` Christian Loehle
  2025-04-15  2:31   ` Xuewen Yan
  6 siblings, 0 replies; 31+ messages in thread
From: Christian Loehle @ 2025-03-24 16:34 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, lukasz.luba,
	rafael.j.wysocki, pierre.gondois, linux-kernel
  Cc: qyousef, hongyan.xia2, luis.machado, qperret

On 3/2/25 21:05, Vincent Guittot wrote:
> EAS is based on wakeup events to efficiently place tasks on the system, but
> there are cases where a task doesn't have wakeup events anymore or at a far
> too low pace. For such situation, we can take advantage of the task being
> put back in the enqueued list to check if it should be pushed on another
> CPU. When the task is alone on the CPU, it's never put back in the enqueued
> list; In this special case, we use the tick to run the check.
> 
> Wake up events remain the main way to migrate tasks but we now detect
> situation where a task is stuck on a CPU by checking that its utilization
> is larger than the max available compute capacity (max cpu capacity or
> uclamp max setting)
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

So during OSPM25 there was a discussion around this saving energy (even for
!UCLAMP_MAX tasks) because currently when a task has an increased compute
demand all of a sudden, we let it trigger the inefficient OPPs on little and
mid CPUs until they are misfit or the system is overutilized and CAS kicks in.
In particular I've presented a workload (internal VideoScroller which loads a
new video every 3s) which performs worse on power with EAS than CAS. Ignoring
overutilized while attempting feec does help a bit. (-5% energy with CAS, -2%
energy with feec() during OU). This push mechanism was also mentioned to
mitigate such situations.
In theory I agree, but I'm afraid it doesn't help in my testing.
Throughout various workloads where the described issue would appear the push
mechanism is only triggered around once every 2 minutes (i.e. absolutely
negligible).
In particular within 1 hour of testing I've only seen 5 pushed tasks that
fit the described scenarios ("Going from inefficient OPPs on little/mid to
more efficient OPPs on the more capable CPUs"). The described scenario is
very common (triggering at least every few seconds during many workloads).
The vast majority of pushed tasks were pushed within a cluster.
This was on Pixel 6.

> ---
>  kernel/sched/fair.c  | 220 +++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/sched.h |   2 +
>  2 files changed, 222 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a9b97bbc085f..c3e383b86808 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7051,6 +7051,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	hrtick_update(rq);
>  }
>  
> +static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p);
>  static void set_next_buddy(struct sched_entity *se);
>  
>  /*
> @@ -7081,6 +7082,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>  		h_nr_idle = task_has_idle_policy(p);
>  		if (task_sleep || task_delayed || !se->sched_delayed)
>  			h_nr_runnable = 1;
> +
> +		fair_remove_pushable_task(rq, p);
>  	} else {
>  		cfs_rq = group_cfs_rq(se);
>  		slice = cfs_rq_min_slice(cfs_rq);
> @@ -8589,6 +8592,197 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  	return target;
>  }
>  
> +static inline bool task_stuck_on_cpu(struct task_struct *p, int cpu)
> +{
> +	unsigned long max_capa, util;
> +
> +	max_capa = min(get_actual_cpu_capacity(cpu),
> +		       uclamp_eff_value(p, UCLAMP_MAX));
> +	util = max(task_util_est(p), task_runnable(p));
> +
> +	/*
> +	 * Return true only if the task might not sleep/wakeup because of a low
> +	 * compute capacity. Tasks, which wake up regularly, will be handled by
> +	 * feec().
> +	 */
> +	return (util > max_capa);
> +}
> +
> +static inline bool sched_energy_push_task(struct task_struct *p, struct rq *rq)
> +{
> +	if (p->nr_cpus_allowed == 1)
> +		return false;
> +
> +	if (is_rd_overutilized(rq->rd))
> +		return false;
> +
> +	if (task_stuck_on_cpu(p, cpu_of(rq)))
> +		return true;
> +
> +	return false;
> +}
> +
> +static int active_load_balance_cpu_stop(void *data);
> +
> +static inline void check_pushable_task(struct task_struct *p, struct rq *rq)
> +{
> +	int new_cpu, cpu = cpu_of(rq);
> +
> +	if (!sched_energy_enabled())
> +		return;
> +
> +	if (WARN_ON(!p))
> +		return;
> +
> +	if (WARN_ON(!task_current(rq, p)))
> +		return;
> +
> +	if (is_migration_disabled(p))
> +		return;
> +
> +	/* If there are several task, wait for being put back */
> +	if (rq->nr_running > 1)
> +		return;
> +
> +	if (!sched_energy_push_task(p, rq))
> +		return;
> +
> +	new_cpu = find_energy_efficient_cpu(p, cpu);
> +
> +	if (new_cpu == cpu)
> +		return;
> +
> +	/*
> +	 * ->active_balance synchronizes accesses to
> +	 * ->active_balance_work.  Once set, it's cleared
> +	 * only after active load balance is finished.
> +	 */
> +	if (!rq->active_balance) {
> +		rq->active_balance = 1;
> +		rq->push_cpu = new_cpu;
> +	} else
> +		return;
> +
> +	raw_spin_rq_unlock(rq);
> +	stop_one_cpu_nowait(cpu,
> +		active_load_balance_cpu_stop, rq,
> +		&rq->active_balance_work);
> +	raw_spin_rq_lock(rq);
> +}
> +
> +static inline int has_pushable_tasks(struct rq *rq)
> +{
> +	return !plist_head_empty(&rq->cfs.pushable_tasks);
> +}
> +
> +static struct task_struct *pick_next_pushable_fair_task(struct rq *rq)
> +{
> +	struct task_struct *p;
> +
> +	if (!has_pushable_tasks(rq))
> +		return NULL;
> +
> +	p = plist_first_entry(&rq->cfs.pushable_tasks,
> +			      struct task_struct, pushable_tasks);
> +
> +	WARN_ON_ONCE(rq->cpu != task_cpu(p));
> +	WARN_ON_ONCE(task_current(rq, p));
> +	WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
> +	WARN_ON_ONCE(!task_on_rq_queued(p));
> +
> +	/*
> +	 * Remove task from the pushable list as we try only once after that
> +	 * the task has been put back in enqueued list.
> +	 */
> +	plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> +
> +	return p;
> +}
> +
> +/*
> + * See if the non running fair tasks on this rq can be sent on other CPUs
> + * that fits better with their profile.
> + */
> +static bool push_fair_task(struct rq *rq)
> +{
> +	struct task_struct *next_task;
> +	int prev_cpu, new_cpu;
> +	struct rq *new_rq;
> +
> +	next_task = pick_next_pushable_fair_task(rq);
> +	if (!next_task)
> +		return false;
> +
> +	if (is_migration_disabled(next_task))
> +		return true;
> +
> +	/* We might release rq lock */
> +	get_task_struct(next_task);
> +
> +	prev_cpu = rq->cpu;
> +
> +	new_cpu = find_energy_efficient_cpu(next_task, prev_cpu);
> +
> +	if (new_cpu == prev_cpu)
> +		goto out;
> +
> +	new_rq = cpu_rq(new_cpu);
> +
> +	if (double_lock_balance(rq, new_rq)) {
> +		/* The task has already migrated in between */
> +		if (task_cpu(next_task) != rq->cpu) {
> +			double_unlock_balance(rq, new_rq);
> +			goto out;
> +		}
> +
> +		deactivate_task(rq, next_task, 0);
> +		set_task_cpu(next_task, new_cpu);
> +		activate_task(new_rq, next_task, 0);
> +
> +		resched_curr(new_rq);
> +
> +		double_unlock_balance(rq, new_rq);
> +	}
> +
> +out:
> +	put_task_struct(next_task);
> +
> +	return true;
> +}
> +
> +static void push_fair_tasks(struct rq *rq)
> +{
> +	/* push_fair_task() will return true if it moved a fair task */
> +	while (push_fair_task(rq))
> +		;
> +}
> +
> +static DEFINE_PER_CPU(struct balance_callback, fair_push_head);
> +
> +static inline void fair_queue_pushable_tasks(struct rq *rq)
> +{
> +	if (!sched_energy_enabled() || !has_pushable_tasks(rq))
> +		return;
> +
> +	queue_balance_callback(rq, &per_cpu(fair_push_head, rq->cpu), push_fair_tasks);
> +}
> +static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p)
> +{
> +	if (sched_energy_enabled())
> +		plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> +}
> +
> +static void fair_add_pushable_task(struct rq *rq, struct task_struct *p)
> +{
> +	if (sched_energy_enabled() && task_on_rq_queued(p) && !p->se.sched_delayed) {
> +		if (sched_energy_push_task(p, rq)) {
> +			plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> +			plist_node_init(&p->pushable_tasks, p->prio);
> +			plist_add(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> +		}
> +	}
> +}
> +
>  /*
>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>   * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAKE,
> @@ -8758,6 +8952,10 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  	return sched_balance_newidle(rq, rf) != 0;
>  }
>  #else
> +static inline void check_pushable_task(struct task_struct *p, struct rq *rq) {}
> +static inline void fair_queue_pushable_tasks(struct rq *rq) {}
> +static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p) {}
> +static inline void fair_add_pushable_task(struct rq *rq, struct task_struct *p) {}
>  static inline void set_task_max_allowed_capacity(struct task_struct *p) {}
>  #endif /* CONFIG_SMP */
>  
> @@ -8947,6 +9145,12 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>  		put_prev_entity(cfs_rq, pse);
>  		set_next_entity(cfs_rq, se);
>  
> +		/*
> +		 * The previous task might be eligible for being pushed on
> +		 * another cpu if it is still active.
> +		 */
> +		fair_add_pushable_task(rq, prev);
> +
>  		__set_next_task_fair(rq, p, true);
>  	}
>  
> @@ -9019,6 +9223,13 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct t
>  		cfs_rq = cfs_rq_of(se);
>  		put_prev_entity(cfs_rq, se);
>  	}
> +
> +	/*
> +	 * The previous task might be eligible for being pushed on another cpu
> +	 * if it is still active.
> +	 */
> +	fair_add_pushable_task(rq, prev);
> +
>  }
>  
>  /*
> @@ -13151,6 +13362,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>  	if (static_branch_unlikely(&sched_numa_balancing))
>  		task_tick_numa(rq, curr);
>  
> +	check_pushable_task(curr, rq);
>  	update_misfit_status(curr, rq);
>  	check_update_overutilized_status(task_rq(curr));
>  
> @@ -13303,6 +13515,8 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
>  {
>  	struct sched_entity *se = &p->se;
>  
> +	fair_remove_pushable_task(rq, p);
> +
>  #ifdef CONFIG_SMP
>  	if (task_on_rq_queued(p)) {
>  		/*
> @@ -13320,6 +13534,11 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
>  	if (hrtick_enabled_fair(rq))
>  		hrtick_start_fair(rq, p);
>  
> +	/*
> +	 * Try to push prev task before checking misfit for next task as
> +	 * the migration of prev can make next fitting the CPU
> +	 */
> +	fair_queue_pushable_tasks(rq);
>  	update_misfit_status(p, rq);
>  	sched_fair_update_stop_tick(rq, p);
>  }
> @@ -13350,6 +13569,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
>  	cfs_rq->tasks_timeline = RB_ROOT_CACHED;
>  	cfs_rq->min_vruntime = (u64)(-(1LL << 20));
>  #ifdef CONFIG_SMP
> +	plist_head_init(&cfs_rq->pushable_tasks);
>  	raw_spin_lock_init(&cfs_rq->removed.lock);
>  #endif
>  }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ab16d3d0e51c..2db198dccf21 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -722,6 +722,8 @@ struct cfs_rq {
>  	struct list_head	leaf_cfs_rq_list;
>  	struct task_group	*tg;	/* group that "owns" this runqueue */
>  
> +	struct plist_head	pushable_tasks;
> +
>  	/* Locally cached copy of our task_group's idle value */
>  	int			idle;
>  


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/7 v5] sched/fair: Add push task mechanism for EAS
  2025-03-02 21:05 ` [PATCH 5/7 v5] sched/fair: Add push task mechanism for EAS Vincent Guittot
                     ` (4 preceding siblings ...)
  2025-03-24 16:34   ` Christian Loehle
@ 2025-03-25 11:16   ` Christian Loehle
  2025-04-15 13:52     ` Vincent Guittot
  2025-04-15  2:31   ` Xuewen Yan
  6 siblings, 1 reply; 31+ messages in thread
From: Christian Loehle @ 2025-03-25 11:16 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, lukasz.luba,
	rafael.j.wysocki, pierre.gondois, linux-kernel
  Cc: qyousef, hongyan.xia2, luis.machado, qperret

On 3/2/25 21:05, Vincent Guittot wrote:
> EAS is based on wakeup events to efficiently place tasks on the system, but
> there are cases where a task doesn't have wakeup events anymore or at a far
> too low pace. For such situation, we can take advantage of the task being
> put back in the enqueued list to check if it should be pushed on another
> CPU. When the task is alone on the CPU, it's never put back in the enqueued
> list; In this special case, we use the tick to run the check.
> 
> Wake up events remain the main way to migrate tasks but we now detect
> situation where a task is stuck on a CPU by checking that its utilization
> is larger than the max available compute capacity (max cpu capacity or
> uclamp max setting)
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c  | 220 +++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/sched.h |   2 +
>  2 files changed, 222 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a9b97bbc085f..c3e383b86808 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7051,6 +7051,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	hrtick_update(rq);
>  }
>  
> +static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p);
>  static void set_next_buddy(struct sched_entity *se);
>  
>  /*
> @@ -7081,6 +7082,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>  		h_nr_idle = task_has_idle_policy(p);
>  		if (task_sleep || task_delayed || !se->sched_delayed)
>  			h_nr_runnable = 1;
> +
> +		fair_remove_pushable_task(rq, p);
>  	} else {
>  		cfs_rq = group_cfs_rq(se);
>  		slice = cfs_rq_min_slice(cfs_rq);
> @@ -8589,6 +8592,197 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  	return target;
>  }
>  
> +static inline bool task_stuck_on_cpu(struct task_struct *p, int cpu)
> +{
> +	unsigned long max_capa, util;
> +
> +	max_capa = min(get_actual_cpu_capacity(cpu),
> +		       uclamp_eff_value(p, UCLAMP_MAX));
> +	util = max(task_util_est(p), task_runnable(p));
> +
> +	/*
> +	 * Return true only if the task might not sleep/wakeup because of a low
> +	 * compute capacity. Tasks, which wake up regularly, will be handled by
> +	 * feec().
> +	 */
> +	return (util > max_capa);
> +}
> +
> +static inline bool sched_energy_push_task(struct task_struct *p, struct rq *rq)
> +{
> +	if (p->nr_cpus_allowed == 1)
> +		return false;
> +
> +	if (is_rd_overutilized(rq->rd))
> +		return false;
> +
> +	if (task_stuck_on_cpu(p, cpu_of(rq)))
> +		return true;
> +
> +	return false;
> +}
> +
> +static int active_load_balance_cpu_stop(void *data);
> +
> +static inline void check_pushable_task(struct task_struct *p, struct rq *rq)
> +{
> +	int new_cpu, cpu = cpu_of(rq);
> +
> +	if (!sched_energy_enabled())
> +		return;
> +
> +	if (WARN_ON(!p))
> +		return;
> +
> +	if (WARN_ON(!task_current(rq, p)))
> +		return;
> +
> +	if (is_migration_disabled(p))
> +		return;
> +
> +	/* If there are several task, wait for being put back */
> +	if (rq->nr_running > 1)
> +		return;
> +
> +	if (!sched_energy_push_task(p, rq))
> +		return;
> +
> +	new_cpu = find_energy_efficient_cpu(p, cpu);
> +
> +	if (new_cpu == cpu)
> +		return;
> +
> +	/*
> +	 * ->active_balance synchronizes accesses to
> +	 * ->active_balance_work.  Once set, it's cleared
> +	 * only after active load balance is finished.
> +	 */
> +	if (!rq->active_balance) {
> +		rq->active_balance = 1;
> +		rq->push_cpu = new_cpu;
> +	} else
> +		return;
> +
> +	raw_spin_rq_unlock(rq);
> +	stop_one_cpu_nowait(cpu,
> +		active_load_balance_cpu_stop, rq,
> +		&rq->active_balance_work);
> +	raw_spin_rq_lock(rq);
> +}
> +
> +static inline int has_pushable_tasks(struct rq *rq)
> +{
> +	return !plist_head_empty(&rq->cfs.pushable_tasks);
> +}
> +
> +static struct task_struct *pick_next_pushable_fair_task(struct rq *rq)
> +{
> +	struct task_struct *p;
> +
> +	if (!has_pushable_tasks(rq))
> +		return NULL;
> +
> +	p = plist_first_entry(&rq->cfs.pushable_tasks,
> +			      struct task_struct, pushable_tasks);
> +
> +	WARN_ON_ONCE(rq->cpu != task_cpu(p));
> +	WARN_ON_ONCE(task_current(rq, p));
> +	WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
> +	WARN_ON_ONCE(!task_on_rq_queued(p));
> +
> +	/*
> +	 * Remove task from the pushable list as we try only once after that
> +	 * the task has been put back in enqueued list.
> +	 */
> +	plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> +
> +	return p;
> +}
> +
> +/*
> + * See if the non running fair tasks on this rq can be sent on other CPUs
> + * that fits better with their profile.
> + */
> +static bool push_fair_task(struct rq *rq)
> +{
> +	struct task_struct *next_task;
> +	int prev_cpu, new_cpu;
> +	struct rq *new_rq;
> +
> +	next_task = pick_next_pushable_fair_task(rq);
> +	if (!next_task)
> +		return false;
> +
> +	if (is_migration_disabled(next_task))
> +		return true;
> +
> +	/* We might release rq lock */
> +	get_task_struct(next_task);
> +
> +	prev_cpu = rq->cpu;
> +
> +	new_cpu = find_energy_efficient_cpu(next_task, prev_cpu);

We aren't gating this on a overutilized check for both call sites of this patch
like the other feec() call and testing shows that this calls feec when OU
relatively often.
Why would it be OK to call feec() here when it isn't on task placement?

> +
> +	if (new_cpu == prev_cpu)
> +		goto out;
> +
> +	new_rq = cpu_rq(new_cpu);
> +
> +	if (double_lock_balance(rq, new_rq)) {
> +		/* The task has already migrated in between */
> +		if (task_cpu(next_task) != rq->cpu) {
> +			double_unlock_balance(rq, new_rq);
> +			goto out;
> +		}
> +
> +		deactivate_task(rq, next_task, 0);
> +		set_task_cpu(next_task, new_cpu);
> +		activate_task(new_rq, next_task, 0);
> +
> +		resched_curr(new_rq);
> +
> +		double_unlock_balance(rq, new_rq);
> +	}
> +
> +out:
> +	put_task_struct(next_task);
> +
> +	return true;
> +}
> +
> +static void push_fair_tasks(struct rq *rq)
> +{
> +	/* push_fair_task() will return true if it moved a fair task */

This isn't technically true, a bit of a nit, push_fair_task() also
will return true when the task found wasn't moveable.

[snip]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/7 v5] sched/fair: Add push task mechanism for EAS
  2025-03-25 11:16   ` Christian Loehle
@ 2025-04-15 13:52     ` Vincent Guittot
  2025-04-16 13:52       ` Christian Loehle
  0 siblings, 1 reply; 31+ messages in thread
From: Vincent Guittot @ 2025-04-15 13:52 UTC (permalink / raw)
  To: Christian Loehle
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, pierre.gondois,
	linux-kernel, qyousef, hongyan.xia2, luis.machado, qperret

On Tue, 25 Mar 2025 at 12:16, Christian Loehle <christian.loehle@arm.com> wrote:
>
> On 3/2/25 21:05, Vincent Guittot wrote:
> > EAS is based on wakeup events to efficiently place tasks on the system, but
> > there are cases where a task doesn't have wakeup events anymore or at a far
> > too low pace. For such situation, we can take advantage of the task being
> > put back in the enqueued list to check if it should be pushed on another
> > CPU. When the task is alone on the CPU, it's never put back in the enqueued
> > list; In this special case, we use the tick to run the check.
> >
> > Wake up events remain the main way to migrate tasks but we now detect
> > situation where a task is stuck on a CPU by checking that its utilization
> > is larger than the max available compute capacity (max cpu capacity or
> > uclamp max setting)
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/fair.c  | 220 +++++++++++++++++++++++++++++++++++++++++++
> >  kernel/sched/sched.h |   2 +
> >  2 files changed, 222 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a9b97bbc085f..c3e383b86808 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7051,6 +7051,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >       hrtick_update(rq);
> >  }
> >
> > +static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p);
> >  static void set_next_buddy(struct sched_entity *se);
> >
> >  /*
> > @@ -7081,6 +7082,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> >               h_nr_idle = task_has_idle_policy(p);
> >               if (task_sleep || task_delayed || !se->sched_delayed)
> >                       h_nr_runnable = 1;
> > +
> > +             fair_remove_pushable_task(rq, p);
> >       } else {
> >               cfs_rq = group_cfs_rq(se);
> >               slice = cfs_rq_min_slice(cfs_rq);
> > @@ -8589,6 +8592,197 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >       return target;
> >  }
> >
> > +static inline bool task_stuck_on_cpu(struct task_struct *p, int cpu)
> > +{
> > +     unsigned long max_capa, util;
> > +
> > +     max_capa = min(get_actual_cpu_capacity(cpu),
> > +                    uclamp_eff_value(p, UCLAMP_MAX));
> > +     util = max(task_util_est(p), task_runnable(p));
> > +
> > +     /*
> > +      * Return true only if the task might not sleep/wakeup because of a low
> > +      * compute capacity. Tasks, which wake up regularly, will be handled by
> > +      * feec().
> > +      */
> > +     return (util > max_capa);
> > +}
> > +
> > +static inline bool sched_energy_push_task(struct task_struct *p, struct rq *rq)
> > +{
> > +     if (p->nr_cpus_allowed == 1)
> > +             return false;
> > +
> > +     if (is_rd_overutilized(rq->rd))
> > +             return false;
> > +
> > +     if (task_stuck_on_cpu(p, cpu_of(rq)))
> > +             return true;
> > +
> > +     return false;
> > +}
> > +
> > +static int active_load_balance_cpu_stop(void *data);
> > +
> > +static inline void check_pushable_task(struct task_struct *p, struct rq *rq)
> > +{
> > +     int new_cpu, cpu = cpu_of(rq);
> > +
> > +     if (!sched_energy_enabled())
> > +             return;
> > +
> > +     if (WARN_ON(!p))
> > +             return;
> > +
> > +     if (WARN_ON(!task_current(rq, p)))
> > +             return;
> > +
> > +     if (is_migration_disabled(p))
> > +             return;
> > +
> > +     /* If there are several task, wait for being put back */
> > +     if (rq->nr_running > 1)
> > +             return;
> > +
> > +     if (!sched_energy_push_task(p, rq))
> > +             return;
> > +
> > +     new_cpu = find_energy_efficient_cpu(p, cpu);
> > +
> > +     if (new_cpu == cpu)
> > +             return;
> > +
> > +     /*
> > +      * ->active_balance synchronizes accesses to
> > +      * ->active_balance_work.  Once set, it's cleared
> > +      * only after active load balance is finished.
> > +      */
> > +     if (!rq->active_balance) {
> > +             rq->active_balance = 1;
> > +             rq->push_cpu = new_cpu;
> > +     } else
> > +             return;
> > +
> > +     raw_spin_rq_unlock(rq);
> > +     stop_one_cpu_nowait(cpu,
> > +             active_load_balance_cpu_stop, rq,
> > +             &rq->active_balance_work);
> > +     raw_spin_rq_lock(rq);
> > +}
> > +
> > +static inline int has_pushable_tasks(struct rq *rq)
> > +{
> > +     return !plist_head_empty(&rq->cfs.pushable_tasks);
> > +}
> > +
> > +static struct task_struct *pick_next_pushable_fair_task(struct rq *rq)
> > +{
> > +     struct task_struct *p;
> > +
> > +     if (!has_pushable_tasks(rq))
> > +             return NULL;
> > +
> > +     p = plist_first_entry(&rq->cfs.pushable_tasks,
> > +                           struct task_struct, pushable_tasks);
> > +
> > +     WARN_ON_ONCE(rq->cpu != task_cpu(p));
> > +     WARN_ON_ONCE(task_current(rq, p));
> > +     WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
> > +     WARN_ON_ONCE(!task_on_rq_queued(p));
> > +
> > +     /*
> > +      * Remove task from the pushable list as we try only once after that
> > +      * the task has been put back in enqueued list.
> > +      */
> > +     plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> > +
> > +     return p;
> > +}
> > +
> > +/*
> > + * See if the non running fair tasks on this rq can be sent on other CPUs
> > + * that fits better with their profile.
> > + */
> > +static bool push_fair_task(struct rq *rq)
> > +{
> > +     struct task_struct *next_task;
> > +     int prev_cpu, new_cpu;
> > +     struct rq *new_rq;
> > +
> > +     next_task = pick_next_pushable_fair_task(rq);
> > +     if (!next_task)
> > +             return false;
> > +
> > +     if (is_migration_disabled(next_task))
> > +             return true;
> > +
> > +     /* We might release rq lock */
> > +     get_task_struct(next_task);
> > +
> > +     prev_cpu = rq->cpu;
> > +
> > +     new_cpu = find_energy_efficient_cpu(next_task, prev_cpu);
>
> We aren't gating this on a overutilized check for both call sites of this patch

The overutilized check has been done when adding the task to the list.


> like the other feec() call and testing shows that this calls feec when OU
> relatively often.
> Why would it be OK to call feec() here when it isn't on task placement?
>
> > +
> > +     if (new_cpu == prev_cpu)
> > +             goto out;
> > +
> > +     new_rq = cpu_rq(new_cpu);
> > +
> > +     if (double_lock_balance(rq, new_rq)) {
> > +             /* The task has already migrated in between */
> > +             if (task_cpu(next_task) != rq->cpu) {
> > +                     double_unlock_balance(rq, new_rq);
> > +                     goto out;
> > +             }
> > +
> > +             deactivate_task(rq, next_task, 0);
> > +             set_task_cpu(next_task, new_cpu);
> > +             activate_task(new_rq, next_task, 0);
> > +
> > +             resched_curr(new_rq);
> > +
> > +             double_unlock_balance(rq, new_rq);
> > +     }
> > +
> > +out:
> > +     put_task_struct(next_task);
> > +
> > +     return true;
> > +}
> > +
> > +static void push_fair_tasks(struct rq *rq)
> > +{
> > +     /* push_fair_task() will return true if it moved a fair task */
>
> This isn't technically true, a bit of a nit, push_fair_task() also
> will return true when the task found wasn't moveable.
>
> [snip]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/7 v5] sched/fair: Add push task mechanism for EAS
  2025-04-15 13:52     ` Vincent Guittot
@ 2025-04-16 13:52       ` Christian Loehle
  0 siblings, 0 replies; 31+ messages in thread
From: Christian Loehle @ 2025-04-16 13:52 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, pierre.gondois,
	linux-kernel, qyousef, hongyan.xia2, luis.machado, qperret

On 4/15/25 14:52, Vincent Guittot wrote:
> On Tue, 25 Mar 2025 at 12:16, Christian Loehle <christian.loehle@arm.com> wrote:
>>
>> On 3/2/25 21:05, Vincent Guittot wrote:
>>> EAS is based on wakeup events to efficiently place tasks on the system, but
>>> there are cases where a task doesn't have wakeup events anymore or at a far
>>> too low pace. For such situation, we can take advantage of the task being
>>> put back in the enqueued list to check if it should be pushed on another
>>> CPU. When the task is alone on the CPU, it's never put back in the enqueued
>>> list; In this special case, we use the tick to run the check.
>>>
>>> Wake up events remain the main way to migrate tasks but we now detect
>>> situation where a task is stuck on a CPU by checking that its utilization
>>> is larger than the max available compute capacity (max cpu capacity or
>>> uclamp max setting)
>>>
>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>> ---
>>>  kernel/sched/fair.c  | 220 +++++++++++++++++++++++++++++++++++++++++++
>>>  kernel/sched/sched.h |   2 +
>>>  2 files changed, 222 insertions(+)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index a9b97bbc085f..c3e383b86808 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -7051,6 +7051,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>>       hrtick_update(rq);
>>>  }
>>>
>>> +static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p);
>>>  static void set_next_buddy(struct sched_entity *se);
>>>
>>>  /*
>>> @@ -7081,6 +7082,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>>>               h_nr_idle = task_has_idle_policy(p);
>>>               if (task_sleep || task_delayed || !se->sched_delayed)
>>>                       h_nr_runnable = 1;
>>> +
>>> +             fair_remove_pushable_task(rq, p);
>>>       } else {
>>>               cfs_rq = group_cfs_rq(se);
>>>               slice = cfs_rq_min_slice(cfs_rq);
>>> @@ -8589,6 +8592,197 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>>>       return target;
>>>  }
>>>
>>> +static inline bool task_stuck_on_cpu(struct task_struct *p, int cpu)
>>> +{
>>> +     unsigned long max_capa, util;
>>> +
>>> +     max_capa = min(get_actual_cpu_capacity(cpu),
>>> +                    uclamp_eff_value(p, UCLAMP_MAX));
>>> +     util = max(task_util_est(p), task_runnable(p));
>>> +
>>> +     /*
>>> +      * Return true only if the task might not sleep/wakeup because of a low
>>> +      * compute capacity. Tasks, which wake up regularly, will be handled by
>>> +      * feec().
>>> +      */
>>> +     return (util > max_capa);
>>> +}
>>> +
>>> +static inline bool sched_energy_push_task(struct task_struct *p, struct rq *rq)
>>> +{
>>> +     if (p->nr_cpus_allowed == 1)
>>> +             return false;
>>> +
>>> +     if (is_rd_overutilized(rq->rd))
>>> +             return false;
>>> +
>>> +     if (task_stuck_on_cpu(p, cpu_of(rq)))
>>> +             return true;
>>> +
>>> +     return false;
>>> +}
>>> +
>>> +static int active_load_balance_cpu_stop(void *data);
>>> +
>>> +static inline void check_pushable_task(struct task_struct *p, struct rq *rq)
>>> +{
>>> +     int new_cpu, cpu = cpu_of(rq);
>>> +
>>> +     if (!sched_energy_enabled())
>>> +             return;
>>> +
>>> +     if (WARN_ON(!p))
>>> +             return;
>>> +
>>> +     if (WARN_ON(!task_current(rq, p)))
>>> +             return;
>>> +
>>> +     if (is_migration_disabled(p))
>>> +             return;
>>> +
>>> +     /* If there are several task, wait for being put back */
>>> +     if (rq->nr_running > 1)
>>> +             return;
>>> +
>>> +     if (!sched_energy_push_task(p, rq))
>>> +             return;
>>> +
>>> +     new_cpu = find_energy_efficient_cpu(p, cpu);
>>> +
>>> +     if (new_cpu == cpu)
>>> +             return;
>>> +
>>> +     /*
>>> +      * ->active_balance synchronizes accesses to
>>> +      * ->active_balance_work.  Once set, it's cleared
>>> +      * only after active load balance is finished.
>>> +      */
>>> +     if (!rq->active_balance) {
>>> +             rq->active_balance = 1;
>>> +             rq->push_cpu = new_cpu;
>>> +     } else
>>> +             return;
>>> +
>>> +     raw_spin_rq_unlock(rq);
>>> +     stop_one_cpu_nowait(cpu,
>>> +             active_load_balance_cpu_stop, rq,
>>> +             &rq->active_balance_work);
>>> +     raw_spin_rq_lock(rq);
>>> +}
>>> +
>>> +static inline int has_pushable_tasks(struct rq *rq)
>>> +{
>>> +     return !plist_head_empty(&rq->cfs.pushable_tasks);
>>> +}
>>> +
>>> +static struct task_struct *pick_next_pushable_fair_task(struct rq *rq)
>>> +{
>>> +     struct task_struct *p;
>>> +
>>> +     if (!has_pushable_tasks(rq))
>>> +             return NULL;
>>> +
>>> +     p = plist_first_entry(&rq->cfs.pushable_tasks,
>>> +                           struct task_struct, pushable_tasks);
>>> +
>>> +     WARN_ON_ONCE(rq->cpu != task_cpu(p));
>>> +     WARN_ON_ONCE(task_current(rq, p));
>>> +     WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
>>> +     WARN_ON_ONCE(!task_on_rq_queued(p));
>>> +
>>> +     /*
>>> +      * Remove task from the pushable list as we try only once after that
>>> +      * the task has been put back in enqueued list.
>>> +      */
>>> +     plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
>>> +
>>> +     return p;
>>> +}
>>> +
>>> +/*
>>> + * See if the non running fair tasks on this rq can be sent on other CPUs
>>> + * that fits better with their profile.
>>> + */
>>> +static bool push_fair_task(struct rq *rq)
>>> +{
>>> +     struct task_struct *next_task;
>>> +     int prev_cpu, new_cpu;
>>> +     struct rq *new_rq;
>>> +
>>> +     next_task = pick_next_pushable_fair_task(rq);
>>> +     if (!next_task)
>>> +             return false;
>>> +
>>> +     if (is_migration_disabled(next_task))
>>> +             return true;
>>> +
>>> +     /* We might release rq lock */
>>> +     get_task_struct(next_task);
>>> +
>>> +     prev_cpu = rq->cpu;
>>> +
>>> +     new_cpu = find_energy_efficient_cpu(next_task, prev_cpu);
>>
>> We aren't gating this on a overutilized check for both call sites of this patch
> 
> The overutilized check has been done when adding the task to the list.
> 

Right, but that was earlier?
Shouldn't we just clear the list on OU since lb is now active again?
(I do understand that this impacts the effectiveness here, but it seems
the correct thing to do?)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/7 v5] sched/fair: Add push task mechanism for EAS
  2025-03-02 21:05 ` [PATCH 5/7 v5] sched/fair: Add push task mechanism for EAS Vincent Guittot
                     ` (5 preceding siblings ...)
  2025-03-25 11:16   ` Christian Loehle
@ 2025-04-15  2:31   ` Xuewen Yan
  2025-04-15 13:51     ` Vincent Guittot
  6 siblings, 1 reply; 31+ messages in thread
From: Xuewen Yan @ 2025-04-15  2:31 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, pierre.gondois,
	linux-kernel, qyousef, hongyan.xia2, christian.loehle,
	luis.machado, qperret

Hi Vincent,

On Mon, Mar 3, 2025 at 5:06 AM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> EAS is based on wakeup events to efficiently place tasks on the system, but
> there are cases where a task doesn't have wakeup events anymore or at a far
> too low pace. For such situation, we can take advantage of the task being
> put back in the enqueued list to check if it should be pushed on another
> CPU. When the task is alone on the CPU, it's never put back in the enqueued
> list; In this special case, we use the tick to run the check.
>
> Wake up events remain the main way to migrate tasks but we now detect
> situation where a task is stuck on a CPU by checking that its utilization
> is larger than the max available compute capacity (max cpu capacity or
> uclamp max setting)
>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c  | 220 +++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/sched.h |   2 +
>  2 files changed, 222 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a9b97bbc085f..c3e383b86808 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7051,6 +7051,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>         hrtick_update(rq);
>  }
>
> +static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p);
>  static void set_next_buddy(struct sched_entity *se);
>
>  /*
> @@ -7081,6 +7082,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>                 h_nr_idle = task_has_idle_policy(p);
>                 if (task_sleep || task_delayed || !se->sched_delayed)
>                         h_nr_runnable = 1;
> +
> +               fair_remove_pushable_task(rq, p);
>         } else {
>                 cfs_rq = group_cfs_rq(se);
>                 slice = cfs_rq_min_slice(cfs_rq);
> @@ -8589,6 +8592,197 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>         return target;
>  }
>
> +static inline bool task_stuck_on_cpu(struct task_struct *p, int cpu)
> +{
> +       unsigned long max_capa, util;
> +
> +       max_capa = min(get_actual_cpu_capacity(cpu),
> +                      uclamp_eff_value(p, UCLAMP_MAX));
> +       util = max(task_util_est(p), task_runnable(p));
> +
> +       /*
> +        * Return true only if the task might not sleep/wakeup because of a low
> +        * compute capacity. Tasks, which wake up regularly, will be handled by
> +        * feec().
> +        */
I am carefully studying this series of patches. I have some doubts
about this part.

Need we check the state?
READ_ONCE(p->__state) != TASK_RUNNING;
Because the tick will check it.

On the other hand, need we check the sched_delayed?
Because it also checks it in put_prev_task_fair().

Thanks!

> +       return (util > max_capa);
> +}
> +
> +static inline bool sched_energy_push_task(struct task_struct *p, struct rq *rq)
> +{
> +       if (p->nr_cpus_allowed == 1)
> +               return false;
> +
> +       if (is_rd_overutilized(rq->rd))
> +               return false;
> +
> +       if (task_stuck_on_cpu(p, cpu_of(rq)))
> +               return true;
> +
> +       return false;
> +}
> +
> +static int active_load_balance_cpu_stop(void *data);
> +
> +static inline void check_pushable_task(struct task_struct *p, struct rq *rq)
> +{
> +       int new_cpu, cpu = cpu_of(rq);
> +
> +       if (!sched_energy_enabled())
> +               return;
> +
> +       if (WARN_ON(!p))
> +               return;
> +
> +       if (WARN_ON(!task_current(rq, p)))
> +               return;
> +
> +       if (is_migration_disabled(p))
> +               return;
> +
> +       /* If there are several task, wait for being put back */
> +       if (rq->nr_running > 1)
> +               return;
> +
> +       if (!sched_energy_push_task(p, rq))
> +               return;
> +
> +       new_cpu = find_energy_efficient_cpu(p, cpu);
> +
> +       if (new_cpu == cpu)
> +               return;
> +
> +       /*
> +        * ->active_balance synchronizes accesses to
> +        * ->active_balance_work.  Once set, it's cleared
> +        * only after active load balance is finished.
> +        */
> +       if (!rq->active_balance) {
> +               rq->active_balance = 1;
> +               rq->push_cpu = new_cpu;
> +       } else
> +               return;
> +
> +       raw_spin_rq_unlock(rq);
> +       stop_one_cpu_nowait(cpu,
> +               active_load_balance_cpu_stop, rq,
> +               &rq->active_balance_work);
> +       raw_spin_rq_lock(rq);
> +}
> +
> +static inline int has_pushable_tasks(struct rq *rq)
> +{
> +       return !plist_head_empty(&rq->cfs.pushable_tasks);
> +}
> +
> +static struct task_struct *pick_next_pushable_fair_task(struct rq *rq)
> +{
> +       struct task_struct *p;
> +
> +       if (!has_pushable_tasks(rq))
> +               return NULL;
> +
> +       p = plist_first_entry(&rq->cfs.pushable_tasks,
> +                             struct task_struct, pushable_tasks);
> +
> +       WARN_ON_ONCE(rq->cpu != task_cpu(p));
> +       WARN_ON_ONCE(task_current(rq, p));
> +       WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
> +       WARN_ON_ONCE(!task_on_rq_queued(p));
> +
> +       /*
> +        * Remove task from the pushable list as we try only once after that
> +        * the task has been put back in enqueued list.
> +        */
> +       plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> +
> +       return p;
> +}
> +
> +/*
> + * See if the non running fair tasks on this rq can be sent on other CPUs
> + * that fits better with their profile.
> + */
> +static bool push_fair_task(struct rq *rq)
> +{
> +       struct task_struct *next_task;
> +       int prev_cpu, new_cpu;
> +       struct rq *new_rq;
> +
> +       next_task = pick_next_pushable_fair_task(rq);
> +       if (!next_task)
> +               return false;
> +
> +       if (is_migration_disabled(next_task))
> +               return true;
> +
> +       /* We might release rq lock */
> +       get_task_struct(next_task);
> +
> +       prev_cpu = rq->cpu;
> +
> +       new_cpu = find_energy_efficient_cpu(next_task, prev_cpu);
> +
> +       if (new_cpu == prev_cpu)
> +               goto out;
> +
> +       new_rq = cpu_rq(new_cpu);
> +
> +       if (double_lock_balance(rq, new_rq)) {
> +               /* The task has already migrated in between */
> +               if (task_cpu(next_task) != rq->cpu) {
> +                       double_unlock_balance(rq, new_rq);
> +                       goto out;
> +               }
> +
> +               deactivate_task(rq, next_task, 0);
> +               set_task_cpu(next_task, new_cpu);
> +               activate_task(new_rq, next_task, 0);
> +
> +               resched_curr(new_rq);
> +
> +               double_unlock_balance(rq, new_rq);
> +       }
> +
> +out:
> +       put_task_struct(next_task);
> +
> +       return true;
> +}
> +
> +static void push_fair_tasks(struct rq *rq)
> +{
> +       /* push_fair_task() will return true if it moved a fair task */
> +       while (push_fair_task(rq))
> +               ;
> +}
> +
> +static DEFINE_PER_CPU(struct balance_callback, fair_push_head);
> +
> +static inline void fair_queue_pushable_tasks(struct rq *rq)
> +{
> +       if (!sched_energy_enabled() || !has_pushable_tasks(rq))
> +               return;
> +
> +       queue_balance_callback(rq, &per_cpu(fair_push_head, rq->cpu), push_fair_tasks);
> +}
> +static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p)
> +{
> +       if (sched_energy_enabled())
> +               plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> +}
> +
> +static void fair_add_pushable_task(struct rq *rq, struct task_struct *p)
> +{
> +       if (sched_energy_enabled() && task_on_rq_queued(p) && !p->se.sched_delayed) {
> +               if (sched_energy_push_task(p, rq)) {
> +                       plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> +                       plist_node_init(&p->pushable_tasks, p->prio);
> +                       plist_add(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> +               }
> +       }
> +}
> +
>  /*
>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>   * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAKE,
> @@ -8758,6 +8952,10 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>         return sched_balance_newidle(rq, rf) != 0;
>  }
>  #else
> +static inline void check_pushable_task(struct task_struct *p, struct rq *rq) {}
> +static inline void fair_queue_pushable_tasks(struct rq *rq) {}
> +static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p) {}
> +static inline void fair_add_pushable_task(struct rq *rq, struct task_struct *p) {}
>  static inline void set_task_max_allowed_capacity(struct task_struct *p) {}
>  #endif /* CONFIG_SMP */
>
> @@ -8947,6 +9145,12 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>                 put_prev_entity(cfs_rq, pse);
>                 set_next_entity(cfs_rq, se);
>
> +               /*
> +                * The previous task might be eligible for being pushed on
> +                * another cpu if it is still active.
> +                */
> +               fair_add_pushable_task(rq, prev);
> +
>                 __set_next_task_fair(rq, p, true);
>         }
>
> @@ -9019,6 +9223,13 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct t
>                 cfs_rq = cfs_rq_of(se);
>                 put_prev_entity(cfs_rq, se);
>         }
> +
> +       /*
> +        * The previous task might be eligible for being pushed on another cpu
> +        * if it is still active.
> +        */
> +       fair_add_pushable_task(rq, prev);
> +
>  }
>
>  /*
> @@ -13151,6 +13362,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>         if (static_branch_unlikely(&sched_numa_balancing))
>                 task_tick_numa(rq, curr);
>
> +       check_pushable_task(curr, rq);
>         update_misfit_status(curr, rq);
>         check_update_overutilized_status(task_rq(curr));
>
> @@ -13303,6 +13515,8 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
>  {
>         struct sched_entity *se = &p->se;
>
> +       fair_remove_pushable_task(rq, p);
> +
>  #ifdef CONFIG_SMP
>         if (task_on_rq_queued(p)) {
>                 /*
> @@ -13320,6 +13534,11 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
>         if (hrtick_enabled_fair(rq))
>                 hrtick_start_fair(rq, p);
>
> +       /*
> +        * Try to push prev task before checking misfit for next task as
> +        * the migration of prev can make next fitting the CPU
> +        */
> +       fair_queue_pushable_tasks(rq);
>         update_misfit_status(p, rq);
>         sched_fair_update_stop_tick(rq, p);
>  }
> @@ -13350,6 +13569,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
>         cfs_rq->tasks_timeline = RB_ROOT_CACHED;
>         cfs_rq->min_vruntime = (u64)(-(1LL << 20));
>  #ifdef CONFIG_SMP
> +       plist_head_init(&cfs_rq->pushable_tasks);
>         raw_spin_lock_init(&cfs_rq->removed.lock);
>  #endif
>  }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ab16d3d0e51c..2db198dccf21 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -722,6 +722,8 @@ struct cfs_rq {
>         struct list_head        leaf_cfs_rq_list;
>         struct task_group       *tg;    /* group that "owns" this runqueue */
>
> +       struct plist_head       pushable_tasks;
> +
>         /* Locally cached copy of our task_group's idle value */
>         int                     idle;
>
> --
> 2.43.0
>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/7 v5] sched/fair: Add push task mechanism for EAS
  2025-04-15  2:31   ` Xuewen Yan
@ 2025-04-15 13:51     ` Vincent Guittot
  2025-04-16  2:03       ` Xuewen Yan
  0 siblings, 1 reply; 31+ messages in thread
From: Vincent Guittot @ 2025-04-15 13:51 UTC (permalink / raw)
  To: Xuewen Yan
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, pierre.gondois,
	linux-kernel, qyousef, hongyan.xia2, christian.loehle,
	luis.machado, qperret

On Tue, 15 Apr 2025 at 04:31, Xuewen Yan <xuewen.yan94@gmail.com> wrote:
>
> Hi Vincent,
>
> On Mon, Mar 3, 2025 at 5:06 AM Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > EAS is based on wakeup events to efficiently place tasks on the system, but
> > there are cases where a task doesn't have wakeup events anymore or at a far
> > too low pace. For such situation, we can take advantage of the task being
> > put back in the enqueued list to check if it should be pushed on another
> > CPU. When the task is alone on the CPU, it's never put back in the enqueued
> > list; In this special case, we use the tick to run the check.
> >
> > Wake up events remain the main way to migrate tasks but we now detect
> > situation where a task is stuck on a CPU by checking that its utilization
> > is larger than the max available compute capacity (max cpu capacity or
> > uclamp max setting)
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/fair.c  | 220 +++++++++++++++++++++++++++++++++++++++++++
> >  kernel/sched/sched.h |   2 +
> >  2 files changed, 222 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a9b97bbc085f..c3e383b86808 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7051,6 +7051,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >         hrtick_update(rq);
> >  }
> >
> > +static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p);
> >  static void set_next_buddy(struct sched_entity *se);
> >
> >  /*
> > @@ -7081,6 +7082,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> >                 h_nr_idle = task_has_idle_policy(p);
> >                 if (task_sleep || task_delayed || !se->sched_delayed)
> >                         h_nr_runnable = 1;
> > +
> > +               fair_remove_pushable_task(rq, p);
> >         } else {
> >                 cfs_rq = group_cfs_rq(se);
> >                 slice = cfs_rq_min_slice(cfs_rq);
> > @@ -8589,6 +8592,197 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >         return target;
> >  }
> >
> > +static inline bool task_stuck_on_cpu(struct task_struct *p, int cpu)
> > +{
> > +       unsigned long max_capa, util;
> > +
> > +       max_capa = min(get_actual_cpu_capacity(cpu),
> > +                      uclamp_eff_value(p, UCLAMP_MAX));
> > +       util = max(task_util_est(p), task_runnable(p));
> > +
> > +       /*
> > +        * Return true only if the task might not sleep/wakeup because of a low
> > +        * compute capacity. Tasks, which wake up regularly, will be handled by
> > +        * feec().
> > +        */
> I am carefully studying this series of patches. I have some doubts
> about this part.
>
> Need we check the state?
> READ_ONCE(p->__state) != TASK_RUNNING;
> Because the tick will check it.
>
> On the other hand, need we check the sched_delayed?
> Because it also checks it in put_prev_task_fair().

In the case of tick, the task is the current task and the only one running


>
> Thanks!
>
> > +       return (util > max_capa);
> > +}
> > +
> > +static inline bool sched_energy_push_task(struct task_struct *p, struct rq *rq)
> > +{
> > +       if (p->nr_cpus_allowed == 1)
> > +               return false;
> > +
> > +       if (is_rd_overutilized(rq->rd))
> > +               return false;
> > +
> > +       if (task_stuck_on_cpu(p, cpu_of(rq)))
> > +               return true;
> > +
> > +       return false;
> > +}
> > +
> > +static int active_load_balance_cpu_stop(void *data);
> > +
> > +static inline void check_pushable_task(struct task_struct *p, struct rq *rq)
> > +{
> > +       int new_cpu, cpu = cpu_of(rq);
> > +
> > +       if (!sched_energy_enabled())
> > +               return;
> > +
> > +       if (WARN_ON(!p))
> > +               return;
> > +
> > +       if (WARN_ON(!task_current(rq, p)))
> > +               return;
> > +
> > +       if (is_migration_disabled(p))
> > +               return;
> > +
> > +       /* If there are several task, wait for being put back */
> > +       if (rq->nr_running > 1)
> > +               return;
> > +
> > +       if (!sched_energy_push_task(p, rq))
> > +               return;
> > +
> > +       new_cpu = find_energy_efficient_cpu(p, cpu);
> > +
> > +       if (new_cpu == cpu)
> > +               return;
> > +
> > +       /*
> > +        * ->active_balance synchronizes accesses to
> > +        * ->active_balance_work.  Once set, it's cleared
> > +        * only after active load balance is finished.
> > +        */
> > +       if (!rq->active_balance) {
> > +               rq->active_balance = 1;
> > +               rq->push_cpu = new_cpu;
> > +       } else
> > +               return;
> > +
> > +       raw_spin_rq_unlock(rq);
> > +       stop_one_cpu_nowait(cpu,
> > +               active_load_balance_cpu_stop, rq,
> > +               &rq->active_balance_work);
> > +       raw_spin_rq_lock(rq);
> > +}
> > +
> > +static inline int has_pushable_tasks(struct rq *rq)
> > +{
> > +       return !plist_head_empty(&rq->cfs.pushable_tasks);
> > +}
> > +
> > +static struct task_struct *pick_next_pushable_fair_task(struct rq *rq)
> > +{
> > +       struct task_struct *p;
> > +
> > +       if (!has_pushable_tasks(rq))
> > +               return NULL;
> > +
> > +       p = plist_first_entry(&rq->cfs.pushable_tasks,
> > +                             struct task_struct, pushable_tasks);
> > +
> > +       WARN_ON_ONCE(rq->cpu != task_cpu(p));
> > +       WARN_ON_ONCE(task_current(rq, p));
> > +       WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
> > +       WARN_ON_ONCE(!task_on_rq_queued(p));
> > +
> > +       /*
> > +        * Remove task from the pushable list as we try only once after that
> > +        * the task has been put back in enqueued list.
> > +        */
> > +       plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> > +
> > +       return p;
> > +}
> > +
> > +/*
> > + * See if the non running fair tasks on this rq can be sent on other CPUs
> > + * that fits better with their profile.
> > + */
> > +static bool push_fair_task(struct rq *rq)
> > +{
> > +       struct task_struct *next_task;
> > +       int prev_cpu, new_cpu;
> > +       struct rq *new_rq;
> > +
> > +       next_task = pick_next_pushable_fair_task(rq);
> > +       if (!next_task)
> > +               return false;
> > +
> > +       if (is_migration_disabled(next_task))
> > +               return true;
> > +
> > +       /* We might release rq lock */
> > +       get_task_struct(next_task);
> > +
> > +       prev_cpu = rq->cpu;
> > +
> > +       new_cpu = find_energy_efficient_cpu(next_task, prev_cpu);
> > +
> > +       if (new_cpu == prev_cpu)
> > +               goto out;
> > +
> > +       new_rq = cpu_rq(new_cpu);
> > +
> > +       if (double_lock_balance(rq, new_rq)) {
> > +               /* The task has already migrated in between */
> > +               if (task_cpu(next_task) != rq->cpu) {
> > +                       double_unlock_balance(rq, new_rq);
> > +                       goto out;
> > +               }
> > +
> > +               deactivate_task(rq, next_task, 0);
> > +               set_task_cpu(next_task, new_cpu);
> > +               activate_task(new_rq, next_task, 0);
> > +
> > +               resched_curr(new_rq);
> > +
> > +               double_unlock_balance(rq, new_rq);
> > +       }
> > +
> > +out:
> > +       put_task_struct(next_task);
> > +
> > +       return true;
> > +}
> > +
> > +static void push_fair_tasks(struct rq *rq)
> > +{
> > +       /* push_fair_task() will return true if it moved a fair task */
> > +       while (push_fair_task(rq))
> > +               ;
> > +}
> > +
> > +static DEFINE_PER_CPU(struct balance_callback, fair_push_head);
> > +
> > +static inline void fair_queue_pushable_tasks(struct rq *rq)
> > +{
> > +       if (!sched_energy_enabled() || !has_pushable_tasks(rq))
> > +               return;
> > +
> > +       queue_balance_callback(rq, &per_cpu(fair_push_head, rq->cpu), push_fair_tasks);
> > +}
> > +static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p)
> > +{
> > +       if (sched_energy_enabled())
> > +               plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> > +}
> > +
> > +static void fair_add_pushable_task(struct rq *rq, struct task_struct *p)
> > +{
> > +       if (sched_energy_enabled() && task_on_rq_queued(p) && !p->se.sched_delayed) {
> > +               if (sched_energy_push_task(p, rq)) {
> > +                       plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> > +                       plist_node_init(&p->pushable_tasks, p->prio);
> > +                       plist_add(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> > +               }
> > +       }
> > +}
> > +
> >  /*
> >   * select_task_rq_fair: Select target runqueue for the waking task in domains
> >   * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAKE,
> > @@ -8758,6 +8952,10 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >         return sched_balance_newidle(rq, rf) != 0;
> >  }
> >  #else
> > +static inline void check_pushable_task(struct task_struct *p, struct rq *rq) {}
> > +static inline void fair_queue_pushable_tasks(struct rq *rq) {}
> > +static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p) {}
> > +static inline void fair_add_pushable_task(struct rq *rq, struct task_struct *p) {}
> >  static inline void set_task_max_allowed_capacity(struct task_struct *p) {}
> >  #endif /* CONFIG_SMP */
> >
> > @@ -8947,6 +9145,12 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
> >                 put_prev_entity(cfs_rq, pse);
> >                 set_next_entity(cfs_rq, se);
> >
> > +               /*
> > +                * The previous task might be eligible for being pushed on
> > +                * another cpu if it is still active.
> > +                */
> > +               fair_add_pushable_task(rq, prev);
> > +
> >                 __set_next_task_fair(rq, p, true);
> >         }
> >
> > @@ -9019,6 +9223,13 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct t
> >                 cfs_rq = cfs_rq_of(se);
> >                 put_prev_entity(cfs_rq, se);
> >         }
> > +
> > +       /*
> > +        * The previous task might be eligible for being pushed on another cpu
> > +        * if it is still active.
> > +        */
> > +       fair_add_pushable_task(rq, prev);
> > +
> >  }
> >
> >  /*
> > @@ -13151,6 +13362,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
> >         if (static_branch_unlikely(&sched_numa_balancing))
> >                 task_tick_numa(rq, curr);
> >
> > +       check_pushable_task(curr, rq);
> >         update_misfit_status(curr, rq);
> >         check_update_overutilized_status(task_rq(curr));
> >
> > @@ -13303,6 +13515,8 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
> >  {
> >         struct sched_entity *se = &p->se;
> >
> > +       fair_remove_pushable_task(rq, p);
> > +
> >  #ifdef CONFIG_SMP
> >         if (task_on_rq_queued(p)) {
> >                 /*
> > @@ -13320,6 +13534,11 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
> >         if (hrtick_enabled_fair(rq))
> >                 hrtick_start_fair(rq, p);
> >
> > +       /*
> > +        * Try to push prev task before checking misfit for next task as
> > +        * the migration of prev can make next fitting the CPU
> > +        */
> > +       fair_queue_pushable_tasks(rq);
> >         update_misfit_status(p, rq);
> >         sched_fair_update_stop_tick(rq, p);
> >  }
> > @@ -13350,6 +13569,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
> >         cfs_rq->tasks_timeline = RB_ROOT_CACHED;
> >         cfs_rq->min_vruntime = (u64)(-(1LL << 20));
> >  #ifdef CONFIG_SMP
> > +       plist_head_init(&cfs_rq->pushable_tasks);
> >         raw_spin_lock_init(&cfs_rq->removed.lock);
> >  #endif
> >  }
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index ab16d3d0e51c..2db198dccf21 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -722,6 +722,8 @@ struct cfs_rq {
> >         struct list_head        leaf_cfs_rq_list;
> >         struct task_group       *tg;    /* group that "owns" this runqueue */
> >
> > +       struct plist_head       pushable_tasks;
> > +
> >         /* Locally cached copy of our task_group's idle value */
> >         int                     idle;
> >
> > --
> > 2.43.0
> >
> >

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/7 v5] sched/fair: Add push task mechanism for EAS
  2025-04-15 13:51     ` Vincent Guittot
@ 2025-04-16  2:03       ` Xuewen Yan
  0 siblings, 0 replies; 31+ messages in thread
From: Xuewen Yan @ 2025-04-16  2:03 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, pierre.gondois,
	linux-kernel, qyousef, hongyan.xia2, christian.loehle,
	luis.machado, qperret

Hi Vincent,

On Tue, Apr 15, 2025 at 9:51 PM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>

> > I am carefully studying this series of patches. I have some doubts
> > about this part.
> >
> > Need we check the state?
> > READ_ONCE(p->__state) != TASK_RUNNING;
> > Because the tick will check it.
> >
> > On the other hand, need we check the sched_delayed?
> > Because it also checks it in put_prev_task_fair().
>
> In the case of tick, the task is the current task and the only one running
>

If the following occurs:
set_current_state(TASK_INTERRUPTIBLE);
schedule();
    __schedule();
        local_irq_disable();

the tick occurs between set_current_state() and local_irq_disable(),
maybe we do not need to migrate it.


BR
---
xuewen

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 6/7 v5] sched/fair: Add misfit case to push task mecanism for EAS
  2025-03-02 21:05 [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases Vincent Guittot
                   ` (4 preceding siblings ...)
  2025-03-02 21:05 ` [PATCH 5/7 v5] sched/fair: Add push task mechanism for EAS Vincent Guittot
@ 2025-03-02 21:05 ` Vincent Guittot
  2025-03-24 16:06   ` Christian Loehle
  2025-03-02 21:05 ` [PATCH 7/7 v5] sched/fair: Update overutilized detection Vincent Guittot
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Vincent Guittot @ 2025-03-02 21:05 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, pierre.gondois,
	linux-kernel
  Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado, qperret,
	Vincent Guittot

Some task misfit cases can be handled directly by the push mecanism
instead of triggering an idle load balance to pull the task on a better
CPU.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 38 +++++++++++++++++++++++++-------------
 1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c3e383b86808..d21fe0a26633 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8508,6 +8508,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			target_stat.runnable = cpu_runnable(cpu_rq(cpu));
 			target_stat.capa = capacity_of(cpu);
 			target_stat.nr_running = cpu_rq(cpu)->cfs.h_nr_runnable;
+			if ((p->on_rq) && (!p->se.sched_delayed) && (cpu == prev_cpu))
+				target_stat.nr_running--;
 
 			/* If the target needs a lower OPP, then look up for
 			 * the corresponding OPP and its associated cost.
@@ -8613,6 +8615,9 @@ static inline bool sched_energy_push_task(struct task_struct *p, struct rq *rq)
 	if (p->nr_cpus_allowed == 1)
 		return false;
 
+	if (!task_fits_cpu(p, cpu_of(rq)))
+		return true;
+
 	if (is_rd_overutilized(rq->rd))
 		return false;
 
@@ -8624,33 +8629,33 @@ static inline bool sched_energy_push_task(struct task_struct *p, struct rq *rq)
 
 static int active_load_balance_cpu_stop(void *data);
 
-static inline void check_pushable_task(struct task_struct *p, struct rq *rq)
+static inline bool check_pushable_task(struct task_struct *p, struct rq *rq)
 {
 	int new_cpu, cpu = cpu_of(rq);
 
 	if (!sched_energy_enabled())
-		return;
+		return false;
 
 	if (WARN_ON(!p))
-		return;
+		return false;
 
 	if (WARN_ON(!task_current(rq, p)))
-		return;
+		return false;
 
 	if (is_migration_disabled(p))
-		return;
+		return false;
 
 	/* If there are several task, wait for being put back */
 	if (rq->nr_running > 1)
-		return;
+		return false;
 
 	if (!sched_energy_push_task(p, rq))
-		return;
+		return false;
 
 	new_cpu = find_energy_efficient_cpu(p, cpu);
 
 	if (new_cpu == cpu)
-		return;
+		return false;
 
 	/*
 	 * ->active_balance synchronizes accesses to
@@ -8661,13 +8666,15 @@ static inline void check_pushable_task(struct task_struct *p, struct rq *rq)
 		rq->active_balance = 1;
 		rq->push_cpu = new_cpu;
 	} else
-		return;
+		return false;
 
 	raw_spin_rq_unlock(rq);
 	stop_one_cpu_nowait(cpu,
 		active_load_balance_cpu_stop, rq,
 		&rq->active_balance_work);
 	raw_spin_rq_lock(rq);
+
+	return true;
 }
 
 static inline int has_pushable_tasks(struct rq *rq)
@@ -8952,7 +8959,11 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	return sched_balance_newidle(rq, rf) != 0;
 }
 #else
-static inline void check_pushable_task(struct task_struct *p, struct rq *rq) {}
+static inline bool check_pushable_task(struct task_struct *p, struct rq *rq)
+{
+	return false;
+}
+
 static inline void fair_queue_pushable_tasks(struct rq *rq) {}
 static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p) {}
 static inline void fair_add_pushable_task(struct rq *rq, struct task_struct *p) {}
@@ -13362,9 +13373,10 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
 
-	check_pushable_task(curr, rq);
-	update_misfit_status(curr, rq);
-	check_update_overutilized_status(task_rq(curr));
+	if (!check_pushable_task(curr, rq)) {
+		update_misfit_status(curr, rq);
+		check_update_overutilized_status(task_rq(curr));
+	}
 
 	task_tick_core(rq, curr);
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/7 v5] sched/fair: Add misfit case to push task mecanism for EAS
  2025-03-02 21:05 ` [PATCH 6/7 v5] sched/fair: Add misfit case to push task mecanism " Vincent Guittot
@ 2025-03-24 16:06   ` Christian Loehle
  0 siblings, 0 replies; 31+ messages in thread
From: Christian Loehle @ 2025-03-24 16:06 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, lukasz.luba,
	rafael.j.wysocki, pierre.gondois, linux-kernel
  Cc: qyousef, hongyan.xia2, luis.machado, qperret

On 3/2/25 21:05, Vincent Guittot wrote:
> Some task misfit cases can be handled directly by the push mecanism
> instead of triggering an idle load balance to pull the task on a better
> CPU.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

So why is one better than the other?
In my testing most misfit migrations were still handled by load-balance,
simply because the push mechanism is only triggered when switching from
the task (but it's still running, often 'uncontended' for a while).
I can provide some numbers here if desired.

> ---
>  kernel/sched/fair.c | 38 +++++++++++++++++++++++++-------------
>  1 file changed, 25 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c3e383b86808..d21fe0a26633 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8508,6 +8508,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  			target_stat.runnable = cpu_runnable(cpu_rq(cpu));
>  			target_stat.capa = capacity_of(cpu);
>  			target_stat.nr_running = cpu_rq(cpu)->cfs.h_nr_runnable;
> +			if ((p->on_rq) && (!p->se.sched_delayed) && (cpu == prev_cpu))
> +				target_stat.nr_running--;
>  
>  			/* If the target needs a lower OPP, then look up for
>  			 * the corresponding OPP and its associated cost.
> @@ -8613,6 +8615,9 @@ static inline bool sched_energy_push_task(struct task_struct *p, struct rq *rq)
>  	if (p->nr_cpus_allowed == 1)
>  		return false;
>  
> +	if (!task_fits_cpu(p, cpu_of(rq)))
> +		return true;
> +
>  	if (is_rd_overutilized(rq->rd))
>  		return false;
>  
> @@ -8624,33 +8629,33 @@ static inline bool sched_energy_push_task(struct task_struct *p, struct rq *rq)
>  
>  static int active_load_balance_cpu_stop(void *data);
>  
> -static inline void check_pushable_task(struct task_struct *p, struct rq *rq)
> +static inline bool check_pushable_task(struct task_struct *p, struct rq *rq)
>  {
>  	int new_cpu, cpu = cpu_of(rq);
>  
>  	if (!sched_energy_enabled())
> -		return;
> +		return false;
>  
>  	if (WARN_ON(!p))
> -		return;
> +		return false;
>  
>  	if (WARN_ON(!task_current(rq, p)))
> -		return;
> +		return false;
>  
>  	if (is_migration_disabled(p))
> -		return;
> +		return false;
>  
>  	/* If there are several task, wait for being put back */
>  	if (rq->nr_running > 1)
> -		return;
> +		return false;
>  
>  	if (!sched_energy_push_task(p, rq))
> -		return;
> +		return false;
>  
>  	new_cpu = find_energy_efficient_cpu(p, cpu);
>  
>  	if (new_cpu == cpu)
> -		return;
> +		return false;
>  
>  	/*
>  	 * ->active_balance synchronizes accesses to
> @@ -8661,13 +8666,15 @@ static inline void check_pushable_task(struct task_struct *p, struct rq *rq)
>  		rq->active_balance = 1;
>  		rq->push_cpu = new_cpu;
>  	} else
> -		return;
> +		return false;
>  
>  	raw_spin_rq_unlock(rq);
>  	stop_one_cpu_nowait(cpu,
>  		active_load_balance_cpu_stop, rq,
>  		&rq->active_balance_work);
>  	raw_spin_rq_lock(rq);
> +
> +	return true;
>  }
>  
>  static inline int has_pushable_tasks(struct rq *rq)
> @@ -8952,7 +8959,11 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  	return sched_balance_newidle(rq, rf) != 0;
>  }
>  #else
> -static inline void check_pushable_task(struct task_struct *p, struct rq *rq) {}
> +static inline bool check_pushable_task(struct task_struct *p, struct rq *rq)
> +{
> +	return false;
> +}
> +
>  static inline void fair_queue_pushable_tasks(struct rq *rq) {}
>  static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p) {}
>  static inline void fair_add_pushable_task(struct rq *rq, struct task_struct *p) {}
> @@ -13362,9 +13373,10 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>  	if (static_branch_unlikely(&sched_numa_balancing))
>  		task_tick_numa(rq, curr);
>  
> -	check_pushable_task(curr, rq);
> -	update_misfit_status(curr, rq);
> -	check_update_overutilized_status(task_rq(curr));
> +	if (!check_pushable_task(curr, rq)) {
> +		update_misfit_status(curr, rq);
> +		check_update_overutilized_status(task_rq(curr));
> +	}
>  
>  	task_tick_core(rq, curr);
>  }


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 7/7 v5] sched/fair: Update overutilized detection
  2025-03-02 21:05 [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases Vincent Guittot
                   ` (5 preceding siblings ...)
  2025-03-02 21:05 ` [PATCH 6/7 v5] sched/fair: Add misfit case to push task mecanism " Vincent Guittot
@ 2025-03-02 21:05 ` Vincent Guittot
  2025-03-24 16:41 ` [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases Christian Loehle
  2025-04-03 12:36 ` Christian Loehle
  8 siblings, 0 replies; 31+ messages in thread
From: Vincent Guittot @ 2025-03-02 21:05 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, pierre.gondois,
	linux-kernel
  Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado, qperret,
	Vincent Guittot

Checking uclamp_min is useless and counterproductive for overutilized state
as misfit can now happen without being in overutilized state

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d21fe0a26633..79f505492215 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6831,16 +6831,15 @@ static inline void hrtick_update(struct rq *rq)
 #ifdef CONFIG_SMP
 static inline bool cpu_overutilized(int cpu)
 {
-	unsigned long  rq_util_min, rq_util_max;
+	unsigned long rq_util_max;
 
 	if (!sched_energy_enabled())
 		return false;
 
-	rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
 	rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
 
 	/* Return true only if the utilization doesn't fit CPU's capacity */
-	return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
+	return !util_fits_cpu(cpu_util_cfs(cpu), 0, rq_util_max, cpu);
 }
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases
  2025-03-02 21:05 [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases Vincent Guittot
                   ` (6 preceding siblings ...)
  2025-03-02 21:05 ` [PATCH 7/7 v5] sched/fair: Update overutilized detection Vincent Guittot
@ 2025-03-24 16:41 ` Christian Loehle
  2025-04-03 12:36 ` Christian Loehle
  8 siblings, 0 replies; 31+ messages in thread
From: Christian Loehle @ 2025-03-24 16:41 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, lukasz.luba,
	rafael.j.wysocki, pierre.gondois, linux-kernel
  Cc: qyousef, hongyan.xia2, luis.machado, qperret

On 3/2/25 21:05, Vincent Guittot wrote:
> The current Energy Aware Scheduler has some known limitations which have
> became more and more visible with features like uclamp as an example. This
> serie tries to fix some of those issues:
> - tasks stacked on the same CPU of a PD
> - tasks stuck on the wrong CPU.
> 
> Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> whereas it is capped to a lower compute capacity. This wrong classification
> can prevent periodic load balancer to select a group_misfit_task CPU
> because group_overloaded has higher priority.
> 
> Patch 2 creates a new EM interface that will be used by Patch 3
> 
> Patch 3 fixes the issue of tasks being stacked on same CPU of a PD whereas
> others might be a better choice. feec() looks for the CPU with the highest
> spare capacity in a PD assuming that it will be the best CPU from a energy
> efficiency PoV because it will require the smallest increase of OPP.
> This is often but not always true, this policy filters some others CPUs
> which would be as efficients because of using the same OPP but with less
> running tasks as an example.
> In fact, we only care about the cost of the new OPP that will be
> selected to handle the waking task. In many cases, several CPUs will end
> up selecting the same OPP and as a result having the same energy cost. In
> such cases, we can use other metrics to select the best CPU with the same
> energy cost. Patch 3 rework feec() to look 1st for the lowest cost in a PD
> and then the most performant CPU between CPUs. At now, this only tries to
> evenly spread the number of runnable tasks on CPUs but this can be
> improved with other metric like the sched slice duration in a follow up
> series.
> 
> perf sched pipe on a dragonboard rb5 has been used to compare the overhead
> of the new feec() vs current implementation.
> 
> 9 iterations of perf bench sched pipe -T -l 80000
>                 ops/sec  stdev 
> tip/sched/core  16634    (+/- 0.5%)
> + patches 1-3   17434    (+/- 1.2%)  +4.8%
> 
> 
> Patch 4 removed the now unused em_cpu_energy()
> 
> Patch 5 solves another problem with tasks being stuck on a CPU forever
> because it doesn't sleep anymore and as a result never wakeup and call
> feec(). Such task can be detected by comparing util_avg or runnable_avg
> with the compute capacity of the CPU. Once detected, we can call feec() to
> check if there is a better CPU for the stuck task. The call can be done in
> 2 places:
> - When the task is put back in the runnnable list after its running slice
>   with the balance callback mecanism similarly to the rt/dl push callback.
> - During cfs tick when there is only 1 running task stuck on the CPU in
>   which case the balance callback can't be used.
> 
> This push callback mecanism with the new feec() algorithm ensures that
> tasks always get a chance to migrate on the best suitable CPU and don't
> stay stuck on a CPU which is no more the most suitable one. As examples:
> - A task waking on a big CPU with a uclamp max preventing it to sleep and
>   wake up, can migrate on a smaller CPU once it's more power efficient.
> - The tasks are spread on CPUs in the PD when they target the same OPP.
> 
> Patch 6 adds task misfit migration case in the cfs tick and push callback
> mecanism to prevent waking up an idle cpu unnecessarily.
> 
> Patch 7 removes the need of testing uclamp_min in cpu_overutilized to
> trigger the active migration of a task on another CPU.
> 
> Compared to v4:
> - Fixed check_pushable_task for !SMP
> 
> Compared to v3:
> - Fixed the empty functions
> 
> Compared to v2:
> - Renamed the push and tick functions to ease understanding what they do.
>   Both are kept in the same patch as they solve the same problem.
> - Created some helper functions
> - Fixing some typos and comments
> - The task_stuck_on_cpu() condition remains unchanged. Pierre suggested to
>   take into account the min capacity of the CPU but the is not directly
>   available right now. It can trigger feec() when uclamp_max is very low
>   compare to the min capacity of the CPU but the feec() should keep 
>   returning the same CPU. This can be handled in a follow on patch
> 
> Compared to v1:
> - The call to feec() even when overutilized has been removed
> from this serie and will be adressed in a separate series. Only the case
> of uclamp_min has been kept as it is now handled by push callback and
> tick mecanism.
> - The push mecanism has been cleanup, fixed and simplified.
> 
> This series implements some of the topics discussed at OSPM [1]. Other
> topics will be part of an other serie
> 
> [1] https://youtu.be/PHEBAyxeM_M?si=ZApIOw3BS4SOLPwp
> 
> Vincent Guittot (7):
>   sched/fair: Filter false overloaded_group case for EAS
>   energy model: Add a get previous state function
>   sched/fair: Rework feec() to use cost instead of spare capacity
>   energy model: Remove unused em_cpu_energy()
>   sched/fair: Add push task mechanism for EAS
>   sched/fair: Add misfit case to push task mecanism for EAS
>   sched/fair: Update overutilized detection
> 
>  include/linux/energy_model.h | 111 ++----
>  kernel/sched/fair.c          | 721 ++++++++++++++++++++++++-----------
>  kernel/sched/sched.h         |   2 +
>  3 files changed, 518 insertions(+), 316 deletions(-)
> 

Hi Vincent,
I'm giving this another go of reviewing after our OSPM discussions.
One thing which bothered me in the past is that it's just a lot going
on in this series, almost rewriting all of the EAS code in fair.c ;)

For easier reviewing I suggest splitting the series:
1. sched/fair: Filter false overloaded_group case for EAS
(Or actually just get this merged, no need carrying this around, is there?)
2. Rework feec to use more factors than just max_spare_cap to improve
responsiveness / reduce load (Patches 2,3,4)
3. Add push mechanism and make use of it for misfit migration (Patches
5,6,7)

In particular 2 & 3 could be separated, reviewed and tested on their own,
this would make it much easier to discuss what's being tackled here IMO.

Best regards,
Christian

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases
  2025-03-02 21:05 [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases Vincent Guittot
                   ` (7 preceding siblings ...)
  2025-03-24 16:41 ` [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases Christian Loehle
@ 2025-04-03 12:36 ` Christian Loehle
  2025-04-15 13:49   ` Vincent Guittot
  8 siblings, 1 reply; 31+ messages in thread
From: Christian Loehle @ 2025-04-03 12:36 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, lukasz.luba,
	rafael.j.wysocki, pierre.gondois, linux-kernel
  Cc: qyousef, hongyan.xia2, luis.machado, qperret

On 3/2/25 21:05, Vincent Guittot wrote:
> The current Energy Aware Scheduler has some known limitations which have
> became more and more visible with features like uclamp as an example. This
> serie tries to fix some of those issues:
> - tasks stacked on the same CPU of a PD
> - tasks stuck on the wrong CPU.
> 
> Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> whereas it is capped to a lower compute capacity. This wrong classification
> can prevent periodic load balancer to select a group_misfit_task CPU
> because group_overloaded has higher priority.
> 
> Patch 2 creates a new EM interface that will be used by Patch 3
> 
> Patch 3 fixes the issue of tasks being stacked on same CPU of a PD whereas
> others might be a better choice. feec() looks for the CPU with the highest
> spare capacity in a PD assuming that it will be the best CPU from a energy
> efficiency PoV because it will require the smallest increase of OPP.
> This is often but not always true, this policy filters some others CPUs
> which would be as efficients because of using the same OPP but with less
> running tasks as an example.
> In fact, we only care about the cost of the new OPP that will be
> selected to handle the waking task. In many cases, several CPUs will end
> up selecting the same OPP and as a result having the same energy cost. In
> such cases, we can use other metrics to select the best CPU with the same
> energy cost. Patch 3 rework feec() to look 1st for the lowest cost in a PD
> and then the most performant CPU between CPUs. At now, this only tries to
> evenly spread the number of runnable tasks on CPUs but this can be
> improved with other metric like the sched slice duration in a follow up
> series.
> 
> perf sched pipe on a dragonboard rb5 has been used to compare the overhead
> of the new feec() vs current implementation.
> 
> 9 iterations of perf bench sched pipe -T -l 80000
>                 ops/sec  stdev 
> tip/sched/core  16634    (+/- 0.5%)
> + patches 1-3   17434    (+/- 1.2%)  +4.8%
> 
> 
> Patch 4 removed the now unused em_cpu_energy()
> 
> Patch 5 solves another problem with tasks being stuck on a CPU forever
> because it doesn't sleep anymore and as a result never wakeup and call
> feec(). Such task can be detected by comparing util_avg or runnable_avg
> with the compute capacity of the CPU. Once detected, we can call feec() to
> check if there is a better CPU for the stuck task. The call can be done in
> 2 places:
> - When the task is put back in the runnnable list after its running slice
>   with the balance callback mecanism similarly to the rt/dl push callback.
> - During cfs tick when there is only 1 running task stuck on the CPU in
>   which case the balance callback can't be used.
> 
> This push callback mecanism with the new feec() algorithm ensures that
> tasks always get a chance to migrate on the best suitable CPU and don't
> stay stuck on a CPU which is no more the most suitable one. As examples:
> - A task waking on a big CPU with a uclamp max preventing it to sleep and
>   wake up, can migrate on a smaller CPU once it's more power efficient.
> - The tasks are spread on CPUs in the PD when they target the same OPP.
> 
> Patch 6 adds task misfit migration case in the cfs tick and push callback
> mecanism to prevent waking up an idle cpu unnecessarily.
> 
> Patch 7 removes the need of testing uclamp_min in cpu_overutilized to
> trigger the active migration of a task on another CPU.
> 
> Compared to v4:
> - Fixed check_pushable_task for !SMP
> 
> Compared to v3:
> - Fixed the empty functions
> 
> Compared to v2:
> - Renamed the push and tick functions to ease understanding what they do.
>   Both are kept in the same patch as they solve the same problem.
> - Created some helper functions
> - Fixing some typos and comments
> - The task_stuck_on_cpu() condition remains unchanged. Pierre suggested to
>   take into account the min capacity of the CPU but the is not directly
>   available right now. It can trigger feec() when uclamp_max is very low
>   compare to the min capacity of the CPU but the feec() should keep 
>   returning the same CPU. This can be handled in a follow on patch
> 
> Compared to v1:
> - The call to feec() even when overutilized has been removed
> from this serie and will be adressed in a separate series. Only the case
> of uclamp_min has been kept as it is now handled by push callback and
> tick mecanism.
> - The push mecanism has been cleanup, fixed and simplified.
> 
> This series implements some of the topics discussed at OSPM [1]. Other
> topics will be part of an other serie
> 
> [1] https://youtu.be/PHEBAyxeM_M?si=ZApIOw3BS4SOLPwp
> 
> Vincent Guittot (7):
>   sched/fair: Filter false overloaded_group case for EAS
>   energy model: Add a get previous state function
>   sched/fair: Rework feec() to use cost instead of spare capacity
>   energy model: Remove unused em_cpu_energy()
>   sched/fair: Add push task mechanism for EAS
>   sched/fair: Add misfit case to push task mecanism for EAS
>   sched/fair: Update overutilized detection
> 
>  include/linux/energy_model.h | 111 ++----
>  kernel/sched/fair.c          | 721 ++++++++++++++++++++++++-----------
>  kernel/sched/sched.h         |   2 +
>  3 files changed, 518 insertions(+), 316 deletions(-)
> 

Hi Vincent,
so I've invested some time into running tests with the series.
To further narrow down which patch we can attribute a change in
behavior I've compared the following:
- Patches 1 to 3 applied, comparing your proposed feec() (B)
only to the baseline feec() (A).
- All patches applied, using a static branch to enable (C) and
disable (D) push mechanism for misfit tasks (if disabled only
the 'tasks stuck on CPU' mechanism triggers here).

I've looked at
1) YouTube 4K video playback
2) Dr.Arm (in-house ARM game)
3) VideoScroller which loads a new video every 3s
4) Idle screen on
5) Speedometer2.0 in Chromium

The device tested is the Pixel6 with 6.12 kernel + backported
scheduler patches.
For power measurements the onboard energy-meter is used [1].

Mainline feec() A is the baseline for all. All workloads are run for
10mins with the exception of Speedometer 2.0
(one iteration each for 5 iterations with cooldowns).

1) YouTube 4K video
+4.5% power with all other tested (the regression already shows with B,
no further change with C & D).
(cf. +18.5% power with CAS).
The power regression comes from increased average frequency on all
3 clusters.
No dropped frames in all tested A to D.

2)  Dr.Arm (in-house ARM game)
+9.9% power with all other tested (the regression already shows with B,
no further change with C & D).
(cf. +3.7% power with CAS, new feec() performs worse than CAS here.)
The power regression comes from increased average frequency on all
3 clusters.

3) VideoScroller
No difference in terms of power for A to D.
Specifically even the push mechanism with misfit enabled/disabled
doesn't make a noticeable difference in per-cluster energy numbers.

4) Idle screen on
No difference in power for all for A to D.

5) Speedometer2.0 in Chromium
Both power and score comparable for A to D.

As mentioned in the thread already the push mechanism
(without misfit tasks) (D) triggers only once every 2-20 minutes,
depending on the workload (all tested here were without any
UCLAMP_MAX tasks).
I also used the device manually just to check if I'm not missing
anything here, I wasn't.
This push task mechanism shouldn't make any difference without
UCLAMP_MAX.

The increased average frequency in 1) and 2) is caused by the
deviation from max-spare-cap in feec(), which previously ensured
as much headroom as possible until we have to raise the OPP of the
cluster.

So all in all this regresses power on some crucial EAS workloads.
I couldn't find a real-world workload where the
'less co-scheduling/contention' strategy of feec() showed a benefit.
Did you have a specific workload for this in mind?

[1]
https://tooling.sites.arm.com/lisa/latest/sections/api/generated/lisa.analysis.pixel6.Pixel6Analysis.html#lisa.analysis.pixel6.Pixel6Analysis.df_power_meter

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases
  2025-04-03 12:36 ` Christian Loehle
@ 2025-04-15 13:49   ` Vincent Guittot
  2025-04-16 10:51     ` Christian Loehle
  0 siblings, 1 reply; 31+ messages in thread
From: Vincent Guittot @ 2025-04-15 13:49 UTC (permalink / raw)
  To: Christian Loehle
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, pierre.gondois,
	linux-kernel, qyousef, hongyan.xia2, luis.machado, qperret

Hi Christian,

On Thu, 3 Apr 2025 at 14:37, Christian Loehle <christian.loehle@arm.com> wrote:
>
> On 3/2/25 21:05, Vincent Guittot wrote:
> > The current Energy Aware Scheduler has some known limitations which have
> > became more and more visible with features like uclamp as an example. This
> > serie tries to fix some of those issues:
> > - tasks stacked on the same CPU of a PD
> > - tasks stuck on the wrong CPU.
> >

...

> >
> >  include/linux/energy_model.h | 111 ++----
> >  kernel/sched/fair.c          | 721 ++++++++++++++++++++++++-----------
> >  kernel/sched/sched.h         |   2 +
> >  3 files changed, 518 insertions(+), 316 deletions(-)
> >
>
> Hi Vincent,
> so I've invested some time into running tests with the series.
> To further narrow down which patch we can attribute a change in
> behavior I've compared the following:
> - Patches 1 to 3 applied, comparing your proposed feec() (B)
> only to the baseline feec() (A).
> - All patches applied, using a static branch to enable (C) and
> disable (D) push mechanism for misfit tasks (if disabled only
> the 'tasks stuck on CPU' mechanism triggers here).
>
> I've looked at
> 1) YouTube 4K video playback
> 2) Dr.Arm (in-house ARM game)
> 3) VideoScroller which loads a new video every 3s
> 4) Idle screen on
> 5) Speedometer2.0 in Chromium
>
> The device tested is the Pixel6 with 6.12 kernel + backported
> scheduler patches.

What do you mean by "6.12 kernel + backported scheduler patches" ? Do
you mean android mainline v6.12 ?

I run my test with android mainline v6.13 + scheduler patches for
v6.14 and v6.15-rc1. Do you mean the same ? v6.12 misses a number of
important patches in regards to threads accounting

> For power measurements the onboard energy-meter is used [1].

same for me

>
> Mainline feec() A is the baseline for all. All workloads are run for
> 10mins with the exception of Speedometer 2.0
> (one iteration each for 5 iterations with cooldowns).

What do you mean exactly by (one iteration each for 5 iterations with
cooldowns) ?

>
> 1) YouTube 4K video

I'd like to reproduce this use case because my test with 4k video
playback shows similar or slightly better power consumption (2%) with
this patch.

Do you have details about this use case that you can share ?


> +4.5% power with all other tested (the regression already shows with B,
> no further change with C & D).
> (cf. +18.5% power with CAS).
> The power regression comes from increased average frequency on all
> 3 clusters.

I'm interested to understand why the average frequency increases as
the OPP remains the 1st level of selection and in case of light loaded
use cases we should not see much difference. That's what I see on my
4k video playback use case

And I will also look at why the CAS is better in your case

> No dropped frames in all tested A to D.
>
> 2)  Dr.Arm (in-house ARM game)
> +9.9% power with all other tested (the regression already shows with B,
> no further change with C & D).
> (cf. +3.7% power with CAS, new feec() performs worse than CAS here.)
> The power regression comes from increased average frequency on all
> 3 clusters.

I supposed that I won't be able to reproduce this one

>
> 3) VideoScroller
> No difference in terms of power for A to D.
> Specifically even the push mechanism with misfit enabled/disabled
> doesn't make a noticeable difference in per-cluster energy numbers.
>
> 4) Idle screen on
> No difference in power for all for A to D.

I see a difference here mainly for DDR power consumption with 7%
saving compared to mainline and 2% on the CPU clusters

>
> 5) Speedometer2.0 in Chromium
> Both power and score comparable for A to D.
>
> As mentioned in the thread already the push mechanism
> (without misfit tasks) (D) triggers only once every 2-20 minutes,
> depending on the workload (all tested here were without any
> UCLAMP_MAX tasks).
> I also used the device manually just to check if I'm not missing
> anything here, I wasn't.
> This push task mechanism shouldn't make any difference without
> UCLAMP_MAX.

On the push mechanism side, I'm surprised that you don't get more push
than once every 2-20 minutes. On the speedometer, I've got around 170
push fair and 600 check pushable which ends with a task migration
during the 75 seconds of the test and much more calls that ends with
the same cpu. This also needs to be compared with the 70% of
overutilized state during the 75 seconds of the time during which we
don't push. On light loaded case, the condition is currently to
conservative to trigger push task mechanism but that's also expected
in order to be conservative

The fact that OU triggers too quickly limits the impact of push and feec rework

uclamp_max sees a difference with the push mechanism which is another
argument for using it.

And this is 1st step is quite conservative before extending the cases
which can benefit from push and feec rework as explained  at OSPM

>
> The increased average frequency in 1) and 2) is caused by the
> deviation from max-spare-cap in feec(), which previously ensured
> as much headroom as possible until we have to raise the OPP of the
> cluster.
>
> So all in all this regresses power on some crucial EAS workloads.
> I couldn't find a real-world workload where the
> 'less co-scheduling/contention' strategy of feec() showed a benefit.
> Did you have a specific workload for this in mind?
>
> [1]
> https://tooling.sites.arm.com/lisa/latest/sections/api/generated/lisa.analysis.pixel6.Pixel6Analysis.html#lisa.analysis.pixel6.Pixel6Analysis.df_power_meter

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases
  2025-04-15 13:49   ` Vincent Guittot
@ 2025-04-16 10:51     ` Christian Loehle
  0 siblings, 0 replies; 31+ messages in thread
From: Christian Loehle @ 2025-04-16 10:51 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, pierre.gondois,
	linux-kernel, qyousef, hongyan.xia2, luis.machado, qperret

On 4/15/25 14:49, Vincent Guittot wrote:
> Hi Christian,
> 
> On Thu, 3 Apr 2025 at 14:37, Christian Loehle <christian.loehle@arm.com> wrote:
>>
>> On 3/2/25 21:05, Vincent Guittot wrote:
>>> The current Energy Aware Scheduler has some known limitations which have
>>> became more and more visible with features like uclamp as an example. This
>>> serie tries to fix some of those issues:
>>> - tasks stacked on the same CPU of a PD
>>> - tasks stuck on the wrong CPU.
>>>
> 
> ...
> 
>>>
>>>  include/linux/energy_model.h | 111 ++----
>>>  kernel/sched/fair.c          | 721 ++++++++++++++++++++++++-----------
>>>  kernel/sched/sched.h         |   2 +
>>>  3 files changed, 518 insertions(+), 316 deletions(-)
>>>
>>
>> Hi Vincent,
>> so I've invested some time into running tests with the series.
>> To further narrow down which patch we can attribute a change in
>> behavior I've compared the following:
>> - Patches 1 to 3 applied, comparing your proposed feec() (B)
>> only to the baseline feec() (A).
>> - All patches applied, using a static branch to enable (C) and
>> disable (D) push mechanism for misfit tasks (if disabled only
>> the 'tasks stuck on CPU' mechanism triggers here).
>>
>> I've looked at
>> 1) YouTube 4K video playback
>> 2) Dr.Arm (in-house ARM game)
>> 3) VideoScroller which loads a new video every 3s
>> 4) Idle screen on
>> 5) Speedometer2.0 in Chromium
>>
>> The device tested is the Pixel6 with 6.12 kernel + backported
>> scheduler patches.
> 
> What do you mean by "6.12 kernel + backported scheduler patches" ? Do
> you mean android mainline v6.12 ?

Yes, in particular with the following patches backported:
(This series is here in the shortlog)
PM: EM: Add min/max available performance state limits  
sched/fair: Fix variable declaration position  
sched/fair: Do not try to migrate delayed dequeue task  
sched/fair: Rename cfs_rq.nr_running into nr_queued  
sched/fair: Remove unused cfs_rq.idle_nr_running  
sched/fair: Rename cfs_rq.idle_h_nr_running into h_nr_idle  
sched/fair: Removed unsued cfs_rq.h_nr_delayed  
sched/fair: Use the new cfs_rq.h_nr_runnable  
sched/fair: Add new cfs_rq.h_nr_runnable  
sched/fair: Rename h_nr_running into h_nr_queued  
sched/eevdf: More PELT vs DELAYED_DEQUEUE  
sched/fair: Fix sched_can_stop_tick() for fair tasks  
sched/fair: optimize the PLACE_LAG when se->vlag is zero  

> 
> I run my test with android mainline v6.13 + scheduler patches for
> v6.14 and v6.15-rc1. Do you mean the same ? v6.12 misses a number of
> important patches in regards to threads accounting

Which ones in particular do you think are critical?
I'm also happy to just use your branch for testing, so we align on
a common base, if you're willing to share it.
I'm not happy about having to test on backported kernels either, but
as long as this is necessary we might as well just share branches of
Android mainline kernel backports for EAS patches, we all do the
backports anyway.

> 
>> For power measurements the onboard energy-meter is used [1].
> 
> same for me
> 
>>
>> Mainline feec() A is the baseline for all. All workloads are run for
>> 10mins with the exception of Speedometer 2.0
>> (one iteration each for 5 iterations with cooldowns).
> 
> What do you mean exactly by (one iteration each for 5 iterations with
> cooldowns) ?

So for Speedometer 2.0 I do:
Run one iteration.
Wait until device is cooled down (all temp sensors <30C).
Repeat 5x.

> 
>>
>> 1) YouTube 4K video
> 
> I'd like to reproduce this use case because my test with 4k video
> playback shows similar or slightly better power consumption (2%) with
> this patch.
> 
> Do you have details about this use case that you can share ?

Sure, in that case it's just a 5 hour long sample video without
ads in between. I then static-branch between e.g. the two feec()s.
to collect the numbers.
1m of stabilising between static branch switches were energy numbers
are disregarded.

> 
> 
>> +4.5% power with all other tested (the regression already shows with B,
>> no further change with C & D).
>> (cf. +18.5% power with CAS).
>> The power regression comes from increased average frequency on all
>> 3 clusters.
> 
> I'm interested to understand why the average frequency increases as
> the OPP remains the 1st level of selection and in case of light loaded
> use cases we should not see much difference. That's what I see on my
> 4k video playback use case

Well the OPPs may be quite far apart and while max-spare-cap strategy
will optimally balance the util within the cluster, this series deviates
from that, so you will raise OPP earlier once the util of the CPUs in
the cluster grow.
For illustration here's the OPP table for the tested Pixel 6:
CPU    Freq (kHz)    ΔFreq   Capacity   ΔCap 
cpu0   300000        0        26         0     
cpu0   574000        274000   50         24    
cpu0   738000        164000   65         15    
cpu0   930000        192000   82         17    
cpu0   1098000       168000   97         15    
cpu0   1197000       99000    106        9     
cpu0   1328000       131000   117        11    
cpu0   1401000       73000    124        7     
cpu0   1598000       197000   141        17    
cpu0   1704000       106000   151        10    
cpu0   1803000       99000    160        9  
cpu4   400000        0        88         0     
cpu4   553000        153000   122        34    
cpu4   696000        143000   153        31    
cpu4   799000        103000   176        23    
cpu4   910000        111000   201        25    
cpu4   1024000       114000   226        25    
cpu4   1197000       173000   264        38    
cpu4   1328000       131000   293        29    
cpu4   1491000       163000   329        36    
cpu4   1663000       172000   367        38    
cpu4   1836000       173000   405        38    
cpu4   1999000       163000   441        36    
cpu4   2130000       131000   470        29    
cpu4   2253000       123000   498        28   
cpu6   500000        0        182        0     
cpu6   851000        351000   311        129   
cpu6   984000        133000   359        48    
cpu6   1106000       122000   404        45    
cpu6   1277000       171000   466        62    
cpu6   1426000       149000   521        55    
cpu6   1582000       156000   578        57    
cpu6   1745000       163000   637        59    
cpu6   1826000       81000    667        30    
cpu6   2048000       222000   748        81    
cpu6   2188000       140000   799        51    
cpu6   2252000       64000    823        24    
cpu6   2401000       149000   877        54    
cpu6   2507000       106000   916        39    
cpu6   2630000       123000   961        45    
cpu6   2704000       74000    988        27    
cpu6   2802000       98000    1024       36

A hypothetical util distribution on the little for OPP0
would be:
0:5 1:16 2:17 3:18
when now placing a util=2 task max-spare-cap will obviously
pick CPU0, while you may deviate form that also picking any
of CPU1-3. For CPU3 even a single util increase will raise
the OPP of the cluster.
As util are never that stable the balancing effect of
max-spare-cap is helping preserve energy.

On big (CPU6) OPP0 -> OPP1 the situation is even worse if the
util numbers above are too small to be convincing.

> 
> And I will also look at why the CAS is better in your case
> 
>> No dropped frames in all tested A to D.
>>
>> 2)  Dr.Arm (in-house ARM game)
>> +9.9% power with all other tested (the regression already shows with B,
>> no further change with C & D).
>> (cf. +3.7% power with CAS, new feec() performs worse than CAS here.)
>> The power regression comes from increased average frequency on all
>> 3 clusters.
> 
> I supposed that I won't be able to reproduce this one

Not really, although given that the YT case is similar I don't
think this would be a one-off. Probably any comparable 3D action
game will do (our internally is just really nice to automate
obviously).

> 
>>
>> 3) VideoScroller
>> No difference in terms of power for A to D.
>> Specifically even the push mechanism with misfit enabled/disabled
>> doesn't make a noticeable difference in per-cluster energy numbers.
>>
>> 4) Idle screen on
>> No difference in power for all for A to D.
> 
> I see a difference here mainly for DDR power consumption with 7%
> saving compared to mainline and 2% on the CPU clusters

Honestly the stddev on these is so high that something needs to go
quite badly to show something significant in this, just wanted to
include it.

> 
>>
>> 5) Speedometer2.0 in Chromium
>> Both power and score comparable for A to D.
>>
>> As mentioned in the thread already the push mechanism
>> (without misfit tasks) (D) triggers only once every 2-20 minutes,
>> depending on the workload (all tested here were without any
>> UCLAMP_MAX tasks).
>> I also used the device manually just to check if I'm not missing
>> anything here, I wasn't.
>> This push task mechanism shouldn't make any difference without
>> UCLAMP_MAX.
> 
> On the push mechanism side, I'm surprised that you don't get more push
> than once every 2-20 minutes. On the speedometer, I've got around 170
> push fair and 600 check pushable which ends with a task migration
> during the 75 seconds of the test and much more calls that ends with
> the same cpu. This also needs to be compared with the 70% of
> overutilized state during the 75 seconds of the time during which we
> don't push. On light loaded case, the condition is currently to
> conservative to trigger push task mechanism but that's also expected
> in order to be conservative

Does that include misfit pushes? I'd be interested if our results
vastly differ here. Just to reiterate, this is without misfit pushes,
only the "stuck on CPU" case introduced by 5/7.

> 
> The fact that OU triggers too quickly limits the impact of push and feec rework

I'm working on a series here :)

> 
> uclamp_max sees a difference with the push mechanism which is another
> argument for using it.

I don't doubt that, but there's little to test with real-world use-cases
really...

> 
> And this is 1st step is quite conservative before extending the cases
> which can benefit from push and feec rework as explained  at OSPM
> 

Right, I actually do see an appeal of having the push mechanism in fair/EAS,
but of course also the series introducing it should have sufficient convincing
benefits.


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2025-04-16 13:52 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-02 21:05 [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases Vincent Guittot
2025-03-02 21:05 ` [PATCH 1/7 v5] sched/fair: Filter false overloaded_group case for EAS Vincent Guittot
2025-03-04  4:38   ` K Prateek Nayak
2025-03-05  8:13     ` Vincent Guittot
2025-03-02 21:05 ` [PATCH 2/7 v5] energy model: Add a get previous state function Vincent Guittot
2025-03-02 21:05 ` [PATCH 3/7 v5] sched/fair: Rework feec() to use cost instead of spare capacity Vincent Guittot
2025-03-12 14:08   ` Pierre Gondois
2025-03-14 16:24     ` Vincent Guittot
2025-03-16 20:21       ` Pierre Gondois
2025-03-25 11:09   ` Pierre Gondois
2025-03-02 21:05 ` [PATCH 4/7 v5] energy model: Remove unused em_cpu_energy() Vincent Guittot
2025-03-02 21:05 ` [PATCH 5/7 v5] sched/fair: Add push task mechanism for EAS Vincent Guittot
2025-03-07 12:51   ` kernel test robot
2025-03-10 12:47   ` kernel test robot
2025-03-10 18:20   ` Shrikanth Hegde
2025-03-11 16:27     ` Vincent Guittot
2025-03-19 15:26   ` Valentin Schneider
2025-03-24 16:34   ` Christian Loehle
2025-03-25 11:16   ` Christian Loehle
2025-04-15 13:52     ` Vincent Guittot
2025-04-16 13:52       ` Christian Loehle
2025-04-15  2:31   ` Xuewen Yan
2025-04-15 13:51     ` Vincent Guittot
2025-04-16  2:03       ` Xuewen Yan
2025-03-02 21:05 ` [PATCH 6/7 v5] sched/fair: Add misfit case to push task mecanism " Vincent Guittot
2025-03-24 16:06   ` Christian Loehle
2025-03-02 21:05 ` [PATCH 7/7 v5] sched/fair: Update overutilized detection Vincent Guittot
2025-03-24 16:41 ` [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases Christian Loehle
2025-04-03 12:36 ` Christian Loehle
2025-04-15 13:49   ` Vincent Guittot
2025-04-16 10:51     ` Christian Loehle

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox