[PATCH 0/5] sched/fair: Rework EAS to handle more cases

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/5] sched/fair: Rework EAS to handle more cases
@ 2024-08-30 13:03 Vincent Guittot
  2024-08-30 13:03 ` [PATCH 1/5] sched/fair: Filter false overloaded_group case for EAS Vincent Guittot
                   ` (6 more replies)
  0 siblings, 7 replies; 62+ messages in thread
From: Vincent Guittot @ 2024-08-30 13:03 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2, Vincent Guittot

The current Energy Aware Scheduler has some known limitations which have
became more and more visible with features like uclamp as an example. This
serie tries to fix some of those issues:
- tasks stacked on the same CPU of a PD
- tasks stuck on the wrong CPU.

Patch 1 fixes the case where a CPU is wrongly classified as overloaded
whereas it is capped to a lower compute capacity. This wrong classification
can prevent periodic load balancer to select a group_misfit_task CPU
because group_overloaded has higher priority.

Patch 2 creates a new EM interface that will be used by Patch 3

Patch 3 fixes the issue of tasks being stacked on same CPU of a PD whereas
others might be a better choice. feec() looks for the CPU with the highest
spare capacity in a PD assuming that it will be the best CPU from a energy
efficiency PoV because it will require the smallest increase of OPP.
This is often but not always true, this policy filters some others CPUs
which would be as efficients because of using the same OPP but with less
running tasks as an example.
In fact, we only care about the cost of the new OPP that will be
selected to handle the waking task. In many cases, several CPUs will end
up selecting the same OPP and as a result having the same energy cost. In
such cases, we can use other metrics to select the best CPU with the same
energy cost. Patch 3 rework feec() to look 1st for the lowest cost in a PD
and then the most performant CPU between CPUs.

perf sched pipe on a dragonboard rb5 has been used to compare the overhead
of the new feec() vs current implementation.
sidenote: delayed dequeue has been disable for all tests.

9 iterations of perf bench sched pipe -T -l 80000
                ops/sec  stdev 
tip/sched/core  13490    (+/- 1.7%)
+ patches 1-3   14095    (+/- 1.7%)  +4.5%

When overutilized, the scheduler stops looking for an energy efficient CPU
and fallback to the default performance mode. Although this is the best
choice when a system is fully overutilized, it also breaks the energy
efficiency when one CPU becomes overutilized for a short time because of
kworker and/or background activity as an example.
Patch 4 calls feec() everytime instead of skipping it when overutlized,
and fallback to default performance mode only when feec() can't find a
suitable CPU. The main advantage is that the task placement remains more
stable especially when there is a short and transient overutilized state.
The drawback is that the overhead can be significant for some CPU intensive
use cases.

The overhead of patch 4 has been stressed with hackbench on dragonboard rb5

                               tip/sched/core        + patches 1-4
			       Time    stdev         Time    stdev
hackbench -l 5120 -g 1         0.724   +/-1.3%       0.765   +/-3.0% (-5.7%)
hackbench -l 1280 -g 4         0.740   +/-1.1%       0.768   +/-1.8% (-3.8%)
hackbench -l 640  -g 8         0.792   +/-1.3%       0.812   +/-1.6% (-2.6%)
hackbench -l 320  -g 16        0.847   +/-1.4%       0.852   +/-1.8% (-0.6%)

hackbench -p -l 5120 -g 1      0.878   +/-1.9%       1.115   +/-3.0% (-27%)
hackbench -p -l 1280 -g 4      0.789   +/-2.6%       0.862   +/-5.0% (-9.2%)
hackbench -p -l 640  -g 8      0.732   +/-1.9%       0.801   +/-4.3% (-9.4%)
hackbench -p -l 320  -g 16     0.710   +/-4.7%       0.767   +/-4.9% (-8.1%)

hackbench -T -l 5120 -g 1      0.756   +/-3.9%       0.772   +/-1.63 (-2.0%)
hackbench -T -l 1280 -g 4      0.725   +/-1.4%       0.737   +/-2.0% (-1.3%)
hackbench -T -l 640  -g 8      0.767   +/-1.5%       0.809   +/-2.6% (-5.5%)
hackbench -T -l 320  -g 16     0.812   +/-1.2%       0.823   +/-2.2% (-1.4%)

hackbench -T -p -l 5120 -g 1   0.941   +/-2.5%       1.190   +/-1.6% (-26%) 
hackbench -T -p -l 1280 -g 4   0.869   +/-2.5%       0.931   +/-4.9% (-7.2%)
hackbench -T -p -l 640  -g 8   0.819   +/-2.4%       0.895   +/-4.6% (-9.3%)
hackbench -T -p -l 320  -g 16  0.763   +/-2.6%       0.863   +/-5.0% (-13%)

Side note: Both new feec() and current feec() give similar overheads with
patch 4.

Although the highest reachable CPU throughput is not the only target of EAS,
the overhead can be significant in some cases as shown in hackbech results
above. That being said I still think it's worth the benefit for the stability
of tasks placement and a better control of the power.

Patch 5 solves another problem with tasks being stuck on a CPU forever
because it doesn't sleep anymore and as a result never wakeup and call
feec(). Such task can be detected by comparing util_avg or runnable_avg
with the compute capacity of the CPU. Once detected, we can call feec() to
check if there is a better CPU for the stuck task. The call can be done in
2 places:
- When the task is put back in the runnnable list after its running slice
  with the balance callback mecanism similarly to the rt/dl push callback.
- During cfs tick when there is only 1 running task stuck on the CPU in
  which case the balance callback can't be used.

This push callback doesn't replace the current misfit task mecanism which
is already implemented but this could be considered as a follow up serie.

This push callback mecanism with the new feec() algorithm ensures that
tasks always get a chance to migrate on the best suitable CPU and don't
stay stuck on a CPU which is no more the most suitable one. As examples:
- A task waking on a big CPU with a uclamp max preventing it to sleep and
  wake up, can migrate on a smaller CPU once it's more power efficient.
- The tasks are spread on CPUs in the PD when they target the same OPP.

This series implements some of the topics discussed at OSPM [1]. Other
topics will be part of an other serie

[1] https://youtu.be/PHEBAyxeM_M?si=ZApIOw3BS4SOLPwp

Vincent Guittot (5):
  sched/fair: Filter false overloaded_group case for EAS
  energy model: Add a get previous state function
  sched/fair: Rework feec() to use cost instead of spare capacity
  sched/fair: Use EAS also when overutilized
  sched/fair: Add push task callback for EAS

 include/linux/energy_model.h |  18 +
 kernel/sched/fair.c          | 693 +++++++++++++++++++++++------------
 kernel/sched/sched.h         |   2 +
 3 files changed, 488 insertions(+), 225 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 1/5] sched/fair: Filter false overloaded_group case for EAS
  2024-08-30 13:03 [PATCH 0/5] sched/fair: Rework EAS to handle more cases Vincent Guittot
@ 2024-08-30 13:03 ` Vincent Guittot
  2024-09-02  9:01   ` Hongyan Xia
  2024-09-13 13:21   ` Pierre Gondois
  2024-08-30 13:03 ` [PATCH 2/5] energy model: Add a get previous state function Vincent Guittot
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 62+ messages in thread
From: Vincent Guittot @ 2024-08-30 13:03 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2, Vincent Guittot

With EAS, a group should be set overloaded if at least 1 CPU in the group
is overutilized bit it can happen that a CPU is fully utilized by tasks
because of clamping the compute capacity of the CPU. In such case, the CPU
is not overutilized and as a result should not be set overloaded as well.

group_overloaded being a higher priority than group_misfit, such group can
be selected as the busiest group instead of a group with a mistfit task
and prevents load_balance to select the CPU with the misfit task to pull
the latter on a fitting CPU.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fea057b311f6..e67d6029b269 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9806,6 +9806,7 @@ struct sg_lb_stats {
 	enum group_type group_type;
 	unsigned int group_asym_packing;	/* Tasks should be moved to preferred CPU */
 	unsigned int group_smt_balance;		/* Task on busy SMT be moved */
+	unsigned long group_overutilized;	/* No CPU is overutilized in the group */
 	unsigned long group_misfit_task_load;	/* A CPU has a task too big for its capacity */
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
@@ -10039,6 +10040,13 @@ group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
 static inline bool
 group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
 {
+	/*
+	 * With EAS and uclamp, 1 CPU in the group must be overutilized to
+	 * consider the group overloaded.
+	 */
+	if (sched_energy_enabled() && !sgs->group_overutilized)
+		return false;
+
 	if (sgs->sum_nr_running <= sgs->group_weight)
 		return false;
 
@@ -10252,8 +10260,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		if (nr_running > 1)
 			*sg_overloaded = 1;
 
-		if (cpu_overutilized(i))
+		if (cpu_overutilized(i)) {
 			*sg_overutilized = 1;
+			sgs->group_overutilized = 1;
+		}
 
 #ifdef CONFIG_NUMA_BALANCING
 		sgs->nr_numa_running += rq->nr_numa_running;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/5] sched/fair: Filter false overloaded_group case for EAS
  2024-08-30 13:03 ` [PATCH 1/5] sched/fair: Filter false overloaded_group case for EAS Vincent Guittot
@ 2024-09-02  9:01   ` Hongyan Xia
  2024-09-06  6:51     ` Vincent Guittot
  2024-09-13 13:21   ` Pierre Gondois
  1 sibling, 1 reply; 62+ messages in thread
From: Hongyan Xia @ 2024-09-02  9:01 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, lukasz.luba,
	rafael.j.wysocki, linux-kernel
  Cc: qyousef

On 30/08/2024 14:03, Vincent Guittot wrote:
> With EAS, a group should be set overloaded if at least 1 CPU in the group
> is overutilized bit it can happen that a CPU is fully utilized by tasks
> because of clamping the compute capacity of the CPU. In such case, the CPU
> is not overutilized and as a result should not be set overloaded as well.
> 
> group_overloaded being a higher priority than group_misfit, such group can
> be selected as the busiest group instead of a group with a mistfit task
> and prevents load_balance to select the CPU with the misfit task to pull
> the latter on a fitting CPU.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>   kernel/sched/fair.c | 12 +++++++++++-
>   1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fea057b311f6..e67d6029b269 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9806,6 +9806,7 @@ struct sg_lb_stats {
>   	enum group_type group_type;
>   	unsigned int group_asym_packing;	/* Tasks should be moved to preferred CPU */
>   	unsigned int group_smt_balance;		/* Task on busy SMT be moved */
> +	unsigned long group_overutilized;	/* No CPU is overutilized in the group */

Does this have to be unsigned long? I think a shorter width like bool 
(or int to be consistent with other fields) expresses the intention.

Also the comment to me is a bit confusing. All other fields are positive 
but this one's comment is in a negative tone.

>   	unsigned long group_misfit_task_load;	/* A CPU has a task too big for its capacity */
>   #ifdef CONFIG_NUMA_BALANCING
>   	unsigned int nr_numa_running;
> @@ -10039,6 +10040,13 @@ group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
>   static inline bool
>   group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
>   {
> +	/*
> +	 * With EAS and uclamp, 1 CPU in the group must be overutilized to
> +	 * consider the group overloaded.
> +	 */
> +	if (sched_energy_enabled() && !sgs->group_overutilized)
> +		return false;
> +
>   	if (sgs->sum_nr_running <= sgs->group_weight)
>   		return false;
>   
> @@ -10252,8 +10260,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>   		if (nr_running > 1)
>   			*sg_overloaded = 1;
>   
> -		if (cpu_overutilized(i))
> +		if (cpu_overutilized(i)) {
>   			*sg_overutilized = 1;
> +			sgs->group_overutilized = 1;
> +		}
>   
>   #ifdef CONFIG_NUMA_BALANCING
>   		sgs->nr_numa_running += rq->nr_numa_running;

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/5] sched/fair: Filter false overloaded_group case for EAS
  2024-09-02  9:01   ` Hongyan Xia
@ 2024-09-06  6:51     ` Vincent Guittot
  0 siblings, 0 replies; 62+ messages in thread
From: Vincent Guittot @ 2024-09-06  6:51 UTC (permalink / raw)
  To: Hongyan Xia
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef

On Mon, 2 Sept 2024 at 11:01, Hongyan Xia <hongyan.xia2@arm.com> wrote:
>
> On 30/08/2024 14:03, Vincent Guittot wrote:
> > With EAS, a group should be set overloaded if at least 1 CPU in the group
> > is overutilized bit it can happen that a CPU is fully utilized by tasks
> > because of clamping the compute capacity of the CPU. In such case, the CPU
> > is not overutilized and as a result should not be set overloaded as well.
> >
> > group_overloaded being a higher priority than group_misfit, such group can
> > be selected as the busiest group instead of a group with a mistfit task
> > and prevents load_balance to select the CPU with the misfit task to pull
> > the latter on a fitting CPU.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >   kernel/sched/fair.c | 12 +++++++++++-
> >   1 file changed, 11 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index fea057b311f6..e67d6029b269 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9806,6 +9806,7 @@ struct sg_lb_stats {
> >       enum group_type group_type;
> >       unsigned int group_asym_packing;        /* Tasks should be moved to preferred CPU */
> >       unsigned int group_smt_balance;         /* Task on busy SMT be moved */
> > +     unsigned long group_overutilized;       /* No CPU is overutilized in the group */
>
> Does this have to be unsigned long? I think a shorter width like bool
> (or int to be consistent with other fields) expresses the intention.

yes an unsigned int is enough

>
> Also the comment to me is a bit confusing. All other fields are positive
> but this one's comment is in a negative tone.

Coming from the 1st way I implemented it but then I forgot to update
the comment. Should be:
/* At least one CPU is overutilized in the group */

>
> >       unsigned long group_misfit_task_load;   /* A CPU has a task too big for its capacity */
> >   #ifdef CONFIG_NUMA_BALANCING
> >       unsigned int nr_numa_running;
> > @@ -10039,6 +10040,13 @@ group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
> >   static inline bool
> >   group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
> >   {
> > +     /*
> > +      * With EAS and uclamp, 1 CPU in the group must be overutilized to
> > +      * consider the group overloaded.
> > +      */
> > +     if (sched_energy_enabled() && !sgs->group_overutilized)
> > +             return false;
> > +
> >       if (sgs->sum_nr_running <= sgs->group_weight)
> >               return false;
> >
> > @@ -10252,8 +10260,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >               if (nr_running > 1)
> >                       *sg_overloaded = 1;
> >
> > -             if (cpu_overutilized(i))
> > +             if (cpu_overutilized(i)) {
> >                       *sg_overutilized = 1;
> > +                     sgs->group_overutilized = 1;
> > +             }
> >
> >   #ifdef CONFIG_NUMA_BALANCING
> >               sgs->nr_numa_running += rq->nr_numa_running;

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/5] sched/fair: Filter false overloaded_group case for EAS
  2024-08-30 13:03 ` [PATCH 1/5] sched/fair: Filter false overloaded_group case for EAS Vincent Guittot
  2024-09-02  9:01   ` Hongyan Xia
@ 2024-09-13 13:21   ` Pierre Gondois
  1 sibling, 0 replies; 62+ messages in thread
From: Pierre Gondois @ 2024-09-13 13:21 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, lukasz.luba,
	rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2

Hello Vincent,

I have been trying this patch with the following workload, on a Pixel6
(4 littles, 2 mid, 2 big):
a. 5 tasks with: [UCLAMP_MIN:0, UCLAMP_MAX:1, duty_cycle=100%, cpuset:0-2]
b. 1 task with: [duty_cycle=100%, cpuset:0-7] but starting on CPU4

a.
There are many UCLAMP_MAX task also to pass the following condition
to tag a group as overloaded.
group_is_overloaded()
\-(sgs->sum_nr_running <= sgs->group_weight)
These tasks should put

b. to see if a CPU-bound task is migrated to the big cluster.

---
- Without patch 5 [RFC PATCH 5/5] sched/fair: Add push task callback for EAS
- Without this patch
The migration is effectively due to the load_balancer selecting the
little cluster over the mid cluster.
The little cluster put the system in an overutilized state.

---
- Without patch 5 [RFC PATCH 5/5] sched/fair: Add push task callback for EAS
- With this patch
The load_balancer effectively selects the medium cluster over the little
cluster (since none of the little CPU is overutilized). The load_balancer
migrates the task b. to a big CPU.

Note:
This is true most of the time, but whenever a non-UCLAMP_MAX tasks wakes-up
on one of CPU0-3 (where the UCLAMP_MAX are pinned), the cluster becomes
overutilized and the new mechanism is bypassed.
Same thing if a task with [UCLAMP_MIN:0, UCLAMP_MAX:1024, duty_cycle=100%, cpuset:0]
is added to the workload.

---
- With patch 5 [RFC PATCH 5/5] sched/fair: Add push task callback for EAS
- Without this patch

The task b. gets an opportunity to migrate to a big CPU through the sched_tick.
However with both patches are applied, the migration is triggered by the
load_balancer.

---
So FWIW, from a mechanism PoV and independently from patch 5:
Tested-by: Pierre Gondois <pierre.gondois@arm.com>


On 8/30/24 15:03, Vincent Guittot wrote:
> With EAS, a group should be set overloaded if at least 1 CPU in the group
> is overutilized bit it can happen that a CPU is fully utilized by tasks
> because of clamping the compute capacity of the CPU. In such case, the CPU
> is not overutilized and as a result should not be set overloaded as well.
> 
> group_overloaded being a higher priority than group_misfit, such group can
> be selected as the busiest group instead of a group with a mistfit task
> and prevents load_balance to select the CPU with the misfit task to pull
> the latter on a fitting CPU.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>   kernel/sched/fair.c | 12 +++++++++++-
>   1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fea057b311f6..e67d6029b269 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9806,6 +9806,7 @@ struct sg_lb_stats {
>   	enum group_type group_type;
>   	unsigned int group_asym_packing;	/* Tasks should be moved to preferred CPU */
>   	unsigned int group_smt_balance;		/* Task on busy SMT be moved */
> +	unsigned long group_overutilized;	/* No CPU is overutilized in the group */
>   	unsigned long group_misfit_task_load;	/* A CPU has a task too big for its capacity */
>   #ifdef CONFIG_NUMA_BALANCING
>   	unsigned int nr_numa_running;
> @@ -10039,6 +10040,13 @@ group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
>   static inline bool
>   group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
>   {
> +	/*
> +	 * With EAS and uclamp, 1 CPU in the group must be overutilized to
> +	 * consider the group overloaded.
> +	 */
> +	if (sched_energy_enabled() && !sgs->group_overutilized)
> +		return false;
> +
>   	if (sgs->sum_nr_running <= sgs->group_weight)
>   		return false;
>   
> @@ -10252,8 +10260,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>   		if (nr_running > 1)
>   			*sg_overloaded = 1;
>   
> -		if (cpu_overutilized(i))
> +		if (cpu_overutilized(i)) {
>   			*sg_overutilized = 1;
> +			sgs->group_overutilized = 1;
> +		}
>   
>   #ifdef CONFIG_NUMA_BALANCING
>   		sgs->nr_numa_running += rq->nr_numa_running;

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 2/5] energy model: Add a get previous state function
  2024-08-30 13:03 [PATCH 0/5] sched/fair: Rework EAS to handle more cases Vincent Guittot
  2024-08-30 13:03 ` [PATCH 1/5] sched/fair: Filter false overloaded_group case for EAS Vincent Guittot
@ 2024-08-30 13:03 ` Vincent Guittot
  2024-09-05  9:21   ` Lukasz Luba
  2024-08-30 13:03 ` [PATCH 3/5] sched/fair: Rework feec() to use cost instead of spare capacity Vincent Guittot
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 62+ messages in thread
From: Vincent Guittot @ 2024-08-30 13:03 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2, Vincent Guittot

Instead of parsing all EM table everytime, add a function to get the
previous state.

Will be used in the scheduler feec() function.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/energy_model.h | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 1ff52020cf75..ea8ea7e031c0 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -207,6 +207,24 @@ em_pd_get_efficient_state(struct em_perf_state *table, int nr_perf_states,
 	return nr_perf_states - 1;
 }
 
+static inline int
+em_pd_get_previous_state(struct em_perf_state *table, int nr_perf_states,
+			  int idx, unsigned long pd_flags)
+{
+	struct em_perf_state *ps;
+	int i;
+
+	for (i = idx - 1; i >= 0; i--) {
+		ps = &table[i];
+		if (pd_flags & EM_PERF_DOMAIN_SKIP_INEFFICIENCIES &&
+		    ps->flags & EM_PERF_STATE_INEFFICIENT)
+			continue;
+		return i;
+	}
+
+	return -1;
+}
+
 /**
  * em_cpu_energy() - Estimates the energy consumed by the CPUs of a
  *		performance domain
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH 2/5] energy model: Add a get previous state function
  2024-08-30 13:03 ` [PATCH 2/5] energy model: Add a get previous state function Vincent Guittot
@ 2024-09-05  9:21   ` Lukasz Luba
  2024-09-06  6:55     ` Vincent Guittot
  0 siblings, 1 reply; 62+ messages in thread
From: Lukasz Luba @ 2024-09-05  9:21 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: qyousef, hongyan.xia2, mingo, mgorman, peterz, dietmar.eggemann,
	bsegall, vschneid, rostedt, rafael.j.wysocki, linux-kernel,
	juri.lelli

Hi Vincent,

On 8/30/24 14:03, Vincent Guittot wrote:
> Instead of parsing all EM table everytime, add a function to get the
> previous state.
> 
> Will be used in the scheduler feec() function.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>   include/linux/energy_model.h | 18 ++++++++++++++++++
>   1 file changed, 18 insertions(+)
> 
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index 1ff52020cf75..ea8ea7e031c0 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -207,6 +207,24 @@ em_pd_get_efficient_state(struct em_perf_state *table, int nr_perf_states,
>   	return nr_perf_states - 1;
>   }
>   
> +static inline int
> +em_pd_get_previous_state(struct em_perf_state *table, int nr_perf_states,
> +			  int idx, unsigned long pd_flags)
> +{
> +	struct em_perf_state *ps;
> +	int i;
> +
> +	for (i = idx - 1; i >= 0; i--) {
> +		ps = &table[i];
> +		if (pd_flags & EM_PERF_DOMAIN_SKIP_INEFFICIENCIES &&
> +		    ps->flags & EM_PERF_STATE_INEFFICIENT)
> +			continue;
> +		return i;
> +	}

Would you mind to add a comment on top of that for loop?
Or maybe a bit more detail in the patch header what would you like to
find (e.g. 1st efficient OPP which is lower).

It's looking for a first OPP (don't forget it's ascending 'table') which
is lower or equal to the 'idx' state.

If uclamp_max is set and that OPP is inefficient, don't we choose
a higher OPP which is efficient?

I'm not against this function.

BTW, I wonder if this design is still valid with the uclamp_max.

Regards,
Lukasz

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 2/5] energy model: Add a get previous state function
  2024-09-05  9:21   ` Lukasz Luba
@ 2024-09-06  6:55     ` Vincent Guittot
  0 siblings, 0 replies; 62+ messages in thread
From: Vincent Guittot @ 2024-09-06  6:55 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: qyousef, hongyan.xia2, mingo, mgorman, peterz, dietmar.eggemann,
	bsegall, vschneid, rostedt, rafael.j.wysocki, linux-kernel,
	juri.lelli

On Thu, 5 Sept 2024 at 11:20, Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> Hi Vincent,
>
> On 8/30/24 14:03, Vincent Guittot wrote:
> > Instead of parsing all EM table everytime, add a function to get the
> > previous state.
> >
> > Will be used in the scheduler feec() function.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >   include/linux/energy_model.h | 18 ++++++++++++++++++
> >   1 file changed, 18 insertions(+)
> >
> > diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> > index 1ff52020cf75..ea8ea7e031c0 100644
> > --- a/include/linux/energy_model.h
> > +++ b/include/linux/energy_model.h
> > @@ -207,6 +207,24 @@ em_pd_get_efficient_state(struct em_perf_state *table, int nr_perf_states,
> >       return nr_perf_states - 1;
> >   }
> >
> > +static inline int
> > +em_pd_get_previous_state(struct em_perf_state *table, int nr_perf_states,
> > +                       int idx, unsigned long pd_flags)
> > +{
> > +     struct em_perf_state *ps;
> > +     int i;
> > +
> > +     for (i = idx - 1; i >= 0; i--) {
> > +             ps = &table[i];
> > +             if (pd_flags & EM_PERF_DOMAIN_SKIP_INEFFICIENCIES &&
> > +                 ps->flags & EM_PERF_STATE_INEFFICIENT)
> > +                     continue;
> > +             return i;
> > +     }
>
> Would you mind to add a comment on top of that for loop?

Yes I will

> Or maybe a bit more detail in the patch header what would you like to
> find (e.g. 1st efficient OPP which is lower).
>
> It's looking for a first OPP (don't forget it's ascending 'table') which
> is lower or equal to the 'idx' state.
>
> If uclamp_max is set and that OPP is inefficient, don't we choose
> a higher OPP which is efficient?

I use this function to get the capacity range of an OPP at index idx.
uclamp has already been checked before when selecting OPP idx. Now we
want to capacity range to know when we need to look for a lower OPP

>
> I'm not against this function.
>
> BTW, I wonder if this design is still valid with the uclamp_max.
>
> Regards,
> Lukasz

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 3/5] sched/fair: Rework feec() to use cost instead of spare capacity
  2024-08-30 13:03 [PATCH 0/5] sched/fair: Rework EAS to handle more cases Vincent Guittot
  2024-08-30 13:03 ` [PATCH 1/5] sched/fair: Filter false overloaded_group case for EAS Vincent Guittot
  2024-08-30 13:03 ` [PATCH 2/5] energy model: Add a get previous state function Vincent Guittot
@ 2024-08-30 13:03 ` Vincent Guittot
  2024-09-02  9:11   ` kernel test robot
                     ` (3 more replies)
  2024-08-30 13:03 ` [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized Vincent Guittot
                   ` (3 subsequent siblings)
  6 siblings, 4 replies; 62+ messages in thread
From: Vincent Guittot @ 2024-08-30 13:03 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2, Vincent Guittot

feec() looks for the CPU with highest spare capacity in a PD assuming that
it will be the best CPU from a energy efficiency PoV because it will
require the smallest increase of OPP. Although this is true generally
speaking, this policy also filters some others CPUs which will be as
efficients because of using the same OPP.
In fact, we really care about the cost of the new OPP that will be
selected to handle the waking task. In many cases, several CPUs will end
up selecting the same OPP and as a result using the same energy cost. In
these cases, we can use other metrics to select the best CPU for the same
energy cost.

Rework feec() to look 1st for the lowest cost in a PD and then the most
performant CPU between CPUs.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 466 +++++++++++++++++++++++---------------------
 1 file changed, 244 insertions(+), 222 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e67d6029b269..2273eecf6086 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8081,29 +8081,37 @@ static unsigned long cpu_util_without(int cpu, struct task_struct *p)
 }
 
 /*
- * energy_env - Utilization landscape for energy estimation.
- * @task_busy_time: Utilization contribution by the task for which we test the
- *                  placement. Given by eenv_task_busy_time().
- * @pd_busy_time:   Utilization of the whole perf domain without the task
- *                  contribution. Given by eenv_pd_busy_time().
- * @cpu_cap:        Maximum CPU capacity for the perf domain.
- * @pd_cap:         Entire perf domain capacity. (pd->nr_cpus * cpu_cap).
- */
-struct energy_env {
-	unsigned long task_busy_time;
-	unsigned long pd_busy_time;
-	unsigned long cpu_cap;
-	unsigned long pd_cap;
+ * energy_cpu_stat - Utilization landscape for energy estimation.
+ * @idx :        Index of the OPP in the performance domain
+ * @cost :       Cost of the OPP
+ * @max_perf :   Compute capacity of OPP
+ * @min_perf :   Compute capacity of the previous OPP
+ * @capa :       Capacity of the CPU
+ * @runnable :   runnbale_avg of the CPU
+ * @nr_running : number of cfs running task
+ * @fits :       Fits level of the CPU
+ * @cpu :        current best CPU
+ */
+struct energy_cpu_stat {
+	unsigned long idx;
+	unsigned long cost;
+	unsigned long max_perf;
+	unsigned long min_perf;
+	unsigned long capa;
+	unsigned long util;
+	unsigned long runnable;
+	unsigned int nr_running;
+	int fits;
+	int cpu;
 };
 
 /*
- * Compute the task busy time for compute_energy(). This time cannot be
- * injected directly into effective_cpu_util() because of the IRQ scaling.
+ * Compute the task busy time for computing its energy impact. This time cannot
+ * be injected directly into effective_cpu_util() because of the IRQ scaling.
  * The latter only makes sense with the most recent CPUs where the task has
  * run.
  */
-static inline void eenv_task_busy_time(struct energy_env *eenv,
-				       struct task_struct *p, int prev_cpu)
+static inline unsigned long task_busy_time(struct task_struct *p, int prev_cpu)
 {
 	unsigned long busy_time, max_cap = arch_scale_cpu_capacity(prev_cpu);
 	unsigned long irq = cpu_util_irq(cpu_rq(prev_cpu));
@@ -8113,124 +8121,152 @@ static inline void eenv_task_busy_time(struct energy_env *eenv,
 	else
 		busy_time = scale_irq_capacity(task_util_est(p), irq, max_cap);
 
-	eenv->task_busy_time = busy_time;
+	return busy_time;
 }
 
-/*
- * Compute the perf_domain (PD) busy time for compute_energy(). Based on the
- * utilization for each @pd_cpus, it however doesn't take into account
- * clamping since the ratio (utilization / cpu_capacity) is already enough to
- * scale the EM reported power consumption at the (eventually clamped)
- * cpu_capacity.
- *
- * The contribution of the task @p for which we want to estimate the
- * energy cost is removed (by cpu_util()) and must be calculated
- * separately (see eenv_task_busy_time). This ensures:
- *
- *   - A stable PD utilization, no matter which CPU of that PD we want to place
- *     the task on.
- *
- *   - A fair comparison between CPUs as the task contribution (task_util())
- *     will always be the same no matter which CPU utilization we rely on
- *     (util_avg or util_est).
- *
- * Set @eenv busy time for the PD that spans @pd_cpus. This busy time can't
- * exceed @eenv->pd_cap.
- */
-static inline void eenv_pd_busy_time(struct energy_env *eenv,
-				     struct cpumask *pd_cpus,
-				     struct task_struct *p)
+/* Estimate the utilization of the CPU that is then used to select the OPP */
+static unsigned long find_cpu_max_util(int cpu, struct task_struct *p, int dst_cpu)
 {
-	unsigned long busy_time = 0;
-	int cpu;
+	unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
+	unsigned long eff_util, min, max;
+
+	/*
+	 * Performance domain frequency: utilization clamping
+	 * must be considered since it affects the selection
+	 * of the performance domain frequency.
+	 */
+	eff_util = effective_cpu_util(cpu, util, &min, &max);
 
-	for_each_cpu(cpu, pd_cpus) {
-		unsigned long util = cpu_util(cpu, p, -1, 0);
+	/* Task's uclamp can modify min and max value */
+	if (uclamp_is_used() && cpu == dst_cpu) {
+		min = max(min, uclamp_eff_value(p, UCLAMP_MIN));
 
-		busy_time += effective_cpu_util(cpu, util, NULL, NULL);
+		/*
+		 * If there is no active max uclamp constraint,
+		 * directly use task's one, otherwise keep max.
+		 */
+		if (uclamp_rq_is_idle(cpu_rq(cpu)))
+			max = uclamp_eff_value(p, UCLAMP_MAX);
+		else
+			max = max(max, uclamp_eff_value(p, UCLAMP_MAX));
 	}
 
-	eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
+	eff_util = sugov_effective_cpu_perf(cpu, eff_util, min, max);
+	return eff_util;
 }
 
-/*
- * Compute the maximum utilization for compute_energy() when the task @p
- * is placed on the cpu @dst_cpu.
- *
- * Returns the maximum utilization among @eenv->cpus. This utilization can't
- * exceed @eenv->cpu_cap.
- */
-static inline unsigned long
-eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
-		 struct task_struct *p, int dst_cpu)
+/* Estimate the utilization of the CPU without the task */
+static unsigned long find_cpu_actual_util(int cpu, struct task_struct *p)
 {
-	unsigned long max_util = 0;
-	int cpu;
+	unsigned long util = cpu_util(cpu, p, -1, 0);
+	unsigned long eff_util;
 
-	for_each_cpu(cpu, pd_cpus) {
-		struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
-		unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
-		unsigned long eff_util, min, max;
+	eff_util = effective_cpu_util(cpu, util, NULL, NULL);
 
-		/*
-		 * Performance domain frequency: utilization clamping
-		 * must be considered since it affects the selection
-		 * of the performance domain frequency.
-		 * NOTE: in case RT tasks are running, by default the min
-		 * utilization can be max OPP.
-		 */
-		eff_util = effective_cpu_util(cpu, util, &min, &max);
+	return eff_util;
+}
 
-		/* Task's uclamp can modify min and max value */
-		if (tsk && uclamp_is_used()) {
-			min = max(min, uclamp_eff_value(p, UCLAMP_MIN));
+/* Find the cost of a performance domain for the estimated utilization */
+static inline void find_pd_cost(struct em_perf_domain *pd,
+				unsigned long max_util,
+				struct energy_cpu_stat *stat)
+{
+	struct em_perf_table *em_table;
+	struct em_perf_state *ps;
+	int i;
 
-			/*
-			 * If there is no active max uclamp constraint,
-			 * directly use task's one, otherwise keep max.
-			 */
-			if (uclamp_rq_is_idle(cpu_rq(cpu)))
-				max = uclamp_eff_value(p, UCLAMP_MAX);
-			else
-				max = max(max, uclamp_eff_value(p, UCLAMP_MAX));
-		}
+	/*
+	 * Find the lowest performance state of the Energy Model above the
+	 * requested performance.
+	 */
+	em_table = rcu_dereference(pd->em_table);
+	i = em_pd_get_efficient_state(em_table->state, pd->nr_perf_states,
+				      max_util, pd->flags);
+	ps = &em_table->state[i];
 
-		eff_util = sugov_effective_cpu_perf(cpu, eff_util, min, max);
-		max_util = max(max_util, eff_util);
+	/* Save the cost and performance range of the OPP */
+	stat->max_perf = ps->performance;
+	stat->cost = ps->cost;
+	i = em_pd_get_previous_state(em_table->state, pd->nr_perf_states,
+				      i, pd->flags);
+	if (i < 0)
+		stat->min_perf = 0;
+	else {
+		ps = &em_table->state[i];
+		stat->min_perf = ps->performance;
 	}
-
-	return min(max_util, eenv->cpu_cap);
 }
 
-/*
- * compute_energy(): Use the Energy Model to estimate the energy that @pd would
- * consume for a given utilization landscape @eenv. When @dst_cpu < 0, the task
- * contribution is ignored.
- */
-static inline unsigned long
-compute_energy(struct energy_env *eenv, struct perf_domain *pd,
-	       struct cpumask *pd_cpus, struct task_struct *p, int dst_cpu)
+/*Check if the CPU can handle the waking task */
+static int check_cpu_with_task(struct task_struct *p, int cpu)
 {
-	unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
-	unsigned long busy_time = eenv->pd_busy_time;
-	unsigned long energy;
+	unsigned long p_util_min = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MIN) : 0;
+	unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
+	unsigned long util_min = p_util_min;
+	unsigned long util_max = p_util_max;
+	unsigned long util = cpu_util(cpu, p, cpu, 0);
+	struct rq *rq = cpu_rq(cpu);
 
-	if (dst_cpu >= 0)
-		busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
+	/*
+	 * Skip CPUs that cannot satisfy the capacity request.
+	 * IOW, placing the task there would make the CPU
+	 * overutilized. Take uclamp into account to see how
+	 * much capacity we can get out of the CPU; this is
+	 * aligned with sched_cpu_util().
+	 */
+	if (uclamp_is_used() && !uclamp_rq_is_idle(rq)) {
+		unsigned long rq_util_min, rq_util_max;
+		/*
+		 * Open code uclamp_rq_util_with() except for
+		 * the clamp() part. I.e.: apply max aggregation
+		 * only. util_fits_cpu() logic requires to
+		 * operate on non clamped util but must use the
+		 * max-aggregated uclamp_{min, max}.
+		 */
+		rq_util_min = uclamp_rq_get(rq, UCLAMP_MIN);
+		rq_util_max = uclamp_rq_get(rq, UCLAMP_MAX);
+		util_min = max(rq_util_min, p_util_min);
+		util_max = max(rq_util_max, p_util_max);
+	}
+	return util_fits_cpu(util, util_min, util_max, cpu);
+}
 
-	energy = em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
+/* For a same cost, select the CPU that will povide best performance for the task */
+static bool select_best_cpu(struct energy_cpu_stat *target,
+			    struct energy_cpu_stat *min,
+			    int prev, struct sched_domain *sd)
+{
+	/*  Select the one with the least number of running tasks */
+	if (target->nr_running < min->nr_running)
+		return true;
+	if (target->nr_running > min->nr_running)
+		return false;
 
-	trace_sched_compute_energy_tp(p, dst_cpu, energy, max_util, busy_time);
+	/* Favor previous CPU otherwise */
+	if (target->cpu == prev)
+		return true;
+	if (min->cpu == prev)
+		return false;
 
-	return energy;
+	/*
+	 * Choose CPU with lowest contention. One might want to consider load instead of
+	 * runnable but we are supposed to not be overutilized so there is enough compute
+	 * capacity for everybody.
+	 */
+	if ((target->runnable * min->capa * sd->imbalance_pct) >=
+			(min->runnable * target->capa * 100))
+		return false;
+
+	return true;
 }
 
 /*
  * find_energy_efficient_cpu(): Find most energy-efficient target CPU for the
- * waking task. find_energy_efficient_cpu() looks for the CPU with maximum
- * spare capacity in each performance domain and uses it as a potential
- * candidate to execute the task. Then, it uses the Energy Model to figure
- * out which of the CPU candidates is the most energy-efficient.
+ * waking task. find_energy_efficient_cpu() looks for the CPU with the lowest
+ * power cost (usually with maximum spare capacity but not always) in each
+ * performance domain and uses it as a potential candidate to execute the task.
+ * Then, it uses the Energy Model to figure out which of the CPU candidates is
+ * the most energy-efficient.
  *
  * The rationale for this heuristic is as follows. In a performance domain,
  * all the most energy efficient CPU candidates (according to the Energy
@@ -8267,17 +8303,14 @@ compute_energy(struct energy_env *eenv, struct perf_domain *pd,
 static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
-	unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
-	unsigned long p_util_min = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MIN) : 0;
-	unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
+	unsigned long task_util;
+	unsigned long best_nrg = ULONG_MAX;
+	int best_fits = -1;
+	int best_cpu = -1;
 	struct root_domain *rd = this_rq()->rd;
-	int cpu, best_energy_cpu, target = -1;
-	int prev_fits = -1, best_fits = -1;
-	unsigned long best_actual_cap = 0;
-	unsigned long prev_actual_cap = 0;
+	int cpu, target = -1;
 	struct sched_domain *sd;
 	struct perf_domain *pd;
-	struct energy_env eenv;
 
 	rcu_read_lock();
 	pd = rcu_dereference(rd->pd);
@@ -8296,20 +8329,21 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 
 	target = prev_cpu;
 
-	sync_entity_load_avg(&p->se);
-	if (!task_util_est(p) && p_util_min == 0)
-		goto unlock;
 
-	eenv_task_busy_time(&eenv, p, prev_cpu);
+	sync_entity_load_avg(&p->se);
+	task_util = task_busy_time(p, prev_cpu);
 
 	for (; pd; pd = pd->next) {
-		unsigned long util_min = p_util_min, util_max = p_util_max;
-		unsigned long cpu_cap, cpu_actual_cap, util;
-		long prev_spare_cap = -1, max_spare_cap = -1;
-		unsigned long rq_util_min, rq_util_max;
-		unsigned long cur_delta, base_energy;
-		int max_spare_cap_cpu = -1;
-		int fits, max_fits = -1;
+		unsigned long cpu_actual_cap, max_cost = 0;
+		unsigned long pd_actual_util = 0, delta_nrg = 0;
+		struct energy_cpu_stat target_stat;
+		struct energy_cpu_stat min_stat = {
+			.cost = ULONG_MAX,
+			.max_perf = ULONG_MAX,
+			.min_perf = ULONG_MAX,
+			.fits = -2,
+			.cpu = -1,
+		};
 
 		cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
 
@@ -8320,13 +8354,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 		cpu = cpumask_first(cpus);
 		cpu_actual_cap = get_actual_cpu_capacity(cpu);
 
-		eenv.cpu_cap = cpu_actual_cap;
-		eenv.pd_cap = 0;
-
+		/* In a PD, the CPU with the lowest cost will be the most efficient */
 		for_each_cpu(cpu, cpus) {
-			struct rq *rq = cpu_rq(cpu);
-
-			eenv.pd_cap += cpu_actual_cap;
+			unsigned long target_perf;
 
 			if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
 				continue;
@@ -8334,120 +8364,112 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			if (!cpumask_test_cpu(cpu, p->cpus_ptr))
 				continue;
 
-			util = cpu_util(cpu, p, cpu, 0);
-			cpu_cap = capacity_of(cpu);
+			target_stat.fits = check_cpu_with_task(p, cpu);
+
+			if (!target_stat.fits)
+				continue;
+
+			/* 1st select the CPU that fits best */
+			if (target_stat.fits < min_stat.fits)
+				continue;
+
+			/* Then select the CPU with lowest cost */
+
+			/* Get the performance of the CPU w/ waking task. */
+			target_perf = find_cpu_max_util(cpu, p, cpu);
+			target_perf = min(target_perf, cpu_actual_cap);
+
+			/* Needing a higher OPP means a higher cost */
+			if (target_perf > min_stat.max_perf)
+				continue;
 
 			/*
-			 * Skip CPUs that cannot satisfy the capacity request.
-			 * IOW, placing the task there would make the CPU
-			 * overutilized. Take uclamp into account to see how
-			 * much capacity we can get out of the CPU; this is
-			 * aligned with sched_cpu_util().
+			 * At this point, target's cost can be either equal or
+			 * lower than the current minimum cost.
 			 */
-			if (uclamp_is_used() && !uclamp_rq_is_idle(rq)) {
-				/*
-				 * Open code uclamp_rq_util_with() except for
-				 * the clamp() part. I.e.: apply max aggregation
-				 * only. util_fits_cpu() logic requires to
-				 * operate on non clamped util but must use the
-				 * max-aggregated uclamp_{min, max}.
-				 */
-				rq_util_min = uclamp_rq_get(rq, UCLAMP_MIN);
-				rq_util_max = uclamp_rq_get(rq, UCLAMP_MAX);
 
-				util_min = max(rq_util_min, p_util_min);
-				util_max = max(rq_util_max, p_util_max);
-			}
+			/* Gather more statistics */
+			target_stat.cpu = cpu;
+			target_stat.runnable = cpu_runnable(cpu_rq(cpu));
+			target_stat.capa = capacity_of(cpu);
+			target_stat.nr_running = cpu_rq(cpu)->cfs.h_nr_running;
 
-			fits = util_fits_cpu(util, util_min, util_max, cpu);
-			if (!fits)
+			/* If the target needs a lower OPP, then look up for
+			 * the corresponding OPP and its associated cost.
+			 * Otherwise at same cost level, select the CPU which
+			 * provides best performance.
+			 */
+			if (target_perf < min_stat.min_perf)
+				find_pd_cost(pd->em_pd, target_perf, &target_stat);
+			else if (!select_best_cpu(&target_stat, &min_stat, prev_cpu, sd))
 				continue;
 
-			lsub_positive(&cpu_cap, util);
-
-			if (cpu == prev_cpu) {
-				/* Always use prev_cpu as a candidate. */
-				prev_spare_cap = cpu_cap;
-				prev_fits = fits;
-			} else if ((fits > max_fits) ||
-				   ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {
-				/*
-				 * Find the CPU with the maximum spare capacity
-				 * among the remaining CPUs in the performance
-				 * domain.
-				 */
-				max_spare_cap = cpu_cap;
-				max_spare_cap_cpu = cpu;
-				max_fits = fits;
-			}
+			/* Save the new most efficient CPU of the PD */
+			min_stat = target_stat;
 		}
 
-		if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
+		if (min_stat.cpu == -1)
 			continue;
 
-		eenv_pd_busy_time(&eenv, cpus, p);
-		/* Compute the 'base' energy of the pd, without @p */
-		base_energy = compute_energy(&eenv, pd, cpus, p, -1);
+		if (min_stat.fits < best_fits)
+			continue;
 
-		/* Evaluate the energy impact of using prev_cpu. */
-		if (prev_spare_cap > -1) {
-			prev_delta = compute_energy(&eenv, pd, cpus, p,
-						    prev_cpu);
-			/* CPU utilization has changed */
-			if (prev_delta < base_energy)
-				goto unlock;
-			prev_delta -= base_energy;
-			prev_actual_cap = cpu_actual_cap;
-			best_delta = min(best_delta, prev_delta);
-		}
+		/* Idle system costs nothing */
+		target_stat.max_perf = 0;
+		target_stat.cost = 0;
 
-		/* Evaluate the energy impact of using max_spare_cap_cpu. */
-		if (max_spare_cap_cpu >= 0 && max_spare_cap > prev_spare_cap) {
-			/* Current best energy cpu fits better */
-			if (max_fits < best_fits)
-				continue;
+		/* Estimate utilization and cost without p */
+		for_each_cpu(cpu, cpus) {
+			unsigned long target_util;
 
-			/*
-			 * Both don't fit performance hint (i.e. uclamp_min)
-			 * but best energy cpu has better capacity.
-			 */
-			if ((max_fits < 0) &&
-			    (cpu_actual_cap <= best_actual_cap))
-				continue;
+			/* Accumulate actual utilization w/o task p */
+			pd_actual_util += find_cpu_actual_util(cpu, p);
 
-			cur_delta = compute_energy(&eenv, pd, cpus, p,
-						   max_spare_cap_cpu);
-			/* CPU utilization has changed */
-			if (cur_delta < base_energy)
-				goto unlock;
-			cur_delta -= base_energy;
+			/* Get the max utilization of the CPU w/o task p */
+			target_util = find_cpu_max_util(cpu, p, -1);
+			target_util = min(target_util, cpu_actual_cap);
 
-			/*
-			 * Both fit for the task but best energy cpu has lower
-			 * energy impact.
-			 */
-			if ((max_fits > 0) && (best_fits > 0) &&
-			    (cur_delta >= best_delta))
+			/* Current OPP is enough */
+			if (target_util <= target_stat.max_perf)
 				continue;
 
-			best_delta = cur_delta;
-			best_energy_cpu = max_spare_cap_cpu;
-			best_fits = max_fits;
-			best_actual_cap = cpu_actual_cap;
+			/* Compute and save the cost of the OPP */
+			find_pd_cost(pd->em_pd, target_util, &target_stat);
+			max_cost = target_stat.cost;
 		}
-	}
-	rcu_read_unlock();
 
-	if ((best_fits > prev_fits) ||
-	    ((best_fits > 0) && (best_delta < prev_delta)) ||
-	    ((best_fits < 0) && (best_actual_cap > prev_actual_cap)))
-		target = best_energy_cpu;
+		/* Add the NRG cost of p */
+		delta_nrg = task_util * min_stat.cost;
 
-	return target;
+		/* Compute the NRG cost of others running at higher OPP because of p */
+		if (min_stat.cost > max_cost)
+			delta_nrg += pd_actual_util * (min_stat.cost - max_cost);
+
+		/* nrg with p */
+		trace_sched_compute_energy_tp(p, min_stat.cpu, delta_nrg,
+				min_stat.max_perf, pd_actual_util + task_util);
+
+		/*
+		 * The probability that delta NRGs are equals is almost null. PDs being sorted
+		 * by max capacity, keep the one with highest max capacity if this
+		 * happens.
+		 * TODO: add a margin in nrg cost and take into account other stats
+		 */
+		if ((min_stat.fits == best_fits) &&
+		    (delta_nrg >= best_nrg))
+			continue;
+
+		best_fits = min_stat.fits;
+		best_nrg = delta_nrg;
+		best_cpu = min_stat.cpu;
+	}
 
 unlock:
 	rcu_read_unlock();
 
+	if (best_cpu >= 0)
+		target = best_cpu;
+
 	return target;
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/5] sched/fair: Rework feec() to use cost instead of spare capacity
  2024-08-30 13:03 ` [PATCH 3/5] sched/fair: Rework feec() to use cost instead of spare capacity Vincent Guittot
@ 2024-09-02  9:11   ` kernel test robot
  2024-09-02 11:03   ` Hongyan Xia
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 62+ messages in thread
From: kernel test robot @ 2024-09-02  9:11 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, lukasz.luba,
	rafael.j.wysocki, linux-kernel
  Cc: llvm, oe-kbuild-all, qyousef, hongyan.xia2, Vincent Guittot

Hi Vincent,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on peterz-queue/sched/core linus/master v6.11-rc6 next-20240830]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Vincent-Guittot/sched-fair-Filter-false-overloaded_group-case-for-EAS/20240830-210826
base:   tip/sched/core
patch link:    https://lore.kernel.org/r/20240830130309.2141697-4-vincent.guittot%40linaro.org
patch subject: [PATCH 3/5] sched/fair: Rework feec() to use cost instead of spare capacity
config: s390-allnoconfig (https://download.01.org/0day-ci/archive/20240902/202409021606.CwIU0HB8-lkp@intel.com/config)
compiler: clang version 20.0.0git (https://github.com/llvm/llvm-project 6f682c26b04f0b349c4c473756cb8625b4f37c6d)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240902/202409021606.CwIU0HB8-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409021606.CwIU0HB8-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from kernel/sched/fair.c:23:
   In file included from include/linux/energy_model.h:5:
   In file included from include/linux/device.h:32:
   In file included from include/linux/device/driver.h:21:
   In file included from include/linux/module.h:19:
   In file included from include/linux/elf.h:6:
   In file included from arch/s390/include/asm/elf.h:181:
   In file included from arch/s390/include/asm/mmu_context.h:11:
   In file included from arch/s390/include/asm/pgalloc.h:18:
   In file included from include/linux/mm.h:2228:
   include/linux/vmstat.h:514:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
     514 |         return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
         |                               ~~~~~~~~~~~ ^ ~~~
   In file included from kernel/sched/fair.c:38:
   In file included from include/linux/sched/isolation.h:7:
   In file included from include/linux/tick.h:8:
   In file included from include/linux/clockchips.h:14:
   In file included from include/linux/clocksource.h:22:
   In file included from arch/s390/include/asm/io.h:93:
   include/asm-generic/io.h:548:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     548 |         val = __raw_readb(PCI_IOBASE + addr);
         |                           ~~~~~~~~~~ ^
   include/asm-generic/io.h:561:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     561 |         val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/big_endian.h:37:59: note: expanded from macro '__le16_to_cpu'
      37 | #define __le16_to_cpu(x) __swab16((__force __u16)(__le16)(x))
         |                                                           ^
   include/uapi/linux/swab.h:102:54: note: expanded from macro '__swab16'
     102 | #define __swab16(x) (__u16)__builtin_bswap16((__u16)(x))
         |                                                      ^
   In file included from kernel/sched/fair.c:38:
   In file included from include/linux/sched/isolation.h:7:
   In file included from include/linux/tick.h:8:
   In file included from include/linux/clockchips.h:14:
   In file included from include/linux/clocksource.h:22:
   In file included from arch/s390/include/asm/io.h:93:
   include/asm-generic/io.h:574:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     574 |         val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/big_endian.h:35:59: note: expanded from macro '__le32_to_cpu'
      35 | #define __le32_to_cpu(x) __swab32((__force __u32)(__le32)(x))
         |                                                           ^
   include/uapi/linux/swab.h:115:54: note: expanded from macro '__swab32'
     115 | #define __swab32(x) (__u32)__builtin_bswap32((__u32)(x))
         |                                                      ^
   In file included from kernel/sched/fair.c:38:
   In file included from include/linux/sched/isolation.h:7:
   In file included from include/linux/tick.h:8:
   In file included from include/linux/clockchips.h:14:
   In file included from include/linux/clocksource.h:22:
   In file included from arch/s390/include/asm/io.h:93:
   include/asm-generic/io.h:585:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     585 |         __raw_writeb(value, PCI_IOBASE + addr);
         |                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:595:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     595 |         __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:605:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     605 |         __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:693:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     693 |         readsb(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:701:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     701 |         readsw(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:709:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     709 |         readsl(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:718:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     718 |         writesb(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
   include/asm-generic/io.h:727:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     727 |         writesw(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
   include/asm-generic/io.h:736:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     736 |         writesl(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
>> kernel/sched/fair.c:8183:6: error: call to undeclared function 'em_pd_get_efficient_state'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    8183 |         i = em_pd_get_efficient_state(em_table->state, pd->nr_perf_states,
         |             ^
>> kernel/sched/fair.c:8190:6: error: call to undeclared function 'em_pd_get_previous_state'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    8190 |         i = em_pd_get_previous_state(em_table->state, pd->nr_perf_states,
         |             ^
   13 warnings and 2 errors generated.


vim +/em_pd_get_efficient_state +8183 kernel/sched/fair.c

  8168	
  8169	/* Find the cost of a performance domain for the estimated utilization */
  8170	static inline void find_pd_cost(struct em_perf_domain *pd,
  8171					unsigned long max_util,
  8172					struct energy_cpu_stat *stat)
  8173	{
  8174		struct em_perf_table *em_table;
  8175		struct em_perf_state *ps;
  8176		int i;
  8177	
  8178		/*
  8179		 * Find the lowest performance state of the Energy Model above the
  8180		 * requested performance.
  8181		 */
  8182		em_table = rcu_dereference(pd->em_table);
> 8183		i = em_pd_get_efficient_state(em_table->state, pd->nr_perf_states,
  8184					      max_util, pd->flags);
  8185		ps = &em_table->state[i];
  8186	
  8187		/* Save the cost and performance range of the OPP */
  8188		stat->max_perf = ps->performance;
  8189		stat->cost = ps->cost;
> 8190		i = em_pd_get_previous_state(em_table->state, pd->nr_perf_states,
  8191					      i, pd->flags);
  8192		if (i < 0)
  8193			stat->min_perf = 0;
  8194		else {
  8195			ps = &em_table->state[i];
  8196			stat->min_perf = ps->performance;
  8197		}
  8198	}
  8199	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/5] sched/fair: Rework feec() to use cost instead of spare capacity
  2024-08-30 13:03 ` [PATCH 3/5] sched/fair: Rework feec() to use cost instead of spare capacity Vincent Guittot
  2024-09-02  9:11   ` kernel test robot
@ 2024-09-02 11:03   ` Hongyan Xia
  2024-09-06  7:08     ` Vincent Guittot
  2024-09-04 15:07   ` Pierre Gondois
  2024-09-11 14:02   ` Pierre Gondois
  3 siblings, 1 reply; 62+ messages in thread
From: Hongyan Xia @ 2024-09-02 11:03 UTC (permalink / raw)
  To: Vincent Guittot, linux-kernel
  Cc: qyousef, mingo, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, vschneid, lukasz.luba, mgorman, rafael.j.wysocki

On 30/08/2024 14:03, Vincent Guittot wrote:
> feec() looks for the CPU with highest spare capacity in a PD assuming that
> it will be the best CPU from a energy efficiency PoV because it will
> require the smallest increase of OPP. Although this is true generally
> speaking, this policy also filters some others CPUs which will be as
> efficients because of using the same OPP.
> In fact, we really care about the cost of the new OPP that will be
> selected to handle the waking task. In many cases, several CPUs will end
> up selecting the same OPP and as a result using the same energy cost. In
> these cases, we can use other metrics to select the best CPU for the same
> energy cost.
> 
> Rework feec() to look 1st for the lowest cost in a PD and then the most
> performant CPU between CPUs.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>   kernel/sched/fair.c | 466 +++++++++++++++++++++++---------------------
>   1 file changed, 244 insertions(+), 222 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e67d6029b269..2273eecf6086 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> [...]
>   
> -	energy = em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
> +/* For a same cost, select the CPU that will povide best performance for the task */
> +static bool select_best_cpu(struct energy_cpu_stat *target,
> +			    struct energy_cpu_stat *min,
> +			    int prev, struct sched_domain *sd)
> +{
> +	/*  Select the one with the least number of running tasks */
> +	if (target->nr_running < min->nr_running)
> +		return true;
> +	if (target->nr_running > min->nr_running)
> +		return false;
>   
This makes me a bit worried about systems with coarse-grained OPPs. All 
my dev boards and one of my old phones have <= 3 OPPs. On my Juno board, 
the lowest OPP on the big core spans across 512 utilization, half of the 
full capacity. Assuming a scenario where there are 4 tasks, each with 
300, 100, 100, 100 utilization, the placement should be 300 on one core 
and 3 tasks with 100 on another, but the nr_running check here would 
give 2 tasks (300 + 100) on one CPU and 2 tasks (100 + 100) on another 
because they are still under the lowest OPP on Juno. The second CPU will 
also finish faster and idle more than the first one.

To give an extreme example, assuming the system has only one OPP (such a 
system is dumb to begin with, but just to make a point), before this 
patch EAS would still work okay in task placement, but after this patch, 
EAS would just balance on the number of tasks, regardless of utilization 
of tasks on wake-up.

I wonder if there is a way to still take total utilization as a factor. 
It used to be 100% of the decision making, but maybe now it is only 60%, 
and the other 40% are things like number of tasks and contention.

> -	trace_sched_compute_energy_tp(p, dst_cpu, energy, max_util, busy_time);
> +	/* Favor previous CPU otherwise */
> +	if (target->cpu == prev)
> +		return true;
> +	if (min->cpu == prev)
> +		return false;
>   
> -	return energy;
> +	/*
> +	 * Choose CPU with lowest contention. One might want to consider load instead of
> +	 * runnable but we are supposed to not be overutilized so there is enough compute
> +	 * capacity for everybody.
> +	 */
> +	if ((target->runnable * min->capa * sd->imbalance_pct) >=
> +			(min->runnable * target->capa * 100))
> +		return false;
> +
> +	return true;
>   }
> [...]


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/5] sched/fair: Rework feec() to use cost instead of spare capacity
  2024-09-02 11:03   ` Hongyan Xia
@ 2024-09-06  7:08     ` Vincent Guittot
  2024-09-06 15:32       ` Hongyan Xia
  0 siblings, 1 reply; 62+ messages in thread
From: Vincent Guittot @ 2024-09-06  7:08 UTC (permalink / raw)
  To: Hongyan Xia
  Cc: linux-kernel, qyousef, mingo, peterz, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, vschneid, lukasz.luba,
	mgorman, rafael.j.wysocki

On Mon, 2 Sept 2024 at 13:03, Hongyan Xia <hongyan.xia2@arm.com> wrote:
>
> On 30/08/2024 14:03, Vincent Guittot wrote:
> > feec() looks for the CPU with highest spare capacity in a PD assuming that
> > it will be the best CPU from a energy efficiency PoV because it will
> > require the smallest increase of OPP. Although this is true generally
> > speaking, this policy also filters some others CPUs which will be as
> > efficients because of using the same OPP.
> > In fact, we really care about the cost of the new OPP that will be
> > selected to handle the waking task. In many cases, several CPUs will end
> > up selecting the same OPP and as a result using the same energy cost. In
> > these cases, we can use other metrics to select the best CPU for the same
> > energy cost.
> >
> > Rework feec() to look 1st for the lowest cost in a PD and then the most
> > performant CPU between CPUs.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >   kernel/sched/fair.c | 466 +++++++++++++++++++++++---------------------
> >   1 file changed, 244 insertions(+), 222 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e67d6029b269..2273eecf6086 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > [...]
> >
> > -     energy = em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
> > +/* For a same cost, select the CPU that will povide best performance for the task */
> > +static bool select_best_cpu(struct energy_cpu_stat *target,
> > +                         struct energy_cpu_stat *min,
> > +                         int prev, struct sched_domain *sd)
> > +{
> > +     /*  Select the one with the least number of running tasks */
> > +     if (target->nr_running < min->nr_running)
> > +             return true;
> > +     if (target->nr_running > min->nr_running)
> > +             return false;
> >
> This makes me a bit worried about systems with coarse-grained OPPs. All
> my dev boards and one of my old phones have <= 3 OPPs. On my Juno board,
> the lowest OPP on the big core spans across 512 utilization, half of the
> full capacity. Assuming a scenario where there are 4 tasks, each with
> 300, 100, 100, 100 utilization, the placement should be 300 on one core
> and 3 tasks with 100 on another, but the nr_running check here would
> give 2 tasks (300 + 100) on one CPU and 2 tasks (100 + 100) on another
> because they are still under the lowest OPP on Juno. The second CPU will
> also finish faster and idle more than the first one.

By balancing the number of tasks on each cpu, I try to minimize the
scheduling latency. In your case above, tasks will wait for no more
than a slice before running whereas it might have to wait up to 2
slices if I put all the (100 utilization) tasks on the same CPU.

>
> To give an extreme example, assuming the system has only one OPP (such a
> system is dumb to begin with, but just to make a point), before this
> patch EAS would still work okay in task placement, but after this patch,

Not sure what you mean by would still work okay. Do you have an
example in mind that would not work correctly ?

> EAS would just balance on the number of tasks, regardless of utilization
> of tasks on wake-up.

You have to keep in mind that utilization is already taken into
account to check if the task fits the CPU and by selecting the OPP
(which is a nope in case of one OPP). So we know that there is enough
capacity for the waking task

>
> I wonder if there is a way to still take total utilization as a factor.

utilization is still used to check that the task utilization fits with
current cpu utilization and then to select the OPP. At this step we
know that there is enough capacity for everybody

> It used to be 100% of the decision making, but maybe now it is only 60%,
> and the other 40% are things like number of tasks and contention.
>
> > -     trace_sched_compute_energy_tp(p, dst_cpu, energy, max_util, busy_time);
> > +     /* Favor previous CPU otherwise */
> > +     if (target->cpu == prev)
> > +             return true;
> > +     if (min->cpu == prev)
> > +             return false;
> >
> > -     return energy;
> > +     /*
> > +      * Choose CPU with lowest contention. One might want to consider load instead of
> > +      * runnable but we are supposed to not be overutilized so there is enough compute
> > +      * capacity for everybody.
> > +      */
> > +     if ((target->runnable * min->capa * sd->imbalance_pct) >=
> > +                     (min->runnable * target->capa * 100))
> > +             return false;
> > +
> > +     return true;
> >   }
> > [...]
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/5] sched/fair: Rework feec() to use cost instead of spare capacity
  2024-09-06  7:08     ` Vincent Guittot
@ 2024-09-06 15:32       ` Hongyan Xia
  2024-09-12 12:12         ` Vincent Guittot
  0 siblings, 1 reply; 62+ messages in thread
From: Hongyan Xia @ 2024-09-06 15:32 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, qyousef, mingo, peterz, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, vschneid, lukasz.luba,
	mgorman, rafael.j.wysocki

On 06/09/2024 08:08, Vincent Guittot wrote:
> On Mon, 2 Sept 2024 at 13:03, Hongyan Xia <hongyan.xia2@arm.com> wrote:
>>
>> On 30/08/2024 14:03, Vincent Guittot wrote:
>>> feec() looks for the CPU with highest spare capacity in a PD assuming that
>>> it will be the best CPU from a energy efficiency PoV because it will
>>> require the smallest increase of OPP. Although this is true generally
>>> speaking, this policy also filters some others CPUs which will be as
>>> efficients because of using the same OPP.
>>> In fact, we really care about the cost of the new OPP that will be
>>> selected to handle the waking task. In many cases, several CPUs will end
>>> up selecting the same OPP and as a result using the same energy cost. In
>>> these cases, we can use other metrics to select the best CPU for the same
>>> energy cost.
>>>
>>> Rework feec() to look 1st for the lowest cost in a PD and then the most
>>> performant CPU between CPUs.
>>>
>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>> ---
>>>    kernel/sched/fair.c | 466 +++++++++++++++++++++++---------------------
>>>    1 file changed, 244 insertions(+), 222 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index e67d6029b269..2273eecf6086 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> [...]
>>>
>>> -     energy = em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
>>> +/* For a same cost, select the CPU that will povide best performance for the task */
>>> +static bool select_best_cpu(struct energy_cpu_stat *target,
>>> +                         struct energy_cpu_stat *min,
>>> +                         int prev, struct sched_domain *sd)
>>> +{
>>> +     /*  Select the one with the least number of running tasks */
>>> +     if (target->nr_running < min->nr_running)
>>> +             return true;
>>> +     if (target->nr_running > min->nr_running)
>>> +             return false;
>>>
>> This makes me a bit worried about systems with coarse-grained OPPs. All
>> my dev boards and one of my old phones have <= 3 OPPs. On my Juno board,
>> the lowest OPP on the big core spans across 512 utilization, half of the
>> full capacity. Assuming a scenario where there are 4 tasks, each with
>> 300, 100, 100, 100 utilization, the placement should be 300 on one core
>> and 3 tasks with 100 on another, but the nr_running check here would
>> give 2 tasks (300 + 100) on one CPU and 2 tasks (100 + 100) on another
>> because they are still under the lowest OPP on Juno. The second CPU will
>> also finish faster and idle more than the first one.
> 
> By balancing the number of tasks on each cpu, I try to minimize the
> scheduling latency. In your case above, tasks will wait for no more
> than a slice before running whereas it might have to wait up to 2
> slices if I put all the (100 utilization) tasks on the same CPU.

If viewed in another angle, we are now asking the 300 task (which 
potentially has a heavier workload to finish) to compete with a 100 
task, and now one core finishes faster and the other takes longer time, 
making the overall execution time longer.

>>
>> To give an extreme example, assuming the system has only one OPP (such a
>> system is dumb to begin with, but just to make a point), before this
>> patch EAS would still work okay in task placement, but after this patch,
> 
> Not sure what you mean by would still work okay. Do you have an
> example in mind that would not work correctly ?

With only one OPP, this patch will balance task placement purely on the 
number of tasks without considering utilization, and I don't think 
that's entirely acceptable (I actually need to deal with such a device 
with only one OPP in real life, although that's the fault of that 
device). Before, we are still balancing on total utilization, which 
results in the lowest execution time.

> 
>> EAS would just balance on the number of tasks, regardless of utilization
>> of tasks on wake-up.
> 
> You have to keep in mind that utilization is already taken into
> account to check if the task fits the CPU and by selecting the OPP
> (which is a nope in case of one OPP). So we know that there is enough
> capacity for the waking task

Still, taking my Juno board as an example where the 1st OPP is at 
utilization 512. Assuming no 25% margin, four tasks with utilization 
200, 200, 50, 50 and two CPUs, I would strongly favor 200 + 50 on one 
CPU and same on the other, than 200 + 200 on one, 50 + 50 on the other. 
However, with this patch, these two scheduling decisions are the same, 
as long as both are under the 512 OPP.

Of course, this becomes less of a problem with fine-grained OPPs. On my 
Pixel 6 with 18 OPPs on one CPU, I don't have such concerns.

>>
>> I wonder if there is a way to still take total utilization as a factor.
> 
> utilization is still used to check that the task utilization fits with
> current cpu utilization and then to select the OPP. At this step we
> know that there is enough capacity for everybody
> 
>> It used to be 100% of the decision making, but maybe now it is only 60%,
>> and the other 40% are things like number of tasks and contention.
>>
>>> -     trace_sched_compute_energy_tp(p, dst_cpu, energy, max_util, busy_time);
>>> +     /* Favor previous CPU otherwise */
>>> +     if (target->cpu == prev)
>>> +             return true;
>>> +     if (min->cpu == prev)
>>> +             return false;
>>>
>>> -     return energy;
>>> +     /*
>>> +      * Choose CPU with lowest contention. One might want to consider load instead of
>>> +      * runnable but we are supposed to not be overutilized so there is enough compute
>>> +      * capacity for everybody.
>>> +      */
>>> +     if ((target->runnable * min->capa * sd->imbalance_pct) >=
>>> +                     (min->runnable * target->capa * 100))
>>> +             return false;
>>> +
>>> +     return true;
>>>    }
>>> [...]
>>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/5] sched/fair: Rework feec() to use cost instead of spare capacity
  2024-09-06 15:32       ` Hongyan Xia
@ 2024-09-12 12:12         ` Vincent Guittot
  0 siblings, 0 replies; 62+ messages in thread
From: Vincent Guittot @ 2024-09-12 12:12 UTC (permalink / raw)
  To: Hongyan Xia
  Cc: linux-kernel, qyousef, mingo, peterz, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, vschneid, lukasz.luba,
	mgorman, rafael.j.wysocki

On Fri, 6 Sept 2024 at 17:32, Hongyan Xia <hongyan.xia2@arm.com> wrote:
>
> On 06/09/2024 08:08, Vincent Guittot wrote:
> > On Mon, 2 Sept 2024 at 13:03, Hongyan Xia <hongyan.xia2@arm.com> wrote:
> >>
> >> On 30/08/2024 14:03, Vincent Guittot wrote:
> >>> feec() looks for the CPU with highest spare capacity in a PD assuming that
> >>> it will be the best CPU from a energy efficiency PoV because it will
> >>> require the smallest increase of OPP. Although this is true generally
> >>> speaking, this policy also filters some others CPUs which will be as
> >>> efficients because of using the same OPP.
> >>> In fact, we really care about the cost of the new OPP that will be
> >>> selected to handle the waking task. In many cases, several CPUs will end
> >>> up selecting the same OPP and as a result using the same energy cost. In
> >>> these cases, we can use other metrics to select the best CPU for the same
> >>> energy cost.
> >>>
> >>> Rework feec() to look 1st for the lowest cost in a PD and then the most
> >>> performant CPU between CPUs.
> >>>
> >>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> >>> ---
> >>>    kernel/sched/fair.c | 466 +++++++++++++++++++++++---------------------
> >>>    1 file changed, 244 insertions(+), 222 deletions(-)
> >>>
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index e67d6029b269..2273eecf6086 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> [...]
> >>>
> >>> -     energy = em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
> >>> +/* For a same cost, select the CPU that will povide best performance for the task */
> >>> +static bool select_best_cpu(struct energy_cpu_stat *target,
> >>> +                         struct energy_cpu_stat *min,
> >>> +                         int prev, struct sched_domain *sd)
> >>> +{
> >>> +     /*  Select the one with the least number of running tasks */
> >>> +     if (target->nr_running < min->nr_running)
> >>> +             return true;
> >>> +     if (target->nr_running > min->nr_running)
> >>> +             return false;
> >>>
> >> This makes me a bit worried about systems with coarse-grained OPPs. All
> >> my dev boards and one of my old phones have <= 3 OPPs. On my Juno board,
> >> the lowest OPP on the big core spans across 512 utilization, half of the
> >> full capacity. Assuming a scenario where there are 4 tasks, each with
> >> 300, 100, 100, 100 utilization, the placement should be 300 on one core
> >> and 3 tasks with 100 on another, but the nr_running check here would
> >> give 2 tasks (300 + 100) on one CPU and 2 tasks (100 + 100) on another
> >> because they are still under the lowest OPP on Juno. The second CPU will
> >> also finish faster and idle more than the first one.
> >
> > By balancing the number of tasks on each cpu, I try to minimize the
> > scheduling latency. In your case above, tasks will wait for no more
> > than a slice before running whereas it might have to wait up to 2
> > slices if I put all the (100 utilization) tasks on the same CPU.
>
> If viewed in another angle, we are now asking the 300 task (which
> potentially has a heavier workload to finish) to compete with a 100
> task, and now one core finishes faster and the other takes longer time,
> making the overall execution time longer.

The main problem with utilization is that it also reflects the recent
past and it can screw up task placement as well as I presented at
OSPM. Imagine that a long running  "400 task" just went back to sleep
before placing the "300 task" and the 3 "100 tasks" then you can end
up putting 3 tasks on one core as well.
The goal here is to optimize scheduling latency which is a problem
that has never been really taken into account so far with the problem
of several tasks being stacked on the same cpu which increases the
scheduling latency . A next step after this patchset will be to take
into account the sched slice in addition to the number of tasks to
optimize the scheduling latency of some tasks. The fact that a cpu
will run longer should be taken into account in the energy model when
we compute the energy cost which is not the case for now because of
the complexity to now when cpus will be really idle and which state
will be selected

>
> >>
> >> To give an extreme example, assuming the system has only one OPP (such a
> >> system is dumb to begin with, but just to make a point), before this
> >> patch EAS would still work okay in task placement, but after this patch,
> >
> > Not sure what you mean by would still work okay. Do you have an
> > example in mind that would not work correctly ?
>
> With only one OPP, this patch will balance task placement purely on the
> number of tasks without considering utilization, and I don't think
> that's entirely acceptable (I actually need to deal with such a device
> with only one OPP in real life, although that's the fault of that
> device). Before, we are still balancing on total utilization, which
> results in the lowest execution time.
>
> >
> >> EAS would just balance on the number of tasks, regardless of utilization
> >> of tasks on wake-up.
> >
> > You have to keep in mind that utilization is already taken into
> > account to check if the task fits the CPU and by selecting the OPP
> > (which is a nope in case of one OPP). So we know that there is enough
> > capacity for the waking task
>
> Still, taking my Juno board as an example where the 1st OPP is at
> utilization 512. Assuming no 25% margin, four tasks with utilization
> 200, 200, 50, 50 and two CPUs, I would strongly favor 200 + 50 on one
> CPU and same on the other, than 200 + 200 on one, 50 + 50 on the other.
> However, with this patch, these two scheduling decisions are the same,
> as long as both are under the 512 OPP.

The runnable avg test should handle this when there is the same number
of tasks on both CPUs then we select the one with lowest contention so
one 200 task should end up on each CPU


>
> Of course, this becomes less of a problem with fine-grained OPPs. On my
> Pixel 6 with 18 OPPs on one CPU, I don't have such concerns.
>
> >>
> >> I wonder if there is a way to still take total utilization as a factor.
> >
> > utilization is still used to check that the task utilization fits with
> > current cpu utilization and then to select the OPP. At this step we
> > know that there is enough capacity for everybody
> >
> >> It used to be 100% of the decision making, but maybe now it is only 60%,
> >> and the other 40% are things like number of tasks and contention.
> >>
> >>> -     trace_sched_compute_energy_tp(p, dst_cpu, energy, max_util, busy_time);
> >>> +     /* Favor previous CPU otherwise */
> >>> +     if (target->cpu == prev)
> >>> +             return true;
> >>> +     if (min->cpu == prev)
> >>> +             return false;
> >>>
> >>> -     return energy;
> >>> +     /*
> >>> +      * Choose CPU with lowest contention. One might want to consider load instead of
> >>> +      * runnable but we are supposed to not be overutilized so there is enough compute
> >>> +      * capacity for everybody.
> >>> +      */
> >>> +     if ((target->runnable * min->capa * sd->imbalance_pct) >=
> >>> +                     (min->runnable * target->capa * 100))
> >>> +             return false;
> >>> +
> >>> +     return true;
> >>>    }
> >>> [...]
> >>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/5] sched/fair: Rework feec() to use cost instead of spare capacity
  2024-08-30 13:03 ` [PATCH 3/5] sched/fair: Rework feec() to use cost instead of spare capacity Vincent Guittot
  2024-09-02  9:11   ` kernel test robot
  2024-09-02 11:03   ` Hongyan Xia
@ 2024-09-04 15:07   ` Pierre Gondois
  2024-09-06  7:08     ` Vincent Guittot
  2024-09-11 14:02   ` Pierre Gondois
  3 siblings, 1 reply; 62+ messages in thread
From: Pierre Gondois @ 2024-09-04 15:07 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, lukasz.luba,
	rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2



On 8/30/24 15:03, Vincent Guittot wrote:
> feec() looks for the CPU with highest spare capacity in a PD assuming that
> it will be the best CPU from a energy efficiency PoV because it will
> require the smallest increase of OPP. Although this is true generally
> speaking, this policy also filters some others CPUs which will be as
> efficients because of using the same OPP.
> In fact, we really care about the cost of the new OPP that will be
> selected to handle the waking task. In many cases, several CPUs will end
> up selecting the same OPP and as a result using the same energy cost. In
> these cases, we can use other metrics to select the best CPU for the same
> energy cost.
> 
> Rework feec() to look 1st for the lowest cost in a PD and then the most
> performant CPU between CPUs.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>   kernel/sched/fair.c | 466 +++++++++++++++++++++++---------------------
>   1 file changed, 244 insertions(+), 222 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e67d6029b269..2273eecf6086 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> -/*
> - * compute_energy(): Use the Energy Model to estimate the energy that @pd would
> - * consume for a given utilization landscape @eenv. When @dst_cpu < 0, the task
> - * contribution is ignored.
> - */
> -static inline unsigned long
> -compute_energy(struct energy_env *eenv, struct perf_domain *pd,
> -	       struct cpumask *pd_cpus, struct task_struct *p, int dst_cpu)
> +/*Check if the CPU can handle the waking task */
> +static int check_cpu_with_task(struct task_struct *p, int cpu)
>   {
> -	unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
> -	unsigned long busy_time = eenv->pd_busy_time;
> -	unsigned long energy;
> +	unsigned long p_util_min = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MIN) : 0;
> +	unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
> +	unsigned long util_min = p_util_min;
> +	unsigned long util_max = p_util_max;
> +	unsigned long util = cpu_util(cpu, p, cpu, 0);
> +	struct rq *rq = cpu_rq(cpu);
>   
> -	if (dst_cpu >= 0)
> -		busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
> +	/*
> +	 * Skip CPUs that cannot satisfy the capacity request.
> +	 * IOW, placing the task there would make the CPU
> +	 * overutilized. Take uclamp into account to see how
> +	 * much capacity we can get out of the CPU; this is
> +	 * aligned with sched_cpu_util().
> +	 */
> +	if (uclamp_is_used() && !uclamp_rq_is_idle(rq)) {
> +		unsigned long rq_util_min, rq_util_max;
> +		/*
> +		 * Open code uclamp_rq_util_with() except for
> +		 * the clamp() part. I.e.: apply max aggregation
> +		 * only. util_fits_cpu() logic requires to
> +		 * operate on non clamped util but must use the
> +		 * max-aggregated uclamp_{min, max}.
> +		 */
> +		rq_util_min = uclamp_rq_get(rq, UCLAMP_MIN);
> +		rq_util_max = uclamp_rq_get(rq, UCLAMP_MAX);
> +		util_min = max(rq_util_min, p_util_min);
> +		util_max = max(rq_util_max, p_util_max);
> +	}
> +	return util_fits_cpu(util, util_min, util_max, cpu);
> +}
>   
> -	energy = em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);

I think em_cpu_energy() would need to be removed with this patch,
if there are no more references to it.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/5] sched/fair: Rework feec() to use cost instead of spare capacity
  2024-09-04 15:07   ` Pierre Gondois
@ 2024-09-06  7:08     ` Vincent Guittot
  0 siblings, 0 replies; 62+ messages in thread
From: Vincent Guittot @ 2024-09-06  7:08 UTC (permalink / raw)
  To: Pierre Gondois
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2

On Wed, 4 Sept 2024 at 17:07, Pierre Gondois <pierre.gondois@arm.com> wrote:
>
>
>
> On 8/30/24 15:03, Vincent Guittot wrote:
> > feec() looks for the CPU with highest spare capacity in a PD assuming that
> > it will be the best CPU from a energy efficiency PoV because it will
> > require the smallest increase of OPP. Although this is true generally
> > speaking, this policy also filters some others CPUs which will be as
> > efficients because of using the same OPP.
> > In fact, we really care about the cost of the new OPP that will be
> > selected to handle the waking task. In many cases, several CPUs will end
> > up selecting the same OPP and as a result using the same energy cost. In
> > these cases, we can use other metrics to select the best CPU for the same
> > energy cost.
> >
> > Rework feec() to look 1st for the lowest cost in a PD and then the most
> > performant CPU between CPUs.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >   kernel/sched/fair.c | 466 +++++++++++++++++++++++---------------------
> >   1 file changed, 244 insertions(+), 222 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e67d6029b269..2273eecf6086 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > -/*
> > - * compute_energy(): Use the Energy Model to estimate the energy that @pd would
> > - * consume for a given utilization landscape @eenv. When @dst_cpu < 0, the task
> > - * contribution is ignored.
> > - */
> > -static inline unsigned long
> > -compute_energy(struct energy_env *eenv, struct perf_domain *pd,
> > -            struct cpumask *pd_cpus, struct task_struct *p, int dst_cpu)
> > +/*Check if the CPU can handle the waking task */
> > +static int check_cpu_with_task(struct task_struct *p, int cpu)
> >   {
> > -     unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
> > -     unsigned long busy_time = eenv->pd_busy_time;
> > -     unsigned long energy;
> > +     unsigned long p_util_min = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MIN) : 0;
> > +     unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
> > +     unsigned long util_min = p_util_min;
> > +     unsigned long util_max = p_util_max;
> > +     unsigned long util = cpu_util(cpu, p, cpu, 0);
> > +     struct rq *rq = cpu_rq(cpu);
> >
> > -     if (dst_cpu >= 0)
> > -             busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
> > +     /*
> > +      * Skip CPUs that cannot satisfy the capacity request.
> > +      * IOW, placing the task there would make the CPU
> > +      * overutilized. Take uclamp into account to see how
> > +      * much capacity we can get out of the CPU; this is
> > +      * aligned with sched_cpu_util().
> > +      */
> > +     if (uclamp_is_used() && !uclamp_rq_is_idle(rq)) {
> > +             unsigned long rq_util_min, rq_util_max;
> > +             /*
> > +              * Open code uclamp_rq_util_with() except for
> > +              * the clamp() part. I.e.: apply max aggregation
> > +              * only. util_fits_cpu() logic requires to
> > +              * operate on non clamped util but must use the
> > +              * max-aggregated uclamp_{min, max}.
> > +              */
> > +             rq_util_min = uclamp_rq_get(rq, UCLAMP_MIN);
> > +             rq_util_max = uclamp_rq_get(rq, UCLAMP_MAX);
> > +             util_min = max(rq_util_min, p_util_min);
> > +             util_max = max(rq_util_max, p_util_max);
> > +     }
> > +     return util_fits_cpu(util, util_min, util_max, cpu);
> > +}
> >
> > -     energy = em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
>
> I think em_cpu_energy() would need to be removed with this patch,
> if there are no more references to it.

Yes, I will add a patch to cleanup unused function

>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/5] sched/fair: Rework feec() to use cost instead of spare capacity
  2024-08-30 13:03 ` [PATCH 3/5] sched/fair: Rework feec() to use cost instead of spare capacity Vincent Guittot
                     ` (2 preceding siblings ...)
  2024-09-04 15:07   ` Pierre Gondois
@ 2024-09-11 14:02   ` Pierre Gondois
  2024-09-11 16:51     ` Pierre Gondois
  2024-09-12 12:22     ` Vincent Guittot
  3 siblings, 2 replies; 62+ messages in thread
From: Pierre Gondois @ 2024-09-11 14:02 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, lukasz.luba,
	rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2

Hello Vincent,

On 8/30/24 15:03, Vincent Guittot wrote:
> feec() looks for the CPU with highest spare capacity in a PD assuming that
> it will be the best CPU from a energy efficiency PoV because it will
> require the smallest increase of OPP. Although this is true generally
> speaking, this policy also filters some others CPUs which will be as
> efficients because of using the same OPP.
> In fact, we really care about the cost of the new OPP that will be
> selected to handle the waking task. In many cases, several CPUs will end
> up selecting the same OPP and as a result using the same energy cost. In
> these cases, we can use other metrics to select the best CPU for the same
> energy cost.
> 
> Rework feec() to look 1st for the lowest cost in a PD and then the most
> performant CPU between CPUs.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>   kernel/sched/fair.c | 466 +++++++++++++++++++++++---------------------
>   1 file changed, 244 insertions(+), 222 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e67d6029b269..2273eecf6086 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c

[snip]

>   
> -/*
> - * compute_energy(): Use the Energy Model to estimate the energy that @pd would
> - * consume for a given utilization landscape @eenv. When @dst_cpu < 0, the task
> - * contribution is ignored.
> - */
> -static inline unsigned long
> -compute_energy(struct energy_env *eenv, struct perf_domain *pd,
> -	       struct cpumask *pd_cpus, struct task_struct *p, int dst_cpu)
> +/*Check if the CPU can handle the waking task */
> +static int check_cpu_with_task(struct task_struct *p, int cpu)
>   {
> -	unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
> -	unsigned long busy_time = eenv->pd_busy_time;
> -	unsigned long energy;
> +	unsigned long p_util_min = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MIN) : 0;
> +	unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
> +	unsigned long util_min = p_util_min;
> +	unsigned long util_max = p_util_max;
> +	unsigned long util = cpu_util(cpu, p, cpu, 0);
> +	struct rq *rq = cpu_rq(cpu);
>   
> -	if (dst_cpu >= 0)
> -		busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);

I think you should mention that the energy computation is not capped anymore.
It used to be:
pd_util: sum of the CPU's util for the pd considered, without task_util
pd_cap: sum of the CPU's capacity for the pd

(a)
busy_time = min(pd_cap, pd_util);
prev_energy = busy_time * OPP[prev_max_util].cost;

busy_time = min(pd_cap, pd_util + task_util);
new_energy = busy_time * OPP[new_max_util].cost;

delta_energy = new_energy - prev_energy;

Note that when computing busy_time, task_util is not capped to one CPU's
max_cap. This means that in empty pd, a task can 'steal' capacity from
CPUs during the energy computation.
Cf. [1]

and it is now:
(b)
delta_energy = task_util * OPP[new_max_util].cost;
delta_energy += pd_util * (OPP[new_max_util].cost - OPP[prev_max_util].cost);

Note that the busy_time (task_util + pd_util) is now not capped by anything.

---

Not capping the task_util is discussed in [1][3] and [2] (at 21:15).
IIUC, UCLAMP_MAX tasks are the only case leveraging this. Indeed,
non-clamped tasks will not fit and be placed on a bigger CPU. Same for
UCLAMP_MIN tasks.
FWIU, not capping the utilization of tasks during the energy computation
allows to reflect that a task will (likely) run longer. However the
task's performance would likely decrease as the other tasks on the target
CPU are not taken into account (it is assumed the task to be placed will
receive the compute power it requires).

---
Case A:

Assuming we have an empty system with:
- 4 little CPUs (max_capa=512, first OPP as [capa=256, cost=10])
- 2 big CPUs (max_capa=1024, first OPP as [capa=512, cost=10])
i.e. the first OPP of all the CPU consumes the same amount of energy.
And a task with: [UCLAMP_MIN=0, UCLAMP_MAX=10, util = 1000]

Then feec() would have no reason to prefer a big CPU over a little CPU,
even though the big CPU would provide more performance.

---
Case B:

(This is not especially related to this patch.)
Another case that might be problematic:
- 4 little CPUs (max_capa=512, first OPP as [capa=256])
- 2 big CPUs (max_capa=1024, first OPP as [capa=512])
- little CPUs consume less than big CPUs, but the highest OPP
   of the little CPUs consume more than the lowest of the big CPUs.
And tasks:
- 3 tasks with [UCLAMP_MIN=0, UCLAMP_MAX=10, util = 1000]
- 1 task with [UCLAMP_MIN=0, UCLAMP_MAX=1024, util = 50]

Then
- the 3 UCLAMP_MAX tasks will be placed on the little CPUs. Indeed,
   due to the last patch of the serie, these tasks have now an opportunity
   to run feec() and be placed on a more energy efficient CPU.
- the 'normal' task will be placed on a big CPU. Indeed, placing
   it on a little CPU would raise the OPP of the little cluster.

This means that the 'normal' task is prevented to run the remaining little
CPU even though:
- the little CPU can provide the compute capacity
- the little CPU would consume less energy

In other terms, using UCLAMP_MAX on some tasks makes the system consume
more energy.

---

In my opinion, this last case comes from the difficulty of defining UCLAMP_MAX.
 From sched-util-clamp.rst (about UCLAMP_MAX):
- Uclamp is a hinting mechanism that allows the scheduler to understand the
   performance requirements and restrictions of the tasks
- The right way to view util clamp is as a mechanism to make request or hint on
   performance constraints.
- some tasks should be restricted from consuming too
   much resources and should not go above a specific performance point.
-
Another example is in Android where tasks are classified as background,
foreground, top-app, etc. Util clamp can be used to constrain how much
resources background tasks are consuming by capping the performance point they
can run at. This constraint helps reserve resources for important tasks, like
the ones belonging to the currently active app (top-app group). Beside this
helps in limiting how much power they consume. This can be more obvious in
heterogeneous systems (e.g. Arm big.LITTLE); the constraint will help bias the
background tasks to stay on the little cores which will ensure that:

         1. The big cores are free to run top-app tasks immediately. top-app
            tasks are the tasks the user is currently interacting with, hence
            the most important tasks in the system.
         2. They don't run on a power hungry core and drain battery even if they
            are CPU intensive tasks.
-
For example, it can be handy to limit performance when running low on battery
or when the system wants to limit access to more energy hungry performance
levels when it's in idle state or screen is off.

"""
This constraint helps reserve resources for important tasks, like
the ones belonging to the currently active app (top-app group).
"""
It doesn't seem that UCLAMP_MAX does this. This looks more like bandwidth
control.

"""
They don't run on a power hungry core and drain battery even if they
are CPU intensive tasks.
"""
Avoiding mid/big CPUs could be done with tasksets,

I can understand that one might want to run a task at a higher OPP due to
timing constraints for instance. However I don't see why someone would want
to run a task at a lower OPP, regarding only the performance and not the
energy consumption. It thus means that UCLAMP_MAX is an energy hint rather
of a performance hint.

UCLAMP_MAX could be set for a task to make it spend less energy, but the
loss in performance would be unknown.
A case Hongyan mentioned in his uclamp sum aggregation serie [4] is that
an infinite task with [UCLAMP_MIN=0, UCLAMP_MAX=1, util = 1000] could fit
in a little CPU. The energy consumption would indeed be very low, but the
performance would also be very low.

With Hongyan's sum aggregation serie [5]:
- case B would not happen as the 'normal' task would not raise the OPP of
   the whole cluster.
- the 'infinite UCLAMP_MAX tasks' case would not happen as each task would
   account for 1 util
- case A would still happen, but could be solved in any case.

I know Hongyan's patchset has already been discussed, but I still don't
understand why max aggregation is preferred over sum aggregation.
The definition of UCLAMP_MAX values seems clearer and in effect results in
a simpler implementation and less corner cases. In simple words:
"When estimating the CPU frequency to set, for this task,
account for at most X util."
rather than:
"When estimating the CPU frequency to set, the task with the highest
UCLAMP_MAX of the CPU will cap the requested frequency."

Note that actually I think it's a good idea to run feec() regularly
and to take into account other parameters like nr_running. I just think
that UCLAMP_MAX's max aggregation creates corner cases that are difficult
to solve altogether.

Thanks in advance for the time you will take answering,
Regards,
Pierre

[1] https://lore.kernel.org/all/20240606070645.3295-1-xuewen.yan@unisoc.com/
[2] https://www.youtube.com/watch?v=PHEBAyxeM_M
[3] https://lore.kernel.org/all/CAKfTPtDPCPYvCi1c_Nh+Cn01ZVS7E=tAHQeNX-mArBt3BXdjYw@mail.gmail.com/
[4] https://lore.kernel.org/all/b81a5b1c-14de-4232-bee9-ee647355dd8c@arm.com/
[5] https://lore.kernel.org/all/cover.1706792708.git.hongyan.xia2@arm.com/#t

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/5] sched/fair: Rework feec() to use cost instead of spare capacity
  2024-09-11 14:02   ` Pierre Gondois
@ 2024-09-11 16:51     ` Pierre Gondois
  2024-09-12 12:22     ` Vincent Guittot
  1 sibling, 0 replies; 62+ messages in thread
From: Pierre Gondois @ 2024-09-11 16:51 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, lukasz.luba,
	rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2

(edit)

On 9/11/24 16:02, Pierre Gondois wrote:
> Hello Vincent,
> 
> On 8/30/24 15:03, Vincent Guittot wrote:
>> feec() looks for the CPU with highest spare capacity in a PD assuming that
>> it will be the best CPU from a energy efficiency PoV because it will
>> require the smallest increase of OPP. Although this is true generally
>> speaking, this policy also filters some others CPUs which will be as
>> efficients because of using the same OPP.
>> In fact, we really care about the cost of the new OPP that will be
>> selected to handle the waking task. In many cases, several CPUs will end
>> up selecting the same OPP and as a result using the same energy cost. In
>> these cases, we can use other metrics to select the best CPU for the same
>> energy cost.
>>
>> Rework feec() to look 1st for the lowest cost in a PD and then the most
>> performant CPU between CPUs.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>    kernel/sched/fair.c | 466 +++++++++++++++++++++++---------------------
>>    1 file changed, 244 insertions(+), 222 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e67d6029b269..2273eecf6086 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
> 
> [snip]
> 
>>    
>> -/*
>> - * compute_energy(): Use the Energy Model to estimate the energy that @pd would
>> - * consume for a given utilization landscape @eenv. When @dst_cpu < 0, the task
>> - * contribution is ignored.
>> - */
>> -static inline unsigned long
>> -compute_energy(struct energy_env *eenv, struct perf_domain *pd,
>> -	       struct cpumask *pd_cpus, struct task_struct *p, int dst_cpu)
>> +/*Check if the CPU can handle the waking task */
>> +static int check_cpu_with_task(struct task_struct *p, int cpu)
>>    {
>> -	unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
>> -	unsigned long busy_time = eenv->pd_busy_time;
>> -	unsigned long energy;
>> +	unsigned long p_util_min = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MIN) : 0;
>> +	unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
>> +	unsigned long util_min = p_util_min;
>> +	unsigned long util_max = p_util_max;
>> +	unsigned long util = cpu_util(cpu, p, cpu, 0);
>> +	struct rq *rq = cpu_rq(cpu);
>>    
>> -	if (dst_cpu >= 0)
>> -		busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
> 
> I think you should mention that the energy computation is not capped anymore.
> It used to be:
> pd_util: sum of the CPU's util for the pd considered, without task_util
> pd_cap: sum of the CPU's capacity for the pd
> 
> (a)
> busy_time = min(pd_cap, pd_util);
> prev_energy = busy_time * OPP[prev_max_util].cost;
> 
> busy_time = min(pd_cap, pd_util + task_util);
> new_energy = busy_time * OPP[new_max_util].cost;
> 
> delta_energy = new_energy - prev_energy;
> 
> Note that when computing busy_time, task_util is not capped to one CPU's
> max_cap. This means that in empty pd, a task can 'steal' capacity from
> CPUs during the energy computation.
> Cf. [1]
> 
> and it is now:
> (b)
> delta_energy = task_util * OPP[new_max_util].cost;
> delta_energy += pd_util * (OPP[new_max_util].cost - OPP[prev_max_util].cost);
> 
> Note that the busy_time (task_util + pd_util) is now not capped by anything.
> 
> ---
> 
> Not capping the task_util is discussed in [1][3] and [2] (at 21:15).
> IIUC, UCLAMP_MAX tasks are the only case leveraging this. Indeed,
> non-clamped tasks will not fit and be placed on a bigger CPU. Same for
> UCLAMP_MIN tasks.
> FWIU, not capping the utilization of tasks during the energy computation
> allows to reflect that a task will (likely) run longer. However the
> task's performance would likely decrease as the other tasks on the target
> CPU are not taken into account (it is assumed the task to be placed will
> receive the compute power it requires).
> 
> ---
> Case A:
> 
> Assuming we have an empty system with:
> - 4 little CPUs (max_capa=512, first OPP as [capa=256, cost=10])
> - 2 big CPUs (max_capa=1024, first OPP as [capa=512, cost=10])
> i.e. the first OPP of all the CPU consumes the same amount of energy.
> And a task with: [UCLAMP_MIN=0, UCLAMP_MAX=10, util = 1000]
> 
> Then feec() would have no reason to prefer a big CPU over a little CPU,
> even though the big CPU would provide more performance.
> 
> ---
> Case B:
> 
> (This is not especially related to this patch.)
> Another case that might be problematic:
> - 4 little CPUs (max_capa=512, first OPP as [capa=256])
> - 2 big CPUs (max_capa=1024, first OPP as [capa=512])
> - little CPUs consume less than big CPUs, but the highest OPP
>     of the little CPUs consume more than the lowest of the big CPUs.
> And tasks:
> - 3 tasks with [UCLAMP_MIN=0, UCLAMP_MAX=10, util = 1000]
> - 1 task with [UCLAMP_MIN=0, UCLAMP_MAX=1024, util = 50]
> 
> Then
> - the 3 UCLAMP_MAX tasks will be placed on the little CPUs. Indeed,
>     due to the last patch of the serie, these tasks have now an opportunity
>     to run feec() and be placed on a more energy efficient CPU.
> - the 'normal' task will be placed on a big CPU. Indeed, placing
>     it on a little CPU would raise the OPP of the little cluster.
> 
> This means that the 'normal' task is prevented to run the remaining little
> CPU even though:
> - the little CPU can provide the compute capacity

This behaviour is actually due to the little CPUs not being able to provide
the compute capacity for the normal task without raising the OPP of the cluster.
So this behaviour is expected.

I am providing another case instead:
Case B':
- 4 little CPUs (max_capa=512, first OPP as [capa=256])
- 2 big CPUs (max_capa=1024, first OPP as [capa=512])
And tasks:
- 4 tasks with [UCLAMP_MIN=0, UCLAMP_MAX=10, util = 1000]
- 1 task with [UCLAMP_MIN=0, UCLAMP_MAX=1024, util = 50]

The normal task will not fit any of the little CPU as the rq's UCLAMP_MAX
value would raise from 10 to 1024. If I m not mistaken (this time), the
normal task should be placed on a little CPU as:
- it consumes less power
- even though UCLAMP_MAX tasks consume the least power they can, it makes
   other tasks consume more

Theoretically, placing the 4 UCLAMP_MAX tasks on one little CPU and using
another CPU for the normal task would:
- consume less energy
- satisfy the UCLAMP_MAX constraint
even though the performance of the workload would be less. This is a bit
hard to conceive for me.


> - the little CPU would consume less energy
> 
> In other terms, using UCLAMP_MAX on some tasks makes the system consume
> more energy.
> 
> ---
> 
> In my opinion, this last case comes from the difficulty of defining UCLAMP_MAX.
>   From sched-util-clamp.rst (about UCLAMP_MAX):
> - Uclamp is a hinting mechanism that allows the scheduler to understand the
>     performance requirements and restrictions of the tasks
> - The right way to view util clamp is as a mechanism to make request or hint on
>     performance constraints.
> - some tasks should be restricted from consuming too
>     much resources and should not go above a specific performance point.
> -
> Another example is in Android where tasks are classified as background,
> foreground, top-app, etc. Util clamp can be used to constrain how much
> resources background tasks are consuming by capping the performance point they
> can run at. This constraint helps reserve resources for important tasks, like
> the ones belonging to the currently active app (top-app group). Beside this
> helps in limiting how much power they consume. This can be more obvious in
> heterogeneous systems (e.g. Arm big.LITTLE); the constraint will help bias the
> background tasks to stay on the little cores which will ensure that:
> 
>           1. The big cores are free to run top-app tasks immediately. top-app
>              tasks are the tasks the user is currently interacting with, hence
>              the most important tasks in the system.
>           2. They don't run on a power hungry core and drain battery even if they
>              are CPU intensive tasks.
> -
> For example, it can be handy to limit performance when running low on battery
> or when the system wants to limit access to more energy hungry performance
> levels when it's in idle state or screen is off.
> 
> """
> This constraint helps reserve resources for important tasks, like
> the ones belonging to the currently active app (top-app group).
> """
> It doesn't seem that UCLAMP_MAX does this. This looks more like bandwidth
> control.
> 
> """
> They don't run on a power hungry core and drain battery even if they
> are CPU intensive tasks.
> """
> Avoiding mid/big CPUs could be done with tasksets,
> 
> I can understand that one might want to run a task at a higher OPP due to
> timing constraints for instance. However I don't see why someone would want
> to run a task at a lower OPP, regarding only the performance and not the
> energy consumption. It thus means that UCLAMP_MAX is an energy hint rather
> of a performance hint.
> 
> UCLAMP_MAX could be set for a task to make it spend less energy, but the
> loss in performance would be unknown.
> A case Hongyan mentioned in his uclamp sum aggregation serie [4] is that
> an infinite task with [UCLAMP_MIN=0, UCLAMP_MAX=1, util = 1000] could fit
> in a little CPU. The energy consumption would indeed be very low, but the
> performance would also be very low.
> 
> With Hongyan's sum aggregation serie [5]:
> - case B would not happen as the 'normal' task would not raise the OPP of
>     the whole cluster.

Cf. above

> - the 'infinite UCLAMP_MAX tasks' case would not happen as each task would
>     account for 1 util
> - case A would still happen, but could be solved in any case.
> 
> I know Hongyan's patchset has already been discussed, but I still don't
> understand why max aggregation is preferred over sum aggregation.
> The definition of UCLAMP_MAX values seems clearer and in effect results in
> a simpler implementation and less corner cases. In simple words:
> "When estimating the CPU frequency to set, for this task,
> account for at most X util."
> rather than:
> "When estimating the CPU frequency to set, the task with the highest
> UCLAMP_MAX of the CPU will cap the requested frequency."
> 
> Note that actually I think it's a good idea to run feec() regularly
> and to take into account other parameters like nr_running. I just think
> that UCLAMP_MAX's max aggregation creates corner cases that are difficult
> to solve altogether.
> 
> Thanks in advance for the time you will take answering,
> Regards,
> Pierre
> 
> [1] https://lore.kernel.org/all/20240606070645.3295-1-xuewen.yan@unisoc.com/
> [2] https://www.youtube.com/watch?v=PHEBAyxeM_M
> [3] https://lore.kernel.org/all/CAKfTPtDPCPYvCi1c_Nh+Cn01ZVS7E=tAHQeNX-mArBt3BXdjYw@mail.gmail.com/
> [4] https://lore.kernel.org/all/b81a5b1c-14de-4232-bee9-ee647355dd8c@arm.com/
> [5] https://lore.kernel.org/all/cover.1706792708.git.hongyan.xia2@arm.com/#t
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/5] sched/fair: Rework feec() to use cost instead of spare capacity
  2024-09-11 14:02   ` Pierre Gondois
  2024-09-11 16:51     ` Pierre Gondois
@ 2024-09-12 12:22     ` Vincent Guittot
  2024-12-05 16:23       ` Pierre Gondois
  1 sibling, 1 reply; 62+ messages in thread
From: Vincent Guittot @ 2024-09-12 12:22 UTC (permalink / raw)
  To: Pierre Gondois
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2

On Wed, 11 Sept 2024 at 16:03, Pierre Gondois <pierre.gondois@arm.com> wrote:
>
> Hello Vincent,
>
> On 8/30/24 15:03, Vincent Guittot wrote:
> > feec() looks for the CPU with highest spare capacity in a PD assuming that
> > it will be the best CPU from a energy efficiency PoV because it will
> > require the smallest increase of OPP. Although this is true generally
> > speaking, this policy also filters some others CPUs which will be as
> > efficients because of using the same OPP.
> > In fact, we really care about the cost of the new OPP that will be
> > selected to handle the waking task. In many cases, several CPUs will end
> > up selecting the same OPP and as a result using the same energy cost. In
> > these cases, we can use other metrics to select the best CPU for the same
> > energy cost.
> >
> > Rework feec() to look 1st for the lowest cost in a PD and then the most
> > performant CPU between CPUs.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >   kernel/sched/fair.c | 466 +++++++++++++++++++++++---------------------
> >   1 file changed, 244 insertions(+), 222 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e67d6029b269..2273eecf6086 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
>
> [snip]
>
> >
> > -/*
> > - * compute_energy(): Use the Energy Model to estimate the energy that @pd would
> > - * consume for a given utilization landscape @eenv. When @dst_cpu < 0, the task
> > - * contribution is ignored.
> > - */
> > -static inline unsigned long
> > -compute_energy(struct energy_env *eenv, struct perf_domain *pd,
> > -            struct cpumask *pd_cpus, struct task_struct *p, int dst_cpu)
> > +/*Check if the CPU can handle the waking task */
> > +static int check_cpu_with_task(struct task_struct *p, int cpu)
> >   {
> > -     unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
> > -     unsigned long busy_time = eenv->pd_busy_time;
> > -     unsigned long energy;
> > +     unsigned long p_util_min = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MIN) : 0;
> > +     unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
> > +     unsigned long util_min = p_util_min;
> > +     unsigned long util_max = p_util_max;
> > +     unsigned long util = cpu_util(cpu, p, cpu, 0);
> > +     struct rq *rq = cpu_rq(cpu);
> >
> > -     if (dst_cpu >= 0)
> > -             busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
>
> I think you should mention that the energy computation is not capped anymore.
> It used to be:
> pd_util: sum of the CPU's util for the pd considered, without task_util
> pd_cap: sum of the CPU's capacity for the pd
>
> (a)
> busy_time = min(pd_cap, pd_util);
> prev_energy = busy_time * OPP[prev_max_util].cost;
>
> busy_time = min(pd_cap, pd_util + task_util);
> new_energy = busy_time * OPP[new_max_util].cost;
>
> delta_energy = new_energy - prev_energy;
>
> Note that when computing busy_time, task_util is not capped to one CPU's
> max_cap. This means that in empty pd, a task can 'steal' capacity from
> CPUs during the energy computation.
> Cf. [1]
>
> and it is now:
> (b)
> delta_energy = task_util * OPP[new_max_util].cost;
> delta_energy += pd_util * (OPP[new_max_util].cost - OPP[prev_max_util].cost);
>
> Note that the busy_time (task_util + pd_util) is now not capped by anything.

As discussed in [1], capping utilization with max capacity is a
mistake because we lost  the information that this will run longer

>
> ---
>
> Not capping the task_util is discussed in [1][3] and [2] (at 21:15).
> IIUC, UCLAMP_MAX tasks are the only case leveraging this. Indeed,
> non-clamped tasks will not fit and be placed on a bigger CPU. Same for
> UCLAMP_MIN tasks.
> FWIU, not capping the utilization of tasks during the energy computation
> allows to reflect that a task will (likely) run longer. However the
> task's performance would likely decrease as the other tasks on the target
> CPU are not taken into account (it is assumed the task to be placed will
> receive the compute power it requires).
>
> ---
> Case A:
>
> Assuming we have an empty system with:
> - 4 little CPUs (max_capa=512, first OPP as [capa=256, cost=10])
> - 2 big CPUs (max_capa=1024, first OPP as [capa=512, cost=10])

Quite an inefficient hardware design here where the big core provides
twice more capacity for the same cost of the little for their 1st OPP
!!!

> i.e. the first OPP of all the CPU consumes the same amount of energy.
> And a task with: [UCLAMP_MIN=0, UCLAMP_MAX=10, util = 1000]
>
> Then feec() would have no reason to prefer a big CPU over a little CPU,
> even though the big CPU would provide more performance.

As mentioned in the comment of feec(), I don't expect to face a
situation where the delta energy is equal for 2 PDs especially with
the uWatt that has been introduced to prevent such situation. But if
this should happen, it is in the TODO to add margin and to take other
stats into account like compute capacity. Also, the PDs are sorted by
max capacity so we currently keep the one with highest capacity ie big
CPU in your case

>
> ---
> Case B:
>
> (This is not especially related to this patch.)
> Another case that might be problematic:
> - 4 little CPUs (max_capa=512, first OPP as [capa=256])
> - 2 big CPUs (max_capa=1024, first OPP as [capa=512])
> - little CPUs consume less than big CPUs, but the highest OPP
>    of the little CPUs consume more than the lowest of the big CPUs.
> And tasks:
> - 3 tasks with [UCLAMP_MIN=0, UCLAMP_MAX=10, util = 1000]
> - 1 task with [UCLAMP_MIN=0, UCLAMP_MAX=1024, util = 50]
>
> Then
> - the 3 UCLAMP_MAX tasks will be placed on the little CPUs. Indeed,
>    due to the last patch of the serie, these tasks have now an opportunity
>    to run feec() and be placed on a more energy efficient CPU.
> - the 'normal' task will be placed on a big CPU. Indeed, placing
>    it on a little CPU would raise the OPP of the little cluster.
>
> This means that the 'normal' task is prevented to run the remaining little
> CPU even though:
> - the little CPU can provide the compute capacity
> - the little CPU would consume less energy
>
> In other terms, using UCLAMP_MAX on some tasks makes the system consume
> more energy.

You have probably noticed that this patchset doesn't make any
assumption about uclamp_max/min behavior and the below doesn't seem to
be related to this patchset but to the uclamp_max behavior so I don't
think it's the right place to discuss this. A talk or a BoF at LPC
would have been a better place

>
> ---
>
> In my opinion, this last case comes from the difficulty of defining UCLAMP_MAX.
>  From sched-util-clamp.rst (about UCLAMP_MAX):
> - Uclamp is a hinting mechanism that allows the scheduler to understand the
>    performance requirements and restrictions of the tasks
> - The right way to view util clamp is as a mechanism to make request or hint on
>    performance constraints.
> - some tasks should be restricted from consuming too
>    much resources and should not go above a specific performance point.
> -
> Another example is in Android where tasks are classified as background,
> foreground, top-app, etc. Util clamp can be used to constrain how much
> resources background tasks are consuming by capping the performance point they
> can run at. This constraint helps reserve resources for important tasks, like
> the ones belonging to the currently active app (top-app group). Beside this
> helps in limiting how much power they consume. This can be more obvious in
> heterogeneous systems (e.g. Arm big.LITTLE); the constraint will help bias the
> background tasks to stay on the little cores which will ensure that:
>
>          1. The big cores are free to run top-app tasks immediately. top-app
>             tasks are the tasks the user is currently interacting with, hence
>             the most important tasks in the system.
>          2. They don't run on a power hungry core and drain battery even if they
>             are CPU intensive tasks.
> -
> For example, it can be handy to limit performance when running low on battery
> or when the system wants to limit access to more energy hungry performance
> levels when it's in idle state or screen is off.
>
> """
> This constraint helps reserve resources for important tasks, like
> the ones belonging to the currently active app (top-app group).
> """
> It doesn't seem that UCLAMP_MAX does this. This looks more like bandwidth
> control.
>
> """
> They don't run on a power hungry core and drain battery even if they
> are CPU intensive tasks.
> """
> Avoiding mid/big CPUs could be done with tasksets,
>
> I can understand that one might want to run a task at a higher OPP due to
> timing constraints for instance. However I don't see why someone would want
> to run a task at a lower OPP, regarding only the performance and not the
> energy consumption. It thus means that UCLAMP_MAX is an energy hint rather
> of a performance hint.
>
> UCLAMP_MAX could be set for a task to make it spend less energy, but the
> loss in performance would be unknown.
> A case Hongyan mentioned in his uclamp sum aggregation serie [4] is that
> an infinite task with [UCLAMP_MIN=0, UCLAMP_MAX=1, util = 1000] could fit
> in a little CPU. The energy consumption would indeed be very low, but the
> performance would also be very low.
>
> With Hongyan's sum aggregation serie [5]:
> - case B would not happen as the 'normal' task would not raise the OPP of
>    the whole cluster.
> - the 'infinite UCLAMP_MAX tasks' case would not happen as each task would
>    account for 1 util
> - case A would still happen, but could be solved in any case.
>
> I know Hongyan's patchset has already been discussed, but I still don't
> understand why max aggregation is preferred over sum aggregation.
> The definition of UCLAMP_MAX values seems clearer and in effect results in
> a simpler implementation and less corner cases. In simple words:
> "When estimating the CPU frequency to set, for this task,
> account for at most X util."
> rather than:
> "When estimating the CPU frequency to set, the task with the highest
> UCLAMP_MAX of the CPU will cap the requested frequency."
>
> Note that actually I think it's a good idea to run feec() regularly
> and to take into account other parameters like nr_running. I just think
> that UCLAMP_MAX's max aggregation creates corner cases that are difficult
> to solve altogether.
>
> Thanks in advance for the time you will take answering,
> Regards,
> Pierre
>
> [1] https://lore.kernel.org/all/20240606070645.3295-1-xuewen.yan@unisoc.com/
> [2] https://www.youtube.com/watch?v=PHEBAyxeM_M
> [3] https://lore.kernel.org/all/CAKfTPtDPCPYvCi1c_Nh+Cn01ZVS7E=tAHQeNX-mArBt3BXdjYw@mail.gmail.com/
> [4] https://lore.kernel.org/all/b81a5b1c-14de-4232-bee9-ee647355dd8c@arm.com/
> [5] https://lore.kernel.org/all/cover.1706792708.git.hongyan.xia2@arm.com/#t

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/5] sched/fair: Rework feec() to use cost instead of spare capacity
  2024-09-12 12:22     ` Vincent Guittot
@ 2024-12-05 16:23       ` Pierre Gondois
  0 siblings, 0 replies; 62+ messages in thread
From: Pierre Gondois @ 2024-12-05 16:23 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2, Christian Loehle

Hello Vincent,

On 9/12/24 14:22, Vincent Guittot wrote:
> On Wed, 11 Sept 2024 at 16:03, Pierre Gondois <pierre.gondois@arm.com> wrote:
>>
>> Hello Vincent,
>>
>> On 8/30/24 15:03, Vincent Guittot wrote:
>>> feec() looks for the CPU with highest spare capacity in a PD assuming that
>>> it will be the best CPU from a energy efficiency PoV because it will
>>> require the smallest increase of OPP. Although this is true generally
>>> speaking, this policy also filters some others CPUs which will be as
>>> efficients because of using the same OPP.
>>> In fact, we really care about the cost of the new OPP that will be
>>> selected to handle the waking task. In many cases, several CPUs will end
>>> up selecting the same OPP and as a result using the same energy cost. In
>>> these cases, we can use other metrics to select the best CPU for the same
>>> energy cost.
>>>
>>> Rework feec() to look 1st for the lowest cost in a PD and then the most
>>> performant CPU between CPUs.
>>>
>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>> ---
>>>    kernel/sched/fair.c | 466 +++++++++++++++++++++++---------------------
>>>    1 file changed, 244 insertions(+), 222 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index e67d6029b269..2273eecf6086 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>
>> [snip]
>>
>>>
>>> -/*
>>> - * compute_energy(): Use the Energy Model to estimate the energy that @pd would
>>> - * consume for a given utilization landscape @eenv. When @dst_cpu < 0, the task
>>> - * contribution is ignored.
>>> - */
>>> -static inline unsigned long
>>> -compute_energy(struct energy_env *eenv, struct perf_domain *pd,
>>> -            struct cpumask *pd_cpus, struct task_struct *p, int dst_cpu)
>>> +/*Check if the CPU can handle the waking task */
>>> +static int check_cpu_with_task(struct task_struct *p, int cpu)
>>>    {
>>> -     unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
>>> -     unsigned long busy_time = eenv->pd_busy_time;
>>> -     unsigned long energy;
>>> +     unsigned long p_util_min = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MIN) : 0;
>>> +     unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
>>> +     unsigned long util_min = p_util_min;
>>> +     unsigned long util_max = p_util_max;
>>> +     unsigned long util = cpu_util(cpu, p, cpu, 0);
>>> +     struct rq *rq = cpu_rq(cpu);
>>>
>>> -     if (dst_cpu >= 0)
>>> -             busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
>>
>> I think you should mention that the energy computation is not capped anymore.
>> It used to be:
>> pd_util: sum of the CPU's util for the pd considered, without task_util
>> pd_cap: sum of the CPU's capacity for the pd
>>
>> (a)
>> busy_time = min(pd_cap, pd_util);
>> prev_energy = busy_time * OPP[prev_max_util].cost;
>>
>> busy_time = min(pd_cap, pd_util + task_util);
>> new_energy = busy_time * OPP[new_max_util].cost;
>>
>> delta_energy = new_energy - prev_energy;
>>
>> Note that when computing busy_time, task_util is not capped to one CPU's
>> max_cap. This means that in empty pd, a task can 'steal' capacity from
>> CPUs during the energy computation.
>> Cf. [1]
>>
>> and it is now:
>> (b)
>> delta_energy = task_util * OPP[new_max_util].cost;
>> delta_energy += pd_util * (OPP[new_max_util].cost - OPP[prev_max_util].cost);
>>
>> Note that the busy_time (task_util + pd_util) is now not capped by anything.
> 
> As discussed in [1], capping utilization with max capacity is a
> mistake because we lost  the information that this will run longer
> 


I think this comes down to the fact that uClampMax tasks are force fit into
CPUs that don't have the required spare capacity [1].

On a little CPU with a capacity=200, a task with util=600 will run longer,
but it will eventually finish. Such task should not fit a little CPU normally.
By setting the task with UCLAMP_MAX=100, the task now fits the little CPU
and consumes less energy.

As Quentin mentioned I think, EAS can place tasks if utilization values are
correct. The initial condition was to have a 20% margin on the CPU capacity,
but it seems the intent was more to detect situations where the CPU is always
running (i.e. no idle time anymore).

With [1], having a 20% margin (with uClampMax tasks) doesn't mean that there
is idle time anymore. When there is no idle time anymore, utilization values
are a reflection of the niceness of the task. I.e. the utilization represents
how much time the CFS scheduler allowed them to run rather than how much
compute power tasks require.

------- Example start -------

In this condition, EAS should not be able to make always accurate task
placements. For instance, on a platform with 1 big CPU (capa=1024) and one
little CPU (capa=200), and with X tasks such as:
- duty-cycle=60%
- UCLAMP_MAX=800
Tasks being CPU demanding, they are placed on the big CPU. Each task will have
a utilization of 1024/X. The system is not overutilized since tasks have
a UCLAMP_MAX setting.
The bigger X, the lower the task utilization. Eventually, tasks' utilization
will be low enough to have feec() migrating one of them to the little CPU.
The system then becomes overutilized.

With the present patchset, the task is migrated back to the big CPU. Indeed:
task_tick_fair()
\-check_update_overutilized_status() --> // setting overutilzed=1
\-check_misfit_cpu()
   \-find_energy_efficient_cpu()      --> task doesn't fit the little CPU anymore,
                                      --> migrate back to the big CPU
                                      --> // resetting overutilzed=0 later

So these UCLAMP_MAX tasks will bounce on the little CPU, transiently activating
the overutilized state.

Similarly, when there is no idle time, it is possible to play with the task
niceness to make the utilization smaller/bigger.

------- Example end -------

This behaviour was existing before this present patchset due to [1]. However,
it was not really visible since feec() only ran during task wakeup.

It seems that the condition in update_idle_rq_clock_pelt() would be a better
way to tag a CPU as overutilized.

Down migrating UCLAMP_MAX tasks makes sense IMO, but to avoid making a CPU
overutilized, throttling these tasks (like the bandwidth control seems to do)
could be a solution.

[1] https://lore.kernel.org/lkml/20220629194632.1117723-1-qais.yousef@arm.com/

Regards,
Pierre

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
  2024-08-30 13:03 [PATCH 0/5] sched/fair: Rework EAS to handle more cases Vincent Guittot
                   ` (2 preceding siblings ...)
  2024-08-30 13:03 ` [PATCH 3/5] sched/fair: Rework feec() to use cost instead of spare capacity Vincent Guittot
@ 2024-08-30 13:03 ` Vincent Guittot
  2024-09-17 20:24   ` Christian Loehle
  2024-09-20 16:17   ` Quentin Perret
  2024-08-30 13:03 ` [RFC PATCH 5/5] sched/fair: Add push task callback for EAS Vincent Guittot
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 62+ messages in thread
From: Vincent Guittot @ 2024-08-30 13:03 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2, Vincent Guittot

Keep looking for an energy efficient CPU even when the system is
overutilized and use the CPU returned by feec() if it has been able to find
one. Otherwise fallback to the default performance and spread mode of the
scheduler.
A system can become overutilized for a short time when workers of a
workqueue wake up for a short background work like vmstat update.
Continuing to look for a energy efficient CPU will prevent to break the
power packing of tasks.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2273eecf6086..e46af2416159 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 		    cpumask_test_cpu(cpu, p->cpus_ptr))
 			return cpu;
 
-		if (!is_rd_overutilized(this_rq()->rd)) {
+		if (sched_energy_enabled()) {
 			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
 			if (new_cpu >= 0)
 				return new_cpu;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
  2024-08-30 13:03 ` [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized Vincent Guittot
@ 2024-09-17 20:24   ` Christian Loehle
  2024-09-19  8:25     ` Pierre Gondois
  2024-09-25 13:07     ` Vincent Guittot
  2024-09-20 16:17   ` Quentin Perret
  1 sibling, 2 replies; 62+ messages in thread
From: Christian Loehle @ 2024-09-17 20:24 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, lukasz.luba,
	rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2

On 8/30/24 14:03, Vincent Guittot wrote:
> Keep looking for an energy efficient CPU even when the system is
> overutilized and use the CPU returned by feec() if it has been able to find
> one. Otherwise fallback to the default performance and spread mode of the
> scheduler.
> A system can become overutilized for a short time when workers of a
> workqueue wake up for a short background work like vmstat update.
> Continuing to look for a energy efficient CPU will prevent to break the
> power packing of tasks.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2273eecf6086..e46af2416159 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>  		    cpumask_test_cpu(cpu, p->cpus_ptr))
>  			return cpu;
>  
> -		if (!is_rd_overutilized(this_rq()->rd)) {
> +		if (sched_energy_enabled()) {
>  			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>  			if (new_cpu >= 0)
>  				return new_cpu;

Super quick testing on pixel6:
for i in $(seq 0 6); do /data/local/tmp/hackbench -l 500 -g 100 | grep Time; sleep 60; done                                     
with patch 5/5 only:
Time: 19.433
Time: 19.657
Time: 19.851
Time: 19.789
Time: 19.857
Time: 20.092
Time: 19.973

mainline:
Time: 18.836
Time: 18.718
Time: 18.781
Time: 19.015
Time: 19.061
Time: 18.950
Time: 19.166


The reason we didn't always have this enabled is the belief that
this costs us too much performance in scenarios we most need it
while at best making subpar EAS decisions anyway (in an
overutilized state).
I'd be open for questioning that, but why the change of mind?
And why is this necessary in your series if the EAS selection
isn't 'final' (until the next sleep) anymore (Patch 5/5)?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
  2024-09-17 20:24   ` Christian Loehle
@ 2024-09-19  8:25     ` Pierre Gondois
  2024-09-25 13:28       ` Vincent Guittot
  2024-09-25 13:07     ` Vincent Guittot
  1 sibling, 1 reply; 62+ messages in thread
From: Pierre Gondois @ 2024-09-19  8:25 UTC (permalink / raw)
  To: Christian Loehle, Vincent Guittot
  Cc: qyousef, hongyan.xia2, mingo, peterz, linux-kernel,
	rafael.j.wysocki, lukasz.luba, vschneid, mgorman, bsegall,
	rostedt, dietmar.eggemann, juri.lelli

Hello Vincent,
I tried this patch on a Pixel 6 (8 CPUs, 4 little, 2 mid, 2 big)
with patches 1-4/5 using these workloads:
---
A.
a. 8 tasks at 2%/5%/10% during 1s
b. 1 task:
    - sleeping during 0.3s
    - at 100% during 0.3s
    - sleeping during 0.3s

b. is used to reach the overutilized state during a limited amount of time.
EAS is then operating, then the load balancer does the task placement, then EAS
is operating again.

B.
a. 8 tasks at 2%/5%/10% during 1s
b. 1 task:
    - at 100% during 1s

---
I'm seeing the energy consumption increase in some cases. This seems to be
due to feec() migrating tasks more often than what the load balancer does
for this workload. This leads to utilization 'spikes' and then frequency
'spikes', increasing the overall energy consumption.
This is not entirely related to this patch though,, as the task placement seems
correct. I.e. feec() effectively does an optimal placement given the EM and
task utilization. The task placement is just a bit less stable than with
the load balancer.

---
Regarding hackbench, I've reproduced the test you've run on the same Pixel6.
I have CONFIG_SCHED_CLUSTER enabled, which allows having a sched domain for
each little/mid/big CPUs (without the config, these group would no exist).

I see an important regression in the result.
I replaced the condition to run feec() through select_task_rq_fair() by:
   if (get_rd_overloaded(cpu_rq(cpu)->rd) == 0)) {
     new_cpu = find_energy_efficient_cpu(p, prev_cpu);
     ...
   }
and obtained better results.

Indeed, for such intensive workload:
- EAS would not have any energy benefit, so better prioritize performance
   (as Christian mentioned)
- EAS would not be able to find a fitting CPU, so running feec() should be
   avoided
- as you mention in the commit message, shuffling tasks when one CPU becomes
   momentarily overutilized is inefficient energy-wise (even though I don't have
   the numbers, it should make sense).
So detecting when the system is overloaded should be a better compromise I
assume. The condition in sched_balance_find_src_group() to let the load balancer
operate might also need to be updated.

Note:
- base: with patches 1-4/5
- _ou: run feec() when not overutilized
- _ol: run feec() when not overloaded
- mean: hackbench execution time in s.
- delta: negative is better. Value is in percentage.
┌─────┬───────────┬──────────┬─────────┬──────────┬─────────┬──────────┬──────────┬──────────┐
│ id  ┆ mean_base ┆ std_base ┆ mean_ou ┆ std_ou   ┆ mean_ol ┆ std_ol   ┆ delta_ou ┆ delta_ol │
╞═════╪═══════════╪══════════╪═════════╪══════════╪═════════╪══════════╪══════════╪══════════╡
│ 1   ┆ 1.9786    ┆ 0.04719  ┆ 3.0856  ┆ 0.122209 ┆ 2.1734  ┆ 0.045203 ┆ 55.95    ┆ 9.85     │
│ 2   ┆ 1.8991    ┆ 0.019768 ┆ 2.6672  ┆ 0.135266 ┆ 1.98875 ┆ 0.055132 ┆ 40.45    ┆ 4.72     │
│ 3   ┆ 1.9053    ┆ 0.014795 ┆ 2.5761  ┆ 0.141693 ┆ 2.06425 ┆ 0.045901 ┆ 35.21    ┆ 8.34     │
│ 4   ┆ 1.9586    ┆ 0.023439 ┆ 2.5823  ┆ 0.110399 ┆ 2.0955  ┆ 0.053818 ┆ 31.84    ┆ 6.99     │
│ 5   ┆ 1.746     ┆ 0.055676 ┆ 3.3437  ┆ 0.279107 ┆ 1.88    ┆ 0.038184 ┆ 91.51    ┆ 7.67     │
│ 6   ┆ 1.5476    ┆ 0.050131 ┆ 2.6835  ┆ 0.140497 ┆ 1.5645  ┆ 0.081644 ┆ 73.4     ┆ 1.09     │
│ 7   ┆ 1.4562    ┆ 0.062457 ┆ 2.3568  ┆ 0.119213 ┆ 1.48425 ┆ 0.06212  ┆ 61.85    ┆ 1.93     │
│ 8   ┆ 1.3554    ┆ 0.031757 ┆ 2.0609  ┆ 0.112869 ┆ 1.4085  ┆ 0.036601 ┆ 52.05    ┆ 3.92     │
│ 9   ┆ 2.0391    ┆ 0.035732 ┆ 3.4045  ┆ 0.277307 ┆ 2.2155  ┆ 0.019053 ┆ 66.96    ┆ 8.65     │
│ 10  ┆ 1.9247    ┆ 0.056472 ┆ 2.6605  ┆ 0.119417 ┆ 2.02775 ┆ 0.05795  ┆ 38.23    ┆ 5.35     │
│ 11  ┆ 1.8923    ┆ 0.038222 ┆ 2.8113  ┆ 0.120623 ┆ 2.089   ┆ 0.025259 ┆ 48.57    ┆ 10.39    │
│ 12  ┆ 1.9444    ┆ 0.034856 ┆ 2.6675  ┆ 0.219585 ┆ 2.1035  ┆ 0.076514 ┆ 37.19    ┆ 8.18     │
│ 13  ┆ 1.7107    ┆ 0.04874  ┆ 3.4443  ┆ 0.154481 ┆ 1.8275  ┆ 0.036665 ┆ 101.34   ┆ 6.83     │
│ 14  ┆ 1.5565    ┆ 0.056595 ┆ 2.8241  ┆ 0.158643 ┆ 1.5515  ┆ 0.040813 ┆ 81.44    ┆ -0.32    │
│ 15  ┆ 1.4932    ┆ 0.085256 ┆ 2.6841  ┆ 0.135623 ┆ 1.50475 ┆ 0.028336 ┆ 79.75    ┆ 0.77     │
│ 16  ┆ 1.4263    ┆ 0.067666 ┆ 2.3971  ┆ 0.145928 ┆ 1.414   ┆ 0.061422 ┆ 68.06    ┆ -0.86    │
└─────┴───────────┴──────────┴─────────┴──────────┴─────────┴──────────┴──────────┴──────────┘

On 9/17/24 22:24, Christian Loehle wrote:
> On 8/30/24 14:03, Vincent Guittot wrote:
>> Keep looking for an energy efficient CPU even when the system is
>> overutilized and use the CPU returned by feec() if it has been able to find
>> one. Otherwise fallback to the default performance and spread mode of the
>> scheduler.
>> A system can become overutilized for a short time when workers of a
>> workqueue wake up for a short background work like vmstat update.
>> Continuing to look for a energy efficient CPU will prevent to break the
>> power packing of tasks.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>   kernel/sched/fair.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 2273eecf6086..e46af2416159 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>>   		    cpumask_test_cpu(cpu, p->cpus_ptr))
>>   			return cpu;
>>   
>> -		if (!is_rd_overutilized(this_rq()->rd)) {
>> +		if (sched_energy_enabled()) {
>>   			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>>   			if (new_cpu >= 0)
>>   				return new_cpu;
> 
> Super quick testing on pixel6:
> for i in $(seq 0 6); do /data/local/tmp/hackbench -l 500 -g 100 | grep Time; sleep 60; done
> with patch 5/5 only:
> Time: 19.433
> Time: 19.657
> Time: 19.851
> Time: 19.789
> Time: 19.857
> Time: 20.092
> Time: 19.973
> 
> mainline:
> Time: 18.836
> Time: 18.718
> Time: 18.781
> Time: 19.015
> Time: 19.061
> Time: 18.950
> Time: 19.166
> 
> 
> The reason we didn't always have this enabled is the belief that
> this costs us too much performance in scenarios we most need it
> while at best making subpar EAS decisions anyway (in an
> overutilized state).
> I'd be open for questioning that, but why the change of mind?
> And why is this necessary in your series if the EAS selection
> isn't 'final' (until the next sleep) anymore (Patch 5/5)?
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
  2024-09-19  8:25     ` Pierre Gondois
@ 2024-09-25 13:28       ` Vincent Guittot
  2024-10-07  7:03         ` Pierre Gondois
  0 siblings, 1 reply; 62+ messages in thread
From: Vincent Guittot @ 2024-09-25 13:28 UTC (permalink / raw)
  To: Pierre Gondois
  Cc: Christian Loehle, qyousef, hongyan.xia2, mingo, peterz,
	linux-kernel, rafael.j.wysocki, lukasz.luba, vschneid, mgorman,
	bsegall, rostedt, dietmar.eggemann, juri.lelli

On Thu, 19 Sept 2024 at 10:26, Pierre Gondois <pierre.gondois@arm.com> wrote:
>
> Hello Vincent,
> I tried this patch on a Pixel 6 (8 CPUs, 4 little, 2 mid, 2 big)
> with patches 1-4/5 using these workloads:
> ---
> A.
> a. 8 tasks at 2%/5%/10% during 1s
> b. 1 task:
>     - sleeping during 0.3s
>     - at 100% during 0.3s
>     - sleeping during 0.3s
>
> b. is used to reach the overutilized state during a limited amount of time.
> EAS is then operating, then the load balancer does the task placement, then EAS
> is operating again.
>
> B.
> a. 8 tasks at 2%/5%/10% during 1s
> b. 1 task:
>     - at 100% during 1s
>
> ---
> I'm seeing the energy consumption increase in some cases. This seems to be
> due to feec() migrating tasks more often than what the load balancer does
> for this workload. This leads to utilization 'spikes' and then frequency
> 'spikes', increasing the overall energy consumption.
> This is not entirely related to this patch though,, as the task placement seems
> correct. I.e. feec() effectively does an optimal placement given the EM and
> task utilization. The task placement is just a bit less stable than with
> the load balancer.

Would patch 5 help to keep things better placed ? in particular if
task b is misplaced at some point because of load balance ?

I agree that load balance might still contribute to migrate task in a
not energy efficient way

>
> ---
> Regarding hackbench, I've reproduced the test you've run on the same Pixel6.
> I have CONFIG_SCHED_CLUSTER enabled, which allows having a sched domain for
> each little/mid/big CPUs (without the config, these group would no exist).

Why did you do this ? All cpus are expected to be in same sched domain
as they share their LLC

>
> I see an important regression in the result.
> I replaced the condition to run feec() through select_task_rq_fair() by:
>    if (get_rd_overloaded(cpu_rq(cpu)->rd) == 0)) {

overloaded is enable when more than 1 task runs on a cpu whatever the
utilization

>      new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>      ...
>    }
> and obtained better results.
>
> Indeed, for such intensive workload:
> - EAS would not have any energy benefit, so better prioritize performance
>    (as Christian mentioned)
> - EAS would not be able to find a fitting CPU, so running feec() should be
>    avoided
> - as you mention in the commit message, shuffling tasks when one CPU becomes
>    momentarily overutilized is inefficient energy-wise (even though I don't have
>    the numbers, it should make sense).
> So detecting when the system is overloaded should be a better compromise I
> assume. The condition in sched_balance_find_src_group() to let the load balancer
> operate might also need to be updated.
>
> Note:
> - base: with patches 1-4/5
> - _ou: run feec() when not overutilized
> - _ol: run feec() when not overloaded
> - mean: hackbench execution time in s.
> - delta: negative is better. Value is in percentage.

Could you share your command line ? As explained in the cover letter I
have seen some perf regressions but not in the range that you have
below

What is your base ? tip/sched/core ?

> ┌─────┬───────────┬──────────┬─────────┬──────────┬─────────┬──────────┬──────────┬──────────┐
> │ id  ┆ mean_base ┆ std_base ┆ mean_ou ┆ std_ou   ┆ mean_ol ┆ std_ol   ┆ delta_ou ┆ delta_ol │
> ╞═════╪═══════════╪══════════╪═════════╪══════════╪═════════╪══════════╪══════════╪══════════╡
> │ 1   ┆ 1.9786    ┆ 0.04719  ┆ 3.0856  ┆ 0.122209 ┆ 2.1734  ┆ 0.045203 ┆ 55.95    ┆ 9.85     │
> │ 2   ┆ 1.8991    ┆ 0.019768 ┆ 2.6672  ┆ 0.135266 ┆ 1.98875 ┆ 0.055132 ┆ 40.45    ┆ 4.72     │
> │ 3   ┆ 1.9053    ┆ 0.014795 ┆ 2.5761  ┆ 0.141693 ┆ 2.06425 ┆ 0.045901 ┆ 35.21    ┆ 8.34     │
> │ 4   ┆ 1.9586    ┆ 0.023439 ┆ 2.5823  ┆ 0.110399 ┆ 2.0955  ┆ 0.053818 ┆ 31.84    ┆ 6.99     │
> │ 5   ┆ 1.746     ┆ 0.055676 ┆ 3.3437  ┆ 0.279107 ┆ 1.88    ┆ 0.038184 ┆ 91.51    ┆ 7.67     │
> │ 6   ┆ 1.5476    ┆ 0.050131 ┆ 2.6835  ┆ 0.140497 ┆ 1.5645  ┆ 0.081644 ┆ 73.4     ┆ 1.09     │
> │ 7   ┆ 1.4562    ┆ 0.062457 ┆ 2.3568  ┆ 0.119213 ┆ 1.48425 ┆ 0.06212  ┆ 61.85    ┆ 1.93     │
> │ 8   ┆ 1.3554    ┆ 0.031757 ┆ 2.0609  ┆ 0.112869 ┆ 1.4085  ┆ 0.036601 ┆ 52.05    ┆ 3.92     │
> │ 9   ┆ 2.0391    ┆ 0.035732 ┆ 3.4045  ┆ 0.277307 ┆ 2.2155  ┆ 0.019053 ┆ 66.96    ┆ 8.65     │
> │ 10  ┆ 1.9247    ┆ 0.056472 ┆ 2.6605  ┆ 0.119417 ┆ 2.02775 ┆ 0.05795  ┆ 38.23    ┆ 5.35     │
> │ 11  ┆ 1.8923    ┆ 0.038222 ┆ 2.8113  ┆ 0.120623 ┆ 2.089   ┆ 0.025259 ┆ 48.57    ┆ 10.39    │
> │ 12  ┆ 1.9444    ┆ 0.034856 ┆ 2.6675  ┆ 0.219585 ┆ 2.1035  ┆ 0.076514 ┆ 37.19    ┆ 8.18     │
> │ 13  ┆ 1.7107    ┆ 0.04874  ┆ 3.4443  ┆ 0.154481 ┆ 1.8275  ┆ 0.036665 ┆ 101.34   ┆ 6.83     │
> │ 14  ┆ 1.5565    ┆ 0.056595 ┆ 2.8241  ┆ 0.158643 ┆ 1.5515  ┆ 0.040813 ┆ 81.44    ┆ -0.32    │
> │ 15  ┆ 1.4932    ┆ 0.085256 ┆ 2.6841  ┆ 0.135623 ┆ 1.50475 ┆ 0.028336 ┆ 79.75    ┆ 0.77     │
> │ 16  ┆ 1.4263    ┆ 0.067666 ┆ 2.3971  ┆ 0.145928 ┆ 1.414   ┆ 0.061422 ┆ 68.06    ┆ -0.86    │
> └─────┴───────────┴──────────┴─────────┴──────────┴─────────┴──────────┴──────────┴──────────┘
>
> On 9/17/24 22:24, Christian Loehle wrote:
> > On 8/30/24 14:03, Vincent Guittot wrote:
> >> Keep looking for an energy efficient CPU even when the system is
> >> overutilized and use the CPU returned by feec() if it has been able to find
> >> one. Otherwise fallback to the default performance and spread mode of the
> >> scheduler.
> >> A system can become overutilized for a short time when workers of a
> >> workqueue wake up for a short background work like vmstat update.
> >> Continuing to look for a energy efficient CPU will prevent to break the
> >> power packing of tasks.
> >>
> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> >> ---
> >>   kernel/sched/fair.c | 2 +-
> >>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 2273eecf6086..e46af2416159 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> >>                  cpumask_test_cpu(cpu, p->cpus_ptr))
> >>                      return cpu;
> >>
> >> -            if (!is_rd_overutilized(this_rq()->rd)) {
> >> +            if (sched_energy_enabled()) {
> >>                      new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> >>                      if (new_cpu >= 0)
> >>                              return new_cpu;
> >
> > Super quick testing on pixel6:
> > for i in $(seq 0 6); do /data/local/tmp/hackbench -l 500 -g 100 | grep Time; sleep 60; done
> > with patch 5/5 only:
> > Time: 19.433
> > Time: 19.657
> > Time: 19.851
> > Time: 19.789
> > Time: 19.857
> > Time: 20.092
> > Time: 19.973
> >
> > mainline:
> > Time: 18.836
> > Time: 18.718
> > Time: 18.781
> > Time: 19.015
> > Time: 19.061
> > Time: 18.950
> > Time: 19.166
> >
> >
> > The reason we didn't always have this enabled is the belief that
> > this costs us too much performance in scenarios we most need it
> > while at best making subpar EAS decisions anyway (in an
> > overutilized state).
> > I'd be open for questioning that, but why the change of mind?
> > And why is this necessary in your series if the EAS selection
> > isn't 'final' (until the next sleep) anymore (Patch 5/5)?
> >

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
  2024-09-25 13:28       ` Vincent Guittot
@ 2024-10-07  7:03         ` Pierre Gondois
  2024-10-09  8:53           ` Vincent Guittot
  0 siblings, 1 reply; 62+ messages in thread
From: Pierre Gondois @ 2024-10-07  7:03 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Christian Loehle, qyousef, hongyan.xia2, mingo, peterz,
	linux-kernel, rafael.j.wysocki, lukasz.luba, vschneid, mgorman,
	bsegall, rostedt, dietmar.eggemann, juri.lelli

Hello Vincent,

Sorry for the delay:

On 9/25/24 15:28, Vincent Guittot wrote:
> On Thu, 19 Sept 2024 at 10:26, Pierre Gondois <pierre.gondois@arm.com> wrote:
>>
>> Hello Vincent,
>> I tried this patch on a Pixel 6 (8 CPUs, 4 little, 2 mid, 2 big)
>> with patches 1-4/5 using these workloads:
>> ---
>> A.
>> a. 8 tasks at 2%/5%/10% during 1s
>> b. 1 task:
>>      - sleeping during 0.3s
>>      - at 100% during 0.3s
>>      - sleeping during 0.3s
>>
>> b. is used to reach the overutilized state during a limited amount of time.
>> EAS is then operating, then the load balancer does the task placement, then EAS
>> is operating again.
>>
>> B.
>> a. 8 tasks at 2%/5%/10% during 1s
>> b. 1 task:
>>      - at 100% during 1s
>>
>> ---
>> I'm seeing the energy consumption increase in some cases. This seems to be
>> due to feec() migrating tasks more often than what the load balancer does
>> for this workload. This leads to utilization 'spikes' and then frequency
>> 'spikes', increasing the overall energy consumption.
>> This is not entirely related to this patch though,, as the task placement seems
>> correct. I.e. feec() effectively does an optimal placement given the EM and
>> task utilization. The task placement is just a bit less stable than with
>> the load balancer.
> 
> Would patch 5 help to keep things better placed ? in particular if
> task b is misplaced at some point because of load balance ?

I assume so, it would require more testing on my side

> 
> I agree that load balance might still contribute to migrate task in a
> not energy efficient way
> 
>>
>> ---
>> Regarding hackbench, I've reproduced the test you've run on the same Pixel6.
>> I have CONFIG_SCHED_CLUSTER enabled, which allows having a sched domain for
>> each little/mid/big CPUs (without the config, these group would no exist).
> 
> Why did you do this ? All cpus are expected to be in same sched domain
> as they share their LLC

I did this to observe the loa balancer a bit more carefully while reviewing
the first patch:
   sched/fair: Filter false overloaded_group case for EAS
I've let this configuration, but effectively this should not bring anything more.


> 
>>
>> I see an important regression in the result.
>> I replaced the condition to run feec() through select_task_rq_fair() by:
>>     if (get_rd_overloaded(cpu_rq(cpu)->rd) == 0)) {
> 
> overloaded is enable when more than 1 task runs on a cpu whatever the
> utilization

Yes right, this idea has little sense.

> 
>>       new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>>       ...
>>     }
>> and obtained better results.
>>
>> Indeed, for such intensive workload:
>> - EAS would not have any energy benefit, so better prioritize performance
>>     (as Christian mentioned)
>> - EAS would not be able to find a fitting CPU, so running feec() should be
>>     avoided
>> - as you mention in the commit message, shuffling tasks when one CPU becomes
>>     momentarily overutilized is inefficient energy-wise (even though I don't have
>>     the numbers, it should make sense).
>> So detecting when the system is overloaded should be a better compromise I
>> assume. The condition in sched_balance_find_src_group() to let the load balancer
>> operate might also need to be updated.
>>
>> Note:
>> - base: with patches 1-4/5
>> - _ou: run feec() when not overutilized
>> - _ol: run feec() when not overloaded
>> - mean: hackbench execution time in s.
>> - delta: negative is better. Value is in percentage.
> 
> Could you share your command line ? As explained in the cover letter I
> have seen some perf regressions but not in the range that you have
> below
> 
> What is your base ? tip/sched/core ?

I am working on a Pixel6, with a branch based on v6.8 with some scheduler patches
to be able to apply your patches cleanly.

The mapping id -> command line is as:
(1) hackbench -l 5120 -g 1
(2) hackbench -l 1280 -g 4
(3) hackbench -l 640  -g 8
(4) hackbench -l 320  -g 16

(5) hackbench -p -l 5120 -g 1
(6) hackbench -p -l 1280 -g 4
(7) hackbench -p -l 640  -g 8
(8) hackbench -p -l 320  -g 16

(9) hackbench -T -l 5120 -g 1
(10) hackbench -T -l 1280 -g 4
(11) hackbench -T -l 640  -g 8
(12) hackbench -T -l 320  -g 16

(13) hackbench -T -p -l 5120 -g 1
(14) hackbench -T -p -l 1280 -g 4
(15) hackbench -T -p -l 640  -g 8
(16) hackbench -T -p -l 320  -g 16


> 
>> ┌─────┬───────────┬──────────┬─────────┬──────────┬─────────┬──────────┬──────────┬──────────┐
>> │ id  ┆ mean_base ┆ std_base ┆ mean_ou ┆ std_ou   ┆ mean_ol ┆ std_ol   ┆ delta_ou ┆ delta_ol │
>> ╞═════╪═══════════╪══════════╪═════════╪══════════╪═════════╪══════════╪══════════╪══════════╡
>> │ 1   ┆ 1.9786    ┆ 0.04719  ┆ 3.0856  ┆ 0.122209 ┆ 2.1734  ┆ 0.045203 ┆ 55.95    ┆ 9.85     │
>> │ 2   ┆ 1.8991    ┆ 0.019768 ┆ 2.6672  ┆ 0.135266 ┆ 1.98875 ┆ 0.055132 ┆ 40.45    ┆ 4.72     │
>> │ 3   ┆ 1.9053    ┆ 0.014795 ┆ 2.5761  ┆ 0.141693 ┆ 2.06425 ┆ 0.045901 ┆ 35.21    ┆ 8.34     │
>> │ 4   ┆ 1.9586    ┆ 0.023439 ┆ 2.5823  ┆ 0.110399 ┆ 2.0955  ┆ 0.053818 ┆ 31.84    ┆ 6.99     │
>> │ 5   ┆ 1.746     ┆ 0.055676 ┆ 3.3437  ┆ 0.279107 ┆ 1.88    ┆ 0.038184 ┆ 91.51    ┆ 7.67     │
>> │ 6   ┆ 1.5476    ┆ 0.050131 ┆ 2.6835  ┆ 0.140497 ┆ 1.5645  ┆ 0.081644 ┆ 73.4     ┆ 1.09     │
>> │ 7   ┆ 1.4562    ┆ 0.062457 ┆ 2.3568  ┆ 0.119213 ┆ 1.48425 ┆ 0.06212  ┆ 61.85    ┆ 1.93     │
>> │ 8   ┆ 1.3554    ┆ 0.031757 ┆ 2.0609  ┆ 0.112869 ┆ 1.4085  ┆ 0.036601 ┆ 52.05    ┆ 3.92     │
>> │ 9   ┆ 2.0391    ┆ 0.035732 ┆ 3.4045  ┆ 0.277307 ┆ 2.2155  ┆ 0.019053 ┆ 66.96    ┆ 8.65     │
>> │ 10  ┆ 1.9247    ┆ 0.056472 ┆ 2.6605  ┆ 0.119417 ┆ 2.02775 ┆ 0.05795  ┆ 38.23    ┆ 5.35     │
>> │ 11  ┆ 1.8923    ┆ 0.038222 ┆ 2.8113  ┆ 0.120623 ┆ 2.089   ┆ 0.025259 ┆ 48.57    ┆ 10.39    │
>> │ 12  ┆ 1.9444    ┆ 0.034856 ┆ 2.6675  ┆ 0.219585 ┆ 2.1035  ┆ 0.076514 ┆ 37.19    ┆ 8.18     │
>> │ 13  ┆ 1.7107    ┆ 0.04874  ┆ 3.4443  ┆ 0.154481 ┆ 1.8275  ┆ 0.036665 ┆ 101.34   ┆ 6.83     │
>> │ 14  ┆ 1.5565    ┆ 0.056595 ┆ 2.8241  ┆ 0.158643 ┆ 1.5515  ┆ 0.040813 ┆ 81.44    ┆ -0.32    │
>> │ 15  ┆ 1.4932    ┆ 0.085256 ┆ 2.6841  ┆ 0.135623 ┆ 1.50475 ┆ 0.028336 ┆ 79.75    ┆ 0.77     │
>> │ 16  ┆ 1.4263    ┆ 0.067666 ┆ 2.3971  ┆ 0.145928 ┆ 1.414   ┆ 0.061422 ┆ 68.06    ┆ -0.86    │
>> └─────┴───────────┴──────────┴─────────┴──────────┴─────────┴──────────┴──────────┴──────────┘
>>
>> On 9/17/24 22:24, Christian Loehle wrote:
>>> On 8/30/24 14:03, Vincent Guittot wrote:
>>>> Keep looking for an energy efficient CPU even when the system is
>>>> overutilized and use the CPU returned by feec() if it has been able to find
>>>> one. Otherwise fallback to the default performance and spread mode of the
>>>> scheduler.
>>>> A system can become overutilized for a short time when workers of a
>>>> workqueue wake up for a short background work like vmstat update.
>>>> Continuing to look for a energy efficient CPU will prevent to break the
>>>> power packing of tasks.
>>>>
>>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>>> ---
>>>>    kernel/sched/fair.c | 2 +-
>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 2273eecf6086..e46af2416159 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>>>>                   cpumask_test_cpu(cpu, p->cpus_ptr))
>>>>                       return cpu;
>>>>
>>>> -            if (!is_rd_overutilized(this_rq()->rd)) {
>>>> +            if (sched_energy_enabled()) {
>>>>                       new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>>>>                       if (new_cpu >= 0)
>>>>                               return new_cpu;
>>>
>>> Super quick testing on pixel6:
>>> for i in $(seq 0 6); do /data/local/tmp/hackbench -l 500 -g 100 | grep Time; sleep 60; done
>>> with patch 5/5 only:
>>> Time: 19.433
>>> Time: 19.657
>>> Time: 19.851
>>> Time: 19.789
>>> Time: 19.857
>>> Time: 20.092
>>> Time: 19.973
>>>
>>> mainline:
>>> Time: 18.836
>>> Time: 18.718
>>> Time: 18.781
>>> Time: 19.015
>>> Time: 19.061
>>> Time: 18.950
>>> Time: 19.166
>>>
>>>
>>> The reason we didn't always have this enabled is the belief that
>>> this costs us too much performance in scenarios we most need it
>>> while at best making subpar EAS decisions anyway (in an
>>> overutilized state).
>>> I'd be open for questioning that, but why the change of mind?
>>> And why is this necessary in your series if the EAS selection
>>> isn't 'final' (until the next sleep) anymore (Patch 5/5)?
>>>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
  2024-10-07  7:03         ` Pierre Gondois
@ 2024-10-09  8:53           ` Vincent Guittot
  2024-10-11 12:52             ` Pierre Gondois
  0 siblings, 1 reply; 62+ messages in thread
From: Vincent Guittot @ 2024-10-09  8:53 UTC (permalink / raw)
  To: Pierre Gondois
  Cc: Christian Loehle, qyousef, hongyan.xia2, mingo, peterz,
	linux-kernel, rafael.j.wysocki, lukasz.luba, vschneid, mgorman,
	bsegall, rostedt, dietmar.eggemann, juri.lelli

Hi Pierre,

On Mon, 7 Oct 2024 at 09:03, Pierre Gondois <pierre.gondois@arm.com> wrote:
>
> Hello Vincent,
>
> Sorry for the delay:
>
> On 9/25/24 15:28, Vincent Guittot wrote:
> > On Thu, 19 Sept 2024 at 10:26, Pierre Gondois <pierre.gondois@arm.com> wrote:
> >>
> >> Hello Vincent,
> >> I tried this patch on a Pixel 6 (8 CPUs, 4 little, 2 mid, 2 big)
> >> with patches 1-4/5 using these workloads:
> >> ---
> >> A.
> >> a. 8 tasks at 2%/5%/10% during 1s
> >> b. 1 task:
> >>      - sleeping during 0.3s
> >>      - at 100% during 0.3s
> >>      - sleeping during 0.3s
> >>
> >> b. is used to reach the overutilized state during a limited amount of time.
> >> EAS is then operating, then the load balancer does the task placement, then EAS
> >> is operating again.
> >>
> >> B.
> >> a. 8 tasks at 2%/5%/10% during 1s
> >> b. 1 task:
> >>      - at 100% during 1s
> >>
> >> ---
> >> I'm seeing the energy consumption increase in some cases. This seems to be
> >> due to feec() migrating tasks more often than what the load balancer does
> >> for this workload. This leads to utilization 'spikes' and then frequency
> >> 'spikes', increasing the overall energy consumption.
> >> This is not entirely related to this patch though,, as the task placement seems
> >> correct. I.e. feec() effectively does an optimal placement given the EM and
> >> task utilization. The task placement is just a bit less stable than with
> >> the load balancer.
> >
> > Would patch 5 help to keep things better placed ? in particular if
> > task b is misplaced at some point because of load balance ?
>
> I assume so, it would require more testing on my side
>
> >
> > I agree that load balance might still contribute to migrate task in a
> > not energy efficient way
> >
> >>
> >> ---
> >> Regarding hackbench, I've reproduced the test you've run on the same Pixel6.
> >> I have CONFIG_SCHED_CLUSTER enabled, which allows having a sched domain for
> >> each little/mid/big CPUs (without the config, these group would no exist).
> >
> > Why did you do this ? All cpus are expected to be in same sched domain
> > as they share their LLC
>
> I did this to observe the loa balancer a bit more carefully while reviewing
> the first patch:
>    sched/fair: Filter false overloaded_group case for EAS
> I've let this configuration, but effectively this should not bring anything more.
>
>
> >
> >>
> >> I see an important regression in the result.
> >> I replaced the condition to run feec() through select_task_rq_fair() by:
> >>     if (get_rd_overloaded(cpu_rq(cpu)->rd) == 0)) {
> >
> > overloaded is enable when more than 1 task runs on a cpu whatever the
> > utilization
>
> Yes right, this idea has little sense.
>
> >
> >>       new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> >>       ...
> >>     }
> >> and obtained better results.
> >>
> >> Indeed, for such intensive workload:
> >> - EAS would not have any energy benefit, so better prioritize performance
> >>     (as Christian mentioned)
> >> - EAS would not be able to find a fitting CPU, so running feec() should be
> >>     avoided
> >> - as you mention in the commit message, shuffling tasks when one CPU becomes
> >>     momentarily overutilized is inefficient energy-wise (even though I don't have
> >>     the numbers, it should make sense).
> >> So detecting when the system is overloaded should be a better compromise I
> >> assume. The condition in sched_balance_find_src_group() to let the load balancer
> >> operate might also need to be updated.
> >>
> >> Note:
> >> - base: with patches 1-4/5
> >> - _ou: run feec() when not overutilized
> >> - _ol: run feec() when not overloaded
> >> - mean: hackbench execution time in s.
> >> - delta: negative is better. Value is in percentage.
> >
> > Could you share your command line ? As explained in the cover letter I
> > have seen some perf regressions but not in the range that you have
> > below
> >
> > What is your base ? tip/sched/core ?
>
> I am working on a Pixel6, with a branch based on v6.8 with some scheduler patches
> to be able to apply your patches cleanly.

TBH, I'm always cautious with those kind of frankenstein kernel
especially with all changes that have happened on the scheduler since
v6.8 compared to tip/sched/core

>
> The mapping id -> command line is as:

Thanks for the commands details, I'm going to have a look

> (1) hackbench -l 5120 -g 1
> (2) hackbench -l 1280 -g 4
> (3) hackbench -l 640  -g 8
> (4) hackbench -l 320  -g 16
>
> (5) hackbench -p -l 5120 -g 1
> (6) hackbench -p -l 1280 -g 4
> (7) hackbench -p -l 640  -g 8
> (8) hackbench -p -l 320  -g 16
>
> (9) hackbench -T -l 5120 -g 1
> (10) hackbench -T -l 1280 -g 4
> (11) hackbench -T -l 640  -g 8
> (12) hackbench -T -l 320  -g 16
>
> (13) hackbench -T -p -l 5120 -g 1
> (14) hackbench -T -p -l 1280 -g 4
> (15) hackbench -T -p -l 640  -g 8
> (16) hackbench -T -p -l 320  -g 16
>
>
> >
> >> ┌─────┬───────────┬──────────┬─────────┬──────────┬─────────┬──────────┬──────────┬──────────┐
> >> │ id  ┆ mean_base ┆ std_base ┆ mean_ou ┆ std_ou   ┆ mean_ol ┆ std_ol   ┆ delta_ou ┆ delta_ol │
> >> ╞═════╪═══════════╪══════════╪═════════╪══════════╪═════════╪══════════╪══════════╪══════════╡
> >> │ 1   ┆ 1.9786    ┆ 0.04719  ┆ 3.0856  ┆ 0.122209 ┆ 2.1734  ┆ 0.045203 ┆ 55.95    ┆ 9.85     │

I might have misunderstood your results above last time.
mean_base results include patches 1 to 4 and  mean_ou revert patch 4.
Does it mean that it is 55% better with patch 4 ? I originally thought
there was a regression with patch 4 but I'm not sure that I understood
correctly after re reading the table.

> >> │ 2   ┆ 1.8991    ┆ 0.019768 ┆ 2.6672  ┆ 0.135266 ┆ 1.98875 ┆ 0.055132 ┆ 40.45    ┆ 4.72     │
> >> │ 3   ┆ 1.9053    ┆ 0.014795 ┆ 2.5761  ┆ 0.141693 ┆ 2.06425 ┆ 0.045901 ┆ 35.21    ┆ 8.34     │
> >> │ 4   ┆ 1.9586    ┆ 0.023439 ┆ 2.5823  ┆ 0.110399 ┆ 2.0955  ┆ 0.053818 ┆ 31.84    ┆ 6.99     │
> >> │ 5   ┆ 1.746     ┆ 0.055676 ┆ 3.3437  ┆ 0.279107 ┆ 1.88    ┆ 0.038184 ┆ 91.51    ┆ 7.67     │
> >> │ 6   ┆ 1.5476    ┆ 0.050131 ┆ 2.6835  ┆ 0.140497 ┆ 1.5645  ┆ 0.081644 ┆ 73.4     ┆ 1.09     │
> >> │ 7   ┆ 1.4562    ┆ 0.062457 ┆ 2.3568  ┆ 0.119213 ┆ 1.48425 ┆ 0.06212  ┆ 61.85    ┆ 1.93     │
> >> │ 8   ┆ 1.3554    ┆ 0.031757 ┆ 2.0609  ┆ 0.112869 ┆ 1.4085  ┆ 0.036601 ┆ 52.05    ┆ 3.92     │
> >> │ 9   ┆ 2.0391    ┆ 0.035732 ┆ 3.4045  ┆ 0.277307 ┆ 2.2155  ┆ 0.019053 ┆ 66.96    ┆ 8.65     │
> >> │ 10  ┆ 1.9247    ┆ 0.056472 ┆ 2.6605  ┆ 0.119417 ┆ 2.02775 ┆ 0.05795  ┆ 38.23    ┆ 5.35     │
> >> │ 11  ┆ 1.8923    ┆ 0.038222 ┆ 2.8113  ┆ 0.120623 ┆ 2.089   ┆ 0.025259 ┆ 48.57    ┆ 10.39    │
> >> │ 12  ┆ 1.9444    ┆ 0.034856 ┆ 2.6675  ┆ 0.219585 ┆ 2.1035  ┆ 0.076514 ┆ 37.19    ┆ 8.18     │
> >> │ 13  ┆ 1.7107    ┆ 0.04874  ┆ 3.4443  ┆ 0.154481 ┆ 1.8275  ┆ 0.036665 ┆ 101.34   ┆ 6.83     │
> >> │ 14  ┆ 1.5565    ┆ 0.056595 ┆ 2.8241  ┆ 0.158643 ┆ 1.5515  ┆ 0.040813 ┆ 81.44    ┆ -0.32    │
> >> │ 15  ┆ 1.4932    ┆ 0.085256 ┆ 2.6841  ┆ 0.135623 ┆ 1.50475 ┆ 0.028336 ┆ 79.75    ┆ 0.77     │
> >> │ 16  ┆ 1.4263    ┆ 0.067666 ┆ 2.3971  ┆ 0.145928 ┆ 1.414   ┆ 0.061422 ┆ 68.06    ┆ -0.86    │
> >> └─────┴───────────┴──────────┴─────────┴──────────┴─────────┴──────────┴──────────┴──────────┘
> >>
> >> On 9/17/24 22:24, Christian Loehle wrote:
> >>> On 8/30/24 14:03, Vincent Guittot wrote:
> >>>> Keep looking for an energy efficient CPU even when the system is
> >>>> overutilized and use the CPU returned by feec() if it has been able to find
> >>>> one. Otherwise fallback to the default performance and spread mode of the
> >>>> scheduler.
> >>>> A system can become overutilized for a short time when workers of a
> >>>> workqueue wake up for a short background work like vmstat update.
> >>>> Continuing to look for a energy efficient CPU will prevent to break the
> >>>> power packing of tasks.
> >>>>
> >>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> >>>> ---
> >>>>    kernel/sched/fair.c | 2 +-
> >>>>    1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>> index 2273eecf6086..e46af2416159 100644
> >>>> --- a/kernel/sched/fair.c
> >>>> +++ b/kernel/sched/fair.c
> >>>> @@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> >>>>                   cpumask_test_cpu(cpu, p->cpus_ptr))
> >>>>                       return cpu;
> >>>>
> >>>> -            if (!is_rd_overutilized(this_rq()->rd)) {
> >>>> +            if (sched_energy_enabled()) {
> >>>>                       new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> >>>>                       if (new_cpu >= 0)
> >>>>                               return new_cpu;
> >>>
> >>> Super quick testing on pixel6:
> >>> for i in $(seq 0 6); do /data/local/tmp/hackbench -l 500 -g 100 | grep Time; sleep 60; done
> >>> with patch 5/5 only:
> >>> Time: 19.433
> >>> Time: 19.657
> >>> Time: 19.851
> >>> Time: 19.789
> >>> Time: 19.857
> >>> Time: 20.092
> >>> Time: 19.973
> >>>
> >>> mainline:
> >>> Time: 18.836
> >>> Time: 18.718
> >>> Time: 18.781
> >>> Time: 19.015
> >>> Time: 19.061
> >>> Time: 18.950
> >>> Time: 19.166
> >>>
> >>>
> >>> The reason we didn't always have this enabled is the belief that
> >>> this costs us too much performance in scenarios we most need it
> >>> while at best making subpar EAS decisions anyway (in an
> >>> overutilized state).
> >>> I'd be open for questioning that, but why the change of mind?
> >>> And why is this necessary in your series if the EAS selection
> >>> isn't 'final' (until the next sleep) anymore (Patch 5/5)?
> >>>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
  2024-10-09  8:53           ` Vincent Guittot
@ 2024-10-11 12:52             ` Pierre Gondois
  2024-10-15 12:47               ` Vincent Guittot
  0 siblings, 1 reply; 62+ messages in thread
From: Pierre Gondois @ 2024-10-11 12:52 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Christian Loehle, qyousef, hongyan.xia2, mingo, peterz,
	linux-kernel, rafael.j.wysocki, lukasz.luba, vschneid, mgorman,
	bsegall, rostedt, dietmar.eggemann, juri.lelli

Hello Vincent,

On 10/9/24 10:53, Vincent Guittot wrote:
> Hi Pierre,
> 
> On Mon, 7 Oct 2024 at 09:03, Pierre Gondois <pierre.gondois@arm.com> wrote:
>>
>> Hello Vincent,
>>
>> Sorry for the delay:
>>
>> On 9/25/24 15:28, Vincent Guittot wrote:
>>> On Thu, 19 Sept 2024 at 10:26, Pierre Gondois <pierre.gondois@arm.com> wrote:
>>>>
>>>> Hello Vincent,
>>>> I tried this patch on a Pixel 6 (8 CPUs, 4 little, 2 mid, 2 big)
>>>> with patches 1-4/5 using these workloads:
>>>> ---
>>>> A.
>>>> a. 8 tasks at 2%/5%/10% during 1s
>>>> b. 1 task:
>>>>       - sleeping during 0.3s
>>>>       - at 100% during 0.3s
>>>>       - sleeping during 0.3s
>>>>
>>>> b. is used to reach the overutilized state during a limited amount of time.
>>>> EAS is then operating, then the load balancer does the task placement, then EAS
>>>> is operating again.
>>>>
>>>> B.
>>>> a. 8 tasks at 2%/5%/10% during 1s
>>>> b. 1 task:
>>>>       - at 100% during 1s
>>>>
>>>> ---
>>>> I'm seeing the energy consumption increase in some cases. This seems to be
>>>> due to feec() migrating tasks more often than what the load balancer does
>>>> for this workload. This leads to utilization 'spikes' and then frequency
>>>> 'spikes', increasing the overall energy consumption.
>>>> This is not entirely related to this patch though,, as the task placement seems
>>>> correct. I.e. feec() effectively does an optimal placement given the EM and
>>>> task utilization. The task placement is just a bit less stable than with
>>>> the load balancer.
>>>
>>> Would patch 5 help to keep things better placed ? in particular if
>>> task b is misplaced at some point because of load balance ?
>>
>> I assume so, it would require more testing on my side
>>
>>>
>>> I agree that load balance might still contribute to migrate task in a
>>> not energy efficient way
>>>
>>>>
>>>> ---
>>>> Regarding hackbench, I've reproduced the test you've run on the same Pixel6.
>>>> I have CONFIG_SCHED_CLUSTER enabled, which allows having a sched domain for
>>>> each little/mid/big CPUs (without the config, these group would no exist).
>>>
>>> Why did you do this ? All cpus are expected to be in same sched domain
>>> as they share their LLC
>>
>> I did this to observe the loa balancer a bit more carefully while reviewing
>> the first patch:
>>     sched/fair: Filter false overloaded_group case for EAS
>> I've let this configuration, but effectively this should not bring anything more.
>>
>>
>>>
>>>>
>>>> I see an important regression in the result.
>>>> I replaced the condition to run feec() through select_task_rq_fair() by:
>>>>      if (get_rd_overloaded(cpu_rq(cpu)->rd) == 0)) {
>>>
>>> overloaded is enable when more than 1 task runs on a cpu whatever the
>>> utilization
>>
>> Yes right, this idea has little sense.
>>
>>>
>>>>        new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>>>>        ...
>>>>      }
>>>> and obtained better results.
>>>>
>>>> Indeed, for such intensive workload:
>>>> - EAS would not have any energy benefit, so better prioritize performance
>>>>      (as Christian mentioned)
>>>> - EAS would not be able to find a fitting CPU, so running feec() should be
>>>>      avoided
>>>> - as you mention in the commit message, shuffling tasks when one CPU becomes
>>>>      momentarily overutilized is inefficient energy-wise (even though I don't have
>>>>      the numbers, it should make sense).
>>>> So detecting when the system is overloaded should be a better compromise I
>>>> assume. The condition in sched_balance_find_src_group() to let the load balancer
>>>> operate might also need to be updated.
>>>>
>>>> Note:
>>>> - base: with patches 1-4/5
>>>> - _ou: run feec() when not overutilized
>>>> - _ol: run feec() when not overloaded
>>>> - mean: hackbench execution time in s.
>>>> - delta: negative is better. Value is in percentage.
>>>
>>> Could you share your command line ? As explained in the cover letter I
>>> have seen some perf regressions but not in the range that you have
>>> below
>>>
>>> What is your base ? tip/sched/core ?
>>
>> I am working on a Pixel6, with a branch based on v6.8 with some scheduler patches
>> to be able to apply your patches cleanly.
> 
> TBH, I'm always cautious with those kind of frankenstein kernel
> especially with all changes that have happened on the scheduler since
> v6.8 compared to tip/sched/core

Yes I understand, I'll re-test it on a Juno with a newer kernel.

> 
>>
>> The mapping id -> command line is as:
> 
> Thanks for the commands details, I'm going to have a look
> 
>> (1) hackbench -l 5120 -g 1
>> (2) hackbench -l 1280 -g 4
>> (3) hackbench -l 640  -g 8
>> (4) hackbench -l 320  -g 16
>>
>> (5) hackbench -p -l 5120 -g 1
>> (6) hackbench -p -l 1280 -g 4
>> (7) hackbench -p -l 640  -g 8
>> (8) hackbench -p -l 320  -g 16
>>
>> (9) hackbench -T -l 5120 -g 1
>> (10) hackbench -T -l 1280 -g 4
>> (11) hackbench -T -l 640  -g 8
>> (12) hackbench -T -l 320  -g 16
>>
>> (13) hackbench -T -p -l 5120 -g 1
>> (14) hackbench -T -p -l 1280 -g 4
>> (15) hackbench -T -p -l 640  -g 8
>> (16) hackbench -T -p -l 320  -g 16
>>
>>
>>>
>>>> ┌─────┬───────────┬──────────┬─────────┬──────────┬─────────┬──────────┬──────────┬──────────┐
>>>> │ id  ┆ mean_base ┆ std_base ┆ mean_ou ┆ std_ou   ┆ mean_ol ┆ std_ol   ┆ delta_ou ┆ delta_ol │
>>>> ╞═════╪═══════════╪══════════╪═════════╪══════════╪═════════╪══════════╪══════════╪══════════╡
>>>> │ 1   ┆ 1.9786    ┆ 0.04719  ┆ 3.0856  ┆ 0.122209 ┆ 2.1734  ┆ 0.045203 ┆ 55.95    ┆ 9.85     │
> 
> I might have misunderstood your results above last time.
> mean_base results include patches 1 to 4 and  mean_ou revert patch 4.
> Does it mean that it is 55% better with patch 4 ? I originally thought
> there was a regression with patch 4 but I'm not sure that I understood
> correctly after re reading the table.

The columns are:
- the _base configuration disables EAS/feec() when in the overutilized state,
   i.e. patches 1-3 are applied.
- the _ou configuration keeps running EAS/feec() when in the overutilized state
   i.e. patches 1-4 are applied
- the _ol configuration should be ignored as previously established


> 
>>>> │ 2   ┆ 1.8991    ┆ 0.019768 ┆ 2.6672  ┆ 0.135266 ┆ 1.98875 ┆ 0.055132 ┆ 40.45    ┆ 4.72     │
>>>> │ 3   ┆ 1.9053    ┆ 0.014795 ┆ 2.5761  ┆ 0.141693 ┆ 2.06425 ┆ 0.045901 ┆ 35.21    ┆ 8.34     │
>>>> │ 4   ┆ 1.9586    ┆ 0.023439 ┆ 2.5823  ┆ 0.110399 ┆ 2.0955  ┆ 0.053818 ┆ 31.84    ┆ 6.99     │
>>>> │ 5   ┆ 1.746     ┆ 0.055676 ┆ 3.3437  ┆ 0.279107 ┆ 1.88    ┆ 0.038184 ┆ 91.51    ┆ 7.67     │
>>>> │ 6   ┆ 1.5476    ┆ 0.050131 ┆ 2.6835  ┆ 0.140497 ┆ 1.5645  ┆ 0.081644 ┆ 73.4     ┆ 1.09     │
>>>> │ 7   ┆ 1.4562    ┆ 0.062457 ┆ 2.3568  ┆ 0.119213 ┆ 1.48425 ┆ 0.06212  ┆ 61.85    ┆ 1.93     │
>>>> │ 8   ┆ 1.3554    ┆ 0.031757 ┆ 2.0609  ┆ 0.112869 ┆ 1.4085  ┆ 0.036601 ┆ 52.05    ┆ 3.92     │
>>>> │ 9   ┆ 2.0391    ┆ 0.035732 ┆ 3.4045  ┆ 0.277307 ┆ 2.2155  ┆ 0.019053 ┆ 66.96    ┆ 8.65     │
>>>> │ 10  ┆ 1.9247    ┆ 0.056472 ┆ 2.6605  ┆ 0.119417 ┆ 2.02775 ┆ 0.05795  ┆ 38.23    ┆ 5.35     │
>>>> │ 11  ┆ 1.8923    ┆ 0.038222 ┆ 2.8113  ┆ 0.120623 ┆ 2.089   ┆ 0.025259 ┆ 48.57    ┆ 10.39    │
>>>> │ 12  ┆ 1.9444    ┆ 0.034856 ┆ 2.6675  ┆ 0.219585 ┆ 2.1035  ┆ 0.076514 ┆ 37.19    ┆ 8.18     │
>>>> │ 13  ┆ 1.7107    ┆ 0.04874  ┆ 3.4443  ┆ 0.154481 ┆ 1.8275  ┆ 0.036665 ┆ 101.34   ┆ 6.83     │
>>>> │ 14  ┆ 1.5565    ┆ 0.056595 ┆ 2.8241  ┆ 0.158643 ┆ 1.5515  ┆ 0.040813 ┆ 81.44    ┆ -0.32    │
>>>> │ 15  ┆ 1.4932    ┆ 0.085256 ┆ 2.6841  ┆ 0.135623 ┆ 1.50475 ┆ 0.028336 ┆ 79.75    ┆ 0.77     │
>>>> │ 16  ┆ 1.4263    ┆ 0.067666 ┆ 2.3971  ┆ 0.145928 ┆ 1.414   ┆ 0.061422 ┆ 68.06    ┆ -0.86    │
>>>> └─────┴───────────┴──────────┴─────────┴──────────┴─────────┴──────────┴──────────┴──────────┘
>>>>
>>>> On 9/17/24 22:24, Christian Loehle wrote:
>>>>> On 8/30/24 14:03, Vincent Guittot wrote:
>>>>>> Keep looking for an energy efficient CPU even when the system is
>>>>>> overutilized and use the CPU returned by feec() if it has been able to find
>>>>>> one. Otherwise fallback to the default performance and spread mode of the
>>>>>> scheduler.
>>>>>> A system can become overutilized for a short time when workers of a
>>>>>> workqueue wake up for a short background work like vmstat update.
>>>>>> Continuing to look for a energy efficient CPU will prevent to break the
>>>>>> power packing of tasks.
>>>>>>
>>>>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>>>>> ---
>>>>>>     kernel/sched/fair.c | 2 +-
>>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>> index 2273eecf6086..e46af2416159 100644
>>>>>> --- a/kernel/sched/fair.c
>>>>>> +++ b/kernel/sched/fair.c
>>>>>> @@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>>>>>>                    cpumask_test_cpu(cpu, p->cpus_ptr))
>>>>>>                        return cpu;
>>>>>>
>>>>>> -            if (!is_rd_overutilized(this_rq()->rd)) {
>>>>>> +            if (sched_energy_enabled()) {
>>>>>>                        new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>>>>>>                        if (new_cpu >= 0)
>>>>>>                                return new_cpu;
>>>>>
>>>>> Super quick testing on pixel6:
>>>>> for i in $(seq 0 6); do /data/local/tmp/hackbench -l 500 -g 100 | grep Time; sleep 60; done
>>>>> with patch 5/5 only:
>>>>> Time: 19.433
>>>>> Time: 19.657
>>>>> Time: 19.851
>>>>> Time: 19.789
>>>>> Time: 19.857
>>>>> Time: 20.092
>>>>> Time: 19.973
>>>>>
>>>>> mainline:
>>>>> Time: 18.836
>>>>> Time: 18.718
>>>>> Time: 18.781
>>>>> Time: 19.015
>>>>> Time: 19.061
>>>>> Time: 18.950
>>>>> Time: 19.166
>>>>>
>>>>>
>>>>> The reason we didn't always have this enabled is the belief that
>>>>> this costs us too much performance in scenarios we most need it
>>>>> while at best making subpar EAS decisions anyway (in an
>>>>> overutilized state).
>>>>> I'd be open for questioning that, but why the change of mind?
>>>>> And why is this necessary in your series if the EAS selection
>>>>> isn't 'final' (until the next sleep) anymore (Patch 5/5)?
>>>>>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
  2024-10-11 12:52             ` Pierre Gondois
@ 2024-10-15 12:47               ` Vincent Guittot
  2024-10-31 15:21                 ` Pierre Gondois
  0 siblings, 1 reply; 62+ messages in thread
From: Vincent Guittot @ 2024-10-15 12:47 UTC (permalink / raw)
  To: Pierre Gondois
  Cc: Christian Loehle, qyousef, hongyan.xia2, mingo, peterz,
	linux-kernel, rafael.j.wysocki, lukasz.luba, vschneid, mgorman,
	bsegall, rostedt, dietmar.eggemann, juri.lelli

On Fri, 11 Oct 2024 at 14:52, Pierre Gondois <pierre.gondois@arm.com> wrote:
>
> Hello Vincent,
>
> On 10/9/24 10:53, Vincent Guittot wrote:
> > Hi Pierre,
> >
> > On Mon, 7 Oct 2024 at 09:03, Pierre Gondois <pierre.gondois@arm.com> wrote:
> >>
> >> Hello Vincent,
> >>
> >> Sorry for the delay:
> >>
> >> On 9/25/24 15:28, Vincent Guittot wrote:
> >>> On Thu, 19 Sept 2024 at 10:26, Pierre Gondois <pierre.gondois@arm.com> wrote:
> >>>>
> >>>> Hello Vincent,
> >>>> I tried this patch on a Pixel 6 (8 CPUs, 4 little, 2 mid, 2 big)
> >>>> with patches 1-4/5 using these workloads:
> >>>> ---
> >>>> A.
> >>>> a. 8 tasks at 2%/5%/10% during 1s
> >>>> b. 1 task:
> >>>>       - sleeping during 0.3s
> >>>>       - at 100% during 0.3s
> >>>>       - sleeping during 0.3s
> >>>>
> >>>> b. is used to reach the overutilized state during a limited amount of time.
> >>>> EAS is then operating, then the load balancer does the task placement, then EAS
> >>>> is operating again.
> >>>>
> >>>> B.
> >>>> a. 8 tasks at 2%/5%/10% during 1s
> >>>> b. 1 task:
> >>>>       - at 100% during 1s
> >>>>
> >>>> ---
> >>>> I'm seeing the energy consumption increase in some cases. This seems to be
> >>>> due to feec() migrating tasks more often than what the load balancer does
> >>>> for this workload. This leads to utilization 'spikes' and then frequency
> >>>> 'spikes', increasing the overall energy consumption.
> >>>> This is not entirely related to this patch though,, as the task placement seems
> >>>> correct. I.e. feec() effectively does an optimal placement given the EM and
> >>>> task utilization. The task placement is just a bit less stable than with
> >>>> the load balancer.
> >>>
> >>> Would patch 5 help to keep things better placed ? in particular if
> >>> task b is misplaced at some point because of load balance ?
> >>
> >> I assume so, it would require more testing on my side
> >>
> >>>
> >>> I agree that load balance might still contribute to migrate task in a
> >>> not energy efficient way
> >>>
> >>>>
> >>>> ---
> >>>> Regarding hackbench, I've reproduced the test you've run on the same Pixel6.
> >>>> I have CONFIG_SCHED_CLUSTER enabled, which allows having a sched domain for
> >>>> each little/mid/big CPUs (without the config, these group would no exist).
> >>>
> >>> Why did you do this ? All cpus are expected to be in same sched domain
> >>> as they share their LLC
> >>
> >> I did this to observe the loa balancer a bit more carefully while reviewing
> >> the first patch:
> >>     sched/fair: Filter false overloaded_group case for EAS
> >> I've let this configuration, but effectively this should not bring anything more.
> >>
> >>
> >>>
> >>>>
> >>>> I see an important regression in the result.
> >>>> I replaced the condition to run feec() through select_task_rq_fair() by:
> >>>>      if (get_rd_overloaded(cpu_rq(cpu)->rd) == 0)) {
> >>>
> >>> overloaded is enable when more than 1 task runs on a cpu whatever the
> >>> utilization
> >>
> >> Yes right, this idea has little sense.
> >>
> >>>
> >>>>        new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> >>>>        ...
> >>>>      }
> >>>> and obtained better results.
> >>>>
> >>>> Indeed, for such intensive workload:
> >>>> - EAS would not have any energy benefit, so better prioritize performance
> >>>>      (as Christian mentioned)
> >>>> - EAS would not be able to find a fitting CPU, so running feec() should be
> >>>>      avoided
> >>>> - as you mention in the commit message, shuffling tasks when one CPU becomes
> >>>>      momentarily overutilized is inefficient energy-wise (even though I don't have
> >>>>      the numbers, it should make sense).
> >>>> So detecting when the system is overloaded should be a better compromise I
> >>>> assume. The condition in sched_balance_find_src_group() to let the load balancer
> >>>> operate might also need to be updated.
> >>>>
> >>>> Note:
> >>>> - base: with patches 1-4/5
> >>>> - _ou: run feec() when not overutilized
> >>>> - _ol: run feec() when not overloaded
> >>>> - mean: hackbench execution time in s.
> >>>> - delta: negative is better. Value is in percentage.
> >>>
> >>> Could you share your command line ? As explained in the cover letter I
> >>> have seen some perf regressions but not in the range that you have
> >>> below
> >>>
> >>> What is your base ? tip/sched/core ?
> >>
> >> I am working on a Pixel6, with a branch based on v6.8 with some scheduler patches
> >> to be able to apply your patches cleanly.
> >
> > TBH, I'm always cautious with those kind of frankenstein kernel
> > especially with all changes that have happened on the scheduler since
> > v6.8 compared to tip/sched/core
>
> Yes I understand, I'll re-test it on a Juno with a newer kernel.
>
> >
> >>
> >> The mapping id -> command line is as:
> >
> > Thanks for the commands details, I'm going to have a look
> >
> >> (1) hackbench -l 5120 -g 1
> >> (2) hackbench -l 1280 -g 4
> >> (3) hackbench -l 640  -g 8
> >> (4) hackbench -l 320  -g 16
> >>
> >> (5) hackbench -p -l 5120 -g 1
> >> (6) hackbench -p -l 1280 -g 4
> >> (7) hackbench -p -l 640  -g 8
> >> (8) hackbench -p -l 320  -g 16
> >>
> >> (9) hackbench -T -l 5120 -g 1
> >> (10) hackbench -T -l 1280 -g 4
> >> (11) hackbench -T -l 640  -g 8
> >> (12) hackbench -T -l 320  -g 16
> >>
> >> (13) hackbench -T -p -l 5120 -g 1
> >> (14) hackbench -T -p -l 1280 -g 4
> >> (15) hackbench -T -p -l 640  -g 8
> >> (16) hackbench -T -p -l 320  -g 16
> >>
> >>
> >>>
> >>>> ┌─────┬───────────┬──────────┬─────────┬──────────┬─────────┬──────────┬──────────┬──────────┐
> >>>> │ id  ┆ mean_base ┆ std_base ┆ mean_ou ┆ std_ou   ┆ mean_ol ┆ std_ol   ┆ delta_ou ┆ delta_ol │
> >>>> ╞═════╪═══════════╪══════════╪═════════╪══════════╪═════════╪══════════╪══════════╪══════════╡
> >>>> │ 1   ┆ 1.9786    ┆ 0.04719  ┆ 3.0856  ┆ 0.122209 ┆ 2.1734  ┆ 0.045203 ┆ 55.95    ┆ 9.85     │
> >
> > I might have misunderstood your results above last time.
> > mean_base results include patches 1 to 4 and  mean_ou revert patch 4.
> > Does it mean that it is 55% better with patch 4 ? I originally thought
> > there was a regression with patch 4 but I'm not sure that I understood
> > correctly after re reading the table.
>
> The columns are:
> - the _base configuration disables EAS/feec() when in the overutilized state,
>    i.e. patches 1-3 are applied.

your original description
"
 - base: with patches 1-4/5
 - _ou: run feec() when not overutilized
 - _ol: run feec() when not overloaded
"
was quite confusing :-)

Thanks for the clarification

> - the _ou configuration keeps running EAS/feec() when in the overutilized state
>    i.e. patches 1-4 are applied
> - the _ol configuration should be ignored as previously established
>
>
> >
> >>>> │ 2   ┆ 1.8991    ┆ 0.019768 ┆ 2.6672  ┆ 0.135266 ┆ 1.98875 ┆ 0.055132 ┆ 40.45    ┆ 4.72     │
> >>>> │ 3   ┆ 1.9053    ┆ 0.014795 ┆ 2.5761  ┆ 0.141693 ┆ 2.06425 ┆ 0.045901 ┆ 35.21    ┆ 8.34     │
> >>>> │ 4   ┆ 1.9586    ┆ 0.023439 ┆ 2.5823  ┆ 0.110399 ┆ 2.0955  ┆ 0.053818 ┆ 31.84    ┆ 6.99     │
> >>>> │ 5   ┆ 1.746     ┆ 0.055676 ┆ 3.3437  ┆ 0.279107 ┆ 1.88    ┆ 0.038184 ┆ 91.51    ┆ 7.67     │
> >>>> │ 6   ┆ 1.5476    ┆ 0.050131 ┆ 2.6835  ┆ 0.140497 ┆ 1.5645  ┆ 0.081644 ┆ 73.4     ┆ 1.09     │
> >>>> │ 7   ┆ 1.4562    ┆ 0.062457 ┆ 2.3568  ┆ 0.119213 ┆ 1.48425 ┆ 0.06212  ┆ 61.85    ┆ 1.93     │
> >>>> │ 8   ┆ 1.3554    ┆ 0.031757 ┆ 2.0609  ┆ 0.112869 ┆ 1.4085  ┆ 0.036601 ┆ 52.05    ┆ 3.92     │
> >>>> │ 9   ┆ 2.0391    ┆ 0.035732 ┆ 3.4045  ┆ 0.277307 ┆ 2.2155  ┆ 0.019053 ┆ 66.96    ┆ 8.65     │
> >>>> │ 10  ┆ 1.9247    ┆ 0.056472 ┆ 2.6605  ┆ 0.119417 ┆ 2.02775 ┆ 0.05795  ┆ 38.23    ┆ 5.35     │
> >>>> │ 11  ┆ 1.8923    ┆ 0.038222 ┆ 2.8113  ┆ 0.120623 ┆ 2.089   ┆ 0.025259 ┆ 48.57    ┆ 10.39    │
> >>>> │ 12  ┆ 1.9444    ┆ 0.034856 ┆ 2.6675  ┆ 0.219585 ┆ 2.1035  ┆ 0.076514 ┆ 37.19    ┆ 8.18     │
> >>>> │ 13  ┆ 1.7107    ┆ 0.04874  ┆ 3.4443  ┆ 0.154481 ┆ 1.8275  ┆ 0.036665 ┆ 101.34   ┆ 6.83     │
> >>>> │ 14  ┆ 1.5565    ┆ 0.056595 ┆ 2.8241  ┆ 0.158643 ┆ 1.5515  ┆ 0.040813 ┆ 81.44    ┆ -0.32    │
> >>>> │ 15  ┆ 1.4932    ┆ 0.085256 ┆ 2.6841  ┆ 0.135623 ┆ 1.50475 ┆ 0.028336 ┆ 79.75    ┆ 0.77     │
> >>>> │ 16  ┆ 1.4263    ┆ 0.067666 ┆ 2.3971  ┆ 0.145928 ┆ 1.414   ┆ 0.061422 ┆ 68.06    ┆ -0.86    │
> >>>> └─────┴───────────┴──────────┴─────────┴──────────┴─────────┴──────────┴──────────┴──────────┘
> >>>>
> >>>> On 9/17/24 22:24, Christian Loehle wrote:
> >>>>> On 8/30/24 14:03, Vincent Guittot wrote:
> >>>>>> Keep looking for an energy efficient CPU even when the system is
> >>>>>> overutilized and use the CPU returned by feec() if it has been able to find
> >>>>>> one. Otherwise fallback to the default performance and spread mode of the
> >>>>>> scheduler.
> >>>>>> A system can become overutilized for a short time when workers of a
> >>>>>> workqueue wake up for a short background work like vmstat update.
> >>>>>> Continuing to look for a energy efficient CPU will prevent to break the
> >>>>>> power packing of tasks.
> >>>>>>
> >>>>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> >>>>>> ---
> >>>>>>     kernel/sched/fair.c | 2 +-
> >>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>>>
> >>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>>>> index 2273eecf6086..e46af2416159 100644
> >>>>>> --- a/kernel/sched/fair.c
> >>>>>> +++ b/kernel/sched/fair.c
> >>>>>> @@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> >>>>>>                    cpumask_test_cpu(cpu, p->cpus_ptr))
> >>>>>>                        return cpu;
> >>>>>>
> >>>>>> -            if (!is_rd_overutilized(this_rq()->rd)) {
> >>>>>> +            if (sched_energy_enabled()) {
> >>>>>>                        new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> >>>>>>                        if (new_cpu >= 0)
> >>>>>>                                return new_cpu;
> >>>>>
> >>>>> Super quick testing on pixel6:
> >>>>> for i in $(seq 0 6); do /data/local/tmp/hackbench -l 500 -g 100 | grep Time; sleep 60; done
> >>>>> with patch 5/5 only:
> >>>>> Time: 19.433
> >>>>> Time: 19.657
> >>>>> Time: 19.851
> >>>>> Time: 19.789
> >>>>> Time: 19.857
> >>>>> Time: 20.092
> >>>>> Time: 19.973
> >>>>>
> >>>>> mainline:
> >>>>> Time: 18.836
> >>>>> Time: 18.718
> >>>>> Time: 18.781
> >>>>> Time: 19.015
> >>>>> Time: 19.061
> >>>>> Time: 18.950
> >>>>> Time: 19.166
> >>>>>
> >>>>>
> >>>>> The reason we didn't always have this enabled is the belief that
> >>>>> this costs us too much performance in scenarios we most need it
> >>>>> while at best making subpar EAS decisions anyway (in an
> >>>>> overutilized state).
> >>>>> I'd be open for questioning that, but why the change of mind?
> >>>>> And why is this necessary in your series if the EAS selection
> >>>>> isn't 'final' (until the next sleep) anymore (Patch 5/5)?
> >>>>>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
  2024-10-15 12:47               ` Vincent Guittot
@ 2024-10-31 15:21                 ` Pierre Gondois
  0 siblings, 0 replies; 62+ messages in thread
From: Pierre Gondois @ 2024-10-31 15:21 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Christian Loehle, qyousef, hongyan.xia2, mingo, peterz,
	linux-kernel, rafael.j.wysocki, lukasz.luba, vschneid, mgorman,
	bsegall, rostedt, dietmar.eggemann, juri.lelli



On 10/15/24 14:47, Vincent Guittot wrote:
> On Fri, 11 Oct 2024 at 14:52, Pierre Gondois <pierre.gondois@arm.com> wrote:
>>
>> Hello Vincent,
>>
>> On 10/9/24 10:53, Vincent Guittot wrote:
>>> Hi Pierre,
>>>
>>> On Mon, 7 Oct 2024 at 09:03, Pierre Gondois <pierre.gondois@arm.com> wrote:
>>>>
>>>> Hello Vincent,
>>>>
>>>> Sorry for the delay:
>>>>
>>>> On 9/25/24 15:28, Vincent Guittot wrote:
>>>>> On Thu, 19 Sept 2024 at 10:26, Pierre Gondois <pierre.gondois@arm.com> wrote:
>>>>>>
>>>>>> Hello Vincent,
>>>>>> I tried this patch on a Pixel 6 (8 CPUs, 4 little, 2 mid, 2 big)
>>>>>> with patches 1-4/5 using these workloads:
>>>>>> ---
>>>>>> A.
>>>>>> a. 8 tasks at 2%/5%/10% during 1s
>>>>>> b. 1 task:
>>>>>>        - sleeping during 0.3s
>>>>>>        - at 100% during 0.3s
>>>>>>        - sleeping during 0.3s
>>>>>>
>>>>>> b. is used to reach the overutilized state during a limited amount of time.
>>>>>> EAS is then operating, then the load balancer does the task placement, then EAS
>>>>>> is operating again.
>>>>>>
>>>>>> B.
>>>>>> a. 8 tasks at 2%/5%/10% during 1s
>>>>>> b. 1 task:
>>>>>>        - at 100% during 1s
>>>>>>
>>>>>> ---
>>>>>> I'm seeing the energy consumption increase in some cases. This seems to be
>>>>>> due to feec() migrating tasks more often than what the load balancer does
>>>>>> for this workload. This leads to utilization 'spikes' and then frequency
>>>>>> 'spikes', increasing the overall energy consumption.
>>>>>> This is not entirely related to this patch though,, as the task placement seems
>>>>>> correct. I.e. feec() effectively does an optimal placement given the EM and
>>>>>> task utilization. The task placement is just a bit less stable than with
>>>>>> the load balancer.
>>>>>
>>>>> Would patch 5 help to keep things better placed ? in particular if
>>>>> task b is misplaced at some point because of load balance ?
>>>>
>>>> I assume so, it would require more testing on my side
>>>>
>>>>>
>>>>> I agree that load balance might still contribute to migrate task in a
>>>>> not energy efficient way
>>>>>
>>>>>>
>>>>>> ---
>>>>>> Regarding hackbench, I've reproduced the test you've run on the same Pixel6.
>>>>>> I have CONFIG_SCHED_CLUSTER enabled, which allows having a sched domain for
>>>>>> each little/mid/big CPUs (without the config, these group would no exist).
>>>>>
>>>>> Why did you do this ? All cpus are expected to be in same sched domain
>>>>> as they share their LLC
>>>>
>>>> I did this to observe the loa balancer a bit more carefully while reviewing
>>>> the first patch:
>>>>      sched/fair: Filter false overloaded_group case for EAS
>>>> I've let this configuration, but effectively this should not bring anything more.
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> I see an important regression in the result.
>>>>>> I replaced the condition to run feec() through select_task_rq_fair() by:
>>>>>>       if (get_rd_overloaded(cpu_rq(cpu)->rd) == 0)) {
>>>>>
>>>>> overloaded is enable when more than 1 task runs on a cpu whatever the
>>>>> utilization
>>>>
>>>> Yes right, this idea has little sense.
>>>>
>>>>>
>>>>>>         new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>>>>>>         ...
>>>>>>       }
>>>>>> and obtained better results.
>>>>>>
>>>>>> Indeed, for such intensive workload:
>>>>>> - EAS would not have any energy benefit, so better prioritize performance
>>>>>>       (as Christian mentioned)
>>>>>> - EAS would not be able to find a fitting CPU, so running feec() should be
>>>>>>       avoided
>>>>>> - as you mention in the commit message, shuffling tasks when one CPU becomes
>>>>>>       momentarily overutilized is inefficient energy-wise (even though I don't have
>>>>>>       the numbers, it should make sense).
>>>>>> So detecting when the system is overloaded should be a better compromise I
>>>>>> assume. The condition in sched_balance_find_src_group() to let the load balancer
>>>>>> operate might also need to be updated.
>>>>>>
>>>>>> Note:
>>>>>> - base: with patches 1-4/5
>>>>>> - _ou: run feec() when not overutilized
>>>>>> - _ol: run feec() when not overloaded
>>>>>> - mean: hackbench execution time in s.
>>>>>> - delta: negative is better. Value is in percentage.
>>>>>
>>>>> Could you share your command line ? As explained in the cover letter I
>>>>> have seen some perf regressions but not in the range that you have
>>>>> below
>>>>>
>>>>> What is your base ? tip/sched/core ?
>>>>
>>>> I am working on a Pixel6, with a branch based on v6.8 with some scheduler patches
>>>> to be able to apply your patches cleanly.
>>>
>>> TBH, I'm always cautious with those kind of frankenstein kernel
>>> especially with all changes that have happened on the scheduler since
>>> v6.8 compared to tip/sched/core
>>
>> Yes I understand, I'll re-test it on a Juno with a newer kernel.

For the record, I ran the same tests, still on a Pixel6 and supposedly with
the same setup. The results I got show indeed a regression lower than
the first results shared. I assume I didn't cool the Pixel6 enough in
the first experiment ...

The suffix '_w' is for the result with this present patch, running EAS
when in overutilized state.
+---------------------+--------+----------+--------+----------+---------+
|                 cmd |   mean |   std    | mean_w | std_w    |   ratio |
+---------------------+--------+----------+--------+----------+---------+
|       -l 5120 -g 1  | 1.9266 | 0.044848 | 2.1028 | 0.06441  |  9.15%  |
|       -l 1280 -g 4  | 1.89   | 0.080833 | 1.9588 | 0.040227 |  3.64%  |
|       -l 640  -g 8  | 1.8882 | 0.069197 | 1.918  | 0.06837  |  1.58%  |
|       -l 320  -g 16 | 1.9324 | 0.011194 | 1.9154 | 0.044998 | -0.88%  |
|    -p -l 5120 -g 1  | 1.4012 | 0.029811 | 1.6178 | 0.04027  | 15.46%  |
|    -p -l 1280 -g 4  | 1.3432 | 0.036949 | 1.5022 | 0.073346 | 11.84%  |
|    -p -l 640  -g 8  | 1.2944 | 0.022143 | 1.4468 | 0.013882 | 11.77%  |
|    -p -l 320  -g 16 | 1.2824 | 0.045873 | 1.3668 | 0.024448 |  6.58%  |
| -T    -l 5120 -g 1  | 1.9198 | 0.054897 | 2.0318 | 0.059222 |  5.83%  |
| -T    -l 1280 -g 4  | 1.8342 | 0.089015 | 1.9572 | 0.007328 |  6.71%  |
| -T    -l 640  -g 8  | 1.8986 | 0.023469 | 1.937  | 0.068044 |  2.02%  |
| -T    -l 320  -g 16 | 1.825  | 0.060634 | 1.9278 | 0.038206 |  5.63%  |
| -T -p -l 5120 -g 1  | 1.4424 | 0.007956 | 1.6474 | 0.035536 | 14.21%  |
| -T -p -l 1280 -g 4  | 1.3796 | 0.029305 | 1.5106 | 0.031533 |  9.5 %  |
| -T -p -l 640  -g 8  | 1.3306 | 0.024347 | 1.4662 | 0.064224 | 10.19%  |
| -T -p -l 320  -g 16 | 1.2886 | 0.031437 | 1.389  | 0.033083 |  7.79%  |
+---------------------+--------+----------+--------+----------+---------+

>>
>>>
>>>>
>>>> The mapping id -> command line is as:
>>>
>>> Thanks for the commands details, I'm going to have a look
>>>
>>>> (1) hackbench -l 5120 -g 1
>>>> (2) hackbench -l 1280 -g 4
>>>> (3) hackbench -l 640  -g 8
>>>> (4) hackbench -l 320  -g 16
>>>>
>>>> (5) hackbench -p -l 5120 -g 1
>>>> (6) hackbench -p -l 1280 -g 4
>>>> (7) hackbench -p -l 640  -g 8
>>>> (8) hackbench -p -l 320  -g 16
>>>>
>>>> (9) hackbench -T -l 5120 -g 1
>>>> (10) hackbench -T -l 1280 -g 4
>>>> (11) hackbench -T -l 640  -g 8
>>>> (12) hackbench -T -l 320  -g 16
>>>>
>>>> (13) hackbench -T -p -l 5120 -g 1
>>>> (14) hackbench -T -p -l 1280 -g 4
>>>> (15) hackbench -T -p -l 640  -g 8
>>>> (16) hackbench -T -p -l 320  -g 16
>>>>
>>>>
>>>>>
>>>>>> ┌─────┬───────────┬──────────┬─────────┬──────────┬─────────┬──────────┬──────────┬──────────┐
>>>>>> │ id  ┆ mean_base ┆ std_base ┆ mean_ou ┆ std_ou   ┆ mean_ol ┆ std_ol   ┆ delta_ou ┆ delta_ol │
>>>>>> ╞═════╪═══════════╪══════════╪═════════╪══════════╪═════════╪══════════╪══════════╪══════════╡
>>>>>> │ 1   ┆ 1.9786    ┆ 0.04719  ┆ 3.0856  ┆ 0.122209 ┆ 2.1734  ┆ 0.045203 ┆ 55.95    ┆ 9.85     │
>>>
>>> I might have misunderstood your results above last time.
>>> mean_base results include patches 1 to 4 and  mean_ou revert patch 4.
>>> Does it mean that it is 55% better with patch 4 ? I originally thought
>>> there was a regression with patch 4 but I'm not sure that I understood
>>> correctly after re reading the table.
>>
>> The columns are:
>> - the _base configuration disables EAS/feec() when in the overutilized state,
>>     i.e. patches 1-3 are applied.
> 
> your original description
> "
>   - base: with patches 1-4/5
>   - _ou: run feec() when not overutilized
>   - _ol: run feec() when not overloaded
> "
> was quite confusing :-)
> 
> Thanks for the clarification
> 
>> - the _ou configuration keeps running EAS/feec() when in the overutilized state
>>     i.e. patches 1-4 are applied
>> - the _ol configuration should be ignored as previously established
>>
>>
>>>
>>>>>> │ 2   ┆ 1.8991    ┆ 0.019768 ┆ 2.6672  ┆ 0.135266 ┆ 1.98875 ┆ 0.055132 ┆ 40.45    ┆ 4.72     │
>>>>>> │ 3   ┆ 1.9053    ┆ 0.014795 ┆ 2.5761  ┆ 0.141693 ┆ 2.06425 ┆ 0.045901 ┆ 35.21    ┆ 8.34     │
>>>>>> │ 4   ┆ 1.9586    ┆ 0.023439 ┆ 2.5823  ┆ 0.110399 ┆ 2.0955  ┆ 0.053818 ┆ 31.84    ┆ 6.99     │
>>>>>> │ 5   ┆ 1.746     ┆ 0.055676 ┆ 3.3437  ┆ 0.279107 ┆ 1.88    ┆ 0.038184 ┆ 91.51    ┆ 7.67     │
>>>>>> │ 6   ┆ 1.5476    ┆ 0.050131 ┆ 2.6835  ┆ 0.140497 ┆ 1.5645  ┆ 0.081644 ┆ 73.4     ┆ 1.09     │
>>>>>> │ 7   ┆ 1.4562    ┆ 0.062457 ┆ 2.3568  ┆ 0.119213 ┆ 1.48425 ┆ 0.06212  ┆ 61.85    ┆ 1.93     │
>>>>>> │ 8   ┆ 1.3554    ┆ 0.031757 ┆ 2.0609  ┆ 0.112869 ┆ 1.4085  ┆ 0.036601 ┆ 52.05    ┆ 3.92     │
>>>>>> │ 9   ┆ 2.0391    ┆ 0.035732 ┆ 3.4045  ┆ 0.277307 ┆ 2.2155  ┆ 0.019053 ┆ 66.96    ┆ 8.65     │
>>>>>> │ 10  ┆ 1.9247    ┆ 0.056472 ┆ 2.6605  ┆ 0.119417 ┆ 2.02775 ┆ 0.05795  ┆ 38.23    ┆ 5.35     │
>>>>>> │ 11  ┆ 1.8923    ┆ 0.038222 ┆ 2.8113  ┆ 0.120623 ┆ 2.089   ┆ 0.025259 ┆ 48.57    ┆ 10.39    │
>>>>>> │ 12  ┆ 1.9444    ┆ 0.034856 ┆ 2.6675  ┆ 0.219585 ┆ 2.1035  ┆ 0.076514 ┆ 37.19    ┆ 8.18     │
>>>>>> │ 13  ┆ 1.7107    ┆ 0.04874  ┆ 3.4443  ┆ 0.154481 ┆ 1.8275  ┆ 0.036665 ┆ 101.34   ┆ 6.83     │
>>>>>> │ 14  ┆ 1.5565    ┆ 0.056595 ┆ 2.8241  ┆ 0.158643 ┆ 1.5515  ┆ 0.040813 ┆ 81.44    ┆ -0.32    │
>>>>>> │ 15  ┆ 1.4932    ┆ 0.085256 ┆ 2.6841  ┆ 0.135623 ┆ 1.50475 ┆ 0.028336 ┆ 79.75    ┆ 0.77     │
>>>>>> │ 16  ┆ 1.4263    ┆ 0.067666 ┆ 2.3971  ┆ 0.145928 ┆ 1.414   ┆ 0.061422 ┆ 68.06    ┆ -0.86    │
>>>>>> └─────┴───────────┴──────────┴─────────┴──────────┴─────────┴──────────┴──────────┴──────────┘
>>>>>>
>>>>>> On 9/17/24 22:24, Christian Loehle wrote:
>>>>>>> On 8/30/24 14:03, Vincent Guittot wrote:
>>>>>>>> Keep looking for an energy efficient CPU even when the system is
>>>>>>>> overutilized and use the CPU returned by feec() if it has been able to find
>>>>>>>> one. Otherwise fallback to the default performance and spread mode of the
>>>>>>>> scheduler.
>>>>>>>> A system can become overutilized for a short time when workers of a
>>>>>>>> workqueue wake up for a short background work like vmstat update.
>>>>>>>> Continuing to look for a energy efficient CPU will prevent to break the
>>>>>>>> power packing of tasks.
>>>>>>>>
>>>>>>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>>>>>>> ---
>>>>>>>>      kernel/sched/fair.c | 2 +-
>>>>>>>>      1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>>>> index 2273eecf6086..e46af2416159 100644
>>>>>>>> --- a/kernel/sched/fair.c
>>>>>>>> +++ b/kernel/sched/fair.c
>>>>>>>> @@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>>>>>>>>                     cpumask_test_cpu(cpu, p->cpus_ptr))
>>>>>>>>                         return cpu;
>>>>>>>>
>>>>>>>> -            if (!is_rd_overutilized(this_rq()->rd)) {
>>>>>>>> +            if (sched_energy_enabled()) {
>>>>>>>>                         new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>>>>>>>>                         if (new_cpu >= 0)
>>>>>>>>                                 return new_cpu;
>>>>>>>
>>>>>>> Super quick testing on pixel6:
>>>>>>> for i in $(seq 0 6); do /data/local/tmp/hackbench -l 500 -g 100 | grep Time; sleep 60; done
>>>>>>> with patch 5/5 only:
>>>>>>> Time: 19.433
>>>>>>> Time: 19.657
>>>>>>> Time: 19.851
>>>>>>> Time: 19.789
>>>>>>> Time: 19.857
>>>>>>> Time: 20.092
>>>>>>> Time: 19.973
>>>>>>>
>>>>>>> mainline:
>>>>>>> Time: 18.836
>>>>>>> Time: 18.718
>>>>>>> Time: 18.781
>>>>>>> Time: 19.015
>>>>>>> Time: 19.061
>>>>>>> Time: 18.950
>>>>>>> Time: 19.166
>>>>>>>
>>>>>>>
>>>>>>> The reason we didn't always have this enabled is the belief that
>>>>>>> this costs us too much performance in scenarios we most need it
>>>>>>> while at best making subpar EAS decisions anyway (in an
>>>>>>> overutilized state).
>>>>>>> I'd be open for questioning that, but why the change of mind?
>>>>>>> And why is this necessary in your series if the EAS selection
>>>>>>> isn't 'final' (until the next sleep) anymore (Patch 5/5)?
>>>>>>>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
  2024-09-17 20:24   ` Christian Loehle
  2024-09-19  8:25     ` Pierre Gondois
@ 2024-09-25 13:07     ` Vincent Guittot
  1 sibling, 0 replies; 62+ messages in thread
From: Vincent Guittot @ 2024-09-25 13:07 UTC (permalink / raw)
  To: Christian Loehle
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2

On Tue, 17 Sept 2024 at 22:24, Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 8/30/24 14:03, Vincent Guittot wrote:
> > Keep looking for an energy efficient CPU even when the system is
> > overutilized and use the CPU returned by feec() if it has been able to find
> > one. Otherwise fallback to the default performance and spread mode of the
> > scheduler.
> > A system can become overutilized for a short time when workers of a
> > workqueue wake up for a short background work like vmstat update.
> > Continuing to look for a energy efficient CPU will prevent to break the
> > power packing of tasks.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/fair.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 2273eecf6086..e46af2416159 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> >                   cpumask_test_cpu(cpu, p->cpus_ptr))
> >                       return cpu;
> >
> > -             if (!is_rd_overutilized(this_rq()->rd)) {
> > +             if (sched_energy_enabled()) {
> >                       new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> >                       if (new_cpu >= 0)
> >                               return new_cpu;
>
> Super quick testing on pixel6:
> for i in $(seq 0 6); do /data/local/tmp/hackbench -l 500 -g 100 | grep Time; sleep 60; done
> with patch 5/5 only:

Do you mean 4/5 ?

> Time: 19.433
> Time: 19.657
> Time: 19.851
> Time: 19.789
> Time: 19.857
> Time: 20.092
> Time: 19.973
>
> mainline:
> Time: 18.836
> Time: 18.718
> Time: 18.781
> Time: 19.015
> Time: 19.061
> Time: 18.950
> Time: 19.166
>

As mentioned in the cover letter,  patch 4/5  has an impact on performance.
Your 4.6% regression is in the range of what I have for these tests

>
> The reason we didn't always have this enabled is the belief that
> this costs us too much performance in scenarios we most need it
> while at best making subpar EAS decisions anyway (in an
> overutilized state).
> I'd be open for questioning that, but why the change of mind?

several reasons:
- the rework of eas patch 1,2,3 of this patchset adds some performance
hints into the selection of an energy efficient CPU
- Although some initial proposal of overutilized state was per sched
domain to prevent destroying whole placement if only a subpart of the
system was overutilized, the current implementation is binary: whole
system or nothing. As shown during [1], a short kworker wakeup can
destroy all task placement by putting the whole system overutilized.
But even  when overutilized, there are a lot of possibilities to do
correct feec() task placement. The overutilized state is too
aggressive.
- the feec() has been reworked since the original version to be less
complex as described by commit 5b77261c5510 ("sched/topology: Remove
the EM_MAX_COMPLEXITY limit")

[1] https://youtu.be/PHEBAyxeM_M?si=ZApIOw3BS4SOLPwp

> And why is this necessary in your series if the EAS selection
> isn't 'final' (until the next sleep) anymore (Patch 5/5)?

To prevent destroying everything without good reason. feec() will try
select a CPU only if it can find one that fits for the task otherwise
we fallback to full performance one.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
  2024-08-30 13:03 ` [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized Vincent Guittot
  2024-09-17 20:24   ` Christian Loehle
@ 2024-09-20 16:17   ` Quentin Perret
  2024-09-25 13:27     ` Vincent Guittot
  1 sibling, 1 reply; 62+ messages in thread
From: Quentin Perret @ 2024-09-20 16:17 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2

Hi Vincent,

On Friday 30 Aug 2024 at 15:03:08 (+0200), Vincent Guittot wrote:
> Keep looking for an energy efficient CPU even when the system is
> overutilized and use the CPU returned by feec() if it has been able to find
> one. Otherwise fallback to the default performance and spread mode of the
> scheduler.
> A system can become overutilized for a short time when workers of a
> workqueue wake up for a short background work like vmstat update.
> Continuing to look for a energy efficient CPU will prevent to break the
> power packing of tasks.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2273eecf6086..e46af2416159 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>  		    cpumask_test_cpu(cpu, p->cpus_ptr))
>  			return cpu;
>  
> -		if (!is_rd_overutilized(this_rq()->rd)) {
> +		if (sched_energy_enabled()) {

As mentioned during LPC, when there is no idle time on a CPU, the
utilization value of the tasks running on it is no longer a good
approximation for how much the tasks want, it becomes an image of how
much CPU time they were given. That is particularly problematic in the
co-scheduling case, but not just.

IOW, when we're OU, the util values are bogus, so using feec() is frankly
wrong IMO. If we don't have a good idea of how long tasks want to run,
the EM just can't help us with anything so we should stay away from it.

I understand how just plain bailing out as we do today is sub-optimal,
but whatever we do to improve on that can't be doing utilization-based
task placement.

Have you considered making the default (non-EAS) wake-up path a little
more reluctant to migrations when EAS is enabled? That should allow us
to maintain a somewhat stable task placement when OU is only transient
(e.g. due to misfit), but without using util values when we really
shouldn't.

Thoughts?

Thanks,
Quentin

>  			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>  			if (new_cpu >= 0)
>  				return new_cpu;
> -- 
> 2.34.1
> 
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
  2024-09-20 16:17   ` Quentin Perret
@ 2024-09-25 13:27     ` Vincent Guittot
  2024-09-26  9:10       ` Quentin Perret
  0 siblings, 1 reply; 62+ messages in thread
From: Vincent Guittot @ 2024-09-25 13:27 UTC (permalink / raw)
  To: Quentin Perret
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2

On Fri, 20 Sept 2024 at 18:17, Quentin Perret <qperret@google.com> wrote:
>
> Hi Vincent,
>
> On Friday 30 Aug 2024 at 15:03:08 (+0200), Vincent Guittot wrote:
> > Keep looking for an energy efficient CPU even when the system is
> > overutilized and use the CPU returned by feec() if it has been able to find
> > one. Otherwise fallback to the default performance and spread mode of the
> > scheduler.
> > A system can become overutilized for a short time when workers of a
> > workqueue wake up for a short background work like vmstat update.
> > Continuing to look for a energy efficient CPU will prevent to break the
> > power packing of tasks.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/fair.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 2273eecf6086..e46af2416159 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> >                   cpumask_test_cpu(cpu, p->cpus_ptr))
> >                       return cpu;
> >
> > -             if (!is_rd_overutilized(this_rq()->rd)) {
> > +             if (sched_energy_enabled()) {
>
> As mentioned during LPC, when there is no idle time on a CPU, the
> utilization value of the tasks running on it is no longer a good
> approximation for how much the tasks want, it becomes an image of how
> much CPU time they were given. That is particularly problematic in the
> co-scheduling case, but not just.

Yes, this is not always true when overutilized and  true after a
certain amount of time. When a CPU is fully utilized without any idle
time anymore, feec() will not find a CPU for the task

>
> IOW, when we're OU, the util values are bogus, so using feec() is frankly
> wrong IMO. If we don't have a good idea of how long tasks want to run,

Except that the CPU is not already fully busy without idle time when
the system is overutilized. We have  ~20% margin on each CPU which
means that system are overutilized as soon as one CPU is more than 80%
utilized which is far from not having idle time anymore. So even when
OU, it doesn't mean that all CPUs don't have idle time and most of the
time the opposite happens and feec() can still make a useful decision.
Also, when there is no idle time on a CPU, the task doesn't fit and
feec() doesn't return a CPU.

Then, the old way to compute invariant utilization was particularly
sensible to the overutilized state because the utilization was capped
and asymptotically converging to max cpu compute capacity but this is
not true with the new pelt and we can go above compute capacity of the
cpu and remain correct as long as we are able to increase the compute
capacity before that there is no idle time. In theory, the utilization
"could" be correct until we reach 1024 (for utilization or runnable)
and there is no way to catch up the temporary under compute capacity.

> the EM just can't help us with anything so we should stay away from it.
>
> I understand how just plain bailing out as we do today is sub-optimal,
> but whatever we do to improve on that can't be doing utilization-based
> task placement.
>
> Have you considered making the default (non-EAS) wake-up path a little
> more reluctant to migrations when EAS is enabled? That should allow us
> to maintain a somewhat stable task placement when OU is only transient
> (e.g. due to misfit), but without using util values when we really
> shouldn't.
>
> Thoughts?

As mentioned above OU doesn't mean no idle time anymore and in this
case utilization is still relevant. In would be in favor of adding
more performance related decision into feec() similarly to have is
done in patch 3 which would be for example that if a cpu doesn't fit
we could still return  a CPU with more performance focus

>
> Thanks,
> Quentin
>
> >                       new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> >                       if (new_cpu >= 0)
> >                               return new_cpu;
> > --
> > 2.34.1
> >
> >

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
  2024-09-25 13:27     ` Vincent Guittot
@ 2024-09-26  9:10       ` Quentin Perret
  2024-10-01 16:20         ` Vincent Guittot
  0 siblings, 1 reply; 62+ messages in thread
From: Quentin Perret @ 2024-09-26  9:10 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2

Hi Vincent,

On Wednesday 25 Sep 2024 at 15:27:45 (+0200), Vincent Guittot wrote:
> On Fri, 20 Sept 2024 at 18:17, Quentin Perret <qperret@google.com> wrote:
> >
> > Hi Vincent,
> >
> > On Friday 30 Aug 2024 at 15:03:08 (+0200), Vincent Guittot wrote:
> > > Keep looking for an energy efficient CPU even when the system is
> > > overutilized and use the CPU returned by feec() if it has been able to find
> > > one. Otherwise fallback to the default performance and spread mode of the
> > > scheduler.
> > > A system can become overutilized for a short time when workers of a
> > > workqueue wake up for a short background work like vmstat update.
> > > Continuing to look for a energy efficient CPU will prevent to break the
> > > power packing of tasks.
> > >
> > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > > ---
> > >  kernel/sched/fair.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 2273eecf6086..e46af2416159 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > >                   cpumask_test_cpu(cpu, p->cpus_ptr))
> > >                       return cpu;
> > >
> > > -             if (!is_rd_overutilized(this_rq()->rd)) {
> > > +             if (sched_energy_enabled()) {
> >
> > As mentioned during LPC, when there is no idle time on a CPU, the
> > utilization value of the tasks running on it is no longer a good
> > approximation for how much the tasks want, it becomes an image of how
> > much CPU time they were given. That is particularly problematic in the
> > co-scheduling case, but not just.
> 
> Yes, this is not always true when overutilized and  true after a
> certain amount of time. When a CPU is fully utilized without any idle
> time anymore, feec() will not find a CPU for the task

Well the problem is that is might actually find a CPU for the task -- a
co-scheduled task can obviously look arbitrarily small from a util PoV.

> >
> > IOW, when we're OU, the util values are bogus, so using feec() is frankly
> > wrong IMO. If we don't have a good idea of how long tasks want to run,
> 
> Except that the CPU is not already fully busy without idle time when
> the system is overutilized. We have  ~20% margin on each CPU which
> means that system are overutilized as soon as one CPU is more than 80%
> utilized which is far from not having idle time anymore. So even when
> OU, it doesn't mean that all CPUs don't have idle time and most of the
> time the opposite happens and feec() can still make a useful decision.

My problem with the proposed change here is that it doesn't at all
distinguish between the truly overloaded case (when we have more compute
demand that resources) from a system with a stable-ish utilization at
90%. If you're worried about the latter, then perhaps we should think
about redefining the OU threshold some other way (either by simply
making higher or configurable, or changing its nature to look at the
last time we actually got idle time in the system). But I'm still rather
opinionated that util-based placement is wrong for the former.

And for what it's worth, in my experience if any of the big CPUs get
anywhere near the top of their OPP range, given that the power/perf
curve is exponential it's being penny-wise and pound-foolish to
micro-optimise the placement of the other smaller tasks from an energy
PoV at the same time. But if we can show that it helps real use-cases,
then why not.

> Also, when there is no idle time on a CPU, the task doesn't fit and
> feec() doesn't return a CPU.

It doesn't fit on that CPU but might still (incorrectly) fit on another
CPU right?

> Then, the old way to compute invariant utilization was particularly
> sensible to the overutilized state because the utilization was capped
> and asymptotically converging to max cpu compute capacity but this is
> not true with the new pelt and we can go above compute capacity of the
> cpu and remain correct as long as we are able to increase the compute
> capacity before that there is no idle time. In theory, the utilization
> "could" be correct until we reach 1024 (for utilization or runnable)
> and there is no way to catch up the temporary under compute capacity.
> 
> > the EM just can't help us with anything so we should stay away from it.
> >
> > I understand how just plain bailing out as we do today is sub-optimal,
> > but whatever we do to improve on that can't be doing utilization-based
> > task placement.
> >
> > Have you considered making the default (non-EAS) wake-up path a little
> > more reluctant to migrations when EAS is enabled? That should allow us
> > to maintain a somewhat stable task placement when OU is only transient
> > (e.g. due to misfit), but without using util values when we really
> > shouldn't.
> >
> > Thoughts?
> 
> As mentioned above OU doesn't mean no idle time anymore and in this
> case utilization is still relevant

OK, but please distinguish this from the truly overloaded case somehow,
I really don't think we can 'break' it just to help with the corner case
when we've got 90% ish util.

> In would be in favor of adding
> more performance related decision into feec() similarly to have is
> done in patch 3 which would be for example that if a cpu doesn't fit
> we could still return  a CPU with more performance focus

Fine with me in principle as long as we stop using utilization as a
proxy for how much a task wants when it really isn't that any more.

Thanks!
Quentin

> >
> > Thanks,
> > Quentin
> >
> > >                       new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> > >                       if (new_cpu >= 0)
> > >                               return new_cpu;
> > > --
> > > 2.34.1
> > >
> > >

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
  2024-09-26  9:10       ` Quentin Perret
@ 2024-10-01 16:20         ` Vincent Guittot
  2024-10-01 17:50           ` Quentin Perret
  0 siblings, 1 reply; 62+ messages in thread
From: Vincent Guittot @ 2024-10-01 16:20 UTC (permalink / raw)
  To: Quentin Perret
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2

On Thu, 26 Sept 2024 at 11:10, Quentin Perret <qperret@google.com> wrote:
>
> Hi Vincent,
>
> On Wednesday 25 Sep 2024 at 15:27:45 (+0200), Vincent Guittot wrote:
> > On Fri, 20 Sept 2024 at 18:17, Quentin Perret <qperret@google.com> wrote:
> > >
> > > Hi Vincent,
> > >
> > > On Friday 30 Aug 2024 at 15:03:08 (+0200), Vincent Guittot wrote:
> > > > Keep looking for an energy efficient CPU even when the system is
> > > > overutilized and use the CPU returned by feec() if it has been able to find
> > > > one. Otherwise fallback to the default performance and spread mode of the
> > > > scheduler.
> > > > A system can become overutilized for a short time when workers of a
> > > > workqueue wake up for a short background work like vmstat update.
> > > > Continuing to look for a energy efficient CPU will prevent to break the
> > > > power packing of tasks.
> > > >
> > > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > > > ---
> > > >  kernel/sched/fair.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 2273eecf6086..e46af2416159 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > > >                   cpumask_test_cpu(cpu, p->cpus_ptr))
> > > >                       return cpu;
> > > >
> > > > -             if (!is_rd_overutilized(this_rq()->rd)) {
> > > > +             if (sched_energy_enabled()) {
> > >
> > > As mentioned during LPC, when there is no idle time on a CPU, the
> > > utilization value of the tasks running on it is no longer a good
> > > approximation for how much the tasks want, it becomes an image of how
> > > much CPU time they were given. That is particularly problematic in the
> > > co-scheduling case, but not just.
> >
> > Yes, this is not always true when overutilized and  true after a
> > certain amount of time. When a CPU is fully utilized without any idle
> > time anymore, feec() will not find a CPU for the task
>
> Well the problem is that is might actually find a CPU for the task -- a
> co-scheduled task can obviously look arbitrarily small from a util PoV.

With commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task
utilization"), the util_est remains set the value before having to
share the cpu with other tasks which means that the util_est remains
correct even if its util_avg decrease because of sharing the cpu with
other task. This has been done to cover the cases that you mention
above whereboth util_avg and util_est where decreasing when tasks
starts to  share  the CPU bandwidth with others

>
> > >
> > > IOW, when we're OU, the util values are bogus, so using feec() is frankly
> > > wrong IMO. If we don't have a good idea of how long tasks want to run,
> >
> > Except that the CPU is not already fully busy without idle time when
> > the system is overutilized. We have  ~20% margin on each CPU which
> > means that system are overutilized as soon as one CPU is more than 80%
> > utilized which is far from not having idle time anymore. So even when
> > OU, it doesn't mean that all CPUs don't have idle time and most of the
> > time the opposite happens and feec() can still make a useful decision.
>
> My problem with the proposed change here is that it doesn't at all
> distinguish between the truly overloaded case (when we have more compute
> demand that resources) from a system with a stable-ish utilization at
> 90%. If you're worried about the latter, then perhaps we should think
> about redefining the OU threshold some other way (either by simply
> making higher or configurable, or changing its nature to look at the

we definitely increase the OU threshold but we still have case with
truly overutilized CPU but still good utilization value

> last time we actually got idle time in the system). But I'm still rather
> opinionated that util-based placement is wrong for the former.

And feec() will return -1 for that case because util_est remains high

>
> And for what it's worth, in my experience if any of the big CPUs get
> anywhere near the top of their OPP range, given that the power/perf
> curve is exponential it's being penny-wise and pound-foolish to
> micro-optimise the placement of the other smaller tasks from an energy
> PoV at the same time. But if we can show that it helps real use-cases,
> then why not.

The thermal mitigation and/or power budget policy quickly reduce the
max compute capacity of such big CPUs becomes overutilized with lower
OPP which reduce the diff between big/medium/little

>
> > Also, when there is no idle time on a CPU, the task doesn't fit and
> > feec() doesn't return a CPU.
>
> It doesn't fit on that CPU but might still (incorrectly) fit on another
> CPU right?

the commit that I mentioned above covers those cases and the task will
not incorrectly fit to another smaller CPU because its util_est is
preserved during the overutilized phase

>
> > Then, the old way to compute invariant utilization was particularly
> > sensible to the overutilized state because the utilization was capped
> > and asymptotically converging to max cpu compute capacity but this is
> > not true with the new pelt and we can go above compute capacity of the
> > cpu and remain correct as long as we are able to increase the compute
> > capacity before that there is no idle time. In theory, the utilization
> > "could" be correct until we reach 1024 (for utilization or runnable)
> > and there is no way to catch up the temporary under compute capacity.
> >
> > > the EM just can't help us with anything so we should stay away from it.
> > >
> > > I understand how just plain bailing out as we do today is sub-optimal,
> > > but whatever we do to improve on that can't be doing utilization-based
> > > task placement.
> > >
> > > Have you considered making the default (non-EAS) wake-up path a little
> > > more reluctant to migrations when EAS is enabled? That should allow us
> > > to maintain a somewhat stable task placement when OU is only transient
> > > (e.g. due to misfit), but without using util values when we really
> > > shouldn't.
> > >
> > > Thoughts?
> >
> > As mentioned above OU doesn't mean no idle time anymore and in this
> > case utilization is still relevant
>
> OK, but please distinguish this from the truly overloaded case somehow,
> I really don't think we can 'break' it just to help with the corner case
> when we've got 90% ish util.
>
> > In would be in favor of adding
> > more performance related decision into feec() similarly to have is
> > done in patch 3 which would be for example that if a cpu doesn't fit
> > we could still return  a CPU with more performance focus
>
> Fine with me in principle as long as we stop using utilization as a
> proxy for how much a task wants when it really isn't that any more.
>
> Thanks!
> Quentin
>
> > >
> > > Thanks,
> > > Quentin
> > >
> > > >                       new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> > > >                       if (new_cpu >= 0)
> > > >                               return new_cpu;
> > > > --
> > > > 2.34.1
> > > >
> > > >

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
  2024-10-01 16:20         ` Vincent Guittot
@ 2024-10-01 17:50           ` Quentin Perret
  2024-10-02  7:11             ` Lukasz Luba
  2024-10-03  6:27             ` Vincent Guittot
  0 siblings, 2 replies; 62+ messages in thread
From: Quentin Perret @ 2024-10-01 17:50 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2

On Tuesday 01 Oct 2024 at 18:20:03 (+0200), Vincent Guittot wrote:
> With commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task
> utilization"), the util_est remains set the value before having to
> share the cpu with other tasks which means that the util_est remains
> correct even if its util_avg decrease because of sharing the cpu with
> other task. This has been done to cover the cases that you mention
> above whereboth util_avg and util_est where decreasing when tasks
> starts to  share  the CPU bandwidth with others

I don't think I agree about the correctness of that util_est value at
all. The above patch only makes it arbitrarily out of date in the truly
overcommitted case. All the util-based heuristic we have in the
scheduler are based around the assumption that the close future will
look like the recent past, so using an arbitrarily old util-est is still
incorrect. I can understand how this may work OK in RT-app or other
use-cases with perfectly periodic tasks for their entire lifetime and
such, but this doesn't work at all in the general case.

> And feec() will return -1 for that case because util_est remains high

And again, checking that a task fits is broken to start with if we don't
know how big the task is. When we have reasons to believe that the util
values are no longer correct (and the absence of idle time is a very
good reason for that) we just need to give up on them. The fact that we
have to resort to using out-of-date data to sort of make that work is
just another proof that this is not a good idea in the general case.

> the commit that I mentioned above covers those cases and the task will
> not incorrectly fit to another smaller CPU because its util_est is
> preserved during the overutilized phase

There are other reasons why a task may look like it fits, e.g. two tasks
coscheduled on a big CPU get 50% util each, then we migrate one away, the
CPU looks half empty. Is it half empty? We've got no way to tell until
we see idle time. The current util_avg and old util_est value are just
not helpful, they're both bad signals and we should just discard them.

So again I do feel like the best way forward would be to change the
nature of the OU threshold to actually ask cpuidle 'when was the last
time there was idle time?' (or possibly cache that in the idle task
directly). And then based on that we can decide whether we want to enter
feec() and do util-based decision, or to kick the push-pull mechanism in
your other patches, things like that. That would solve/avoid the problem
I mentioned in the previous paragraph and make the OU detection more
robust. We could also consider using different thresholds in different
places to re-enable load-balancing earlier, and give up on feec() a bit
later to avoid messing the entire task placement when we're only
transiently OU because of misfit. But eventually, we really need to just
give up on util values altogether when we're really overcommitted, it's
really an invariant we need to keep.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
  2024-10-01 17:50           ` Quentin Perret
@ 2024-10-02  7:11             ` Lukasz Luba
  2024-10-02  7:55               ` Quentin Perret
  2024-10-03  6:27             ` Vincent Guittot
  1 sibling, 1 reply; 62+ messages in thread
From: Lukasz Luba @ 2024-10-02  7:11 UTC (permalink / raw)
  To: Quentin Perret, Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, rafael.j.wysocki, linux-kernel, qyousef,
	hongyan.xia2

Hi Quentin and Vincent,

On 10/1/24 18:50, Quentin Perret wrote:
> On Tuesday 01 Oct 2024 at 18:20:03 (+0200), Vincent Guittot wrote:
>> With commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task
>> utilization"), the util_est remains set the value before having to
>> share the cpu with other tasks which means that the util_est remains
>> correct even if its util_avg decrease because of sharing the cpu with
>> other task. This has been done to cover the cases that you mention
>> above whereboth util_avg and util_est where decreasing when tasks
>> starts to  share  the CPU bandwidth with others
> 
> I don't think I agree about the correctness of that util_est value at
> all. The above patch only makes it arbitrarily out of date in the truly
> overcommitted case. All the util-based heuristic we have in the
> scheduler are based around the assumption that the close future will
> look like the recent past, so using an arbitrarily old util-est is still
> incorrect. I can understand how this may work OK in RT-app or other
> use-cases with perfectly periodic tasks for their entire lifetime and
> such, but this doesn't work at all in the general case.

I remember that commit Vincent mentioned above. That was from a web
browser test 'Speedometer', not rt-app. The browser has to run the
same 'computation problem' but with quite a lot of JavaScript
frameworks. Those frameworks mainly run in the browser main thread,
with some helper threads in background.

So it was not purely RT-app or other perfectly periodic task.
Although, IIRC Vincent was able to build a model based on rt-app
to tackle that issue.

That patch helped to better reflect the situation in the OS.

For this particular _subject_ I don't think it's relevant, though.
It was actually helping to show that the situation is worse, so
closer to OU because the task was bigger (and we avoid EAS).

> 
>> And feec() will return -1 for that case because util_est remains high
> 
> And again, checking that a task fits is broken to start with if we don't
> know how big the task is. When we have reasons to believe that the util
> values are no longer correct (and the absence of idle time is a very
> good reason for that) we just need to give up on them. The fact that we
> have to resort to using out-of-date data to sort of make that work is
> just another proof that this is not a good idea in the general case.
> 
>> the commit that I mentioned above covers those cases and the task will
>> not incorrectly fit to another smaller CPU because its util_est is
>> preserved during the overutilized phase
> 
> There are other reasons why a task may look like it fits, e.g. two tasks
> coscheduled on a big CPU get 50% util each, then we migrate one away, the
> CPU looks half empty. Is it half empty? We've got no way to tell until
> we see idle time. The current util_avg and old util_est value are just
> not helpful, they're both bad signals and we should just discard them.

So would you then reset them to 0? Or leave them as they are?
What about the other signals (cpu runqueue) which are derived from them?
That sounds like really heavy change or inconsistency in many places.

> 
> So again I do feel like the best way forward would be to change the
> nature of the OU threshold to actually ask cpuidle 'when was the last
> time there was idle time?' (or possibly cache that in the idle task
> directly). And then based on that we can decide whether we want to enter
> feec() and do util-based decision, or to kick the push-pull mechanism in
> your other patches, things like that. That would solve/avoid the problem
> I mentioned in the previous paragraph and make the OU detection more
> robust. We could also consider using different thresholds in different
> places to re-enable load-balancing earlier, and give up on feec() a bit
> later to avoid messing the entire task placement when we're only
> transiently OU because of misfit. But eventually, we really need to just
> give up on util values altogether when we're really overcommitted, it's
> really an invariant we need to keep.

IMHO the problem here with OU was amplified recently due to the
Uclamp_max setting + 'Max aggregation policy' + aggressive frequency
capping + fast freq switching.

Now we are in the situation where we complain about util metrics...

I've been warning Qais and Vincent that this usage of Uclamp_max
in such environment is dangerous and might explode.

If one background task is capped hard in CPU freq, but does computation
'all the time' making that CPU to have no idle time - then IMO
this is not a good scheduling. This is a receipt for starvation.
You probably won't find any better metric.

I would suggest to stop making the OU situation worse and more
frequent with this 'artificial starvation with uclamp_max'.

I understand we want to safe energy, but uclamp_max in current shape
has too many side effects IMO.

Why we haven't invested in the 'Bandwidth controller', e.g. to make
it big.Little aware (if that could be a problem)(they were there for
many years)?

Regards,
Lukasz

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
  2024-10-02  7:11             ` Lukasz Luba
@ 2024-10-02  7:55               ` Quentin Perret
  2024-10-02  9:54                 ` Lukasz Luba
  0 siblings, 1 reply; 62+ messages in thread
From: Quentin Perret @ 2024-10-02  7:55 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, rafael.j.wysocki,
	linux-kernel, qyousef, hongyan.xia2

Hey Lukasz,

On Wednesday 02 Oct 2024 at 08:11:06 (+0100), Lukasz Luba wrote:
> Hi Quentin and Vincent,
> 
> On 10/1/24 18:50, Quentin Perret wrote:
> > On Tuesday 01 Oct 2024 at 18:20:03 (+0200), Vincent Guittot wrote:
> > > With commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task
> > > utilization"), the util_est remains set the value before having to
> > > share the cpu with other tasks which means that the util_est remains
> > > correct even if its util_avg decrease because of sharing the cpu with
> > > other task. This has been done to cover the cases that you mention
> > > above whereboth util_avg and util_est where decreasing when tasks
> > > starts to  share  the CPU bandwidth with others
> > 
> > I don't think I agree about the correctness of that util_est value at
> > all. The above patch only makes it arbitrarily out of date in the truly
> > overcommitted case. All the util-based heuristic we have in the
> > scheduler are based around the assumption that the close future will
> > look like the recent past, so using an arbitrarily old util-est is still
> > incorrect. I can understand how this may work OK in RT-app or other
> > use-cases with perfectly periodic tasks for their entire lifetime and
> > such, but this doesn't work at all in the general case.
> 
> I remember that commit Vincent mentioned above. That was from a web
> browser test 'Speedometer', not rt-app. The browser has to run the
> same 'computation problem' but with quite a lot of JavaScript
> frameworks. Those frameworks mainly run in the browser main thread,
> with some helper threads in background.
> 
> So it was not purely RT-app or other perfectly periodic task.
> Although, IIRC Vincent was able to build a model based on rt-app
> to tackle that issue.
> 
> That patch helped to better reflect the situation in the OS.

Sure thing, I'm absolutely ready to believe that an old util-est value
will be better in certain use-cases, but again I don't think we should
conflate this for the general case. In particular a util-est that was
measured when the system was lightly loaded is absolutely not guaranteed
to be valid while it is overcommitted. Freshness matters in many cases.

> For this particular _subject_ I don't think it's relevant, though.
> It was actually helping to show that the situation is worse, so
> closer to OU because the task was bigger (and we avoid EAS).
> 
> > 
> > > And feec() will return -1 for that case because util_est remains high
> > 
> > And again, checking that a task fits is broken to start with if we don't
> > know how big the task is. When we have reasons to believe that the util
> > values are no longer correct (and the absence of idle time is a very
> > good reason for that) we just need to give up on them. The fact that we
> > have to resort to using out-of-date data to sort of make that work is
> > just another proof that this is not a good idea in the general case.
> > 
> > > the commit that I mentioned above covers those cases and the task will
> > > not incorrectly fit to another smaller CPU because its util_est is
> > > preserved during the overutilized phase
> > 
> > There are other reasons why a task may look like it fits, e.g. two tasks
> > coscheduled on a big CPU get 50% util each, then we migrate one away, the
> > CPU looks half empty. Is it half empty? We've got no way to tell until
> > we see idle time. The current util_avg and old util_est value are just
> > not helpful, they're both bad signals and we should just discard them.
> 
> So would you then reset them to 0? Or leave them as they are?
> What about the other signals (cpu runqueue) which are derived from them?
> That sounds like really heavy change or inconsistency in many places.

I would just leave them as they are, but not look at them, pretty much
like we do today. In the overcommitted case, load is a superior signal
because it accounts for runnable time and the task weights, so we really
ought to use that instead of util.

> > 
> > So again I do feel like the best way forward would be to change the
> > nature of the OU threshold to actually ask cpuidle 'when was the last
> > time there was idle time?' (or possibly cache that in the idle task
> > directly). And then based on that we can decide whether we want to enter
> > feec() and do util-based decision, or to kick the push-pull mechanism in
> > your other patches, things like that. That would solve/avoid the problem
> > I mentioned in the previous paragraph and make the OU detection more
> > robust. We could also consider using different thresholds in different
> > places to re-enable load-balancing earlier, and give up on feec() a bit
> > later to avoid messing the entire task placement when we're only
> > transiently OU because of misfit. But eventually, we really need to just
> > give up on util values altogether when we're really overcommitted, it's
> > really an invariant we need to keep.
> 
> IMHO the problem here with OU was amplified recently due to the
> Uclamp_max setting

Ack.

> 'Max aggregation policy'

Ack.

> aggressive frequency capping

What do you mean by that?

> fast freq switching.

And not sure what fast switching has to do with the issue here?

> Now we are in the situation where we complain about util metrics...
> 
> I've been warning Qais and Vincent that this usage of Uclamp_max
> in such environment is dangerous and might explode.

I absolutely agree that uclamp max makes a huge mess of things, and util
in particular :-(

> If one background task is capped hard in CPU freq, but does computation
> 'all the time' making that CPU to have no idle time - then IMO
> this is not a good scheduling. This is a receipt for starvation.
> You probably won't find any better metric.
> 
> I would suggest to stop making the OU situation worse and more
> frequent with this 'artificial starvation with uclamp_max'.
> 
> I understand we want to safe energy, but uclamp_max in current shape
> has too many side effects IMO.
> 
> Why we haven't invested in the 'Bandwidth controller', e.g. to make
> it big.Little aware (if that could be a problem)(they were there for
> many years)?

Bandwidth control is a different thing really, not sure it can be used
interchangeably with uclamp_max in general. Running all the time at low
frequency is often going to be better from a power perspective than
running uncapped for a fixed period of time.

I think the intention of uclamp max is really to say 'these tasks have
low QoS, use spare cycles at low-ish frequency to run them'. What we
found was that it was best to use cpu.shares in conjunction with
uclamp.max to implement the 'use spare cycles' part of the previous
statement, but that was its own can of worms and caused a lot of
priority inversion problems. Hopefully the proxy exec stuff will solve
that...

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
  2024-10-02  7:55               ` Quentin Perret
@ 2024-10-02  9:54                 ` Lukasz Luba
  0 siblings, 0 replies; 62+ messages in thread
From: Lukasz Luba @ 2024-10-02  9:54 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, rafael.j.wysocki,
	linux-kernel, qyousef, hongyan.xia2



On 10/2/24 08:55, Quentin Perret wrote:
> Hey Lukasz,
> 
> On Wednesday 02 Oct 2024 at 08:11:06 (+0100), Lukasz Luba wrote:
>> Hi Quentin and Vincent,
>>
>> On 10/1/24 18:50, Quentin Perret wrote:
>>> On Tuesday 01 Oct 2024 at 18:20:03 (+0200), Vincent Guittot wrote:
>>>> With commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task
>>>> utilization"), the util_est remains set the value before having to
>>>> share the cpu with other tasks which means that the util_est remains
>>>> correct even if its util_avg decrease because of sharing the cpu with
>>>> other task. This has been done to cover the cases that you mention
>>>> above whereboth util_avg and util_est where decreasing when tasks
>>>> starts to  share  the CPU bandwidth with others
>>>
>>> I don't think I agree about the correctness of that util_est value at
>>> all. The above patch only makes it arbitrarily out of date in the truly
>>> overcommitted case. All the util-based heuristic we have in the
>>> scheduler are based around the assumption that the close future will
>>> look like the recent past, so using an arbitrarily old util-est is still
>>> incorrect. I can understand how this may work OK in RT-app or other
>>> use-cases with perfectly periodic tasks for their entire lifetime and
>>> such, but this doesn't work at all in the general case.
>>
>> I remember that commit Vincent mentioned above. That was from a web
>> browser test 'Speedometer', not rt-app. The browser has to run the
>> same 'computation problem' but with quite a lot of JavaScript
>> frameworks. Those frameworks mainly run in the browser main thread,
>> with some helper threads in background.
>>
>> So it was not purely RT-app or other perfectly periodic task.
>> Although, IIRC Vincent was able to build a model based on rt-app
>> to tackle that issue.
>>
>> That patch helped to better reflect the situation in the OS.
> 
> Sure thing, I'm absolutely ready to believe that an old util-est value
> will be better in certain use-cases, but again I don't think we should
> conflate this for the general case. In particular a util-est that was
> measured when the system was lightly loaded is absolutely not guaranteed
> to be valid while it is overcommitted. Freshness matters in many cases.

I think I got your point, fair enough.

> 
>> For this particular _subject_ I don't think it's relevant, though.
>> It was actually helping to show that the situation is worse, so
>> closer to OU because the task was bigger (and we avoid EAS).
>>
>>>
>>>> And feec() will return -1 for that case because util_est remains high
>>>
>>> And again, checking that a task fits is broken to start with if we don't
>>> know how big the task is. When we have reasons to believe that the util
>>> values are no longer correct (and the absence of idle time is a very
>>> good reason for that) we just need to give up on them. The fact that we
>>> have to resort to using out-of-date data to sort of make that work is
>>> just another proof that this is not a good idea in the general case.
>>>
>>>> the commit that I mentioned above covers those cases and the task will
>>>> not incorrectly fit to another smaller CPU because its util_est is
>>>> preserved during the overutilized phase
>>>
>>> There are other reasons why a task may look like it fits, e.g. two tasks
>>> coscheduled on a big CPU get 50% util each, then we migrate one away, the
>>> CPU looks half empty. Is it half empty? We've got no way to tell until
>>> we see idle time. The current util_avg and old util_est value are just
>>> not helpful, they're both bad signals and we should just discard them.
>>
>> So would you then reset them to 0? Or leave them as they are?
>> What about the other signals (cpu runqueue) which are derived from them?
>> That sounds like really heavy change or inconsistency in many places.
> 
> I would just leave them as they are, but not look at them, pretty much
> like we do today. In the overcommitted case, load is a superior signal
> because it accounts for runnable time and the task weights, so we really
> ought to use that instead of util.

OK make sense, thanks. Sounds like valid plan to try then.

> 
>>>
>>> So again I do feel like the best way forward would be to change the
>>> nature of the OU threshold to actually ask cpuidle 'when was the last
>>> time there was idle time?' (or possibly cache that in the idle task
>>> directly). And then based on that we can decide whether we want to enter
>>> feec() and do util-based decision, or to kick the push-pull mechanism in
>>> your other patches, things like that. That would solve/avoid the problem
>>> I mentioned in the previous paragraph and make the OU detection more
>>> robust. We could also consider using different thresholds in different
>>> places to re-enable load-balancing earlier, and give up on feec() a bit
>>> later to avoid messing the entire task placement when we're only
>>> transiently OU because of misfit. But eventually, we really need to just
>>> give up on util values altogether when we're really overcommitted, it's
>>> really an invariant we need to keep.
>>
>> IMHO the problem here with OU was amplified recently due to the
>> Uclamp_max setting
> 
> Ack.
> 
>> 'Max aggregation policy'
> 
> Ack.
> 
>> aggressive frequency capping
> 
> What do you mean by that?
> 
>> fast freq switching.
> 
> And not sure what fast switching has to do with the issue here?

I mean, with some recent changes flying LKML we are heading to kind
of 'per task DVFS'. Like switching a frequency 'just for that task'
when it's scheduled. This was concerning me. I think we tried to
have a 'planning' view in scheduler on more things in the CPUs requested
performance for future. The future is hard to predict, sometime even
this +20% CPU freq margin was helping us (when we run a bit longer than
our prediction).

With this approach tackling all of the 'safety margins' to save
more power I'm worried about harming normal general scheduling
and performance.

I'm a big fan to save energy, but not doing this very hard
where general scheduling concept might suffer.
E.g. this _subject_:  EAS when OU - is when I'm careful.


> 
>> Now we are in the situation where we complain about util metrics...
>>
>> I've been warning Qais and Vincent that this usage of Uclamp_max
>> in such environment is dangerous and might explode.
> 
> I absolutely agree that uclamp max makes a huge mess of things, and util
> in particular :-(
> 
>> If one background task is capped hard in CPU freq, but does computation
>> 'all the time' making that CPU to have no idle time - then IMO
>> this is not a good scheduling. This is a receipt for starvation.
>> You probably won't find any better metric.
>>
>> I would suggest to stop making the OU situation worse and more
>> frequent with this 'artificial starvation with uclamp_max'.
>>
>> I understand we want to safe energy, but uclamp_max in current shape
>> has too many side effects IMO.
>>
>> Why we haven't invested in the 'Bandwidth controller', e.g. to make
>> it big.Little aware (if that could be a problem)(they were there for
>> many years)?
> 
> Bandwidth control is a different thing really, not sure it can be used
> interchangeably with uclamp_max in general. Running all the time at low
> frequency is often going to be better from a power perspective than
> running uncapped for a fixed period of time.
> 
> I think the intention of uclamp max is really to say 'these tasks have
> low QoS, use spare cycles at low-ish frequency to run them'. What we
> found was that it was best to use cpu.shares in conjunction with
> uclamp.max to implement the 'use spare cycles' part of the previous
> statement, but that was its own can of worms and caused a lot of
> priority inversion problems. Hopefully the proxy exec stuff will solve
> that...
> 

Yes, I see your point. It looks like some new ideas are very welcome.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
  2024-10-01 17:50           ` Quentin Perret
  2024-10-02  7:11             ` Lukasz Luba
@ 2024-10-03  6:27             ` Vincent Guittot
  2024-10-03  8:15               ` Lukasz Luba
                                 ` (2 more replies)
  1 sibling, 3 replies; 62+ messages in thread
From: Vincent Guittot @ 2024-10-03  6:27 UTC (permalink / raw)
  To: Quentin Perret
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2

On Tue, 1 Oct 2024 at 19:51, Quentin Perret <qperret@google.com> wrote:
>
> On Tuesday 01 Oct 2024 at 18:20:03 (+0200), Vincent Guittot wrote:
> > With commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task
> > utilization"), the util_est remains set the value before having to
> > share the cpu with other tasks which means that the util_est remains
> > correct even if its util_avg decrease because of sharing the cpu with
> > other task. This has been done to cover the cases that you mention
> > above whereboth util_avg and util_est where decreasing when tasks
> > starts to  share  the CPU bandwidth with others
>
> I don't think I agree about the correctness of that util_est value at
> all. The above patch only makes it arbitrarily out of date in the truly
> overcommitted case. All the util-based heuristic we have in the
> scheduler are based around the assumption that the close future will
> look like the recent past, so using an arbitrarily old util-est is still
> incorrect. I can understand how this may work OK in RT-app or other

This fixes a real use case on android device

> use-cases with perfectly periodic tasks for their entire lifetime and
> such, but this doesn't work at all in the general case.
>
> > And feec() will return -1 for that case because util_est remains high
>
> And again, checking that a task fits is broken to start with if we don't
> know how big the task is. When we have reasons to believe that the util
> values are no longer correct (and the absence of idle time is a very
> good reason for that) we just need to give up on them. The fact that we
> have to resort to using out-of-date data to sort of make that work is
> just another proof that this is not a good idea in the general case.

That's where I disagree, this is not an out-of-date value, this is the
last correct one before sharing the cpu

>
> > the commit that I mentioned above covers those cases and the task will
> > not incorrectly fit to another smaller CPU because its util_est is
> > preserved during the overutilized phase
>
> There are other reasons why a task may look like it fits, e.g. two tasks
> coscheduled on a big CPU get 50% util each, then we migrate one away, the

50% of what ? not the cpu capacity. I think you miss one piece of the
recent pelt behavior here. I fullygree that when the system os
overcommitted the util base task placement is not correct but I also
think that feec() can't find a cpu in such case

> CPU looks half empty. Is it half empty? We've got no way to tell until

The same here, it's not thanks to util_est

> we see idle time. The current util_avg and old util_est value are just
> not helpful, they're both bad signals and we should just discard them.
>
> So again I do feel like the best way forward would be to change the
> nature of the OU threshold to actually ask cpuidle 'when was the last
> time there was idle time?' (or possibly cache that in the idle task
> directly). And then based on that we can decide whether we want to enter
> feec() and do util-based decision, or to kick the push-pull mechanism in
> your other patches, things like that. That would solve/avoid the problem
> I mentioned in the previous paragraph and make the OU detection more
> robust. We could also consider using different thresholds in different
> places to re-enable load-balancing earlier, and give up on feec() a bit
> later to avoid messing the entire task placement when we're only
> transiently OU because of misfit. But eventually, we really need to just
> give up on util values altogether when we're really overcommitted, it's
> really an invariant we need to keep.

For now, I will increase the OU threshold to cpu capacity to reduce
the false overutilized state because of misfit tasks which is what I
really care about. The redesign of OU will come in a different series
as this implies more rework. IIUC your point, we are more interested
by the prev cpu than the current one

>
> Thanks,
> Quentin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
  2024-10-03  6:27             ` Vincent Guittot
@ 2024-10-03  8:15               ` Lukasz Luba
  2024-10-03  8:26                 ` Quentin Perret
  2024-10-03  8:52                 ` Vincent Guittot
  2024-10-03  8:21               ` Quentin Perret
  2024-11-19 14:46               ` Christian Loehle
  2 siblings, 2 replies; 62+ messages in thread
From: Lukasz Luba @ 2024-10-03  8:15 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, rafael.j.wysocki, linux-kernel, qyousef,
	hongyan.xia2, Quentin Perret

Hi Vincent,

On 10/3/24 07:27, Vincent Guittot wrote:
> On Tue, 1 Oct 2024 at 19:51, Quentin Perret <qperret@google.com> wrote:
>>
>> On Tuesday 01 Oct 2024 at 18:20:03 (+0200), Vincent Guittot wrote:
>>> With commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task
>>> utilization"), the util_est remains set the value before having to
>>> share the cpu with other tasks which means that the util_est remains
>>> correct even if its util_avg decrease because of sharing the cpu with
>>> other task. This has been done to cover the cases that you mention
>>> above whereboth util_avg and util_est where decreasing when tasks
>>> starts to  share  the CPU bandwidth with others
>>
>> I don't think I agree about the correctness of that util_est value at
>> all. The above patch only makes it arbitrarily out of date in the truly
>> overcommitted case. All the util-based heuristic we have in the
>> scheduler are based around the assumption that the close future will
>> look like the recent past, so using an arbitrarily old util-est is still
>> incorrect. I can understand how this may work OK in RT-app or other
> 
> This fixes a real use case on android device
> 
>> use-cases with perfectly periodic tasks for their entire lifetime and
>> such, but this doesn't work at all in the general case.
>>
>>> And feec() will return -1 for that case because util_est remains high
>>
>> And again, checking that a task fits is broken to start with if we don't
>> know how big the task is. When we have reasons to believe that the util
>> values are no longer correct (and the absence of idle time is a very
>> good reason for that) we just need to give up on them. The fact that we
>> have to resort to using out-of-date data to sort of make that work is
>> just another proof that this is not a good idea in the general case.
> 
> That's where I disagree, this is not an out-of-date value, this is the
> last correct one before sharing the cpu
> 
>>
>>> the commit that I mentioned above covers those cases and the task will
>>> not incorrectly fit to another smaller CPU because its util_est is
>>> preserved during the overutilized phase
>>
>> There are other reasons why a task may look like it fits, e.g. two tasks
>> coscheduled on a big CPU get 50% util each, then we migrate one away, the
> 
> 50% of what ? not the cpu capacity. I think you miss one piece of the
> recent pelt behavior here. I fullygree that when the system os
> overcommitted the util base task placement is not correct but I also
> think that feec() can't find a cpu in such case
> 
>> CPU looks half empty. Is it half empty? We've got no way to tell until
> 
> The same here, it's not thanks to util_est
> 
>> we see idle time. The current util_avg and old util_est value are just
>> not helpful, they're both bad signals and we should just discard them.
>>
>> So again I do feel like the best way forward would be to change the
>> nature of the OU threshold to actually ask cpuidle 'when was the last
>> time there was idle time?' (or possibly cache that in the idle task
>> directly). And then based on that we can decide whether we want to enter
>> feec() and do util-based decision, or to kick the push-pull mechanism in
>> your other patches, things like that. That would solve/avoid the problem
>> I mentioned in the previous paragraph and make the OU detection more
>> robust. We could also consider using different thresholds in different
>> places to re-enable load-balancing earlier, and give up on feec() a bit
>> later to avoid messing the entire task placement when we're only
>> transiently OU because of misfit. But eventually, we really need to just
>> give up on util values altogether when we're really overcommitted, it's
>> really an invariant we need to keep.
> 
> For now, I will increase the OU threshold to cpu capacity to reduce
> the false overutilized state because of misfit tasks which is what I
> really care about. The redesign of OU will come in a different series
> as this implies more rework. IIUC your point, we are more interested
> by the prev cpu than the current one
> 

What do you mean by that?
Is it due to OU in e.g. Little cluster?
Is it amplified by the uclamp_max usage?

You're re-writing heavily the EAS+EM and I would like to understand
your motivation.

BTW, do you know that if you or anyone wants to improve the EAS/EM
should be able to provide the power numbers?

W/o the power numbers the discussion is moot. Many times SW engineers
have wrong assumptions about HW, therefore we have to test and
measure. There are confidential power saving techniques in HW
that can be missed and some ugly workaround created in SW for issues
which don't exist.

That's why we have to discuss the power numbers.

This _subject_ is not different. If EAS is going to help
even in OU state - we need the numbers.

Regards,
Lukasz

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
  2024-10-03  8:15               ` Lukasz Luba
@ 2024-10-03  8:26                 ` Quentin Perret
  2024-10-03  8:52                 ` Vincent Guittot
  1 sibling, 0 replies; 62+ messages in thread
From: Quentin Perret @ 2024-10-03  8:26 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, rafael.j.wysocki,
	linux-kernel, qyousef, hongyan.xia2

On Thursday 03 Oct 2024 at 09:15:31 (+0100), Lukasz Luba wrote:
> BTW, do you know that if you or anyone wants to improve the EAS/EM
> should be able to provide the power numbers?
> 
> W/o the power numbers the discussion is moot. Many times SW engineers
> have wrong assumptions about HW, therefore we have to test and
> measure. There are confidential power saving techniques in HW
> that can be missed and some ugly workaround created in SW for issues
> which don't exist.
> 
> That's why we have to discuss the power numbers.

And generally speaking +1 to the above, it would be nice to have power
numbers to motivate the series better. The hackbench results are nice to
show the limited overhead, but they obviously don't help evaluating the
patches against what they claim to do (making better energy decisions in
feec() and such).

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
  2024-10-03  8:15               ` Lukasz Luba
  2024-10-03  8:26                 ` Quentin Perret
@ 2024-10-03  8:52                 ` Vincent Guittot
  1 sibling, 0 replies; 62+ messages in thread
From: Vincent Guittot @ 2024-10-03  8:52 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, rafael.j.wysocki, linux-kernel, qyousef,
	hongyan.xia2, Quentin Perret

On Thu, 3 Oct 2024 at 10:14, Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> Hi Vincent,
>
> On 10/3/24 07:27, Vincent Guittot wrote:
> > On Tue, 1 Oct 2024 at 19:51, Quentin Perret <qperret@google.com> wrote:
> >>
> >> On Tuesday 01 Oct 2024 at 18:20:03 (+0200), Vincent Guittot wrote:
> >>> With commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task
> >>> utilization"), the util_est remains set the value before having to
> >>> share the cpu with other tasks which means that the util_est remains
> >>> correct even if its util_avg decrease because of sharing the cpu with
> >>> other task. This has been done to cover the cases that you mention
> >>> above whereboth util_avg and util_est where decreasing when tasks
> >>> starts to  share  the CPU bandwidth with others
> >>
> >> I don't think I agree about the correctness of that util_est value at
> >> all. The above patch only makes it arbitrarily out of date in the truly
> >> overcommitted case. All the util-based heuristic we have in the
> >> scheduler are based around the assumption that the close future will
> >> look like the recent past, so using an arbitrarily old util-est is still
> >> incorrect. I can understand how this may work OK in RT-app or other
> >
> > This fixes a real use case on android device
> >
> >> use-cases with perfectly periodic tasks for their entire lifetime and
> >> such, but this doesn't work at all in the general case.
> >>
> >>> And feec() will return -1 for that case because util_est remains high
> >>
> >> And again, checking that a task fits is broken to start with if we don't
> >> know how big the task is. When we have reasons to believe that the util
> >> values are no longer correct (and the absence of idle time is a very
> >> good reason for that) we just need to give up on them. The fact that we
> >> have to resort to using out-of-date data to sort of make that work is
> >> just another proof that this is not a good idea in the general case.
> >
> > That's where I disagree, this is not an out-of-date value, this is the
> > last correct one before sharing the cpu
> >
> >>
> >>> the commit that I mentioned above covers those cases and the task will
> >>> not incorrectly fit to another smaller CPU because its util_est is
> >>> preserved during the overutilized phase
> >>
> >> There are other reasons why a task may look like it fits, e.g. two tasks
> >> coscheduled on a big CPU get 50% util each, then we migrate one away, the
> >
> > 50% of what ? not the cpu capacity. I think you miss one piece of the
> > recent pelt behavior here. I fullygree that when the system os
> > overcommitted the util base task placement is not correct but I also
> > think that feec() can't find a cpu in such case
> >
> >> CPU looks half empty. Is it half empty? We've got no way to tell until
> >
> > The same here, it's not thanks to util_est
> >
> >> we see idle time. The current util_avg and old util_est value are just
> >> not helpful, they're both bad signals and we should just discard them.
> >>
> >> So again I do feel like the best way forward would be to change the
> >> nature of the OU threshold to actually ask cpuidle 'when was the last
> >> time there was idle time?' (or possibly cache that in the idle task
> >> directly). And then based on that we can decide whether we want to enter
> >> feec() and do util-based decision, or to kick the push-pull mechanism in
> >> your other patches, things like that. That would solve/avoid the problem
> >> I mentioned in the previous paragraph and make the OU detection more
> >> robust. We could also consider using different thresholds in different
> >> places to re-enable load-balancing earlier, and give up on feec() a bit
> >> later to avoid messing the entire task placement when we're only
> >> transiently OU because of misfit. But eventually, we really need to just
> >> give up on util values altogether when we're really overcommitted, it's
> >> really an invariant we need to keep.
> >
> > For now, I will increase the OU threshold to cpu capacity to reduce
> > the false overutilized state because of misfit tasks which is what I
> > really care about. The redesign of OU will come in a different series
> > as this implies more rework. IIUC your point, we are more interested
> > by the prev cpu than the current one
> >
>
> What do you mean by that?
> Is it due to OU in e.g. Little cluster?
> Is it amplified by the uclamp_max usage?

You need to know if the prev cpu was overcommitted to know if the task
utilization is correct and usable

>
> You're re-writing heavily the EAS+EM and I would like to understand
> your motivation.

I want to cover more cases that are obviously not covered by current
EAS implementation

>
> BTW, do you know that if you or anyone wants to improve the EAS/EM
> should be able to provide the power numbers?

Having power numbers with the latest mainline kernel is not always
easy as platforms don't support it. Typically, pixel 6 doesn't support
v6.11 or even v6.12-rc1 with enough power optimization. And older
custom kernel don't get last changes and are often modified with out
of tree code which are out of the scope of the discussion

>
> W/o the power numbers the discussion is moot. Many times SW engineers
> have wrong assumptions about HW, therefore we have to test and
> measure. There are confidential power saving techniques in HW
> that can be missed and some ugly workaround created in SW for issues
> which don't exist.
>
> That's why we have to discuss the power numbers.
>
> This _subject_ is not different. If EAS is going to help
> even in OU state - we need the numbers.

I don't expect EAS to help during OU state but more to prevent
spreading blindly everything around. I was more concerned to make sure
that EAS doesn't regression perf too much when overutilized so it can
keep sane task placement whenever possible

>
> Regards,
> Lukasz

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
  2024-10-03  6:27             ` Vincent Guittot
  2024-10-03  8:15               ` Lukasz Luba
@ 2024-10-03  8:21               ` Quentin Perret
  2024-10-03  8:57                 ` Vincent Guittot
  2024-11-19 14:46               ` Christian Loehle
  2 siblings, 1 reply; 62+ messages in thread
From: Quentin Perret @ 2024-10-03  8:21 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2

On Thursday 03 Oct 2024 at 08:27:00 (+0200), Vincent Guittot wrote:
> On Tue, 1 Oct 2024 at 19:51, Quentin Perret <qperret@google.com> wrote:
> > And again, checking that a task fits is broken to start with if we don't
> > know how big the task is. When we have reasons to believe that the util
> > values are no longer correct (and the absence of idle time is a very
> > good reason for that) we just need to give up on them. The fact that we
> > have to resort to using out-of-date data to sort of make that work is
> > just another proof that this is not a good idea in the general case.
> 
> That's where I disagree, this is not an out-of-date value, this is the
> last correct one before sharing the cpu

This value is arbitrarily old, so of course it is out of date. This only
sort of works for tasks that don't change their behaviour. That's true
for some use-cases, yes, but absolutely not in the general case. How
can you know that the last correct value before sharing the CPU is still
valid minutes later? The fact that the system started to be
overcommitted is a good indication that something has changed, so we
really can't tell. Also, how is any of this going to work for newly
created tasks while we're overcommitted for example?

> > > the commit that I mentioned above covers those cases and the task will
> > > not incorrectly fit to another smaller CPU because its util_est is
> > > preserved during the overutilized phase
> >
> > There are other reasons why a task may look like it fits, e.g. two tasks
> > coscheduled on a big CPU get 50% util each, then we migrate one away, the
> 
> 50% of what ?

50% of SCHED_CAPACITY_SCALE (the above sentence mentions a 'big' CPU, and
for simplicity I assumed no 'pressure' of any kind).

> not the cpu capacity. I think you miss one piece of the
> recent pelt behavior here

That could very well be the case, which piece are you thinking of?

> I fullygree that when the system os
> overcommitted the util base task placement is not correct but I also
> think that feec() can't find a cpu in such case

But why are we even entering feec() then? Isn't this just looking for
trouble really? As per the example above, task migrations can cause util
'gaps' on the source CPU which may make it appear like a good candidate
from an energy standpoint, but it's all bogus really. And let's not even
talk about how wrong the EM is going be when simulating a potential task
migration in the overcommitted case.

> > CPU looks half empty. Is it half empty? We've got no way to tell until
> 
> The same here, it's not thanks to util_est

And again, an out-of-date util est value is not helpful in the general
case. It helps certain use-cases, sure, but please let's not promote it
to a load-bearing construct on top of which we build our entire
scheduling strategy :-)

> > we see idle time. The current util_avg and old util_est value are just
> > not helpful, they're both bad signals and we should just discard them.
> >
> > So again I do feel like the best way forward would be to change the
> > nature of the OU threshold to actually ask cpuidle 'when was the last
> > time there was idle time?' (or possibly cache that in the idle task
> > directly). And then based on that we can decide whether we want to enter
> > feec() and do util-based decision, or to kick the push-pull mechanism in
> > your other patches, things like that. That would solve/avoid the problem
> > I mentioned in the previous paragraph and make the OU detection more
> > robust. We could also consider using different thresholds in different
> > places to re-enable load-balancing earlier, and give up on feec() a bit
> > later to avoid messing the entire task placement when we're only
> > transiently OU because of misfit. But eventually, we really need to just
> > give up on util values altogether when we're really overcommitted, it's
> > really an invariant we need to keep.
> 
> For now, I will increase the OU threshold to cpu capacity to reduce
> the false overutilized state because of misfit tasks which is what I
> really care about.

Cool, and FWIW I am supportive of making this whole part of the code
better -- a transient OU state due to misfit does make a mess of things
and we should indeed be able to do better.

> The redesign of OU will come in a different series
> as this implies more rework.

Ack, this can be made orthogonal to this work I think.

> IIUC your point, we are more interested
> by the prev cpu than the current one

Hmm, not sure to understand that part. What do you mean?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
  2024-10-03  8:21               ` Quentin Perret
@ 2024-10-03  8:57                 ` Vincent Guittot
  2024-10-03  9:52                   ` Quentin Perret
  0 siblings, 1 reply; 62+ messages in thread
From: Vincent Guittot @ 2024-10-03  8:57 UTC (permalink / raw)
  To: Quentin Perret
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2

On Thu, 3 Oct 2024 at 10:21, Quentin Perret <qperret@google.com> wrote:
>
> On Thursday 03 Oct 2024 at 08:27:00 (+0200), Vincent Guittot wrote:
> > On Tue, 1 Oct 2024 at 19:51, Quentin Perret <qperret@google.com> wrote:
> > > And again, checking that a task fits is broken to start with if we don't
> > > know how big the task is. When we have reasons to believe that the util
> > > values are no longer correct (and the absence of idle time is a very
> > > good reason for that) we just need to give up on them. The fact that we
> > > have to resort to using out-of-date data to sort of make that work is
> > > just another proof that this is not a good idea in the general case.
> >
> > That's where I disagree, this is not an out-of-date value, this is the
> > last correct one before sharing the cpu
>
> This value is arbitrarily old, so of course it is out of date. This only
> sort of works for tasks that don't change their behaviour. That's true
> for some use-cases, yes, but absolutely not in the general case. How
> can you know that the last correct value before sharing the CPU is still
> valid minutes later? The fact that the system started to be
> overcommitted is a good indication that something has changed, so we
> really can't tell. Also, how is any of this going to work for newly
> created tasks while we're overcommitted for example?
>
> > > > the commit that I mentioned above covers those cases and the task will
> > > > not incorrectly fit to another smaller CPU because its util_est is
> > > > preserved during the overutilized phase
> > >
> > > There are other reasons why a task may look like it fits, e.g. two tasks
> > > coscheduled on a big CPU get 50% util each, then we migrate one away, the
> >
> > 50% of what ?
>
> 50% of SCHED_CAPACITY_SCALE (the above sentence mentions a 'big' CPU, and
> for simplicity I assumed no 'pressure' of any kind).

ok, i missed the big cpu

>
> > not the cpu capacity. I think you miss one piece of the
> > recent pelt behavior here
>
> That could very well be the case, which piece are you thinking of?

The current pelt algorithm track actual cpu utilization and can go
above cpu capacity (but not above 1024) so a  task utilization can
become bigger than a little cpu capacity

>
> > I fullygree that when the system os
> > overcommitted the util base task placement is not correct but I also
> > think that feec() can't find a cpu in such case
>
> But why are we even entering feec() then? Isn't this just looking for
> trouble really? As per the example above, task migrations can cause util
> 'gaps' on the source CPU which may make it appear like a good candidate
> from an energy standpoint, but it's all bogus really. And let's not even
> talk about how wrong the EM is going be when simulating a potential task
> migration in the overcommitted case.
>
> > > CPU looks half empty. Is it half empty? We've got no way to tell until
> >
> > The same here, it's not thanks to util_est
>
> And again, an out-of-date util est value is not helpful in the general
> case. It helps certain use-cases, sure, but please let's not promote it
> to a load-bearing construct on top of which we build our entire
> scheduling strategy :-)
>
> > > we see idle time. The current util_avg and old util_est value are just
> > > not helpful, they're both bad signals and we should just discard them.
> > >
> > > So again I do feel like the best way forward would be to change the
> > > nature of the OU threshold to actually ask cpuidle 'when was the last
> > > time there was idle time?' (or possibly cache that in the idle task
> > > directly). And then based on that we can decide whether we want to enter
> > > feec() and do util-based decision, or to kick the push-pull mechanism in
> > > your other patches, things like that. That would solve/avoid the problem
> > > I mentioned in the previous paragraph and make the OU detection more
> > > robust. We could also consider using different thresholds in different
> > > places to re-enable load-balancing earlier, and give up on feec() a bit
> > > later to avoid messing the entire task placement when we're only
> > > transiently OU because of misfit. But eventually, we really need to just
> > > give up on util values altogether when we're really overcommitted, it's
> > > really an invariant we need to keep.
> >
> > For now, I will increase the OU threshold to cpu capacity to reduce
> > the false overutilized state because of misfit tasks which is what I
> > really care about.
>
> Cool, and FWIW I am supportive of making this whole part of the code
> better -- a transient OU state due to misfit does make a mess of things
> and we should indeed be able to do better.
>
> > The redesign of OU will come in a different series
> > as this implies more rework.
>
> Ack, this can be made orthogonal to this work I think.
>
> > IIUC your point, we are more interested
> > by the prev cpu than the current one
>
> Hmm, not sure to understand that part. What do you mean?

As replied to Lukasz, if you want to discard utilization of a trask
you need to check the previous cpu

>
> Thanks,
> Quentin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
  2024-10-03  8:57                 ` Vincent Guittot
@ 2024-10-03  9:52                   ` Quentin Perret
  2024-10-03 13:26                     ` Vincent Guittot
  0 siblings, 1 reply; 62+ messages in thread
From: Quentin Perret @ 2024-10-03  9:52 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2

On Thursday 03 Oct 2024 at 10:57:55 (+0200), Vincent Guittot wrote:
> The current pelt algorithm track actual cpu utilization and can go
> above cpu capacity (but not above 1024) so a  task utilization can
> become bigger than a little cpu capacity

Right, the time invariance thing. So yes, I still think that a mix of
co-scheduling and task migrations (which is likely common in the
overcommitted state) will cause some CPUs to appear lightly utilized at
least transiently, hence tricking feec() into thinking it can help when
it really can't.

> As replied to Lukasz, if you want to discard utilization of a trask
> you need to check the previous cpu

Please help me out here because I'm still not quite sure what we're
talking about. Could you please expand a bit?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
  2024-10-03  9:52                   ` Quentin Perret
@ 2024-10-03 13:26                     ` Vincent Guittot
  0 siblings, 0 replies; 62+ messages in thread
From: Vincent Guittot @ 2024-10-03 13:26 UTC (permalink / raw)
  To: Quentin Perret
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2

On Thu, 3 Oct 2024 at 11:52, Quentin Perret <qperret@google.com> wrote:
>
> On Thursday 03 Oct 2024 at 10:57:55 (+0200), Vincent Guittot wrote:
> > The current pelt algorithm track actual cpu utilization and can go
> > above cpu capacity (but not above 1024) so a  task utilization can
> > become bigger than a little cpu capacity
>
> Right, the time invariance thing. So yes, I still think that a mix of
> co-scheduling and task migrations (which is likely common in the
> overcommitted state) will cause some CPUs to appear lightly utilized at
> least transiently, hence tricking feec() into thinking it can help when
> it really can't.
>
> > As replied to Lukasz, if you want to discard utilization of a trask
> > you need to check the previous cpu
>
> Please help me out here because I'm still not quite sure what we're
> talking about. Could you please expand a bit?

If you consider that utilization of a task is meaningless because of
CPU overcommitment then you need to check if the prev cpu of the
waking task is/was overcommitted or not the last time the task run in
addition to the overcommitment of the next cpu

>
> Thanks,
> Quentin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
  2024-10-03  6:27             ` Vincent Guittot
  2024-10-03  8:15               ` Lukasz Luba
  2024-10-03  8:21               ` Quentin Perret
@ 2024-11-19 14:46               ` Christian Loehle
  2 siblings, 0 replies; 62+ messages in thread
From: Christian Loehle @ 2024-11-19 14:46 UTC (permalink / raw)
  To: Vincent Guittot, Quentin Perret
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2

On 10/3/24 07:27, Vincent Guittot wrote:
> On Tue, 1 Oct 2024 at 19:51, Quentin Perret <qperret@google.com> wrote:
>>
>> On Tuesday 01 Oct 2024 at 18:20:03 (+0200), Vincent Guittot wrote:
>>> With commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task
>>> utilization"), the util_est remains set the value before having to
>>> share the cpu with other tasks which means that the util_est remains
>>> correct even if its util_avg decrease because of sharing the cpu with
>>> other task. This has been done to cover the cases that you mention
>>> above whereboth util_avg and util_est where decreasing when tasks
>>> starts to  share  the CPU bandwidth with others
>>
>> I don't think I agree about the correctness of that util_est value at
>> all. The above patch only makes it arbitrarily out of date in the truly
>> overcommitted case. All the util-based heuristic we have in the
>> scheduler are based around the assumption that the close future will
>> look like the recent past, so using an arbitrarily old util-est is still
>> incorrect. I can understand how this may work OK in RT-app or other
> 
> This fixes a real use case on android device
> 
>> use-cases with perfectly periodic tasks for their entire lifetime and
>> such, but this doesn't work at all in the general case.
>>
>>> And feec() will return -1 for that case because util_est remains high
>>
>> And again, checking that a task fits is broken to start with if we don't
>> know how big the task is. When we have reasons to believe that the util
>> values are no longer correct (and the absence of idle time is a very
>> good reason for that) we just need to give up on them. The fact that we
>> have to resort to using out-of-date data to sort of make that work is
>> just another proof that this is not a good idea in the general case.
> 
> That's where I disagree, this is not an out-of-date value, this is the
> last correct one before sharing the cpu
Just adding on this since we are discussing the correctness of util_est
value on an OU CPU since
commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task utilization").
I agree that this commit fixed the immediate false util_est drop after
coscheduling two (or more) tasks, but that's a specific one.
If one of two coscheduled tasks starts growing their util_est can't reflect
that if their compute demand grows above CPU-capacity, that commit doesn't
change the fact. There is no generally sensible way of estimating such a
util_est anyway.
Even worse if both coscheduled tasks grow which isn't uncommon, considering
they might be related.

So
"this is the last correct one before sharing the cpu" is true,
"This is not an out-of-date value" isn't true in the general case.

I agree that the OU definition can evolve, basing that on idle time makes
sense, but given the common period of 16ms (frame rate) we might delay
setting OU by quite a lot for the cases it 'actually is true'.

Regards,
Christian

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC PATCH 5/5] sched/fair: Add push task callback for EAS
  2024-08-30 13:03 [PATCH 0/5] sched/fair: Rework EAS to handle more cases Vincent Guittot
                   ` (3 preceding siblings ...)
  2024-08-30 13:03 ` [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized Vincent Guittot
@ 2024-08-30 13:03 ` Vincent Guittot
  2024-09-09  9:59   ` Christian Loehle
                     ` (2 more replies)
  2024-11-07 10:14 ` [PATCH 0/5] sched/fair: Rework EAS to handle more cases Pierre Gondois
  2024-11-28 17:24 ` Hongyan Xia
  6 siblings, 3 replies; 62+ messages in thread
From: Vincent Guittot @ 2024-08-30 13:03 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2, Vincent Guittot

EAS is based on wakeup events to efficiently place tasks on the system, but
there are cases where a task will not have wakeup events anymore or at a
far too low pace. For such situation, we can take advantage of the task
being put back in the enqueued list to check if it should be migrated on
another CPU. When the task is the only one running on the CPU, the tick
will check it the task is stuck on this CPU and should migrate on another
one.

Wake up events remain the main way to migrate tasks but we now detect
situation where a task is stuck on a CPU by checking that its utilization
is larger than the max available compute capacity (max cpu capacity or
uclamp max setting)

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c  | 211 +++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |   2 +
 2 files changed, 213 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e46af2416159..41fb18ac118b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5455,6 +5455,7 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
 }
 
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
+static void dequeue_pushable_task(struct cfs_rq *cfs_rq, struct sched_entity *se, bool queue);
 
 static bool
 dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
@@ -5463,6 +5464,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	update_curr(cfs_rq);
 
+	dequeue_pushable_task(cfs_rq, se, false);
+
 	if (flags & DEQUEUE_DELAYED) {
 		SCHED_WARN_ON(!se->sched_delayed);
 	} else {
@@ -5585,6 +5588,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	}
 
 	se->prev_sum_exec_runtime = se->sum_exec_runtime;
+
+	dequeue_pushable_task(cfs_rq, se, true);
 }
 
 static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags);
@@ -5620,6 +5625,7 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 }
 
 static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
+static void enqueue_pushable_task(struct cfs_rq *cfs_rq, struct sched_entity *se);
 
 static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 {
@@ -5639,9 +5645,16 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 		__enqueue_entity(cfs_rq, prev);
 		/* in !on_rq case, update occurred at dequeue */
 		update_load_avg(cfs_rq, prev, 0);
+
+		/*
+		 * The previous task might be eligible for pushing it on
+		 * another cpu if it is still active.
+		 */
+		enqueue_pushable_task(cfs_rq, prev);
 	}
 	SCHED_WARN_ON(cfs_rq->curr != prev);
 	cfs_rq->curr = NULL;
+
 }
 
 static void
@@ -8393,6 +8406,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			target_stat.runnable = cpu_runnable(cpu_rq(cpu));
 			target_stat.capa = capacity_of(cpu);
 			target_stat.nr_running = cpu_rq(cpu)->cfs.h_nr_running;
+			if ((p->on_rq) && (cpu == prev_cpu))
+				target_stat.nr_running--;
 
 			/* If the target needs a lower OPP, then look up for
 			 * the corresponding OPP and its associated cost.
@@ -8473,6 +8488,197 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	return target;
 }
 
+static inline bool task_misfit_cpu(struct task_struct *p, int cpu)
+{
+	unsigned long max_capa = get_actual_cpu_capacity(cpu);
+	unsigned long util = task_util_est(p);
+
+	max_capa = min(max_capa, uclamp_eff_value(p, UCLAMP_MAX));
+	util = max(util, task_runnable(p));
+
+	/*
+	 * Return true only if the task might not sleep/wakeup because of a low
+	 * compute capacity. Tasks, which wake up regularly, will be handled by
+	 * feec().
+	 */
+	return (util > max_capa);
+}
+
+static int active_load_balance_cpu_stop(void *data);
+
+static inline void check_misfit_cpu(struct task_struct *p, struct rq *rq)
+{
+	int new_cpu, cpu = cpu_of(rq);
+
+	if (!sched_energy_enabled())
+		return;
+
+	if (WARN_ON(!p))
+		return;
+
+	if (WARN_ON(p != rq->curr))
+		return;
+
+	if (is_migration_disabled(p))
+		return;
+
+	if ((rq->nr_running > 1) || (p->nr_cpus_allowed == 1))
+		return;
+
+	if (!task_misfit_cpu(p, cpu))
+		return;
+
+	new_cpu = find_energy_efficient_cpu(p, cpu);
+
+	if (new_cpu == cpu)
+		return;
+
+	/*
+	 * ->active_balance synchronizes accesses to
+	 * ->active_balance_work.  Once set, it's cleared
+	 * only after active load balance is finished.
+	 */
+	if (!rq->active_balance) {
+		rq->active_balance = 1;
+		rq->push_cpu = new_cpu;
+	} else
+		return;
+
+	raw_spin_rq_unlock(rq);
+	stop_one_cpu_nowait(cpu,
+		active_load_balance_cpu_stop, rq,
+		&rq->active_balance_work);
+	raw_spin_rq_lock(rq);
+}
+
+static inline int has_pushable_tasks(struct rq *rq)
+{
+	return !plist_head_empty(&rq->cfs.pushable_tasks);
+}
+
+static struct task_struct *pick_next_pushable_fair_task(struct rq *rq)
+{
+	struct task_struct *p;
+
+	if (!has_pushable_tasks(rq))
+		return NULL;
+
+	p = plist_first_entry(&rq->cfs.pushable_tasks,
+			      struct task_struct, pushable_tasks);
+
+	WARN_ON_ONCE(rq->cpu != task_cpu(p));
+	WARN_ON_ONCE(task_current(rq, p));
+	WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
+
+	WARN_ON_ONCE(!task_on_rq_queued(p));
+
+	/*
+	 * Remove task from the pushable list as we try only once after
+	 * task has been put back in enqueued list.
+	 */
+	plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+
+	return p;
+}
+
+/*
+ * See if the non running fair tasks on this rq
+ * can be sent to some other CPU that fits better with
+ * their profile.
+ */
+static int push_fair_task(struct rq *rq)
+{
+	struct task_struct *next_task;
+	struct rq *new_rq;
+	int prev_cpu, new_cpu;
+	int ret = 0;
+
+	next_task = pick_next_pushable_fair_task(rq);
+	if (!next_task)
+		return 0;
+
+	if (is_migration_disabled(next_task))
+		return 0;
+
+	if (WARN_ON(next_task == rq->curr))
+		return 0;
+
+	/* We might release rq lock */
+	get_task_struct(next_task);
+
+	prev_cpu = rq->cpu;
+
+	new_cpu = find_energy_efficient_cpu(next_task, prev_cpu);
+
+	if (new_cpu == prev_cpu)
+		goto out;
+
+	new_rq = cpu_rq(new_cpu);
+
+	if (double_lock_balance(rq, new_rq)) {
+
+		deactivate_task(rq, next_task, 0);
+		set_task_cpu(next_task, new_cpu);
+		activate_task(new_rq, next_task, 0);
+		ret = 1;
+
+		resched_curr(new_rq);
+
+		double_unlock_balance(rq, new_rq);
+	}
+
+out:
+	put_task_struct(next_task);
+
+	return ret;
+}
+
+static void push_fair_tasks(struct rq *rq)
+{
+	/* push_dl_task() will return true if it moved a -deadline task */
+	while (push_fair_task(rq))
+		;
+}
+
+static DEFINE_PER_CPU(struct balance_callback, fair_push_head);
+
+static inline void fair_queue_push_tasks(struct rq *rq)
+{
+	if (!sched_energy_enabled() || !has_pushable_tasks(rq))
+		return;
+
+	queue_balance_callback(rq, &per_cpu(fair_push_head, rq->cpu), push_fair_tasks);
+}
+static void dequeue_pushable_task(struct cfs_rq *cfs_rq, struct sched_entity *se, bool queue)
+{
+	struct task_struct *p;
+	struct rq *rq;
+
+	if (sched_energy_enabled() && entity_is_task(se)) {
+		rq = rq_of(cfs_rq);
+		p = container_of(se, struct task_struct, se);
+
+		plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+
+		if (queue)
+			fair_queue_push_tasks(rq);
+	}
+}
+
+static void enqueue_pushable_task(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	if (sched_energy_enabled() && entity_is_task(se)) {
+		struct task_struct *p = container_of(se, struct task_struct, se);
+		struct rq *rq = rq_of(cfs_rq);
+
+		if ((p->nr_cpus_allowed > 1) && task_misfit_cpu(p, rq->cpu)) {
+			plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+			plist_node_init(&p->pushable_tasks, p->prio);
+			plist_add(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+		}
+	}
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAKE,
@@ -8642,6 +8848,8 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	return sched_balance_newidle(rq, rf) != 0;
 }
 #else
+static inline void dequeue_pushable_task(struct cfs_rq *cfs_rq, struct sched_entity *se, bool queue) {}
+static inline void enqueue_pushable_task(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 static inline void set_task_max_allowed_capacity(struct task_struct *p) {}
 #endif /* CONFIG_SMP */
 
@@ -13013,6 +13221,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 	check_update_overutilized_status(task_rq(curr));
 
 	task_tick_core(rq, curr);
+
+	check_misfit_cpu(curr, rq);
 }
 
 /*
@@ -13204,6 +13414,7 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
 void init_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	cfs_rq->tasks_timeline = RB_ROOT_CACHED;
+	plist_head_init(&cfs_rq->pushable_tasks);
 	cfs_rq->min_vruntime = (u64)(-(1LL << 20));
 #ifdef CONFIG_SMP
 	raw_spin_lock_init(&cfs_rq->removed.lock);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2f5d658c0631..f3327695d4a3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -672,6 +672,8 @@ struct cfs_rq {
 	struct list_head	leaf_cfs_rq_list;
 	struct task_group	*tg;	/* group that "owns" this runqueue */
 
+	struct plist_head	pushable_tasks;
+
 	/* Locally cached copy of our task_group's idle value */
 	int			idle;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 5/5] sched/fair: Add push task callback for EAS
  2024-08-30 13:03 ` [RFC PATCH 5/5] sched/fair: Add push task callback for EAS Vincent Guittot
@ 2024-09-09  9:59   ` Christian Loehle
  2024-09-09 12:54     ` Vincent Guittot
  2024-09-11 14:03   ` Pierre Gondois
  2024-09-13 16:08   ` Pierre Gondois
  2 siblings, 1 reply; 62+ messages in thread
From: Christian Loehle @ 2024-09-09  9:59 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, lukasz.luba,
	rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2

On 8/30/24 14:03, Vincent Guittot wrote:
> EAS is based on wakeup events to efficiently place tasks on the system, but
> there are cases where a task will not have wakeup events anymore or at a
> far too low pace. For such situation, we can take advantage of the task
> being put back in the enqueued list to check if it should be migrated on
> another CPU. When the task is the only one running on the CPU, the tick
> will check it the task is stuck on this CPU and should migrate on another
> one.
> 
> Wake up events remain the main way to migrate tasks but we now detect
> situation where a task is stuck on a CPU by checking that its utilization
> is larger than the max available compute capacity (max cpu capacity or
> uclamp max setting)

Let me think out loud about this and feel free to object:
If there's other tasks on the rq we don't have that problem, if it is the
only one running it's utilization should be 1024, misfit should take care
of the upmigration on the way up.
If the task utilization is 1024 it needs to be alone on the rq, why would
another CPU be more efficient in that case (which presumably is an idle
CPU of the same PD)?
Or is this patch just for UCLAMP_MAX < 1024 cases altogether?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 5/5] sched/fair: Add push task callback for EAS
  2024-09-09  9:59   ` Christian Loehle
@ 2024-09-09 12:54     ` Vincent Guittot
  0 siblings, 0 replies; 62+ messages in thread
From: Vincent Guittot @ 2024-09-09 12:54 UTC (permalink / raw)
  To: Christian Loehle
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2

On Mon, 9 Sept 2024 at 11:59, Christian Loehle <christian.loehle@arm.com> wrote:
>
> On 8/30/24 14:03, Vincent Guittot wrote:
> > EAS is based on wakeup events to efficiently place tasks on the system, but
> > there are cases where a task will not have wakeup events anymore or at a
> > far too low pace. For such situation, we can take advantage of the task
> > being put back in the enqueued list to check if it should be migrated on
> > another CPU. When the task is the only one running on the CPU, the tick
> > will check it the task is stuck on this CPU and should migrate on another
> > one.
> >
> > Wake up events remain the main way to migrate tasks but we now detect
> > situation where a task is stuck on a CPU by checking that its utilization
> > is larger than the max available compute capacity (max cpu capacity or
> > uclamp max setting)
>
> Let me think out loud about this and feel free to object:
> If there's other tasks on the rq we don't have that problem, if it is the

You might have been confused by the term utilization in the commit
message which includes both util_avg and runnable_avg of the task the
the max cpu capacity which is the get_actual_cpu_capacity

> only one running it's utilization should be 1024, misfit should take care
> of the upmigration on the way up.
> If the task utilization is 1024 it needs to be alone on the rq, why would
> another CPU be more efficient in that case (which presumably is an idle
> CPU of the same PD)?
> Or is this patch just for UCLAMP_MAX < 1024 cases altogether?

For a task alone stuck on a CPU, it's for the uclamp_max case although
this might also replace the misfit task behavior in the future.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 5/5] sched/fair: Add push task callback for EAS
  2024-08-30 13:03 ` [RFC PATCH 5/5] sched/fair: Add push task callback for EAS Vincent Guittot
  2024-09-09  9:59   ` Christian Loehle
@ 2024-09-11 14:03   ` Pierre Gondois
  2024-09-12 12:30     ` Vincent Guittot
  2024-09-13 16:08   ` Pierre Gondois
  2 siblings, 1 reply; 62+ messages in thread
From: Pierre Gondois @ 2024-09-11 14:03 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, lukasz.luba,
	rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2

Hello Vincent,

On 8/30/24 15:03, Vincent Guittot wrote:
> EAS is based on wakeup events to efficiently place tasks on the system, but
> there are cases where a task will not have wakeup events anymore or at a
> far too low pace. For such situation, we can take advantage of the task
> being put back in the enqueued list to check if it should be migrated on
> another CPU. When the task is the only one running on the CPU, the tick
> will check it the task is stuck on this CPU and should migrate on another
> one.
> 
> Wake up events remain the main way to migrate tasks but we now detect
> situation where a task is stuck on a CPU by checking that its utilization
> is larger than the max available compute capacity (max cpu capacity or
> uclamp max setting)
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>   kernel/sched/fair.c  | 211 +++++++++++++++++++++++++++++++++++++++++++
>   kernel/sched/sched.h |   2 +
>   2 files changed, 213 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e46af2416159..41fb18ac118b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c

[snip]

> +
> +static inline void check_misfit_cpu(struct task_struct *p, struct rq *rq)
> +{
> +	int new_cpu, cpu = cpu_of(rq);
> +
> +	if (!sched_energy_enabled())
> +		return;
> +
> +	if (WARN_ON(!p))
> +		return;
> +
> +	if (WARN_ON(p != rq->curr))
> +		return;
> +
> +	if (is_migration_disabled(p))
> +		return;
> +
> +	if ((rq->nr_running > 1) || (p->nr_cpus_allowed == 1))
> +		return;

I tried the code on a Pixel6 with the following setup:
- without the above (rq->nr_running > 1) condition
- without the push task mechanism
i.e. tasks without regular wakeups only have the opportunity to
run feec() via the sched_tick. It seemed sufficient to avoid
the problematic you mentioned:
- having unbalanced UCLAMP_MAX tasks in a pd, e.g. 1 UCLAMP_MAX task
   per little CPU, except one little CPU with N UCLAMP_MAX tasks
- downgrading UCLAMP_MAX tasks that could run on smaller CPUs
   but have no wakeups and thus don't run feec()

Thus I was wondering it it would not be better to integrate the
EAS to the load balancer instead (not my idea, but don't remember
who suggested that).
Or otherwise if just running feec() through the sched_tick path
would not be sufficient (i.e. this patch minus the push mechanism).

> +
> +	if (!task_misfit_cpu(p, cpu))
> +		return;
> +
> +	new_cpu = find_energy_efficient_cpu(p, cpu);
> +
> +	if (new_cpu == cpu)
> +		return;
> +
> +	/*
> +	 * ->active_balance synchronizes accesses to
> +	 * ->active_balance_work.  Once set, it's cleared
> +	 * only after active load balance is finished.
> +	 */
> +	if (!rq->active_balance) {
> +		rq->active_balance = 1;
> +		rq->push_cpu = new_cpu;
> +	} else
> +		return;
> +
> +	raw_spin_rq_unlock(rq);
> +	stop_one_cpu_nowait(cpu,
> +		active_load_balance_cpu_stop, rq,
> +		&rq->active_balance_work);
> +	raw_spin_rq_lock(rq);

I didn't hit any error, but isn't it eligible to the following ?
   commit f0498d2a54e7 ("sched: Fix stop_one_cpu_nowait() vs hotplug")


> +}
> +
> +static inline int has_pushable_tasks(struct rq *rq)
> +{
> +	return !plist_head_empty(&rq->cfs.pushable_tasks);
> +}
> +
> +static struct task_struct *pick_next_pushable_fair_task(struct rq *rq)
> +{
> +	struct task_struct *p;
> +
> +	if (!has_pushable_tasks(rq))
> +		return NULL;
> +
> +	p = plist_first_entry(&rq->cfs.pushable_tasks,
> +			      struct task_struct, pushable_tasks);
> +
> +	WARN_ON_ONCE(rq->cpu != task_cpu(p));
> +	WARN_ON_ONCE(task_current(rq, p));
> +	WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
> +
> +	WARN_ON_ONCE(!task_on_rq_queued(p));
> +
> +	/*
> +	 * Remove task from the pushable list as we try only once after
> +	 * task has been put back in enqueued list.
> +	 */
> +	plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> +
> +	return p;
> +}
> +
> +/*
> + * See if the non running fair tasks on this rq
> + * can be sent to some other CPU that fits better with
> + * their profile.
> + */
> +static int push_fair_task(struct rq *rq)
> +{
> +	struct task_struct *next_task;
> +	struct rq *new_rq;
> +	int prev_cpu, new_cpu;
> +	int ret = 0;
> +
> +	next_task = pick_next_pushable_fair_task(rq);
> +	if (!next_task)
> +		return 0;
> +
> +	if (is_migration_disabled(next_task))
> +		return 0;
> +
> +	if (WARN_ON(next_task == rq->curr))
> +		return 0;
> +
> +	/* We might release rq lock */
> +	get_task_struct(next_task);
> +
> +	prev_cpu = rq->cpu;
> +
> +	new_cpu = find_energy_efficient_cpu(next_task, prev_cpu);
> +
> +	if (new_cpu == prev_cpu)
> +		goto out;
> +
> +	new_rq = cpu_rq(new_cpu);
> +
> +	if (double_lock_balance(rq, new_rq)) {


I think it might be necessary to check the following:
   if (task_cpu(next_task) != rq->cpu) {
     double_unlock_balance(rq, new_rq);
     goto out;
   }

Indeed I've been hitting the following warnings:
- uclamp_rq_dec_id():SCHED_WARN_ON(!bucket->tasks)
- set_task_cpu()::WARN_ON_ONCE(state == TASK_RUNNING &&
		     p->sched_class == &fair_sched_class &&
		     (p->on_rq && !task_on_rq_migrating(p)))
- update_entity_lag()::SCHED_WARN_ON(!se->on_rq)

and it seemed to be caused by the task not being on the initial rq anymore.

> +
> +		deactivate_task(rq, next_task, 0);
> +		set_task_cpu(next_task, new_cpu);
> +		activate_task(new_rq, next_task, 0);
> +		ret = 1;
> +
> +		resched_curr(new_rq);
> +
> +		double_unlock_balance(rq, new_rq);
> +	}
> +
> +out:
> +	put_task_struct(next_task);
> +
> +	return ret;
> +}
> +

Regards,
Pierre

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 5/5] sched/fair: Add push task callback for EAS
  2024-09-11 14:03   ` Pierre Gondois
@ 2024-09-12 12:30     ` Vincent Guittot
  2024-09-13  9:09       ` Pierre Gondois
  0 siblings, 1 reply; 62+ messages in thread
From: Vincent Guittot @ 2024-09-12 12:30 UTC (permalink / raw)
  To: Pierre Gondois
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2

Hello Pierre,

On Wed, 11 Sept 2024 at 16:03, Pierre Gondois <pierre.gondois@arm.com> wrote:
>
> Hello Vincent,
>
> On 8/30/24 15:03, Vincent Guittot wrote:
> > EAS is based on wakeup events to efficiently place tasks on the system, but
> > there are cases where a task will not have wakeup events anymore or at a
> > far too low pace. For such situation, we can take advantage of the task
> > being put back in the enqueued list to check if it should be migrated on
> > another CPU. When the task is the only one running on the CPU, the tick
> > will check it the task is stuck on this CPU and should migrate on another
> > one.
> >
> > Wake up events remain the main way to migrate tasks but we now detect
> > situation where a task is stuck on a CPU by checking that its utilization
> > is larger than the max available compute capacity (max cpu capacity or
> > uclamp max setting)
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >   kernel/sched/fair.c  | 211 +++++++++++++++++++++++++++++++++++++++++++
> >   kernel/sched/sched.h |   2 +
> >   2 files changed, 213 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e46af2416159..41fb18ac118b 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
>
> [snip]
>
> > +
> > +static inline void check_misfit_cpu(struct task_struct *p, struct rq *rq)
> > +{
> > +     int new_cpu, cpu = cpu_of(rq);
> > +
> > +     if (!sched_energy_enabled())
> > +             return;
> > +
> > +     if (WARN_ON(!p))
> > +             return;
> > +
> > +     if (WARN_ON(p != rq->curr))
> > +             return;
> > +
> > +     if (is_migration_disabled(p))
> > +             return;
> > +
> > +     if ((rq->nr_running > 1) || (p->nr_cpus_allowed == 1))
> > +             return;
>
> I tried the code on a Pixel6 with the following setup:
> - without the above (rq->nr_running > 1) condition
> - without the push task mechanism
> i.e. tasks without regular wakeups only have the opportunity to
> run feec() via the sched_tick. It seemed sufficient to avoid
> the problematic you mentioned:
> - having unbalanced UCLAMP_MAX tasks in a pd, e.g. 1 UCLAMP_MAX task
>    per little CPU, except one little CPU with N UCLAMP_MAX tasks
> - downgrading UCLAMP_MAX tasks that could run on smaller CPUs
>    but have no wakeups and thus don't run feec()

The main problem with your test is that you always call feec() for the
running task so you always have to wake up the migration thread to
migrate the current running thread which is quite inefficient. The
push mechanism only takes a task which is not the current running one
and we don't need to wake up migration thread which is simpler and
more efficient. We check only one task at a time and will not loop on
an unbounded number of tasks after a task switch or a tick

>
> Thus I was wondering it it would not be better to integrate the
> EAS to the load balancer instead (not my idea, but don't remember
> who suggested that).

My 1st thought was also to use load balance to pull tasks which were
stuck on the wrong CPU (as mentioned in [1]) but this solution is not
scalable as we don't want to test all runnable task on a cpu and it's
not really easy to know which cpu and which tasks should be checked

[1] https://youtu.be/PHEBAyxeM_M?si=ZApIOw3BS4SOLPwp

> Or otherwise if just running feec() through the sched_tick path
> would not be sufficient (i.e. this patch minus the push mechanism).

As mentioned above, the push mechanism is more efficient than active migration.


>
> > +
> > +     if (!task_misfit_cpu(p, cpu))
> > +             return;
> > +
> > +     new_cpu = find_energy_efficient_cpu(p, cpu);
> > +
> > +     if (new_cpu == cpu)
> > +             return;
> > +
> > +     /*
> > +      * ->active_balance synchronizes accesses to
> > +      * ->active_balance_work.  Once set, it's cleared
> > +      * only after active load balance is finished.
> > +      */
> > +     if (!rq->active_balance) {
> > +             rq->active_balance = 1;
> > +             rq->push_cpu = new_cpu;
> > +     } else
> > +             return;
> > +
> > +     raw_spin_rq_unlock(rq);
> > +     stop_one_cpu_nowait(cpu,
> > +             active_load_balance_cpu_stop, rq,
> > +             &rq->active_balance_work);
> > +     raw_spin_rq_lock(rq);
>
> I didn't hit any error, but isn't it eligible to the following ?
>    commit f0498d2a54e7 ("sched: Fix stop_one_cpu_nowait() vs hotplug")
>

I will recheck but being called from the tick, for the local cpu and
with a running thread no being cpu_stopper_thread, should protect us
from the case describe in this commit

>
> > +}
> > +
> > +static inline int has_pushable_tasks(struct rq *rq)
> > +{
> > +     return !plist_head_empty(&rq->cfs.pushable_tasks);
> > +}
> > +
> > +static struct task_struct *pick_next_pushable_fair_task(struct rq *rq)
> > +{
> > +     struct task_struct *p;
> > +
> > +     if (!has_pushable_tasks(rq))
> > +             return NULL;
> > +
> > +     p = plist_first_entry(&rq->cfs.pushable_tasks,
> > +                           struct task_struct, pushable_tasks);
> > +
> > +     WARN_ON_ONCE(rq->cpu != task_cpu(p));
> > +     WARN_ON_ONCE(task_current(rq, p));
> > +     WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
> > +
> > +     WARN_ON_ONCE(!task_on_rq_queued(p));
> > +
> > +     /*
> > +      * Remove task from the pushable list as we try only once after
> > +      * task has been put back in enqueued list.
> > +      */
> > +     plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> > +
> > +     return p;
> > +}
> > +
> > +/*
> > + * See if the non running fair tasks on this rq
> > + * can be sent to some other CPU that fits better with
> > + * their profile.
> > + */
> > +static int push_fair_task(struct rq *rq)
> > +{
> > +     struct task_struct *next_task;
> > +     struct rq *new_rq;
> > +     int prev_cpu, new_cpu;
> > +     int ret = 0;
> > +
> > +     next_task = pick_next_pushable_fair_task(rq);
> > +     if (!next_task)
> > +             return 0;
> > +
> > +     if (is_migration_disabled(next_task))
> > +             return 0;
> > +
> > +     if (WARN_ON(next_task == rq->curr))
> > +             return 0;
> > +
> > +     /* We might release rq lock */
> > +     get_task_struct(next_task);
> > +
> > +     prev_cpu = rq->cpu;
> > +
> > +     new_cpu = find_energy_efficient_cpu(next_task, prev_cpu);
> > +
> > +     if (new_cpu == prev_cpu)
> > +             goto out;
> > +
> > +     new_rq = cpu_rq(new_cpu);
> > +
> > +     if (double_lock_balance(rq, new_rq)) {
>
>
> I think it might be necessary to check the following:
>    if (task_cpu(next_task) != rq->cpu) {
>      double_unlock_balance(rq, new_rq);

Yes good point

>      goto out;
>    }
>
> Indeed I've been hitting the following warnings:
> - uclamp_rq_dec_id():SCHED_WARN_ON(!bucket->tasks)
> - set_task_cpu()::WARN_ON_ONCE(state == TASK_RUNNING &&
>                      p->sched_class == &fair_sched_class &&
>                      (p->on_rq && !task_on_rq_migrating(p)))
> - update_entity_lag()::SCHED_WARN_ON(!se->on_rq)
>
> and it seemed to be caused by the task not being on the initial rq anymore.

Do you have a particular use case to trigger this ? I haven't faced
this in the various stress tests that  I did


>
> > +
> > +             deactivate_task(rq, next_task, 0);
> > +             set_task_cpu(next_task, new_cpu);
> > +             activate_task(new_rq, next_task, 0);
> > +             ret = 1;
> > +
> > +             resched_curr(new_rq);
> > +
> > +             double_unlock_balance(rq, new_rq);
> > +     }
> > +
> > +out:
> > +     put_task_struct(next_task);
> > +
> > +     return ret;
> > +}
> > +
>
> Regards,
> Pierre

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 5/5] sched/fair: Add push task callback for EAS
  2024-09-12 12:30     ` Vincent Guittot
@ 2024-09-13  9:09       ` Pierre Gondois
  2024-09-24 12:37         ` Vincent Guittot
  0 siblings, 1 reply; 62+ messages in thread
From: Pierre Gondois @ 2024-09-13  9:09 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2

Hello Vincent,

On 9/12/24 14:30, Vincent Guittot wrote:
> Hello Pierre,
> 
> On Wed, 11 Sept 2024 at 16:03, Pierre Gondois <pierre.gondois@arm.com> wrote:
>>
>> Hello Vincent,
>>
>> On 8/30/24 15:03, Vincent Guittot wrote:
>>> EAS is based on wakeup events to efficiently place tasks on the system, but
>>> there are cases where a task will not have wakeup events anymore or at a
>>> far too low pace. For such situation, we can take advantage of the task
>>> being put back in the enqueued list to check if it should be migrated on
>>> another CPU. When the task is the only one running on the CPU, the tick
>>> will check it the task is stuck on this CPU and should migrate on another
>>> one.
>>>
>>> Wake up events remain the main way to migrate tasks but we now detect
>>> situation where a task is stuck on a CPU by checking that its utilization
>>> is larger than the max available compute capacity (max cpu capacity or
>>> uclamp max setting)
>>>
>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>> ---
>>>    kernel/sched/fair.c  | 211 +++++++++++++++++++++++++++++++++++++++++++
>>>    kernel/sched/sched.h |   2 +
>>>    2 files changed, 213 insertions(+)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.cHel
>>> index e46af2416159..41fb18ac118b 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>
>> [snip]
>>
>>> +
>>> +static inline void check_misfit_cpu(struct task_struct *p, struct rq *rq)
>>> +{
>>> +     int new_cpu, cpu = cpu_of(rq);
>>> +
>>> +     if (!sched_energy_enabled())
>>> +             return;
>>> +
>>> +     if (WARN_ON(!p))
>>> +             return;
>>> +
>>> +     if (WARN_ON(p != rq->curr))
>>> +             return;
>>> +
>>> +     if (is_migration_disabled(p))
>>> +             return;
>>> +
>>> +     if ((rq->nr_running > 1) || (p->nr_cpus_allowed == 1))
>>> +             return;
>>
>> I tried the code on a Pixel6 with the following setup:
>> - without the above (rq->nr_running > 1) condition
>> - without the push task mechanism
>> i.e. tasks without regular wakeups only have the opportunity to
>> run feec() via the sched_tick. It seemed sufficient to avoid
>> the problematic you mentioned:
>> - having unbalanced UCLAMP_MAX tasks in a pd, e.g. 1 UCLAMP_MAX task
>>     per little CPU, except one little CPU with N UCLAMP_MAX tasks
>> - downgrading UCLAMP_MAX tasks that could run on smaller CPUs
>>     but have no wakeups and thus don't run feec()
> 
> The main problem with your test is that you always call feec() for the
> running task so you always have to wake up the migration thread to
> migrate the current running thread which is quite inefficient. The
> push mechanism only takes a task which is not the current running one
> and we don't need to wake up migration thread which is simpler and
> more efficient. We check only one task at a time and will not loop on
> an unbounded number of tasks after a task switch or a tick
> 
>>
>> Thus I was wondering it it would not be better to integrate the
>> EAS to the load balancer instead (not my idea, but don't remember
>> who suggested that).
> 
> My 1st thought was also to use load balance to pull tasks which were
> stuck on the wrong CPU (as mentioned in [1]) but this solution is not
> scalable as we don't want to test all runnable task on a cpu and it's
> not really easy to know which cpu and which tasks should be checked
> 
> [1] https://youtu.be/PHEBAyxeM_M?si=ZApIOw3BS4SOLPwp
> 
>> Or otherwise if just running feec() through the sched_tick path
>> would not be sufficient (i.e. this patch minus the push mechanism).
> 
> As mentioned above, the push mechanism is more efficient than active migration.
> 
> 
>>
>>> +
>>> +     if (!task_misfit_cpu(p, cpu))
>>> +             return;
>>> +
>>> +     new_cpu = find_energy_efficient_cpu(p, cpu);
>>> +
>>> +     if (new_cpu == cpu)
>>> +             return;
>>> +
>>> +     /*
>>> +      * ->active_balance synchronizes accesses to
>>> +      * ->active_balance_work.  Once set, it's cleared
>>> +      * only after active load balance is finished.
>>> +      */
>>> +     if (!rq->active_balance) {
>>> +             rq->active_balance = 1;
>>> +             rq->push_cpu = new_cpu;
>>> +     } else
>>> +             return;
>>> +
>>> +     raw_spin_rq_unlock(rq);
>>> +     stop_one_cpu_nowait(cpu,
>>> +             active_load_balance_cpu_stop, rq,
>>> +             &rq->active_balance_work);
>>> +     raw_spin_rq_lock(rq);
>>
>> I didn't hit any error, but isn't it eligible to the following ?
>>     commit f0498d2a54e7 ("sched: Fix stop_one_cpu_nowait() vs hotplug")
>>
> 
> I will recheck but being called from the tick, for the local cpu and
> with a running thread no being cpu_stopper_thread, should protect us
> from the case describe in this commit
> 
>>
>>> +}
>>> +
>>> +static inline int has_pushable_tasks(struct rq *rq)
>>> +{
>>> +     return !plist_head_empty(&rq->cfs.pushable_tasks);
>>> +}
>>> +
>>> +static struct task_struct *pick_next_pushable_fair_task(struct rq *rq)
>>> +{
>>> +     struct task_struct *p;
>>> +
>>> +     if (!has_pushable_tasks(rq))
>>> +             return NULL;
>>> +
>>> +     p = plist_first_entry(&rq->cfs.pushable_tasks,
>>> +                           struct task_struct, pushable_tasks);
>>> +
>>> +     WARN_ON_ONCE(rq->cpu != task_cpu(p));
>>> +     WARN_ON_ONCE(task_current(rq, p));
>>> +     WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
>>> +
>>> +     WARN_ON_ONCE(!task_on_rq_queued(p));
>>> +
>>> +     /*
>>> +      * Remove task from the pushable list as we try only once after
>>> +      * task has been put back in enqueued list.
>>> +      */
>>> +     plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
>>> +
>>> +     return p;
>>> +}
>>> +
>>> +/*
>>> + * See if the non running fair tasks on this rq
>>> + * can be sent to some other CPU that fits better with
>>> + * their profile.
>>> + */
>>> +static int push_fair_task(struct rq *rq)
>>> +{
>>> +     struct task_struct *next_task;
>>> +     struct rq *new_rq;
>>> +     int prev_cpu, new_cpu;
>>> +     int ret = 0;
>>> +
>>> +     next_task = pick_next_pushable_fair_task(rq);
>>> +     if (!next_task)
>>> +             return 0;
>>> +
>>> +     if (is_migration_disabled(next_task))
>>> +             return 0;
>>> +
>>> +     if (WARN_ON(next_task == rq->curr))
>>> +             return 0;
>>> +
>>> +     /* We might release rq lock */
>>> +     get_task_struct(next_task);
>>> +
>>> +     prev_cpu = rq->cpu;
>>> +
>>> +     new_cpu = find_energy_efficient_cpu(next_task, prev_cpu);
>>> +
>>> +     if (new_cpu == prev_cpu)
>>> +             goto out;
>>> +
>>> +     new_rq = cpu_rq(new_cpu);
>>> +
>>> +     if (double_lock_balance(rq, new_rq)) {
>>
>>
>> I think it might be necessary to check the following:
>>     if (task_cpu(next_task) != rq->cpu) {
>>       double_unlock_balance(rq, new_rq);
> 
> Yes good point
> 
>>       goto out;
>>     }
>>
>> Indeed I've been hitting the following warnings:
>> - uclamp_rq_dec_id():SCHED_WARN_ON(!bucket->tasks)
>> - set_task_cpu()::WARN_ON_ONCE(state == TASK_RUNNING &&
>>                       p->sched_class == &fair_sched_class &&
>>                       (p->on_rq && !task_on_rq_migrating(p)))
>> - update_entity_lag()::SCHED_WARN_ON(!se->on_rq)
>>
>> and it seemed to be caused by the task not being on the initial rq anymore.
> 
> Do you have a particular use case to trigger this ? I haven't faced
> this in the various stress tests that  I did

It was triggered by this workload, but it was on a Pixel6 device,
so there might be more background activity:
- 8 tasks with: [UCLAMP_MIN:0, UCLAMP_MAX:1, duty_cycle:100%, period:16ms]
- 4 tasks each bound to a little CPU with: [UCLAMP_MIN:0, UCLAMP_MAX:1024, duty_cycle:4%, period:4ms]

Regards,
Pierre

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 5/5] sched/fair: Add push task callback for EAS
  2024-09-13  9:09       ` Pierre Gondois
@ 2024-09-24 12:37         ` Vincent Guittot
  0 siblings, 0 replies; 62+ messages in thread
From: Vincent Guittot @ 2024-09-24 12:37 UTC (permalink / raw)
  To: Pierre Gondois
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2

On Fri, 13 Sept 2024 at 11:10, Pierre Gondois <pierre.gondois@arm.com> wrote:
>
> Hello Vincent,
>
> On 9/12/24 14:30, Vincent Guittot wrote:
> > Hello Pierre,
> >
> > On Wed, 11 Sept 2024 at 16:03, Pierre Gondois <pierre.gondois@arm.com> wrote:
> >>
> >> Hello Vincent,
> >>
> >> On 8/30/24 15:03, Vincent Guittot wrote:
> >>> EAS is based on wakeup events to efficiently place tasks on the system, but
> >>> there are cases where a task will not have wakeup events anymore or at a
> >>> far too low pace. For such situation, we can take advantage of the task
> >>> being put back in the enqueued list to check if it should be migrated on
> >>> another CPU. When the task is the only one running on the CPU, the tick
> >>> will check it the task is stuck on this CPU and should migrate on another
> >>> one.
> >>>
> >>> Wake up events remain the main way to migrate tasks but we now detect
> >>> situation where a task is stuck on a CPU by checking that its utilization
> >>> is larger than the max available compute capacity (max cpu capacity or
> >>> uclamp max setting)
> >>>
> >>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> >>> ---
> >>>    kernel/sched/fair.c  | 211 +++++++++++++++++++++++++++++++++++++++++++
> >>>    kernel/sched/sched.h |   2 +
> >>>    2 files changed, 213 insertions(+)
> >>>
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.cHel
> >>> index e46af2416159..41fb18ac118b 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>
> >> [snip]
> >>
> >>> +
> >>> +static inline void check_misfit_cpu(struct task_struct *p, struct rq *rq)
> >>> +{
> >>> +     int new_cpu, cpu = cpu_of(rq);
> >>> +
> >>> +     if (!sched_energy_enabled())
> >>> +             return;
> >>> +
> >>> +     if (WARN_ON(!p))
> >>> +             return;
> >>> +
> >>> +     if (WARN_ON(p != rq->curr))
> >>> +             return;
> >>> +
> >>> +     if (is_migration_disabled(p))
> >>> +             return;
> >>> +
> >>> +     if ((rq->nr_running > 1) || (p->nr_cpus_allowed == 1))
> >>> +             return;
> >>
> >> I tried the code on a Pixel6 with the following setup:
> >> - without the above (rq->nr_running > 1) condition
> >> - without the push task mechanism
> >> i.e. tasks without regular wakeups only have the opportunity to
> >> run feec() via the sched_tick. It seemed sufficient to avoid
> >> the problematic you mentioned:
> >> - having unbalanced UCLAMP_MAX tasks in a pd, e.g. 1 UCLAMP_MAX task
> >>     per little CPU, except one little CPU with N UCLAMP_MAX tasks
> >> - downgrading UCLAMP_MAX tasks that could run on smaller CPUs
> >>     but have no wakeups and thus don't run feec()
> >
> > The main problem with your test is that you always call feec() for the
> > running task so you always have to wake up the migration thread to
> > migrate the current running thread which is quite inefficient. The
> > push mechanism only takes a task which is not the current running one
> > and we don't need to wake up migration thread which is simpler and
> > more efficient. We check only one task at a time and will not loop on
> > an unbounded number of tasks after a task switch or a tick
> >
> >>
> >> Thus I was wondering it it would not be better to integrate the
> >> EAS to the load balancer instead (not my idea, but don't remember
> >> who suggested that).
> >
> > My 1st thought was also to use load balance to pull tasks which were
> > stuck on the wrong CPU (as mentioned in [1]) but this solution is not
> > scalable as we don't want to test all runnable task on a cpu and it's
> > not really easy to know which cpu and which tasks should be checked
> >
> > [1] https://youtu.be/PHEBAyxeM_M?si=ZApIOw3BS4SOLPwp
> >
> >> Or otherwise if just running feec() through the sched_tick path
> >> would not be sufficient (i.e. this patch minus the push mechanism).
> >
> > As mentioned above, the push mechanism is more efficient than active migration.
> >
> >
> >>
> >>> +
> >>> +     if (!task_misfit_cpu(p, cpu))
> >>> +             return;
> >>> +
> >>> +     new_cpu = find_energy_efficient_cpu(p, cpu);
> >>> +
> >>> +     if (new_cpu == cpu)
> >>> +             return;
> >>> +
> >>> +     /*
> >>> +      * ->active_balance synchronizes accesses to
> >>> +      * ->active_balance_work.  Once set, it's cleared
> >>> +      * only after active load balance is finished.
> >>> +      */
> >>> +     if (!rq->active_balance) {
> >>> +             rq->active_balance = 1;
> >>> +             rq->push_cpu = new_cpu;
> >>> +     } else
> >>> +             return;
> >>> +
> >>> +     raw_spin_rq_unlock(rq);
> >>> +     stop_one_cpu_nowait(cpu,
> >>> +             active_load_balance_cpu_stop, rq,
> >>> +             &rq->active_balance_work);
> >>> +     raw_spin_rq_lock(rq);
> >>
> >> I didn't hit any error, but isn't it eligible to the following ?
> >>     commit f0498d2a54e7 ("sched: Fix stop_one_cpu_nowait() vs hotplug")
> >>
> >
> > I will recheck but being called from the tick, for the local cpu and
> > with a running thread no being cpu_stopper_thread, should protect us
> > from the case describe in this commit
> >
> >>
> >>> +}
> >>> +
> >>> +static inline int has_pushable_tasks(struct rq *rq)
> >>> +{
> >>> +     return !plist_head_empty(&rq->cfs.pushable_tasks);
> >>> +}
> >>> +
> >>> +static struct task_struct *pick_next_pushable_fair_task(struct rq *rq)
> >>> +{
> >>> +     struct task_struct *p;
> >>> +
> >>> +     if (!has_pushable_tasks(rq))
> >>> +             return NULL;
> >>> +
> >>> +     p = plist_first_entry(&rq->cfs.pushable_tasks,
> >>> +                           struct task_struct, pushable_tasks);
> >>> +
> >>> +     WARN_ON_ONCE(rq->cpu != task_cpu(p));
> >>> +     WARN_ON_ONCE(task_current(rq, p));
> >>> +     WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
> >>> +
> >>> +     WARN_ON_ONCE(!task_on_rq_queued(p));
> >>> +
> >>> +     /*
> >>> +      * Remove task from the pushable list as we try only once after
> >>> +      * task has been put back in enqueued list.
> >>> +      */
> >>> +     plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> >>> +
> >>> +     return p;
> >>> +}
> >>> +
> >>> +/*
> >>> + * See if the non running fair tasks on this rq
> >>> + * can be sent to some other CPU that fits better with
> >>> + * their profile.
> >>> + */
> >>> +static int push_fair_task(struct rq *rq)
> >>> +{
> >>> +     struct task_struct *next_task;
> >>> +     struct rq *new_rq;
> >>> +     int prev_cpu, new_cpu;
> >>> +     int ret = 0;
> >>> +
> >>> +     next_task = pick_next_pushable_fair_task(rq);
> >>> +     if (!next_task)
> >>> +             return 0;
> >>> +
> >>> +     if (is_migration_disabled(next_task))
> >>> +             return 0;
> >>> +
> >>> +     if (WARN_ON(next_task == rq->curr))
> >>> +             return 0;
> >>> +
> >>> +     /* We might release rq lock */
> >>> +     get_task_struct(next_task);
> >>> +
> >>> +     prev_cpu = rq->cpu;
> >>> +
> >>> +     new_cpu = find_energy_efficient_cpu(next_task, prev_cpu);
> >>> +
> >>> +     if (new_cpu == prev_cpu)
> >>> +             goto out;
> >>> +
> >>> +     new_rq = cpu_rq(new_cpu);
> >>> +
> >>> +     if (double_lock_balance(rq, new_rq)) {
> >>
> >>
> >> I think it might be necessary to check the following:
> >>     if (task_cpu(next_task) != rq->cpu) {
> >>       double_unlock_balance(rq, new_rq);
> >
> > Yes good point
> >
> >>       goto out;
> >>     }
> >>
> >> Indeed I've been hitting the following warnings:
> >> - uclamp_rq_dec_id():SCHED_WARN_ON(!bucket->tasks)
> >> - set_task_cpu()::WARN_ON_ONCE(state == TASK_RUNNING &&
> >>                       p->sched_class == &fair_sched_class &&
> >>                       (p->on_rq && !task_on_rq_migrating(p)))
> >> - update_entity_lag()::SCHED_WARN_ON(!se->on_rq)
> >>
> >> and it seemed to be caused by the task not being on the initial rq anymore.
> >
> > Do you have a particular use case to trigger this ? I haven't faced
> > this in the various stress tests that  I did
>
> It was triggered by this workload, but it was on a Pixel6 device,
> so there might be more background activity:
> - 8 tasks with: [UCLAMP_MIN:0, UCLAMP_MAX:1, duty_cycle:100%, period:16ms]
> - 4 tasks each bound to a little CPU with: [UCLAMP_MIN:0, UCLAMP_MAX:1024, duty_cycle:4%, period:4ms]

Thanks I'm going to have a closer look

>
> Regards,
> Pierre

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 5/5] sched/fair: Add push task callback for EAS
  2024-08-30 13:03 ` [RFC PATCH 5/5] sched/fair: Add push task callback for EAS Vincent Guittot
  2024-09-09  9:59   ` Christian Loehle
  2024-09-11 14:03   ` Pierre Gondois
@ 2024-09-13 16:08   ` Pierre Gondois
  2024-09-24 13:00     ` Vincent Guittot
  2 siblings, 1 reply; 62+ messages in thread
From: Pierre Gondois @ 2024-09-13 16:08 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, lukasz.luba,
	rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2

Hello Vincent,

On 8/30/24 15:03, Vincent Guittot wrote:
> EAS is based on wakeup events to efficiently place tasks on the system, but
> there are cases where a task will not have wakeup events anymore or at a
> far too low pace. For such situation, we can take advantage of the task
> being put back in the enqueued list to check if it should be migrated on
> another CPU. When the task is the only one running on the CPU, the tick
> will check it the task is stuck on this CPU and should migrate on another
> one.
> 
> Wake up events remain the main way to migrate tasks but we now detect
> situation where a task is stuck on a CPU by checking that its utilization
> is larger than the max available compute capacity (max cpu capacity or
> uclamp max setting)
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>   kernel/sched/fair.c  | 211 +++++++++++++++++++++++++++++++++++++++++++
>   kernel/sched/sched.h |   2 +
>   2 files changed, 213 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e46af2416159..41fb18ac118b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c

[...]

> +
> +static inline void check_misfit_cpu(struct task_struct *p, struct rq *rq)
> +{
> +	int new_cpu, cpu = cpu_of(rq);
> +
> +	if (!sched_energy_enabled())
> +		return;
> +
> +	if (WARN_ON(!p))
> +		return;
> +
> +	if (WARN_ON(p != rq->curr))
> +		return;
> +
> +	if (is_migration_disabled(p))
> +		return;
> +
> +	if ((rq->nr_running > 1) || (p->nr_cpus_allowed == 1))

If the goal is to detect tasks that should be migrated to bigger CPUs,
couldn't the check be changed from:
-  (p->nr_cpus_allowed == 1)
to
- (p->max_allowed_capacity == arch_scale_cpu_capacity(cpu))
to avoid the case where a task is bound to the little cluster for instance ?

Similar question for update_misfit_status(), doesn't:
- (arch_scale_cpu_capacity(cpu) == p->max_allowed_capacity)
include this case:
- (p->nr_cpus_allowed == 1)


> +		return;
> +
> +	if (!task_misfit_cpu(p, cpu))
> +		return;

task_misfit_cpu() intends to check whether the task will have an opportunity
to run feec() though wakeups/push-pull.

Shouldn't we check whether the task fits the CPU with the 20% margin
with task_fits_cpu() aswell ? This would allow to migrate the task
faster than the load_balancer.


> +
> +	new_cpu = find_energy_efficient_cpu(p, cpu);
> +
> +	if (new_cpu == cpu)
> +		return;
> +
> +	/*
> +	 * ->active_balance synchronizes accesses to
> +	 * ->active_balance_work.  Once set, it's cleared
> +	 * only after active load balance is finished.
> +	 */
> +	if (!rq->active_balance) {
> +		rq->active_balance = 1;
> +		rq->push_cpu = new_cpu;
> +	} else
> +		return;
> +
> +	raw_spin_rq_unlock(rq);
> +	stop_one_cpu_nowait(cpu,
> +		active_load_balance_cpu_stop, rq,
> +		&rq->active_balance_work);
> +	raw_spin_rq_lock(rq);
> +}
> +

Regards,
Pierre

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 5/5] sched/fair: Add push task callback for EAS
  2024-09-13 16:08   ` Pierre Gondois
@ 2024-09-24 13:00     ` Vincent Guittot
  0 siblings, 0 replies; 62+ messages in thread
From: Vincent Guittot @ 2024-09-24 13:00 UTC (permalink / raw)
  To: Pierre Gondois
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2

On Fri, 13 Sept 2024 at 18:08, Pierre Gondois <pierre.gondois@arm.com> wrote:
>
> Hello Vincent,
>
> On 8/30/24 15:03, Vincent Guittot wrote:
> > EAS is based on wakeup events to efficiently place tasks on the system, but
> > there are cases where a task will not have wakeup events anymore or at a
> > far too low pace. For such situation, we can take advantage of the task
> > being put back in the enqueued list to check if it should be migrated on
> > another CPU. When the task is the only one running on the CPU, the tick
> > will check it the task is stuck on this CPU and should migrate on another
> > one.
> >
> > Wake up events remain the main way to migrate tasks but we now detect
> > situation where a task is stuck on a CPU by checking that its utilization
> > is larger than the max available compute capacity (max cpu capacity or
> > uclamp max setting)
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >   kernel/sched/fair.c  | 211 +++++++++++++++++++++++++++++++++++++++++++
> >   kernel/sched/sched.h |   2 +
> >   2 files changed, 213 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e46af2416159..41fb18ac118b 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
>
> [...]
>
> > +
> > +static inline void check_misfit_cpu(struct task_struct *p, struct rq *rq)
> > +{
> > +     int new_cpu, cpu = cpu_of(rq);
> > +
> > +     if (!sched_energy_enabled())
> > +             return;
> > +
> > +     if (WARN_ON(!p))
> > +             return;
> > +
> > +     if (WARN_ON(p != rq->curr))
> > +             return;
> > +
> > +     if (is_migration_disabled(p))
> > +             return;
> > +
> > +     if ((rq->nr_running > 1) || (p->nr_cpus_allowed == 1))
>
> If the goal is to detect tasks that should be migrated to bigger CPUs,

We don't want the tick to try to do active migration for the running
task if the task can't run on another CPU than this one. This is not
related to the migration to bigger CPUs that is done by
update_misfit_status().

> couldn't the check be changed from:
> -  (p->nr_cpus_allowed == 1)
> to
> - (p->max_allowed_capacity == arch_scale_cpu_capacity(cpu))
> to avoid the case where a task is bound to the little cluster for instance ?

I was about to say yes, but the condition
(arch_scale_cpu_capacity(cpu) == p->max_allowed_capacity) is too large
and prevents migrating to a smaller cpu which is one case that we want
to handle here.  That being said I have an internal patch that
includes the check done by update_misfit_status() for push callback
mechanism but I didn't add it in this version to not add to much
change in the same serie.

>
> Similar question for update_misfit_status(), doesn't:
> - (arch_scale_cpu_capacity(cpu) == p->max_allowed_capacity)
> include this case:
> - (p->nr_cpus_allowed == 1)

For update_misfit_status, you are right

>
>
> > +             return;
> > +
> > +     if (!task_misfit_cpu(p, cpu))
> > +             return;
>
> task_misfit_cpu() intends to check whether the task will have an opportunity
> to run feec() though wakeups/push-pull.
>
> Shouldn't we check whether the task fits the CPU with the 20% margin
> with task_fits_cpu() aswell ? This would allow to migrate the task
> faster than the load_balancer.

As mentioned above I have a patch that I didn't send that adds the
update_misfit_status() condition in the push call. I agree that should
speedup some migrations of misfit task to bigger cpu without waking up
an idle CPU to do the load balance and pull the task on a potentially
3rd CPU

>
>
> > +
> > +     new_cpu = find_energy_efficient_cpu(p, cpu);
> > +
> > +     if (new_cpu == cpu)
> > +             return;
> > +
> > +     /*
> > +      * ->active_balance synchronizes accesses to
> > +      * ->active_balance_work.  Once set, it's cleared
> > +      * only after active load balance is finished.
> > +      */
> > +     if (!rq->active_balance) {
> > +             rq->active_balance = 1;
> > +             rq->push_cpu = new_cpu;
> > +     } else
> > +             return;
> > +
> > +     raw_spin_rq_unlock(rq);
> > +     stop_one_cpu_nowait(cpu,
> > +             active_load_balance_cpu_stop, rq,
> > +             &rq->active_balance_work);
> > +     raw_spin_rq_lock(rq);
> > +}
> > +
>
> Regards,
> Pierre

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/5] sched/fair: Rework EAS to handle more cases
  2024-08-30 13:03 [PATCH 0/5] sched/fair: Rework EAS to handle more cases Vincent Guittot
                   ` (4 preceding siblings ...)
  2024-08-30 13:03 ` [RFC PATCH 5/5] sched/fair: Add push task callback for EAS Vincent Guittot
@ 2024-11-07 10:14 ` Pierre Gondois
  2024-11-08  9:27   ` Vincent Guittot
  2024-11-28 17:24 ` Hongyan Xia
  6 siblings, 1 reply; 62+ messages in thread
From: Pierre Gondois @ 2024-11-07 10:14 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, lukasz.luba,
	rafael.j.wysocki, linux-kernel
  Cc: qyousef, hongyan.xia2

Hello Vincent,
Related to feec(), but not to this patchset, I think there might be a
concurrency issue while running feec().

Feec() doesn't have any locking mechanism. This means that multiple CPUs
might run the function at the same time.
If:
- 2 tasks with approximately the same utilization wake up at the same time
- some space on an energy efficient CPU is available
feec() will likely select the same target for the 2 tasks.

Once feec() determined a target for a task, util signals are updated in
enqueue_task_fair(). The delta between running feec() <-> enqueue_task_fair()
is ~20us (on a Pixel6). This is not much, but this still allows some other
CPUs to run feec() util signals that will be wrong in no time.

Note that it is also possible for one CPU to run feec() for 2 different tasks,
decide to migrate the 2 tasks to another target CPU, and then start enqueueing
the tasks. Meaning one single CPU will run feec() using util signals it knows
are wrong.

The issue is problematic as it creates some instability. Once a
'parallel selection' is done, the following scenarios can happen:
- the system goes overutilized, and EAS is disabled
- a frequency spike happen to handle the unexpected load.
   Then the perf. domain becomes less energy efficient compared to other
   perf. domains, and tasks are migrated out of this perf. domain

I made the following prototype to avoid 'parallel selections'. The goal here
is to tag CPUs that are under pending migration.
A target CPU is tagged as 'eas_pending_enqueue' at the end of feec(). Other
CPUs should therefore not consider this CPU as valid candidate.

The implementation is a bit raw, but it gives some good results. Using rt-app
workloads, and trying not to have tasks waking up at the same timing during
the whole test:
Workload1:
N tasks with a period of 16ms and a util of 4/8. Each task starts with a
4ms delay. Each workload lasts 20s and is run over 5 iterations.

Workload2:
N tasks with a period of (8 +n)ms and a util of 4/8. I.e. the first task
has a period of 8ms, the second task a period of 9ms, etc. Each workload lasts
20s and is run over 5 iterations.

Are presented:
- the measured energy consumed, according to the Pixel6 energy meters
- the estimated energy consumed, lisa uses the util signals along with
   the CPU frequencies and the Energy Model to do an estimation.
- the amount of time spent in the overutilized state, in percentage.

------

Workload1:

Measured energy:
+------+-------+--------------+--------------+------------+
| util | count | without      | with         | ratio      |
+------+-------+--------------+--------------+------------+
| 4    | 8     | 3220.970324  | 3312.097508  | 2.829184   |
| 4    | 12    | 5942.486726  | 5016.106047  | -15.589108 |
| 4    | 16    | 10412.26692  | 10017.633658 | -3.79008   |
| 8    | 8     | 7524.271751  | 7479.451427  | -0.595677  |
| 8    | 12    | 14782.214144 | 14567.282266 | -1.45399   |
| 8    | 16    | 21452.863497 | 19561.143385 | -8.818031  |
+------+-------+--------------+--------------+------------+
Std:
+------+-------+-------------+-------------+
| util | count | without     | with        |
+------+-------+-------------+-------------+
| 4    | 8     | 165.563394  | 48.903514   |
| 4    | 12    | 518.609612  | 81.170952   |
| 4    | 16    | 329.729882  | 192.739245  |
| 8    | 8     | 105.144497  | 336.796522  |
| 8    | 12    | 384.615323  | 339.86986   |
| 8    | 16    | 1252.735561 | 2563.268952 |
+------+-------+-------------+-------------+

Estimated energy:
+------+-------+-----------+-----------+------------+
| util | count | without   | with      | ratio      |
+------+-------+-----------+-----------+------------+
| 4    | 8     | 1.4372e10 | 1.2791e10 | -11.000273 |
| 4    | 12    | 3.1881e10 | 2.3743e10 | -25.526193 |
| 4    | 16    | 5.7663e10 | 5.4079e10 | -6.215679  |
| 8    | 8     | 2.5622e10 | 2.5337e10 | -1.109823  |
| 8    | 12    | 6.4332e10 | 6.9335e10 | 7.776814   | [1]
| 8    | 16    | 9.5285e10 | 8.2331e10 | -13.594508 |
+------+-------+-----------+-----------+------------+
Std:
+------+-------+----------+-----------+
| util | count | without  | with      |
+------+-------+----------+-----------+
| 4    | 8     | 1.3896e9 | 5.4265e8  |
| 4    | 12    | 4.7511e9 | 5.1521e8  |
| 4    | 16    | 3.5486e9 | 1.2625e9  |
| 8    | 8     | 3.0033e8 | 2.3168e9  |
| 8    | 12    | 8.7739e9 | 3.0743e9  |
| 8    | 16    | 6.7982e9 | 2.2393e10 |
+------+-------+----------+-----------+

Overutilized ratio (in % of the 20s test):
+------+-------+-----------+-----------+------------+
| util | count | without   | with      | ratio      |
+------+-------+-----------+-----------+------------+
| 4    | 8     | 0.187941  | 0.015834  | -91.575158 |
| 4    | 12    | 0.543073  | 0.045483  | -91.624815 |
| 4    | 16    | 8.510734  | 8.389077  | -1.429448  |
| 8    | 8     | 1.056678  | 0.876095  | -17.089643 |
| 8    | 12    | 36.457757 | 9.260862  | -74.598378 | [1]
| 8    | 16    | 72.327933 | 78.693558 | 8.801061   |
+------+-------+-----------+-----------+------------+
Std:
+------+-------+-----------+-----------+
| util | count | without   | with      |
+------+-------+-----------+-----------+
| 4    | 8     | 0.232077  | 0.016531  |
| 4    | 12    | 0.338637  | 0.040252  |
| 4    | 16    | 0.729743  | 6.368214  |
| 8    | 8     | 1.702964  | 1.722589  |
| 8    | 12    | 34.436278 | 17.314564 |
| 8    | 16    | 14.540217 | 33.77831  |
+------+-------+-----------+-----------+

------

Workload2:

Measured energy:
+------+-------+--------------+--------------+-----------+
| util | count | without      | with         | ratio     |
+------+-------+--------------+--------------+-----------+
| 4    | 8     | 3357.578785  | 3324.890715  | -0.973561 |
| 4    | 12    | 5024.573746  | 4903.394533  | -2.411731 |
| 4    | 16    | 10114.715431 | 9762.803821  | -3.479204 |
| 8    | 8     | 7485.230678  | 6961.782086  | -6.993086 |
| 8    | 12    | 13720.482516 | 13374.765825 | -2.519712 |
| 8    | 16    | 24846.806317 | 24444.012805 | -1.621108 |
+------+-------+--------------+--------------+-----------+
Std:
+------+-------+------------+------------+
| util | count | without    | with       |
+------+-------+------------+------------+
| 4    | 8     | 87.450628  | 76.955783  |
| 4    | 12    | 106.062839 | 116.882891 |
| 4    | 16    | 182.525881 | 172.819307 |
| 8    | 8     | 874.292359 | 162.790237 |
| 8    | 12    | 151.830636 | 339.286741 |
| 8    | 16    | 904.751446 | 154.419644 |
+------+-------+------------+------------+

Estimated energy:
+------+-------+-----------+-----------+------------+
| util | count | without   | with      | ratio      |
+------+-------+-----------+-----------+------------+
| 4    | 8     | 1.4778e10 | 1.4805e10 | 0.184658   |
| 4    | 12    | 2.6105e10 | 2.5485e10 | -2.374486  |
| 4    | 16    | 5.8394e10 | 5.7177e10 | -2.083208  |
| 8    | 8     | 3.0275e10 | 2.5973e10 | -14.211178 |
| 8    | 12    | 7.0616e10 | 6.9085e10 | -2.168347  |
| 8    | 16    | 1.3133e11 | 1.2891e11 | -1.839725  |
+------+-------+-----------+-----------+------------+
Std:
+------+-------+----------+----------+
| util | count | without  | with     |
+------+-------+----------+----------+
| 4    | 8     | 3.5449e8 | 8.2454e8 |
| 4    | 12    | 9.4248e8 | 1.1364e9 |
| 4    | 16    | 8.3240e8 | 1.2084e9 |
| 8    | 8     | 9.0364e9 | 5.0381e8 |
| 8    | 12    | 9.9112e8 | 3.0836e9 |
| 8    | 16    | 4.9429e8 | 1.9533e9 |
+------+-------+----------+----------+

Overutilized ratio (in % of the 20s test):
+------+-------+-----------+----------+------------+
| util | count | without   | with     | ratio      |
+------+-------+-----------+----------+------------+
| 4    | 8     | 0.154992  | 0.049429 | -68.108419 |
| 4    | 12    | 0.132593  | 0.061762 | -53.420202 |
| 4    | 16    | 6.798091  | 4.606102 | -32.244179 |
| 8    | 8     | 1.360703  | 0.174626 | -87.166465 |
| 8    | 12    | 0.519704  | 0.250469 | -51.805502 |
| 8    | 16    | 12.114269 | 8.969281 | -25.961019 |
+------+-------+-----------+----------+------------+
Std:
+------+-------+----------+----------+
| util | count | without  | with     |
+------+-------+----------+----------+
| 4    | 8     | 0.212919 | 0.036856 |
| 4    | 12    | 0.069696 | 0.060257 |
| 4    | 16    | 0.63995  | 0.542028 |
| 8    | 8     | 2.158079 | 0.211775 |
| 8    | 12    | 0.089159 | 0.187436 |
| 8    | 16    | 0.798565 | 1.669003 |
+------+-------+----------+----------+

------

Analysis:

- [1]
Without the patch, 2 tasks end up on one little CPU. This consumes
less energy than using the medium/big CPU according to the energy model,
but EAS should not be capable of doing such task placement as the little
CPU becomes overutilized.
Without the patch, the system is overutilized a lot more than with the patch.

-
Looking at the overutilized ratio, being overutilized 0.5% of the time or
0.05% of the time might seem close, but it means that EAS ended up
doing a bad task placement multiple, independent times.

-
The overutilized ratio should be checked along the energy results as it
shows how much EAS was involved in the task placement.

-
Overall, the energy consumed is less. The quantity of energy saved varies
with the workload.

------

On another note, I wanted to ask if there would be a v2 of this present
patchset (sched/fair: Rework EAS to handle more cases),

Regards,
Pierre

------


diff --git a/include/linux/sched.h b/include/linux/sched.h
index bb343136ddd0..812d5bf88875 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1592,6 +1592,12 @@ struct task_struct {
         struct user_event_mm            *user_event_mm;
  #endif
  
+       /*
+        * Keep track of the CPU feec() migrated this task to.
+        * There is a per-cpu 'eas_pending_enqueue' value to reset.
+        */
+       int eas_target_cpu;
+
         /*
          * New fields for task_struct should be added above here, so that
          * they are included in the randomized portion of task_struct.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c157d4860a3b..34911eb059cf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6945,6 +6945,8 @@ requeue_delayed_entity(struct sched_entity *se)
         se->sched_delayed = 0;
  }
  
+DEFINE_PER_CPU(atomic_t, eas_pending_enqueue);
+
  /*
   * The enqueue_task method is called before nr_running is
   * increased. Here we update the fair scheduling stats and
@@ -7064,6 +7066,11 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
                 check_update_overutilized_status(rq);
  
  enqueue_throttle:
+       if (p->eas_target_cpu != -1) {
+               atomic_set(&per_cpu(eas_pending_enqueue, p->eas_target_cpu), 0);
+               p->eas_target_cpu = -1;
+       }
+
         assert_list_leaf_cfs_rq(rq);
  
         hrtick_update(rq);
@@ -8451,6 +8458,11 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
                         if (!cpumask_test_cpu(cpu, p->cpus_ptr))
                                 continue;
  
+                       /* Skip this CPU as its util signal will be invalid soon. */
+                       if (atomic_read(&per_cpu(eas_pending_enqueue, cpu)) &&
+                           cpu != prev_cpu)
+                               continue;
+
                         util = cpu_util(cpu, p, cpu, 0);
                         cpu_cap = capacity_of(cpu);
  
@@ -8560,6 +8572,17 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
             ((best_fits < 0) && (best_actual_cap > prev_actual_cap)))
                 target = best_energy_cpu;
  
+       /*
+        *'Lock' the target CPU if there is a migration. Prevent other feec()
+        * calls to use the same target CPU until util signals are not updated.
+        */
+       if (prev_cpu != target) {
+               if (!atomic_cmpxchg_acquire(&per_cpu(eas_pending_enqueue, target), 0, 1))
+                       p->eas_target_cpu = target;
+               else
+                       target = prev_cpu;
+       }
+
         return target;
  
  unlock:

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/5] sched/fair: Rework EAS to handle more cases
  2024-11-07 10:14 ` [PATCH 0/5] sched/fair: Rework EAS to handle more cases Pierre Gondois
@ 2024-11-08  9:27   ` Vincent Guittot
  2024-11-08 13:10     ` Pierre Gondois
  0 siblings, 1 reply; 62+ messages in thread
From: Vincent Guittot @ 2024-11-08  9:27 UTC (permalink / raw)
  To: Pierre Gondois
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2

Hi Pierre,

On Thu, 7 Nov 2024 at 11:14, Pierre Gondois <pierre.gondois@arm.com> wrote:
>
> Hello Vincent,
> Related to feec(), but not to this patchset, I think there might be a
> concurrency issue while running feec().

yes, this is a know limitation

>
> Feec() doesn't have any locking mechanism. This means that multiple CPUs
> might run the function at the same time.

this is done on purpose as we don't want to lock and slow down the wakeup path

> If:
> - 2 tasks with approximately the same utilization wake up at the same time
> - some space on an energy efficient CPU is available
> feec() will likely select the same target for the 2 tasks.

yes

>
> Once feec() determined a target for a task, util signals are updated in
> enqueue_task_fair(). The delta between running feec() <-> enqueue_task_fair()
> is ~20us (on a Pixel6). This is not much, but this still allows some other

20us is quite long. this is the worst case on little core lowest freq ?

> CPUs to run feec() util signals that will be wrong in no time.
>
> Note that it is also possible for one CPU to run feec() for 2 different tasks,
> decide to migrate the 2 tasks to another target CPU, and then start enqueueing
> the tasks. Meaning one single CPU will run feec() using util signals it knows
> are wrong.

isn't this case serialized because cpu selection for next task will
happen after enqueuing the 1st one

>
> The issue is problematic as it creates some instability. Once a
> 'parallel selection' is done, the following scenarios can happen:
> - the system goes overutilized, and EAS is disabled
> - a frequency spike happen to handle the unexpected load.
>    Then the perf. domain becomes less energy efficient compared to other
>    perf. domains, and tasks are migrated out of this perf. domain
>
> I made the following prototype to avoid 'parallel selections'. The goal here
> is to tag CPUs that are under pending migration.
> A target CPU is tagged as 'eas_pending_enqueue' at the end of feec(). Other
> CPUs should therefore not consider this CPU as valid candidate.
>
> The implementation is a bit raw, but it gives some good results. Using rt-app
> workloads, and trying not to have tasks waking up at the same timing during
> the whole test:
> Workload1:
> N tasks with a period of 16ms and a util of 4/8. Each task starts with a
> 4ms delay. Each workload lasts 20s and is run over 5 iterations.
>
> Workload2:
> N tasks with a period of (8 +n)ms and a util of 4/8. I.e. the first task
> has a period of 8ms, the second task a period of 9ms, etc. Each workload lasts
> 20s and is run over 5 iterations.
>
> Are presented:
> - the measured energy consumed, according to the Pixel6 energy meters
> - the estimated energy consumed, lisa uses the util signals along with
>    the CPU frequencies and the Energy Model to do an estimation.
> - the amount of time spent in the overutilized state, in percentage.
>
> ------
>
> Workload1:
>
> Measured energy:
> +------+-------+--------------+--------------+------------+
> | util | count | without      | with         | ratio      |
> +------+-------+--------------+--------------+------------+
> | 4    | 8     | 3220.970324  | 3312.097508  | 2.829184   |
> | 4    | 12    | 5942.486726  | 5016.106047  | -15.589108 |
> | 4    | 16    | 10412.26692  | 10017.633658 | -3.79008   |
> | 8    | 8     | 7524.271751  | 7479.451427  | -0.595677  |
> | 8    | 12    | 14782.214144 | 14567.282266 | -1.45399   |
> | 8    | 16    | 21452.863497 | 19561.143385 | -8.818031  |
> +------+-------+--------------+--------------+------------+
> Std:
> +------+-------+-------------+-------------+
> | util | count | without     | with        |
> +------+-------+-------------+-------------+
> | 4    | 8     | 165.563394  | 48.903514   |
> | 4    | 12    | 518.609612  | 81.170952   |
> | 4    | 16    | 329.729882  | 192.739245  |
> | 8    | 8     | 105.144497  | 336.796522  |
> | 8    | 12    | 384.615323  | 339.86986   |
> | 8    | 16    | 1252.735561 | 2563.268952 |
> +------+-------+-------------+-------------+
>
> Estimated energy:
> +------+-------+-----------+-----------+------------+
> | util | count | without   | with      | ratio      |
> +------+-------+-----------+-----------+------------+
> | 4    | 8     | 1.4372e10 | 1.2791e10 | -11.000273 |
> | 4    | 12    | 3.1881e10 | 2.3743e10 | -25.526193 |
> | 4    | 16    | 5.7663e10 | 5.4079e10 | -6.215679  |
> | 8    | 8     | 2.5622e10 | 2.5337e10 | -1.109823  |
> | 8    | 12    | 6.4332e10 | 6.9335e10 | 7.776814   | [1]
> | 8    | 16    | 9.5285e10 | 8.2331e10 | -13.594508 |
> +------+-------+-----------+-----------+------------+
> Std:
> +------+-------+----------+-----------+
> | util | count | without  | with      |
> +------+-------+----------+-----------+
> | 4    | 8     | 1.3896e9 | 5.4265e8  |
> | 4    | 12    | 4.7511e9 | 5.1521e8  |
> | 4    | 16    | 3.5486e9 | 1.2625e9  |
> | 8    | 8     | 3.0033e8 | 2.3168e9  |
> | 8    | 12    | 8.7739e9 | 3.0743e9  |
> | 8    | 16    | 6.7982e9 | 2.2393e10 |
> +------+-------+----------+-----------+
>
> Overutilized ratio (in % of the 20s test):
> +------+-------+-----------+-----------+------------+
> | util | count | without   | with      | ratio      |
> +------+-------+-----------+-----------+------------+
> | 4    | 8     | 0.187941  | 0.015834  | -91.575158 |
> | 4    | 12    | 0.543073  | 0.045483  | -91.624815 |
> | 4    | 16    | 8.510734  | 8.389077  | -1.429448  |
> | 8    | 8     | 1.056678  | 0.876095  | -17.089643 |
> | 8    | 12    | 36.457757 | 9.260862  | -74.598378 | [1]
> | 8    | 16    | 72.327933 | 78.693558 | 8.801061   |
> +------+-------+-----------+-----------+------------+
> Std:
> +------+-------+-----------+-----------+
> | util | count | without   | with      |
> +------+-------+-----------+-----------+
> | 4    | 8     | 0.232077  | 0.016531  |
> | 4    | 12    | 0.338637  | 0.040252  |
> | 4    | 16    | 0.729743  | 6.368214  |
> | 8    | 8     | 1.702964  | 1.722589  |
> | 8    | 12    | 34.436278 | 17.314564 |
> | 8    | 16    | 14.540217 | 33.77831  |
> +------+-------+-----------+-----------+
>
> ------
>
> Workload2:
>
> Measured energy:
> +------+-------+--------------+--------------+-----------+
> | util | count | without      | with         | ratio     |
> +------+-------+--------------+--------------+-----------+
> | 4    | 8     | 3357.578785  | 3324.890715  | -0.973561 |
> | 4    | 12    | 5024.573746  | 4903.394533  | -2.411731 |
> | 4    | 16    | 10114.715431 | 9762.803821  | -3.479204 |
> | 8    | 8     | 7485.230678  | 6961.782086  | -6.993086 |
> | 8    | 12    | 13720.482516 | 13374.765825 | -2.519712 |
> | 8    | 16    | 24846.806317 | 24444.012805 | -1.621108 |
> +------+-------+--------------+--------------+-----------+
> Std:
> +------+-------+------------+------------+
> | util | count | without    | with       |
> +------+-------+------------+------------+
> | 4    | 8     | 87.450628  | 76.955783  |
> | 4    | 12    | 106.062839 | 116.882891 |
> | 4    | 16    | 182.525881 | 172.819307 |
> | 8    | 8     | 874.292359 | 162.790237 |
> | 8    | 12    | 151.830636 | 339.286741 |
> | 8    | 16    | 904.751446 | 154.419644 |
> +------+-------+------------+------------+
>
> Estimated energy:
> +------+-------+-----------+-----------+------------+
> | util | count | without   | with      | ratio      |
> +------+-------+-----------+-----------+------------+
> | 4    | 8     | 1.4778e10 | 1.4805e10 | 0.184658   |
> | 4    | 12    | 2.6105e10 | 2.5485e10 | -2.374486  |
> | 4    | 16    | 5.8394e10 | 5.7177e10 | -2.083208  |
> | 8    | 8     | 3.0275e10 | 2.5973e10 | -14.211178 |
> | 8    | 12    | 7.0616e10 | 6.9085e10 | -2.168347  |
> | 8    | 16    | 1.3133e11 | 1.2891e11 | -1.839725  |
> +------+-------+-----------+-----------+------------+
> Std:
> +------+-------+----------+----------+
> | util | count | without  | with     |
> +------+-------+----------+----------+
> | 4    | 8     | 3.5449e8 | 8.2454e8 |
> | 4    | 12    | 9.4248e8 | 1.1364e9 |
> | 4    | 16    | 8.3240e8 | 1.2084e9 |
> | 8    | 8     | 9.0364e9 | 5.0381e8 |
> | 8    | 12    | 9.9112e8 | 3.0836e9 |
> | 8    | 16    | 4.9429e8 | 1.9533e9 |
> +------+-------+----------+----------+
>
> Overutilized ratio (in % of the 20s test):
> +------+-------+-----------+----------+------------+
> | util | count | without   | with     | ratio      |
> +------+-------+-----------+----------+------------+
> | 4    | 8     | 0.154992  | 0.049429 | -68.108419 |
> | 4    | 12    | 0.132593  | 0.061762 | -53.420202 |
> | 4    | 16    | 6.798091  | 4.606102 | -32.244179 |
> | 8    | 8     | 1.360703  | 0.174626 | -87.166465 |
> | 8    | 12    | 0.519704  | 0.250469 | -51.805502 |
> | 8    | 16    | 12.114269 | 8.969281 | -25.961019 |
> +------+-------+-----------+----------+------------+
> Std:
> +------+-------+----------+----------+
> | util | count | without  | with     |
> +------+-------+----------+----------+
> | 4    | 8     | 0.212919 | 0.036856 |
> | 4    | 12    | 0.069696 | 0.060257 |
> | 4    | 16    | 0.63995  | 0.542028 |
> | 8    | 8     | 2.158079 | 0.211775 |
> | 8    | 12    | 0.089159 | 0.187436 |
> | 8    | 16    | 0.798565 | 1.669003 |
> +------+-------+----------+----------+
>
> ------
>
> Analysis:
>
> - [1]
> Without the patch, 2 tasks end up on one little CPU. This consumes
> less energy than using the medium/big CPU according to the energy model,
> but EAS should not be capable of doing such task placement as the little
> CPU becomes overutilized.
> Without the patch, the system is overutilized a lot more than with the patch.
>
> -
> Looking at the overutilized ratio, being overutilized 0.5% of the time or
> 0.05% of the time might seem close, but it means that EAS ended up
> doing a bad task placement multiple, independent times.
>
> -
> The overutilized ratio should be checked along the energy results as it
> shows how much EAS was involved in the task placement.
>
> -
> Overall, the energy consumed is less. The quantity of energy saved varies
> with the workload.
>
> ------
>
> On another note, I wanted to ask if there would be a v2 of this present
> patchset (sched/fair: Rework EAS to handle more cases),

yes, I have been side tracked by other stuff since LPC and haven't
been able to finalize test on v2 but it's ongoing

>
> Regards,
> Pierre
>
> ------
>
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index bb343136ddd0..812d5bf88875 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1592,6 +1592,12 @@ struct task_struct {
>          struct user_event_mm            *user_event_mm;
>   #endif
>
> +       /*
> +        * Keep track of the CPU feec() migrated this task to.
> +        * There is a per-cpu 'eas_pending_enqueue' value to reset.
> +        */
> +       int eas_target_cpu;
> +
>          /*
>           * New fields for task_struct should be added above here, so that
>           * they are included in the randomized portion of task_struct.
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c157d4860a3b..34911eb059cf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6945,6 +6945,8 @@ requeue_delayed_entity(struct sched_entity *se)
>          se->sched_delayed = 0;
>   }
>
> +DEFINE_PER_CPU(atomic_t, eas_pending_enqueue);
> +
>   /*
>    * The enqueue_task method is called before nr_running is
>    * increased. Here we update the fair scheduling stats and
> @@ -7064,6 +7066,11 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>                  check_update_overutilized_status(rq);
>
>   enqueue_throttle:
> +       if (p->eas_target_cpu != -1) {
> +               atomic_set(&per_cpu(eas_pending_enqueue, p->eas_target_cpu), 0);
> +               p->eas_target_cpu = -1;
> +       }
> +
>          assert_list_leaf_cfs_rq(rq);
>
>          hrtick_update(rq);
> @@ -8451,6 +8458,11 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>                          if (!cpumask_test_cpu(cpu, p->cpus_ptr))
>                                  continue;
>
> +                       /* Skip this CPU as its util signal will be invalid soon. */
> +                       if (atomic_read(&per_cpu(eas_pending_enqueue, cpu)) &&
> +                           cpu != prev_cpu)
> +                               continue;
> +
>                          util = cpu_util(cpu, p, cpu, 0);
>                          cpu_cap = capacity_of(cpu);
>
> @@ -8560,6 +8572,17 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>              ((best_fits < 0) && (best_actual_cap > prev_actual_cap)))
>                  target = best_energy_cpu;
>
> +       /*
> +        *'Lock' the target CPU if there is a migration. Prevent other feec()
> +        * calls to use the same target CPU until util signals are not updated.
> +        */
> +       if (prev_cpu != target) {
> +               if (!atomic_cmpxchg_acquire(&per_cpu(eas_pending_enqueue, target), 0, 1))
> +                       p->eas_target_cpu = target;
> +               else
> +                       target = prev_cpu;
> +       }
> +
>          return target;
>
>   unlock:

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/5] sched/fair: Rework EAS to handle more cases
  2024-11-08  9:27   ` Vincent Guittot
@ 2024-11-08 13:10     ` Pierre Gondois
  2024-11-11 19:08       ` Vincent Guittot
  0 siblings, 1 reply; 62+ messages in thread
From: Pierre Gondois @ 2024-11-08 13:10 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2, Christian Loehle



On 11/8/24 10:27, Vincent Guittot wrote:
> Hi Pierre,
> 
> On Thu, 7 Nov 2024 at 11:14, Pierre Gondois <pierre.gondois@arm.com> wrote:
>>
>> Hello Vincent,
>> Related to feec(), but not to this patchset, I think there might be a
>> concurrency issue while running feec().
> 
> yes, this is a know limitation
> 
>>
>> Feec() doesn't have any locking mechanism. This means that multiple CPUs
>> might run the function at the same time.
> 
> this is done on purpose as we don't want to lock and slow down the wakeup path

Yes right, this is understandable. However there could be a way to bail out of
feec() when such case is detected without actually waiting for a lock (cf. the
prototype).
We already bail out of feec() when the utilization of a CPU without a task is
higher than with the task in the energy computation.

> 
>> If:
>> - 2 tasks with approximately the same utilization wake up at the same time
>> - some space on an energy efficient CPU is available
>> feec() will likely select the same target for the 2 tasks.
> 
> yes
> 
>>
>> Once feec() determined a target for a task, util signals are updated in
>> enqueue_task_fair(). The delta between running feec() <-> enqueue_task_fair()
>> is ~20us (on a Pixel6). This is not much, but this still allows some other
> 
> 20us is quite long. this is the worst case on little core lowest freq ?

I only kept the occurrences where feec() ends up with a target != prev_cpu.
In these case enqueuing is done on the target CPU (cf. __ttwu_queue_wakelist),
which might take more time.

In the other case, the delta is effectively lower (~10us).

> 
>> CPUs to run feec() util signals that will be wrong in no time.
>>
>> Note that it is also possible for one CPU to run feec() for 2 different tasks,
>> decide to migrate the 2 tasks to another target CPU, and then start enqueueing
>> the tasks. Meaning one single CPU will run feec() using util signals it knows
>> are wrong.
> 
> isn't this case serialized because cpu selection for next task will
> happen after enqueuing the 1st one

I'm not sure I understand the question, but if the enqueue is done on the
target CPU, the running CPU might call feec() in the meantime.

> 
>>
>> The issue is problematic as it creates some instability. Once a
>> 'parallel selection' is done, the following scenarios can happen:
>> - the system goes overutilized, and EAS is disabled
>> - a frequency spike happen to handle the unexpected load.
>>     Then the perf. domain becomes less energy efficient compared to other
>>     perf. domains, and tasks are migrated out of this perf. domain
>>
>> I made the following prototype to avoid 'parallel selections'. The goal here
>> is to tag CPUs that are under pending migration.
>> A target CPU is tagged as 'eas_pending_enqueue' at the end of feec(). Other
>> CPUs should therefore not consider this CPU as valid candidate.
>>
>> The implementation is a bit raw, but it gives some good results. Using rt-app
>> workloads, and trying not to have tasks waking up at the same timing during
>> the whole test:
>> Workload1:
>> N tasks with a period of 16ms and a util of 4/8. Each task starts with a
>> 4ms delay. Each workload lasts 20s and is run over 5 iterations.
>>
>> Workload2:
>> N tasks with a period of (8 +n)ms and a util of 4/8. I.e. the first task
>> has a period of 8ms, the second task a period of 9ms, etc. Each workload lasts
>> 20s and is run over 5 iterations.
>>
>> Are presented:
>> - the measured energy consumed, according to the Pixel6 energy meters
>> - the estimated energy consumed, lisa uses the util signals along with
>>     the CPU frequencies and the Energy Model to do an estimation.
>> - the amount of time spent in the overutilized state, in percentage.
>>
>> ------
>>
>> Workload1:
>>
>> Measured energy:
>> +------+-------+--------------+--------------+------------+
>> | util | count | without      | with         | ratio      |
>> +------+-------+--------------+--------------+------------+
>> | 4    | 8     | 3220.970324  | 3312.097508  | 2.829184   |
>> | 4    | 12    | 5942.486726  | 5016.106047  | -15.589108 |
>> | 4    | 16    | 10412.26692  | 10017.633658 | -3.79008   |
>> | 8    | 8     | 7524.271751  | 7479.451427  | -0.595677  |
>> | 8    | 12    | 14782.214144 | 14567.282266 | -1.45399   |
>> | 8    | 16    | 21452.863497 | 19561.143385 | -8.818031  |
>> +------+-------+--------------+--------------+------------+
>> Std:
>> +------+-------+-------------+-------------+
>> | util | count | without     | with        |
>> +------+-------+-------------+-------------+
>> | 4    | 8     | 165.563394  | 48.903514   |
>> | 4    | 12    | 518.609612  | 81.170952   |
>> | 4    | 16    | 329.729882  | 192.739245  |
>> | 8    | 8     | 105.144497  | 336.796522  |
>> | 8    | 12    | 384.615323  | 339.86986   |
>> | 8    | 16    | 1252.735561 | 2563.268952 |
>> +------+-------+-------------+-------------+
>>
>> Estimated energy:
>> +------+-------+-----------+-----------+------------+
>> | util | count | without   | with      | ratio      |
>> +------+-------+-----------+-----------+------------+
>> | 4    | 8     | 1.4372e10 | 1.2791e10 | -11.000273 |
>> | 4    | 12    | 3.1881e10 | 2.3743e10 | -25.526193 |
>> | 4    | 16    | 5.7663e10 | 5.4079e10 | -6.215679  |
>> | 8    | 8     | 2.5622e10 | 2.5337e10 | -1.109823  |
>> | 8    | 12    | 6.4332e10 | 6.9335e10 | 7.776814   | [1]
>> | 8    | 16    | 9.5285e10 | 8.2331e10 | -13.594508 |
>> +------+-------+-----------+-----------+------------+
>> Std:
>> +------+-------+----------+-----------+
>> | util | count | without  | with      |
>> +------+-------+----------+-----------+
>> | 4    | 8     | 1.3896e9 | 5.4265e8  |
>> | 4    | 12    | 4.7511e9 | 5.1521e8  |
>> | 4    | 16    | 3.5486e9 | 1.2625e9  |
>> | 8    | 8     | 3.0033e8 | 2.3168e9  |
>> | 8    | 12    | 8.7739e9 | 3.0743e9  |
>> | 8    | 16    | 6.7982e9 | 2.2393e10 |
>> +------+-------+----------+-----------+
>>
>> Overutilized ratio (in % of the 20s test):
>> +------+-------+-----------+-----------+------------+
>> | util | count | without   | with      | ratio      |
>> +------+-------+-----------+-----------+------------+
>> | 4    | 8     | 0.187941  | 0.015834  | -91.575158 |
>> | 4    | 12    | 0.543073  | 0.045483  | -91.624815 |
>> | 4    | 16    | 8.510734  | 8.389077  | -1.429448  |
>> | 8    | 8     | 1.056678  | 0.876095  | -17.089643 |
>> | 8    | 12    | 36.457757 | 9.260862  | -74.598378 | [1]
>> | 8    | 16    | 72.327933 | 78.693558 | 8.801061   |
>> +------+-------+-----------+-----------+------------+
>> Std:
>> +------+-------+-----------+-----------+
>> | util | count | without   | with      |
>> +------+-------+-----------+-----------+
>> | 4    | 8     | 0.232077  | 0.016531  |
>> | 4    | 12    | 0.338637  | 0.040252  |
>> | 4    | 16    | 0.729743  | 6.368214  |
>> | 8    | 8     | 1.702964  | 1.722589  |
>> | 8    | 12    | 34.436278 | 17.314564 |
>> | 8    | 16    | 14.540217 | 33.77831  |
>> +------+-------+-----------+-----------+
>>
>> ------
>>
>> Workload2:
>>
>> Measured energy:
>> +------+-------+--------------+--------------+-----------+
>> | util | count | without      | with         | ratio     |
>> +------+-------+--------------+--------------+-----------+
>> | 4    | 8     | 3357.578785  | 3324.890715  | -0.973561 |
>> | 4    | 12    | 5024.573746  | 4903.394533  | -2.411731 |
>> | 4    | 16    | 10114.715431 | 9762.803821  | -3.479204 |
>> | 8    | 8     | 7485.230678  | 6961.782086  | -6.993086 |
>> | 8    | 12    | 13720.482516 | 13374.765825 | -2.519712 |
>> | 8    | 16    | 24846.806317 | 24444.012805 | -1.621108 |
>> +------+-------+--------------+--------------+-----------+
>> Std:
>> +------+-------+------------+------------+
>> | util | count | without    | with       |
>> +------+-------+------------+------------+
>> | 4    | 8     | 87.450628  | 76.955783  |
>> | 4    | 12    | 106.062839 | 116.882891 |
>> | 4    | 16    | 182.525881 | 172.819307 |
>> | 8    | 8     | 874.292359 | 162.790237 |
>> | 8    | 12    | 151.830636 | 339.286741 |
>> | 8    | 16    | 904.751446 | 154.419644 |
>> +------+-------+------------+------------+
>>
>> Estimated energy:
>> +------+-------+-----------+-----------+------------+
>> | util | count | without   | with      | ratio      |
>> +------+-------+-----------+-----------+------------+
>> | 4    | 8     | 1.4778e10 | 1.4805e10 | 0.184658   |
>> | 4    | 12    | 2.6105e10 | 2.5485e10 | -2.374486  |
>> | 4    | 16    | 5.8394e10 | 5.7177e10 | -2.083208  |
>> | 8    | 8     | 3.0275e10 | 2.5973e10 | -14.211178 |
>> | 8    | 12    | 7.0616e10 | 6.9085e10 | -2.168347  |
>> | 8    | 16    | 1.3133e11 | 1.2891e11 | -1.839725  |
>> +------+-------+-----------+-----------+------------+
>> Std:
>> +------+-------+----------+----------+
>> | util | count | without  | with     |
>> +------+-------+----------+----------+
>> | 4    | 8     | 3.5449e8 | 8.2454e8 |
>> | 4    | 12    | 9.4248e8 | 1.1364e9 |
>> | 4    | 16    | 8.3240e8 | 1.2084e9 |
>> | 8    | 8     | 9.0364e9 | 5.0381e8 |
>> | 8    | 12    | 9.9112e8 | 3.0836e9 |
>> | 8    | 16    | 4.9429e8 | 1.9533e9 |
>> +------+-------+----------+----------+
>>
>> Overutilized ratio (in % of the 20s test):
>> +------+-------+-----------+----------+------------+
>> | util | count | without   | with     | ratio      |
>> +------+-------+-----------+----------+------------+
>> | 4    | 8     | 0.154992  | 0.049429 | -68.108419 |
>> | 4    | 12    | 0.132593  | 0.061762 | -53.420202 |
>> | 4    | 16    | 6.798091  | 4.606102 | -32.244179 |
>> | 8    | 8     | 1.360703  | 0.174626 | -87.166465 |
>> | 8    | 12    | 0.519704  | 0.250469 | -51.805502 |
>> | 8    | 16    | 12.114269 | 8.969281 | -25.961019 |
>> +------+-------+-----------+----------+------------+
>> Std:
>> +------+-------+----------+----------+
>> | util | count | without  | with     |
>> +------+-------+----------+----------+
>> | 4    | 8     | 0.212919 | 0.036856 |
>> | 4    | 12    | 0.069696 | 0.060257 |
>> | 4    | 16    | 0.63995  | 0.542028 |
>> | 8    | 8     | 2.158079 | 0.211775 |
>> | 8    | 12    | 0.089159 | 0.187436 |
>> | 8    | 16    | 0.798565 | 1.669003 |
>> +------+-------+----------+----------+
>>
>> ------
>>
>> Analysis:
>>
>> - [1]
>> Without the patch, 2 tasks end up on one little CPU. This consumes
>> less energy than using the medium/big CPU according to the energy model,
>> but EAS should not be capable of doing such task placement as the little
>> CPU becomes overutilized.
>> Without the patch, the system is overutilized a lot more than with the patch.
>>
>> -
>> Looking at the overutilized ratio, being overutilized 0.5% of the time or
>> 0.05% of the time might seem close, but it means that EAS ended up
>> doing a bad task placement multiple, independent times.
>>
>> -
>> The overutilized ratio should be checked along the energy results as it
>> shows how much EAS was involved in the task placement.
>>
>> -
>> Overall, the energy consumed is less. The quantity of energy saved varies
>> with the workload.
>>
>> ------
>>
>> On another note, I wanted to ask if there would be a v2 of this present
>> patchset (sched/fair: Rework EAS to handle more cases),
> 
> yes, I have been side tracked by other stuff since LPC and haven't
> been able to finalize test on v2 but it's ongoing

Ok thanks!

> 
>>
>> Regards,
>> Pierre
>>
>> ------
>>
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index bb343136ddd0..812d5bf88875 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1592,6 +1592,12 @@ struct task_struct {
>>           struct user_event_mm            *user_event_mm;
>>    #endif
>>
>> +       /*
>> +        * Keep track of the CPU feec() migrated this task to.
>> +        * There is a per-cpu 'eas_pending_enqueue' value to reset.
>> +        */
>> +       int eas_target_cpu;
>> +
>>           /*
>>            * New fields for task_struct should be added above here, so that
>>            * they are included in the randomized portion of task_struct.
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c157d4860a3b..34911eb059cf 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6945,6 +6945,8 @@ requeue_delayed_entity(struct sched_entity *se)
>>           se->sched_delayed = 0;
>>    }
>>
>> +DEFINE_PER_CPU(atomic_t, eas_pending_enqueue);
>> +
>>    /*
>>     * The enqueue_task method is called before nr_running is
>>     * increased. Here we update the fair scheduling stats and
>> @@ -7064,6 +7066,11 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>                   check_update_overutilized_status(rq);
>>
>>    enqueue_throttle:
>> +       if (p->eas_target_cpu != -1) {
>> +               atomic_set(&per_cpu(eas_pending_enqueue, p->eas_target_cpu), 0);
>> +               p->eas_target_cpu = -1;
>> +       }
>> +
>>           assert_list_leaf_cfs_rq(rq);
>>
>>           hrtick_update(rq);
>> @@ -8451,6 +8458,11 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>>                           if (!cpumask_test_cpu(cpu, p->cpus_ptr))
>>                                   continue;
>>
>> +                       /* Skip this CPU as its util signal will be invalid soon. */
>> +                       if (atomic_read(&per_cpu(eas_pending_enqueue, cpu)) &&
>> +                           cpu != prev_cpu)
>> +                               continue;
>> +
>>                           util = cpu_util(cpu, p, cpu, 0);
>>                           cpu_cap = capacity_of(cpu);
>>
>> @@ -8560,6 +8572,17 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>>               ((best_fits < 0) && (best_actual_cap > prev_actual_cap)))
>>                   target = best_energy_cpu;
>>
>> +       /*
>> +        *'Lock' the target CPU if there is a migration. Prevent other feec()
>> +        * calls to use the same target CPU until util signals are not updated.
>> +        */
>> +       if (prev_cpu != target) {
>> +               if (!atomic_cmpxchg_acquire(&per_cpu(eas_pending_enqueue, target), 0, 1))
>> +                       p->eas_target_cpu = target;
>> +               else
>> +                       target = prev_cpu;
>> +       }
>> +
>>           return target;
>>
>>    unlock:

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/5] sched/fair: Rework EAS to handle more cases
  2024-11-08 13:10     ` Pierre Gondois
@ 2024-11-11 19:08       ` Vincent Guittot
  0 siblings, 0 replies; 62+ messages in thread
From: Vincent Guittot @ 2024-11-11 19:08 UTC (permalink / raw)
  To: Pierre Gondois
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef, hongyan.xia2, Christian Loehle

On Fri, 8 Nov 2024 at 14:10, Pierre Gondois <pierre.gondois@arm.com> wrote:
>
>
>
> On 11/8/24 10:27, Vincent Guittot wrote:
> > Hi Pierre,
> >
> > On Thu, 7 Nov 2024 at 11:14, Pierre Gondois <pierre.gondois@arm.com> wrote:
> >>
> >> Hello Vincent,
> >> Related to feec(), but not to this patchset, I think there might be a
> >> concurrency issue while running feec().
> >
> > yes, this is a know limitation
> >
> >>
> >> Feec() doesn't have any locking mechanism. This means that multiple CPUs
> >> might run the function at the same time.
> >
> > this is done on purpose as we don't want to lock and slow down the wakeup path
>
> Yes right, this is understandable. However there could be a way to bail out of
> feec() when such case is detected without actually waiting for a lock (cf. the
> prototype).
> We already bail out of feec() when the utilization of a CPU without a task is
> higher than with the task in the energy computation.
>
> >
> >> If:
> >> - 2 tasks with approximately the same utilization wake up at the same time
> >> - some space on an energy efficient CPU is available
> >> feec() will likely select the same target for the 2 tasks.
> >
> > yes
> >
> >>
> >> Once feec() determined a target for a task, util signals are updated in
> >> enqueue_task_fair(). The delta between running feec() <-> enqueue_task_fair()
> >> is ~20us (on a Pixel6). This is not much, but this still allows some other
> >
> > 20us is quite long. this is the worst case on little core lowest freq ?
>
> I only kept the occurrences where feec() ends up with a target != prev_cpu.
> In these case enqueuing is done on the target CPU (cf. __ttwu_queue_wakelist),
> which might take more time.
>
> In the other case, the delta is effectively lower (~10us).
>
> >
> >> CPUs to run feec() util signals that will be wrong in no time.
> >>
> >> Note that it is also possible for one CPU to run feec() for 2 different tasks,
> >> decide to migrate the 2 tasks to another target CPU, and then start enqueueing
> >> the tasks. Meaning one single CPU will run feec() using util signals it knows
> >> are wrong.
> >
> > isn't this case serialized because cpu selection for next task will
> > happen after enqueuing the 1st one
>
> I'm not sure I understand the question, but if the enqueue is done on the
> target CPU, the running CPU might call feec() in the meantime.

When CPUs share LLC, the local cpu enqueues to the target ... unless
target is idle which is the case for your example above

>
> >
> >>
> >> The issue is problematic as it creates some instability. Once a
> >> 'parallel selection' is done, the following scenarios can happen:
> >> - the system goes overutilized, and EAS is disabled
> >> - a frequency spike happen to handle the unexpected load.
> >>     Then the perf. domain becomes less energy efficient compared to other
> >>     perf. domains, and tasks are migrated out of this perf. domain
> >>
> >> I made the following prototype to avoid 'parallel selections'. The goal here
> >> is to tag CPUs that are under pending migration.
> >> A target CPU is tagged as 'eas_pending_enqueue' at the end of feec(). Other
> >> CPUs should therefore not consider this CPU as valid candidate.
> >>
> >> The implementation is a bit raw, but it gives some good results. Using rt-app
> >> workloads, and trying not to have tasks waking up at the same timing during
> >> the whole test:
> >> Workload1:
> >> N tasks with a period of 16ms and a util of 4/8. Each task starts with a
> >> 4ms delay. Each workload lasts 20s and is run over 5 iterations.
> >>
> >> Workload2:
> >> N tasks with a period of (8 +n)ms and a util of 4/8. I.e. the first task
> >> has a period of 8ms, the second task a period of 9ms, etc. Each workload lasts
> >> 20s and is run over 5 iterations.
> >>
> >> Are presented:
> >> - the measured energy consumed, according to the Pixel6 energy meters
> >> - the estimated energy consumed, lisa uses the util signals along with
> >>     the CPU frequencies and the Energy Model to do an estimation.
> >> - the amount of time spent in the overutilized state, in percentage.
> >>
> >> ------
> >>
> >> Workload1:
> >>
> >> Measured energy:
> >> +------+-------+--------------+--------------+------------+
> >> | util | count | without      | with         | ratio      |
> >> +------+-------+--------------+--------------+------------+
> >> | 4    | 8     | 3220.970324  | 3312.097508  | 2.829184   |
> >> | 4    | 12    | 5942.486726  | 5016.106047  | -15.589108 |
> >> | 4    | 16    | 10412.26692  | 10017.633658 | -3.79008   |
> >> | 8    | 8     | 7524.271751  | 7479.451427  | -0.595677  |
> >> | 8    | 12    | 14782.214144 | 14567.282266 | -1.45399   |
> >> | 8    | 16    | 21452.863497 | 19561.143385 | -8.818031  |
> >> +------+-------+--------------+--------------+------------+
> >> Std:
> >> +------+-------+-------------+-------------+
> >> | util | count | without     | with        |
> >> +------+-------+-------------+-------------+
> >> | 4    | 8     | 165.563394  | 48.903514   |
> >> | 4    | 12    | 518.609612  | 81.170952   |
> >> | 4    | 16    | 329.729882  | 192.739245  |
> >> | 8    | 8     | 105.144497  | 336.796522  |
> >> | 8    | 12    | 384.615323  | 339.86986   |
> >> | 8    | 16    | 1252.735561 | 2563.268952 |
> >> +------+-------+-------------+-------------+
> >>
> >> Estimated energy:
> >> +------+-------+-----------+-----------+------------+
> >> | util | count | without   | with      | ratio      |
> >> +------+-------+-----------+-----------+------------+
> >> | 4    | 8     | 1.4372e10 | 1.2791e10 | -11.000273 |
> >> | 4    | 12    | 3.1881e10 | 2.3743e10 | -25.526193 |
> >> | 4    | 16    | 5.7663e10 | 5.4079e10 | -6.215679  |
> >> | 8    | 8     | 2.5622e10 | 2.5337e10 | -1.109823  |
> >> | 8    | 12    | 6.4332e10 | 6.9335e10 | 7.776814   | [1]
> >> | 8    | 16    | 9.5285e10 | 8.2331e10 | -13.594508 |
> >> +------+-------+-----------+-----------+------------+
> >> Std:
> >> +------+-------+----------+-----------+
> >> | util | count | without  | with      |
> >> +------+-------+----------+-----------+
> >> | 4    | 8     | 1.3896e9 | 5.4265e8  |
> >> | 4    | 12    | 4.7511e9 | 5.1521e8  |
> >> | 4    | 16    | 3.5486e9 | 1.2625e9  |
> >> | 8    | 8     | 3.0033e8 | 2.3168e9  |
> >> | 8    | 12    | 8.7739e9 | 3.0743e9  |
> >> | 8    | 16    | 6.7982e9 | 2.2393e10 |
> >> +------+-------+----------+-----------+
> >>
> >> Overutilized ratio (in % of the 20s test):
> >> +------+-------+-----------+-----------+------------+
> >> | util | count | without   | with      | ratio      |
> >> +------+-------+-----------+-----------+------------+
> >> | 4    | 8     | 0.187941  | 0.015834  | -91.575158 |
> >> | 4    | 12    | 0.543073  | 0.045483  | -91.624815 |
> >> | 4    | 16    | 8.510734  | 8.389077  | -1.429448  |
> >> | 8    | 8     | 1.056678  | 0.876095  | -17.089643 |
> >> | 8    | 12    | 36.457757 | 9.260862  | -74.598378 | [1]
> >> | 8    | 16    | 72.327933 | 78.693558 | 8.801061   |
> >> +------+-------+-----------+-----------+------------+
> >> Std:
> >> +------+-------+-----------+-----------+
> >> | util | count | without   | with      |
> >> +------+-------+-----------+-----------+
> >> | 4    | 8     | 0.232077  | 0.016531  |
> >> | 4    | 12    | 0.338637  | 0.040252  |
> >> | 4    | 16    | 0.729743  | 6.368214  |
> >> | 8    | 8     | 1.702964  | 1.722589  |
> >> | 8    | 12    | 34.436278 | 17.314564 |
> >> | 8    | 16    | 14.540217 | 33.77831  |
> >> +------+-------+-----------+-----------+
> >>
> >> ------
> >>
> >> Workload2:
> >>
> >> Measured energy:
> >> +------+-------+--------------+--------------+-----------+
> >> | util | count | without      | with         | ratio     |
> >> +------+-------+--------------+--------------+-----------+
> >> | 4    | 8     | 3357.578785  | 3324.890715  | -0.973561 |
> >> | 4    | 12    | 5024.573746  | 4903.394533  | -2.411731 |
> >> | 4    | 16    | 10114.715431 | 9762.803821  | -3.479204 |
> >> | 8    | 8     | 7485.230678  | 6961.782086  | -6.993086 |
> >> | 8    | 12    | 13720.482516 | 13374.765825 | -2.519712 |
> >> | 8    | 16    | 24846.806317 | 24444.012805 | -1.621108 |
> >> +------+-------+--------------+--------------+-----------+
> >> Std:
> >> +------+-------+------------+------------+
> >> | util | count | without    | with       |
> >> +------+-------+------------+------------+
> >> | 4    | 8     | 87.450628  | 76.955783  |
> >> | 4    | 12    | 106.062839 | 116.882891 |
> >> | 4    | 16    | 182.525881 | 172.819307 |
> >> | 8    | 8     | 874.292359 | 162.790237 |
> >> | 8    | 12    | 151.830636 | 339.286741 |
> >> | 8    | 16    | 904.751446 | 154.419644 |
> >> +------+-------+------------+------------+
> >>
> >> Estimated energy:
> >> +------+-------+-----------+-----------+------------+
> >> | util | count | without   | with      | ratio      |
> >> +------+-------+-----------+-----------+------------+
> >> | 4    | 8     | 1.4778e10 | 1.4805e10 | 0.184658   |
> >> | 4    | 12    | 2.6105e10 | 2.5485e10 | -2.374486  |
> >> | 4    | 16    | 5.8394e10 | 5.7177e10 | -2.083208  |
> >> | 8    | 8     | 3.0275e10 | 2.5973e10 | -14.211178 |
> >> | 8    | 12    | 7.0616e10 | 6.9085e10 | -2.168347  |
> >> | 8    | 16    | 1.3133e11 | 1.2891e11 | -1.839725  |
> >> +------+-------+-----------+-----------+------------+
> >> Std:
> >> +------+-------+----------+----------+
> >> | util | count | without  | with     |
> >> +------+-------+----------+----------+
> >> | 4    | 8     | 3.5449e8 | 8.2454e8 |
> >> | 4    | 12    | 9.4248e8 | 1.1364e9 |
> >> | 4    | 16    | 8.3240e8 | 1.2084e9 |
> >> | 8    | 8     | 9.0364e9 | 5.0381e8 |
> >> | 8    | 12    | 9.9112e8 | 3.0836e9 |
> >> | 8    | 16    | 4.9429e8 | 1.9533e9 |
> >> +------+-------+----------+----------+
> >>
> >> Overutilized ratio (in % of the 20s test):
> >> +------+-------+-----------+----------+------------+
> >> | util | count | without   | with     | ratio      |
> >> +------+-------+-----------+----------+------------+
> >> | 4    | 8     | 0.154992  | 0.049429 | -68.108419 |
> >> | 4    | 12    | 0.132593  | 0.061762 | -53.420202 |
> >> | 4    | 16    | 6.798091  | 4.606102 | -32.244179 |
> >> | 8    | 8     | 1.360703  | 0.174626 | -87.166465 |
> >> | 8    | 12    | 0.519704  | 0.250469 | -51.805502 |
> >> | 8    | 16    | 12.114269 | 8.969281 | -25.961019 |
> >> +------+-------+-----------+----------+------------+
> >> Std:
> >> +------+-------+----------+----------+
> >> | util | count | without  | with     |
> >> +------+-------+----------+----------+
> >> | 4    | 8     | 0.212919 | 0.036856 |
> >> | 4    | 12    | 0.069696 | 0.060257 |
> >> | 4    | 16    | 0.63995  | 0.542028 |
> >> | 8    | 8     | 2.158079 | 0.211775 |
> >> | 8    | 12    | 0.089159 | 0.187436 |
> >> | 8    | 16    | 0.798565 | 1.669003 |
> >> +------+-------+----------+----------+
> >>
> >> ------
> >>
> >> Analysis:
> >>
> >> - [1]
> >> Without the patch, 2 tasks end up on one little CPU. This consumes
> >> less energy than using the medium/big CPU according to the energy model,
> >> but EAS should not be capable of doing such task placement as the little
> >> CPU becomes overutilized.
> >> Without the patch, the system is overutilized a lot more than with the patch.
> >>
> >> -
> >> Looking at the overutilized ratio, being overutilized 0.5% of the time or
> >> 0.05% of the time might seem close, but it means that EAS ended up
> >> doing a bad task placement multiple, independent times.
> >>
> >> -
> >> The overutilized ratio should be checked along the energy results as it
> >> shows how much EAS was involved in the task placement.
> >>
> >> -
> >> Overall, the energy consumed is less. The quantity of energy saved varies
> >> with the workload.
> >>
> >> ------
> >>
> >> On another note, I wanted to ask if there would be a v2 of this present
> >> patchset (sched/fair: Rework EAS to handle more cases),
> >
> > yes, I have been side tracked by other stuff since LPC and haven't
> > been able to finalize test on v2 but it's ongoing
>
> Ok thanks!
>
> >
> >>
> >> Regards,
> >> Pierre
> >>
> >> ------
> >>
> >>
> >> diff --git a/include/linux/sched.h b/include/linux/sched.h
> >> index bb343136ddd0..812d5bf88875 100644
> >> --- a/include/linux/sched.h
> >> +++ b/include/linux/sched.h
> >> @@ -1592,6 +1592,12 @@ struct task_struct {
> >>           struct user_event_mm            *user_event_mm;
> >>    #endif
> >>
> >> +       /*
> >> +        * Keep track of the CPU feec() migrated this task to.
> >> +        * There is a per-cpu 'eas_pending_enqueue' value to reset.
> >> +        */
> >> +       int eas_target_cpu;
> >> +
> >>           /*
> >>            * New fields for task_struct should be added above here, so that
> >>            * they are included in the randomized portion of task_struct.
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index c157d4860a3b..34911eb059cf 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -6945,6 +6945,8 @@ requeue_delayed_entity(struct sched_entity *se)
> >>           se->sched_delayed = 0;
> >>    }
> >>
> >> +DEFINE_PER_CPU(atomic_t, eas_pending_enqueue);
> >> +
> >>    /*
> >>     * The enqueue_task method is called before nr_running is
> >>     * increased. Here we update the fair scheduling stats and
> >> @@ -7064,6 +7066,11 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >>                   check_update_overutilized_status(rq);
> >>
> >>    enqueue_throttle:
> >> +       if (p->eas_target_cpu != -1) {
> >> +               atomic_set(&per_cpu(eas_pending_enqueue, p->eas_target_cpu), 0);
> >> +               p->eas_target_cpu = -1;
> >> +       }
> >> +
> >>           assert_list_leaf_cfs_rq(rq);
> >>
> >>           hrtick_update(rq);
> >> @@ -8451,6 +8458,11 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >>                           if (!cpumask_test_cpu(cpu, p->cpus_ptr))
> >>                                   continue;
> >>
> >> +                       /* Skip this CPU as its util signal will be invalid soon. */
> >> +                       if (atomic_read(&per_cpu(eas_pending_enqueue, cpu)) &&
> >> +                           cpu != prev_cpu)
> >> +                               continue;
> >> +
> >>                           util = cpu_util(cpu, p, cpu, 0);
> >>                           cpu_cap = capacity_of(cpu);
> >>
> >> @@ -8560,6 +8572,17 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >>               ((best_fits < 0) && (best_actual_cap > prev_actual_cap)))
> >>                   target = best_energy_cpu;
> >>
> >> +       /*
> >> +        *'Lock' the target CPU if there is a migration. Prevent other feec()
> >> +        * calls to use the same target CPU until util signals are not updated.
> >> +        */
> >> +       if (prev_cpu != target) {
> >> +               if (!atomic_cmpxchg_acquire(&per_cpu(eas_pending_enqueue, target), 0, 1))
> >> +                       p->eas_target_cpu = target;
> >> +               else
> >> +                       target = prev_cpu;
> >> +       }
> >> +
> >>           return target;
> >>
> >>    unlock:

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/5] sched/fair: Rework EAS to handle more cases
  2024-08-30 13:03 [PATCH 0/5] sched/fair: Rework EAS to handle more cases Vincent Guittot
                   ` (5 preceding siblings ...)
  2024-11-07 10:14 ` [PATCH 0/5] sched/fair: Rework EAS to handle more cases Pierre Gondois
@ 2024-11-28 17:24 ` Hongyan Xia
  2024-11-30 10:50   ` Vincent Guittot
  6 siblings, 1 reply; 62+ messages in thread
From: Hongyan Xia @ 2024-11-28 17:24 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, lukasz.luba,
	rafael.j.wysocki, linux-kernel
  Cc: qyousef

Hi Vincent,

On 30/08/2024 14:03, Vincent Guittot wrote:
> The current Energy Aware Scheduler has some known limitations which have
> became more and more visible with features like uclamp as an example. This
> serie tries to fix some of those issues:
> - tasks stacked on the same CPU of a PD
> - tasks stuck on the wrong CPU.
> 
> Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> whereas it is capped to a lower compute capacity. This wrong classification
> can prevent periodic load balancer to select a group_misfit_task CPU
> because group_overloaded has higher priority.
> 
> 
> Patch 2 creates a new EM interface that will be used by Patch 3
> 
> 
> Patch 3 fixes the issue of tasks being stacked on same CPU of a PD whereas
> others might be a better choice. feec() looks for the CPU with the highest
> spare capacity in a PD assuming that it will be the best CPU from a energy
> efficiency PoV because it will require the smallest increase of OPP.
> This is often but not always true, this policy filters some others CPUs
> which would be as efficients because of using the same OPP but with less
> running tasks as an example.
> In fact, we only care about the cost of the new OPP that will be
> selected to handle the waking task. In many cases, several CPUs will end
> up selecting the same OPP and as a result having the same energy cost. In
> such cases, we can use other metrics to select the best CPU with the same
> energy cost. Patch 3 rework feec() to look 1st for the lowest cost in a PD
> and then the most performant CPU between CPUs.
> 
> perf sched pipe on a dragonboard rb5 has been used to compare the overhead
> of the new feec() vs current implementation.
> sidenote: delayed dequeue has been disable for all tests.
> 
> 9 iterations of perf bench sched pipe -T -l 80000
>                  ops/sec  stdev
> tip/sched/core  13490    (+/- 1.7%)
> + patches 1-3   14095    (+/- 1.7%)  +4.5%
> 
> 
> When overutilized, the scheduler stops looking for an energy efficient CPU
> and fallback to the default performance mode. Although this is the best
> choice when a system is fully overutilized, it also breaks the energy
> efficiency when one CPU becomes overutilized for a short time because of
> kworker and/or background activity as an example.
> Patch 4 calls feec() everytime instead of skipping it when overutlized,
> and fallback to default performance mode only when feec() can't find a
> suitable CPU. The main advantage is that the task placement remains more
> stable especially when there is a short and transient overutilized state.
> The drawback is that the overhead can be significant for some CPU intensive
> use cases.
> 
> The overhead of patch 4 has been stressed with hackbench on dragonboard rb5
> 
>                                 tip/sched/core        + patches 1-4
> 			       Time    stdev         Time    stdev
> hackbench -l 5120 -g 1         0.724   +/-1.3%       0.765   +/-3.0% (-5.7%)
> hackbench -l 1280 -g 4         0.740   +/-1.1%       0.768   +/-1.8% (-3.8%)
> hackbench -l 640  -g 8         0.792   +/-1.3%       0.812   +/-1.6% (-2.6%)
> hackbench -l 320  -g 16        0.847   +/-1.4%       0.852   +/-1.8% (-0.6%)
> 
> hackbench -p -l 5120 -g 1      0.878   +/-1.9%       1.115   +/-3.0% (-27%)
> hackbench -p -l 1280 -g 4      0.789   +/-2.6%       0.862   +/-5.0% (-9.2%)
> hackbench -p -l 640  -g 8      0.732   +/-1.9%       0.801   +/-4.3% (-9.4%)
> hackbench -p -l 320  -g 16     0.710   +/-4.7%       0.767   +/-4.9% (-8.1%)
> 
> hackbench -T -l 5120 -g 1      0.756   +/-3.9%       0.772   +/-1.63 (-2.0%)
> hackbench -T -l 1280 -g 4      0.725   +/-1.4%       0.737   +/-2.0% (-1.3%)
> hackbench -T -l 640  -g 8      0.767   +/-1.5%       0.809   +/-2.6% (-5.5%)
> hackbench -T -l 320  -g 16     0.812   +/-1.2%       0.823   +/-2.2% (-1.4%)
> 
> hackbench -T -p -l 5120 -g 1   0.941   +/-2.5%       1.190   +/-1.6% (-26%)
> hackbench -T -p -l 1280 -g 4   0.869   +/-2.5%       0.931   +/-4.9% (-7.2%)
> hackbench -T -p -l 640  -g 8   0.819   +/-2.4%       0.895   +/-4.6% (-9.3%)
> hackbench -T -p -l 320  -g 16  0.763   +/-2.6%       0.863   +/-5.0% (-13%)
> 
> Side note: Both new feec() and current feec() give similar overheads with
> patch 4.
> 
> Although the highest reachable CPU throughput is not the only target of EAS,
> the overhead can be significant in some cases as shown in hackbech results
> above. That being said I still think it's worth the benefit for the stability
> of tasks placement and a better control of the power.
> 
> 
> Patch 5 solves another problem with tasks being stuck on a CPU forever
> because it doesn't sleep anymore and as a result never wakeup and call
> feec(). Such task can be detected by comparing util_avg or runnable_avg
> with the compute capacity of the CPU. Once detected, we can call feec() to
> check if there is a better CPU for the stuck task. The call can be done in
> 2 places:
> - When the task is put back in the runnnable list after its running slice
>    with the balance callback mecanism similarly to the rt/dl push callback.
> - During cfs tick when there is only 1 running task stuck on the CPU in
>    which case the balance callback can't be used.
> 
> This push callback doesn't replace the current misfit task mecanism which
> is already implemented but this could be considered as a follow up serie.
> 
> 
> This push callback mecanism with the new feec() algorithm ensures that
> tasks always get a chance to migrate on the best suitable CPU and don't
> stay stuck on a CPU which is no more the most suitable one. As examples:
> - A task waking on a big CPU with a uclamp max preventing it to sleep and
>    wake up, can migrate on a smaller CPU once it's more power efficient.
> - The tasks are spread on CPUs in the PD when they target the same OPP.
> 
> This series implements some of the topics discussed at OSPM [1]. Other
> topics will be part of an other serie
> 
> [1] https://youtu.be/PHEBAyxeM_M?si=ZApIOw3BS4SOLPwp
> 
> Vincent Guittot (5):
>    sched/fair: Filter false overloaded_group case for EAS
>    energy model: Add a get previous state function
>    sched/fair: Rework feec() to use cost instead of spare capacity
>    sched/fair: Use EAS also when overutilized
>    sched/fair: Add push task callback for EAS
> 
>   include/linux/energy_model.h |  18 +
>   kernel/sched/fair.c          | 693 +++++++++++++++++++++++------------
>   kernel/sched/sched.h         |   2 +
>   3 files changed, 488 insertions(+), 225 deletions(-)
> 

On second look, I do wonder if this series should be split into 
individual patches or mini-series. Some of the ideas, like 
overloaded_groups or calling EAS at more locations rather than just 
wake-up events, might be easier to review and merge if they are independent.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/5] sched/fair: Rework EAS to handle more cases
  2024-11-28 17:24 ` Hongyan Xia
@ 2024-11-30 10:50   ` Vincent Guittot
  0 siblings, 0 replies; 62+ messages in thread
From: Vincent Guittot @ 2024-11-30 10:50 UTC (permalink / raw)
  To: Hongyan Xia
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, linux-kernel,
	qyousef

Hi Hongyan,

On Thu, 28 Nov 2024 at 18:24, Hongyan Xia <hongyan.xia2@arm.com> wrote:
>
> Hi Vincent,
>
> On 30/08/2024 14:03, Vincent Guittot wrote:
> > The current Energy Aware Scheduler has some known limitations which have
> > became more and more visible with features like uclamp as an example. This
> > serie tries to fix some of those issues:
> > - tasks stacked on the same CPU of a PD
> > - tasks stuck on the wrong CPU.
> >
> > Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> > whereas it is capped to a lower compute capacity. This wrong classification
> > can prevent periodic load balancer to select a group_misfit_task CPU
> > because group_overloaded has higher priority.
> >
> >
> > Patch 2 creates a new EM interface that will be used by Patch 3
> >
> >
> > Patch 3 fixes the issue of tasks being stacked on same CPU of a PD whereas
> > others might be a better choice. feec() looks for the CPU with the highest
> > spare capacity in a PD assuming that it will be the best CPU from a energy
> > efficiency PoV because it will require the smallest increase of OPP.
> > This is often but not always true, this policy filters some others CPUs
> > which would be as efficients because of using the same OPP but with less
> > running tasks as an example.
> > In fact, we only care about the cost of the new OPP that will be
> > selected to handle the waking task. In many cases, several CPUs will end
> > up selecting the same OPP and as a result having the same energy cost. In
> > such cases, we can use other metrics to select the best CPU with the same
> > energy cost. Patch 3 rework feec() to look 1st for the lowest cost in a PD
> > and then the most performant CPU between CPUs.
> >
> > perf sched pipe on a dragonboard rb5 has been used to compare the overhead
> > of the new feec() vs current implementation.
> > sidenote: delayed dequeue has been disable for all tests.
> >
> > 9 iterations of perf bench sched pipe -T -l 80000
> >                  ops/sec  stdev
> > tip/sched/core  13490    (+/- 1.7%)
> > + patches 1-3   14095    (+/- 1.7%)  +4.5%
> >
> >
> > When overutilized, the scheduler stops looking for an energy efficient CPU
> > and fallback to the default performance mode. Although this is the best
> > choice when a system is fully overutilized, it also breaks the energy
> > efficiency when one CPU becomes overutilized for a short time because of
> > kworker and/or background activity as an example.
> > Patch 4 calls feec() everytime instead of skipping it when overutlized,
> > and fallback to default performance mode only when feec() can't find a
> > suitable CPU. The main advantage is that the task placement remains more
> > stable especially when there is a short and transient overutilized state.
> > The drawback is that the overhead can be significant for some CPU intensive
> > use cases.
> >
> > The overhead of patch 4 has been stressed with hackbench on dragonboard rb5
> >
> >                                 tip/sched/core        + patches 1-4
> >                              Time    stdev         Time    stdev
> > hackbench -l 5120 -g 1         0.724   +/-1.3%       0.765   +/-3.0% (-5.7%)
> > hackbench -l 1280 -g 4         0.740   +/-1.1%       0.768   +/-1.8% (-3.8%)
> > hackbench -l 640  -g 8         0.792   +/-1.3%       0.812   +/-1.6% (-2.6%)
> > hackbench -l 320  -g 16        0.847   +/-1.4%       0.852   +/-1.8% (-0.6%)
> >
> > hackbench -p -l 5120 -g 1      0.878   +/-1.9%       1.115   +/-3.0% (-27%)
> > hackbench -p -l 1280 -g 4      0.789   +/-2.6%       0.862   +/-5.0% (-9.2%)
> > hackbench -p -l 640  -g 8      0.732   +/-1.9%       0.801   +/-4.3% (-9.4%)
> > hackbench -p -l 320  -g 16     0.710   +/-4.7%       0.767   +/-4.9% (-8.1%)
> >
> > hackbench -T -l 5120 -g 1      0.756   +/-3.9%       0.772   +/-1.63 (-2.0%)
> > hackbench -T -l 1280 -g 4      0.725   +/-1.4%       0.737   +/-2.0% (-1.3%)
> > hackbench -T -l 640  -g 8      0.767   +/-1.5%       0.809   +/-2.6% (-5.5%)
> > hackbench -T -l 320  -g 16     0.812   +/-1.2%       0.823   +/-2.2% (-1.4%)
> >
> > hackbench -T -p -l 5120 -g 1   0.941   +/-2.5%       1.190   +/-1.6% (-26%)
> > hackbench -T -p -l 1280 -g 4   0.869   +/-2.5%       0.931   +/-4.9% (-7.2%)
> > hackbench -T -p -l 640  -g 8   0.819   +/-2.4%       0.895   +/-4.6% (-9.3%)
> > hackbench -T -p -l 320  -g 16  0.763   +/-2.6%       0.863   +/-5.0% (-13%)
> >
> > Side note: Both new feec() and current feec() give similar overheads with
> > patch 4.
> >
> > Although the highest reachable CPU throughput is not the only target of EAS,
> > the overhead can be significant in some cases as shown in hackbech results
> > above. That being said I still think it's worth the benefit for the stability
> > of tasks placement and a better control of the power.
> >
> >
> > Patch 5 solves another problem with tasks being stuck on a CPU forever
> > because it doesn't sleep anymore and as a result never wakeup and call
> > feec(). Such task can be detected by comparing util_avg or runnable_avg
> > with the compute capacity of the CPU. Once detected, we can call feec() to
> > check if there is a better CPU for the stuck task. The call can be done in
> > 2 places:
> > - When the task is put back in the runnnable list after its running slice
> >    with the balance callback mecanism similarly to the rt/dl push callback.
> > - During cfs tick when there is only 1 running task stuck on the CPU in
> >    which case the balance callback can't be used.
> >
> > This push callback doesn't replace the current misfit task mecanism which
> > is already implemented but this could be considered as a follow up serie.
> >
> >
> > This push callback mecanism with the new feec() algorithm ensures that
> > tasks always get a chance to migrate on the best suitable CPU and don't
> > stay stuck on a CPU which is no more the most suitable one. As examples:
> > - A task waking on a big CPU with a uclamp max preventing it to sleep and
> >    wake up, can migrate on a smaller CPU once it's more power efficient.
> > - The tasks are spread on CPUs in the PD when they target the same OPP.
> >
> > This series implements some of the topics discussed at OSPM [1]. Other
> > topics will be part of an other serie
> >
> > [1] https://youtu.be/PHEBAyxeM_M?si=ZApIOw3BS4SOLPwp
> >
> > Vincent Guittot (5):
> >    sched/fair: Filter false overloaded_group case for EAS
> >    energy model: Add a get previous state function
> >    sched/fair: Rework feec() to use cost instead of spare capacity
> >    sched/fair: Use EAS also when overutilized
> >    sched/fair: Add push task callback for EAS
> >
> >   include/linux/energy_model.h |  18 +
> >   kernel/sched/fair.c          | 693 +++++++++++++++++++++++------------
> >   kernel/sched/sched.h         |   2 +
> >   3 files changed, 488 insertions(+), 225 deletions(-)
> >
>
> On second look, I do wonder if this series should be split into
> individual patches or mini-series. Some of the ideas, like
> overloaded_groups or calling EAS at more locations rather than just
> wake-up events, might be easier to review and merge if they are independent.

The series is almost ready, I was waiting for the support of v6.12 on
a device like pixel 6 to run some benchmarks but it is not yet
available publicly at least so I might send the serie without such
figures. I also wanted to test it with delayed dequeued enabled this
time unlike previous version:
https://lore.kernel.org/lkml/20241129161756.3081386-1-vincent.guittot@linaro.org/

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2024-12-05 16:23 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-30 13:03 [PATCH 0/5] sched/fair: Rework EAS to handle more cases Vincent Guittot
2024-08-30 13:03 ` [PATCH 1/5] sched/fair: Filter false overloaded_group case for EAS Vincent Guittot
2024-09-02  9:01   ` Hongyan Xia
2024-09-06  6:51     ` Vincent Guittot
2024-09-13 13:21   ` Pierre Gondois
2024-08-30 13:03 ` [PATCH 2/5] energy model: Add a get previous state function Vincent Guittot
2024-09-05  9:21   ` Lukasz Luba
2024-09-06  6:55     ` Vincent Guittot
2024-08-30 13:03 ` [PATCH 3/5] sched/fair: Rework feec() to use cost instead of spare capacity Vincent Guittot
2024-09-02  9:11   ` kernel test robot
2024-09-02 11:03   ` Hongyan Xia
2024-09-06  7:08     ` Vincent Guittot
2024-09-06 15:32       ` Hongyan Xia
2024-09-12 12:12         ` Vincent Guittot
2024-09-04 15:07   ` Pierre Gondois
2024-09-06  7:08     ` Vincent Guittot
2024-09-11 14:02   ` Pierre Gondois
2024-09-11 16:51     ` Pierre Gondois
2024-09-12 12:22     ` Vincent Guittot
2024-12-05 16:23       ` Pierre Gondois
2024-08-30 13:03 ` [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized Vincent Guittot
2024-09-17 20:24   ` Christian Loehle
2024-09-19  8:25     ` Pierre Gondois
2024-09-25 13:28       ` Vincent Guittot
2024-10-07  7:03         ` Pierre Gondois
2024-10-09  8:53           ` Vincent Guittot
2024-10-11 12:52             ` Pierre Gondois
2024-10-15 12:47               ` Vincent Guittot
2024-10-31 15:21                 ` Pierre Gondois
2024-09-25 13:07     ` Vincent Guittot
2024-09-20 16:17   ` Quentin Perret
2024-09-25 13:27     ` Vincent Guittot
2024-09-26  9:10       ` Quentin Perret
2024-10-01 16:20         ` Vincent Guittot
2024-10-01 17:50           ` Quentin Perret
2024-10-02  7:11             ` Lukasz Luba
2024-10-02  7:55               ` Quentin Perret
2024-10-02  9:54                 ` Lukasz Luba
2024-10-03  6:27             ` Vincent Guittot
2024-10-03  8:15               ` Lukasz Luba
2024-10-03  8:26                 ` Quentin Perret
2024-10-03  8:52                 ` Vincent Guittot
2024-10-03  8:21               ` Quentin Perret
2024-10-03  8:57                 ` Vincent Guittot
2024-10-03  9:52                   ` Quentin Perret
2024-10-03 13:26                     ` Vincent Guittot
2024-11-19 14:46               ` Christian Loehle
2024-08-30 13:03 ` [RFC PATCH 5/5] sched/fair: Add push task callback for EAS Vincent Guittot
2024-09-09  9:59   ` Christian Loehle
2024-09-09 12:54     ` Vincent Guittot
2024-09-11 14:03   ` Pierre Gondois
2024-09-12 12:30     ` Vincent Guittot
2024-09-13  9:09       ` Pierre Gondois
2024-09-24 12:37         ` Vincent Guittot
2024-09-13 16:08   ` Pierre Gondois
2024-09-24 13:00     ` Vincent Guittot
2024-11-07 10:14 ` [PATCH 0/5] sched/fair: Rework EAS to handle more cases Pierre Gondois
2024-11-08  9:27   ` Vincent Guittot
2024-11-08 13:10     ` Pierre Gondois
2024-11-11 19:08       ` Vincent Guittot
2024-11-28 17:24 ` Hongyan Xia
2024-11-30 10:50   ` Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox