public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases
@ 2025-03-02 21:05 Vincent Guittot
  2025-03-02 21:05 ` [PATCH 1/7 v5] sched/fair: Filter false overloaded_group case for EAS Vincent Guittot
                   ` (8 more replies)
  0 siblings, 9 replies; 31+ messages in thread
From: Vincent Guittot @ 2025-03-02 21:05 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, lukasz.luba, rafael.j.wysocki, pierre.gondois,
	linux-kernel
  Cc: qyousef, hongyan.xia2, christian.loehle, luis.machado, qperret,
	Vincent Guittot

The current Energy Aware Scheduler has some known limitations which have
became more and more visible with features like uclamp as an example. This
serie tries to fix some of those issues:
- tasks stacked on the same CPU of a PD
- tasks stuck on the wrong CPU.

Patch 1 fixes the case where a CPU is wrongly classified as overloaded
whereas it is capped to a lower compute capacity. This wrong classification
can prevent periodic load balancer to select a group_misfit_task CPU
because group_overloaded has higher priority.

Patch 2 creates a new EM interface that will be used by Patch 3

Patch 3 fixes the issue of tasks being stacked on same CPU of a PD whereas
others might be a better choice. feec() looks for the CPU with the highest
spare capacity in a PD assuming that it will be the best CPU from a energy
efficiency PoV because it will require the smallest increase of OPP.
This is often but not always true, this policy filters some others CPUs
which would be as efficients because of using the same OPP but with less
running tasks as an example.
In fact, we only care about the cost of the new OPP that will be
selected to handle the waking task. In many cases, several CPUs will end
up selecting the same OPP and as a result having the same energy cost. In
such cases, we can use other metrics to select the best CPU with the same
energy cost. Patch 3 rework feec() to look 1st for the lowest cost in a PD
and then the most performant CPU between CPUs. At now, this only tries to
evenly spread the number of runnable tasks on CPUs but this can be
improved with other metric like the sched slice duration in a follow up
series.

perf sched pipe on a dragonboard rb5 has been used to compare the overhead
of the new feec() vs current implementation.

9 iterations of perf bench sched pipe -T -l 80000
                ops/sec  stdev 
tip/sched/core  16634    (+/- 0.5%)
+ patches 1-3   17434    (+/- 1.2%)  +4.8%


Patch 4 removed the now unused em_cpu_energy()

Patch 5 solves another problem with tasks being stuck on a CPU forever
because it doesn't sleep anymore and as a result never wakeup and call
feec(). Such task can be detected by comparing util_avg or runnable_avg
with the compute capacity of the CPU. Once detected, we can call feec() to
check if there is a better CPU for the stuck task. The call can be done in
2 places:
- When the task is put back in the runnnable list after its running slice
  with the balance callback mecanism similarly to the rt/dl push callback.
- During cfs tick when there is only 1 running task stuck on the CPU in
  which case the balance callback can't be used.

This push callback mecanism with the new feec() algorithm ensures that
tasks always get a chance to migrate on the best suitable CPU and don't
stay stuck on a CPU which is no more the most suitable one. As examples:
- A task waking on a big CPU with a uclamp max preventing it to sleep and
  wake up, can migrate on a smaller CPU once it's more power efficient.
- The tasks are spread on CPUs in the PD when they target the same OPP.

Patch 6 adds task misfit migration case in the cfs tick and push callback
mecanism to prevent waking up an idle cpu unnecessarily.

Patch 7 removes the need of testing uclamp_min in cpu_overutilized to
trigger the active migration of a task on another CPU.

Compared to v4:
- Fixed check_pushable_task for !SMP

Compared to v3:
- Fixed the empty functions

Compared to v2:
- Renamed the push and tick functions to ease understanding what they do.
  Both are kept in the same patch as they solve the same problem.
- Created some helper functions
- Fixing some typos and comments
- The task_stuck_on_cpu() condition remains unchanged. Pierre suggested to
  take into account the min capacity of the CPU but the is not directly
  available right now. It can trigger feec() when uclamp_max is very low
  compare to the min capacity of the CPU but the feec() should keep 
  returning the same CPU. This can be handled in a follow on patch

Compared to v1:
- The call to feec() even when overutilized has been removed
from this serie and will be adressed in a separate series. Only the case
of uclamp_min has been kept as it is now handled by push callback and
tick mecanism.
- The push mecanism has been cleanup, fixed and simplified.

This series implements some of the topics discussed at OSPM [1]. Other
topics will be part of an other serie

[1] https://youtu.be/PHEBAyxeM_M?si=ZApIOw3BS4SOLPwp

Vincent Guittot (7):
  sched/fair: Filter false overloaded_group case for EAS
  energy model: Add a get previous state function
  sched/fair: Rework feec() to use cost instead of spare capacity
  energy model: Remove unused em_cpu_energy()
  sched/fair: Add push task mechanism for EAS
  sched/fair: Add misfit case to push task mecanism for EAS
  sched/fair: Update overutilized detection

 include/linux/energy_model.h | 111 ++----
 kernel/sched/fair.c          | 721 ++++++++++++++++++++++++-----------
 kernel/sched/sched.h         |   2 +
 3 files changed, 518 insertions(+), 316 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2025-04-16 13:52 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-02 21:05 [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases Vincent Guittot
2025-03-02 21:05 ` [PATCH 1/7 v5] sched/fair: Filter false overloaded_group case for EAS Vincent Guittot
2025-03-04  4:38   ` K Prateek Nayak
2025-03-05  8:13     ` Vincent Guittot
2025-03-02 21:05 ` [PATCH 2/7 v5] energy model: Add a get previous state function Vincent Guittot
2025-03-02 21:05 ` [PATCH 3/7 v5] sched/fair: Rework feec() to use cost instead of spare capacity Vincent Guittot
2025-03-12 14:08   ` Pierre Gondois
2025-03-14 16:24     ` Vincent Guittot
2025-03-16 20:21       ` Pierre Gondois
2025-03-25 11:09   ` Pierre Gondois
2025-03-02 21:05 ` [PATCH 4/7 v5] energy model: Remove unused em_cpu_energy() Vincent Guittot
2025-03-02 21:05 ` [PATCH 5/7 v5] sched/fair: Add push task mechanism for EAS Vincent Guittot
2025-03-07 12:51   ` kernel test robot
2025-03-10 12:47   ` kernel test robot
2025-03-10 18:20   ` Shrikanth Hegde
2025-03-11 16:27     ` Vincent Guittot
2025-03-19 15:26   ` Valentin Schneider
2025-03-24 16:34   ` Christian Loehle
2025-03-25 11:16   ` Christian Loehle
2025-04-15 13:52     ` Vincent Guittot
2025-04-16 13:52       ` Christian Loehle
2025-04-15  2:31   ` Xuewen Yan
2025-04-15 13:51     ` Vincent Guittot
2025-04-16  2:03       ` Xuewen Yan
2025-03-02 21:05 ` [PATCH 6/7 v5] sched/fair: Add misfit case to push task mecanism " Vincent Guittot
2025-03-24 16:06   ` Christian Loehle
2025-03-02 21:05 ` [PATCH 7/7 v5] sched/fair: Update overutilized detection Vincent Guittot
2025-03-24 16:41 ` [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases Christian Loehle
2025-04-03 12:36 ` Christian Loehle
2025-04-15 13:49   ` Vincent Guittot
2025-04-16 10:51     ` Christian Loehle

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox