All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched/fair: Call update_util_est() after dequeue_entities()
@ 2026-05-12 12:46 Qais Yousef
  2026-05-12 20:52 ` John Stultz
  2026-05-15 18:35 ` Tim Chen
  0 siblings, 2 replies; 6+ messages in thread
From: Qais Yousef @ 2026-05-12 12:46 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot
  Cc: Rafael J. Wysocki, Viresh Kumar, Juri Lelli, Steven Rostedt,
	John Stultz, Dietmar Eggemann, Tim Chen, Chen, Yu C,
	Thomas Gleixner, linux-kernel, linux-pm, Qais Yousef

update_util_est() reads task_util() at dequeue which is updated in
dequeue_entities(). To read the accurate util_avg at dequeue, make sure
to do the read after load_avg is updated in dequeue_entities().

util_est for a periodic task before

                                periodic-3114 util_est.enqueued running
   ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
183┤                ▖▗  ▐▖         ▖ ▗▙   ▗   ▗▙▖▖       ▖▖   ▖       ▖▖        ▗  ▟  ▗▄▖          │
139┤               ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖     │
 95┤              ▐▀    ▘   ▝   ▝        ▝▘        ▘   ▘▘       ▝▘       ▝▘  ▝    ▝        ▀       │
   │              ▛                                                                                │
 51┤             ▐▘                                                                                │
  7┤      ▖▗▗  ▗▄▐                                                                                 │
   └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
  0.00      0.65       1.30      1.96       2.61      3.26       3.91      4.57       5.22     5.87

and after

                                 periodic-2977 util_est.enqueued running
     ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
157.0┤               ▙▄ ▗▄  ▗▄▄▄ ▗▄  ▗▄▄▄▗▄▄  ▗▄▄▖ ▄   ▄▄▄   ▄  ▄▖▖  ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄           │
119.5┤             ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀   ▝▝▀  ▀   ▀▀▀▀            │
 82.0┤             ▟                                                                               │
     │             ▌                                                                               │
 44.5┤             ▌                                                                               │
  7.0┤      ▗   ▗▖ ▌                                                                               │
     └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
    0.00      0.65      1.30       1.95      2.60      3.25      3.90       4.56      5.21     5.86

Note how the signal is noisier and can peak to 183 vs 157 now.

Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race")
Signed-off-by: Qais Yousef <qyousef@layalina.io>
---

This is split from [1] series where I stumbled upon this problem. AFAICS it
needs backporting all the way to 6.12 LTS.

[1] https://lore.kernel.org/lkml/20260504020003.71306-1-qyousef@layalina.io/

 kernel/sched/fair.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 728965851842..96ba97e5f4ae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7401,6 +7401,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
  */
 static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
+	int ret;
+
 	if (task_is_throttled(p)) {
 		dequeue_throttled_task(p, flags);
 		return true;
@@ -7409,8 +7411,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	if (!p->se.sched_delayed)
 		util_est_dequeue(&rq->cfs, p);
 
+	ret = dequeue_entities(rq, &p->se, flags);
 	util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
-	if (dequeue_entities(rq, &p->se, flags) < 0)
+	if (ret < 0)
 		return false;
 
 	/*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched/fair: Call update_util_est() after dequeue_entities()
  2026-05-12 12:46 [PATCH] sched/fair: Call update_util_est() after dequeue_entities() Qais Yousef
@ 2026-05-12 20:52 ` John Stultz
  2026-05-13 12:46   ` Vincent Guittot
  2026-05-15 18:35 ` Tim Chen
  1 sibling, 1 reply; 6+ messages in thread
From: John Stultz @ 2026-05-12 20:52 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
	Viresh Kumar, Juri Lelli, Steven Rostedt, Dietmar Eggemann,
	Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm

On Tue, May 12, 2026 at 5:47 AM Qais Yousef <qyousef@layalina.io> wrote:
>
> update_util_est() reads task_util() at dequeue which is updated in
> dequeue_entities(). To read the accurate util_avg at dequeue, make sure
> to do the read after load_avg is updated in dequeue_entities().
>
> util_est for a periodic task before
>
>                                 periodic-3114 util_est.enqueued running
>    ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
> 183┤                ▖▗  ▐▖         ▖ ▗▙   ▗   ▗▙▖▖       ▖▖   ▖       ▖▖        ▗  ▟  ▗▄▖          │
> 139┤               ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖     │
>  95┤              ▐▀    ▘   ▝   ▝        ▝▘        ▘   ▘▘       ▝▘       ▝▘  ▝    ▝        ▀       │
>    │              ▛                                                                                │
>  51┤             ▐▘                                                                                │
>   7┤      ▖▗▗  ▗▄▐                                                                                 │
>    └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
>   0.00      0.65       1.30      1.96       2.61      3.26       3.91      4.57       5.22     5.87
>
> and after
>
>                                  periodic-2977 util_est.enqueued running
>      ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
> 157.0┤               ▙▄ ▗▄  ▗▄▄▄ ▗▄  ▗▄▄▄▗▄▄  ▗▄▄▖ ▄   ▄▄▄   ▄  ▄▖▖  ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄           │
> 119.5┤             ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀   ▝▝▀  ▀   ▀▀▀▀            │
>  82.0┤             ▟                                                                               │
>      │             ▌                                                                               │
>  44.5┤             ▌                                                                               │
>   7.0┤      ▗   ▗▖ ▌                                                                               │
>      └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
>     0.00      0.65      1.30       1.95      2.60      3.25      3.90       4.56      5.21     5.86
>
> Note how the signal is noisier and can peak to 183 vs 157 now.
>
> Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race")
> Signed-off-by: Qais Yousef <qyousef@layalina.io>
> ---
>
> This is split from [1] series where I stumbled upon this problem. AFAICS it
> needs backporting all the way to 6.12 LTS.
>
> [1] https://lore.kernel.org/lkml/20260504020003.71306-1-qyousef@layalina.io/
>
>  kernel/sched/fair.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 728965851842..96ba97e5f4ae 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7401,6 +7401,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>   */
>  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  {
> +       int ret;
> +
>         if (task_is_throttled(p)) {
>                 dequeue_throttled_task(p, flags);
>                 return true;
> @@ -7409,8 +7411,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>         if (!p->se.sched_delayed)
>                 util_est_dequeue(&rq->cfs, p);
>
> +       ret = dequeue_entities(rq, &p->se, flags);
>         util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
> -       if (dequeue_entities(rq, &p->se, flags) < 0)

Hrm, so the Sashiko tool raised a reasonable concern on the earlier
version of this:
  https://sashiko.dev/#/patchset/20260504020003.71306-1-qyousef%40layalina.io?part=12

Specifically, we shouldn't reference p after dequeue_entities() or we
risk racing with it being woken up, running, and maybe exiting on
another cpu.
And this moves the util_est_updat() call to after dequeue finishes.

Maybe there's some way to have util_est_update() compensate for the
unfinished accounting that will be done in dequeue_entities()?

thanks
-john

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched/fair: Call update_util_est() after dequeue_entities()
  2026-05-12 20:52 ` John Stultz
@ 2026-05-13 12:46   ` Vincent Guittot
  2026-05-15  1:51     ` Qais Yousef
  0 siblings, 1 reply; 6+ messages in thread
From: Vincent Guittot @ 2026-05-13 12:46 UTC (permalink / raw)
  To: John Stultz
  Cc: Qais Yousef, Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki,
	Viresh Kumar, Juri Lelli, Steven Rostedt, Dietmar Eggemann,
	Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm

Le mardi 12 mai 2026 à 13:52:09 (-0700), John Stultz a écrit :
> On Tue, May 12, 2026 at 5:47 AM Qais Yousef <qyousef@layalina.io> wrote:
> >
> > update_util_est() reads task_util() at dequeue which is updated in
> > dequeue_entities(). To read the accurate util_avg at dequeue, make sure
> > to do the read after load_avg is updated in dequeue_entities().
> >
> > util_est for a periodic task before
> >
> >                                 periodic-3114 util_est.enqueued running
> >    ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
> > 183┤                ▖▗  ▐▖         ▖ ▗▙   ▗   ▗▙▖▖       ▖▖   ▖       ▖▖        ▗  ▟  ▗▄▖          │
> > 139┤               ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖     │
> >  95┤              ▐▀    ▘   ▝   ▝        ▝▘        ▘   ▘▘       ▝▘       ▝▘  ▝    ▝        ▀       │
> >    │              ▛                                                                                │
> >  51┤             ▐▘                                                                                │
> >   7┤      ▖▗▗  ▗▄▐                                                                                 │
> >    └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
> >   0.00      0.65       1.30      1.96       2.61      3.26       3.91      4.57       5.22     5.87
> >
> > and after
> >
> >                                  periodic-2977 util_est.enqueued running
> >      ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
> > 157.0┤               ▙▄ ▗▄  ▗▄▄▄ ▗▄  ▗▄▄▄▗▄▄  ▗▄▄▖ ▄   ▄▄▄   ▄  ▄▖▖  ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄           │
> > 119.5┤             ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀   ▝▝▀  ▀   ▀▀▀▀            │
> >  82.0┤             ▟                                                                               │
> >      │             ▌                                                                               │
> >  44.5┤             ▌                                                                               │
> >   7.0┤      ▗   ▗▖ ▌                                                                               │
> >      └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
> >     0.00      0.65      1.30       1.95      2.60      3.25      3.90       4.56      5.21     5.86
> >
> > Note how the signal is noisier and can peak to 183 vs 157 now.
> >
> > Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race")
> > Signed-off-by: Qais Yousef <qyousef@layalina.io>
> > ---
> >
> > This is split from [1] series where I stumbled upon this problem. AFAICS it
> > needs backporting all the way to 6.12 LTS.
> >
> > [1] https://lore.kernel.org/lkml/20260504020003.71306-1-qyousef@layalina.io/
> >
> >  kernel/sched/fair.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 728965851842..96ba97e5f4ae 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7401,6 +7401,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> >   */
> >  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  {
> > +       int ret;
> > +
> >         if (task_is_throttled(p)) {
> >                 dequeue_throttled_task(p, flags);
> >                 return true;
> > @@ -7409,8 +7411,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >         if (!p->se.sched_delayed)
> >                 util_est_dequeue(&rq->cfs, p);
> >
> > +       ret = dequeue_entities(rq, &p->se, flags);
> >         util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
> > -       if (dequeue_entities(rq, &p->se, flags) < 0)
> 
> Hrm, so the Sashiko tool raised a reasonable concern on the earlier
> version of this:
>   https://sashiko.dev/#/patchset/20260504020003.71306-1-qyousef%40layalina.io?part=12

Even without sashiko, this is what the comment says just below in the function ;-)

Below is a different way to fix it which also covers cases when we have both
DEQUEUE_SLEEP and DEQUEUE_DELAYED

---
 kernel/sched/fair.c | 189 ++++++++++++++++++++++----------------------
 1 file changed, 93 insertions(+), 96 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c83fbe4e88c1..0976adc12594 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5120,13 +5120,87 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	trace_pelt_cfs_tp(cfs_rq);
 }

+#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
+
+static inline void util_est_update(struct sched_entity *se)
+{
+	unsigned int ewma, dequeued, last_ewma_diff;
+
+	if (!sched_feat(UTIL_EST))
+		return;
+
+	/* Get current estimate of utilization */
+	ewma = READ_ONCE(se->avg.util_est);
+
+	/*
+	 * If the PELT values haven't changed since enqueue time,
+	 * skip the util_est update.
+	 */
+	if (ewma & UTIL_AVG_UNCHANGED)
+		return;
+
+	/* Get utilization at dequeue */
+	dequeued = READ_ONCE(se->avg.util_avg);
+
+	/*
+	 * Reset EWMA on utilization increases, the moving average is used only
+	 * to smooth utilization decreases.
+	 */
+	if (ewma <= dequeued) {
+		ewma = dequeued;
+		goto done;
+	}
+
+	/*
+	 * Skip update of task's estimated utilization when its members are
+	 * already ~1% close to its last activation value.
+	 */
+	last_ewma_diff = ewma - dequeued;
+	if (last_ewma_diff < UTIL_EST_MARGIN)
+		goto done;
+
+	/*
+	 * To avoid underestimate of task utilization, skip updates of EWMA if
+	 * we cannot grant that thread got all CPU time it wanted.
+	 */
+	if ((dequeued + UTIL_EST_MARGIN) < READ_ONCE(se->avg.runnable_avg))
+		goto done;
+
+
+	/*
+	 * Update Task's estimated utilization
+	 *
+	 * When *p completes an activation we can consolidate another sample
+	 * of the task size. This is done by using this value to update the
+	 * Exponential Weighted Moving Average (EWMA):
+	 *
+	 *  ewma(t) = w *  task_util(p) + (1-w) * ewma(t-1)
+	 *          = w *  task_util(p) +         ewma(t-1)  - w * ewma(t-1)
+	 *          = w * (task_util(p) -         ewma(t-1)) +     ewma(t-1)
+	 *          = w * (      -last_ewma_diff           ) +     ewma(t-1)
+	 *          = w * (-last_ewma_diff +  ewma(t-1) / w)
+	 *
+	 * Where 'w' is the weight of new samples, which is configured to be
+	 * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
+	 */
+	ewma <<= UTIL_EST_WEIGHT_SHIFT;
+	ewma  -= last_ewma_diff;
+	ewma >>= UTIL_EST_WEIGHT_SHIFT;
+done:
+	ewma |= UTIL_AVG_UNCHANGED;
+	WRITE_ONCE(se->avg.util_est, ewma);
+
+	trace_sched_util_est_se_tp(se);
+}
+
 /*
  * Optional action to be done while updating the load average
  */
-#define UPDATE_TG	0x1
-#define SKIP_AGE_LOAD	0x2
-#define DO_ATTACH	0x4
-#define DO_DETACH	0x8
+#define UPDATE_TG	0x01
+#define SKIP_AGE_LOAD	0x02
+#define DO_ATTACH	0x04
+#define DO_DETACH	0x08
+#define UPDATE_UTIL_EST	0x10

 /* Update task and its cfs_rq load average */
 static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
@@ -5169,6 +5243,9 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 		if (flags & UPDATE_TG)
 			update_tg_load_avg(cfs_rq);
 	}
+
+	if (flags & UPDATE_UTIL_EST)
+		util_est_update(se);
 }

 /*
@@ -5227,11 +5304,6 @@ static inline unsigned long task_util(struct task_struct *p)
 	return READ_ONCE(p->se.avg.util_avg);
 }

-static inline unsigned long task_runnable(struct task_struct *p)
-{
-	return READ_ONCE(p->se.avg.runnable_avg);
-}
-
 static inline unsigned long _task_util_est(struct task_struct *p)
 {
 	return READ_ONCE(p->se.avg.util_est) & ~UTIL_AVG_UNCHANGED;
@@ -5274,88 +5346,6 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
 	trace_sched_util_est_cfs_tp(cfs_rq);
 }

-#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
-
-static inline void util_est_update(struct cfs_rq *cfs_rq,
-				   struct task_struct *p,
-				   bool task_sleep)
-{
-	unsigned int ewma, dequeued, last_ewma_diff;
-
-	if (!sched_feat(UTIL_EST))
-		return;
-
-	/*
-	 * Skip update of task's estimated utilization when the task has not
-	 * yet completed an activation, e.g. being migrated.
-	 */
-	if (!task_sleep)
-		return;
-
-	/* Get current estimate of utilization */
-	ewma = READ_ONCE(p->se.avg.util_est);
-
-	/*
-	 * If the PELT values haven't changed since enqueue time,
-	 * skip the util_est update.
-	 */
-	if (ewma & UTIL_AVG_UNCHANGED)
-		return;
-
-	/* Get utilization at dequeue */
-	dequeued = task_util(p);
-
-	/*
-	 * Reset EWMA on utilization increases, the moving average is used only
-	 * to smooth utilization decreases.
-	 */
-	if (ewma <= dequeued) {
-		ewma = dequeued;
-		goto done;
-	}
-
-	/*
-	 * Skip update of task's estimated utilization when its members are
-	 * already ~1% close to its last activation value.
-	 */
-	last_ewma_diff = ewma - dequeued;
-	if (last_ewma_diff < UTIL_EST_MARGIN)
-		goto done;
-
-	/*
-	 * To avoid underestimate of task utilization, skip updates of EWMA if
-	 * we cannot grant that thread got all CPU time it wanted.
-	 */
-	if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
-		goto done;
-
-
-	/*
-	 * Update Task's estimated utilization
-	 *
-	 * When *p completes an activation we can consolidate another sample
-	 * of the task size. This is done by using this value to update the
-	 * Exponential Weighted Moving Average (EWMA):
-	 *
-	 *  ewma(t) = w *  task_util(p) + (1-w) * ewma(t-1)
-	 *          = w *  task_util(p) +         ewma(t-1)  - w * ewma(t-1)
-	 *          = w * (task_util(p) -         ewma(t-1)) +     ewma(t-1)
-	 *          = w * (      -last_ewma_diff           ) +     ewma(t-1)
-	 *          = w * (-last_ewma_diff +  ewma(t-1) / w)
-	 *
-	 * Where 'w' is the weight of new samples, which is configured to be
-	 * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
-	 */
-	ewma <<= UTIL_EST_WEIGHT_SHIFT;
-	ewma  -= last_ewma_diff;
-	ewma >>= UTIL_EST_WEIGHT_SHIFT;
-done:
-	ewma |= UTIL_AVG_UNCHANGED;
-	WRITE_ONCE(p->se.avg.util_est, ewma);
-
-	trace_sched_util_est_se_tp(&p->se);
-}
-
 static inline unsigned long get_actual_cpu_capacity(int cpu)
 {
 	unsigned long capacity = arch_scale_cpu_capacity(cpu);
@@ -5828,7 +5818,7 @@ static bool
 dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
 	bool sleep = flags & DEQUEUE_SLEEP;
-	int action = UPDATE_TG;
+	int action = 0;

 	update_curr(cfs_rq);
 	clear_buddies(cfs_rq, se);
@@ -5848,15 +5838,23 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)

 		if (sched_feat(DELAY_DEQUEUE) && delay &&
 		    !entity_eligible(cfs_rq, se)) {
-			update_load_avg(cfs_rq, se, 0);
+			if (entity_is_task(se))
+				action |= UPDATE_UTIL_EST;
+			update_load_avg(cfs_rq, se, action);
 			update_entity_lag(cfs_rq, se);
 			set_delayed(se);
 			return false;
 		}
 	}

-	if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
-		action |= DO_DETACH;
+	action = UPDATE_TG;
+	if (entity_is_task(se)) {
+		if (task_on_rq_migrating(task_of(se)))
+			action |= DO_DETACH;
+
+		if (sleep && !(flags & DEQUEUE_DELAYED))
+			action |= UPDATE_UTIL_EST;
+	}

 	/*
 	 * When dequeuing a sched_entity, we must:
@@ -7628,7 +7626,6 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	if (!p->se.sched_delayed)
 		util_est_dequeue(&rq->cfs, p);

-	util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
 	if (dequeue_entities(rq, &p->se, flags) < 0)
 		return false;

--
2.43.0




> 
> Specifically, we shouldn't reference p after dequeue_entities() or we
> risk racing with it being woken up, running, and maybe exiting on
> another cpu.
> And this moves the util_est_updat() call to after dequeue finishes.
> 
> Maybe there's some way to have util_est_update() compensate for the
> unfinished accounting that will be done in dequeue_entities()?
> 
> thanks
> -john

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched/fair: Call update_util_est() after dequeue_entities()
  2026-05-13 12:46   ` Vincent Guittot
@ 2026-05-15  1:51     ` Qais Yousef
  2026-05-15 15:23       ` Vincent Guittot
  0 siblings, 1 reply; 6+ messages in thread
From: Qais Yousef @ 2026-05-15  1:51 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: John Stultz, Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki,
	Viresh Kumar, Juri Lelli, Steven Rostedt, Dietmar Eggemann,
	Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm

On 05/13/26 14:46, Vincent Guittot wrote:
> Le mardi 12 mai 2026 à 13:52:09 (-0700), John Stultz a écrit :
> > On Tue, May 12, 2026 at 5:47 AM Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > update_util_est() reads task_util() at dequeue which is updated in
> > > dequeue_entities(). To read the accurate util_avg at dequeue, make sure
> > > to do the read after load_avg is updated in dequeue_entities().
> > >
> > > util_est for a periodic task before
> > >
> > >                                 periodic-3114 util_est.enqueued running
> > >    ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
> > > 183┤                ▖▗  ▐▖         ▖ ▗▙   ▗   ▗▙▖▖       ▖▖   ▖       ▖▖        ▗  ▟  ▗▄▖          │
> > > 139┤               ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖     │
> > >  95┤              ▐▀    ▘   ▝   ▝        ▝▘        ▘   ▘▘       ▝▘       ▝▘  ▝    ▝        ▀       │
> > >    │              ▛                                                                                │
> > >  51┤             ▐▘                                                                                │
> > >   7┤      ▖▗▗  ▗▄▐                                                                                 │
> > >    └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
> > >   0.00      0.65       1.30      1.96       2.61      3.26       3.91      4.57       5.22     5.87
> > >
> > > and after
> > >
> > >                                  periodic-2977 util_est.enqueued running
> > >      ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
> > > 157.0┤               ▙▄ ▗▄  ▗▄▄▄ ▗▄  ▗▄▄▄▗▄▄  ▗▄▄▖ ▄   ▄▄▄   ▄  ▄▖▖  ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄           │
> > > 119.5┤             ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀   ▝▝▀  ▀   ▀▀▀▀            │
> > >  82.0┤             ▟                                                                               │
> > >      │             ▌                                                                               │
> > >  44.5┤             ▌                                                                               │
> > >   7.0┤      ▗   ▗▖ ▌                                                                               │
> > >      └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
> > >     0.00      0.65      1.30       1.95      2.60      3.25      3.90       4.56      5.21     5.86
> > >
> > > Note how the signal is noisier and can peak to 183 vs 157 now.
> > >
> > > Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race")
> > > Signed-off-by: Qais Yousef <qyousef@layalina.io>
> > > ---
> > >
> > > This is split from [1] series where I stumbled upon this problem. AFAICS it
> > > needs backporting all the way to 6.12 LTS.
> > >
> > > [1] https://lore.kernel.org/lkml/20260504020003.71306-1-qyousef@layalina.io/
> > >
> > >  kernel/sched/fair.c | 5 ++++-
> > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 728965851842..96ba97e5f4ae 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7401,6 +7401,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> > >   */
> > >  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > >  {
> > > +       int ret;
> > > +
> > >         if (task_is_throttled(p)) {
> > >                 dequeue_throttled_task(p, flags);
> > >                 return true;
> > > @@ -7409,8 +7411,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > >         if (!p->se.sched_delayed)
> > >                 util_est_dequeue(&rq->cfs, p);
> > >
> > > +       ret = dequeue_entities(rq, &p->se, flags);
> > >         util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
> > > -       if (dequeue_entities(rq, &p->se, flags) < 0)
> > 
> > Hrm, so the Sashiko tool raised a reasonable concern on the earlier
> > version of this:
> >   https://sashiko.dev/#/patchset/20260504020003.71306-1-qyousef%40layalina.io?part=12
> 
> Even without sashiko, this is what the comment says just below in the function ;-)

Yeah I read it but it didn't really register properly :-)

> 
> Below is a different way to fix it which also covers cases when we have both
> DEQUEUE_SLEEP and DEQUEUE_DELAYED

Works for me. It looks even more stable now

util_avg

                                    periodic-2737 util_avg running
   ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
140┤               ▟▟█▄▄██▙▄▟▟▙▙▟█▟▄▟▟▟▄▄██▄▟▙█▙▄▙█▙▄█▙▙▄▙▙▙█▟█▄▟▟▟▄▟█▟▟▄██▄▙█▟▟▄▟██▄              │
105┤              ▄██████████████████████████████████████████████████████████████████              │
 70┤             ▗▛▘                                                                               │
   │             ▐▘                                                                                │
 35┤             █                                                                                 │
  0┤       ▖   ▗▄▌                                                                                 │
   └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
  0.00      0.65       1.30      1.95       2.60      3.25       3.90      4.55       5.20     5.85

and util_eset

                                 periodic-2737 util_est.enqueued running
     ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
140.0┤               ▞▀▀▀▀▀▀▀▀▀▀▀▀▀▜▀▀▀▜▀▀▀▀▀▀▀▀▜▀▀▀▀▀▀▀▀▀▀▀▀▀▀▜▀▀▀▀▀▀▜▛▀▀▀▛▀▀▀▀▀▀▀▀▀▘             │
106.8┤              ▝▘                                                                             │
 73.5┤             ▗▘                                                                              │
     │             ▗                                                                               │
 40.2┤             ▖                                                                               │
  7.0┤       ▖   ▄▗▖                                                                               │
     └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
    0.00      0.65      1.30       1.95      2.60      3.25      3.90       4.55      5.20     5.85

Do you want to send a proper patch? Feel free to stick my
reviewed-and-tested-by.

Thanks!

> 
> ---
>  kernel/sched/fair.c | 189 ++++++++++++++++++++++----------------------
>  1 file changed, 93 insertions(+), 96 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c83fbe4e88c1..0976adc12594 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5120,13 +5120,87 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>  	trace_pelt_cfs_tp(cfs_rq);
>  }
> 
> +#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
> +
> +static inline void util_est_update(struct sched_entity *se)
> +{
> +	unsigned int ewma, dequeued, last_ewma_diff;
> +
> +	if (!sched_feat(UTIL_EST))
> +		return;
> +
> +	/* Get current estimate of utilization */
> +	ewma = READ_ONCE(se->avg.util_est);
> +
> +	/*
> +	 * If the PELT values haven't changed since enqueue time,
> +	 * skip the util_est update.
> +	 */
> +	if (ewma & UTIL_AVG_UNCHANGED)
> +		return;
> +
> +	/* Get utilization at dequeue */
> +	dequeued = READ_ONCE(se->avg.util_avg);
> +
> +	/*
> +	 * Reset EWMA on utilization increases, the moving average is used only
> +	 * to smooth utilization decreases.
> +	 */
> +	if (ewma <= dequeued) {
> +		ewma = dequeued;
> +		goto done;
> +	}
> +
> +	/*
> +	 * Skip update of task's estimated utilization when its members are
> +	 * already ~1% close to its last activation value.
> +	 */
> +	last_ewma_diff = ewma - dequeued;
> +	if (last_ewma_diff < UTIL_EST_MARGIN)
> +		goto done;
> +
> +	/*
> +	 * To avoid underestimate of task utilization, skip updates of EWMA if
> +	 * we cannot grant that thread got all CPU time it wanted.
> +	 */
> +	if ((dequeued + UTIL_EST_MARGIN) < READ_ONCE(se->avg.runnable_avg))
> +		goto done;
> +
> +
> +	/*
> +	 * Update Task's estimated utilization
> +	 *
> +	 * When *p completes an activation we can consolidate another sample
> +	 * of the task size. This is done by using this value to update the
> +	 * Exponential Weighted Moving Average (EWMA):
> +	 *
> +	 *  ewma(t) = w *  task_util(p) + (1-w) * ewma(t-1)
> +	 *          = w *  task_util(p) +         ewma(t-1)  - w * ewma(t-1)
> +	 *          = w * (task_util(p) -         ewma(t-1)) +     ewma(t-1)
> +	 *          = w * (      -last_ewma_diff           ) +     ewma(t-1)
> +	 *          = w * (-last_ewma_diff +  ewma(t-1) / w)
> +	 *
> +	 * Where 'w' is the weight of new samples, which is configured to be
> +	 * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
> +	 */
> +	ewma <<= UTIL_EST_WEIGHT_SHIFT;
> +	ewma  -= last_ewma_diff;
> +	ewma >>= UTIL_EST_WEIGHT_SHIFT;
> +done:
> +	ewma |= UTIL_AVG_UNCHANGED;
> +	WRITE_ONCE(se->avg.util_est, ewma);
> +
> +	trace_sched_util_est_se_tp(se);
> +}
> +
>  /*
>   * Optional action to be done while updating the load average
>   */
> -#define UPDATE_TG	0x1
> -#define SKIP_AGE_LOAD	0x2
> -#define DO_ATTACH	0x4
> -#define DO_DETACH	0x8
> +#define UPDATE_TG	0x01
> +#define SKIP_AGE_LOAD	0x02
> +#define DO_ATTACH	0x04
> +#define DO_DETACH	0x08
> +#define UPDATE_UTIL_EST	0x10
> 
>  /* Update task and its cfs_rq load average */
>  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> @@ -5169,6 +5243,9 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>  		if (flags & UPDATE_TG)
>  			update_tg_load_avg(cfs_rq);
>  	}
> +
> +	if (flags & UPDATE_UTIL_EST)
> +		util_est_update(se);
>  }
> 
>  /*
> @@ -5227,11 +5304,6 @@ static inline unsigned long task_util(struct task_struct *p)
>  	return READ_ONCE(p->se.avg.util_avg);
>  }
> 
> -static inline unsigned long task_runnable(struct task_struct *p)
> -{
> -	return READ_ONCE(p->se.avg.runnable_avg);
> -}
> -
>  static inline unsigned long _task_util_est(struct task_struct *p)
>  {
>  	return READ_ONCE(p->se.avg.util_est) & ~UTIL_AVG_UNCHANGED;
> @@ -5274,88 +5346,6 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
>  	trace_sched_util_est_cfs_tp(cfs_rq);
>  }
> 
> -#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
> -
> -static inline void util_est_update(struct cfs_rq *cfs_rq,
> -				   struct task_struct *p,
> -				   bool task_sleep)
> -{
> -	unsigned int ewma, dequeued, last_ewma_diff;
> -
> -	if (!sched_feat(UTIL_EST))
> -		return;
> -
> -	/*
> -	 * Skip update of task's estimated utilization when the task has not
> -	 * yet completed an activation, e.g. being migrated.
> -	 */
> -	if (!task_sleep)
> -		return;
> -
> -	/* Get current estimate of utilization */
> -	ewma = READ_ONCE(p->se.avg.util_est);
> -
> -	/*
> -	 * If the PELT values haven't changed since enqueue time,
> -	 * skip the util_est update.
> -	 */
> -	if (ewma & UTIL_AVG_UNCHANGED)
> -		return;
> -
> -	/* Get utilization at dequeue */
> -	dequeued = task_util(p);
> -
> -	/*
> -	 * Reset EWMA on utilization increases, the moving average is used only
> -	 * to smooth utilization decreases.
> -	 */
> -	if (ewma <= dequeued) {
> -		ewma = dequeued;
> -		goto done;
> -	}
> -
> -	/*
> -	 * Skip update of task's estimated utilization when its members are
> -	 * already ~1% close to its last activation value.
> -	 */
> -	last_ewma_diff = ewma - dequeued;
> -	if (last_ewma_diff < UTIL_EST_MARGIN)
> -		goto done;
> -
> -	/*
> -	 * To avoid underestimate of task utilization, skip updates of EWMA if
> -	 * we cannot grant that thread got all CPU time it wanted.
> -	 */
> -	if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
> -		goto done;
> -
> -
> -	/*
> -	 * Update Task's estimated utilization
> -	 *
> -	 * When *p completes an activation we can consolidate another sample
> -	 * of the task size. This is done by using this value to update the
> -	 * Exponential Weighted Moving Average (EWMA):
> -	 *
> -	 *  ewma(t) = w *  task_util(p) + (1-w) * ewma(t-1)
> -	 *          = w *  task_util(p) +         ewma(t-1)  - w * ewma(t-1)
> -	 *          = w * (task_util(p) -         ewma(t-1)) +     ewma(t-1)
> -	 *          = w * (      -last_ewma_diff           ) +     ewma(t-1)
> -	 *          = w * (-last_ewma_diff +  ewma(t-1) / w)
> -	 *
> -	 * Where 'w' is the weight of new samples, which is configured to be
> -	 * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
> -	 */
> -	ewma <<= UTIL_EST_WEIGHT_SHIFT;
> -	ewma  -= last_ewma_diff;
> -	ewma >>= UTIL_EST_WEIGHT_SHIFT;
> -done:
> -	ewma |= UTIL_AVG_UNCHANGED;
> -	WRITE_ONCE(p->se.avg.util_est, ewma);
> -
> -	trace_sched_util_est_se_tp(&p->se);
> -}
> -
>  static inline unsigned long get_actual_cpu_capacity(int cpu)
>  {
>  	unsigned long capacity = arch_scale_cpu_capacity(cpu);
> @@ -5828,7 +5818,7 @@ static bool
>  dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  {
>  	bool sleep = flags & DEQUEUE_SLEEP;
> -	int action = UPDATE_TG;
> +	int action = 0;
> 
>  	update_curr(cfs_rq);
>  	clear_buddies(cfs_rq, se);
> @@ -5848,15 +5838,23 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> 
>  		if (sched_feat(DELAY_DEQUEUE) && delay &&
>  		    !entity_eligible(cfs_rq, se)) {
> -			update_load_avg(cfs_rq, se, 0);
> +			if (entity_is_task(se))
> +				action |= UPDATE_UTIL_EST;
> +			update_load_avg(cfs_rq, se, action);
>  			update_entity_lag(cfs_rq, se);
>  			set_delayed(se);
>  			return false;
>  		}
>  	}
> 
> -	if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
> -		action |= DO_DETACH;
> +	action = UPDATE_TG;
> +	if (entity_is_task(se)) {
> +		if (task_on_rq_migrating(task_of(se)))
> +			action |= DO_DETACH;
> +
> +		if (sleep && !(flags & DEQUEUE_DELAYED))
> +			action |= UPDATE_UTIL_EST;
> +	}
> 
>  	/*
>  	 * When dequeuing a sched_entity, we must:
> @@ -7628,7 +7626,6 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	if (!p->se.sched_delayed)
>  		util_est_dequeue(&rq->cfs, p);
> 
> -	util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
>  	if (dequeue_entities(rq, &p->se, flags) < 0)
>  		return false;
> 
> --
> 2.43.0
> 
> 
> 
> 
> > 
> > Specifically, we shouldn't reference p after dequeue_entities() or we
> > risk racing with it being woken up, running, and maybe exiting on
> > another cpu.
> > And this moves the util_est_updat() call to after dequeue finishes.
> > 
> > Maybe there's some way to have util_est_update() compensate for the
> > unfinished accounting that will be done in dequeue_entities()?
> > 
> > thanks
> > -john

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched/fair: Call update_util_est() after dequeue_entities()
  2026-05-15  1:51     ` Qais Yousef
@ 2026-05-15 15:23       ` Vincent Guittot
  0 siblings, 0 replies; 6+ messages in thread
From: Vincent Guittot @ 2026-05-15 15:23 UTC (permalink / raw)
  To: Qais Yousef
  Cc: John Stultz, Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki,
	Viresh Kumar, Juri Lelli, Steven Rostedt, Dietmar Eggemann,
	Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm

On Fri, 15 May 2026 at 03:51, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 05/13/26 14:46, Vincent Guittot wrote:
> > Le mardi 12 mai 2026 à 13:52:09 (-0700), John Stultz a écrit :
> > > On Tue, May 12, 2026 at 5:47 AM Qais Yousef <qyousef@layalina.io> wrote:
> > > >
> > > > update_util_est() reads task_util() at dequeue which is updated in
> > > > dequeue_entities(). To read the accurate util_avg at dequeue, make sure
> > > > to do the read after load_avg is updated in dequeue_entities().
> > > >
> > > > util_est for a periodic task before
> > > >
> > > >                                 periodic-3114 util_est.enqueued running
> > > >    ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
> > > > 183┤                ▖▗  ▐▖         ▖ ▗▙   ▗   ▗▙▖▖       ▖▖   ▖       ▖▖        ▗  ▟  ▗▄▖          │
> > > > 139┤               ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖     │
> > > >  95┤              ▐▀    ▘   ▝   ▝        ▝▘        ▘   ▘▘       ▝▘       ▝▘  ▝    ▝        ▀       │
> > > >    │              ▛                                                                                │
> > > >  51┤             ▐▘                                                                                │
> > > >   7┤      ▖▗▗  ▗▄▐                                                                                 │
> > > >    └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
> > > >   0.00      0.65       1.30      1.96       2.61      3.26       3.91      4.57       5.22     5.87
> > > >
> > > > and after
> > > >
> > > >                                  periodic-2977 util_est.enqueued running
> > > >      ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
> > > > 157.0┤               ▙▄ ▗▄  ▗▄▄▄ ▗▄  ▗▄▄▄▗▄▄  ▗▄▄▖ ▄   ▄▄▄   ▄  ▄▖▖  ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄           │
> > > > 119.5┤             ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀   ▝▝▀  ▀   ▀▀▀▀            │
> > > >  82.0┤             ▟                                                                               │
> > > >      │             ▌                                                                               │
> > > >  44.5┤             ▌                                                                               │
> > > >   7.0┤      ▗   ▗▖ ▌                                                                               │
> > > >      └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
> > > >     0.00      0.65      1.30       1.95      2.60      3.25      3.90       4.56      5.21     5.86
> > > >
> > > > Note how the signal is noisier and can peak to 183 vs 157 now.
> > > >
> > > > Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race")
> > > > Signed-off-by: Qais Yousef <qyousef@layalina.io>
> > > > ---
> > > >
> > > > This is split from [1] series where I stumbled upon this problem. AFAICS it
> > > > needs backporting all the way to 6.12 LTS.
> > > >
> > > > [1] https://lore.kernel.org/lkml/20260504020003.71306-1-qyousef@layalina.io/
> > > >
> > > >  kernel/sched/fair.c | 5 ++++-
> > > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 728965851842..96ba97e5f4ae 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -7401,6 +7401,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> > > >   */
> > > >  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > >  {
> > > > +       int ret;
> > > > +
> > > >         if (task_is_throttled(p)) {
> > > >                 dequeue_throttled_task(p, flags);
> > > >                 return true;
> > > > @@ -7409,8 +7411,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > >         if (!p->se.sched_delayed)
> > > >                 util_est_dequeue(&rq->cfs, p);
> > > >
> > > > +       ret = dequeue_entities(rq, &p->se, flags);
> > > >         util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
> > > > -       if (dequeue_entities(rq, &p->se, flags) < 0)
> > >
> > > Hrm, so the Sashiko tool raised a reasonable concern on the earlier
> > > version of this:
> > >   https://sashiko.dev/#/patchset/20260504020003.71306-1-qyousef%40layalina.io?part=12
> >
> > Even without sashiko, this is what the comment says just below in the function ;-)
>
> Yeah I read it but it didn't really register properly :-)
>
> >
> > Below is a different way to fix it which also covers cases when we have both
> > DEQUEUE_SLEEP and DEQUEUE_DELAYED
>
> Works for me. It looks even more stable now
>
> util_avg
>
>                                     periodic-2737 util_avg running
>    ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
> 140┤               ▟▟█▄▄██▙▄▟▟▙▙▟█▟▄▟▟▟▄▄██▄▟▙█▙▄▙█▙▄█▙▙▄▙▙▙█▟█▄▟▟▟▄▟█▟▟▄██▄▙█▟▟▄▟██▄              │
> 105┤              ▄██████████████████████████████████████████████████████████████████              │
>  70┤             ▗▛▘                                                                               │
>    │             ▐▘                                                                                │
>  35┤             █                                                                                 │
>   0┤       ▖   ▗▄▌                                                                                 │
>    └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
>   0.00      0.65       1.30      1.95       2.60      3.25       3.90      4.55       5.20     5.85
>
> and util_eset
>
>                                  periodic-2737 util_est.enqueued running
>      ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
> 140.0┤               ▞▀▀▀▀▀▀▀▀▀▀▀▀▀▜▀▀▀▜▀▀▀▀▀▀▀▀▜▀▀▀▀▀▀▀▀▀▀▀▀▀▀▜▀▀▀▀▀▀▜▛▀▀▀▛▀▀▀▀▀▀▀▀▀▘             │
> 106.8┤              ▝▘                                                                             │
>  73.5┤             ▗▘                                                                              │
>      │             ▗                                                                               │
>  40.2┤             ▖                                                                               │
>   7.0┤       ▖   ▄▗▖                                                                               │
>      └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
>     0.00      0.65      1.30       1.95      2.60      3.25      3.90       4.55      5.20     5.85
>
> Do you want to send a proper patch? Feel free to stick my
> reviewed-and-tested-by.

Yes I'm going to send a proper patch with reported, reviewed and tested by You

Thanks

>
> Thanks!
>
> >
> > ---
> >  kernel/sched/fair.c | 189 ++++++++++++++++++++++----------------------
> >  1 file changed, 93 insertions(+), 96 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index c83fbe4e88c1..0976adc12594 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5120,13 +5120,87 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >       trace_pelt_cfs_tp(cfs_rq);
> >  }
> >
> > +#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
> > +
> > +static inline void util_est_update(struct sched_entity *se)
> > +{
> > +     unsigned int ewma, dequeued, last_ewma_diff;
> > +
> > +     if (!sched_feat(UTIL_EST))
> > +             return;
> > +
> > +     /* Get current estimate of utilization */
> > +     ewma = READ_ONCE(se->avg.util_est);
> > +
> > +     /*
> > +      * If the PELT values haven't changed since enqueue time,
> > +      * skip the util_est update.
> > +      */
> > +     if (ewma & UTIL_AVG_UNCHANGED)
> > +             return;
> > +
> > +     /* Get utilization at dequeue */
> > +     dequeued = READ_ONCE(se->avg.util_avg);
> > +
> > +     /*
> > +      * Reset EWMA on utilization increases, the moving average is used only
> > +      * to smooth utilization decreases.
> > +      */
> > +     if (ewma <= dequeued) {
> > +             ewma = dequeued;
> > +             goto done;
> > +     }
> > +
> > +     /*
> > +      * Skip update of task's estimated utilization when its members are
> > +      * already ~1% close to its last activation value.
> > +      */
> > +     last_ewma_diff = ewma - dequeued;
> > +     if (last_ewma_diff < UTIL_EST_MARGIN)
> > +             goto done;
> > +
> > +     /*
> > +      * To avoid underestimate of task utilization, skip updates of EWMA if
> > +      * we cannot grant that thread got all CPU time it wanted.
> > +      */
> > +     if ((dequeued + UTIL_EST_MARGIN) < READ_ONCE(se->avg.runnable_avg))
> > +             goto done;
> > +
> > +
> > +     /*
> > +      * Update Task's estimated utilization
> > +      *
> > +      * When *p completes an activation we can consolidate another sample
> > +      * of the task size. This is done by using this value to update the
> > +      * Exponential Weighted Moving Average (EWMA):
> > +      *
> > +      *  ewma(t) = w *  task_util(p) + (1-w) * ewma(t-1)
> > +      *          = w *  task_util(p) +         ewma(t-1)  - w * ewma(t-1)
> > +      *          = w * (task_util(p) -         ewma(t-1)) +     ewma(t-1)
> > +      *          = w * (      -last_ewma_diff           ) +     ewma(t-1)
> > +      *          = w * (-last_ewma_diff +  ewma(t-1) / w)
> > +      *
> > +      * Where 'w' is the weight of new samples, which is configured to be
> > +      * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
> > +      */
> > +     ewma <<= UTIL_EST_WEIGHT_SHIFT;
> > +     ewma  -= last_ewma_diff;
> > +     ewma >>= UTIL_EST_WEIGHT_SHIFT;
> > +done:
> > +     ewma |= UTIL_AVG_UNCHANGED;
> > +     WRITE_ONCE(se->avg.util_est, ewma);
> > +
> > +     trace_sched_util_est_se_tp(se);
> > +}
> > +
> >  /*
> >   * Optional action to be done while updating the load average
> >   */
> > -#define UPDATE_TG    0x1
> > -#define SKIP_AGE_LOAD        0x2
> > -#define DO_ATTACH    0x4
> > -#define DO_DETACH    0x8
> > +#define UPDATE_TG    0x01
> > +#define SKIP_AGE_LOAD        0x02
> > +#define DO_ATTACH    0x04
> > +#define DO_DETACH    0x08
> > +#define UPDATE_UTIL_EST      0x10
> >
> >  /* Update task and its cfs_rq load average */
> >  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > @@ -5169,6 +5243,9 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >               if (flags & UPDATE_TG)
> >                       update_tg_load_avg(cfs_rq);
> >       }
> > +
> > +     if (flags & UPDATE_UTIL_EST)
> > +             util_est_update(se);
> >  }
> >
> >  /*
> > @@ -5227,11 +5304,6 @@ static inline unsigned long task_util(struct task_struct *p)
> >       return READ_ONCE(p->se.avg.util_avg);
> >  }
> >
> > -static inline unsigned long task_runnable(struct task_struct *p)
> > -{
> > -     return READ_ONCE(p->se.avg.runnable_avg);
> > -}
> > -
> >  static inline unsigned long _task_util_est(struct task_struct *p)
> >  {
> >       return READ_ONCE(p->se.avg.util_est) & ~UTIL_AVG_UNCHANGED;
> > @@ -5274,88 +5346,6 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
> >       trace_sched_util_est_cfs_tp(cfs_rq);
> >  }
> >
> > -#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
> > -
> > -static inline void util_est_update(struct cfs_rq *cfs_rq,
> > -                                struct task_struct *p,
> > -                                bool task_sleep)
> > -{
> > -     unsigned int ewma, dequeued, last_ewma_diff;
> > -
> > -     if (!sched_feat(UTIL_EST))
> > -             return;
> > -
> > -     /*
> > -      * Skip update of task's estimated utilization when the task has not
> > -      * yet completed an activation, e.g. being migrated.
> > -      */
> > -     if (!task_sleep)
> > -             return;
> > -
> > -     /* Get current estimate of utilization */
> > -     ewma = READ_ONCE(p->se.avg.util_est);
> > -
> > -     /*
> > -      * If the PELT values haven't changed since enqueue time,
> > -      * skip the util_est update.
> > -      */
> > -     if (ewma & UTIL_AVG_UNCHANGED)
> > -             return;
> > -
> > -     /* Get utilization at dequeue */
> > -     dequeued = task_util(p);
> > -
> > -     /*
> > -      * Reset EWMA on utilization increases, the moving average is used only
> > -      * to smooth utilization decreases.
> > -      */
> > -     if (ewma <= dequeued) {
> > -             ewma = dequeued;
> > -             goto done;
> > -     }
> > -
> > -     /*
> > -      * Skip update of task's estimated utilization when its members are
> > -      * already ~1% close to its last activation value.
> > -      */
> > -     last_ewma_diff = ewma - dequeued;
> > -     if (last_ewma_diff < UTIL_EST_MARGIN)
> > -             goto done;
> > -
> > -     /*
> > -      * To avoid underestimate of task utilization, skip updates of EWMA if
> > -      * we cannot grant that thread got all CPU time it wanted.
> > -      */
> > -     if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
> > -             goto done;
> > -
> > -
> > -     /*
> > -      * Update Task's estimated utilization
> > -      *
> > -      * When *p completes an activation we can consolidate another sample
> > -      * of the task size. This is done by using this value to update the
> > -      * Exponential Weighted Moving Average (EWMA):
> > -      *
> > -      *  ewma(t) = w *  task_util(p) + (1-w) * ewma(t-1)
> > -      *          = w *  task_util(p) +         ewma(t-1)  - w * ewma(t-1)
> > -      *          = w * (task_util(p) -         ewma(t-1)) +     ewma(t-1)
> > -      *          = w * (      -last_ewma_diff           ) +     ewma(t-1)
> > -      *          = w * (-last_ewma_diff +  ewma(t-1) / w)
> > -      *
> > -      * Where 'w' is the weight of new samples, which is configured to be
> > -      * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
> > -      */
> > -     ewma <<= UTIL_EST_WEIGHT_SHIFT;
> > -     ewma  -= last_ewma_diff;
> > -     ewma >>= UTIL_EST_WEIGHT_SHIFT;
> > -done:
> > -     ewma |= UTIL_AVG_UNCHANGED;
> > -     WRITE_ONCE(p->se.avg.util_est, ewma);
> > -
> > -     trace_sched_util_est_se_tp(&p->se);
> > -}
> > -
> >  static inline unsigned long get_actual_cpu_capacity(int cpu)
> >  {
> >       unsigned long capacity = arch_scale_cpu_capacity(cpu);
> > @@ -5828,7 +5818,7 @@ static bool
> >  dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >  {
> >       bool sleep = flags & DEQUEUE_SLEEP;
> > -     int action = UPDATE_TG;
> > +     int action = 0;
> >
> >       update_curr(cfs_rq);
> >       clear_buddies(cfs_rq, se);
> > @@ -5848,15 +5838,23 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >
> >               if (sched_feat(DELAY_DEQUEUE) && delay &&
> >                   !entity_eligible(cfs_rq, se)) {
> > -                     update_load_avg(cfs_rq, se, 0);
> > +                     if (entity_is_task(se))
> > +                             action |= UPDATE_UTIL_EST;
> > +                     update_load_avg(cfs_rq, se, action);
> >                       update_entity_lag(cfs_rq, se);
> >                       set_delayed(se);
> >                       return false;
> >               }
> >       }
> >
> > -     if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
> > -             action |= DO_DETACH;
> > +     action = UPDATE_TG;
> > +     if (entity_is_task(se)) {
> > +             if (task_on_rq_migrating(task_of(se)))
> > +                     action |= DO_DETACH;
> > +
> > +             if (sleep && !(flags & DEQUEUE_DELAYED))
> > +                     action |= UPDATE_UTIL_EST;
> > +     }
> >
> >       /*
> >        * When dequeuing a sched_entity, we must:
> > @@ -7628,7 +7626,6 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >       if (!p->se.sched_delayed)
> >               util_est_dequeue(&rq->cfs, p);
> >
> > -     util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
> >       if (dequeue_entities(rq, &p->se, flags) < 0)
> >               return false;
> >
> > --
> > 2.43.0
> >
> >
> >
> >
> > >
> > > Specifically, we shouldn't reference p after dequeue_entities() or we
> > > risk racing with it being woken up, running, and maybe exiting on
> > > another cpu.
> > > And this moves the util_est_updat() call to after dequeue finishes.
> > >
> > > Maybe there's some way to have util_est_update() compensate for the
> > > unfinished accounting that will be done in dequeue_entities()?
> > >
> > > thanks
> > > -john

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched/fair: Call update_util_est() after dequeue_entities()
  2026-05-12 12:46 [PATCH] sched/fair: Call update_util_est() after dequeue_entities() Qais Yousef
  2026-05-12 20:52 ` John Stultz
@ 2026-05-15 18:35 ` Tim Chen
  1 sibling, 0 replies; 6+ messages in thread
From: Tim Chen @ 2026-05-15 18:35 UTC (permalink / raw)
  To: Qais Yousef, Ingo Molnar, Peter Zijlstra, Vincent Guittot
  Cc: Rafael J. Wysocki, Viresh Kumar, Juri Lelli, Steven Rostedt,
	John Stultz, Dietmar Eggemann, Chen, Yu C, Thomas Gleixner,
	linux-kernel, linux-pm

On Tue, 2026-05-12 at 13:46 +0100, Qais Yousef wrote:
> update_util_est() reads task_util() at dequeue which is updated in
> dequeue_entities(). To read the accurate util_avg at dequeue, make sure
> to do the read after load_avg is updated in dequeue_entities().
> 
> util_est for a periodic task before
> 
>                                 periodic-3114 util_est.enqueued running
>    ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
> 183┤                ▖▗  ▐▖         ▖ ▗▙   ▗   ▗▙▖▖       ▖▖   ▖       ▖▖        ▗  ▟  ▗▄▖          │
> 139┤               ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖     │
>  95┤              ▐▀    ▘   ▝   ▝        ▝▘        ▘   ▘▘       ▝▘       ▝▘  ▝    ▝        ▀       │
>    │              ▛                                                                                │
>  51┤             ▐▘                                                                                │
>   7┤      ▖▗▗  ▗▄▐                                                                                 │
>    └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
>   0.00      0.65       1.30      1.96       2.61      3.26       3.91      4.57       5.22     5.87
> 
> and after
> 
>                                  periodic-2977 util_est.enqueued running
>      ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
> 157.0┤               ▙▄ ▗▄  ▗▄▄▄ ▗▄  ▗▄▄▄▗▄▄  ▗▄▄▖ ▄   ▄▄▄   ▄  ▄▖▖  ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄           │
> 119.5┤             ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀   ▝▝▀  ▀   ▀▀▀▀            │
>  82.0┤             ▟                                                                               │
>      │             ▌                                                                               │
>  44.5┤             ▌                                                                               │
>   7.0┤      ▗   ▗▖ ▌                                                                               │
>      └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
>     0.00      0.65      1.30       1.95      2.60      3.25      3.90       4.56      5.21     5.86
> 
> Note how the signal is noisier and can peak to 183 vs 157 now.
> 
> Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race")
> Signed-off-by: Qais Yousef <qyousef@layalina.io>
> ---
> 
> This is split from [1] series where I stumbled upon this problem. AFAICS it
> needs backporting all the way to 6.12 LTS.
> 
> [1] https://lore.kernel.org/lkml/20260504020003.71306-1-qyousef@layalina.io/
> 
>  kernel/sched/fair.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 728965851842..96ba97e5f4ae 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7401,6 +7401,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>   */
>  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  {
> +	int ret;
> +
>  	if (task_is_throttled(p)) {
>  		dequeue_throttled_task(p, flags);
>  		return true;
> @@ -7409,8 +7411,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	if (!p->se.sched_delayed)
>  		util_est_dequeue(&rq->cfs, p);
>  
> +	ret = dequeue_entities(rq, &p->se, flags);
>  	util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);

I thought that util_est_update() was called intentionally before dequeue_entities
to update the utilization of task p up to this time right
before the dequeue.  Then dequeue_entities() is called later
with up to date task utilization estimate of p.

Perhaps util_est_update() should be moved before
util_est_dequeue() so the updated utilization of p
is subtracted from the rq utilization.

@@ -8002,10 +8002,10 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
                return true;
        }
 
+       util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
        if (!p->se.sched_delayed)
                util_est_dequeue(&rq->cfs, p);
 
-       util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
        if (dequeue_entities(rq, &p->se, flags) < 0)
                return false;


Tim

> -	if (dequeue_entities(rq, &p->se, flags) < 0)
> +	if (ret < 0)
>  		return false;
>  
>  	/*

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-05-15 18:35 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-12 12:46 [PATCH] sched/fair: Call update_util_est() after dequeue_entities() Qais Yousef
2026-05-12 20:52 ` John Stultz
2026-05-13 12:46   ` Vincent Guittot
2026-05-15  1:51     ` Qais Yousef
2026-05-15 15:23       ` Vincent Guittot
2026-05-15 18:35 ` Tim Chen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.