* [PATCH] sched/fair: Call update_util_est() after dequeue_entities()
@ 2026-05-12 12:46 Qais Yousef
2026-05-12 20:52 ` John Stultz
2026-05-15 18:35 ` Tim Chen
0 siblings, 2 replies; 6+ messages in thread
From: Qais Yousef @ 2026-05-12 12:46 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Vincent Guittot
Cc: Rafael J. Wysocki, Viresh Kumar, Juri Lelli, Steven Rostedt,
John Stultz, Dietmar Eggemann, Tim Chen, Chen, Yu C,
Thomas Gleixner, linux-kernel, linux-pm, Qais Yousef
update_util_est() reads task_util() at dequeue which is updated in
dequeue_entities(). To read the accurate util_avg at dequeue, make sure
to do the read after load_avg is updated in dequeue_entities().
util_est for a periodic task before
periodic-3114 util_est.enqueued running
┌───────────────────────────────────────────────────────────────────────────────────────────────┐
183┤ ▖▗ ▐▖ ▖ ▗▙ ▗ ▗▙▖▖ ▖▖ ▖ ▖▖ ▗ ▟ ▗▄▖ │
139┤ ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖ │
95┤ ▐▀ ▘ ▝ ▝ ▝▘ ▘ ▘▘ ▝▘ ▝▘ ▝ ▝ ▀ │
│ ▛ │
51┤ ▐▘ │
7┤ ▖▗▗ ▗▄▐ │
└┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
0.00 0.65 1.30 1.96 2.61 3.26 3.91 4.57 5.22 5.87
and after
periodic-2977 util_est.enqueued running
┌─────────────────────────────────────────────────────────────────────────────────────────────┐
157.0┤ ▙▄ ▗▄ ▗▄▄▄ ▗▄ ▗▄▄▄▗▄▄ ▗▄▄▖ ▄ ▄▄▄ ▄ ▄▖▖ ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄ │
119.5┤ ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀ ▝▝▀ ▀ ▀▀▀▀ │
82.0┤ ▟ │
│ ▌ │
44.5┤ ▌ │
7.0┤ ▗ ▗▖ ▌ │
└┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
0.00 0.65 1.30 1.95 2.60 3.25 3.90 4.56 5.21 5.86
Note how the signal is noisier and can peak to 183 vs 157 now.
Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race")
Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
This is split from [1] series where I stumbled upon this problem. AFAICS it
needs backporting all the way to 6.12 LTS.
[1] https://lore.kernel.org/lkml/20260504020003.71306-1-qyousef@layalina.io/
kernel/sched/fair.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 728965851842..96ba97e5f4ae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7401,6 +7401,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
*/
static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
+ int ret;
+
if (task_is_throttled(p)) {
dequeue_throttled_task(p, flags);
return true;
@@ -7409,8 +7411,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!p->se.sched_delayed)
util_est_dequeue(&rq->cfs, p);
+ ret = dequeue_entities(rq, &p->se, flags);
util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
- if (dequeue_entities(rq, &p->se, flags) < 0)
+ if (ret < 0)
return false;
/*
--
2.34.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] sched/fair: Call update_util_est() after dequeue_entities()
2026-05-12 12:46 [PATCH] sched/fair: Call update_util_est() after dequeue_entities() Qais Yousef
@ 2026-05-12 20:52 ` John Stultz
2026-05-13 12:46 ` Vincent Guittot
2026-05-15 18:35 ` Tim Chen
1 sibling, 1 reply; 6+ messages in thread
From: John Stultz @ 2026-05-12 20:52 UTC (permalink / raw)
To: Qais Yousef
Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar, Juri Lelli, Steven Rostedt, Dietmar Eggemann,
Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm
On Tue, May 12, 2026 at 5:47 AM Qais Yousef <qyousef@layalina.io> wrote:
>
> update_util_est() reads task_util() at dequeue which is updated in
> dequeue_entities(). To read the accurate util_avg at dequeue, make sure
> to do the read after load_avg is updated in dequeue_entities().
>
> util_est for a periodic task before
>
> periodic-3114 util_est.enqueued running
> ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
> 183┤ ▖▗ ▐▖ ▖ ▗▙ ▗ ▗▙▖▖ ▖▖ ▖ ▖▖ ▗ ▟ ▗▄▖ │
> 139┤ ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖ │
> 95┤ ▐▀ ▘ ▝ ▝ ▝▘ ▘ ▘▘ ▝▘ ▝▘ ▝ ▝ ▀ │
> │ ▛ │
> 51┤ ▐▘ │
> 7┤ ▖▗▗ ▗▄▐ │
> └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
> 0.00 0.65 1.30 1.96 2.61 3.26 3.91 4.57 5.22 5.87
>
> and after
>
> periodic-2977 util_est.enqueued running
> ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
> 157.0┤ ▙▄ ▗▄ ▗▄▄▄ ▗▄ ▗▄▄▄▗▄▄ ▗▄▄▖ ▄ ▄▄▄ ▄ ▄▖▖ ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄ │
> 119.5┤ ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀ ▝▝▀ ▀ ▀▀▀▀ │
> 82.0┤ ▟ │
> │ ▌ │
> 44.5┤ ▌ │
> 7.0┤ ▗ ▗▖ ▌ │
> └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
> 0.00 0.65 1.30 1.95 2.60 3.25 3.90 4.56 5.21 5.86
>
> Note how the signal is noisier and can peak to 183 vs 157 now.
>
> Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race")
> Signed-off-by: Qais Yousef <qyousef@layalina.io>
> ---
>
> This is split from [1] series where I stumbled upon this problem. AFAICS it
> needs backporting all the way to 6.12 LTS.
>
> [1] https://lore.kernel.org/lkml/20260504020003.71306-1-qyousef@layalina.io/
>
> kernel/sched/fair.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 728965851842..96ba97e5f4ae 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7401,6 +7401,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> */
> static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> {
> + int ret;
> +
> if (task_is_throttled(p)) {
> dequeue_throttled_task(p, flags);
> return true;
> @@ -7409,8 +7411,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> if (!p->se.sched_delayed)
> util_est_dequeue(&rq->cfs, p);
>
> + ret = dequeue_entities(rq, &p->se, flags);
> util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
> - if (dequeue_entities(rq, &p->se, flags) < 0)
Hrm, so the Sashiko tool raised a reasonable concern on the earlier
version of this:
https://sashiko.dev/#/patchset/20260504020003.71306-1-qyousef%40layalina.io?part=12
Specifically, we shouldn't reference p after dequeue_entities() or we
risk racing with it being woken up, running, and maybe exiting on
another cpu.
And this moves the util_est_updat() call to after dequeue finishes.
Maybe there's some way to have util_est_update() compensate for the
unfinished accounting that will be done in dequeue_entities()?
thanks
-john
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] sched/fair: Call update_util_est() after dequeue_entities()
2026-05-12 20:52 ` John Stultz
@ 2026-05-13 12:46 ` Vincent Guittot
2026-05-15 1:51 ` Qais Yousef
0 siblings, 1 reply; 6+ messages in thread
From: Vincent Guittot @ 2026-05-13 12:46 UTC (permalink / raw)
To: John Stultz
Cc: Qais Yousef, Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki,
Viresh Kumar, Juri Lelli, Steven Rostedt, Dietmar Eggemann,
Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm
Le mardi 12 mai 2026 à 13:52:09 (-0700), John Stultz a écrit :
> On Tue, May 12, 2026 at 5:47 AM Qais Yousef <qyousef@layalina.io> wrote:
> >
> > update_util_est() reads task_util() at dequeue which is updated in
> > dequeue_entities(). To read the accurate util_avg at dequeue, make sure
> > to do the read after load_avg is updated in dequeue_entities().
> >
> > util_est for a periodic task before
> >
> > periodic-3114 util_est.enqueued running
> > ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
> > 183┤ ▖▗ ▐▖ ▖ ▗▙ ▗ ▗▙▖▖ ▖▖ ▖ ▖▖ ▗ ▟ ▗▄▖ │
> > 139┤ ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖ │
> > 95┤ ▐▀ ▘ ▝ ▝ ▝▘ ▘ ▘▘ ▝▘ ▝▘ ▝ ▝ ▀ │
> > │ ▛ │
> > 51┤ ▐▘ │
> > 7┤ ▖▗▗ ▗▄▐ │
> > └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
> > 0.00 0.65 1.30 1.96 2.61 3.26 3.91 4.57 5.22 5.87
> >
> > and after
> >
> > periodic-2977 util_est.enqueued running
> > ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
> > 157.0┤ ▙▄ ▗▄ ▗▄▄▄ ▗▄ ▗▄▄▄▗▄▄ ▗▄▄▖ ▄ ▄▄▄ ▄ ▄▖▖ ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄ │
> > 119.5┤ ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀ ▝▝▀ ▀ ▀▀▀▀ │
> > 82.0┤ ▟ │
> > │ ▌ │
> > 44.5┤ ▌ │
> > 7.0┤ ▗ ▗▖ ▌ │
> > └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
> > 0.00 0.65 1.30 1.95 2.60 3.25 3.90 4.56 5.21 5.86
> >
> > Note how the signal is noisier and can peak to 183 vs 157 now.
> >
> > Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race")
> > Signed-off-by: Qais Yousef <qyousef@layalina.io>
> > ---
> >
> > This is split from [1] series where I stumbled upon this problem. AFAICS it
> > needs backporting all the way to 6.12 LTS.
> >
> > [1] https://lore.kernel.org/lkml/20260504020003.71306-1-qyousef@layalina.io/
> >
> > kernel/sched/fair.c | 5 ++++-
> > 1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 728965851842..96ba97e5f4ae 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7401,6 +7401,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> > */
> > static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > {
> > + int ret;
> > +
> > if (task_is_throttled(p)) {
> > dequeue_throttled_task(p, flags);
> > return true;
> > @@ -7409,8 +7411,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > if (!p->se.sched_delayed)
> > util_est_dequeue(&rq->cfs, p);
> >
> > + ret = dequeue_entities(rq, &p->se, flags);
> > util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
> > - if (dequeue_entities(rq, &p->se, flags) < 0)
>
> Hrm, so the Sashiko tool raised a reasonable concern on the earlier
> version of this:
> https://sashiko.dev/#/patchset/20260504020003.71306-1-qyousef%40layalina.io?part=12
Even without sashiko, this is what the comment says just below in the function ;-)
Below is a different way to fix it which also covers cases when we have both
DEQUEUE_SLEEP and DEQUEUE_DELAYED
---
kernel/sched/fair.c | 189 ++++++++++++++++++++++----------------------
1 file changed, 93 insertions(+), 96 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c83fbe4e88c1..0976adc12594 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5120,13 +5120,87 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
trace_pelt_cfs_tp(cfs_rq);
}
+#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
+
+static inline void util_est_update(struct sched_entity *se)
+{
+ unsigned int ewma, dequeued, last_ewma_diff;
+
+ if (!sched_feat(UTIL_EST))
+ return;
+
+ /* Get current estimate of utilization */
+ ewma = READ_ONCE(se->avg.util_est);
+
+ /*
+ * If the PELT values haven't changed since enqueue time,
+ * skip the util_est update.
+ */
+ if (ewma & UTIL_AVG_UNCHANGED)
+ return;
+
+ /* Get utilization at dequeue */
+ dequeued = READ_ONCE(se->avg.util_avg);
+
+ /*
+ * Reset EWMA on utilization increases, the moving average is used only
+ * to smooth utilization decreases.
+ */
+ if (ewma <= dequeued) {
+ ewma = dequeued;
+ goto done;
+ }
+
+ /*
+ * Skip update of task's estimated utilization when its members are
+ * already ~1% close to its last activation value.
+ */
+ last_ewma_diff = ewma - dequeued;
+ if (last_ewma_diff < UTIL_EST_MARGIN)
+ goto done;
+
+ /*
+ * To avoid underestimate of task utilization, skip updates of EWMA if
+ * we cannot grant that thread got all CPU time it wanted.
+ */
+ if ((dequeued + UTIL_EST_MARGIN) < READ_ONCE(se->avg.runnable_avg))
+ goto done;
+
+
+ /*
+ * Update Task's estimated utilization
+ *
+ * When *p completes an activation we can consolidate another sample
+ * of the task size. This is done by using this value to update the
+ * Exponential Weighted Moving Average (EWMA):
+ *
+ * ewma(t) = w * task_util(p) + (1-w) * ewma(t-1)
+ * = w * task_util(p) + ewma(t-1) - w * ewma(t-1)
+ * = w * (task_util(p) - ewma(t-1)) + ewma(t-1)
+ * = w * ( -last_ewma_diff ) + ewma(t-1)
+ * = w * (-last_ewma_diff + ewma(t-1) / w)
+ *
+ * Where 'w' is the weight of new samples, which is configured to be
+ * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
+ */
+ ewma <<= UTIL_EST_WEIGHT_SHIFT;
+ ewma -= last_ewma_diff;
+ ewma >>= UTIL_EST_WEIGHT_SHIFT;
+done:
+ ewma |= UTIL_AVG_UNCHANGED;
+ WRITE_ONCE(se->avg.util_est, ewma);
+
+ trace_sched_util_est_se_tp(se);
+}
+
/*
* Optional action to be done while updating the load average
*/
-#define UPDATE_TG 0x1
-#define SKIP_AGE_LOAD 0x2
-#define DO_ATTACH 0x4
-#define DO_DETACH 0x8
+#define UPDATE_TG 0x01
+#define SKIP_AGE_LOAD 0x02
+#define DO_ATTACH 0x04
+#define DO_DETACH 0x08
+#define UPDATE_UTIL_EST 0x10
/* Update task and its cfs_rq load average */
static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
@@ -5169,6 +5243,9 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
if (flags & UPDATE_TG)
update_tg_load_avg(cfs_rq);
}
+
+ if (flags & UPDATE_UTIL_EST)
+ util_est_update(se);
}
/*
@@ -5227,11 +5304,6 @@ static inline unsigned long task_util(struct task_struct *p)
return READ_ONCE(p->se.avg.util_avg);
}
-static inline unsigned long task_runnable(struct task_struct *p)
-{
- return READ_ONCE(p->se.avg.runnable_avg);
-}
-
static inline unsigned long _task_util_est(struct task_struct *p)
{
return READ_ONCE(p->se.avg.util_est) & ~UTIL_AVG_UNCHANGED;
@@ -5274,88 +5346,6 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
trace_sched_util_est_cfs_tp(cfs_rq);
}
-#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
-
-static inline void util_est_update(struct cfs_rq *cfs_rq,
- struct task_struct *p,
- bool task_sleep)
-{
- unsigned int ewma, dequeued, last_ewma_diff;
-
- if (!sched_feat(UTIL_EST))
- return;
-
- /*
- * Skip update of task's estimated utilization when the task has not
- * yet completed an activation, e.g. being migrated.
- */
- if (!task_sleep)
- return;
-
- /* Get current estimate of utilization */
- ewma = READ_ONCE(p->se.avg.util_est);
-
- /*
- * If the PELT values haven't changed since enqueue time,
- * skip the util_est update.
- */
- if (ewma & UTIL_AVG_UNCHANGED)
- return;
-
- /* Get utilization at dequeue */
- dequeued = task_util(p);
-
- /*
- * Reset EWMA on utilization increases, the moving average is used only
- * to smooth utilization decreases.
- */
- if (ewma <= dequeued) {
- ewma = dequeued;
- goto done;
- }
-
- /*
- * Skip update of task's estimated utilization when its members are
- * already ~1% close to its last activation value.
- */
- last_ewma_diff = ewma - dequeued;
- if (last_ewma_diff < UTIL_EST_MARGIN)
- goto done;
-
- /*
- * To avoid underestimate of task utilization, skip updates of EWMA if
- * we cannot grant that thread got all CPU time it wanted.
- */
- if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
- goto done;
-
-
- /*
- * Update Task's estimated utilization
- *
- * When *p completes an activation we can consolidate another sample
- * of the task size. This is done by using this value to update the
- * Exponential Weighted Moving Average (EWMA):
- *
- * ewma(t) = w * task_util(p) + (1-w) * ewma(t-1)
- * = w * task_util(p) + ewma(t-1) - w * ewma(t-1)
- * = w * (task_util(p) - ewma(t-1)) + ewma(t-1)
- * = w * ( -last_ewma_diff ) + ewma(t-1)
- * = w * (-last_ewma_diff + ewma(t-1) / w)
- *
- * Where 'w' is the weight of new samples, which is configured to be
- * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
- */
- ewma <<= UTIL_EST_WEIGHT_SHIFT;
- ewma -= last_ewma_diff;
- ewma >>= UTIL_EST_WEIGHT_SHIFT;
-done:
- ewma |= UTIL_AVG_UNCHANGED;
- WRITE_ONCE(p->se.avg.util_est, ewma);
-
- trace_sched_util_est_se_tp(&p->se);
-}
-
static inline unsigned long get_actual_cpu_capacity(int cpu)
{
unsigned long capacity = arch_scale_cpu_capacity(cpu);
@@ -5828,7 +5818,7 @@ static bool
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
bool sleep = flags & DEQUEUE_SLEEP;
- int action = UPDATE_TG;
+ int action = 0;
update_curr(cfs_rq);
clear_buddies(cfs_rq, se);
@@ -5848,15 +5838,23 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
if (sched_feat(DELAY_DEQUEUE) && delay &&
!entity_eligible(cfs_rq, se)) {
- update_load_avg(cfs_rq, se, 0);
+ if (entity_is_task(se))
+ action |= UPDATE_UTIL_EST;
+ update_load_avg(cfs_rq, se, action);
update_entity_lag(cfs_rq, se);
set_delayed(se);
return false;
}
}
- if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
- action |= DO_DETACH;
+ action = UPDATE_TG;
+ if (entity_is_task(se)) {
+ if (task_on_rq_migrating(task_of(se)))
+ action |= DO_DETACH;
+
+ if (sleep && !(flags & DEQUEUE_DELAYED))
+ action |= UPDATE_UTIL_EST;
+ }
/*
* When dequeuing a sched_entity, we must:
@@ -7628,7 +7626,6 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!p->se.sched_delayed)
util_est_dequeue(&rq->cfs, p);
- util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
if (dequeue_entities(rq, &p->se, flags) < 0)
return false;
--
2.43.0
>
> Specifically, we shouldn't reference p after dequeue_entities() or we
> risk racing with it being woken up, running, and maybe exiting on
> another cpu.
> And this moves the util_est_updat() call to after dequeue finishes.
>
> Maybe there's some way to have util_est_update() compensate for the
> unfinished accounting that will be done in dequeue_entities()?
>
> thanks
> -john
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] sched/fair: Call update_util_est() after dequeue_entities()
2026-05-13 12:46 ` Vincent Guittot
@ 2026-05-15 1:51 ` Qais Yousef
2026-05-15 15:23 ` Vincent Guittot
0 siblings, 1 reply; 6+ messages in thread
From: Qais Yousef @ 2026-05-15 1:51 UTC (permalink / raw)
To: Vincent Guittot
Cc: John Stultz, Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki,
Viresh Kumar, Juri Lelli, Steven Rostedt, Dietmar Eggemann,
Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm
On 05/13/26 14:46, Vincent Guittot wrote:
> Le mardi 12 mai 2026 à 13:52:09 (-0700), John Stultz a écrit :
> > On Tue, May 12, 2026 at 5:47 AM Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > update_util_est() reads task_util() at dequeue which is updated in
> > > dequeue_entities(). To read the accurate util_avg at dequeue, make sure
> > > to do the read after load_avg is updated in dequeue_entities().
> > >
> > > util_est for a periodic task before
> > >
> > > periodic-3114 util_est.enqueued running
> > > ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
> > > 183┤ ▖▗ ▐▖ ▖ ▗▙ ▗ ▗▙▖▖ ▖▖ ▖ ▖▖ ▗ ▟ ▗▄▖ │
> > > 139┤ ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖ │
> > > 95┤ ▐▀ ▘ ▝ ▝ ▝▘ ▘ ▘▘ ▝▘ ▝▘ ▝ ▝ ▀ │
> > > │ ▛ │
> > > 51┤ ▐▘ │
> > > 7┤ ▖▗▗ ▗▄▐ │
> > > └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
> > > 0.00 0.65 1.30 1.96 2.61 3.26 3.91 4.57 5.22 5.87
> > >
> > > and after
> > >
> > > periodic-2977 util_est.enqueued running
> > > ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
> > > 157.0┤ ▙▄ ▗▄ ▗▄▄▄ ▗▄ ▗▄▄▄▗▄▄ ▗▄▄▖ ▄ ▄▄▄ ▄ ▄▖▖ ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄ │
> > > 119.5┤ ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀ ▝▝▀ ▀ ▀▀▀▀ │
> > > 82.0┤ ▟ │
> > > │ ▌ │
> > > 44.5┤ ▌ │
> > > 7.0┤ ▗ ▗▖ ▌ │
> > > └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
> > > 0.00 0.65 1.30 1.95 2.60 3.25 3.90 4.56 5.21 5.86
> > >
> > > Note how the signal is noisier and can peak to 183 vs 157 now.
> > >
> > > Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race")
> > > Signed-off-by: Qais Yousef <qyousef@layalina.io>
> > > ---
> > >
> > > This is split from [1] series where I stumbled upon this problem. AFAICS it
> > > needs backporting all the way to 6.12 LTS.
> > >
> > > [1] https://lore.kernel.org/lkml/20260504020003.71306-1-qyousef@layalina.io/
> > >
> > > kernel/sched/fair.c | 5 ++++-
> > > 1 file changed, 4 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 728965851842..96ba97e5f4ae 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7401,6 +7401,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> > > */
> > > static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > {
> > > + int ret;
> > > +
> > > if (task_is_throttled(p)) {
> > > dequeue_throttled_task(p, flags);
> > > return true;
> > > @@ -7409,8 +7411,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > if (!p->se.sched_delayed)
> > > util_est_dequeue(&rq->cfs, p);
> > >
> > > + ret = dequeue_entities(rq, &p->se, flags);
> > > util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
> > > - if (dequeue_entities(rq, &p->se, flags) < 0)
> >
> > Hrm, so the Sashiko tool raised a reasonable concern on the earlier
> > version of this:
> > https://sashiko.dev/#/patchset/20260504020003.71306-1-qyousef%40layalina.io?part=12
>
> Even without sashiko, this is what the comment says just below in the function ;-)
Yeah I read it but it didn't really register properly :-)
>
> Below is a different way to fix it which also covers cases when we have both
> DEQUEUE_SLEEP and DEQUEUE_DELAYED
Works for me. It looks even more stable now
util_avg
periodic-2737 util_avg running
┌───────────────────────────────────────────────────────────────────────────────────────────────┐
140┤ ▟▟█▄▄██▙▄▟▟▙▙▟█▟▄▟▟▟▄▄██▄▟▙█▙▄▙█▙▄█▙▙▄▙▙▙█▟█▄▟▟▟▄▟█▟▟▄██▄▙█▟▟▄▟██▄ │
105┤ ▄██████████████████████████████████████████████████████████████████ │
70┤ ▗▛▘ │
│ ▐▘ │
35┤ █ │
0┤ ▖ ▗▄▌ │
└┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
0.00 0.65 1.30 1.95 2.60 3.25 3.90 4.55 5.20 5.85
and util_eset
periodic-2737 util_est.enqueued running
┌─────────────────────────────────────────────────────────────────────────────────────────────┐
140.0┤ ▞▀▀▀▀▀▀▀▀▀▀▀▀▀▜▀▀▀▜▀▀▀▀▀▀▀▀▜▀▀▀▀▀▀▀▀▀▀▀▀▀▀▜▀▀▀▀▀▀▜▛▀▀▀▛▀▀▀▀▀▀▀▀▀▘ │
106.8┤ ▝▘ │
73.5┤ ▗▘ │
│ ▗ │
40.2┤ ▖ │
7.0┤ ▖ ▄▗▖ │
└┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
0.00 0.65 1.30 1.95 2.60 3.25 3.90 4.55 5.20 5.85
Do you want to send a proper patch? Feel free to stick my
reviewed-and-tested-by.
Thanks!
>
> ---
> kernel/sched/fair.c | 189 ++++++++++++++++++++++----------------------
> 1 file changed, 93 insertions(+), 96 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c83fbe4e88c1..0976adc12594 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5120,13 +5120,87 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> trace_pelt_cfs_tp(cfs_rq);
> }
>
> +#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
> +
> +static inline void util_est_update(struct sched_entity *se)
> +{
> + unsigned int ewma, dequeued, last_ewma_diff;
> +
> + if (!sched_feat(UTIL_EST))
> + return;
> +
> + /* Get current estimate of utilization */
> + ewma = READ_ONCE(se->avg.util_est);
> +
> + /*
> + * If the PELT values haven't changed since enqueue time,
> + * skip the util_est update.
> + */
> + if (ewma & UTIL_AVG_UNCHANGED)
> + return;
> +
> + /* Get utilization at dequeue */
> + dequeued = READ_ONCE(se->avg.util_avg);
> +
> + /*
> + * Reset EWMA on utilization increases, the moving average is used only
> + * to smooth utilization decreases.
> + */
> + if (ewma <= dequeued) {
> + ewma = dequeued;
> + goto done;
> + }
> +
> + /*
> + * Skip update of task's estimated utilization when its members are
> + * already ~1% close to its last activation value.
> + */
> + last_ewma_diff = ewma - dequeued;
> + if (last_ewma_diff < UTIL_EST_MARGIN)
> + goto done;
> +
> + /*
> + * To avoid underestimate of task utilization, skip updates of EWMA if
> + * we cannot grant that thread got all CPU time it wanted.
> + */
> + if ((dequeued + UTIL_EST_MARGIN) < READ_ONCE(se->avg.runnable_avg))
> + goto done;
> +
> +
> + /*
> + * Update Task's estimated utilization
> + *
> + * When *p completes an activation we can consolidate another sample
> + * of the task size. This is done by using this value to update the
> + * Exponential Weighted Moving Average (EWMA):
> + *
> + * ewma(t) = w * task_util(p) + (1-w) * ewma(t-1)
> + * = w * task_util(p) + ewma(t-1) - w * ewma(t-1)
> + * = w * (task_util(p) - ewma(t-1)) + ewma(t-1)
> + * = w * ( -last_ewma_diff ) + ewma(t-1)
> + * = w * (-last_ewma_diff + ewma(t-1) / w)
> + *
> + * Where 'w' is the weight of new samples, which is configured to be
> + * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
> + */
> + ewma <<= UTIL_EST_WEIGHT_SHIFT;
> + ewma -= last_ewma_diff;
> + ewma >>= UTIL_EST_WEIGHT_SHIFT;
> +done:
> + ewma |= UTIL_AVG_UNCHANGED;
> + WRITE_ONCE(se->avg.util_est, ewma);
> +
> + trace_sched_util_est_se_tp(se);
> +}
> +
> /*
> * Optional action to be done while updating the load average
> */
> -#define UPDATE_TG 0x1
> -#define SKIP_AGE_LOAD 0x2
> -#define DO_ATTACH 0x4
> -#define DO_DETACH 0x8
> +#define UPDATE_TG 0x01
> +#define SKIP_AGE_LOAD 0x02
> +#define DO_ATTACH 0x04
> +#define DO_DETACH 0x08
> +#define UPDATE_UTIL_EST 0x10
>
> /* Update task and its cfs_rq load average */
> static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> @@ -5169,6 +5243,9 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> if (flags & UPDATE_TG)
> update_tg_load_avg(cfs_rq);
> }
> +
> + if (flags & UPDATE_UTIL_EST)
> + util_est_update(se);
> }
>
> /*
> @@ -5227,11 +5304,6 @@ static inline unsigned long task_util(struct task_struct *p)
> return READ_ONCE(p->se.avg.util_avg);
> }
>
> -static inline unsigned long task_runnable(struct task_struct *p)
> -{
> - return READ_ONCE(p->se.avg.runnable_avg);
> -}
> -
> static inline unsigned long _task_util_est(struct task_struct *p)
> {
> return READ_ONCE(p->se.avg.util_est) & ~UTIL_AVG_UNCHANGED;
> @@ -5274,88 +5346,6 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
> trace_sched_util_est_cfs_tp(cfs_rq);
> }
>
> -#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
> -
> -static inline void util_est_update(struct cfs_rq *cfs_rq,
> - struct task_struct *p,
> - bool task_sleep)
> -{
> - unsigned int ewma, dequeued, last_ewma_diff;
> -
> - if (!sched_feat(UTIL_EST))
> - return;
> -
> - /*
> - * Skip update of task's estimated utilization when the task has not
> - * yet completed an activation, e.g. being migrated.
> - */
> - if (!task_sleep)
> - return;
> -
> - /* Get current estimate of utilization */
> - ewma = READ_ONCE(p->se.avg.util_est);
> -
> - /*
> - * If the PELT values haven't changed since enqueue time,
> - * skip the util_est update.
> - */
> - if (ewma & UTIL_AVG_UNCHANGED)
> - return;
> -
> - /* Get utilization at dequeue */
> - dequeued = task_util(p);
> -
> - /*
> - * Reset EWMA on utilization increases, the moving average is used only
> - * to smooth utilization decreases.
> - */
> - if (ewma <= dequeued) {
> - ewma = dequeued;
> - goto done;
> - }
> -
> - /*
> - * Skip update of task's estimated utilization when its members are
> - * already ~1% close to its last activation value.
> - */
> - last_ewma_diff = ewma - dequeued;
> - if (last_ewma_diff < UTIL_EST_MARGIN)
> - goto done;
> -
> - /*
> - * To avoid underestimate of task utilization, skip updates of EWMA if
> - * we cannot grant that thread got all CPU time it wanted.
> - */
> - if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
> - goto done;
> -
> -
> - /*
> - * Update Task's estimated utilization
> - *
> - * When *p completes an activation we can consolidate another sample
> - * of the task size. This is done by using this value to update the
> - * Exponential Weighted Moving Average (EWMA):
> - *
> - * ewma(t) = w * task_util(p) + (1-w) * ewma(t-1)
> - * = w * task_util(p) + ewma(t-1) - w * ewma(t-1)
> - * = w * (task_util(p) - ewma(t-1)) + ewma(t-1)
> - * = w * ( -last_ewma_diff ) + ewma(t-1)
> - * = w * (-last_ewma_diff + ewma(t-1) / w)
> - *
> - * Where 'w' is the weight of new samples, which is configured to be
> - * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
> - */
> - ewma <<= UTIL_EST_WEIGHT_SHIFT;
> - ewma -= last_ewma_diff;
> - ewma >>= UTIL_EST_WEIGHT_SHIFT;
> -done:
> - ewma |= UTIL_AVG_UNCHANGED;
> - WRITE_ONCE(p->se.avg.util_est, ewma);
> -
> - trace_sched_util_est_se_tp(&p->se);
> -}
> -
> static inline unsigned long get_actual_cpu_capacity(int cpu)
> {
> unsigned long capacity = arch_scale_cpu_capacity(cpu);
> @@ -5828,7 +5818,7 @@ static bool
> dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> {
> bool sleep = flags & DEQUEUE_SLEEP;
> - int action = UPDATE_TG;
> + int action = 0;
>
> update_curr(cfs_rq);
> clear_buddies(cfs_rq, se);
> @@ -5848,15 +5838,23 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>
> if (sched_feat(DELAY_DEQUEUE) && delay &&
> !entity_eligible(cfs_rq, se)) {
> - update_load_avg(cfs_rq, se, 0);
> + if (entity_is_task(se))
> + action |= UPDATE_UTIL_EST;
> + update_load_avg(cfs_rq, se, action);
> update_entity_lag(cfs_rq, se);
> set_delayed(se);
> return false;
> }
> }
>
> - if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
> - action |= DO_DETACH;
> + action = UPDATE_TG;
> + if (entity_is_task(se)) {
> + if (task_on_rq_migrating(task_of(se)))
> + action |= DO_DETACH;
> +
> + if (sleep && !(flags & DEQUEUE_DELAYED))
> + action |= UPDATE_UTIL_EST;
> + }
>
> /*
> * When dequeuing a sched_entity, we must:
> @@ -7628,7 +7626,6 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> if (!p->se.sched_delayed)
> util_est_dequeue(&rq->cfs, p);
>
> - util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
> if (dequeue_entities(rq, &p->se, flags) < 0)
> return false;
>
> --
> 2.43.0
>
>
>
>
> >
> > Specifically, we shouldn't reference p after dequeue_entities() or we
> > risk racing with it being woken up, running, and maybe exiting on
> > another cpu.
> > And this moves the util_est_updat() call to after dequeue finishes.
> >
> > Maybe there's some way to have util_est_update() compensate for the
> > unfinished accounting that will be done in dequeue_entities()?
> >
> > thanks
> > -john
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] sched/fair: Call update_util_est() after dequeue_entities()
2026-05-15 1:51 ` Qais Yousef
@ 2026-05-15 15:23 ` Vincent Guittot
0 siblings, 0 replies; 6+ messages in thread
From: Vincent Guittot @ 2026-05-15 15:23 UTC (permalink / raw)
To: Qais Yousef
Cc: John Stultz, Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki,
Viresh Kumar, Juri Lelli, Steven Rostedt, Dietmar Eggemann,
Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm
On Fri, 15 May 2026 at 03:51, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 05/13/26 14:46, Vincent Guittot wrote:
> > Le mardi 12 mai 2026 à 13:52:09 (-0700), John Stultz a écrit :
> > > On Tue, May 12, 2026 at 5:47 AM Qais Yousef <qyousef@layalina.io> wrote:
> > > >
> > > > update_util_est() reads task_util() at dequeue which is updated in
> > > > dequeue_entities(). To read the accurate util_avg at dequeue, make sure
> > > > to do the read after load_avg is updated in dequeue_entities().
> > > >
> > > > util_est for a periodic task before
> > > >
> > > > periodic-3114 util_est.enqueued running
> > > > ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
> > > > 183┤ ▖▗ ▐▖ ▖ ▗▙ ▗ ▗▙▖▖ ▖▖ ▖ ▖▖ ▗ ▟ ▗▄▖ │
> > > > 139┤ ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖ │
> > > > 95┤ ▐▀ ▘ ▝ ▝ ▝▘ ▘ ▘▘ ▝▘ ▝▘ ▝ ▝ ▀ │
> > > > │ ▛ │
> > > > 51┤ ▐▘ │
> > > > 7┤ ▖▗▗ ▗▄▐ │
> > > > └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
> > > > 0.00 0.65 1.30 1.96 2.61 3.26 3.91 4.57 5.22 5.87
> > > >
> > > > and after
> > > >
> > > > periodic-2977 util_est.enqueued running
> > > > ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
> > > > 157.0┤ ▙▄ ▗▄ ▗▄▄▄ ▗▄ ▗▄▄▄▗▄▄ ▗▄▄▖ ▄ ▄▄▄ ▄ ▄▖▖ ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄ │
> > > > 119.5┤ ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀ ▝▝▀ ▀ ▀▀▀▀ │
> > > > 82.0┤ ▟ │
> > > > │ ▌ │
> > > > 44.5┤ ▌ │
> > > > 7.0┤ ▗ ▗▖ ▌ │
> > > > └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
> > > > 0.00 0.65 1.30 1.95 2.60 3.25 3.90 4.56 5.21 5.86
> > > >
> > > > Note how the signal is noisier and can peak to 183 vs 157 now.
> > > >
> > > > Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race")
> > > > Signed-off-by: Qais Yousef <qyousef@layalina.io>
> > > > ---
> > > >
> > > > This is split from [1] series where I stumbled upon this problem. AFAICS it
> > > > needs backporting all the way to 6.12 LTS.
> > > >
> > > > [1] https://lore.kernel.org/lkml/20260504020003.71306-1-qyousef@layalina.io/
> > > >
> > > > kernel/sched/fair.c | 5 ++++-
> > > > 1 file changed, 4 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 728965851842..96ba97e5f4ae 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -7401,6 +7401,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> > > > */
> > > > static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > > {
> > > > + int ret;
> > > > +
> > > > if (task_is_throttled(p)) {
> > > > dequeue_throttled_task(p, flags);
> > > > return true;
> > > > @@ -7409,8 +7411,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > > if (!p->se.sched_delayed)
> > > > util_est_dequeue(&rq->cfs, p);
> > > >
> > > > + ret = dequeue_entities(rq, &p->se, flags);
> > > > util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
> > > > - if (dequeue_entities(rq, &p->se, flags) < 0)
> > >
> > > Hrm, so the Sashiko tool raised a reasonable concern on the earlier
> > > version of this:
> > > https://sashiko.dev/#/patchset/20260504020003.71306-1-qyousef%40layalina.io?part=12
> >
> > Even without sashiko, this is what the comment says just below in the function ;-)
>
> Yeah I read it but it didn't really register properly :-)
>
> >
> > Below is a different way to fix it which also covers cases when we have both
> > DEQUEUE_SLEEP and DEQUEUE_DELAYED
>
> Works for me. It looks even more stable now
>
> util_avg
>
> periodic-2737 util_avg running
> ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
> 140┤ ▟▟█▄▄██▙▄▟▟▙▙▟█▟▄▟▟▟▄▄██▄▟▙█▙▄▙█▙▄█▙▙▄▙▙▙█▟█▄▟▟▟▄▟█▟▟▄██▄▙█▟▟▄▟██▄ │
> 105┤ ▄██████████████████████████████████████████████████████████████████ │
> 70┤ ▗▛▘ │
> │ ▐▘ │
> 35┤ █ │
> 0┤ ▖ ▗▄▌ │
> └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
> 0.00 0.65 1.30 1.95 2.60 3.25 3.90 4.55 5.20 5.85
>
> and util_eset
>
> periodic-2737 util_est.enqueued running
> ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
> 140.0┤ ▞▀▀▀▀▀▀▀▀▀▀▀▀▀▜▀▀▀▜▀▀▀▀▀▀▀▀▜▀▀▀▀▀▀▀▀▀▀▀▀▀▀▜▀▀▀▀▀▀▜▛▀▀▀▛▀▀▀▀▀▀▀▀▀▘ │
> 106.8┤ ▝▘ │
> 73.5┤ ▗▘ │
> │ ▗ │
> 40.2┤ ▖ │
> 7.0┤ ▖ ▄▗▖ │
> └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
> 0.00 0.65 1.30 1.95 2.60 3.25 3.90 4.55 5.20 5.85
>
> Do you want to send a proper patch? Feel free to stick my
> reviewed-and-tested-by.
Yes I'm going to send a proper patch with reported, reviewed and tested by You
Thanks
>
> Thanks!
>
> >
> > ---
> > kernel/sched/fair.c | 189 ++++++++++++++++++++++----------------------
> > 1 file changed, 93 insertions(+), 96 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index c83fbe4e88c1..0976adc12594 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5120,13 +5120,87 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > trace_pelt_cfs_tp(cfs_rq);
> > }
> >
> > +#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
> > +
> > +static inline void util_est_update(struct sched_entity *se)
> > +{
> > + unsigned int ewma, dequeued, last_ewma_diff;
> > +
> > + if (!sched_feat(UTIL_EST))
> > + return;
> > +
> > + /* Get current estimate of utilization */
> > + ewma = READ_ONCE(se->avg.util_est);
> > +
> > + /*
> > + * If the PELT values haven't changed since enqueue time,
> > + * skip the util_est update.
> > + */
> > + if (ewma & UTIL_AVG_UNCHANGED)
> > + return;
> > +
> > + /* Get utilization at dequeue */
> > + dequeued = READ_ONCE(se->avg.util_avg);
> > +
> > + /*
> > + * Reset EWMA on utilization increases, the moving average is used only
> > + * to smooth utilization decreases.
> > + */
> > + if (ewma <= dequeued) {
> > + ewma = dequeued;
> > + goto done;
> > + }
> > +
> > + /*
> > + * Skip update of task's estimated utilization when its members are
> > + * already ~1% close to its last activation value.
> > + */
> > + last_ewma_diff = ewma - dequeued;
> > + if (last_ewma_diff < UTIL_EST_MARGIN)
> > + goto done;
> > +
> > + /*
> > + * To avoid underestimate of task utilization, skip updates of EWMA if
> > + * we cannot grant that thread got all CPU time it wanted.
> > + */
> > + if ((dequeued + UTIL_EST_MARGIN) < READ_ONCE(se->avg.runnable_avg))
> > + goto done;
> > +
> > +
> > + /*
> > + * Update Task's estimated utilization
> > + *
> > + * When *p completes an activation we can consolidate another sample
> > + * of the task size. This is done by using this value to update the
> > + * Exponential Weighted Moving Average (EWMA):
> > + *
> > + * ewma(t) = w * task_util(p) + (1-w) * ewma(t-1)
> > + * = w * task_util(p) + ewma(t-1) - w * ewma(t-1)
> > + * = w * (task_util(p) - ewma(t-1)) + ewma(t-1)
> > + * = w * ( -last_ewma_diff ) + ewma(t-1)
> > + * = w * (-last_ewma_diff + ewma(t-1) / w)
> > + *
> > + * Where 'w' is the weight of new samples, which is configured to be
> > + * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
> > + */
> > + ewma <<= UTIL_EST_WEIGHT_SHIFT;
> > + ewma -= last_ewma_diff;
> > + ewma >>= UTIL_EST_WEIGHT_SHIFT;
> > +done:
> > + ewma |= UTIL_AVG_UNCHANGED;
> > + WRITE_ONCE(se->avg.util_est, ewma);
> > +
> > + trace_sched_util_est_se_tp(se);
> > +}
> > +
> > /*
> > * Optional action to be done while updating the load average
> > */
> > -#define UPDATE_TG 0x1
> > -#define SKIP_AGE_LOAD 0x2
> > -#define DO_ATTACH 0x4
> > -#define DO_DETACH 0x8
> > +#define UPDATE_TG 0x01
> > +#define SKIP_AGE_LOAD 0x02
> > +#define DO_ATTACH 0x04
> > +#define DO_DETACH 0x08
> > +#define UPDATE_UTIL_EST 0x10
> >
> > /* Update task and its cfs_rq load average */
> > static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > @@ -5169,6 +5243,9 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > if (flags & UPDATE_TG)
> > update_tg_load_avg(cfs_rq);
> > }
> > +
> > + if (flags & UPDATE_UTIL_EST)
> > + util_est_update(se);
> > }
> >
> > /*
> > @@ -5227,11 +5304,6 @@ static inline unsigned long task_util(struct task_struct *p)
> > return READ_ONCE(p->se.avg.util_avg);
> > }
> >
> > -static inline unsigned long task_runnable(struct task_struct *p)
> > -{
> > - return READ_ONCE(p->se.avg.runnable_avg);
> > -}
> > -
> > static inline unsigned long _task_util_est(struct task_struct *p)
> > {
> > return READ_ONCE(p->se.avg.util_est) & ~UTIL_AVG_UNCHANGED;
> > @@ -5274,88 +5346,6 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
> > trace_sched_util_est_cfs_tp(cfs_rq);
> > }
> >
> > -#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
> > -
> > -static inline void util_est_update(struct cfs_rq *cfs_rq,
> > - struct task_struct *p,
> > - bool task_sleep)
> > -{
> > - unsigned int ewma, dequeued, last_ewma_diff;
> > -
> > - if (!sched_feat(UTIL_EST))
> > - return;
> > -
> > - /*
> > - * Skip update of task's estimated utilization when the task has not
> > - * yet completed an activation, e.g. being migrated.
> > - */
> > - if (!task_sleep)
> > - return;
> > -
> > - /* Get current estimate of utilization */
> > - ewma = READ_ONCE(p->se.avg.util_est);
> > -
> > - /*
> > - * If the PELT values haven't changed since enqueue time,
> > - * skip the util_est update.
> > - */
> > - if (ewma & UTIL_AVG_UNCHANGED)
> > - return;
> > -
> > - /* Get utilization at dequeue */
> > - dequeued = task_util(p);
> > -
> > - /*
> > - * Reset EWMA on utilization increases, the moving average is used only
> > - * to smooth utilization decreases.
> > - */
> > - if (ewma <= dequeued) {
> > - ewma = dequeued;
> > - goto done;
> > - }
> > -
> > - /*
> > - * Skip update of task's estimated utilization when its members are
> > - * already ~1% close to its last activation value.
> > - */
> > - last_ewma_diff = ewma - dequeued;
> > - if (last_ewma_diff < UTIL_EST_MARGIN)
> > - goto done;
> > -
> > - /*
> > - * To avoid underestimate of task utilization, skip updates of EWMA if
> > - * we cannot grant that thread got all CPU time it wanted.
> > - */
> > - if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
> > - goto done;
> > -
> > -
> > - /*
> > - * Update Task's estimated utilization
> > - *
> > - * When *p completes an activation we can consolidate another sample
> > - * of the task size. This is done by using this value to update the
> > - * Exponential Weighted Moving Average (EWMA):
> > - *
> > - * ewma(t) = w * task_util(p) + (1-w) * ewma(t-1)
> > - * = w * task_util(p) + ewma(t-1) - w * ewma(t-1)
> > - * = w * (task_util(p) - ewma(t-1)) + ewma(t-1)
> > - * = w * ( -last_ewma_diff ) + ewma(t-1)
> > - * = w * (-last_ewma_diff + ewma(t-1) / w)
> > - *
> > - * Where 'w' is the weight of new samples, which is configured to be
> > - * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
> > - */
> > - ewma <<= UTIL_EST_WEIGHT_SHIFT;
> > - ewma -= last_ewma_diff;
> > - ewma >>= UTIL_EST_WEIGHT_SHIFT;
> > -done:
> > - ewma |= UTIL_AVG_UNCHANGED;
> > - WRITE_ONCE(p->se.avg.util_est, ewma);
> > -
> > - trace_sched_util_est_se_tp(&p->se);
> > -}
> > -
> > static inline unsigned long get_actual_cpu_capacity(int cpu)
> > {
> > unsigned long capacity = arch_scale_cpu_capacity(cpu);
> > @@ -5828,7 +5818,7 @@ static bool
> > dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > {
> > bool sleep = flags & DEQUEUE_SLEEP;
> > - int action = UPDATE_TG;
> > + int action = 0;
> >
> > update_curr(cfs_rq);
> > clear_buddies(cfs_rq, se);
> > @@ -5848,15 +5838,23 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >
> > if (sched_feat(DELAY_DEQUEUE) && delay &&
> > !entity_eligible(cfs_rq, se)) {
> > - update_load_avg(cfs_rq, se, 0);
> > + if (entity_is_task(se))
> > + action |= UPDATE_UTIL_EST;
> > + update_load_avg(cfs_rq, se, action);
> > update_entity_lag(cfs_rq, se);
> > set_delayed(se);
> > return false;
> > }
> > }
> >
> > - if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
> > - action |= DO_DETACH;
> > + action = UPDATE_TG;
> > + if (entity_is_task(se)) {
> > + if (task_on_rq_migrating(task_of(se)))
> > + action |= DO_DETACH;
> > +
> > + if (sleep && !(flags & DEQUEUE_DELAYED))
> > + action |= UPDATE_UTIL_EST;
> > + }
> >
> > /*
> > * When dequeuing a sched_entity, we must:
> > @@ -7628,7 +7626,6 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > if (!p->se.sched_delayed)
> > util_est_dequeue(&rq->cfs, p);
> >
> > - util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
> > if (dequeue_entities(rq, &p->se, flags) < 0)
> > return false;
> >
> > --
> > 2.43.0
> >
> >
> >
> >
> > >
> > > Specifically, we shouldn't reference p after dequeue_entities() or we
> > > risk racing with it being woken up, running, and maybe exiting on
> > > another cpu.
> > > And this moves the util_est_updat() call to after dequeue finishes.
> > >
> > > Maybe there's some way to have util_est_update() compensate for the
> > > unfinished accounting that will be done in dequeue_entities()?
> > >
> > > thanks
> > > -john
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] sched/fair: Call update_util_est() after dequeue_entities()
2026-05-12 12:46 [PATCH] sched/fair: Call update_util_est() after dequeue_entities() Qais Yousef
2026-05-12 20:52 ` John Stultz
@ 2026-05-15 18:35 ` Tim Chen
1 sibling, 0 replies; 6+ messages in thread
From: Tim Chen @ 2026-05-15 18:35 UTC (permalink / raw)
To: Qais Yousef, Ingo Molnar, Peter Zijlstra, Vincent Guittot
Cc: Rafael J. Wysocki, Viresh Kumar, Juri Lelli, Steven Rostedt,
John Stultz, Dietmar Eggemann, Chen, Yu C, Thomas Gleixner,
linux-kernel, linux-pm
On Tue, 2026-05-12 at 13:46 +0100, Qais Yousef wrote:
> update_util_est() reads task_util() at dequeue which is updated in
> dequeue_entities(). To read the accurate util_avg at dequeue, make sure
> to do the read after load_avg is updated in dequeue_entities().
>
> util_est for a periodic task before
>
> periodic-3114 util_est.enqueued running
> ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
> 183┤ ▖▗ ▐▖ ▖ ▗▙ ▗ ▗▙▖▖ ▖▖ ▖ ▖▖ ▗ ▟ ▗▄▖ │
> 139┤ ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖ │
> 95┤ ▐▀ ▘ ▝ ▝ ▝▘ ▘ ▘▘ ▝▘ ▝▘ ▝ ▝ ▀ │
> │ ▛ │
> 51┤ ▐▘ │
> 7┤ ▖▗▗ ▗▄▐ │
> └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
> 0.00 0.65 1.30 1.96 2.61 3.26 3.91 4.57 5.22 5.87
>
> and after
>
> periodic-2977 util_est.enqueued running
> ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
> 157.0┤ ▙▄ ▗▄ ▗▄▄▄ ▗▄ ▗▄▄▄▗▄▄ ▗▄▄▖ ▄ ▄▄▄ ▄ ▄▖▖ ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄ │
> 119.5┤ ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀ ▝▝▀ ▀ ▀▀▀▀ │
> 82.0┤ ▟ │
> │ ▌ │
> 44.5┤ ▌ │
> 7.0┤ ▗ ▗▖ ▌ │
> └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
> 0.00 0.65 1.30 1.95 2.60 3.25 3.90 4.56 5.21 5.86
>
> Note how the signal is noisier and can peak to 183 vs 157 now.
>
> Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race")
> Signed-off-by: Qais Yousef <qyousef@layalina.io>
> ---
>
> This is split from [1] series where I stumbled upon this problem. AFAICS it
> needs backporting all the way to 6.12 LTS.
>
> [1] https://lore.kernel.org/lkml/20260504020003.71306-1-qyousef@layalina.io/
>
> kernel/sched/fair.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 728965851842..96ba97e5f4ae 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7401,6 +7401,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> */
> static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> {
> + int ret;
> +
> if (task_is_throttled(p)) {
> dequeue_throttled_task(p, flags);
> return true;
> @@ -7409,8 +7411,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> if (!p->se.sched_delayed)
> util_est_dequeue(&rq->cfs, p);
>
> + ret = dequeue_entities(rq, &p->se, flags);
> util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
I thought that util_est_update() was called intentionally before dequeue_entities
to update the utilization of task p up to this time right
before the dequeue. Then dequeue_entities() is called later
with up to date task utilization estimate of p.
Perhaps util_est_update() should be moved before
util_est_dequeue() so the updated utilization of p
is subtracted from the rq utilization.
@@ -8002,10 +8002,10 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
return true;
}
+ util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
if (!p->se.sched_delayed)
util_est_dequeue(&rq->cfs, p);
- util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
if (dequeue_entities(rq, &p->se, flags) < 0)
return false;
Tim
> - if (dequeue_entities(rq, &p->se, flags) < 0)
> + if (ret < 0)
> return false;
>
> /*
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-05-15 18:35 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-12 12:46 [PATCH] sched/fair: Call update_util_est() after dequeue_entities() Qais Yousef
2026-05-12 20:52 ` John Stultz
2026-05-13 12:46 ` Vincent Guittot
2026-05-15 1:51 ` Qais Yousef
2026-05-15 15:23 ` Vincent Guittot
2026-05-15 18:35 ` Tim Chen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox