* [PATCH] sched/fair: Call update_util_est() after dequeue_entities()
@ 2026-05-12 12:46 Qais Yousef
2026-05-12 20:52 ` John Stultz
2026-05-15 18:35 ` Tim Chen
0 siblings, 2 replies; 6+ messages in thread
From: Qais Yousef @ 2026-05-12 12:46 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Vincent Guittot
Cc: Rafael J. Wysocki, Viresh Kumar, Juri Lelli, Steven Rostedt,
John Stultz, Dietmar Eggemann, Tim Chen, Chen, Yu C,
Thomas Gleixner, linux-kernel, linux-pm, Qais Yousef
update_util_est() reads task_util() at dequeue which is updated in
dequeue_entities(). To read the accurate util_avg at dequeue, make sure
to do the read after load_avg is updated in dequeue_entities().
util_est for a periodic task before
periodic-3114 util_est.enqueued running
┌───────────────────────────────────────────────────────────────────────────────────────────────┐
183┤ ▖▗ ▐▖ ▖ ▗▙ ▗ ▗▙▖▖ ▖▖ ▖ ▖▖ ▗ ▟ ▗▄▖ │
139┤ ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖ │
95┤ ▐▀ ▘ ▝ ▝ ▝▘ ▘ ▘▘ ▝▘ ▝▘ ▝ ▝ ▀ │
│ ▛ │
51┤ ▐▘ │
7┤ ▖▗▗ ▗▄▐ │
└┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
0.00 0.65 1.30 1.96 2.61 3.26 3.91 4.57 5.22 5.87
and after
periodic-2977 util_est.enqueued running
┌─────────────────────────────────────────────────────────────────────────────────────────────┐
157.0┤ ▙▄ ▗▄ ▗▄▄▄ ▗▄ ▗▄▄▄▗▄▄ ▗▄▄▖ ▄ ▄▄▄ ▄ ▄▖▖ ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄ │
119.5┤ ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀ ▝▝▀ ▀ ▀▀▀▀ │
82.0┤ ▟ │
│ ▌ │
44.5┤ ▌ │
7.0┤ ▗ ▗▖ ▌ │
└┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
0.00 0.65 1.30 1.95 2.60 3.25 3.90 4.56 5.21 5.86
Note how the signal is noisier and can peak to 183 vs 157 now.
Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race")
Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
This is split from [1] series where I stumbled upon this problem. AFAICS it
needs backporting all the way to 6.12 LTS.
[1] https://lore.kernel.org/lkml/20260504020003.71306-1-qyousef@layalina.io/
kernel/sched/fair.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 728965851842..96ba97e5f4ae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7401,6 +7401,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
*/
static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
+ int ret;
+
if (task_is_throttled(p)) {
dequeue_throttled_task(p, flags);
return true;
@@ -7409,8 +7411,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!p->se.sched_delayed)
util_est_dequeue(&rq->cfs, p);
+ ret = dequeue_entities(rq, &p->se, flags);
util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
- if (dequeue_entities(rq, &p->se, flags) < 0)
+ if (ret < 0)
return false;
/*
--
2.34.1
^ permalink raw reply related [flat|nested] 6+ messages in thread* Re: [PATCH] sched/fair: Call update_util_est() after dequeue_entities() 2026-05-12 12:46 [PATCH] sched/fair: Call update_util_est() after dequeue_entities() Qais Yousef @ 2026-05-12 20:52 ` John Stultz 2026-05-13 12:46 ` Vincent Guittot 2026-05-15 18:35 ` Tim Chen 1 sibling, 1 reply; 6+ messages in thread From: John Stultz @ 2026-05-12 20:52 UTC (permalink / raw) To: Qais Yousef Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki, Viresh Kumar, Juri Lelli, Steven Rostedt, Dietmar Eggemann, Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm On Tue, May 12, 2026 at 5:47 AM Qais Yousef <qyousef@layalina.io> wrote: > > update_util_est() reads task_util() at dequeue which is updated in > dequeue_entities(). To read the accurate util_avg at dequeue, make sure > to do the read after load_avg is updated in dequeue_entities(). > > util_est for a periodic task before > > periodic-3114 util_est.enqueued running > ┌───────────────────────────────────────────────────────────────────────────────────────────────┐ > 183┤ ▖▗ ▐▖ ▖ ▗▙ ▗ ▗▙▖▖ ▖▖ ▖ ▖▖ ▗ ▟ ▗▄▖ │ > 139┤ ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖ │ > 95┤ ▐▀ ▘ ▝ ▝ ▝▘ ▘ ▘▘ ▝▘ ▝▘ ▝ ▝ ▀ │ > │ ▛ │ > 51┤ ▐▘ │ > 7┤ ▖▗▗ ▗▄▐ │ > └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘ > 0.00 0.65 1.30 1.96 2.61 3.26 3.91 4.57 5.22 5.87 > > and after > > periodic-2977 util_est.enqueued running > ┌─────────────────────────────────────────────────────────────────────────────────────────────┐ > 157.0┤ ▙▄ ▗▄ ▗▄▄▄ ▗▄ ▗▄▄▄▗▄▄ ▗▄▄▖ ▄ ▄▄▄ ▄ ▄▖▖ ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄ │ > 119.5┤ ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀ ▝▝▀ ▀ ▀▀▀▀ │ > 82.0┤ ▟ │ > │ ▌ │ > 44.5┤ ▌ │ > 7.0┤ ▗ ▗▖ ▌ │ > └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘ > 0.00 0.65 1.30 1.95 2.60 3.25 3.90 4.56 5.21 5.86 > > Note how the signal is noisier and can peak to 183 vs 157 now. > > Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race") > Signed-off-by: Qais Yousef <qyousef@layalina.io> > --- > > This is split from [1] series where I stumbled upon this problem. AFAICS it > needs backporting all the way to 6.12 LTS. > > [1] https://lore.kernel.org/lkml/20260504020003.71306-1-qyousef@layalina.io/ > > kernel/sched/fair.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 728965851842..96ba97e5f4ae 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7401,6 +7401,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) > */ > static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > { > + int ret; > + > if (task_is_throttled(p)) { > dequeue_throttled_task(p, flags); > return true; > @@ -7409,8 +7411,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > if (!p->se.sched_delayed) > util_est_dequeue(&rq->cfs, p); > > + ret = dequeue_entities(rq, &p->se, flags); > util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP); > - if (dequeue_entities(rq, &p->se, flags) < 0) Hrm, so the Sashiko tool raised a reasonable concern on the earlier version of this: https://sashiko.dev/#/patchset/20260504020003.71306-1-qyousef%40layalina.io?part=12 Specifically, we shouldn't reference p after dequeue_entities() or we risk racing with it being woken up, running, and maybe exiting on another cpu. And this moves the util_est_updat() call to after dequeue finishes. Maybe there's some way to have util_est_update() compensate for the unfinished accounting that will be done in dequeue_entities()? thanks -john ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] sched/fair: Call update_util_est() after dequeue_entities() 2026-05-12 20:52 ` John Stultz @ 2026-05-13 12:46 ` Vincent Guittot 2026-05-15 1:51 ` Qais Yousef 0 siblings, 1 reply; 6+ messages in thread From: Vincent Guittot @ 2026-05-13 12:46 UTC (permalink / raw) To: John Stultz Cc: Qais Yousef, Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar, Juri Lelli, Steven Rostedt, Dietmar Eggemann, Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm Le mardi 12 mai 2026 à 13:52:09 (-0700), John Stultz a écrit : > On Tue, May 12, 2026 at 5:47 AM Qais Yousef <qyousef@layalina.io> wrote: > > > > update_util_est() reads task_util() at dequeue which is updated in > > dequeue_entities(). To read the accurate util_avg at dequeue, make sure > > to do the read after load_avg is updated in dequeue_entities(). > > > > util_est for a periodic task before > > > > periodic-3114 util_est.enqueued running > > ┌───────────────────────────────────────────────────────────────────────────────────────────────┐ > > 183┤ ▖▗ ▐▖ ▖ ▗▙ ▗ ▗▙▖▖ ▖▖ ▖ ▖▖ ▗ ▟ ▗▄▖ │ > > 139┤ ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖ │ > > 95┤ ▐▀ ▘ ▝ ▝ ▝▘ ▘ ▘▘ ▝▘ ▝▘ ▝ ▝ ▀ │ > > │ ▛ │ > > 51┤ ▐▘ │ > > 7┤ ▖▗▗ ▗▄▐ │ > > └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘ > > 0.00 0.65 1.30 1.96 2.61 3.26 3.91 4.57 5.22 5.87 > > > > and after > > > > periodic-2977 util_est.enqueued running > > ┌─────────────────────────────────────────────────────────────────────────────────────────────┐ > > 157.0┤ ▙▄ ▗▄ ▗▄▄▄ ▗▄ ▗▄▄▄▗▄▄ ▗▄▄▖ ▄ ▄▄▄ ▄ ▄▖▖ ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄ │ > > 119.5┤ ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀ ▝▝▀ ▀ ▀▀▀▀ │ > > 82.0┤ ▟ │ > > │ ▌ │ > > 44.5┤ ▌ │ > > 7.0┤ ▗ ▗▖ ▌ │ > > └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘ > > 0.00 0.65 1.30 1.95 2.60 3.25 3.90 4.56 5.21 5.86 > > > > Note how the signal is noisier and can peak to 183 vs 157 now. > > > > Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race") > > Signed-off-by: Qais Yousef <qyousef@layalina.io> > > --- > > > > This is split from [1] series where I stumbled upon this problem. AFAICS it > > needs backporting all the way to 6.12 LTS. > > > > [1] https://lore.kernel.org/lkml/20260504020003.71306-1-qyousef@layalina.io/ > > > > kernel/sched/fair.c | 5 ++++- > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 728965851842..96ba97e5f4ae 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -7401,6 +7401,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) > > */ > > static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > > { > > + int ret; > > + > > if (task_is_throttled(p)) { > > dequeue_throttled_task(p, flags); > > return true; > > @@ -7409,8 +7411,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > > if (!p->se.sched_delayed) > > util_est_dequeue(&rq->cfs, p); > > > > + ret = dequeue_entities(rq, &p->se, flags); > > util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP); > > - if (dequeue_entities(rq, &p->se, flags) < 0) > > Hrm, so the Sashiko tool raised a reasonable concern on the earlier > version of this: > https://sashiko.dev/#/patchset/20260504020003.71306-1-qyousef%40layalina.io?part=12 Even without sashiko, this is what the comment says just below in the function ;-) Below is a different way to fix it which also covers cases when we have both DEQUEUE_SLEEP and DEQUEUE_DELAYED --- kernel/sched/fair.c | 189 ++++++++++++++++++++++---------------------- 1 file changed, 93 insertions(+), 96 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c83fbe4e88c1..0976adc12594 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5120,13 +5120,87 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s trace_pelt_cfs_tp(cfs_rq); } +#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100) + +static inline void util_est_update(struct sched_entity *se) +{ + unsigned int ewma, dequeued, last_ewma_diff; + + if (!sched_feat(UTIL_EST)) + return; + + /* Get current estimate of utilization */ + ewma = READ_ONCE(se->avg.util_est); + + /* + * If the PELT values haven't changed since enqueue time, + * skip the util_est update. + */ + if (ewma & UTIL_AVG_UNCHANGED) + return; + + /* Get utilization at dequeue */ + dequeued = READ_ONCE(se->avg.util_avg); + + /* + * Reset EWMA on utilization increases, the moving average is used only + * to smooth utilization decreases. + */ + if (ewma <= dequeued) { + ewma = dequeued; + goto done; + } + + /* + * Skip update of task's estimated utilization when its members are + * already ~1% close to its last activation value. + */ + last_ewma_diff = ewma - dequeued; + if (last_ewma_diff < UTIL_EST_MARGIN) + goto done; + + /* + * To avoid underestimate of task utilization, skip updates of EWMA if + * we cannot grant that thread got all CPU time it wanted. + */ + if ((dequeued + UTIL_EST_MARGIN) < READ_ONCE(se->avg.runnable_avg)) + goto done; + + + /* + * Update Task's estimated utilization + * + * When *p completes an activation we can consolidate another sample + * of the task size. This is done by using this value to update the + * Exponential Weighted Moving Average (EWMA): + * + * ewma(t) = w * task_util(p) + (1-w) * ewma(t-1) + * = w * task_util(p) + ewma(t-1) - w * ewma(t-1) + * = w * (task_util(p) - ewma(t-1)) + ewma(t-1) + * = w * ( -last_ewma_diff ) + ewma(t-1) + * = w * (-last_ewma_diff + ewma(t-1) / w) + * + * Where 'w' is the weight of new samples, which is configured to be + * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT) + */ + ewma <<= UTIL_EST_WEIGHT_SHIFT; + ewma -= last_ewma_diff; + ewma >>= UTIL_EST_WEIGHT_SHIFT; +done: + ewma |= UTIL_AVG_UNCHANGED; + WRITE_ONCE(se->avg.util_est, ewma); + + trace_sched_util_est_se_tp(se); +} + /* * Optional action to be done while updating the load average */ -#define UPDATE_TG 0x1 -#define SKIP_AGE_LOAD 0x2 -#define DO_ATTACH 0x4 -#define DO_DETACH 0x8 +#define UPDATE_TG 0x01 +#define SKIP_AGE_LOAD 0x02 +#define DO_ATTACH 0x04 +#define DO_DETACH 0x08 +#define UPDATE_UTIL_EST 0x10 /* Update task and its cfs_rq load average */ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) @@ -5169,6 +5243,9 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s if (flags & UPDATE_TG) update_tg_load_avg(cfs_rq); } + + if (flags & UPDATE_UTIL_EST) + util_est_update(se); } /* @@ -5227,11 +5304,6 @@ static inline unsigned long task_util(struct task_struct *p) return READ_ONCE(p->se.avg.util_avg); } -static inline unsigned long task_runnable(struct task_struct *p) -{ - return READ_ONCE(p->se.avg.runnable_avg); -} - static inline unsigned long _task_util_est(struct task_struct *p) { return READ_ONCE(p->se.avg.util_est) & ~UTIL_AVG_UNCHANGED; @@ -5274,88 +5346,6 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq, trace_sched_util_est_cfs_tp(cfs_rq); } -#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100) - -static inline void util_est_update(struct cfs_rq *cfs_rq, - struct task_struct *p, - bool task_sleep) -{ - unsigned int ewma, dequeued, last_ewma_diff; - - if (!sched_feat(UTIL_EST)) - return; - - /* - * Skip update of task's estimated utilization when the task has not - * yet completed an activation, e.g. being migrated. - */ - if (!task_sleep) - return; - - /* Get current estimate of utilization */ - ewma = READ_ONCE(p->se.avg.util_est); - - /* - * If the PELT values haven't changed since enqueue time, - * skip the util_est update. - */ - if (ewma & UTIL_AVG_UNCHANGED) - return; - - /* Get utilization at dequeue */ - dequeued = task_util(p); - - /* - * Reset EWMA on utilization increases, the moving average is used only - * to smooth utilization decreases. - */ - if (ewma <= dequeued) { - ewma = dequeued; - goto done; - } - - /* - * Skip update of task's estimated utilization when its members are - * already ~1% close to its last activation value. - */ - last_ewma_diff = ewma - dequeued; - if (last_ewma_diff < UTIL_EST_MARGIN) - goto done; - - /* - * To avoid underestimate of task utilization, skip updates of EWMA if - * we cannot grant that thread got all CPU time it wanted. - */ - if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p)) - goto done; - - - /* - * Update Task's estimated utilization - * - * When *p completes an activation we can consolidate another sample - * of the task size. This is done by using this value to update the - * Exponential Weighted Moving Average (EWMA): - * - * ewma(t) = w * task_util(p) + (1-w) * ewma(t-1) - * = w * task_util(p) + ewma(t-1) - w * ewma(t-1) - * = w * (task_util(p) - ewma(t-1)) + ewma(t-1) - * = w * ( -last_ewma_diff ) + ewma(t-1) - * = w * (-last_ewma_diff + ewma(t-1) / w) - * - * Where 'w' is the weight of new samples, which is configured to be - * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT) - */ - ewma <<= UTIL_EST_WEIGHT_SHIFT; - ewma -= last_ewma_diff; - ewma >>= UTIL_EST_WEIGHT_SHIFT; -done: - ewma |= UTIL_AVG_UNCHANGED; - WRITE_ONCE(p->se.avg.util_est, ewma); - - trace_sched_util_est_se_tp(&p->se); -} - static inline unsigned long get_actual_cpu_capacity(int cpu) { unsigned long capacity = arch_scale_cpu_capacity(cpu); @@ -5828,7 +5818,7 @@ static bool dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) { bool sleep = flags & DEQUEUE_SLEEP; - int action = UPDATE_TG; + int action = 0; update_curr(cfs_rq); clear_buddies(cfs_rq, se); @@ -5848,15 +5838,23 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) if (sched_feat(DELAY_DEQUEUE) && delay && !entity_eligible(cfs_rq, se)) { - update_load_avg(cfs_rq, se, 0); + if (entity_is_task(se)) + action |= UPDATE_UTIL_EST; + update_load_avg(cfs_rq, se, action); update_entity_lag(cfs_rq, se); set_delayed(se); return false; } } - if (entity_is_task(se) && task_on_rq_migrating(task_of(se))) - action |= DO_DETACH; + action = UPDATE_TG; + if (entity_is_task(se)) { + if (task_on_rq_migrating(task_of(se))) + action |= DO_DETACH; + + if (sleep && !(flags & DEQUEUE_DELAYED)) + action |= UPDATE_UTIL_EST; + } /* * When dequeuing a sched_entity, we must: @@ -7628,7 +7626,6 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) if (!p->se.sched_delayed) util_est_dequeue(&rq->cfs, p); - util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP); if (dequeue_entities(rq, &p->se, flags) < 0) return false; -- 2.43.0 > > Specifically, we shouldn't reference p after dequeue_entities() or we > risk racing with it being woken up, running, and maybe exiting on > another cpu. > And this moves the util_est_updat() call to after dequeue finishes. > > Maybe there's some way to have util_est_update() compensate for the > unfinished accounting that will be done in dequeue_entities()? > > thanks > -john ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] sched/fair: Call update_util_est() after dequeue_entities() 2026-05-13 12:46 ` Vincent Guittot @ 2026-05-15 1:51 ` Qais Yousef 2026-05-15 15:23 ` Vincent Guittot 0 siblings, 1 reply; 6+ messages in thread From: Qais Yousef @ 2026-05-15 1:51 UTC (permalink / raw) To: Vincent Guittot Cc: John Stultz, Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar, Juri Lelli, Steven Rostedt, Dietmar Eggemann, Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm On 05/13/26 14:46, Vincent Guittot wrote: > Le mardi 12 mai 2026 à 13:52:09 (-0700), John Stultz a écrit : > > On Tue, May 12, 2026 at 5:47 AM Qais Yousef <qyousef@layalina.io> wrote: > > > > > > update_util_est() reads task_util() at dequeue which is updated in > > > dequeue_entities(). To read the accurate util_avg at dequeue, make sure > > > to do the read after load_avg is updated in dequeue_entities(). > > > > > > util_est for a periodic task before > > > > > > periodic-3114 util_est.enqueued running > > > ┌───────────────────────────────────────────────────────────────────────────────────────────────┐ > > > 183┤ ▖▗ ▐▖ ▖ ▗▙ ▗ ▗▙▖▖ ▖▖ ▖ ▖▖ ▗ ▟ ▗▄▖ │ > > > 139┤ ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖ │ > > > 95┤ ▐▀ ▘ ▝ ▝ ▝▘ ▘ ▘▘ ▝▘ ▝▘ ▝ ▝ ▀ │ > > > │ ▛ │ > > > 51┤ ▐▘ │ > > > 7┤ ▖▗▗ ▗▄▐ │ > > > └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘ > > > 0.00 0.65 1.30 1.96 2.61 3.26 3.91 4.57 5.22 5.87 > > > > > > and after > > > > > > periodic-2977 util_est.enqueued running > > > ┌─────────────────────────────────────────────────────────────────────────────────────────────┐ > > > 157.0┤ ▙▄ ▗▄ ▗▄▄▄ ▗▄ ▗▄▄▄▗▄▄ ▗▄▄▖ ▄ ▄▄▄ ▄ ▄▖▖ ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄ │ > > > 119.5┤ ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀ ▝▝▀ ▀ ▀▀▀▀ │ > > > 82.0┤ ▟ │ > > > │ ▌ │ > > > 44.5┤ ▌ │ > > > 7.0┤ ▗ ▗▖ ▌ │ > > > └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘ > > > 0.00 0.65 1.30 1.95 2.60 3.25 3.90 4.56 5.21 5.86 > > > > > > Note how the signal is noisier and can peak to 183 vs 157 now. > > > > > > Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race") > > > Signed-off-by: Qais Yousef <qyousef@layalina.io> > > > --- > > > > > > This is split from [1] series where I stumbled upon this problem. AFAICS it > > > needs backporting all the way to 6.12 LTS. > > > > > > [1] https://lore.kernel.org/lkml/20260504020003.71306-1-qyousef@layalina.io/ > > > > > > kernel/sched/fair.c | 5 ++++- > > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index 728965851842..96ba97e5f4ae 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -7401,6 +7401,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) > > > */ > > > static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > > > { > > > + int ret; > > > + > > > if (task_is_throttled(p)) { > > > dequeue_throttled_task(p, flags); > > > return true; > > > @@ -7409,8 +7411,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > > > if (!p->se.sched_delayed) > > > util_est_dequeue(&rq->cfs, p); > > > > > > + ret = dequeue_entities(rq, &p->se, flags); > > > util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP); > > > - if (dequeue_entities(rq, &p->se, flags) < 0) > > > > Hrm, so the Sashiko tool raised a reasonable concern on the earlier > > version of this: > > https://sashiko.dev/#/patchset/20260504020003.71306-1-qyousef%40layalina.io?part=12 > > Even without sashiko, this is what the comment says just below in the function ;-) Yeah I read it but it didn't really register properly :-) > > Below is a different way to fix it which also covers cases when we have both > DEQUEUE_SLEEP and DEQUEUE_DELAYED Works for me. It looks even more stable now util_avg periodic-2737 util_avg running ┌───────────────────────────────────────────────────────────────────────────────────────────────┐ 140┤ ▟▟█▄▄██▙▄▟▟▙▙▟█▟▄▟▟▟▄▄██▄▟▙█▙▄▙█▙▄█▙▙▄▙▙▙█▟█▄▟▟▟▄▟█▟▟▄██▄▙█▟▟▄▟██▄ │ 105┤ ▄██████████████████████████████████████████████████████████████████ │ 70┤ ▗▛▘ │ │ ▐▘ │ 35┤ █ │ 0┤ ▖ ▗▄▌ │ └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘ 0.00 0.65 1.30 1.95 2.60 3.25 3.90 4.55 5.20 5.85 and util_eset periodic-2737 util_est.enqueued running ┌─────────────────────────────────────────────────────────────────────────────────────────────┐ 140.0┤ ▞▀▀▀▀▀▀▀▀▀▀▀▀▀▜▀▀▀▜▀▀▀▀▀▀▀▀▜▀▀▀▀▀▀▀▀▀▀▀▀▀▀▜▀▀▀▀▀▀▜▛▀▀▀▛▀▀▀▀▀▀▀▀▀▘ │ 106.8┤ ▝▘ │ 73.5┤ ▗▘ │ │ ▗ │ 40.2┤ ▖ │ 7.0┤ ▖ ▄▗▖ │ └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘ 0.00 0.65 1.30 1.95 2.60 3.25 3.90 4.55 5.20 5.85 Do you want to send a proper patch? Feel free to stick my reviewed-and-tested-by. Thanks! > > --- > kernel/sched/fair.c | 189 ++++++++++++++++++++++---------------------- > 1 file changed, 93 insertions(+), 96 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index c83fbe4e88c1..0976adc12594 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5120,13 +5120,87 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > trace_pelt_cfs_tp(cfs_rq); > } > > +#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100) > + > +static inline void util_est_update(struct sched_entity *se) > +{ > + unsigned int ewma, dequeued, last_ewma_diff; > + > + if (!sched_feat(UTIL_EST)) > + return; > + > + /* Get current estimate of utilization */ > + ewma = READ_ONCE(se->avg.util_est); > + > + /* > + * If the PELT values haven't changed since enqueue time, > + * skip the util_est update. > + */ > + if (ewma & UTIL_AVG_UNCHANGED) > + return; > + > + /* Get utilization at dequeue */ > + dequeued = READ_ONCE(se->avg.util_avg); > + > + /* > + * Reset EWMA on utilization increases, the moving average is used only > + * to smooth utilization decreases. > + */ > + if (ewma <= dequeued) { > + ewma = dequeued; > + goto done; > + } > + > + /* > + * Skip update of task's estimated utilization when its members are > + * already ~1% close to its last activation value. > + */ > + last_ewma_diff = ewma - dequeued; > + if (last_ewma_diff < UTIL_EST_MARGIN) > + goto done; > + > + /* > + * To avoid underestimate of task utilization, skip updates of EWMA if > + * we cannot grant that thread got all CPU time it wanted. > + */ > + if ((dequeued + UTIL_EST_MARGIN) < READ_ONCE(se->avg.runnable_avg)) > + goto done; > + > + > + /* > + * Update Task's estimated utilization > + * > + * When *p completes an activation we can consolidate another sample > + * of the task size. This is done by using this value to update the > + * Exponential Weighted Moving Average (EWMA): > + * > + * ewma(t) = w * task_util(p) + (1-w) * ewma(t-1) > + * = w * task_util(p) + ewma(t-1) - w * ewma(t-1) > + * = w * (task_util(p) - ewma(t-1)) + ewma(t-1) > + * = w * ( -last_ewma_diff ) + ewma(t-1) > + * = w * (-last_ewma_diff + ewma(t-1) / w) > + * > + * Where 'w' is the weight of new samples, which is configured to be > + * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT) > + */ > + ewma <<= UTIL_EST_WEIGHT_SHIFT; > + ewma -= last_ewma_diff; > + ewma >>= UTIL_EST_WEIGHT_SHIFT; > +done: > + ewma |= UTIL_AVG_UNCHANGED; > + WRITE_ONCE(se->avg.util_est, ewma); > + > + trace_sched_util_est_se_tp(se); > +} > + > /* > * Optional action to be done while updating the load average > */ > -#define UPDATE_TG 0x1 > -#define SKIP_AGE_LOAD 0x2 > -#define DO_ATTACH 0x4 > -#define DO_DETACH 0x8 > +#define UPDATE_TG 0x01 > +#define SKIP_AGE_LOAD 0x02 > +#define DO_ATTACH 0x04 > +#define DO_DETACH 0x08 > +#define UPDATE_UTIL_EST 0x10 > > /* Update task and its cfs_rq load average */ > static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > @@ -5169,6 +5243,9 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > if (flags & UPDATE_TG) > update_tg_load_avg(cfs_rq); > } > + > + if (flags & UPDATE_UTIL_EST) > + util_est_update(se); > } > > /* > @@ -5227,11 +5304,6 @@ static inline unsigned long task_util(struct task_struct *p) > return READ_ONCE(p->se.avg.util_avg); > } > > -static inline unsigned long task_runnable(struct task_struct *p) > -{ > - return READ_ONCE(p->se.avg.runnable_avg); > -} > - > static inline unsigned long _task_util_est(struct task_struct *p) > { > return READ_ONCE(p->se.avg.util_est) & ~UTIL_AVG_UNCHANGED; > @@ -5274,88 +5346,6 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq, > trace_sched_util_est_cfs_tp(cfs_rq); > } > > -#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100) > - > -static inline void util_est_update(struct cfs_rq *cfs_rq, > - struct task_struct *p, > - bool task_sleep) > -{ > - unsigned int ewma, dequeued, last_ewma_diff; > - > - if (!sched_feat(UTIL_EST)) > - return; > - > - /* > - * Skip update of task's estimated utilization when the task has not > - * yet completed an activation, e.g. being migrated. > - */ > - if (!task_sleep) > - return; > - > - /* Get current estimate of utilization */ > - ewma = READ_ONCE(p->se.avg.util_est); > - > - /* > - * If the PELT values haven't changed since enqueue time, > - * skip the util_est update. > - */ > - if (ewma & UTIL_AVG_UNCHANGED) > - return; > - > - /* Get utilization at dequeue */ > - dequeued = task_util(p); > - > - /* > - * Reset EWMA on utilization increases, the moving average is used only > - * to smooth utilization decreases. > - */ > - if (ewma <= dequeued) { > - ewma = dequeued; > - goto done; > - } > - > - /* > - * Skip update of task's estimated utilization when its members are > - * already ~1% close to its last activation value. > - */ > - last_ewma_diff = ewma - dequeued; > - if (last_ewma_diff < UTIL_EST_MARGIN) > - goto done; > - > - /* > - * To avoid underestimate of task utilization, skip updates of EWMA if > - * we cannot grant that thread got all CPU time it wanted. > - */ > - if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p)) > - goto done; > - > - > - /* > - * Update Task's estimated utilization > - * > - * When *p completes an activation we can consolidate another sample > - * of the task size. This is done by using this value to update the > - * Exponential Weighted Moving Average (EWMA): > - * > - * ewma(t) = w * task_util(p) + (1-w) * ewma(t-1) > - * = w * task_util(p) + ewma(t-1) - w * ewma(t-1) > - * = w * (task_util(p) - ewma(t-1)) + ewma(t-1) > - * = w * ( -last_ewma_diff ) + ewma(t-1) > - * = w * (-last_ewma_diff + ewma(t-1) / w) > - * > - * Where 'w' is the weight of new samples, which is configured to be > - * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT) > - */ > - ewma <<= UTIL_EST_WEIGHT_SHIFT; > - ewma -= last_ewma_diff; > - ewma >>= UTIL_EST_WEIGHT_SHIFT; > -done: > - ewma |= UTIL_AVG_UNCHANGED; > - WRITE_ONCE(p->se.avg.util_est, ewma); > - > - trace_sched_util_est_se_tp(&p->se); > -} > - > static inline unsigned long get_actual_cpu_capacity(int cpu) > { > unsigned long capacity = arch_scale_cpu_capacity(cpu); > @@ -5828,7 +5818,7 @@ static bool > dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > { > bool sleep = flags & DEQUEUE_SLEEP; > - int action = UPDATE_TG; > + int action = 0; > > update_curr(cfs_rq); > clear_buddies(cfs_rq, se); > @@ -5848,15 +5838,23 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > > if (sched_feat(DELAY_DEQUEUE) && delay && > !entity_eligible(cfs_rq, se)) { > - update_load_avg(cfs_rq, se, 0); > + if (entity_is_task(se)) > + action |= UPDATE_UTIL_EST; > + update_load_avg(cfs_rq, se, action); > update_entity_lag(cfs_rq, se); > set_delayed(se); > return false; > } > } > > - if (entity_is_task(se) && task_on_rq_migrating(task_of(se))) > - action |= DO_DETACH; > + action = UPDATE_TG; > + if (entity_is_task(se)) { > + if (task_on_rq_migrating(task_of(se))) > + action |= DO_DETACH; > + > + if (sleep && !(flags & DEQUEUE_DELAYED)) > + action |= UPDATE_UTIL_EST; > + } > > /* > * When dequeuing a sched_entity, we must: > @@ -7628,7 +7626,6 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > if (!p->se.sched_delayed) > util_est_dequeue(&rq->cfs, p); > > - util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP); > if (dequeue_entities(rq, &p->se, flags) < 0) > return false; > > -- > 2.43.0 > > > > > > > > Specifically, we shouldn't reference p after dequeue_entities() or we > > risk racing with it being woken up, running, and maybe exiting on > > another cpu. > > And this moves the util_est_updat() call to after dequeue finishes. > > > > Maybe there's some way to have util_est_update() compensate for the > > unfinished accounting that will be done in dequeue_entities()? > > > > thanks > > -john ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] sched/fair: Call update_util_est() after dequeue_entities() 2026-05-15 1:51 ` Qais Yousef @ 2026-05-15 15:23 ` Vincent Guittot 0 siblings, 0 replies; 6+ messages in thread From: Vincent Guittot @ 2026-05-15 15:23 UTC (permalink / raw) To: Qais Yousef Cc: John Stultz, Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar, Juri Lelli, Steven Rostedt, Dietmar Eggemann, Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm On Fri, 15 May 2026 at 03:51, Qais Yousef <qyousef@layalina.io> wrote: > > On 05/13/26 14:46, Vincent Guittot wrote: > > Le mardi 12 mai 2026 à 13:52:09 (-0700), John Stultz a écrit : > > > On Tue, May 12, 2026 at 5:47 AM Qais Yousef <qyousef@layalina.io> wrote: > > > > > > > > update_util_est() reads task_util() at dequeue which is updated in > > > > dequeue_entities(). To read the accurate util_avg at dequeue, make sure > > > > to do the read after load_avg is updated in dequeue_entities(). > > > > > > > > util_est for a periodic task before > > > > > > > > periodic-3114 util_est.enqueued running > > > > ┌───────────────────────────────────────────────────────────────────────────────────────────────┐ > > > > 183┤ ▖▗ ▐▖ ▖ ▗▙ ▗ ▗▙▖▖ ▖▖ ▖ ▖▖ ▗ ▟ ▗▄▖ │ > > > > 139┤ ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖ │ > > > > 95┤ ▐▀ ▘ ▝ ▝ ▝▘ ▘ ▘▘ ▝▘ ▝▘ ▝ ▝ ▀ │ > > > > │ ▛ │ > > > > 51┤ ▐▘ │ > > > > 7┤ ▖▗▗ ▗▄▐ │ > > > > └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘ > > > > 0.00 0.65 1.30 1.96 2.61 3.26 3.91 4.57 5.22 5.87 > > > > > > > > and after > > > > > > > > periodic-2977 util_est.enqueued running > > > > ┌─────────────────────────────────────────────────────────────────────────────────────────────┐ > > > > 157.0┤ ▙▄ ▗▄ ▗▄▄▄ ▗▄ ▗▄▄▄▗▄▄ ▗▄▄▖ ▄ ▄▄▄ ▄ ▄▖▖ ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄ │ > > > > 119.5┤ ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀ ▝▝▀ ▀ ▀▀▀▀ │ > > > > 82.0┤ ▟ │ > > > > │ ▌ │ > > > > 44.5┤ ▌ │ > > > > 7.0┤ ▗ ▗▖ ▌ │ > > > > └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘ > > > > 0.00 0.65 1.30 1.95 2.60 3.25 3.90 4.56 5.21 5.86 > > > > > > > > Note how the signal is noisier and can peak to 183 vs 157 now. > > > > > > > > Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race") > > > > Signed-off-by: Qais Yousef <qyousef@layalina.io> > > > > --- > > > > > > > > This is split from [1] series where I stumbled upon this problem. AFAICS it > > > > needs backporting all the way to 6.12 LTS. > > > > > > > > [1] https://lore.kernel.org/lkml/20260504020003.71306-1-qyousef@layalina.io/ > > > > > > > > kernel/sched/fair.c | 5 ++++- > > > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > > index 728965851842..96ba97e5f4ae 100644 > > > > --- a/kernel/sched/fair.c > > > > +++ b/kernel/sched/fair.c > > > > @@ -7401,6 +7401,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) > > > > */ > > > > static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > > > > { > > > > + int ret; > > > > + > > > > if (task_is_throttled(p)) { > > > > dequeue_throttled_task(p, flags); > > > > return true; > > > > @@ -7409,8 +7411,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > > > > if (!p->se.sched_delayed) > > > > util_est_dequeue(&rq->cfs, p); > > > > > > > > + ret = dequeue_entities(rq, &p->se, flags); > > > > util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP); > > > > - if (dequeue_entities(rq, &p->se, flags) < 0) > > > > > > Hrm, so the Sashiko tool raised a reasonable concern on the earlier > > > version of this: > > > https://sashiko.dev/#/patchset/20260504020003.71306-1-qyousef%40layalina.io?part=12 > > > > Even without sashiko, this is what the comment says just below in the function ;-) > > Yeah I read it but it didn't really register properly :-) > > > > > Below is a different way to fix it which also covers cases when we have both > > DEQUEUE_SLEEP and DEQUEUE_DELAYED > > Works for me. It looks even more stable now > > util_avg > > periodic-2737 util_avg running > ┌───────────────────────────────────────────────────────────────────────────────────────────────┐ > 140┤ ▟▟█▄▄██▙▄▟▟▙▙▟█▟▄▟▟▟▄▄██▄▟▙█▙▄▙█▙▄█▙▙▄▙▙▙█▟█▄▟▟▟▄▟█▟▟▄██▄▙█▟▟▄▟██▄ │ > 105┤ ▄██████████████████████████████████████████████████████████████████ │ > 70┤ ▗▛▘ │ > │ ▐▘ │ > 35┤ █ │ > 0┤ ▖ ▗▄▌ │ > └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘ > 0.00 0.65 1.30 1.95 2.60 3.25 3.90 4.55 5.20 5.85 > > and util_eset > > periodic-2737 util_est.enqueued running > ┌─────────────────────────────────────────────────────────────────────────────────────────────┐ > 140.0┤ ▞▀▀▀▀▀▀▀▀▀▀▀▀▀▜▀▀▀▜▀▀▀▀▀▀▀▀▜▀▀▀▀▀▀▀▀▀▀▀▀▀▀▜▀▀▀▀▀▀▜▛▀▀▀▛▀▀▀▀▀▀▀▀▀▘ │ > 106.8┤ ▝▘ │ > 73.5┤ ▗▘ │ > │ ▗ │ > 40.2┤ ▖ │ > 7.0┤ ▖ ▄▗▖ │ > └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘ > 0.00 0.65 1.30 1.95 2.60 3.25 3.90 4.55 5.20 5.85 > > Do you want to send a proper patch? Feel free to stick my > reviewed-and-tested-by. Yes I'm going to send a proper patch with reported, reviewed and tested by You Thanks > > Thanks! > > > > > --- > > kernel/sched/fair.c | 189 ++++++++++++++++++++++---------------------- > > 1 file changed, 93 insertions(+), 96 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index c83fbe4e88c1..0976adc12594 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -5120,13 +5120,87 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > trace_pelt_cfs_tp(cfs_rq); > > } > > > > +#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100) > > + > > +static inline void util_est_update(struct sched_entity *se) > > +{ > > + unsigned int ewma, dequeued, last_ewma_diff; > > + > > + if (!sched_feat(UTIL_EST)) > > + return; > > + > > + /* Get current estimate of utilization */ > > + ewma = READ_ONCE(se->avg.util_est); > > + > > + /* > > + * If the PELT values haven't changed since enqueue time, > > + * skip the util_est update. > > + */ > > + if (ewma & UTIL_AVG_UNCHANGED) > > + return; > > + > > + /* Get utilization at dequeue */ > > + dequeued = READ_ONCE(se->avg.util_avg); > > + > > + /* > > + * Reset EWMA on utilization increases, the moving average is used only > > + * to smooth utilization decreases. > > + */ > > + if (ewma <= dequeued) { > > + ewma = dequeued; > > + goto done; > > + } > > + > > + /* > > + * Skip update of task's estimated utilization when its members are > > + * already ~1% close to its last activation value. > > + */ > > + last_ewma_diff = ewma - dequeued; > > + if (last_ewma_diff < UTIL_EST_MARGIN) > > + goto done; > > + > > + /* > > + * To avoid underestimate of task utilization, skip updates of EWMA if > > + * we cannot grant that thread got all CPU time it wanted. > > + */ > > + if ((dequeued + UTIL_EST_MARGIN) < READ_ONCE(se->avg.runnable_avg)) > > + goto done; > > + > > + > > + /* > > + * Update Task's estimated utilization > > + * > > + * When *p completes an activation we can consolidate another sample > > + * of the task size. This is done by using this value to update the > > + * Exponential Weighted Moving Average (EWMA): > > + * > > + * ewma(t) = w * task_util(p) + (1-w) * ewma(t-1) > > + * = w * task_util(p) + ewma(t-1) - w * ewma(t-1) > > + * = w * (task_util(p) - ewma(t-1)) + ewma(t-1) > > + * = w * ( -last_ewma_diff ) + ewma(t-1) > > + * = w * (-last_ewma_diff + ewma(t-1) / w) > > + * > > + * Where 'w' is the weight of new samples, which is configured to be > > + * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT) > > + */ > > + ewma <<= UTIL_EST_WEIGHT_SHIFT; > > + ewma -= last_ewma_diff; > > + ewma >>= UTIL_EST_WEIGHT_SHIFT; > > +done: > > + ewma |= UTIL_AVG_UNCHANGED; > > + WRITE_ONCE(se->avg.util_est, ewma); > > + > > + trace_sched_util_est_se_tp(se); > > +} > > + > > /* > > * Optional action to be done while updating the load average > > */ > > -#define UPDATE_TG 0x1 > > -#define SKIP_AGE_LOAD 0x2 > > -#define DO_ATTACH 0x4 > > -#define DO_DETACH 0x8 > > +#define UPDATE_TG 0x01 > > +#define SKIP_AGE_LOAD 0x02 > > +#define DO_ATTACH 0x04 > > +#define DO_DETACH 0x08 > > +#define UPDATE_UTIL_EST 0x10 > > > > /* Update task and its cfs_rq load average */ > > static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > > @@ -5169,6 +5243,9 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > if (flags & UPDATE_TG) > > update_tg_load_avg(cfs_rq); > > } > > + > > + if (flags & UPDATE_UTIL_EST) > > + util_est_update(se); > > } > > > > /* > > @@ -5227,11 +5304,6 @@ static inline unsigned long task_util(struct task_struct *p) > > return READ_ONCE(p->se.avg.util_avg); > > } > > > > -static inline unsigned long task_runnable(struct task_struct *p) > > -{ > > - return READ_ONCE(p->se.avg.runnable_avg); > > -} > > - > > static inline unsigned long _task_util_est(struct task_struct *p) > > { > > return READ_ONCE(p->se.avg.util_est) & ~UTIL_AVG_UNCHANGED; > > @@ -5274,88 +5346,6 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq, > > trace_sched_util_est_cfs_tp(cfs_rq); > > } > > > > -#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100) > > - > > -static inline void util_est_update(struct cfs_rq *cfs_rq, > > - struct task_struct *p, > > - bool task_sleep) > > -{ > > - unsigned int ewma, dequeued, last_ewma_diff; > > - > > - if (!sched_feat(UTIL_EST)) > > - return; > > - > > - /* > > - * Skip update of task's estimated utilization when the task has not > > - * yet completed an activation, e.g. being migrated. > > - */ > > - if (!task_sleep) > > - return; > > - > > - /* Get current estimate of utilization */ > > - ewma = READ_ONCE(p->se.avg.util_est); > > - > > - /* > > - * If the PELT values haven't changed since enqueue time, > > - * skip the util_est update. > > - */ > > - if (ewma & UTIL_AVG_UNCHANGED) > > - return; > > - > > - /* Get utilization at dequeue */ > > - dequeued = task_util(p); > > - > > - /* > > - * Reset EWMA on utilization increases, the moving average is used only > > - * to smooth utilization decreases. > > - */ > > - if (ewma <= dequeued) { > > - ewma = dequeued; > > - goto done; > > - } > > - > > - /* > > - * Skip update of task's estimated utilization when its members are > > - * already ~1% close to its last activation value. > > - */ > > - last_ewma_diff = ewma - dequeued; > > - if (last_ewma_diff < UTIL_EST_MARGIN) > > - goto done; > > - > > - /* > > - * To avoid underestimate of task utilization, skip updates of EWMA if > > - * we cannot grant that thread got all CPU time it wanted. > > - */ > > - if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p)) > > - goto done; > > - > > - > > - /* > > - * Update Task's estimated utilization > > - * > > - * When *p completes an activation we can consolidate another sample > > - * of the task size. This is done by using this value to update the > > - * Exponential Weighted Moving Average (EWMA): > > - * > > - * ewma(t) = w * task_util(p) + (1-w) * ewma(t-1) > > - * = w * task_util(p) + ewma(t-1) - w * ewma(t-1) > > - * = w * (task_util(p) - ewma(t-1)) + ewma(t-1) > > - * = w * ( -last_ewma_diff ) + ewma(t-1) > > - * = w * (-last_ewma_diff + ewma(t-1) / w) > > - * > > - * Where 'w' is the weight of new samples, which is configured to be > > - * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT) > > - */ > > - ewma <<= UTIL_EST_WEIGHT_SHIFT; > > - ewma -= last_ewma_diff; > > - ewma >>= UTIL_EST_WEIGHT_SHIFT; > > -done: > > - ewma |= UTIL_AVG_UNCHANGED; > > - WRITE_ONCE(p->se.avg.util_est, ewma); > > - > > - trace_sched_util_est_se_tp(&p->se); > > -} > > - > > static inline unsigned long get_actual_cpu_capacity(int cpu) > > { > > unsigned long capacity = arch_scale_cpu_capacity(cpu); > > @@ -5828,7 +5818,7 @@ static bool > > dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > > { > > bool sleep = flags & DEQUEUE_SLEEP; > > - int action = UPDATE_TG; > > + int action = 0; > > > > update_curr(cfs_rq); > > clear_buddies(cfs_rq, se); > > @@ -5848,15 +5838,23 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > > > > if (sched_feat(DELAY_DEQUEUE) && delay && > > !entity_eligible(cfs_rq, se)) { > > - update_load_avg(cfs_rq, se, 0); > > + if (entity_is_task(se)) > > + action |= UPDATE_UTIL_EST; > > + update_load_avg(cfs_rq, se, action); > > update_entity_lag(cfs_rq, se); > > set_delayed(se); > > return false; > > } > > } > > > > - if (entity_is_task(se) && task_on_rq_migrating(task_of(se))) > > - action |= DO_DETACH; > > + action = UPDATE_TG; > > + if (entity_is_task(se)) { > > + if (task_on_rq_migrating(task_of(se))) > > + action |= DO_DETACH; > > + > > + if (sleep && !(flags & DEQUEUE_DELAYED)) > > + action |= UPDATE_UTIL_EST; > > + } > > > > /* > > * When dequeuing a sched_entity, we must: > > @@ -7628,7 +7626,6 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > > if (!p->se.sched_delayed) > > util_est_dequeue(&rq->cfs, p); > > > > - util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP); > > if (dequeue_entities(rq, &p->se, flags) < 0) > > return false; > > > > -- > > 2.43.0 > > > > > > > > > > > > > > Specifically, we shouldn't reference p after dequeue_entities() or we > > > risk racing with it being woken up, running, and maybe exiting on > > > another cpu. > > > And this moves the util_est_updat() call to after dequeue finishes. > > > > > > Maybe there's some way to have util_est_update() compensate for the > > > unfinished accounting that will be done in dequeue_entities()? > > > > > > thanks > > > -john ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] sched/fair: Call update_util_est() after dequeue_entities() 2026-05-12 12:46 [PATCH] sched/fair: Call update_util_est() after dequeue_entities() Qais Yousef 2026-05-12 20:52 ` John Stultz @ 2026-05-15 18:35 ` Tim Chen 1 sibling, 0 replies; 6+ messages in thread From: Tim Chen @ 2026-05-15 18:35 UTC (permalink / raw) To: Qais Yousef, Ingo Molnar, Peter Zijlstra, Vincent Guittot Cc: Rafael J. Wysocki, Viresh Kumar, Juri Lelli, Steven Rostedt, John Stultz, Dietmar Eggemann, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm On Tue, 2026-05-12 at 13:46 +0100, Qais Yousef wrote: > update_util_est() reads task_util() at dequeue which is updated in > dequeue_entities(). To read the accurate util_avg at dequeue, make sure > to do the read after load_avg is updated in dequeue_entities(). > > util_est for a periodic task before > > periodic-3114 util_est.enqueued running > ┌───────────────────────────────────────────────────────────────────────────────────────────────┐ > 183┤ ▖▗ ▐▖ ▖ ▗▙ ▗ ▗▙▖▖ ▖▖ ▖ ▖▖ ▗ ▟ ▗▄▖ │ > 139┤ ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖ │ > 95┤ ▐▀ ▘ ▝ ▝ ▝▘ ▘ ▘▘ ▝▘ ▝▘ ▝ ▝ ▀ │ > │ ▛ │ > 51┤ ▐▘ │ > 7┤ ▖▗▗ ▗▄▐ │ > └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘ > 0.00 0.65 1.30 1.96 2.61 3.26 3.91 4.57 5.22 5.87 > > and after > > periodic-2977 util_est.enqueued running > ┌─────────────────────────────────────────────────────────────────────────────────────────────┐ > 157.0┤ ▙▄ ▗▄ ▗▄▄▄ ▗▄ ▗▄▄▄▗▄▄ ▗▄▄▖ ▄ ▄▄▄ ▄ ▄▖▖ ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄ │ > 119.5┤ ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀ ▝▝▀ ▀ ▀▀▀▀ │ > 82.0┤ ▟ │ > │ ▌ │ > 44.5┤ ▌ │ > 7.0┤ ▗ ▗▖ ▌ │ > └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘ > 0.00 0.65 1.30 1.95 2.60 3.25 3.90 4.56 5.21 5.86 > > Note how the signal is noisier and can peak to 183 vs 157 now. > > Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race") > Signed-off-by: Qais Yousef <qyousef@layalina.io> > --- > > This is split from [1] series where I stumbled upon this problem. AFAICS it > needs backporting all the way to 6.12 LTS. > > [1] https://lore.kernel.org/lkml/20260504020003.71306-1-qyousef@layalina.io/ > > kernel/sched/fair.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 728965851842..96ba97e5f4ae 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7401,6 +7401,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) > */ > static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > { > + int ret; > + > if (task_is_throttled(p)) { > dequeue_throttled_task(p, flags); > return true; > @@ -7409,8 +7411,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > if (!p->se.sched_delayed) > util_est_dequeue(&rq->cfs, p); > > + ret = dequeue_entities(rq, &p->se, flags); > util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP); I thought that util_est_update() was called intentionally before dequeue_entities to update the utilization of task p up to this time right before the dequeue. Then dequeue_entities() is called later with up to date task utilization estimate of p. Perhaps util_est_update() should be moved before util_est_dequeue() so the updated utilization of p is subtracted from the rq utilization. @@ -8002,10 +8002,10 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) return true; } + util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP); if (!p->se.sched_delayed) util_est_dequeue(&rq->cfs, p); - util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP); if (dequeue_entities(rq, &p->se, flags) < 0) return false; Tim > - if (dequeue_entities(rq, &p->se, flags) < 0) > + if (ret < 0) > return false; > > /* ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-05-15 18:35 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-12 12:46 [PATCH] sched/fair: Call update_util_est() after dequeue_entities() Qais Yousef 2026-05-12 20:52 ` John Stultz 2026-05-13 12:46 ` Vincent Guittot 2026-05-15 1:51 ` Qais Yousef 2026-05-15 15:23 ` Vincent Guittot 2026-05-15 18:35 ` Tim Chen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox