Linux Power Management development
 help / color / mirror / Atom feed
From: Qais Yousef <qyousef@layalina.io>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: John Stultz <jstultz@google.com>, Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	Viresh Kumar <viresh.kumar@linaro.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	"Chen, Yu C" <yu.c.chen@intel.com>,
	Thomas Gleixner <tglx@kernel.org>,
	linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Subject: Re: [PATCH] sched/fair: Call update_util_est() after dequeue_entities()
Date: Fri, 15 May 2026 02:51:05 +0100	[thread overview]
Message-ID: <20260515015105.2dmiy7wqpzdg25wt@airbuntu> (raw)
In-Reply-To: <agRyoe1wHyZ-vMk9@vingu-cube>

On 05/13/26 14:46, Vincent Guittot wrote:
> Le mardi 12 mai 2026 à 13:52:09 (-0700), John Stultz a écrit :
> > On Tue, May 12, 2026 at 5:47 AM Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > update_util_est() reads task_util() at dequeue which is updated in
> > > dequeue_entities(). To read the accurate util_avg at dequeue, make sure
> > > to do the read after load_avg is updated in dequeue_entities().
> > >
> > > util_est for a periodic task before
> > >
> > >                                 periodic-3114 util_est.enqueued running
> > >    ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
> > > 183┤                ▖▗  ▐▖         ▖ ▗▙   ▗   ▗▙▖▖       ▖▖   ▖       ▖▖        ▗  ▟  ▗▄▖          │
> > > 139┤               ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖     │
> > >  95┤              ▐▀    ▘   ▝   ▝        ▝▘        ▘   ▘▘       ▝▘       ▝▘  ▝    ▝        ▀       │
> > >    │              ▛                                                                                │
> > >  51┤             ▐▘                                                                                │
> > >   7┤      ▖▗▗  ▗▄▐                                                                                 │
> > >    └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
> > >   0.00      0.65       1.30      1.96       2.61      3.26       3.91      4.57       5.22     5.87
> > >
> > > and after
> > >
> > >                                  periodic-2977 util_est.enqueued running
> > >      ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
> > > 157.0┤               ▙▄ ▗▄  ▗▄▄▄ ▗▄  ▗▄▄▄▗▄▄  ▗▄▄▖ ▄   ▄▄▄   ▄  ▄▖▖  ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄           │
> > > 119.5┤             ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀   ▝▝▀  ▀   ▀▀▀▀            │
> > >  82.0┤             ▟                                                                               │
> > >      │             ▌                                                                               │
> > >  44.5┤             ▌                                                                               │
> > >   7.0┤      ▗   ▗▖ ▌                                                                               │
> > >      └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
> > >     0.00      0.65      1.30       1.95      2.60      3.25      3.90       4.56      5.21     5.86
> > >
> > > Note how the signal is noisier and can peak to 183 vs 157 now.
> > >
> > > Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race")
> > > Signed-off-by: Qais Yousef <qyousef@layalina.io>
> > > ---
> > >
> > > This is split from [1] series where I stumbled upon this problem. AFAICS it
> > > needs backporting all the way to 6.12 LTS.
> > >
> > > [1] https://lore.kernel.org/lkml/20260504020003.71306-1-qyousef@layalina.io/
> > >
> > >  kernel/sched/fair.c | 5 ++++-
> > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 728965851842..96ba97e5f4ae 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7401,6 +7401,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> > >   */
> > >  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > >  {
> > > +       int ret;
> > > +
> > >         if (task_is_throttled(p)) {
> > >                 dequeue_throttled_task(p, flags);
> > >                 return true;
> > > @@ -7409,8 +7411,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > >         if (!p->se.sched_delayed)
> > >                 util_est_dequeue(&rq->cfs, p);
> > >
> > > +       ret = dequeue_entities(rq, &p->se, flags);
> > >         util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
> > > -       if (dequeue_entities(rq, &p->se, flags) < 0)
> > 
> > Hrm, so the Sashiko tool raised a reasonable concern on the earlier
> > version of this:
> >   https://sashiko.dev/#/patchset/20260504020003.71306-1-qyousef%40layalina.io?part=12
> 
> Even without sashiko, this is what the comment says just below in the function ;-)

Yeah I read it but it didn't really register properly :-)

> 
> Below is a different way to fix it which also covers cases when we have both
> DEQUEUE_SLEEP and DEQUEUE_DELAYED

Works for me. It looks even more stable now

util_avg

                                    periodic-2737 util_avg running
   ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
140┤               ▟▟█▄▄██▙▄▟▟▙▙▟█▟▄▟▟▟▄▄██▄▟▙█▙▄▙█▙▄█▙▙▄▙▙▙█▟█▄▟▟▟▄▟█▟▟▄██▄▙█▟▟▄▟██▄              │
105┤              ▄██████████████████████████████████████████████████████████████████              │
 70┤             ▗▛▘                                                                               │
   │             ▐▘                                                                                │
 35┤             █                                                                                 │
  0┤       ▖   ▗▄▌                                                                                 │
   └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
  0.00      0.65       1.30      1.95       2.60      3.25       3.90      4.55       5.20     5.85

and util_eset

                                 periodic-2737 util_est.enqueued running
     ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
140.0┤               ▞▀▀▀▀▀▀▀▀▀▀▀▀▀▜▀▀▀▜▀▀▀▀▀▀▀▀▜▀▀▀▀▀▀▀▀▀▀▀▀▀▀▜▀▀▀▀▀▀▜▛▀▀▀▛▀▀▀▀▀▀▀▀▀▘             │
106.8┤              ▝▘                                                                             │
 73.5┤             ▗▘                                                                              │
     │             ▗                                                                               │
 40.2┤             ▖                                                                               │
  7.0┤       ▖   ▄▗▖                                                                               │
     └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
    0.00      0.65      1.30       1.95      2.60      3.25      3.90       4.55      5.20     5.85

Do you want to send a proper patch? Feel free to stick my
reviewed-and-tested-by.

Thanks!

> 
> ---
>  kernel/sched/fair.c | 189 ++++++++++++++++++++++----------------------
>  1 file changed, 93 insertions(+), 96 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c83fbe4e88c1..0976adc12594 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5120,13 +5120,87 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>  	trace_pelt_cfs_tp(cfs_rq);
>  }
> 
> +#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
> +
> +static inline void util_est_update(struct sched_entity *se)
> +{
> +	unsigned int ewma, dequeued, last_ewma_diff;
> +
> +	if (!sched_feat(UTIL_EST))
> +		return;
> +
> +	/* Get current estimate of utilization */
> +	ewma = READ_ONCE(se->avg.util_est);
> +
> +	/*
> +	 * If the PELT values haven't changed since enqueue time,
> +	 * skip the util_est update.
> +	 */
> +	if (ewma & UTIL_AVG_UNCHANGED)
> +		return;
> +
> +	/* Get utilization at dequeue */
> +	dequeued = READ_ONCE(se->avg.util_avg);
> +
> +	/*
> +	 * Reset EWMA on utilization increases, the moving average is used only
> +	 * to smooth utilization decreases.
> +	 */
> +	if (ewma <= dequeued) {
> +		ewma = dequeued;
> +		goto done;
> +	}
> +
> +	/*
> +	 * Skip update of task's estimated utilization when its members are
> +	 * already ~1% close to its last activation value.
> +	 */
> +	last_ewma_diff = ewma - dequeued;
> +	if (last_ewma_diff < UTIL_EST_MARGIN)
> +		goto done;
> +
> +	/*
> +	 * To avoid underestimate of task utilization, skip updates of EWMA if
> +	 * we cannot grant that thread got all CPU time it wanted.
> +	 */
> +	if ((dequeued + UTIL_EST_MARGIN) < READ_ONCE(se->avg.runnable_avg))
> +		goto done;
> +
> +
> +	/*
> +	 * Update Task's estimated utilization
> +	 *
> +	 * When *p completes an activation we can consolidate another sample
> +	 * of the task size. This is done by using this value to update the
> +	 * Exponential Weighted Moving Average (EWMA):
> +	 *
> +	 *  ewma(t) = w *  task_util(p) + (1-w) * ewma(t-1)
> +	 *          = w *  task_util(p) +         ewma(t-1)  - w * ewma(t-1)
> +	 *          = w * (task_util(p) -         ewma(t-1)) +     ewma(t-1)
> +	 *          = w * (      -last_ewma_diff           ) +     ewma(t-1)
> +	 *          = w * (-last_ewma_diff +  ewma(t-1) / w)
> +	 *
> +	 * Where 'w' is the weight of new samples, which is configured to be
> +	 * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
> +	 */
> +	ewma <<= UTIL_EST_WEIGHT_SHIFT;
> +	ewma  -= last_ewma_diff;
> +	ewma >>= UTIL_EST_WEIGHT_SHIFT;
> +done:
> +	ewma |= UTIL_AVG_UNCHANGED;
> +	WRITE_ONCE(se->avg.util_est, ewma);
> +
> +	trace_sched_util_est_se_tp(se);
> +}
> +
>  /*
>   * Optional action to be done while updating the load average
>   */
> -#define UPDATE_TG	0x1
> -#define SKIP_AGE_LOAD	0x2
> -#define DO_ATTACH	0x4
> -#define DO_DETACH	0x8
> +#define UPDATE_TG	0x01
> +#define SKIP_AGE_LOAD	0x02
> +#define DO_ATTACH	0x04
> +#define DO_DETACH	0x08
> +#define UPDATE_UTIL_EST	0x10
> 
>  /* Update task and its cfs_rq load average */
>  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> @@ -5169,6 +5243,9 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>  		if (flags & UPDATE_TG)
>  			update_tg_load_avg(cfs_rq);
>  	}
> +
> +	if (flags & UPDATE_UTIL_EST)
> +		util_est_update(se);
>  }
> 
>  /*
> @@ -5227,11 +5304,6 @@ static inline unsigned long task_util(struct task_struct *p)
>  	return READ_ONCE(p->se.avg.util_avg);
>  }
> 
> -static inline unsigned long task_runnable(struct task_struct *p)
> -{
> -	return READ_ONCE(p->se.avg.runnable_avg);
> -}
> -
>  static inline unsigned long _task_util_est(struct task_struct *p)
>  {
>  	return READ_ONCE(p->se.avg.util_est) & ~UTIL_AVG_UNCHANGED;
> @@ -5274,88 +5346,6 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
>  	trace_sched_util_est_cfs_tp(cfs_rq);
>  }
> 
> -#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
> -
> -static inline void util_est_update(struct cfs_rq *cfs_rq,
> -				   struct task_struct *p,
> -				   bool task_sleep)
> -{
> -	unsigned int ewma, dequeued, last_ewma_diff;
> -
> -	if (!sched_feat(UTIL_EST))
> -		return;
> -
> -	/*
> -	 * Skip update of task's estimated utilization when the task has not
> -	 * yet completed an activation, e.g. being migrated.
> -	 */
> -	if (!task_sleep)
> -		return;
> -
> -	/* Get current estimate of utilization */
> -	ewma = READ_ONCE(p->se.avg.util_est);
> -
> -	/*
> -	 * If the PELT values haven't changed since enqueue time,
> -	 * skip the util_est update.
> -	 */
> -	if (ewma & UTIL_AVG_UNCHANGED)
> -		return;
> -
> -	/* Get utilization at dequeue */
> -	dequeued = task_util(p);
> -
> -	/*
> -	 * Reset EWMA on utilization increases, the moving average is used only
> -	 * to smooth utilization decreases.
> -	 */
> -	if (ewma <= dequeued) {
> -		ewma = dequeued;
> -		goto done;
> -	}
> -
> -	/*
> -	 * Skip update of task's estimated utilization when its members are
> -	 * already ~1% close to its last activation value.
> -	 */
> -	last_ewma_diff = ewma - dequeued;
> -	if (last_ewma_diff < UTIL_EST_MARGIN)
> -		goto done;
> -
> -	/*
> -	 * To avoid underestimate of task utilization, skip updates of EWMA if
> -	 * we cannot grant that thread got all CPU time it wanted.
> -	 */
> -	if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
> -		goto done;
> -
> -
> -	/*
> -	 * Update Task's estimated utilization
> -	 *
> -	 * When *p completes an activation we can consolidate another sample
> -	 * of the task size. This is done by using this value to update the
> -	 * Exponential Weighted Moving Average (EWMA):
> -	 *
> -	 *  ewma(t) = w *  task_util(p) + (1-w) * ewma(t-1)
> -	 *          = w *  task_util(p) +         ewma(t-1)  - w * ewma(t-1)
> -	 *          = w * (task_util(p) -         ewma(t-1)) +     ewma(t-1)
> -	 *          = w * (      -last_ewma_diff           ) +     ewma(t-1)
> -	 *          = w * (-last_ewma_diff +  ewma(t-1) / w)
> -	 *
> -	 * Where 'w' is the weight of new samples, which is configured to be
> -	 * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
> -	 */
> -	ewma <<= UTIL_EST_WEIGHT_SHIFT;
> -	ewma  -= last_ewma_diff;
> -	ewma >>= UTIL_EST_WEIGHT_SHIFT;
> -done:
> -	ewma |= UTIL_AVG_UNCHANGED;
> -	WRITE_ONCE(p->se.avg.util_est, ewma);
> -
> -	trace_sched_util_est_se_tp(&p->se);
> -}
> -
>  static inline unsigned long get_actual_cpu_capacity(int cpu)
>  {
>  	unsigned long capacity = arch_scale_cpu_capacity(cpu);
> @@ -5828,7 +5818,7 @@ static bool
>  dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  {
>  	bool sleep = flags & DEQUEUE_SLEEP;
> -	int action = UPDATE_TG;
> +	int action = 0;
> 
>  	update_curr(cfs_rq);
>  	clear_buddies(cfs_rq, se);
> @@ -5848,15 +5838,23 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> 
>  		if (sched_feat(DELAY_DEQUEUE) && delay &&
>  		    !entity_eligible(cfs_rq, se)) {
> -			update_load_avg(cfs_rq, se, 0);
> +			if (entity_is_task(se))
> +				action |= UPDATE_UTIL_EST;
> +			update_load_avg(cfs_rq, se, action);
>  			update_entity_lag(cfs_rq, se);
>  			set_delayed(se);
>  			return false;
>  		}
>  	}
> 
> -	if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
> -		action |= DO_DETACH;
> +	action = UPDATE_TG;
> +	if (entity_is_task(se)) {
> +		if (task_on_rq_migrating(task_of(se)))
> +			action |= DO_DETACH;
> +
> +		if (sleep && !(flags & DEQUEUE_DELAYED))
> +			action |= UPDATE_UTIL_EST;
> +	}
> 
>  	/*
>  	 * When dequeuing a sched_entity, we must:
> @@ -7628,7 +7626,6 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	if (!p->se.sched_delayed)
>  		util_est_dequeue(&rq->cfs, p);
> 
> -	util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
>  	if (dequeue_entities(rq, &p->se, flags) < 0)
>  		return false;
> 
> --
> 2.43.0
> 
> 
> 
> 
> > 
> > Specifically, we shouldn't reference p after dequeue_entities() or we
> > risk racing with it being woken up, running, and maybe exiting on
> > another cpu.
> > And this moves the util_est_updat() call to after dequeue finishes.
> > 
> > Maybe there's some way to have util_est_update() compensate for the
> > unfinished accounting that will be done in dequeue_entities()?
> > 
> > thanks
> > -john

  reply	other threads:[~2026-05-15  1:51 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-12 12:46 [PATCH] sched/fair: Call update_util_est() after dequeue_entities() Qais Yousef
2026-05-12 20:52 ` John Stultz
2026-05-13 12:46   ` Vincent Guittot
2026-05-15  1:51     ` Qais Yousef [this message]
2026-05-15 15:23       ` Vincent Guittot
2026-05-15 18:35 ` Tim Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260515015105.2dmiy7wqpzdg25wt@airbuntu \
    --to=qyousef@layalina.io \
    --cc=dietmar.eggemann@arm.com \
    --cc=jstultz@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rafael@kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@kernel.org \
    --cc=tim.c.chen@linux.intel.com \
    --cc=vincent.guittot@linaro.org \
    --cc=viresh.kumar@linaro.org \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox