All of lore.kernel.org
 help / color / mirror / Atom feed
From: Qais Yousef <qyousef@layalina.io>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: John Stultz <jstultz@google.com>, Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	Viresh Kumar <viresh.kumar@linaro.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	"Chen, Yu C" <yu.c.chen@intel.com>,
	Thomas Gleixner <tglx@kernel.org>,
	linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Subject: Re: [PATCH] sched/fair: Call update_util_est() after dequeue_entities()
Date: Fri, 15 May 2026 02:51:05 +0100	[thread overview]
Message-ID: <20260515015105.2dmiy7wqpzdg25wt@airbuntu> (raw)
In-Reply-To: <agRyoe1wHyZ-vMk9@vingu-cube>

On 05/13/26 14:46, Vincent Guittot wrote:
> Le mardi 12 mai 2026 à 13:52:09 (-0700), John Stultz a écrit :
> > On Tue, May 12, 2026 at 5:47 AM Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > update_util_est() reads task_util() at dequeue which is updated in
> > > dequeue_entities(). To read the accurate util_avg at dequeue, make sure
> > > to do the read after load_avg is updated in dequeue_entities().
> > >
> > > util_est for a periodic task before
> > >
> > >                                 periodic-3114 util_est.enqueued running
> > >    ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
> > > 183┤                ▖▗  ▐▖         ▖ ▗▙   ▗   ▗▙▖▖       ▖▖   ▖       ▖▖        ▗  ▟  ▗▄▖          │
> > > 139┤               ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖     │
> > >  95┤              ▐▀    ▘   ▝   ▝        ▝▘        ▘   ▘▘       ▝▘       ▝▘  ▝    ▝        ▀       │
> > >    │              ▛                                                                                │
> > >  51┤             ▐▘                                                                                │
> > >   7┤      ▖▗▗  ▗▄▐                                                                                 │
> > >    └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
> > >   0.00      0.65       1.30      1.96       2.61      3.26       3.91      4.57       5.22     5.87
> > >
> > > and after
> > >
> > >                                  periodic-2977 util_est.enqueued running
> > >      ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
> > > 157.0┤               ▙▄ ▗▄  ▗▄▄▄ ▗▄  ▗▄▄▄▗▄▄  ▗▄▄▖ ▄   ▄▄▄   ▄  ▄▖▖  ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄           │
> > > 119.5┤             ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀   ▝▝▀  ▀   ▀▀▀▀            │
> > >  82.0┤             ▟                                                                               │
> > >      │             ▌                                                                               │
> > >  44.5┤             ▌                                                                               │
> > >   7.0┤      ▗   ▗▖ ▌                                                                               │
> > >      └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
> > >     0.00      0.65      1.30       1.95      2.60      3.25      3.90       4.56      5.21     5.86
> > >
> > > Note how the signal is noisier and can peak to 183 vs 157 now.
> > >
> > > Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race")
> > > Signed-off-by: Qais Yousef <qyousef@layalina.io>
> > > ---
> > >
> > > This is split from [1] series where I stumbled upon this problem. AFAICS it
> > > needs backporting all the way to 6.12 LTS.
> > >
> > > [1] https://lore.kernel.org/lkml/20260504020003.71306-1-qyousef@layalina.io/
> > >
> > >  kernel/sched/fair.c | 5 ++++-
> > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 728965851842..96ba97e5f4ae 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7401,6 +7401,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> > >   */
> > >  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > >  {
> > > +       int ret;
> > > +
> > >         if (task_is_throttled(p)) {
> > >                 dequeue_throttled_task(p, flags);
> > >                 return true;
> > > @@ -7409,8 +7411,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > >         if (!p->se.sched_delayed)
> > >                 util_est_dequeue(&rq->cfs, p);
> > >
> > > +       ret = dequeue_entities(rq, &p->se, flags);
> > >         util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
> > > -       if (dequeue_entities(rq, &p->se, flags) < 0)
> > 
> > Hrm, so the Sashiko tool raised a reasonable concern on the earlier
> > version of this:
> >   https://sashiko.dev/#/patchset/20260504020003.71306-1-qyousef%40layalina.io?part=12
> 
> Even without sashiko, this is what the comment says just below in the function ;-)

Yeah I read it but it didn't really register properly :-)

> 
> Below is a different way to fix it which also covers cases when we have both
> DEQUEUE_SLEEP and DEQUEUE_DELAYED

Works for me. It looks even more stable now

util_avg

                                    periodic-2737 util_avg running
   ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
140┤               ▟▟█▄▄██▙▄▟▟▙▙▟█▟▄▟▟▟▄▄██▄▟▙█▙▄▙█▙▄█▙▙▄▙▙▙█▟█▄▟▟▟▄▟█▟▟▄██▄▙█▟▟▄▟██▄              │
105┤              ▄██████████████████████████████████████████████████████████████████              │
 70┤             ▗▛▘                                                                               │
   │             ▐▘                                                                                │
 35┤             █                                                                                 │
  0┤       ▖   ▗▄▌                                                                                 │
   └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
  0.00      0.65       1.30      1.95       2.60      3.25       3.90      4.55       5.20     5.85

and util_eset

                                 periodic-2737 util_est.enqueued running
     ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
140.0┤               ▞▀▀▀▀▀▀▀▀▀▀▀▀▀▜▀▀▀▜▀▀▀▀▀▀▀▀▜▀▀▀▀▀▀▀▀▀▀▀▀▀▀▜▀▀▀▀▀▀▜▛▀▀▀▛▀▀▀▀▀▀▀▀▀▘             │
106.8┤              ▝▘                                                                             │
 73.5┤             ▗▘                                                                              │
     │             ▗                                                                               │
 40.2┤             ▖                                                                               │
  7.0┤       ▖   ▄▗▖                                                                               │
     └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
    0.00      0.65      1.30       1.95      2.60      3.25      3.90       4.55      5.20     5.85

Do you want to send a proper patch? Feel free to stick my
reviewed-and-tested-by.

Thanks!

> 
> ---
>  kernel/sched/fair.c | 189 ++++++++++++++++++++++----------------------
>  1 file changed, 93 insertions(+), 96 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c83fbe4e88c1..0976adc12594 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5120,13 +5120,87 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>  	trace_pelt_cfs_tp(cfs_rq);
>  }
> 
> +#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
> +
> +static inline void util_est_update(struct sched_entity *se)
> +{
> +	unsigned int ewma, dequeued, last_ewma_diff;
> +
> +	if (!sched_feat(UTIL_EST))
> +		return;
> +
> +	/* Get current estimate of utilization */
> +	ewma = READ_ONCE(se->avg.util_est);
> +
> +	/*
> +	 * If the PELT values haven't changed since enqueue time,
> +	 * skip the util_est update.
> +	 */
> +	if (ewma & UTIL_AVG_UNCHANGED)
> +		return;
> +
> +	/* Get utilization at dequeue */
> +	dequeued = READ_ONCE(se->avg.util_avg);
> +
> +	/*
> +	 * Reset EWMA on utilization increases, the moving average is used only
> +	 * to smooth utilization decreases.
> +	 */
> +	if (ewma <= dequeued) {
> +		ewma = dequeued;
> +		goto done;
> +	}
> +
> +	/*
> +	 * Skip update of task's estimated utilization when its members are
> +	 * already ~1% close to its last activation value.
> +	 */
> +	last_ewma_diff = ewma - dequeued;
> +	if (last_ewma_diff < UTIL_EST_MARGIN)
> +		goto done;
> +
> +	/*
> +	 * To avoid underestimate of task utilization, skip updates of EWMA if
> +	 * we cannot grant that thread got all CPU time it wanted.
> +	 */
> +	if ((dequeued + UTIL_EST_MARGIN) < READ_ONCE(se->avg.runnable_avg))
> +		goto done;
> +
> +
> +	/*
> +	 * Update Task's estimated utilization
> +	 *
> +	 * When *p completes an activation we can consolidate another sample
> +	 * of the task size. This is done by using this value to update the
> +	 * Exponential Weighted Moving Average (EWMA):
> +	 *
> +	 *  ewma(t) = w *  task_util(p) + (1-w) * ewma(t-1)
> +	 *          = w *  task_util(p) +         ewma(t-1)  - w * ewma(t-1)
> +	 *          = w * (task_util(p) -         ewma(t-1)) +     ewma(t-1)
> +	 *          = w * (      -last_ewma_diff           ) +     ewma(t-1)
> +	 *          = w * (-last_ewma_diff +  ewma(t-1) / w)
> +	 *
> +	 * Where 'w' is the weight of new samples, which is configured to be
> +	 * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
> +	 */
> +	ewma <<= UTIL_EST_WEIGHT_SHIFT;
> +	ewma  -= last_ewma_diff;
> +	ewma >>= UTIL_EST_WEIGHT_SHIFT;
> +done:
> +	ewma |= UTIL_AVG_UNCHANGED;
> +	WRITE_ONCE(se->avg.util_est, ewma);
> +
> +	trace_sched_util_est_se_tp(se);
> +}
> +
>  /*
>   * Optional action to be done while updating the load average
>   */
> -#define UPDATE_TG	0x1
> -#define SKIP_AGE_LOAD	0x2
> -#define DO_ATTACH	0x4
> -#define DO_DETACH	0x8
> +#define UPDATE_TG	0x01
> +#define SKIP_AGE_LOAD	0x02
> +#define DO_ATTACH	0x04
> +#define DO_DETACH	0x08
> +#define UPDATE_UTIL_EST	0x10
> 
>  /* Update task and its cfs_rq load average */
>  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> @@ -5169,6 +5243,9 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>  		if (flags & UPDATE_TG)
>  			update_tg_load_avg(cfs_rq);
>  	}
> +
> +	if (flags & UPDATE_UTIL_EST)
> +		util_est_update(se);
>  }
> 
>  /*
> @@ -5227,11 +5304,6 @@ static inline unsigned long task_util(struct task_struct *p)
>  	return READ_ONCE(p->se.avg.util_avg);
>  }
> 
> -static inline unsigned long task_runnable(struct task_struct *p)
> -{
> -	return READ_ONCE(p->se.avg.runnable_avg);
> -}
> -
>  static inline unsigned long _task_util_est(struct task_struct *p)
>  {
>  	return READ_ONCE(p->se.avg.util_est) & ~UTIL_AVG_UNCHANGED;
> @@ -5274,88 +5346,6 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
>  	trace_sched_util_est_cfs_tp(cfs_rq);
>  }
> 
> -#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
> -
> -static inline void util_est_update(struct cfs_rq *cfs_rq,
> -				   struct task_struct *p,
> -				   bool task_sleep)
> -{
> -	unsigned int ewma, dequeued, last_ewma_diff;
> -
> -	if (!sched_feat(UTIL_EST))
> -		return;
> -
> -	/*
> -	 * Skip update of task's estimated utilization when the task has not
> -	 * yet completed an activation, e.g. being migrated.
> -	 */
> -	if (!task_sleep)
> -		return;
> -
> -	/* Get current estimate of utilization */
> -	ewma = READ_ONCE(p->se.avg.util_est);
> -
> -	/*
> -	 * If the PELT values haven't changed since enqueue time,
> -	 * skip the util_est update.
> -	 */
> -	if (ewma & UTIL_AVG_UNCHANGED)
> -		return;
> -
> -	/* Get utilization at dequeue */
> -	dequeued = task_util(p);
> -
> -	/*
> -	 * Reset EWMA on utilization increases, the moving average is used only
> -	 * to smooth utilization decreases.
> -	 */
> -	if (ewma <= dequeued) {
> -		ewma = dequeued;
> -		goto done;
> -	}
> -
> -	/*
> -	 * Skip update of task's estimated utilization when its members are
> -	 * already ~1% close to its last activation value.
> -	 */
> -	last_ewma_diff = ewma - dequeued;
> -	if (last_ewma_diff < UTIL_EST_MARGIN)
> -		goto done;
> -
> -	/*
> -	 * To avoid underestimate of task utilization, skip updates of EWMA if
> -	 * we cannot grant that thread got all CPU time it wanted.
> -	 */
> -	if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
> -		goto done;
> -
> -
> -	/*
> -	 * Update Task's estimated utilization
> -	 *
> -	 * When *p completes an activation we can consolidate another sample
> -	 * of the task size. This is done by using this value to update the
> -	 * Exponential Weighted Moving Average (EWMA):
> -	 *
> -	 *  ewma(t) = w *  task_util(p) + (1-w) * ewma(t-1)
> -	 *          = w *  task_util(p) +         ewma(t-1)  - w * ewma(t-1)
> -	 *          = w * (task_util(p) -         ewma(t-1)) +     ewma(t-1)
> -	 *          = w * (      -last_ewma_diff           ) +     ewma(t-1)
> -	 *          = w * (-last_ewma_diff +  ewma(t-1) / w)
> -	 *
> -	 * Where 'w' is the weight of new samples, which is configured to be
> -	 * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
> -	 */
> -	ewma <<= UTIL_EST_WEIGHT_SHIFT;
> -	ewma  -= last_ewma_diff;
> -	ewma >>= UTIL_EST_WEIGHT_SHIFT;
> -done:
> -	ewma |= UTIL_AVG_UNCHANGED;
> -	WRITE_ONCE(p->se.avg.util_est, ewma);
> -
> -	trace_sched_util_est_se_tp(&p->se);
> -}
> -
>  static inline unsigned long get_actual_cpu_capacity(int cpu)
>  {
>  	unsigned long capacity = arch_scale_cpu_capacity(cpu);
> @@ -5828,7 +5818,7 @@ static bool
>  dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  {
>  	bool sleep = flags & DEQUEUE_SLEEP;
> -	int action = UPDATE_TG;
> +	int action = 0;
> 
>  	update_curr(cfs_rq);
>  	clear_buddies(cfs_rq, se);
> @@ -5848,15 +5838,23 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> 
>  		if (sched_feat(DELAY_DEQUEUE) && delay &&
>  		    !entity_eligible(cfs_rq, se)) {
> -			update_load_avg(cfs_rq, se, 0);
> +			if (entity_is_task(se))
> +				action |= UPDATE_UTIL_EST;
> +			update_load_avg(cfs_rq, se, action);
>  			update_entity_lag(cfs_rq, se);
>  			set_delayed(se);
>  			return false;
>  		}
>  	}
> 
> -	if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
> -		action |= DO_DETACH;
> +	action = UPDATE_TG;
> +	if (entity_is_task(se)) {
> +		if (task_on_rq_migrating(task_of(se)))
> +			action |= DO_DETACH;
> +
> +		if (sleep && !(flags & DEQUEUE_DELAYED))
> +			action |= UPDATE_UTIL_EST;
> +	}
> 
>  	/*
>  	 * When dequeuing a sched_entity, we must:
> @@ -7628,7 +7626,6 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	if (!p->se.sched_delayed)
>  		util_est_dequeue(&rq->cfs, p);
> 
> -	util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
>  	if (dequeue_entities(rq, &p->se, flags) < 0)
>  		return false;
> 
> --
> 2.43.0
> 
> 
> 
> 
> > 
> > Specifically, we shouldn't reference p after dequeue_entities() or we
> > risk racing with it being woken up, running, and maybe exiting on
> > another cpu.
> > And this moves the util_est_updat() call to after dequeue finishes.
> > 
> > Maybe there's some way to have util_est_update() compensate for the
> > unfinished accounting that will be done in dequeue_entities()?
> > 
> > thanks
> > -john

      reply	other threads:[~2026-05-15  1:51 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-12 12:46 [PATCH] sched/fair: Call update_util_est() after dequeue_entities() Qais Yousef
2026-05-12 20:52 ` John Stultz
2026-05-13 12:46   ` Vincent Guittot
2026-05-15  1:51     ` Qais Yousef [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260515015105.2dmiy7wqpzdg25wt@airbuntu \
    --to=qyousef@layalina.io \
    --cc=dietmar.eggemann@arm.com \
    --cc=jstultz@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rafael@kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@kernel.org \
    --cc=tim.c.chen@linux.intel.com \
    --cc=vincent.guittot@linaro.org \
    --cc=viresh.kumar@linaro.org \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.