All of lore.kernel.org
 help / color / mirror / Atom feed
From: Omar Sandoval <osandov@osandov.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>, Chris Mason <clm@meta.com>,
	Pat Cody <pat@patcody.io>,
	mingo@redhat.com, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
	vschneid@redhat.com, linux-kernel@vger.kernel.org,
	patcody@meta.com, kernel-team@meta.com,
	Breno Leitao <leitao@debian.org>
Subject: Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
Date: Fri, 18 Apr 2025 16:49:08 -0700	[thread overview]
Message-ID: <aALk9DVfjTTHGdvA@telecaster> (raw)
In-Reply-To: <20250418154438.GH17910@noisy.programming.kicks-ass.net>

On Fri, Apr 18, 2025 at 05:44:38PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 16, 2025 at 10:19:42AM -0400, Rik van Riel wrote:
> 
> > The most common warning by far, hitting
> > about 90% of the time we hit anything
> > in avg_vruntime_validate is the
> > WARN_ON_ONCE(cfs_rq->avg_vruntime != vruntime)
> > 
> > The most common code path to getting there,
> > covering about 85% of the cases:
> > 
> > avg_vruntime_validate
> > avg_vruntime_sub
> > __dequeue_entity
> > set_next_entity
> > pick_task_fair
> > pick_next_task_fair
> > __pick_next_task
> > pick_next_task
> > __schedule
> > schedule
> 
> (I'm assuming zero_vruntime patch here, the stock kernel has more
> problems vs min_vruntime getting stuck in the future etc..)
> 
> So I have a theory about this. Key is that you're running a PREEMPT-NONE
> kernel.
> 
> There is one important site the overflow patch does not cover:
> avg_vruntime_update().
> 
> However, due to PREEMPT_NONE, it is possible (Chris mentioned direct
> reclaim and OOM) to have very long (in-kernel) runtimes without
> scheduling.
> 
> (I suppose this should be visible by tracing sched_switch)
> 
> Were this to happen, then avg_vruntime_add() gets to deal with 'key *
> weight' for a relatively large key. But per the overflow checks there,
> this isn't hitting (much).
> 
> But then we try and update zero_vruntime, and that is doing 'delta *
> cfs_rq->avg_load', and avg_load is a larger number and might just cause
> overflow. We don't have a check there.
> 
> If that overflows then avg_vruntime is buggered, but none of the
> individual tree entities hit overflow, exactly as you observe.
> 
> I've modified the zero_vruntime patch a little (new one below) to update
> zero_vruntime in update_curr(). Additionally, I have every tick re-align
> it with the tree avg (the update and the tree can drift due to numerical
> funnies).
> 
> This should ensure these very long in-kernel runtimes are broken up in
> at most tick sized chunks, and by keeping zero_vruntime more or less
> around the actual 0-lag point, the key values are minimized.
> 
> I should probably go play with a kernel module that spins for a few
> seconds with preemption disabled, see if I can make it go BOOM :-) But
> I've not yet done so.
> 
> FWIW, you can add something like:
> 
> Index: linux-2.6/kernel/sched/fair.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched/fair.c
> +++ linux-2.6/kernel/sched/fair.c
> @@ -652,6 +652,9 @@ avg_vruntime_sub(struct cfs_rq *cfs_rq,
>  static inline
>  void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta)
>  {
> +	s64 tmp;
> +	WARN_ON_ONCE(__builtin_mul_overflow(cfs_rq->avg_load, delta, &tmp);
> +
>  	/*
>  	 * v' = v + d ==> avg_vruntime' = avg_runtime - d*avg_load
>  	 */
> 
> 
> To the overflow patch, to see if this mult goes splat.
> 
> 
> New zero_vruntime patch here:

Hey, Peter, thanks for your help with this. I have a question/potential
bug below.

> ---
>  kernel/sched/debug.c |    8 ++--
>  kernel/sched/fair.c  |   92 +++++++--------------------------------------------
>  kernel/sched/sched.h |    2 -
>  3 files changed, 19 insertions(+), 83 deletions(-)
> 
> Index: linux-2.6/kernel/sched/debug.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched/debug.c
> +++ linux-2.6/kernel/sched/debug.c
> @@ -807,7 +807,7 @@ static void print_rq(struct seq_file *m,
>  
>  void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
>  {
> -	s64 left_vruntime = -1, min_vruntime, right_vruntime = -1, left_deadline = -1, spread;
> +	s64 left_vruntime = -1, zero_vruntime, right_vruntime = -1, left_deadline = -1, spread;
>  	struct sched_entity *last, *first, *root;
>  	struct rq *rq = cpu_rq(cpu);
>  	unsigned long flags;
> @@ -830,15 +830,15 @@ void print_cfs_rq(struct seq_file *m, in
>  	last = __pick_last_entity(cfs_rq);
>  	if (last)
>  		right_vruntime = last->vruntime;
> -	min_vruntime = cfs_rq->min_vruntime;
> +	zero_vruntime = cfs_rq->zero_vruntime;
>  	raw_spin_rq_unlock_irqrestore(rq, flags);
>  
>  	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "left_deadline",
>  			SPLIT_NS(left_deadline));
>  	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "left_vruntime",
>  			SPLIT_NS(left_vruntime));
> -	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "min_vruntime",
> -			SPLIT_NS(min_vruntime));
> +	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "zero_vruntime",
> +			SPLIT_NS(zero_vruntime));
>  	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "avg_vruntime",
>  			SPLIT_NS(avg_vruntime(cfs_rq)));
>  	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "right_vruntime",
> Index: linux-2.6/kernel/sched/fair.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched/fair.c
> +++ linux-2.6/kernel/sched/fair.c
> @@ -526,24 +526,6 @@ void account_cfs_rq_runtime(struct cfs_r
>   * Scheduling class tree data structure manipulation methods:
>   */
>  
> -static inline __maybe_unused u64 max_vruntime(u64 max_vruntime, u64 vruntime)
> -{
> -	s64 delta = (s64)(vruntime - max_vruntime);
> -	if (delta > 0)
> -		max_vruntime = vruntime;
> -
> -	return max_vruntime;
> -}
> -
> -static inline __maybe_unused u64 min_vruntime(u64 min_vruntime, u64 vruntime)
> -{
> -	s64 delta = (s64)(vruntime - min_vruntime);
> -	if (delta < 0)
> -		min_vruntime = vruntime;
> -
> -	return min_vruntime;
> -}
> -
>  static inline bool entity_before(const struct sched_entity *a,
>  				 const struct sched_entity *b)
>  {
> @@ -556,7 +538,7 @@ static inline bool entity_before(const s
>  
>  static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -	return (s64)(se->vruntime - cfs_rq->min_vruntime);
> +	return (s64)(se->vruntime - cfs_rq->zero_vruntime);
>  }
>  
>  #define __node_2_se(node) \
> @@ -608,13 +590,13 @@ static inline s64 entity_key(struct cfs_
>   *
>   * Which we track using:
>   *
> - *                    v0 := cfs_rq->min_vruntime
> + *                    v0 := cfs_rq->zero_vruntime
>   * \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime
>   *              \Sum w_i := cfs_rq->avg_load
>   *
> - * Since min_vruntime is a monotonic increasing variable that closely tracks
> - * the per-task service, these deltas: (v_i - v), will be in the order of the
> - * maximal (virtual) lag induced in the system due to quantisation.
> + * Since zero_vruntime closely tracks the per-task service, these
> + * deltas: (v_i - v), will be in the order of the maximal (virtual) lag
> + * induced in the system due to quantisation.
>   *
>   * Also, we use scale_load_down() to reduce the size.
>   *
> @@ -673,7 +655,7 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
>  		avg = div_s64(avg, load);
>  	}
>  
> -	return cfs_rq->min_vruntime + avg;
> +	return cfs_rq->zero_vruntime + avg;
>  }
>  
>  /*
> @@ -734,7 +716,7 @@ static int vruntime_eligible(struct cfs_
>  		load += weight;
>  	}
>  
> -	return avg >= (s64)(vruntime - cfs_rq->min_vruntime) * load;
> +	return avg >= (s64)(vruntime - cfs_rq->zero_vruntime) * load;
>  }
>  
>  int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
> @@ -742,42 +724,21 @@ int entity_eligible(struct cfs_rq *cfs_r
>  	return vruntime_eligible(cfs_rq, se->vruntime);
>  }
>  
> -static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
> +static void __update_zero_vruntime(struct cfs_rq *cfs_rq, s64 delta)
>  {
> -	u64 min_vruntime = cfs_rq->min_vruntime;
> -	/*
> -	 * open coded max_vruntime() to allow updating avg_vruntime
> -	 */
> -	s64 delta = (s64)(vruntime - min_vruntime);
> -	if (delta > 0) {
> -		avg_vruntime_update(cfs_rq, delta);
> -		min_vruntime = vruntime;
> -	}
> -	return min_vruntime;
> +	if (!delta)
> +		return;
> +
> +	avg_vruntime_update(cfs_rq, delta);
> +	cfs_rq->zero_vruntime += delta;
>  }
>  
> -static void update_min_vruntime(struct cfs_rq *cfs_rq)
> +static void update_zero_vruntime(struct cfs_rq *cfs_rq)
>  {
> -	struct sched_entity *se = __pick_root_entity(cfs_rq);
> -	struct sched_entity *curr = cfs_rq->curr;
> -	u64 vruntime = cfs_rq->min_vruntime;
> +	u64 vruntime = avg_vruntime(cfs_rq);
> +	s64 delta = (s64)(vruntime - cfs_rq->zero_vruntime);
>  
> -	if (curr) {
> -		if (curr->on_rq)
> -			vruntime = curr->vruntime;
> -		else
> -			curr = NULL;
> -	}
> -
> -	if (se) {
> -		if (!curr)
> -			vruntime = se->min_vruntime;
> -		else
> -			vruntime = min_vruntime(vruntime, se->min_vruntime);
> -	}
> -
> -	/* ensure we never gain time by being placed backwards. */
> -	cfs_rq->min_vruntime = __update_min_vruntime(cfs_rq, vruntime);
> +	__update_zero_vruntime(cfs_rq, delta);
>  }
>  
>  static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq)
> @@ -850,6 +811,7 @@ RB_DECLARE_CALLBACKS(static, min_vruntim
>  static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>  	avg_vruntime_add(cfs_rq, se);
> +	update_zero_vruntime(cfs_rq);

Won't this double-count cfs_rq->curr in the avg_vruntime() calculation
in update_zero_vruntime()? When cfs_rq->curr->on_rq, put_prev_entity()
calls this with se == cfs_rq->curr _before_ setting cfs_rq->curr to
NULL. So the avg_vruntime_add() call will add entity_key(cfs_rq->curr)
to cfs_rq->avg_vruntime and se_weight(cfs_rq->curr) to cfs_rq->avg_load.
Then update_zero_vruntime() calls avg_vruntime(), which still sees
curr->on_rq and will add curr's key and weight again. This throws
zero_vruntime off, maybe by enough to bust zero_vruntime and/or
avg_vruntime?

Should the call to update_zero_vruntime() go before avg_vruntime_add()?

Thanks,
Omar

  reply	other threads:[~2025-04-18 23:49 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-20 20:53 [PATCH] sched/fair: Add null pointer check to pick_next_entity() Pat Cody
2025-03-20 22:42 ` Christian Loehle
2025-03-21 17:52   ` Pat Cody
2025-03-24 11:56 ` Peter Zijlstra
2025-03-25 15:12   ` Pat Cody
2025-03-25 18:59     ` Peter Zijlstra
2025-03-26 19:26       ` Pat Cody
2025-04-02 14:59   ` Rik van Riel
2025-04-02 18:07     ` Peter Zijlstra
2025-04-09 14:29       ` Rik van Riel
2025-04-09 15:27         ` Peter Zijlstra
2025-04-11 14:51           ` Rik van Riel
2025-04-14  9:08             ` Peter Zijlstra
2025-04-14 15:38               ` Chris Mason
2025-04-15 10:07                 ` Peter Zijlstra
2025-04-16  7:59                   ` Peter Zijlstra
     [not found]   ` <7B2CFC16-1ADE-4565-B555-7525A50494C2@surriel.com>
     [not found]     ` <20250402082221.GT5880@noisy.programming.kicks-ass.net>
2025-04-14 19:57       ` Rik van Riel
2025-04-15  8:02         ` Peter Zijlstra
2025-04-16 12:44           ` Peter Zijlstra
2025-04-16 14:19             ` Rik van Riel
2025-04-16 15:27               ` Chris Mason
2025-04-18 15:44               ` Peter Zijlstra
2025-04-18 23:49                 ` Omar Sandoval [this message]
2025-04-22  0:06                   ` Omar Sandoval
2025-04-22 14:13                     ` Peter Zijlstra
2025-04-22 15:14                       ` Peter Zijlstra
2025-04-25  8:53                         ` Omar Sandoval
2025-04-22 13:36                   ` Peter Zijlstra
2025-04-19  2:53                 ` Mike Galbraith
     [not found] <20250320173438.3562449-2-patcody@meta.com>
2025-03-24 15:56 ` Steven Rostedt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aALk9DVfjTTHGdvA@telecaster \
    --to=osandov@osandov.com \
    --cc=bsegall@google.com \
    --cc=clm@meta.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=kernel-team@meta.com \
    --cc=leitao@debian.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=pat@patcody.io \
    --cc=patcody@meta.com \
    --cc=peterz@infradead.org \
    --cc=riel@surriel.com \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.