public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Omar Sandoval <osandov@osandov.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>, Chris Mason <clm@meta.com>,
	Pat Cody <pat@patcody.io>,
	mingo@redhat.com, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
	vschneid@redhat.com, linux-kernel@vger.kernel.org,
	patcody@meta.com, kernel-team@meta.com,
	Breno Leitao <leitao@debian.org>
Subject: Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
Date: Mon, 21 Apr 2025 17:06:45 -0700	[thread overview]
Message-ID: <aAbdlTISuaJnc5AG@telecaster> (raw)
In-Reply-To: <aALk9DVfjTTHGdvA@telecaster>

On Fri, Apr 18, 2025 at 04:49:08PM -0700, Omar Sandoval wrote:
> On Fri, Apr 18, 2025 at 05:44:38PM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 16, 2025 at 10:19:42AM -0400, Rik van Riel wrote:
> > 
> > > The most common warning by far, hitting
> > > about 90% of the time we hit anything
> > > in avg_vruntime_validate is the
> > > WARN_ON_ONCE(cfs_rq->avg_vruntime != vruntime)
> > > 
> > > The most common code path to getting there,
> > > covering about 85% of the cases:
> > > 
> > > avg_vruntime_validate
> > > avg_vruntime_sub
> > > __dequeue_entity
> > > set_next_entity
> > > pick_task_fair
> > > pick_next_task_fair
> > > __pick_next_task
> > > pick_next_task
> > > __schedule
> > > schedule
> > 
> > (I'm assuming zero_vruntime patch here, the stock kernel has more
> > problems vs min_vruntime getting stuck in the future etc..)
> > 
> > So I have a theory about this. Key is that you're running a PREEMPT-NONE
> > kernel.
> > 
> > There is one important site the overflow patch does not cover:
> > avg_vruntime_update().
> > 
> > However, due to PREEMPT_NONE, it is possible (Chris mentioned direct
> > reclaim and OOM) to have very long (in-kernel) runtimes without
> > scheduling.
> > 
> > (I suppose this should be visible by tracing sched_switch)
> > 
> > Were this to happen, then avg_vruntime_add() gets to deal with 'key *
> > weight' for a relatively large key. But per the overflow checks there,
> > this isn't hitting (much).
> > 
> > But then we try and update zero_vruntime, and that is doing 'delta *
> > cfs_rq->avg_load', and avg_load is a larger number and might just cause
> > overflow. We don't have a check there.
> > 
> > If that overflows then avg_vruntime is buggered, but none of the
> > individual tree entities hit overflow, exactly as you observe.
> > 
> > I've modified the zero_vruntime patch a little (new one below) to update
> > zero_vruntime in update_curr(). Additionally, I have every tick re-align
> > it with the tree avg (the update and the tree can drift due to numerical
> > funnies).
> > 
> > This should ensure these very long in-kernel runtimes are broken up in
> > at most tick sized chunks, and by keeping zero_vruntime more or less
> > around the actual 0-lag point, the key values are minimized.
> > 
> > I should probably go play with a kernel module that spins for a few
> > seconds with preemption disabled, see if I can make it go BOOM :-) But
> > I've not yet done so.

[snip]

> >  static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq)
> > @@ -850,6 +811,7 @@ RB_DECLARE_CALLBACKS(static, min_vruntim
> >  static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> >  	avg_vruntime_add(cfs_rq, se);
> > +	update_zero_vruntime(cfs_rq);
> 
> Won't this double-count cfs_rq->curr in the avg_vruntime() calculation
> in update_zero_vruntime()? When cfs_rq->curr->on_rq, put_prev_entity()
> calls this with se == cfs_rq->curr _before_ setting cfs_rq->curr to
> NULL. So the avg_vruntime_add() call will add entity_key(cfs_rq->curr)
> to cfs_rq->avg_vruntime and se_weight(cfs_rq->curr) to cfs_rq->avg_load.
> Then update_zero_vruntime() calls avg_vruntime(), which still sees
> curr->on_rq and will add curr's key and weight again. This throws
> zero_vruntime off, maybe by enough to bust zero_vruntime and/or
> avg_vruntime?
> 
> Should the call to update_zero_vruntime() go before avg_vruntime_add()?
> 
> Thanks,
> Omar

Hey, Peter,

We haven't been able to test your latest patch, but I dug through some
core dumps from crashes with your initial zero_vruntime patch. It looks
like on just about all of them, the entity vruntimes are way too spread
out, so we would get overflows regardless of what we picked as
zero_vruntime.

As a representative example, we have a cfs_rq with 3 entities with the
follow vruntimes and (scaled down) weights:

vruntime           weight
39052385155836636  2      (curr)
43658311782076206  2
42824722322062111  4886

The difference between the minimum and maximum is 4605926626239570,
which is 53 bits. The total load is 4890. Even if you picked
zero_vruntime to be equidistant from the minimum and maximum, the
(vruntime - zero_vruntime) * load calculation in entity_eligible() is
doomed to overflow.

That range in vruntime seems too absurd to be due to only to running too
long without preemption. We're only seeing these crashes on internal
node cgroups (i.e., cgroups whose children are cgroups, not tasks). This
all leads me to suspect reweight_entity().

Specifically, this line in reweight_entity():

	se->vlag = div_s64(se->vlag * se->load.weight, weight);

seems like it could create a very large vlag, which could cause
place_entity() to adjust vruntime by a large value. (place_entity() has
a similarly suspect adjustment on se->vlag, only update_entity_lag()
clamps it).

I'll try to reproduce something like this, but do you have any thoughts
on this theory in the meantime?

Thanks,
Omar

  reply	other threads:[~2025-04-22  0:06 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-20 20:53 [PATCH] sched/fair: Add null pointer check to pick_next_entity() Pat Cody
2025-03-20 22:42 ` Christian Loehle
2025-03-21 17:52   ` Pat Cody
2025-03-24 11:56 ` Peter Zijlstra
2025-03-25 15:12   ` Pat Cody
2025-03-25 18:59     ` Peter Zijlstra
2025-03-26 19:26       ` Pat Cody
2025-04-02 14:59   ` Rik van Riel
2025-04-02 18:07     ` Peter Zijlstra
2025-04-09 14:29       ` Rik van Riel
2025-04-09 15:27         ` Peter Zijlstra
2025-04-11 14:51           ` Rik van Riel
2025-04-14  9:08             ` Peter Zijlstra
2025-04-14 15:38               ` Chris Mason
2025-04-15 10:07                 ` Peter Zijlstra
2025-04-16  7:59                   ` Peter Zijlstra
     [not found]   ` <7B2CFC16-1ADE-4565-B555-7525A50494C2@surriel.com>
     [not found]     ` <20250402082221.GT5880@noisy.programming.kicks-ass.net>
2025-04-14 19:57       ` Rik van Riel
2025-04-15  8:02         ` Peter Zijlstra
2025-04-16 12:44           ` Peter Zijlstra
2025-04-16 14:19             ` Rik van Riel
2025-04-16 15:27               ` Chris Mason
2025-04-18 15:44               ` Peter Zijlstra
2025-04-18 23:49                 ` Omar Sandoval
2025-04-22  0:06                   ` Omar Sandoval [this message]
2025-04-22 14:13                     ` Peter Zijlstra
2025-04-22 15:14                       ` Peter Zijlstra
2025-04-25  8:53                         ` Omar Sandoval
2025-04-22 13:36                   ` Peter Zijlstra
2025-04-19  2:53                 ` Mike Galbraith
     [not found] <20250320173438.3562449-2-patcody@meta.com>
2025-03-24 15:56 ` Steven Rostedt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aAbdlTISuaJnc5AG@telecaster \
    --to=osandov@osandov.com \
    --cc=bsegall@google.com \
    --cc=clm@meta.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=kernel-team@meta.com \
    --cc=leitao@debian.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=pat@patcody.io \
    --cc=patcody@meta.com \
    --cc=peterz@infradead.org \
    --cc=riel@surriel.com \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox