[PATCH] sched/fair: Add null pointer check to pick_next

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] sched/fair: Add null pointer check to pick_next_entity()
@ 2025-03-20 20:53 Pat Cody
  2025-03-20 22:42 ` Christian Loehle
  2025-03-24 11:56 ` Peter Zijlstra
  0 siblings, 2 replies; 30+ messages in thread
From: Pat Cody @ 2025-03-20 20:53 UTC (permalink / raw)
  To: mingo
  Cc: peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, riel, patcody,
	kernel-team, stable

pick_eevdf() can return null, resulting in a null pointer dereference
crash in pick_next_entity()

The other call site of pick_eevdf() can already handle a null pointer,
and pick_next_entity() can already return null as well. Add an extra
check to handle the null return here.

Cc: stable@vger.kernel.org
Fixes: f12e148892ed ("sched/fair: Prepare pick_next_task() for delayed dequeue")
Signed-off-by: Pat Cody <pat@patcody.io>
---
 kernel/sched/fair.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a553181dc764..f2157298cbce 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5560,6 +5560,8 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 	}
 
 	struct sched_entity *se = pick_eevdf(cfs_rq);
+	if (!se)
+		return NULL;
 	if (se->sched_delayed) {
 		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
 		/*
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
  2025-03-20 20:53 [PATCH] sched/fair: Add null pointer check to pick_next_entity() Pat Cody
@ 2025-03-20 22:42 ` Christian Loehle
  2025-03-21 17:52   ` Pat Cody
  2025-03-24 11:56 ` Peter Zijlstra
  1 sibling, 1 reply; 30+ messages in thread
From: Christian Loehle @ 2025-03-20 22:42 UTC (permalink / raw)
  To: Pat Cody, mingo
  Cc: peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, riel, patcody,
	kernel-team, stable

On 3/20/25 20:53, Pat Cody wrote:
> pick_eevdf() can return null, resulting in a null pointer dereference
> crash in pick_next_entity()
> 
> The other call site of pick_eevdf() can already handle a null pointer,
> and pick_next_entity() can already return null as well. Add an extra
> check to handle the null return here.
> 
> Cc: stable@vger.kernel.org
> Fixes: f12e148892ed ("sched/fair: Prepare pick_next_task() for delayed dequeue")
> Signed-off-by: Pat Cody <pat@patcody.io>

Did this happen on mainline? Any chance it's reproducible?

> ---
>  kernel/sched/fair.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a553181dc764..f2157298cbce 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5560,6 +5560,8 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
>  	}
>  
>  	struct sched_entity *se = pick_eevdf(cfs_rq);
> +	if (!se)
> +		return NULL;
>  	if (se->sched_delayed) {
>  		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
>  		/*


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
  2025-03-20 22:42 ` Christian Loehle
@ 2025-03-21 17:52   ` Pat Cody
  0 siblings, 0 replies; 30+ messages in thread
From: Pat Cody @ 2025-03-21 17:52 UTC (permalink / raw)
  To: Christian Loehle
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, riel, patcody,
	kernel-team, stable

On Thu, Mar 20, 2025 at 10:42:44PM +0000, Christian Loehle wrote:
> Did this happen on mainline? Any chance it's reproducible?

We have the following back-ported on top of 6.13:

sched/fair: Adhere to place_entity() constraints (not upstreamed, see https://lore.kernel.org/all/20250207111141.GD7145@noisy.programming.kicks-ass.net/)
a430d99e3490 sched/fair: Fix value reported by hot tasks pulled in /proc/schedstat
2a77e4be12cb sched/fair: Untangle NEXT_BUDDY and pick_next_task()

We don't have a repro, but the crash in pick_task_fair happens more
often than MCE crashes with our limited deployment of 6.13. 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
  2025-03-20 20:53 [PATCH] sched/fair: Add null pointer check to pick_next_entity() Pat Cody
  2025-03-20 22:42 ` Christian Loehle
@ 2025-03-24 11:56 ` Peter Zijlstra
  2025-03-25 15:12   ` Pat Cody
                     ` (2 more replies)
  1 sibling, 3 replies; 30+ messages in thread
From: Peter Zijlstra @ 2025-03-24 11:56 UTC (permalink / raw)
  To: Pat Cody
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, riel, patcody,
	kernel-team, stable

On Thu, Mar 20, 2025 at 01:53:10PM -0700, Pat Cody wrote:
> pick_eevdf() can return null, resulting in a null pointer dereference
> crash in pick_next_entity()

If it returns NULL while nr_queued, something is really badly wrong.

Your check will hide this badness.

Does something like:

  https://lkml.kernel.org/r/20250128143949.GD7145@noisy.programming.kicks-ass.net

help?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
  2025-03-24 11:56 ` Peter Zijlstra
@ 2025-03-25 15:12   ` Pat Cody
  2025-03-25 18:59     ` Peter Zijlstra
  2025-04-02 14:59   ` Rik van Riel
       [not found]   ` <7B2CFC16-1ADE-4565-B555-7525A50494C2@surriel.com>
  2 siblings, 1 reply; 30+ messages in thread
From: Pat Cody @ 2025-03-25 15:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, riel, patcody,
	kernel-team, stable

On Mon, Mar 24, 2025 at 12:56:13PM +0100, Peter Zijlstra wrote:
> Does something like:
> 
>   https://lkml.kernel.org/r/20250128143949.GD7145@noisy.programming.kicks-ass.net
> 
> help?

To clarify- are you asking about if we've tried reverting 4423af84b297?
We have not tried that yet.

Or if we've included "sched/fair: Adhere to place_entity() constraints",
which we have already done- https://lore.kernel.org/all/20250207-tunneling-tested-koel-c59d33@leitao/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
  2025-03-25 15:12   ` Pat Cody
@ 2025-03-25 18:59     ` Peter Zijlstra
  2025-03-26 19:26       ` Pat Cody
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2025-03-25 18:59 UTC (permalink / raw)
  To: Pat Cody
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, riel, patcody,
	kernel-team, stable

On Tue, Mar 25, 2025 at 08:12:30AM -0700, Pat Cody wrote:
> On Mon, Mar 24, 2025 at 12:56:13PM +0100, Peter Zijlstra wrote:
> > Does something like:
> > 
> >   https://lkml.kernel.org/r/20250128143949.GD7145@noisy.programming.kicks-ass.net
> > 
> > help?
> 
> To clarify- are you asking about if we've tried reverting 4423af84b297?
> We have not tried that yet.
> 
> Or if we've included "sched/fair: Adhere to place_entity() constraints",
> which we have already done- https://lore.kernel.org/all/20250207-tunneling-tested-koel-c59d33@leitao/

This; it seems it got lost. I'll try and get it queued up.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
  2025-03-25 18:59     ` Peter Zijlstra
@ 2025-03-26 19:26       ` Pat Cody
  0 siblings, 0 replies; 30+ messages in thread
From: Pat Cody @ 2025-03-26 19:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, riel, patcody,
	kernel-team, stable

On Tue, Mar 25, 2025 at 07:59:07PM +0100, Peter Zijlstra wrote:
> > To clarify- are you asking about if we've tried reverting 4423af84b297?
> > We have not tried that yet.
> > 
> > Or if we've included "sched/fair: Adhere to place_entity() constraints",
> > which we have already done- https://lore.kernel.org/all/20250207-tunneling-tested-koel-c59d33@leitao/
> 
> This; it seems it got lost. I'll try and get it queued up.

Given that we've already included the patch in what we're running, and
we're seeing this null pointer exception, any suggestions for how to
proceed?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
  2025-03-24 11:56 ` Peter Zijlstra
  2025-03-25 15:12   ` Pat Cody
@ 2025-04-02 14:59   ` Rik van Riel
  2025-04-02 18:07     ` Peter Zijlstra
       [not found]   ` <7B2CFC16-1ADE-4565-B555-7525A50494C2@surriel.com>
  2 siblings, 1 reply; 30+ messages in thread
From: Rik van Riel @ 2025-04-02 14:59 UTC (permalink / raw)
  To: Peter Zijlstra, Pat Cody
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, patcody, kernel-team,
	stable, Breno Leitao

On Mon, 2025-03-24 at 12:56 +0100, Peter Zijlstra wrote:
> On Thu, Mar 20, 2025 at 01:53:10PM -0700, Pat Cody wrote:
> > pick_eevdf() can return null, resulting in a null pointer
> > dereference
> > crash in pick_next_entity()
> 
> If it returns NULL while nr_queued, something is really badly wrong.
> 
> Your check will hide this badness.

Looking at the numbers, I suspect vruntime_eligible()
is simply not allowing us to run the left-most entity
in the rb tree.

At the root level we are seeing these numbers:

*(struct cfs_rq *)0xffff8882b3b80000 = {
	.load = (struct load_weight){
		.weight = (unsigned long)4750106,
		.inv_weight = (u32)0,
	},
	.nr_running = (unsigned int)3,
	.h_nr_running = (unsigned int)3,
	.idle_nr_running = (unsigned int)0,
	.idle_h_nr_running = (unsigned int)0,
	.h_nr_delayed = (unsigned int)0,
	.avg_vruntime = (s64)-2206158374744070955,
	.avg_load = (u64)4637,
	.min_vruntime = (u64)12547674988423219,

Meanwhile, the cfs_rq->curr entity has a weight of 
4699124, a vruntime of 12071905127234526, and a
vlag of -2826239998

The left node entity in the cfs_rq has a weight
of 107666, a vruntime of 16048555717648580,
and a vlag of -1338888

I cannot for the life of me figure out how the
avg_vruntime number is so out of whack from what
the vruntime numbers of the sched entities on the
runqueue look like.

The avg_vruntime code is confusing me. On the
one hand the vruntime number is multiplied by
the sched entity weight when adding to or
subtracting to avg_vruntime, but on the other
hand vruntime_eligible scales the comparison
by the cfs_rq->avg_load number.

What even protects the load number in vruntime_eligible
from going negative in certain cases, when the current
entity's entity_key is a negative value?

The latter is probably not the bug we're seeing now, but
I don't understand how that is supposed to behave.

-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
  2025-04-02 14:59   ` Rik van Riel
@ 2025-04-02 18:07     ` Peter Zijlstra
  2025-04-09 14:29       ` Rik van Riel
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2025-04-02 18:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Pat Cody, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, patcody,
	kernel-team, stable, Breno Leitao

On Wed, Apr 02, 2025 at 10:59:09AM -0400, Rik van Riel wrote:
> On Mon, 2025-03-24 at 12:56 +0100, Peter Zijlstra wrote:
> > On Thu, Mar 20, 2025 at 01:53:10PM -0700, Pat Cody wrote:
> > > pick_eevdf() can return null, resulting in a null pointer
> > > dereference
> > > crash in pick_next_entity()
> > 
> > If it returns NULL while nr_queued, something is really badly wrong.
> > 
> > Your check will hide this badness.
> 
> Looking at the numbers, I suspect vruntime_eligible()
> is simply not allowing us to run the left-most entity
> in the rb tree.
> 
> At the root level we are seeing these numbers:
> 
> *(struct cfs_rq *)0xffff8882b3b80000 = {
> 	.load = (struct load_weight){
> 		.weight = (unsigned long)4750106,
> 		.inv_weight = (u32)0,
> 	},
> 	.nr_running = (unsigned int)3,
> 	.h_nr_running = (unsigned int)3,
> 	.idle_nr_running = (unsigned int)0,
> 	.idle_h_nr_running = (unsigned int)0,
> 	.h_nr_delayed = (unsigned int)0,
> 	.avg_vruntime = (s64)-2206158374744070955,
> 	.avg_load = (u64)4637,
> 	.min_vruntime = (u64)12547674988423219,
> 
> Meanwhile, the cfs_rq->curr entity has a weight of 
> 4699124, a vruntime of 12071905127234526, and a
> vlag of -2826239998
> 
> The left node entity in the cfs_rq has a weight
> of 107666, a vruntime of 16048555717648580,
> and a vlag of -1338888

The thing that stands out is that min_vruntime is a lot smaller than the
leftmost vruntime. This in turn leads to keys being large.

This is undesirable, as it can lead to overflow.

> I cannot for the life of me figure out how the
> avg_vruntime number is so out of whack from what
> the vruntime numbers of the sched entities on the
> runqueue look like.
> 
> The avg_vruntime code is confusing me. On the
> one hand the vruntime number is multiplied by
> the sched entity weight when adding to or
> subtracting to avg_vruntime, but on the other
> hand vruntime_eligible scales the comparison
> by the cfs_rq->avg_load number.
> 
> What even protects the load number in vruntime_eligible
> from going negative in certain cases, when the current
> entity's entity_key is a negative value?
> 
> The latter is probably not the bug we're seeing now, but
> I don't understand how that is supposed to behave.

So there is this giant comment right above avg_vruntime_add() that tries
and explain things.

Basically, from the constraint that the sum of lag is zero, you can
infer that the 0-lag point is the weighted average of the individual
vruntime, which is what we're trying to compute:

        \Sum w_i * v_i
  avg = --------------
           \Sum w_i

Now, since vruntime takes the whole u64 (worse, it wraps), this
multiplication term in the numerator is not something we can compute;
instead we do the min_vruntime (v0 henceforth) thing like:

  v_i = (v_i - v0) + v0

(edit -- this does two things:
 - it keeps the key 'small';
 - it creates a relative 0-point in the modular space)

If you do that subtitution and work it all out, you end up with:

        \Sum w_i * (v_i - v0)
  avg = --------------------- + v0
              \Sum w_i

Since you cannot very well track a ratio like that (and not suffer
terrible numerical problems) we simpy track the numerator and
denominator individually and only perform the division when strictly
needed.

Notably, the numerator lives in cfs_rq->avg_vruntime and the denominator
lives in cfs_rq->avg_load.

The one extra 'funny' is that these numbers track the entities in the
tree, and current is typically outside of the tree, so avg_vruntime()
adds current when needed before doing the division.

(vruntime_eligible() elides the division by cross-wise multiplication)

Anyway... we got s64 to track avg_vruntime, that sum over weighted keys,
if you have a ton of entities and large keys and large weights (your
case) you can overflow and things go to shit.

Anyway, seeing how your min_vruntime is weird, let me ask you to try the
below; it removes the old min_vruntime and instead tracks zero vruntime
as the 'current' avg_vruntime. We don't need the monotinicity filter,
all we really need is something 'near' all the other vruntimes in order
to compute this relative key so we can preserve order across the wrap.

This *should* get us near minimal sized keys. If you can still
reproduce, you should probably add something like that patch I send you
privately earlier, that checks the overflows.

The below builds and boots (provided SCHED_CORE=n).

---
 kernel/sched/debug.c |  8 ++---
 kernel/sched/fair.c  | 92 ++++++++--------------------------------------------
 kernel/sched/sched.h |  2 +-
 3 files changed, 19 insertions(+), 83 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 56ae54e0ce6a..5ca512d50e3a 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -807,7 +807,7 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
 
 void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 {
-	s64 left_vruntime = -1, min_vruntime, right_vruntime = -1, left_deadline = -1, spread;
+	s64 left_vruntime = -1, zero_vruntime, right_vruntime = -1, left_deadline = -1, spread;
 	struct sched_entity *last, *first, *root;
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
@@ -830,15 +830,15 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	last = __pick_last_entity(cfs_rq);
 	if (last)
 		right_vruntime = last->vruntime;
-	min_vruntime = cfs_rq->min_vruntime;
+	zero_vruntime = cfs_rq->zero_vruntime;
 	raw_spin_rq_unlock_irqrestore(rq, flags);
 
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "left_deadline",
 			SPLIT_NS(left_deadline));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "left_vruntime",
 			SPLIT_NS(left_vruntime));
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "min_vruntime",
-			SPLIT_NS(min_vruntime));
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "zero_vruntime",
+			SPLIT_NS(zero_vruntime));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "avg_vruntime",
 			SPLIT_NS(avg_vruntime(cfs_rq)));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "right_vruntime",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e43993a4e580..17e43980f5a3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -526,24 +526,6 @@ void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);
  * Scheduling class tree data structure manipulation methods:
  */
 
-static inline __maybe_unused u64 max_vruntime(u64 max_vruntime, u64 vruntime)
-{
-	s64 delta = (s64)(vruntime - max_vruntime);
-	if (delta > 0)
-		max_vruntime = vruntime;
-
-	return max_vruntime;
-}
-
-static inline __maybe_unused u64 min_vruntime(u64 min_vruntime, u64 vruntime)
-{
-	s64 delta = (s64)(vruntime - min_vruntime);
-	if (delta < 0)
-		min_vruntime = vruntime;
-
-	return min_vruntime;
-}
-
 static inline bool entity_before(const struct sched_entity *a,
 				 const struct sched_entity *b)
 {
@@ -556,7 +538,7 @@ static inline bool entity_before(const struct sched_entity *a,
 
 static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	return (s64)(se->vruntime - cfs_rq->min_vruntime);
+	return (s64)(se->vruntime - cfs_rq->zero_vruntime);
 }
 
 #define __node_2_se(node) \
@@ -608,13 +590,13 @@ static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
  *
  * Which we track using:
  *
- *                    v0 := cfs_rq->min_vruntime
+ *                    v0 := cfs_rq->zero_vruntime
  * \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime
  *              \Sum w_i := cfs_rq->avg_load
  *
- * Since min_vruntime is a monotonic increasing variable that closely tracks
- * the per-task service, these deltas: (v_i - v), will be in the order of the
- * maximal (virtual) lag induced in the system due to quantisation.
+ * Since zero_vruntime closely tracks the per-task service, these 
+ * deltas: (v_i - v), will be in the order of the maximal (virtual) lag
+ * induced in the system due to quantisation.
  *
  * Also, we use scale_load_down() to reduce the size.
  *
@@ -673,7 +655,7 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
 		avg = div_s64(avg, load);
 	}
 
-	return cfs_rq->min_vruntime + avg;
+	return cfs_rq->zero_vruntime + avg;
 }
 
 /*
@@ -734,7 +716,7 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
 		load += weight;
 	}
 
-	return avg >= (s64)(vruntime - cfs_rq->min_vruntime) * load;
+	return avg >= (s64)(vruntime - cfs_rq->zero_vruntime) * load;
 }
 
 int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -742,42 +724,14 @@ int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	return vruntime_eligible(cfs_rq, se->vruntime);
 }
 
-static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
+static void update_zero_vruntime(struct cfs_rq *cfs_rq)
 {
-	u64 min_vruntime = cfs_rq->min_vruntime;
-	/*
-	 * open coded max_vruntime() to allow updating avg_vruntime
-	 */
-	s64 delta = (s64)(vruntime - min_vruntime);
-	if (delta > 0) {
-		avg_vruntime_update(cfs_rq, delta);
-		min_vruntime = vruntime;
-	}
-	return min_vruntime;
-}
+	u64 vruntime = avg_vruntime(cfs_rq);
+	s64 delta = (s64)(vruntime - cfs_rq->zero_vruntime);
 
-static void update_min_vruntime(struct cfs_rq *cfs_rq)
-{
-	struct sched_entity *se = __pick_root_entity(cfs_rq);
-	struct sched_entity *curr = cfs_rq->curr;
-	u64 vruntime = cfs_rq->min_vruntime;
+	avg_vruntime_update(cfs_rq, delta);
 
-	if (curr) {
-		if (curr->on_rq)
-			vruntime = curr->vruntime;
-		else
-			curr = NULL;
-	}
-
-	if (se) {
-		if (!curr)
-			vruntime = se->min_vruntime;
-		else
-			vruntime = min_vruntime(vruntime, se->min_vruntime);
-	}
-
-	/* ensure we never gain time by being placed backwards. */
-	cfs_rq->min_vruntime = __update_min_vruntime(cfs_rq, vruntime);
+	cfs_rq->zero_vruntime = vruntime;
 }
 
 static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq)
@@ -850,6 +804,7 @@ RB_DECLARE_CALLBACKS(static, min_vruntime_cb, struct sched_entity,
 static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	avg_vruntime_add(cfs_rq, se);
+	update_zero_vruntime(cfs_rq);
 	se->min_vruntime = se->vruntime;
 	se->min_slice = se->slice;
 	rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
@@ -1239,7 +1194,6 @@ static void update_curr(struct cfs_rq *cfs_rq)
 
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
 	resched = update_deadline(cfs_rq, curr);
-	update_min_vruntime(cfs_rq);
 
 	if (entity_is_task(curr)) {
 		struct task_struct *p = task_of(curr);
@@ -3825,15 +3779,6 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 		place_entity(cfs_rq, se, 0);
 		if (!curr)
 			__enqueue_entity(cfs_rq, se);
-
-		/*
-		 * The entity's vruntime has been adjusted, so let's check
-		 * whether the rq-wide min_vruntime needs updated too. Since
-		 * the calculations above require stable min_vruntime rather
-		 * than up-to-date one, we do the update at the end of the
-		 * reweight process.
-		 */
-		update_min_vruntime(cfs_rq);
 	}
 }
 
@@ -5511,15 +5456,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	update_cfs_group(se);
 
-	/*
-	 * Now advance min_vruntime if @se was the entity holding it back,
-	 * except when: DEQUEUE_SAVE && !DEQUEUE_MOVE, in this case we'll be
-	 * put back on, and if we advance min_vruntime, we'll be placed back
-	 * further than we started -- i.e. we'll be penalized.
-	 */
-	if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE)
-		update_min_vruntime(cfs_rq);
-
 	if (flags & DEQUEUE_DELAYED)
 		finish_delayed_dequeue_entity(se);
 
@@ -13312,7 +13248,7 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
 void init_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	cfs_rq->tasks_timeline = RB_ROOT_CACHED;
-	cfs_rq->min_vruntime = (u64)(-(1LL << 20));
+	cfs_rq->zero_vruntime = (u64)(-(1LL << 20));
 #ifdef CONFIG_SMP
 	raw_spin_lock_init(&cfs_rq->removed.lock);
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 47972f34ea70..41b312f17f22 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -652,7 +652,7 @@ struct cfs_rq {
 	s64			avg_vruntime;
 	u64			avg_load;
 
-	u64			min_vruntime;
+	u64			zero_vruntime;
 #ifdef CONFIG_SCHED_CORE
 	unsigned int		forceidle_seq;
 	u64			min_vruntime_fi;

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
  2025-04-02 18:07     ` Peter Zijlstra
@ 2025-04-09 14:29       ` Rik van Riel
  2025-04-09 15:27         ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Rik van Riel @ 2025-04-09 14:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pat Cody, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, patcody,
	kernel-team, stable, Breno Leitao

On Wed, 2025-04-02 at 20:07 +0200, Peter Zijlstra wrote:
> 
> Anyway, seeing how your min_vruntime is weird, let me ask you to try
> the
> below; it removes the old min_vruntime and instead tracks zero
> vruntime
> as the 'current' avg_vruntime. We don't need the monotinicity filter,
> all we really need is something 'near' all the other vruntimes in
> order
> to compute this relative key so we can preserve order across the
> wrap.
> 
> This *should* get us near minimal sized keys. If you can still
> reproduce, you should probably add something like that patch I send
> you
> privately earlier, that checks the overflows.

Our trouble workload still makes the scheduler crash
with this patch.

I'll go put the debugging patch on our kernel.

Should I try to get debugging data with this patch
part of the mix, or with the debugging patch just
on top of what's in 6.13 already?

Digging through our kernel crash history, this
particular crash seems to go back at least to
6.11. They just happen much more frequently on
6.13 for some (as of yet unknown) reason.

-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
  2025-04-09 14:29       ` Rik van Riel
@ 2025-04-09 15:27         ` Peter Zijlstra
  2025-04-11 14:51           ` Rik van Riel
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2025-04-09 15:27 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Pat Cody, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, patcody,
	kernel-team, stable, Breno Leitao

On Wed, Apr 09, 2025 at 10:29:43AM -0400, Rik van Riel wrote:
> On Wed, 2025-04-02 at 20:07 +0200, Peter Zijlstra wrote:
> > 
> > Anyway, seeing how your min_vruntime is weird, let me ask you to try
> > the
> > below; it removes the old min_vruntime and instead tracks zero
> > vruntime
> > as the 'current' avg_vruntime. We don't need the monotinicity filter,
> > all we really need is something 'near' all the other vruntimes in
> > order
> > to compute this relative key so we can preserve order across the
> > wrap.
> > 
> > This *should* get us near minimal sized keys. If you can still
> > reproduce, you should probably add something like that patch I send
> > you
> > privately earlier, that checks the overflows.
> 
> Our trouble workload still makes the scheduler crash
> with this patch.
> 
> I'll go put the debugging patch on our kernel.
> 
> Should I try to get debugging data with this patch
> part of the mix, or with the debugging patch just
> on top of what's in 6.13 already?

Whatever is more convenient I suppose.

If you can dump the full tree that would be useful. Typically the
se::{vruntime,weight} and cfs_rq::{zero_vruntime,avg_vruntime,avg_load}
such that we can do full manual validation of the numbers.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
  2025-04-09 15:27         ` Peter Zijlstra
@ 2025-04-11 14:51           ` Rik van Riel
  2025-04-14  9:08             ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Rik van Riel @ 2025-04-11 14:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pat Cody, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, patcody,
	kernel-team, stable, Breno Leitao

On Wed, 9 Apr 2025 17:27:03 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Apr 09, 2025 at 10:29:43AM -0400, Rik van Riel wrote:
> > Our trouble workload still makes the scheduler crash
> > with this patch.
> > 
> > I'll go put the debugging patch on our kernel.
> > 
> > Should I try to get debugging data with this patch
> > part of the mix, or with the debugging patch just
> > on top of what's in 6.13 already?  
> 
> Whatever is more convenient I suppose.
> 
> If you can dump the full tree that would be useful. Typically the
> se::{vruntime,weight} and cfs_rq::{zero_vruntime,avg_vruntime,avg_load}
> such that we can do full manual validation of the numbers.

Here is a dump of the scheduler tree of the crashing CPU.

Unfortunately the CPU crashed in pick_next_entity, and not in your
debugging code. I'll add two more calls to avg_vruntime_validate(),
one from avg_vruntime_update(), and one rfom __update_min_vruntime()
when we skip the call to avg_vruntime_update(). The line numbers in
the backtrace could be a clue.

I have edited the cgroup names to make things more readable, but everything
else is untouched.

One thing that stands out to me is how the vruntime of each of the
entities on the CPU's cfs_rq are really large negative numbers.

vruntime = 18429030910682621789 equals 0xffc111f8d9ee675d

I do not know how those se->vruntime numbers got to that point,
but they are a suggestive cause of the overflow.

I'll go comb through the se->vruntime updating code to see how those
large numbers could end up as the vruntime for these sched entities.


nr_running = 3
min_vruntime = 107772371139014
avg_vruntime = -1277161882867784752
avg_load = 786
tasks_timeline = [
  {
    cgroup /A
    weight = 10230 => 9
    rq = {
      nr_running = 0
      min_vruntime = 458975898004
      avg_vruntime = 0
      avg_load = 0
      tasks_timeline = [
      ]
    }
  },
  {
    cgroup /B
    vruntime = 18445226958208703357
    weight = 319394 => 311
    rq = {
      nr_running = 2
      min_vruntime = 27468255210769
      avg_vruntime = 0
      avg_load = 93
      tasks_timeline = [
        {
          cgroup /B/a
          vruntime = 27468255210769
          weight = 51569 => 50
          rq = {
            nr_running = 1
            min_vruntime = 820449693961
            avg_vruntime = 0
            avg_load = 15
            tasks_timeline = [
              {
                task = 3653382 (fc0)
                vruntime = 820449693961
                weight = 15360 => 15
              },
            ]
          }
        },
        {
          cgroup /B/b
          vruntime = 27468255210769
          weight = 44057 => 43
          rq = {
            nr_running = 1
            min_vruntime = 563178567930
            avg_vruntime = 0
            avg_load = 15
            tasks_timeline = [
              {
                task = 3706454 (fc0)
                vruntime = 563178567930
                weight = 15360 => 15
              },
            ]
          }
        },
      ]
    }
  },
  {
    cgroup /C
    vruntime = 18445539757376619550
    weight = 477855 => 466
    rq = {
      nr_running = 0
      min_vruntime = 17163581720739
      avg_vruntime = 0
      avg_load = 0
      tasks_timeline = [
      ]
    }
  },
]


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
  2025-04-11 14:51           ` Rik van Riel
@ 2025-04-14  9:08             ` Peter Zijlstra
  2025-04-14 15:38               ` Chris Mason
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2025-04-14  9:08 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Pat Cody, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, patcody,
	kernel-team, stable, Breno Leitao

On Fri, Apr 11, 2025 at 10:51:34AM -0400, Rik van Riel wrote:
> On Wed, 9 Apr 2025 17:27:03 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> > On Wed, Apr 09, 2025 at 10:29:43AM -0400, Rik van Riel wrote:
> > > Our trouble workload still makes the scheduler crash
> > > with this patch.
> > > 
> > > I'll go put the debugging patch on our kernel.
> > > 
> > > Should I try to get debugging data with this patch
> > > part of the mix, or with the debugging patch just
> > > on top of what's in 6.13 already?  
> > 
> > Whatever is more convenient I suppose.
> > 
> > If you can dump the full tree that would be useful. Typically the
> > se::{vruntime,weight} and cfs_rq::{zero_vruntime,avg_vruntime,avg_load}
> > such that we can do full manual validation of the numbers.
> 
> Here is a dump of the scheduler tree of the crashing CPU.
> 
> Unfortunately the CPU crashed in pick_next_entity, and not in your
> debugging code. I'll add two more calls to avg_vruntime_validate(),
> one from avg_vruntime_update(), and one rfom __update_min_vruntime()
> when we skip the call to avg_vruntime_update(). The line numbers in
> the backtrace could be a clue.
> 
> I have edited the cgroup names to make things more readable, but everything
> else is untouched.

Hmm, I didn't think you guys used the cgroup stuff.

Anyway, given cgroups, which group pick is the one that went boom? Also,
what is curr (for that cgroup).

curr lives outside of the tree, but is included in the eligibility
consideration (when still on_rq and all that).

> nr_running = 3
> min_vruntime = 107772371139014
> avg_vruntime = -1277161882867784752
> avg_load = 786
> tasks_timeline = [
>   {
>     cgroup /A
>     weight = 10230 => 9

No vruntime, I'll assume !on_rq, but that makes avg_load above not match
:/ So something is off here.

>     rq = {
>       nr_running = 0
>       min_vruntime = 458975898004
>       avg_vruntime = 0
>       avg_load = 0
>       tasks_timeline = [
>       ]
>     }
>   },
>   {
>     cgroup /B
>     vruntime = 18445226958208703357
>     weight = 319394 => 311
>     rq = {
>       nr_running = 2
>       min_vruntime = 27468255210769
>       avg_vruntime = 0
>       avg_load = 93
>       tasks_timeline = [
>         {
>           cgroup /B/a
>           vruntime = 27468255210769
>           weight = 51569 => 50
>           rq = {
>             nr_running = 1
>             min_vruntime = 820449693961
>             avg_vruntime = 0
>             avg_load = 15
>             tasks_timeline = [
>               {
>                 task = 3653382 (fc0)
>                 vruntime = 820449693961
>                 weight = 15360 => 15
>               },
>             ]
>           }
>         },
>         {
>           cgroup /B/b
>           vruntime = 27468255210769
>           weight = 44057 => 43
>           rq = {
>             nr_running = 1
>             min_vruntime = 563178567930
>             avg_vruntime = 0
>             avg_load = 15
>             tasks_timeline = [
>               {
>                 task = 3706454 (fc0)
>                 vruntime = 563178567930
>                 weight = 15360 => 15
>               },
>             ]
>           }
>         },
>       ]
>     }
>   },
>   {
>     cgroup /C
>     vruntime = 18445539757376619550
>     weight = 477855 => 466
>     rq = {
>       nr_running = 0
>       min_vruntime = 17163581720739
>       avg_vruntime = 0
>       avg_load = 0
>       tasks_timeline = [
>       ]
>     }
>   },
> ]

So given the above, I've created the below files, and that gives:

$ ./vruntime < root.txt
  k: 0 w: 311 k*w: 0
  k: 312799167916193 w: 466 k*w: 145764412248945938
  v': 107772371139014 = v: 18445226958208703357 + d: 1624887871987273
  V': -1116773464285165183 = V: 145764412248945938 - d: 1624887871987273 * W: 777
min_vruntime: 107772371139014
avg_vruntime: -1116773464285165183
avg_load: 777

> One thing that stands out to me is how the vruntime of each of the
> entities on the CPU's cfs_rq are really large negative numbers.
> 
> vruntime = 18429030910682621789 equals 0xffc111f8d9ee675d
> 
> I do not know how those se->vruntime numbers got to that point,
> but they are a suggestive cause of the overflow.
> 
> I'll go comb through the se->vruntime updating code to see how those
> large numbers could end up as the vruntime for these sched entities.

As you can see from the output here, the large negative is the result
of min_vruntime being significantly ahead of the actual entities.

This can happen due to that monotonicity filter the thing has -- it
doesn't want to go backwards. Whereas the 0-lag point can move
backwards, seeing how it is the weighted average, and inserting a task
with positive lag will insert a task left of middle, moving the middle
left.

The zero_vruntime patch I gave earlier should avoid this particular
issue.


$ ./vruntime < B.txt
  k: 0 w: 50 k*w: 0
  k: 0 w: 43 k*w: 0
  v': 27468255210769 = v: 27468255210769 + d: 0
  V': 0 = V: 0 - d: 0 * W: 93
min_vruntime: 27468255210769
avg_vruntime: 0
avg_load: 93


C, B/a and B/b are not really interesting, they're single entries where
min_vruntime == vruntime and boring.

---8<---(root.txt)---8<---
entity 18445226958208703357     319394
entity 18445539757376619550     477855
group   107772371139014


---8<---(B.txt)---8<---
entity  27468255210769  51569
entity  27468255210769  44057
group   27468255210769


---8<---(vruntime.c)---8<---

#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>

int main (int argc, char **argv)
{
        unsigned long V, W;
        unsigned long V0;
        bool init = false;

        for (;;) {
                unsigned long vruntime, weight;
                char type[32];

                int r = scanf("%s\t%lu\t%lu\n", &type, &vruntime, &weight);
                if (r == EOF)
                        break;

                if (!strcmp(type, "entity")) {
                        if (!init) {
                                V = W = 0;
                                V0 = vruntime;
                                init = true;
                        }

                        unsigned long k = vruntime - V0;
                        unsigned long w = weight / 1024;

                        V += k * w;
                        W += w;

                        printf("  k: %ld w: %lu k*w: %ld\n", k, w, k*w);
                }

                if (!strcmp(type, "group")) {
                        unsigned long d = vruntime - V0;

                        printf("  v': %lu = v: %lu + d: %lu\n", V0 + d, V0, d);
                        printf("  V': %ld = V: %ld - d: %ld * W: %lu\n",
                                        V - d * W, V, d, W);

                        V0 += d;
                        V -= d * W;
                }
        }

        printf("min_vruntime: %lu\n", V0);
        printf("avg_vruntime: %ld\n", V);
        printf("avg_load: %lu\n", W);

        return 0;
}


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
  2025-04-14  9:08             ` Peter Zijlstra
@ 2025-04-14 15:38               ` Chris Mason
  2025-04-15 10:07                 ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Chris Mason @ 2025-04-14 15:38 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Pat Cody, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, patcody,
	kernel-team, stable, Breno Leitao



On 4/14/25 5:08 AM, Peter Zijlstra wrote:

[ math and such ]


> The zero_vruntime patch I gave earlier should avoid this particular
> issue.

Here's a crash with the zero runtime patch.  I'm trying to reproduce
this outside of prod so we can crank up the iteration speed a bit.

-chris

nr_running = 2
zero_vruntime = 19194347104893960
avg_vruntime = 6044054790
avg_load = 2
curr = {
  cgroup urgent
  vruntime = 24498183812106172
  weight = 3561684 => 3478
  rq = {
    nr_running = 0
    zero_vruntime = 505000008215941
    avg_vruntime = 0
    avg_load = 0
    curr = {
      cgroup urgent/-610604968056586240
      vruntime = 505000008302509
      weight = 455902 => 445
      rq = {
        nr_running = 0
        zero_vruntime = 12234709899
        avg_vruntime = 0
        avg_load = 0
        curr = {
          task = 279047 (fc0)
          vruntime = 12237278090
          weight = 15360 => 15
        }
        tasks_timeline = [
        ]
      }
    }
    tasks_timeline = [
    ]
  }
}
tasks_timeline = [
  {
    cgroup optional
    vruntime = 19194350126921355
    weight = 1168 => 2
    rq = {
      nr_running = 2
      zero_vruntime = 440280059357029
      avg_vruntime = 476
      avg_load = 688
      tasks_timeline = [
        {
          cgroup optional/-610613050111295488
          vruntime = 440280059333960
          weight = 291271 => 284
          rq = {
            nr_running = 5
            zero_vruntime = 65179829005
            avg_vruntime = 0
            avg_load = 75
            tasks_timeline = [
              {
                task = 261672 (fc0)
                vruntime = 65189926507
                weight = 15360 => 15
              },
              {
                task = 261332 (fc0)
                vruntime = 65189480962
                weight = 15360 => 15
              },
              {
                task = 261329 (enc1:0:vp9_fbv)
                vruntime = 65165843516
                weight = 15360 => 15
              },
              {
                task = 261334 (dec0:0:hevc_fbv)
                vruntime = 65174065035
                weight = 15360 => 15
              },
              {
                task = 261868 (fc0)
                vruntime = 65179829005
                weight = 15360 => 15
              },
            ]
          }
        },
        {
          cgroup optional/-610609318858457088
          vruntime = 440280059373247
          weight = 413911 => 404
          rq = {
            nr_running = 1
            zero_vruntime = 22819875784
            avg_vruntime = 0
            avg_load = 15
            tasks_timeline = [
              {
                task = 273291 (fc0)
                vruntime = 22819875784
                weight = 15360 => 15
              },
            ]
          }
        },
      ]
    }
  },
]


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
  2025-04-14 15:38               ` Chris Mason
@ 2025-04-15 10:07                 ` Peter Zijlstra
  2025-04-16  7:59                   ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2025-04-15 10:07 UTC (permalink / raw)
  To: Chris Mason
  Cc: Rik van Riel, Pat Cody, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, patcody, kernel-team, stable, Breno Leitao

On Mon, Apr 14, 2025 at 11:38:15AM -0400, Chris Mason wrote:
> 
> 
> On 4/14/25 5:08 AM, Peter Zijlstra wrote:
> 
> [ math and such ]
> 
> 
> > The zero_vruntime patch I gave earlier should avoid this particular
> > issue.
> 
> Here's a crash with the zero runtime patch. 

And indeed it doesn't have these massive (negative) avg_vruntime values.

> I'm trying to reproduce
> this outside of prod so we can crank up the iteration speed a bit.

Thanks.

Could you add which pick went boom for the next dump?



I am however, slightly confused by this output format.

It looks like it dumps the cfs_rq the first time it encounters it,
either through curr or through the tree.

So if I read this correct the root is something like:

> nr_running = 2
> zero_vruntime = 19194347104893960
> avg_vruntime = 6044054790
> avg_load = 2
> curr = {
>   cgroup urgent
>   vruntime = 24498183812106172
>   weight = 3561684 => 3478
> }
> tasks_timeline = [
>   {
>     cgroup optional
>     vruntime = 19194350126921355
>     weight = 1168 => 2
>   },
> ]

group  19194347104893960
curr   24498183812106172 3561684
entity 19194350126921355 1168

But if I run those numbers, I get avg_load == 1, seeing how 1168/1024 =
1. But the thing says it should be 2.

Similarly, my avg_vruntime is exactly half of what it says.

avg_vruntime: 3022027395
avg_load: 1

(seeing how 19194350126921355-19194347104893960 = 3022027395)

Anyway, with curr being significantly to the right of that, the 0-lag
point is well right of where optional sits. So this pick should be fine,
and result in 'optional' getting selected (curr is no longer eligible).

All the urgent/* groups have nr_running == 0, so are not interesting,
we'll not pick there.

NOTE: I'm inferring curr is on_rq, because nr_running == 2 and the tree
only has 1 entity in it. 

NOTE: if we ignore curr, then optional sits at exactly the 0-lag point,
with either sets of numbers and so should be eligible.


This then leaves us the optional/* groups.

>     cgroup optional
>     rq = {
>       nr_running = 2
>       zero_vruntime = 440280059357029
>       avg_vruntime = 476
>       avg_load = 688
>       tasks_timeline = [
>         {
>           cgroup optional/-610613050111295488
>           vruntime = 440280059333960
>           weight = 291271 => 284
>         },
>         {
>           cgroup optional/-610609318858457088
>           vruntime = 440280059373247
>           weight = 413911 => 404
>         },

group 440280059357029
entity 440280059333960 291271
entity 440280059373247 413911

Which gives:

avg_vruntime: 476
avg_load: 688

And that matches.

Next we have:

>           cgroup optional/-610613050111295488
>           rq = {
>             nr_running = 5
>             zero_vruntime = 65179829005
>             avg_vruntime = 0
>             avg_load = 75
>             tasks_timeline = [
>               {
>                 task = 261672 (fc0)
>                 vruntime = 65189926507
>                 weight = 15360 => 15
>               },
>               {
>                 task = 261332 (fc0)
>                 vruntime = 65189480962
>                 weight = 15360 => 15
>               },
>               {
>                 task = 261329 (enc1:0:vp9_fbv)
>                 vruntime = 65165843516
>                 weight = 15360 => 15
>               },
>               {
>                 task = 261334 (dec0:0:hevc_fbv)
>                 vruntime = 65174065035
>                 weight = 15360 => 15
>               },
>               {
>                 task = 261868 (fc0)
>                 vruntime = 65179829005
>                 weight = 15360 => 15
>               },
>             ]
>           }


avg_vruntime: 0
avg_load: 75

This again matches, leaving the bottom 3 tasks eligible.

And finally:

>           cgroup optional/-610609318858457088
>           rq = {
>             nr_running = 1
>             zero_vruntime = 22819875784
>             avg_vruntime = 0
>             avg_load = 15
>             tasks_timeline = [
>               {
>                 task = 273291 (fc0)
>                 vruntime = 22819875784
>                 weight = 15360 => 15
>               },
>             ]
>           }

Rather boring indeed, but the numbers appear correct.


So I'm not immediately seeing where it would go boom, but seeing how the
root group is the one with dodgy numbers, I would suspect that -- but
I'm not immediately seeing how... :-(

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
  2025-04-15 10:07                 ` Peter Zijlstra
@ 2025-04-16  7:59                   ` Peter Zijlstra
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2025-04-16  7:59 UTC (permalink / raw)
  To: Chris Mason
  Cc: Rik van Riel, Pat Cody, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, patcody, kernel-team, stable, Breno Leitao

On Tue, Apr 15, 2025 at 12:07:05PM +0200, Peter Zijlstra wrote:
> On Mon, Apr 14, 2025 at 11:38:15AM -0400, Chris Mason wrote:
> > 
> > 
> > On 4/14/25 5:08 AM, Peter Zijlstra wrote:
> > 
> > [ math and such ]
> > 
> > 
> > > The zero_vruntime patch I gave earlier should avoid this particular
> > > issue.
> > 
> > Here's a crash with the zero runtime patch. 
> 
> And indeed it doesn't have these massive (negative) avg_vruntime values.
> 
> > I'm trying to reproduce
> > this outside of prod so we can crank up the iteration speed a bit.
> 
> Thanks.
> 
> Could you add which pick went boom for the next dump?
> 
> 
> 
> I am however, slightly confused by this output format.
> 
> It looks like it dumps the cfs_rq the first time it encounters it,
> either through curr or through the tree.
> 
> So if I read this correct the root is something like:
> 
> > nr_running = 2
> > zero_vruntime = 19194347104893960
> > avg_vruntime = 6044054790
> > avg_load = 2
> > curr = {
> >   cgroup urgent
> >   vruntime = 24498183812106172
> >   weight = 3561684 => 3478
> > }
> > tasks_timeline = [
> >   {
> >     cgroup optional
> >     vruntime = 19194350126921355
> >     weight = 1168 => 2
> >   },
> > ]
> 
> group  19194347104893960
> curr   24498183812106172 3561684
> entity 19194350126921355 1168
> 
> But if I run those numbers, I get avg_load == 1, seeing how 1168/1024 =
> 1. But the thing says it should be 2.

N/m, late last night I remembered we have a max(2, ..) in there. So
yeah, your numbers seem right.



^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <7B2CFC16-1ADE-4565-B555-7525A50494C2@surriel.com>]

[parent not found: <20250402082221.GT5880@noisy.programming.kicks-ass.net>]

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
       [not found]     ` <20250402082221.GT5880@noisy.programming.kicks-ass.net>
@ 2025-04-14 19:57       ` Rik van Riel
  2025-04-15  8:02         ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Rik van Riel @ 2025-04-14 19:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Pat Cody, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, patcody, kernel-team, Breno Leitao

On Wed, 2025-04-02 at 10:22 +0200, Peter Zijlstra wrote:
> 
> Please confirm what the reason for overflow is.
> 
Running a large enough sample size has its benefits.

We have hit 3 out of the 4 warnings below.

The only one we did not hit is the cfs_rq->avg_load != avg_load
warning.

Most of the time we seem to hit the warnings from the
code that removes tasks from the runqueue, but we are
occasionally seeing it when adding tasks to the runqueue,
as well.


> +++ b/kernel/sched/fair.c
> @@ -620,12 +620,36 @@ static inline s64 entity_key(struct cfs_
>   *
>   * As measured, the max (key * weight) value was ~44 bits for a
> kernel build.
>   */
> +
> +static void avg_vruntime_validate(struct cfs_rq *cfs_rq)
> +{
> +	unsigned long load = 0;
> +	s64 vruntime = 0;
> +
> +	for (struct rb_node *node = rb_first_cached(&cfs_rq-
> >tasks_timeline);
> +	     node; node = rb_next(node)) {
> +		struct sched_entity *se = __node_2_se(node);
> +		unsigned long weight = scale_load_down(se-
> >load.weight);
> +		s64 key = entity_key(cfs_rq, se);
> +		/* vruntime += key * weight; */
> +		WARN_ON_ONCE(__builtin_mul_overflow(key, weight,
> &key));
> +		WARN_ON_ONCE(__builtin_add_overflow(vruntime, key,
> &vruntime));
> +		load += weight;
> +	}
> +
> +	WARN_ON_ONCE(cfs_rq->avg_load != load);
> +	WARN_ON_ONCE(cfs_rq->avg_vruntime != vruntime);
> +}
> +
>  static void
>  avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>  	unsigned long weight = scale_load_down(se->load.weight);
>  	s64 key = entity_key(cfs_rq, se);
>  
> +	/* not yet added to tree */
> +	avg_vruntime_validate(cfs_rq);
> +
>  	cfs_rq->avg_vruntime += key * weight;
>  	cfs_rq->avg_load += weight;
>  }
> @@ -638,6 +662,9 @@ avg_vruntime_sub(struct cfs_rq *cfs_rq,
>  
>  	cfs_rq->avg_vruntime -= key * weight;
>  	cfs_rq->avg_load -= weight;
> +
> +	/* already removed from tree */
> +	avg_vruntime_validate(cfs_rq);
>  }
>  
>  static inline
> 

-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
  2025-04-14 19:57       ` Rik van Riel
@ 2025-04-15  8:02         ` Peter Zijlstra
  2025-04-16 12:44           ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2025-04-15  8:02 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Chris Mason, Pat Cody, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, patcody, kernel-team, Breno Leitao

On Mon, Apr 14, 2025 at 03:57:42PM -0400, Rik van Riel wrote:
> On Wed, 2025-04-02 at 10:22 +0200, Peter Zijlstra wrote:
> > 
> > Please confirm what the reason for overflow is.
> > 
> Running a large enough sample size has its benefits.
> 
> We have hit 3 out of the 4 warnings below.
> 
> The only one we did not hit is the cfs_rq->avg_load != avg_load
> warning.

Fair enough, that one really isn't hard.

> Most of the time we seem to hit the warnings from the
> code that removes tasks from the runqueue, 

*blink*..

> but we are
> occasionally seeing it when adding tasks to the runqueue,
> as well.

OK, let me try and get my head around that.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
  2025-04-15  8:02         ` Peter Zijlstra
@ 2025-04-16 12:44           ` Peter Zijlstra
  2025-04-16 14:19             ` Rik van Riel
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2025-04-16 12:44 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Chris Mason, Pat Cody, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, patcody, kernel-team, Breno Leitao

On Tue, Apr 15, 2025 at 10:02:35AM +0200, Peter Zijlstra wrote:
> On Mon, Apr 14, 2025 at 03:57:42PM -0400, Rik van Riel wrote:
> > On Wed, 2025-04-02 at 10:22 +0200, Peter Zijlstra wrote:
> > > 
> > > Please confirm what the reason for overflow is.
> > > 
> > Running a large enough sample size has its benefits.
> > 
> > We have hit 3 out of the 4 warnings below.
> > 
> > The only one we did not hit is the cfs_rq->avg_load != avg_load
> > warning.
> 
> Fair enough, that one really isn't hard.
> 
> > Most of the time we seem to hit the warnings from the
> > code that removes tasks from the runqueue, 
> 
> *blink*..

Which warning is getting hit on removal? The avg_vruntime mismatch?

Also, which removal path? schedule()'s block path, or migration like?


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
  2025-04-16 12:44           ` Peter Zijlstra
@ 2025-04-16 14:19             ` Rik van Riel
  2025-04-16 15:27               ` Chris Mason
  2025-04-18 15:44               ` Peter Zijlstra
  0 siblings, 2 replies; 30+ messages in thread
From: Rik van Riel @ 2025-04-16 14:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Pat Cody, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, patcody, kernel-team, Breno Leitao

On Wed, 2025-04-16 at 14:44 +0200, Peter Zijlstra wrote:
> On Tue, Apr 15, 2025 at 10:02:35AM +0200, Peter Zijlstra wrote:
> > On Mon, Apr 14, 2025 at 03:57:42PM -0400, Rik van Riel wrote:
> > > On Wed, 2025-04-02 at 10:22 +0200, Peter Zijlstra wrote:
> > > > 
> > > > Please confirm what the reason for overflow is.
> > > > 
> > > Running a large enough sample size has its benefits.
> > > 
> > > We have hit 3 out of the 4 warnings below.
> > > 
> > > The only one we did not hit is the cfs_rq->avg_load != avg_load
> > > warning.
> > 
> > Fair enough, that one really isn't hard.
> > 
> > > Most of the time we seem to hit the warnings from the
> > > code that removes tasks from the runqueue, 
> > 
> > *blink*..
> 
> Which warning is getting hit on removal? The avg_vruntime mismatch?
> 
> Also, which removal path? schedule()'s block path, or migration like?

The most common warning by far, hitting
about 90% of the time we hit anything
in avg_vruntime_validate is the
WARN_ON_ONCE(cfs_rq->avg_vruntime != vruntime)

The most common code path to getting there,
covering about 85% of the cases:

avg_vruntime_validate
avg_vruntime_sub
__dequeue_entity
set_next_entity
pick_task_fair
pick_next_task_fair
__pick_next_task
pick_next_task
__schedule
schedule


-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
  2025-04-16 14:19             ` Rik van Riel
@ 2025-04-16 15:27               ` Chris Mason
  2025-04-18 15:44               ` Peter Zijlstra
  1 sibling, 0 replies; 30+ messages in thread
From: Chris Mason @ 2025-04-16 15:27 UTC (permalink / raw)
  To: Rik van Riel, Peter Zijlstra
  Cc: Pat Cody, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, patcody,
	kernel-team, Breno Leitao



On 4/16/25 10:19 AM, Rik van Riel wrote:
> On Wed, 2025-04-16 at 14:44 +0200, Peter Zijlstra wrote:
>> On Tue, Apr 15, 2025 at 10:02:35AM +0200, Peter Zijlstra wrote:
>>> On Mon, Apr 14, 2025 at 03:57:42PM -0400, Rik van Riel wrote:
>>>> On Wed, 2025-04-02 at 10:22 +0200, Peter Zijlstra wrote:
>>>>>
>>>>> Please confirm what the reason for overflow is.
>>>>>
>>>> Running a large enough sample size has its benefits.
>>>>
>>>> We have hit 3 out of the 4 warnings below.
>>>>
>>>> The only one we did not hit is the cfs_rq->avg_load != avg_load
>>>> warning.
>>>
>>> Fair enough, that one really isn't hard.
>>>
>>>> Most of the time we seem to hit the warnings from the
>>>> code that removes tasks from the runqueue, 
>>>
>>> *blink*..
>>
>> Which warning is getting hit on removal? The avg_vruntime mismatch?
>>
>> Also, which removal path? schedule()'s block path, or migration like?
> 
> The most common warning by far, hitting
> about 90% of the time we hit anything
> in avg_vruntime_validate is the
> WARN_ON_ONCE(cfs_rq->avg_vruntime != vruntime)
> 
> The most common code path to getting there,
> covering about 85% of the cases:
> 
> avg_vruntime_validate
> avg_vruntime_sub
> __dequeue_entity
> set_next_entity
> pick_task_fair
> pick_next_task_fair
> __pick_next_task
> pick_next_task
> __schedule
> schedule
> 
> 

When I was spot checking hosts, I only found these early in boot.  Rik,
did you find some that tripped later as well?

-chris


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
  2025-04-16 14:19             ` Rik van Riel
  2025-04-16 15:27               ` Chris Mason
@ 2025-04-18 15:44               ` Peter Zijlstra
  2025-04-18 23:49                 ` Omar Sandoval
  2025-04-19  2:53                 ` Mike Galbraith
  1 sibling, 2 replies; 30+ messages in thread
From: Peter Zijlstra @ 2025-04-18 15:44 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Chris Mason, Pat Cody, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, patcody, kernel-team, Breno Leitao

On Wed, Apr 16, 2025 at 10:19:42AM -0400, Rik van Riel wrote:

> The most common warning by far, hitting
> about 90% of the time we hit anything
> in avg_vruntime_validate is the
> WARN_ON_ONCE(cfs_rq->avg_vruntime != vruntime)
> 
> The most common code path to getting there,
> covering about 85% of the cases:
> 
> avg_vruntime_validate
> avg_vruntime_sub
> __dequeue_entity
> set_next_entity
> pick_task_fair
> pick_next_task_fair
> __pick_next_task
> pick_next_task
> __schedule
> schedule

(I'm assuming zero_vruntime patch here, the stock kernel has more
problems vs min_vruntime getting stuck in the future etc..)

So I have a theory about this. Key is that you're running a PREEMPT-NONE
kernel.

There is one important site the overflow patch does not cover:
avg_vruntime_update().

However, due to PREEMPT_NONE, it is possible (Chris mentioned direct
reclaim and OOM) to have very long (in-kernel) runtimes without
scheduling.

(I suppose this should be visible by tracing sched_switch)

Were this to happen, then avg_vruntime_add() gets to deal with 'key *
weight' for a relatively large key. But per the overflow checks there,
this isn't hitting (much).

But then we try and update zero_vruntime, and that is doing 'delta *
cfs_rq->avg_load', and avg_load is a larger number and might just cause
overflow. We don't have a check there.

If that overflows then avg_vruntime is buggered, but none of the
individual tree entities hit overflow, exactly as you observe.

I've modified the zero_vruntime patch a little (new one below) to update
zero_vruntime in update_curr(). Additionally, I have every tick re-align
it with the tree avg (the update and the tree can drift due to numerical
funnies).

This should ensure these very long in-kernel runtimes are broken up in
at most tick sized chunks, and by keeping zero_vruntime more or less
around the actual 0-lag point, the key values are minimized.

I should probably go play with a kernel module that spins for a few
seconds with preemption disabled, see if I can make it go BOOM :-) But
I've not yet done so.

FWIW, you can add something like:

Index: linux-2.6/kernel/sched/fair.c
===================================================================
--- linux-2.6.orig/kernel/sched/fair.c
+++ linux-2.6/kernel/sched/fair.c
@@ -652,6 +652,9 @@ avg_vruntime_sub(struct cfs_rq *cfs_rq,
 static inline
 void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta)
 {
+	s64 tmp;
+	WARN_ON_ONCE(__builtin_mul_overflow(cfs_rq->avg_load, delta, &tmp);
+
 	/*
 	 * v' = v + d ==> avg_vruntime' = avg_runtime - d*avg_load
 	 */


To the overflow patch, to see if this mult goes splat.


New zero_vruntime patch here:

---
 kernel/sched/debug.c |    8 ++--
 kernel/sched/fair.c  |   92 +++++++--------------------------------------------
 kernel/sched/sched.h |    2 -
 3 files changed, 19 insertions(+), 83 deletions(-)

Index: linux-2.6/kernel/sched/debug.c
===================================================================
--- linux-2.6.orig/kernel/sched/debug.c
+++ linux-2.6/kernel/sched/debug.c
@@ -807,7 +807,7 @@ static void print_rq(struct seq_file *m,
 
 void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 {
-	s64 left_vruntime = -1, min_vruntime, right_vruntime = -1, left_deadline = -1, spread;
+	s64 left_vruntime = -1, zero_vruntime, right_vruntime = -1, left_deadline = -1, spread;
 	struct sched_entity *last, *first, *root;
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
@@ -830,15 +830,15 @@ void print_cfs_rq(struct seq_file *m, in
 	last = __pick_last_entity(cfs_rq);
 	if (last)
 		right_vruntime = last->vruntime;
-	min_vruntime = cfs_rq->min_vruntime;
+	zero_vruntime = cfs_rq->zero_vruntime;
 	raw_spin_rq_unlock_irqrestore(rq, flags);
 
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "left_deadline",
 			SPLIT_NS(left_deadline));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "left_vruntime",
 			SPLIT_NS(left_vruntime));
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "min_vruntime",
-			SPLIT_NS(min_vruntime));
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "zero_vruntime",
+			SPLIT_NS(zero_vruntime));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "avg_vruntime",
 			SPLIT_NS(avg_vruntime(cfs_rq)));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "right_vruntime",
Index: linux-2.6/kernel/sched/fair.c
===================================================================
--- linux-2.6.orig/kernel/sched/fair.c
+++ linux-2.6/kernel/sched/fair.c
@@ -526,24 +526,6 @@ void account_cfs_rq_runtime(struct cfs_r
  * Scheduling class tree data structure manipulation methods:
  */
 
-static inline __maybe_unused u64 max_vruntime(u64 max_vruntime, u64 vruntime)
-{
-	s64 delta = (s64)(vruntime - max_vruntime);
-	if (delta > 0)
-		max_vruntime = vruntime;
-
-	return max_vruntime;
-}
-
-static inline __maybe_unused u64 min_vruntime(u64 min_vruntime, u64 vruntime)
-{
-	s64 delta = (s64)(vruntime - min_vruntime);
-	if (delta < 0)
-		min_vruntime = vruntime;
-
-	return min_vruntime;
-}
-
 static inline bool entity_before(const struct sched_entity *a,
 				 const struct sched_entity *b)
 {
@@ -556,7 +538,7 @@ static inline bool entity_before(const s
 
 static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	return (s64)(se->vruntime - cfs_rq->min_vruntime);
+	return (s64)(se->vruntime - cfs_rq->zero_vruntime);
 }
 
 #define __node_2_se(node) \
@@ -608,13 +590,13 @@ static inline s64 entity_key(struct cfs_
  *
  * Which we track using:
  *
- *                    v0 := cfs_rq->min_vruntime
+ *                    v0 := cfs_rq->zero_vruntime
  * \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime
  *              \Sum w_i := cfs_rq->avg_load
  *
- * Since min_vruntime is a monotonic increasing variable that closely tracks
- * the per-task service, these deltas: (v_i - v), will be in the order of the
- * maximal (virtual) lag induced in the system due to quantisation.
+ * Since zero_vruntime closely tracks the per-task service, these
+ * deltas: (v_i - v), will be in the order of the maximal (virtual) lag
+ * induced in the system due to quantisation.
  *
  * Also, we use scale_load_down() to reduce the size.
  *
@@ -673,7 +655,7 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
 		avg = div_s64(avg, load);
 	}
 
-	return cfs_rq->min_vruntime + avg;
+	return cfs_rq->zero_vruntime + avg;
 }
 
 /*
@@ -734,7 +716,7 @@ static int vruntime_eligible(struct cfs_
 		load += weight;
 	}
 
-	return avg >= (s64)(vruntime - cfs_rq->min_vruntime) * load;
+	return avg >= (s64)(vruntime - cfs_rq->zero_vruntime) * load;
 }
 
 int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -742,42 +724,21 @@ int entity_eligible(struct cfs_rq *cfs_r
 	return vruntime_eligible(cfs_rq, se->vruntime);
 }
 
-static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
+static void __update_zero_vruntime(struct cfs_rq *cfs_rq, s64 delta)
 {
-	u64 min_vruntime = cfs_rq->min_vruntime;
-	/*
-	 * open coded max_vruntime() to allow updating avg_vruntime
-	 */
-	s64 delta = (s64)(vruntime - min_vruntime);
-	if (delta > 0) {
-		avg_vruntime_update(cfs_rq, delta);
-		min_vruntime = vruntime;
-	}
-	return min_vruntime;
+	if (!delta)
+		return;
+
+	avg_vruntime_update(cfs_rq, delta);
+	cfs_rq->zero_vruntime += delta;
 }
 
-static void update_min_vruntime(struct cfs_rq *cfs_rq)
+static void update_zero_vruntime(struct cfs_rq *cfs_rq)
 {
-	struct sched_entity *se = __pick_root_entity(cfs_rq);
-	struct sched_entity *curr = cfs_rq->curr;
-	u64 vruntime = cfs_rq->min_vruntime;
+	u64 vruntime = avg_vruntime(cfs_rq);
+	s64 delta = (s64)(vruntime - cfs_rq->zero_vruntime);
 
-	if (curr) {
-		if (curr->on_rq)
-			vruntime = curr->vruntime;
-		else
-			curr = NULL;
-	}
-
-	if (se) {
-		if (!curr)
-			vruntime = se->min_vruntime;
-		else
-			vruntime = min_vruntime(vruntime, se->min_vruntime);
-	}
-
-	/* ensure we never gain time by being placed backwards. */
-	cfs_rq->min_vruntime = __update_min_vruntime(cfs_rq, vruntime);
+	__update_zero_vruntime(cfs_rq, delta);
 }
 
 static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq)
@@ -850,6 +811,7 @@ RB_DECLARE_CALLBACKS(static, min_vruntim
 static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	avg_vruntime_add(cfs_rq, se);
+	update_zero_vruntime(cfs_rq);
 	se->min_vruntime = se->vruntime;
 	se->min_slice = se->slice;
 	rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
@@ -861,6 +823,7 @@ static void __dequeue_entity(struct cfs_
 	rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
 				  &min_vruntime_cb);
 	avg_vruntime_sub(cfs_rq, se);
+	update_zero_vruntime(cfs_rq);
 }
 
 struct sched_entity *__pick_root_entity(struct cfs_rq *cfs_rq)
@@ -1238,8 +1201,15 @@ static void update_curr(struct cfs_rq *c
 		return;
 
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
+
+	/*
+	 * The 0-lag point will progress at cfs_rq->load rate.
+	 * Move along in order to keep the deltas minimal.
+	 */
+	__update_zero_vruntime(cfs_rq,
+			       __calc_delta(delta_exec, NICE_0_LOAD, &cfs_rq->load));
+
 	resched = update_deadline(cfs_rq, curr);
-	update_min_vruntime(cfs_rq);
 
 	if (entity_is_task(curr)) {
 		struct task_struct *p = task_of(curr);
@@ -3825,15 +3795,6 @@ static void reweight_entity(struct cfs_r
 		place_entity(cfs_rq, se, 0);
 		if (!curr)
 			__enqueue_entity(cfs_rq, se);
-
-		/*
-		 * The entity's vruntime has been adjusted, so let's check
-		 * whether the rq-wide min_vruntime needs updated too. Since
-		 * the calculations above require stable min_vruntime rather
-		 * than up-to-date one, we do the update at the end of the
-		 * reweight process.
-		 */
-		update_min_vruntime(cfs_rq);
 	}
 }
 
@@ -5504,15 +5465,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 
 	update_cfs_group(se);
 
-	/*
-	 * Now advance min_vruntime if @se was the entity holding it back,
-	 * except when: DEQUEUE_SAVE && !DEQUEUE_MOVE, in this case we'll be
-	 * put back on, and if we advance min_vruntime, we'll be placed back
-	 * further than we started -- i.e. we'll be penalized.
-	 */
-	if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE)
-		update_min_vruntime(cfs_rq);
-
 	if (flags & DEQUEUE_DELAYED)
 		finish_delayed_dequeue_entity(se);
 
@@ -5630,6 +5582,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
+	update_zero_vruntime(cfs_rq);
 
 	/*
 	 * Ensure that runnable average is periodically updated.
@@ -13305,7 +13258,7 @@ static void set_next_task_fair(struct rq
 void init_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	cfs_rq->tasks_timeline = RB_ROOT_CACHED;
-	cfs_rq->min_vruntime = (u64)(-(1LL << 20));
+	cfs_rq->zero_vruntime = (u64)(-(1LL << 20));
 #ifdef CONFIG_SMP
 	raw_spin_lock_init(&cfs_rq->removed.lock);
 #endif
Index: linux-2.6/kernel/sched/sched.h
===================================================================
--- linux-2.6.orig/kernel/sched/sched.h
+++ linux-2.6/kernel/sched/sched.h
@@ -652,7 +652,7 @@ struct cfs_rq {
 	s64			avg_vruntime;
 	u64			avg_load;
 
-	u64			min_vruntime;
+	u64			zero_vruntime;
 #ifdef CONFIG_SCHED_CORE
 	unsigned int		forceidle_seq;
 	u64			min_vruntime_fi;

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
  2025-04-18 15:44               ` Peter Zijlstra
@ 2025-04-18 23:49                 ` Omar Sandoval
  2025-04-22  0:06                   ` Omar Sandoval
  2025-04-22 13:36                   ` Peter Zijlstra
  2025-04-19  2:53                 ` Mike Galbraith
  1 sibling, 2 replies; 30+ messages in thread
From: Omar Sandoval @ 2025-04-18 23:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Chris Mason, Pat Cody, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, patcody, kernel-team, Breno Leitao

On Fri, Apr 18, 2025 at 05:44:38PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 16, 2025 at 10:19:42AM -0400, Rik van Riel wrote:
> 
> > The most common warning by far, hitting
> > about 90% of the time we hit anything
> > in avg_vruntime_validate is the
> > WARN_ON_ONCE(cfs_rq->avg_vruntime != vruntime)
> > 
> > The most common code path to getting there,
> > covering about 85% of the cases:
> > 
> > avg_vruntime_validate
> > avg_vruntime_sub
> > __dequeue_entity
> > set_next_entity
> > pick_task_fair
> > pick_next_task_fair
> > __pick_next_task
> > pick_next_task
> > __schedule
> > schedule
> 
> (I'm assuming zero_vruntime patch here, the stock kernel has more
> problems vs min_vruntime getting stuck in the future etc..)
> 
> So I have a theory about this. Key is that you're running a PREEMPT-NONE
> kernel.
> 
> There is one important site the overflow patch does not cover:
> avg_vruntime_update().
> 
> However, due to PREEMPT_NONE, it is possible (Chris mentioned direct
> reclaim and OOM) to have very long (in-kernel) runtimes without
> scheduling.
> 
> (I suppose this should be visible by tracing sched_switch)
> 
> Were this to happen, then avg_vruntime_add() gets to deal with 'key *
> weight' for a relatively large key. But per the overflow checks there,
> this isn't hitting (much).
> 
> But then we try and update zero_vruntime, and that is doing 'delta *
> cfs_rq->avg_load', and avg_load is a larger number and might just cause
> overflow. We don't have a check there.
> 
> If that overflows then avg_vruntime is buggered, but none of the
> individual tree entities hit overflow, exactly as you observe.
> 
> I've modified the zero_vruntime patch a little (new one below) to update
> zero_vruntime in update_curr(). Additionally, I have every tick re-align
> it with the tree avg (the update and the tree can drift due to numerical
> funnies).
> 
> This should ensure these very long in-kernel runtimes are broken up in
> at most tick sized chunks, and by keeping zero_vruntime more or less
> around the actual 0-lag point, the key values are minimized.
> 
> I should probably go play with a kernel module that spins for a few
> seconds with preemption disabled, see if I can make it go BOOM :-) But
> I've not yet done so.
> 
> FWIW, you can add something like:
> 
> Index: linux-2.6/kernel/sched/fair.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched/fair.c
> +++ linux-2.6/kernel/sched/fair.c
> @@ -652,6 +652,9 @@ avg_vruntime_sub(struct cfs_rq *cfs_rq,
>  static inline
>  void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta)
>  {
> +	s64 tmp;
> +	WARN_ON_ONCE(__builtin_mul_overflow(cfs_rq->avg_load, delta, &tmp);
> +
>  	/*
>  	 * v' = v + d ==> avg_vruntime' = avg_runtime - d*avg_load
>  	 */
> 
> 
> To the overflow patch, to see if this mult goes splat.
> 
> 
> New zero_vruntime patch here:

Hey, Peter, thanks for your help with this. I have a question/potential
bug below.

> ---
>  kernel/sched/debug.c |    8 ++--
>  kernel/sched/fair.c  |   92 +++++++--------------------------------------------
>  kernel/sched/sched.h |    2 -
>  3 files changed, 19 insertions(+), 83 deletions(-)
> 
> Index: linux-2.6/kernel/sched/debug.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched/debug.c
> +++ linux-2.6/kernel/sched/debug.c
> @@ -807,7 +807,7 @@ static void print_rq(struct seq_file *m,
>  
>  void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
>  {
> -	s64 left_vruntime = -1, min_vruntime, right_vruntime = -1, left_deadline = -1, spread;
> +	s64 left_vruntime = -1, zero_vruntime, right_vruntime = -1, left_deadline = -1, spread;
>  	struct sched_entity *last, *first, *root;
>  	struct rq *rq = cpu_rq(cpu);
>  	unsigned long flags;
> @@ -830,15 +830,15 @@ void print_cfs_rq(struct seq_file *m, in
>  	last = __pick_last_entity(cfs_rq);
>  	if (last)
>  		right_vruntime = last->vruntime;
> -	min_vruntime = cfs_rq->min_vruntime;
> +	zero_vruntime = cfs_rq->zero_vruntime;
>  	raw_spin_rq_unlock_irqrestore(rq, flags);
>  
>  	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "left_deadline",
>  			SPLIT_NS(left_deadline));
>  	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "left_vruntime",
>  			SPLIT_NS(left_vruntime));
> -	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "min_vruntime",
> -			SPLIT_NS(min_vruntime));
> +	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "zero_vruntime",
> +			SPLIT_NS(zero_vruntime));
>  	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "avg_vruntime",
>  			SPLIT_NS(avg_vruntime(cfs_rq)));
>  	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "right_vruntime",
> Index: linux-2.6/kernel/sched/fair.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched/fair.c
> +++ linux-2.6/kernel/sched/fair.c
> @@ -526,24 +526,6 @@ void account_cfs_rq_runtime(struct cfs_r
>   * Scheduling class tree data structure manipulation methods:
>   */
>  
> -static inline __maybe_unused u64 max_vruntime(u64 max_vruntime, u64 vruntime)
> -{
> -	s64 delta = (s64)(vruntime - max_vruntime);
> -	if (delta > 0)
> -		max_vruntime = vruntime;
> -
> -	return max_vruntime;
> -}
> -
> -static inline __maybe_unused u64 min_vruntime(u64 min_vruntime, u64 vruntime)
> -{
> -	s64 delta = (s64)(vruntime - min_vruntime);
> -	if (delta < 0)
> -		min_vruntime = vruntime;
> -
> -	return min_vruntime;
> -}
> -
>  static inline bool entity_before(const struct sched_entity *a,
>  				 const struct sched_entity *b)
>  {
> @@ -556,7 +538,7 @@ static inline bool entity_before(const s
>  
>  static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -	return (s64)(se->vruntime - cfs_rq->min_vruntime);
> +	return (s64)(se->vruntime - cfs_rq->zero_vruntime);
>  }
>  
>  #define __node_2_se(node) \
> @@ -608,13 +590,13 @@ static inline s64 entity_key(struct cfs_
>   *
>   * Which we track using:
>   *
> - *                    v0 := cfs_rq->min_vruntime
> + *                    v0 := cfs_rq->zero_vruntime
>   * \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime
>   *              \Sum w_i := cfs_rq->avg_load
>   *
> - * Since min_vruntime is a monotonic increasing variable that closely tracks
> - * the per-task service, these deltas: (v_i - v), will be in the order of the
> - * maximal (virtual) lag induced in the system due to quantisation.
> + * Since zero_vruntime closely tracks the per-task service, these
> + * deltas: (v_i - v), will be in the order of the maximal (virtual) lag
> + * induced in the system due to quantisation.
>   *
>   * Also, we use scale_load_down() to reduce the size.
>   *
> @@ -673,7 +655,7 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
>  		avg = div_s64(avg, load);
>  	}
>  
> -	return cfs_rq->min_vruntime + avg;
> +	return cfs_rq->zero_vruntime + avg;
>  }
>  
>  /*
> @@ -734,7 +716,7 @@ static int vruntime_eligible(struct cfs_
>  		load += weight;
>  	}
>  
> -	return avg >= (s64)(vruntime - cfs_rq->min_vruntime) * load;
> +	return avg >= (s64)(vruntime - cfs_rq->zero_vruntime) * load;
>  }
>  
>  int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
> @@ -742,42 +724,21 @@ int entity_eligible(struct cfs_rq *cfs_r
>  	return vruntime_eligible(cfs_rq, se->vruntime);
>  }
>  
> -static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
> +static void __update_zero_vruntime(struct cfs_rq *cfs_rq, s64 delta)
>  {
> -	u64 min_vruntime = cfs_rq->min_vruntime;
> -	/*
> -	 * open coded max_vruntime() to allow updating avg_vruntime
> -	 */
> -	s64 delta = (s64)(vruntime - min_vruntime);
> -	if (delta > 0) {
> -		avg_vruntime_update(cfs_rq, delta);
> -		min_vruntime = vruntime;
> -	}
> -	return min_vruntime;
> +	if (!delta)
> +		return;
> +
> +	avg_vruntime_update(cfs_rq, delta);
> +	cfs_rq->zero_vruntime += delta;
>  }
>  
> -static void update_min_vruntime(struct cfs_rq *cfs_rq)
> +static void update_zero_vruntime(struct cfs_rq *cfs_rq)
>  {
> -	struct sched_entity *se = __pick_root_entity(cfs_rq);
> -	struct sched_entity *curr = cfs_rq->curr;
> -	u64 vruntime = cfs_rq->min_vruntime;
> +	u64 vruntime = avg_vruntime(cfs_rq);
> +	s64 delta = (s64)(vruntime - cfs_rq->zero_vruntime);
>  
> -	if (curr) {
> -		if (curr->on_rq)
> -			vruntime = curr->vruntime;
> -		else
> -			curr = NULL;
> -	}
> -
> -	if (se) {
> -		if (!curr)
> -			vruntime = se->min_vruntime;
> -		else
> -			vruntime = min_vruntime(vruntime, se->min_vruntime);
> -	}
> -
> -	/* ensure we never gain time by being placed backwards. */
> -	cfs_rq->min_vruntime = __update_min_vruntime(cfs_rq, vruntime);
> +	__update_zero_vruntime(cfs_rq, delta);
>  }
>  
>  static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq)
> @@ -850,6 +811,7 @@ RB_DECLARE_CALLBACKS(static, min_vruntim
>  static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>  	avg_vruntime_add(cfs_rq, se);
> +	update_zero_vruntime(cfs_rq);

Won't this double-count cfs_rq->curr in the avg_vruntime() calculation
in update_zero_vruntime()? When cfs_rq->curr->on_rq, put_prev_entity()
calls this with se == cfs_rq->curr _before_ setting cfs_rq->curr to
NULL. So the avg_vruntime_add() call will add entity_key(cfs_rq->curr)
to cfs_rq->avg_vruntime and se_weight(cfs_rq->curr) to cfs_rq->avg_load.
Then update_zero_vruntime() calls avg_vruntime(), which still sees
curr->on_rq and will add curr's key and weight again. This throws
zero_vruntime off, maybe by enough to bust zero_vruntime and/or
avg_vruntime?

Should the call to update_zero_vruntime() go before avg_vruntime_add()?

Thanks,
Omar

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
  2025-04-18 23:49                 ` Omar Sandoval
@ 2025-04-22  0:06                   ` Omar Sandoval
  2025-04-22 14:13                     ` Peter Zijlstra
  2025-04-22 13:36                   ` Peter Zijlstra
  1 sibling, 1 reply; 30+ messages in thread
From: Omar Sandoval @ 2025-04-22  0:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Chris Mason, Pat Cody, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, patcody, kernel-team, Breno Leitao

On Fri, Apr 18, 2025 at 04:49:08PM -0700, Omar Sandoval wrote:
> On Fri, Apr 18, 2025 at 05:44:38PM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 16, 2025 at 10:19:42AM -0400, Rik van Riel wrote:
> > 
> > > The most common warning by far, hitting
> > > about 90% of the time we hit anything
> > > in avg_vruntime_validate is the
> > > WARN_ON_ONCE(cfs_rq->avg_vruntime != vruntime)
> > > 
> > > The most common code path to getting there,
> > > covering about 85% of the cases:
> > > 
> > > avg_vruntime_validate
> > > avg_vruntime_sub
> > > __dequeue_entity
> > > set_next_entity
> > > pick_task_fair
> > > pick_next_task_fair
> > > __pick_next_task
> > > pick_next_task
> > > __schedule
> > > schedule
> > 
> > (I'm assuming zero_vruntime patch here, the stock kernel has more
> > problems vs min_vruntime getting stuck in the future etc..)
> > 
> > So I have a theory about this. Key is that you're running a PREEMPT-NONE
> > kernel.
> > 
> > There is one important site the overflow patch does not cover:
> > avg_vruntime_update().
> > 
> > However, due to PREEMPT_NONE, it is possible (Chris mentioned direct
> > reclaim and OOM) to have very long (in-kernel) runtimes without
> > scheduling.
> > 
> > (I suppose this should be visible by tracing sched_switch)
> > 
> > Were this to happen, then avg_vruntime_add() gets to deal with 'key *
> > weight' for a relatively large key. But per the overflow checks there,
> > this isn't hitting (much).
> > 
> > But then we try and update zero_vruntime, and that is doing 'delta *
> > cfs_rq->avg_load', and avg_load is a larger number and might just cause
> > overflow. We don't have a check there.
> > 
> > If that overflows then avg_vruntime is buggered, but none of the
> > individual tree entities hit overflow, exactly as you observe.
> > 
> > I've modified the zero_vruntime patch a little (new one below) to update
> > zero_vruntime in update_curr(). Additionally, I have every tick re-align
> > it with the tree avg (the update and the tree can drift due to numerical
> > funnies).
> > 
> > This should ensure these very long in-kernel runtimes are broken up in
> > at most tick sized chunks, and by keeping zero_vruntime more or less
> > around the actual 0-lag point, the key values are minimized.
> > 
> > I should probably go play with a kernel module that spins for a few
> > seconds with preemption disabled, see if I can make it go BOOM :-) But
> > I've not yet done so.

[snip]

> >  static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq)
> > @@ -850,6 +811,7 @@ RB_DECLARE_CALLBACKS(static, min_vruntim
> >  static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> >  	avg_vruntime_add(cfs_rq, se);
> > +	update_zero_vruntime(cfs_rq);
> 
> Won't this double-count cfs_rq->curr in the avg_vruntime() calculation
> in update_zero_vruntime()? When cfs_rq->curr->on_rq, put_prev_entity()
> calls this with se == cfs_rq->curr _before_ setting cfs_rq->curr to
> NULL. So the avg_vruntime_add() call will add entity_key(cfs_rq->curr)
> to cfs_rq->avg_vruntime and se_weight(cfs_rq->curr) to cfs_rq->avg_load.
> Then update_zero_vruntime() calls avg_vruntime(), which still sees
> curr->on_rq and will add curr's key and weight again. This throws
> zero_vruntime off, maybe by enough to bust zero_vruntime and/or
> avg_vruntime?
> 
> Should the call to update_zero_vruntime() go before avg_vruntime_add()?
> 
> Thanks,
> Omar

Hey, Peter,

We haven't been able to test your latest patch, but I dug through some
core dumps from crashes with your initial zero_vruntime patch. It looks
like on just about all of them, the entity vruntimes are way too spread
out, so we would get overflows regardless of what we picked as
zero_vruntime.

As a representative example, we have a cfs_rq with 3 entities with the
follow vruntimes and (scaled down) weights:

vruntime           weight
39052385155836636  2      (curr)
43658311782076206  2
42824722322062111  4886

The difference between the minimum and maximum is 4605926626239570,
which is 53 bits. The total load is 4890. Even if you picked
zero_vruntime to be equidistant from the minimum and maximum, the
(vruntime - zero_vruntime) * load calculation in entity_eligible() is
doomed to overflow.

That range in vruntime seems too absurd to be due to only to running too
long without preemption. We're only seeing these crashes on internal
node cgroups (i.e., cgroups whose children are cgroups, not tasks). This
all leads me to suspect reweight_entity().

Specifically, this line in reweight_entity():

	se->vlag = div_s64(se->vlag * se->load.weight, weight);

seems like it could create a very large vlag, which could cause
place_entity() to adjust vruntime by a large value. (place_entity() has
a similarly suspect adjustment on se->vlag, only update_entity_lag()
clamps it).

I'll try to reproduce something like this, but do you have any thoughts
on this theory in the meantime?

Thanks,
Omar

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
  2025-04-22  0:06                   ` Omar Sandoval
@ 2025-04-22 14:13                     ` Peter Zijlstra
  2025-04-22 15:14                       ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2025-04-22 14:13 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Rik van Riel, Chris Mason, Pat Cody, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, patcody, kernel-team, Breno Leitao

On Mon, Apr 21, 2025 at 05:06:45PM -0700, Omar Sandoval wrote:

> Hey, Peter,
> 
> We haven't been able to test your latest patch, but I dug through some
> core dumps from crashes with your initial zero_vruntime patch. It looks
> like on just about all of them, the entity vruntimes are way too spread
> out, so we would get overflows regardless of what we picked as
> zero_vruntime.
> 
> As a representative example, we have a cfs_rq with 3 entities with the
> follow vruntimes and (scaled down) weights:
> 
> vruntime           weight
> 39052385155836636  2      (curr)
> 43658311782076206  2
> 42824722322062111  4886
> 
> The difference between the minimum and maximum is 4605926626239570,

Right, that is quite beyond usable. The key question at this point
is how did we get here...

> which is 53 bits. The total load is 4890. Even if you picked
> zero_vruntime to be equidistant from the minimum and maximum, the
> (vruntime - zero_vruntime) * load calculation in entity_eligible() is
> doomed to overflow.
> 
> That range in vruntime seems too absurd to be due to only to running too
> long without preemption. We're only seeing these crashes on internal
> node cgroups (i.e., cgroups whose children are cgroups, not tasks). This
> all leads me to suspect reweight_entity().
> 
> Specifically, this line in reweight_entity():
> 
> 	se->vlag = div_s64(se->vlag * se->load.weight, weight);
> 
> seems like it could create a very large vlag, which could cause
> place_entity() to adjust vruntime by a large value.

Right, I fixed that not too long ago. At the time I convinced myself
clipping there wasn't needed (in fact, it would lead to some other
artifacts iirc). Let me go review that decision :-)

> (place_entity() has
> a similarly suspect adjustment on se->vlag, only update_entity_lag()
> clamps it).

place_entity() is a bit tricky -- but should be well behaved. The
problem is that by adding an entity, you affect the average, because the
average shifts, the lag after placement is different than the lag you
started with.

The scaling there is to ensure the lag after placement reflects the
initial lag. Much like how update_entity_lag() is the lag before
removal, place_entity() ensures the lag is measured after insertion.

> I'll try to reproduce something like this, but do you have any thoughts
> on this theory in the meantime?

It seems reasonable, but as said, let me go look too :-)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
  2025-04-22 14:13                     ` Peter Zijlstra
@ 2025-04-22 15:14                       ` Peter Zijlstra
  2025-04-25  8:53                         ` Omar Sandoval
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2025-04-22 15:14 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Rik van Riel, Chris Mason, Pat Cody, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, patcody, kernel-team, Breno Leitao

On Tue, Apr 22, 2025 at 04:13:52PM +0200, Peter Zijlstra wrote:
> On Mon, Apr 21, 2025 at 05:06:45PM -0700, Omar Sandoval wrote:
> 
> > Hey, Peter,
> > 
> > We haven't been able to test your latest patch, but I dug through some
> > core dumps from crashes with your initial zero_vruntime patch. It looks
> > like on just about all of them, the entity vruntimes are way too spread
> > out, so we would get overflows regardless of what we picked as
> > zero_vruntime.
> > 
> > As a representative example, we have a cfs_rq with 3 entities with the
> > follow vruntimes and (scaled down) weights:
> > 
> > vruntime           weight
> > 39052385155836636  2      (curr)
> > 43658311782076206  2
> > 42824722322062111  4886
> > 
> > The difference between the minimum and maximum is 4605926626239570,
> 
> Right, that is quite beyond usable. The key question at this point
> is how did we get here...
> 
> > which is 53 bits. The total load is 4890. Even if you picked
> > zero_vruntime to be equidistant from the minimum and maximum, the
> > (vruntime - zero_vruntime) * load calculation in entity_eligible() is
> > doomed to overflow.
> > 
> > That range in vruntime seems too absurd to be due to only to running too
> > long without preemption. We're only seeing these crashes on internal
> > node cgroups (i.e., cgroups whose children are cgroups, not tasks). This
> > all leads me to suspect reweight_entity().
> > 
> > Specifically, this line in reweight_entity():
> > 
> > 	se->vlag = div_s64(se->vlag * se->load.weight, weight);
> > 
> > seems like it could create a very large vlag, which could cause
> > place_entity() to adjust vruntime by a large value.
> 
> Right, I fixed that not too long ago. At the time I convinced myself
> clipping there wasn't needed (in fact, it would lead to some other
> artifacts iirc). Let me go review that decision :-)

In particular, the two most recent commits in this area are:

  https://lore.kernel.org/r/20250109105959.GA2981@noisy.programming.kicks-ass.net
  https://lkml.kernel.org/r/20250110115720.GA17405@noisy.programming.kicks-ass.net

(from the same thread).

Note that it does call update_entity_lag() which does clip. So after
that it's just scaling for the new weight.

Notably, virtual time = time / weight, and the clip limit is adjusted
for weight.

So if it is inside limits pre-scaling, it should still be in limits
after scaling.

l = max / w;

w->w' --> l' = l*w/w' = (max / w) * (w/w') = max / w'

I've stuck some trace_printk()s on again, and the numbers I get here
seem sane.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
  2025-04-22 15:14                       ` Peter Zijlstra
@ 2025-04-25  8:53                         ` Omar Sandoval
  0 siblings, 0 replies; 30+ messages in thread
From: Omar Sandoval @ 2025-04-25  8:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Chris Mason, Pat Cody, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, patcody, kernel-team, Breno Leitao

On Tue, Apr 22, 2025 at 05:14:21PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 22, 2025 at 04:13:52PM +0200, Peter Zijlstra wrote:
> > On Mon, Apr 21, 2025 at 05:06:45PM -0700, Omar Sandoval wrote:
> > 
> > > Hey, Peter,
> > > 
> > > We haven't been able to test your latest patch, but I dug through some
> > > core dumps from crashes with your initial zero_vruntime patch. It looks
> > > like on just about all of them, the entity vruntimes are way too spread
> > > out, so we would get overflows regardless of what we picked as
> > > zero_vruntime.
> > > 
> > > As a representative example, we have a cfs_rq with 3 entities with the
> > > follow vruntimes and (scaled down) weights:
> > > 
> > > vruntime           weight
> > > 39052385155836636  2      (curr)
> > > 43658311782076206  2
> > > 42824722322062111  4886
> > > 
> > > The difference between the minimum and maximum is 4605926626239570,
> > 
> > Right, that is quite beyond usable. The key question at this point
> > is how did we get here...
> > 
> > > which is 53 bits. The total load is 4890. Even if you picked
> > > zero_vruntime to be equidistant from the minimum and maximum, the
> > > (vruntime - zero_vruntime) * load calculation in entity_eligible() is
> > > doomed to overflow.
> > > 
> > > That range in vruntime seems too absurd to be due to only to running too
> > > long without preemption. We're only seeing these crashes on internal
> > > node cgroups (i.e., cgroups whose children are cgroups, not tasks). This
> > > all leads me to suspect reweight_entity().
> > > 
> > > Specifically, this line in reweight_entity():
> > > 
> > > 	se->vlag = div_s64(se->vlag * se->load.weight, weight);
> > > 
> > > seems like it could create a very large vlag, which could cause
> > > place_entity() to adjust vruntime by a large value.
> > 
> > Right, I fixed that not too long ago. At the time I convinced myself
> > clipping there wasn't needed (in fact, it would lead to some other
> > artifacts iirc). Let me go review that decision :-)
> 
> In particular, the two most recent commits in this area are:
> 
>   https://lore.kernel.org/r/20250109105959.GA2981@noisy.programming.kicks-ass.net
>   https://lkml.kernel.org/r/20250110115720.GA17405@noisy.programming.kicks-ass.net
> 
> (from the same thread).
> 
> Note that it does call update_entity_lag() which does clip. So after
> that it's just scaling for the new weight.
> 
> Notably, virtual time = time / weight, and the clip limit is adjusted
> for weight.
> 
> So if it is inside limits pre-scaling, it should still be in limits
> after scaling.
> 
> l = max / w;
> 
> w->w' --> l' = l*w/w' = (max / w) * (w/w') = max / w'
> 
> I've stuck some trace_printk()s on again, and the numbers I get here
> seem sane.

For anyone following along, I found the source of the bad vruntimes and
sent a patch:

https://lore.kernel.org/all/f0c2d1072be229e1bdddc73c0703919a8b00c652.1745570998.git.osandov@fb.com/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
  2025-04-18 23:49                 ` Omar Sandoval
  2025-04-22  0:06                   ` Omar Sandoval
@ 2025-04-22 13:36                   ` Peter Zijlstra
  1 sibling, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2025-04-22 13:36 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Rik van Riel, Chris Mason, Pat Cody, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, patcody, kernel-team, Breno Leitao

On Fri, Apr 18, 2025 at 04:49:08PM -0700, Omar Sandoval wrote:
> On Fri, Apr 18, 2025 at 05:44:38PM +0200, Peter Zijlstra wrote:

> > @@ -850,6 +811,7 @@ RB_DECLARE_CALLBACKS(static, min_vruntim
> >  static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> >  	avg_vruntime_add(cfs_rq, se);
> > +	update_zero_vruntime(cfs_rq);
> 
> Won't this double-count cfs_rq->curr in the avg_vruntime() calculation
> in update_zero_vruntime()? When cfs_rq->curr->on_rq, put_prev_entity()
> calls this with se == cfs_rq->curr _before_ setting cfs_rq->curr to
> NULL.

Yep, right you are. I note that set_next_entity() sets cfs_rq->curr
late, and does not have this issue. I'll look at making
put_prev_entity() clear cfs_rq->curr early.

(obviously I'll have to check all that it does after is not using curr)

> So the avg_vruntime_add() call will add entity_key(cfs_rq->curr)
> to cfs_rq->avg_vruntime and se_weight(cfs_rq->curr) to cfs_rq->avg_load.
> Then update_zero_vruntime() calls avg_vruntime(), which still sees
> curr->on_rq and will add curr's key and weight again. This throws
> zero_vruntime off, maybe by enough to bust zero_vruntime and/or
> avg_vruntime?
> 
> Should the call to update_zero_vruntime() go before avg_vruntime_add()?

The thing is that adding/removing from the tree makes avg_vruntime jump
around a bit, and I wanted to adjust for that jump (in the lazy way).

So it needs to be after enqueue/dequeue.



Meanwhile, I've done a custom module that does:

	preempt_disable();
	/* 'spin' for a minute */
	for (int i = 0; i < 60*1000; i++)
		__ndelay(1000000);
	preempt_enable();

just to emulate some ridiculous PREEMPT_NONE kernel section, and while
it trips RCU and Soft Lockup watchdogs, it does not seem to trip any of
the __builtin_overflow bits, even when ran at nice 19.

(this was with the zero_vruntime thing on, I've yet to try with the
upstream min_vruntime thing)

So unless you've disabled those watchdogs (or pushed their limits up),
I'm fairly confident that there's no overflow to be had, even with that
curr issue.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
  2025-04-18 15:44               ` Peter Zijlstra
  2025-04-18 23:49                 ` Omar Sandoval
@ 2025-04-19  2:53                 ` Mike Galbraith
  1 sibling, 0 replies; 30+ messages in thread
From: Mike Galbraith @ 2025-04-19  2:53 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Chris Mason, Pat Cody, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, patcody, kernel-team, Breno Leitao

On Fri, 2025-04-18 at 17:44 +0200, Peter Zijlstra wrote:
> 
> However, due to PREEMPT_NONE, it is possible (Chris mentioned direct
> reclaim and OOM) to have very long (in-kernel) runtimes without
> scheduling.

Dunno what Chris considers 'very long', but FYI you can take some
pretty awful hits even sans PREEMPT_NONE.  In 6.12 time frame I caught
move_expired_inodes() eating 66ms, and isolate_lru_folios() holding
IRQs off for 78ms via dirt simple kbuild+bonnie on spinning rust.

(fugly bandaids for both linger in my 6.12+ trees compost heaps) 

> (I suppose this should be visible by tracing sched_switch)

(I used modified wakeup_rt that only traces max prio)

	-Mike

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <20250320173438.3562449-2-patcody@meta.com>]

* Re: [PATCH] sched/fair: Add null pointer check to pick_next_entity()
       [not found] <20250320173438.3562449-2-patcody@meta.com>
@ 2025-03-24 15:56 ` Steven Rostedt
  0 siblings, 0 replies; 30+ messages in thread
From: Steven Rostedt @ 2025-03-24 15:56 UTC (permalink / raw)
  To: Pat Cody
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	bsegall, mgorman, vschneid, linux-kernel, riel, kernel-team,
	stable

On Thu, 20 Mar 2025 10:34:39 -0700
Pat Cody <patcody@meta.com> wrote:

> pick_eevdf() can return null, resulting in a null pointer dereference
> crash in pick_next_entity()
> 
> The other call site of pick_eevdf() can already handle a null pointer,
> and pick_next_entity() can already return null as well. Add an extra
> check to handle the null return here.

Have you actually triggered this?

> 
> Cc: stable@kernel.org
> Fixes: f12e148892ed ("sched/fair: Prepare pick_next_task() for delayed dequeue")


No signed-off-by.

> ---
>  kernel/sched/fair.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a553181dc764..f2157298cbce 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5560,6 +5560,8 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
>  	}
>  
>  	struct sched_entity *se = pick_eevdf(cfs_rq);
> +	if (!se)
> +		return NULL;

So the place that this is called is in pick_task_fair() which has:

	cfs_rq = &rq->cfs;
	if (!cfs_rq->nr_queued)
		return NULL;

Where I believe it's a bug if cfs_rq->nr_queued > 0 and pick_eevdf() returns NULL.

Which means in should never happen. Now, to be more robust, we could add a
check for NULL, but this is not a fix, nor should it go to stable.

-- Steve



>  	if (se->sched_delayed) {
>  		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
>  		/*


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-04-25  8:53 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-20 20:53 [PATCH] sched/fair: Add null pointer check to pick_next_entity() Pat Cody
2025-03-20 22:42 ` Christian Loehle
2025-03-21 17:52   ` Pat Cody
2025-03-24 11:56 ` Peter Zijlstra
2025-03-25 15:12   ` Pat Cody
2025-03-25 18:59     ` Peter Zijlstra
2025-03-26 19:26       ` Pat Cody
2025-04-02 14:59   ` Rik van Riel
2025-04-02 18:07     ` Peter Zijlstra
2025-04-09 14:29       ` Rik van Riel
2025-04-09 15:27         ` Peter Zijlstra
2025-04-11 14:51           ` Rik van Riel
2025-04-14  9:08             ` Peter Zijlstra
2025-04-14 15:38               ` Chris Mason
2025-04-15 10:07                 ` Peter Zijlstra
2025-04-16  7:59                   ` Peter Zijlstra
     [not found]   ` <7B2CFC16-1ADE-4565-B555-7525A50494C2@surriel.com>
     [not found]     ` <20250402082221.GT5880@noisy.programming.kicks-ass.net>
2025-04-14 19:57       ` Rik van Riel
2025-04-15  8:02         ` Peter Zijlstra
2025-04-16 12:44           ` Peter Zijlstra
2025-04-16 14:19             ` Rik van Riel
2025-04-16 15:27               ` Chris Mason
2025-04-18 15:44               ` Peter Zijlstra
2025-04-18 23:49                 ` Omar Sandoval
2025-04-22  0:06                   ` Omar Sandoval
2025-04-22 14:13                     ` Peter Zijlstra
2025-04-22 15:14                       ` Peter Zijlstra
2025-04-25  8:53                         ` Omar Sandoval
2025-04-22 13:36                   ` Peter Zijlstra
2025-04-19  2:53                 ` Mike Galbraith
     [not found] <20250320173438.3562449-2-patcody@meta.com>
2025-03-24 15:56 ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox