public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Zhan Xusheng <zhanxusheng1024@gmail.com>,
	Ingo Molnar <mingo@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	linux-kernel@vger.kernel.org,
	Zhan Xusheng <zhanxusheng@xiaomi.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Ben Segall <bsegall@google.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Mel Gorman <mgorman@suse.de>
Subject: Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()
Date: Fri, 1 May 2026 12:40:06 +0200	[thread overview]
Message-ID: <20260501104006.GA3102624@noisy.programming.kicks-ass.net> (raw)
In-Reply-To: <4870ea19-78d4-44b2-9f18-14c3f8782726@amd.com>

On Wed, Apr 29, 2026 at 10:33:48AM +0530, K Prateek Nayak wrote:

> > Anyway... let me construct a worst case.
> > 
> > So if you have this cgroup crap on, then you can have an entity of
> > weight 2, and vlag should then be bounded by: (slice+TICK_NSEC) *
> > NICE_0_LOAD, which is around 44 bits as per the comment on entity_key().
> > 
> > The other end is 100*NICE_0_LOAD, so lets wake that, then you get:
> > 
> > {key, weight}[] := {
> >   puny: { (slice + TICK_NSEC) * NICE_0_LOAD, 2               },
> >   max:  { 0,                                 100*NICE_0_LOAD },
> > }
> 
> So MAX_SHARES is (1UL << 18) and we then scale it up so the weight can
> can go up to (1 << 28) I suppose in UP setting.
> 
> So let us assume there is a single task with the largest custom slice on
> each cgroup making the min_slice equal to 100ms. Things start off in
> such a way that puny having a vruntime of say -1 is picked over max which
> had a vruntime of 0 for the sake of simplicity.
> 
> Deadline then becomes:
> 
>     puny->deadline = -1 + __calc_delta(100ms, NICE_0_LOAD, 2)
> 
> which is 52428800000000 - 1.
> 
> At 10HZ tick + RUN_TO_PARITY, we get 10ms error so that puts the
> entity_key() at 57671680000000 (46 bits)
> 
> > 
> > The avg_vruntime() would end up being very close to 0 (which is
> > zero_vruntime), so no real help making that more accurate.
> > 
> > vruntime_eligible(puny) ends up with:
> > 
> >  avg    = 2 * puny.key (+ 0)
> >  weight = 2 + 100 * NICE_0_LOAD
> > 
> >  avg >= puny.key * weight
> > 
> > And that is: (slice + TICK_NSEC) * NICE_0_LOAD * NICE_0_LOAD * 100
> > > and yes, I suppose that can exceed 64bit :-(
> 
> Then the worst case right side would be:
> 
>     57671680000000 * load /* load = ((1 << 28) + 2) */
> 
> and yes that goes beyond 64-bits at 15481123719086080000000.
> 
> I'm running with this configurations:
> 
>     echo 2 > /sys/fs/cgroup/cg0/cpu.weight
>     echo 10000 > /sys/fs/cgroup/cg1/cpu.weight
> 
> Weight of 2 actually gets scaled up to 20480 for UP scenario so I go try
> to make it even smaller with SMP and wait until it gets picked again.

Make sure all of the actual activity of that cgroup happens 'elsewhere'
and you can get it down to an actual 2 per calc_group_shares().

> without any splat so I'm not sure if there is something that prevents
> a possible crash since a weight of 104857600 should have definitely
> made that entity_key() overflow.

Right.

Anyway, I had a poke around with godbolt, and the below seems to
generate the best code for things like x86_64 and arm64.

Specifically, the __builtin_mul_overflow() already has to compute the
128 bit product anyway for most architectures, so using that directly
then leads to saner asm and easier to understand code.

AFAICT HPPA64 is the only 64bit architecture that doesn't implement
__int128 and will thus be demoted to doing what we do on 32bit.

Now all I need to do is write a coherent Changelog to go with it ...
*sigh*

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 728965851842..214fe9d99834 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -904,7 +904,7 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
 		load += weight;
 	}
 
-	return avg >= vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
+	return avg >= (sched_double_long_t)vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
 }
 
 int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9f63b15d309d..f15f4468efe5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -146,7 +146,7 @@ extern struct list_head asym_cap_list;
  * Really only required when CONFIG_FAIR_GROUP_SCHED=y is also set, but to
  * increase coverage and consistency always enable it on 64-bit platforms.
  */
-#ifdef CONFIG_64BIT
+#if defined(CONFIG_64BIT) && defined(__SIZEOF_INT128__)
 # define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w)		((w) << SCHED_FIXEDPOINT_SHIFT)
 # define scale_load_down(w)					\
@@ -157,10 +157,12 @@ extern struct list_head asym_cap_list;
 		__w = max(2UL, __w >> SCHED_FIXEDPOINT_SHIFT);	\
 	__w;							\
 })
+typedef __int128 sched_double_long_t;
 #else
 # define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w)		(w)
 # define scale_load_down(w)	(w)
+typedef s64 sched_double_long_t;
 #endif
 
 /*

  parent reply	other threads:[~2026-05-01 10:40 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-15 14:57 [PATCH] sched/fair: Fix overflow in vruntime_eligible() Zhan Xusheng
2026-04-28 14:49 ` [PATCH RESEND] " Zhan Xusheng
2026-04-28 16:17   ` K Prateek Nayak
2026-04-28 17:35     ` Peter Zijlstra
2026-04-29  5:03       ` K Prateek Nayak
2026-04-29  7:31         ` Zhan Xusheng
2026-05-01 10:40         ` Peter Zijlstra [this message]
2026-05-01 10:48           ` Peter Zijlstra
2026-05-01 13:05           ` David Laight
2026-05-01 13:43             ` Peter Zijlstra
2026-05-04 11:22           ` Peter Zijlstra
2026-05-04 13:16             ` Heiko Carstens
2026-05-04 14:57               ` Peter Zijlstra
2026-05-04 17:40               ` Stefan Schulze Frielinghaus

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260501104006.GA3102624@noisy.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=zhanxusheng1024@gmail.com \
    --cc=zhanxusheng@xiaomi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox