public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: David Laight <david.laight.linux@gmail.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>,
	Zhan Xusheng <zhanxusheng1024@gmail.com>,
	Ingo Molnar <mingo@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	linux-kernel@vger.kernel.org,
	Zhan Xusheng <zhanxusheng@xiaomi.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Ben Segall <bsegall@google.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Mel Gorman <mgorman@suse.de>
Subject: Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()
Date: Fri, 1 May 2026 14:05:21 +0100	[thread overview]
Message-ID: <20260501140521.27be883d@pumpkin> (raw)
In-Reply-To: <20260501104006.GA3102624@noisy.programming.kicks-ass.net>

On Fri, 1 May 2026 12:40:06 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Apr 29, 2026 at 10:33:48AM +0530, K Prateek Nayak wrote:
> 
> > > Anyway... let me construct a worst case.
> > > 
> > > So if you have this cgroup crap on, then you can have an entity of
> > > weight 2, and vlag should then be bounded by: (slice+TICK_NSEC) *
> > > NICE_0_LOAD, which is around 44 bits as per the comment on entity_key().
> > > 
> > > The other end is 100*NICE_0_LOAD, so lets wake that, then you get:
> > > 
> > > {key, weight}[] := {
> > >   puny: { (slice + TICK_NSEC) * NICE_0_LOAD, 2               },
> > >   max:  { 0,                                 100*NICE_0_LOAD },
> > > }  
> > 
> > So MAX_SHARES is (1UL << 18) and we then scale it up so the weight can
> > can go up to (1 << 28) I suppose in UP setting.
> > 
> > So let us assume there is a single task with the largest custom slice on
> > each cgroup making the min_slice equal to 100ms. Things start off in
> > such a way that puny having a vruntime of say -1 is picked over max which
> > had a vruntime of 0 for the sake of simplicity.
> > 
> > Deadline then becomes:
> > 
> >     puny->deadline = -1 + __calc_delta(100ms, NICE_0_LOAD, 2)
> > 
> > which is 52428800000000 - 1.
> > 
> > At 10HZ tick + RUN_TO_PARITY, we get 10ms error so that puts the
> > entity_key() at 57671680000000 (46 bits)
> >   
> > > 
> > > The avg_vruntime() would end up being very close to 0 (which is
> > > zero_vruntime), so no real help making that more accurate.
> > > 
> > > vruntime_eligible(puny) ends up with:
> > > 
> > >  avg    = 2 * puny.key (+ 0)
> > >  weight = 2 + 100 * NICE_0_LOAD
> > > 
> > >  avg >= puny.key * weight
> > > 
> > > And that is: (slice + TICK_NSEC) * NICE_0_LOAD * NICE_0_LOAD * 100  
> > > > and yes, I suppose that can exceed 64bit :-(  
> > 
> > Then the worst case right side would be:
> > 
> >     57671680000000 * load /* load = ((1 << 28) + 2) */
> > 
> > and yes that goes beyond 64-bits at 15481123719086080000000.
> > 
> > I'm running with this configurations:
> > 
> >     echo 2 > /sys/fs/cgroup/cg0/cpu.weight
> >     echo 10000 > /sys/fs/cgroup/cg1/cpu.weight
> > 
> > Weight of 2 actually gets scaled up to 20480 for UP scenario so I go try
> > to make it even smaller with SMP and wait until it gets picked again.  
> 
> Make sure all of the actual activity of that cgroup happens 'elsewhere'
> and you can get it down to an actual 2 per calc_group_shares().
> 
> > without any splat so I'm not sure if there is something that prevents
> > a possible crash since a weight of 104857600 should have definitely
> > made that entity_key() overflow.  
> 
> Right.
> 
> Anyway, I had a poke around with godbolt, and the below seems to
> generate the best code for things like x86_64 and arm64.
> 
> Specifically, the __builtin_mul_overflow() already has to compute the
> 128 bit product anyway for most architectures, so using that directly
> then leads to saner asm and easier to understand code.

Older versions of gcc (maybe before 10?) generate a call to the full
128 bit multiply (_muldid3 ?) on mips64 - rather than just using the instruction
that generates the high part of the product.
The library function is now included but is slightly more complex because
it does the high*low multiples as well.

The same happens on sparc64 and the asm multiply code looks strange.

> 
> AFAICT HPPA64 is the only 64bit architecture that doesn't implement
> __int128 and will thus be demoted to doing what we do on 32bit.
> 
> Now all I need to do is write a coherent Changelog to go with it ...
> *sigh*
> 
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 728965851842..214fe9d99834 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -904,7 +904,7 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
>  		load += weight;
>  	}
>  
> -	return avg >= vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
> +	return avg >= (sched_double_long_t)vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
>  }
>  
>  int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 9f63b15d309d..f15f4468efe5 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -146,7 +146,7 @@ extern struct list_head asym_cap_list;
>   * Really only required when CONFIG_FAIR_GROUP_SCHED=y is also set, but to
>   * increase coverage and consistency always enable it on 64-bit platforms.
>   */
> -#ifdef CONFIG_64BIT
> +#if defined(CONFIG_64BIT) && defined(__SIZEOF_INT128__)
>  # define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
>  # define scale_load(w)		((w) << SCHED_FIXEDPOINT_SHIFT)
>  # define scale_load_down(w)					\
> @@ -157,10 +157,12 @@ extern struct list_head asym_cap_list;
>  		__w = max(2UL, __w >> SCHED_FIXEDPOINT_SHIFT);	\
>  	__w;							\
>  })
> +typedef __int128 sched_double_long_t;
>  #else
>  # define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
>  # define scale_load(w)		(w)
>  # define scale_load_down(w)	(w)
> +typedef s64 sched_double_long_t;

Seems strange that this is signed but the 64bit one is unsigned.

-- David

>  #endif
>  
>  /*
> 


  parent reply	other threads:[~2026-05-01 13:05 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-15 14:57 [PATCH] sched/fair: Fix overflow in vruntime_eligible() Zhan Xusheng
2026-04-28 14:49 ` [PATCH RESEND] " Zhan Xusheng
2026-04-28 16:17   ` K Prateek Nayak
2026-04-28 17:35     ` Peter Zijlstra
2026-04-29  5:03       ` K Prateek Nayak
2026-04-29  7:31         ` Zhan Xusheng
2026-05-01 10:40         ` Peter Zijlstra
2026-05-01 10:48           ` Peter Zijlstra
2026-05-01 13:05           ` David Laight [this message]
2026-05-01 13:43             ` Peter Zijlstra
2026-05-04 11:22           ` Peter Zijlstra
2026-05-04 13:16             ` Heiko Carstens
2026-05-04 14:57               ` Peter Zijlstra
2026-05-04 17:40               ` Stefan Schulze Frielinghaus

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260501140521.27be883d@pumpkin \
    --to=david.laight.linux@gmail.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=zhanxusheng1024@gmail.com \
    --cc=zhanxusheng@xiaomi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox