public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>,
	mingo@kernel.org, juri.lelli@redhat.com,
	dietmar.eggemann@arm.com, rostedt@goodmis.org,
	bsegall@google.com, mgorman@suse.de, vschneid@redhat.com,
	linux-kernel@vger.kernel.org, wangtao554@huawei.com,
	quzicheng@huawei.com, dsmythies@telus.net,
	shubhang@os.amperecomputing.com
Subject: Re: [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime
Date: Thu, 2 Apr 2026 12:22:15 +0200	[thread overview]
Message-ID: <20260402102215.GT3738010@noisy.programming.kicks-ass.net> (raw)
In-Reply-To: <99fa12f9-71d3-4766-8742-a3adc9ce4271@amd.com>

On Thu, Apr 02, 2026 at 10:58:18AM +0530, K Prateek Nayak wrote:
> On 3/30/2026 1:25 PM, K Prateek Nayak wrote:
> >     ------------[ cut here ]------------
> >     (w_vruntime >> 63) != (w_vruntime >> 62)
> >     WARNING: kernel/sched/fair.c:692 at __enqueue_entity+0x382/0x3a0, CPU#5: stress-ng/5062
> 
> Back to this: I still see this with latest set of changes on
> queue:sched/urgent but it doesn't go kaboom. Nonetheless, it suggests we
> are closing in on the s64 limitations of "sum_w_vruntime" which isn't
> very comforting.

Yeah, we are pushing 64bit pretty hard :/ And if all we would care about
was x86_64 I'd have long since used the fact that imul has a 128bit
result and idiv actually divides 128bit. But even among 64bit
architectures that is somewhat rare :/

> Here is one scenario where it was triggered when running:
> 
>     stress-ng --yield=32 -t 10000000s&
>     while true; do perf bench sched messaging -p -t -l 100000 -g 16; done
> 
> on a 256CPUs machine after about an hour into the run:
> 
>     __enqeue_entity: entity_key(-141245081754) weight(90891264) overflow_mul(5608800059305154560) vlag(57498) delayed?(0)
>     cfs_rq: zero_vruntime(3809707759657809) sum_w_vruntime(0) sum_weight(0) nr_queued(1)
>     cfs_rq->curr: entity_key(0) vruntime(3809707759657809) deadline(3809723966988476) weight(37)
> 
> The above comes from __enqueue_entity() after a place_entity(). Breaking
> this down:
> 
>     vlag_initial = 57498
>     vlag = (57498 * (37 + 90891264)) / 37 = 141,245,081,754
> 
>     vruntime = 3809707759657809 - 141245081754 = 3,809,566,514,576,055
>     entity_key(se, cfs_rq) = -141,245,081,754
> 
> Now, multiplying the entity_key with its own weight results to
> 5,608,800,059,305,154,560 (same as what overflow_mul() suggests) but
> in Python, without overflow, this would be: -1,2837,944,014,404,397,056

Oh gawd, this is a 'fun' case.

> One way to avoid the warning entirely would be to pull the zero_vruntime
> close to avg_vruntime is we are enqueuing a very heavy entity.
> 
> The correct way to do this would be to compute the actual avg_vruntime()
> and move the zero_vruntime to that point (but that requires at least one
> multiply + divide + update_zero_vruntime()).
> 
> One seemingly cheap way by which I've been able to avoid the warning is
> with:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 226509231e67..bc708bb8b5d0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5329,6 +5329,7 @@ static void
>  place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  {
>  	u64 vslice, vruntime = avg_vruntime(cfs_rq);
> +	bool update_zero = false;
>  	s64 lag = 0;
>  
>  	if (!se->custom_slice)
> @@ -5406,6 +5407,17 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  			load += avg_vruntime_weight(cfs_rq, curr->load.weight);
>  
>  		lag *= load + avg_vruntime_weight(cfs_rq, se->load.weight);
> +		/*
> +		 * If the entity_key() * sum_weight of all the enqueued entities
> +		 * is more than the sum_w_vruntime, move the zero_vruntime
> +		 * point to the vruntime of the entity which prevents using
> +		 * more bits than necessary for sum_w_vruntime until the
> +		 * next avg_vruntime().
> +		 *
> +		 * XXX: Cheap enough check?
> +		 */
> +		if (abs(lag) > abs(cfs_rq->sum_w_vruntime))
> +			update_zero = true;
>  		if (WARN_ON_ONCE(!load))
>  			load = 1;
>  		lag = div64_long(lag, load);
> @@ -5413,6 +5425,9 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  
>  	se->vruntime = vruntime - lag;
>  
> +	if (update_zero)
> +		update_zero_vruntime(cfs_rq, -lag);
> +
>  	if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) {
>  		se->deadline += se->vruntime;
>  		se->rel_deadline = 0;
> ---
> 
> But I'm sure it'll make people nervous since we basically move the
> zero_vruntime to se->vruntime. It isn't too bad if:
> 
>     abs(sum_w_vuntime - (lag * load)) < abs(lag * se->load.weight)
> 
> but we already know that the latter overflows so is there any other
> cheaper indicator that we can use to detect the necessity to adjust the
> avg_vruntime beforehand at place_entity()?

So in general I think it would be fine to move zero_vruntime to the
heaviest entity in the tree. And if there are multiple equal heaviest
weights, any one of them should be fine.

Per necessity heavy entities are more tightly clustered -- the lag is
inversely proportional to weight, and the spread is proportional to the
lag bound.

I suspect something simple like comparing the entity weight against the
sum_weight might be enough. If the pre-existing tree is, in aggregate,
heavier than the new element, the avg will not move very drastically.
However, if the new element is (significantly) heavier than the tree,
the avg will move significantly (as demonstrated here).

That is, something like the below... But with a comment ofc :-)

Does that make sense?


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9298f49f842c..7fbd9538fe30 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5329,6 +5329,7 @@ static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
 	u64 vslice, vruntime = avg_vruntime(cfs_rq);
+	bool update_zero = false;
 	s64 lag = 0;
 
 	if (!se->custom_slice)
@@ -5345,7 +5346,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 */
 	if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) {
 		struct sched_entity *curr = cfs_rq->curr;
-		long load;
+		long load, weight;
 
 		lag = se->vlag;
 
@@ -5405,14 +5406,21 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		if (curr && curr->on_rq)
 			load += avg_vruntime_weight(cfs_rq, curr->load.weight);
 
-		lag *= load + avg_vruntime_weight(cfs_rq, se->load.weight);
+		weight = avg_vruntime_weight(cfs_rq, se->load.weight);
+		lag *= load + weight;
 		if (WARN_ON_ONCE(!load))
 			load = 1;
 		lag = div64_long(lag, load);
+
+		if (weight > load)
+			update_zero = true;
 	}
 
 	se->vruntime = vruntime - lag;
 
+	if (update_zero)
+		update_zero_vruntime(cfs_rq, -lag);
+
 	if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) {
 		se->deadline += se->vruntime;
 		se->rel_deadline = 0;

  reply	other threads:[~2026-04-02 10:22 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-19  7:58 [PATCH v2 0/7] sched: Various reweight_entity() fixes Peter Zijlstra
2026-02-19  7:58 ` [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking Peter Zijlstra
2026-02-23 10:56   ` Vincent Guittot
2026-02-23 13:09   ` Dietmar Eggemann
2026-02-23 14:15     ` Peter Zijlstra
2026-02-24  8:53       ` Dietmar Eggemann
2026-02-24  9:02         ` Peter Zijlstra
2026-03-28  5:44   ` John Stultz
2026-03-28 17:04     ` Steven Rostedt
2026-03-30 17:58       ` John Stultz
2026-03-30 18:27         ` Steven Rostedt
2026-03-30  9:43     ` Peter Zijlstra
2026-03-30 17:49       ` John Stultz
2026-03-30 10:10     ` Peter Zijlstra
2026-03-30 14:37       ` K Prateek Nayak
2026-03-30 14:40         ` Peter Zijlstra
2026-03-30 15:50           ` K Prateek Nayak
2026-03-30 19:11             ` Peter Zijlstra
2026-03-31  0:38               ` K Prateek Nayak
2026-03-31  4:58                 ` K Prateek Nayak
2026-03-31  7:08                 ` Peter Zijlstra
2026-03-31  7:14                   ` Peter Zijlstra
2026-03-31  8:49                     ` K Prateek Nayak
2026-03-31  9:29                       ` Peter Zijlstra
2026-03-31 12:20                         ` Peter Zijlstra
2026-03-31 16:14                           ` Peter Zijlstra
2026-03-31 17:02                             ` K Prateek Nayak
2026-03-31 22:40                             ` John Stultz
2026-03-30 19:40       ` John Stultz
2026-03-30 19:43         ` Peter Zijlstra
2026-03-30 21:45           ` John Stultz
2026-02-19  7:58 ` [PATCH v2 2/7] sched/fair: Only set slice protection at pick time Peter Zijlstra
2026-02-19  7:58 ` [PATCH v2 3/7] sched/eevdf: Update se->vprot in reweight_entity() Peter Zijlstra
2026-02-19  7:58 ` [PATCH v2 4/7] sched/fair: Fix lag clamp Peter Zijlstra
2026-02-23 10:23   ` Dietmar Eggemann
2026-02-23 10:57   ` Vincent Guittot
2026-02-19  7:58 ` [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime Peter Zijlstra
2026-02-23 10:56   ` Vincent Guittot
2026-02-23 11:51     ` Peter Zijlstra
2026-02-23 12:36       ` Peter Zijlstra
2026-02-23 13:06       ` Vincent Guittot
2026-03-30  7:55       ` K Prateek Nayak
2026-03-30  9:27         ` Peter Zijlstra
2026-04-02  5:28         ` K Prateek Nayak
2026-04-02 10:22           ` Peter Zijlstra [this message]
2026-04-02 10:56             ` K Prateek Nayak
2026-04-03  4:02               ` K Prateek Nayak
2026-04-07 12:00                 ` Peter Zijlstra
2026-04-07 13:42                   ` [tip: sched/core] sched/fair: Avoid overflow in enqueue_entity() tip-bot2 for K Prateek Nayak
2026-02-19  7:58 ` [PATCH v2 6/7] sched/fair: Revert 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag") Peter Zijlstra
2026-02-23 10:57   ` Vincent Guittot
2026-03-24 10:01     ` William Montaz
2026-04-07 13:45       ` Peter Zijlstra
2026-02-19  7:58 ` [PATCH v2 7/7] sched/fair: Use full weight to __calc_delta() Peter Zijlstra
2026-02-23 10:57   ` Vincent Guittot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260402102215.GT3738010@noisy.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=dsmythies@telus.net \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=quzicheng@huawei.com \
    --cc=rostedt@goodmis.org \
    --cc=shubhang@os.amperecomputing.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=wangtao554@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox