Re: [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Peter Zijlstra <peterz@infradead.org>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>,
	mingo@kernel.org, juri.lelli@redhat.com,
	dietmar.eggemann@arm.com, rostedt@goodmis.org,
	bsegall@google.com, mgorman@suse.de, vschneid@redhat.com,
	linux-kernel@vger.kernel.org, wangtao554@huawei.com,
	quzicheng@huawei.com, dsmythies@telus.net,
	shubhang@os.amperecomputing.com
Subject: Re: [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime
Date: Tue, 7 Apr 2026 14:00:52 +0200	[thread overview]
Message-ID: <20260407120052.GG3738010@noisy.programming.kicks-ass.net> (raw)
In-Reply-To: <b004fa56-b4b8-4cd9-9431-a576f629f31d@amd.com>

On Fri, Apr 03, 2026 at 09:32:22AM +0530, K Prateek Nayak wrote:
> On 4/2/2026 4:26 PM, K Prateek Nayak wrote:
> >> That is, something like the below... But with a comment ofc :-)
> >>
> >> Does that make sense?
> > 
> > Let me go queue an overnight test to see if I trip that warning or
> > not.
> 
> Didn't trip any warning and the machine is still up and running
> after 15 Hours so feel free to include:
> 
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> 
> Perhaps the comment can read something like:
> 
> 		/*
> 		 * A heavy entity can pull the avg_vruntime close to its
> 		 * vruntime post enqueue but the zero_vruntime point is
> 		 * only updated at the next update_deadline() / enqueue
> 		 * / dequeue.
> 		 *
> 		 * Until then, the sum_w_vruntime grow quadratically,
> 		 * proportional to the entity's weight (w_i) as:
> 		 *
> 		 *   sum_w_vruntime -= (lag_i * (W + w_i) / W) * w_i
> 		 *
> 		 * If w_i > W, it is beneficial to pull the
> 		 * zero_vruntime towards the entity's vruntime (V_i)
> 		 * since the sum_w_vruntime would only grow  by
> 		 * (lag_i * W) which consumes lesser bits than leaving
> 		 * the zero_vruntime at the pre-enqueue avg_vruntime.
> 		 */
> 		if (weight > load)
> 			update_zero = true;
> 
> Feel free to reword as you see fit :-)

I've made it like so. You did all the hard work after all. Thanks!

---
Subject: sched/fair: Avoid overflow in enqueue_entity()
From: K Prateek Nayak <kprateek.nayak@amd.com>
Date: Tue Apr  7 13:36:17 CEST 2026

Here is one scenario which was triggered when running:

    stress-ng --yield=32 -t 10000000s&
    while true; do perf bench sched messaging -p -t -l 100000 -g 16; done

on a 256CPUs machine after about an hour into the run:

    __enqeue_entity: entity_key(-141245081754) weight(90891264) overflow_mul(5608800059305154560) vlag(57498) delayed?(0)
    cfs_rq: zero_vruntime(3809707759657809) sum_w_vruntime(0) sum_weight(0) nr_queued(1)
    cfs_rq->curr: entity_key(0) vruntime(3809707759657809) deadline(3809723966988476) weight(37)

The above comes from __enqueue_entity() after a place_entity(). Breaking
this down:

    vlag_initial = 57498
    vlag = (57498 * (37 + 90891264)) / 37 = 141,245,081,754

    vruntime = 3809707759657809 - 141245081754 = 3,809,566,514,576,055
    entity_key(se, cfs_rq) = -141,245,081,754

Now, multiplying the entity_key with its own weight results to
5,608,800,059,305,154,560 (same as what overflow_mul() suggests) but
in Python, without overflow, this would be: -1,2837,944,014,404,397,056

Avoid the overflow (without doing the division for avg_vruntime()), by moving
zero_vruntime to the new entity when it is heavier.

Fixes: 4823725d9d1d ("sched/fair: Increase weight bits for avg_vruntime")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
[peterz: suggested 'weight > load' condition]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5352,6 +5352,7 @@ static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
 	u64 vslice, vruntime = avg_vruntime(cfs_rq);
+	bool update_zero = false;
 	s64 lag = 0;
 
 	if (!se->custom_slice)
@@ -5368,7 +5369,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
 	 */
 	if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) {
 		struct sched_entity *curr = cfs_rq->curr;
-		long load;
+		long load, weight;
 
 		lag = se->vlag;
 
@@ -5428,14 +5429,41 @@ place_entity(struct cfs_rq *cfs_rq, stru
 		if (curr && curr->on_rq)
 			load += avg_vruntime_weight(cfs_rq, curr->load.weight);
 
-		lag *= load + avg_vruntime_weight(cfs_rq, se->load.weight);
+		weight = avg_vruntime_weight(cfs_rq, se->load.weight);
+		lag *= load + weight;
 		if (WARN_ON_ONCE(!load))
 			load = 1;
 		lag = div64_long(lag, load);
+
+		/*
+		 * A heavy entity (relative to the tree) will pull the
+		 * avg_vruntime close to its vruntime position on enqueue. But
+		 * the zero_vruntime point is only updated at the next
+		 * update_deadline()/place_entity()/update_entity_lag().
+		 *
+		 * Specifically (see the comment near avg_vruntime_weight()):
+		 *
+		 *   sum_w_vruntime = \Sum (v_i - v0) * w_i
+		 *
+		 * Note that if v0 is near a light entity, both terms will be
+		 * small for the light entity, while in that case both terms
+		 * are large for the heavy entity, leading to risk of
+		 * overflow.
+		 *
+		 * OTOH if v0 is near the heavy entity, then the difference is
+		 * larger for the light entity, but the factor is small, while
+		 * for the heavy entity the difference is small but the factor
+		 * is large. Avoiding the multiplication overflow.
+		 */
+		if (weight > load)
+			update_zero = true;
 	}
 
 	se->vruntime = vruntime - lag;
 
+	if (update_zero)
+		update_zero_vruntime(cfs_rq, -lag);
+
 	if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) {
 		se->deadline += se->vruntime;
 		se->rel_deadline = 0;

next prev parent reply	other threads:[~2026-04-07 12:01 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-19  7:58 [PATCH v2 0/7] sched: Various reweight_entity() fixes Peter Zijlstra
2026-02-19  7:58 ` [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking Peter Zijlstra
2026-02-23 10:56   ` Vincent Guittot
2026-02-23 13:09   ` Dietmar Eggemann
2026-02-23 14:15     ` Peter Zijlstra
2026-02-24  8:53       ` Dietmar Eggemann
2026-02-24  9:02         ` Peter Zijlstra
2026-03-28  5:44   ` John Stultz
2026-03-28 17:04     ` Steven Rostedt
2026-03-30 17:58       ` John Stultz
2026-03-30 18:27         ` Steven Rostedt
2026-03-30  9:43     ` Peter Zijlstra
2026-03-30 17:49       ` John Stultz
2026-03-30 10:10     ` Peter Zijlstra
2026-03-30 14:37       ` K Prateek Nayak
2026-03-30 14:40         ` Peter Zijlstra
2026-03-30 15:50           ` K Prateek Nayak
2026-03-30 19:11             ` Peter Zijlstra
2026-03-31  0:38               ` K Prateek Nayak
2026-03-31  4:58                 ` K Prateek Nayak
2026-03-31  7:08                 ` Peter Zijlstra
2026-03-31  7:14                   ` Peter Zijlstra
2026-03-31  8:49                     ` K Prateek Nayak
2026-03-31  9:29                       ` Peter Zijlstra
2026-03-31 12:20                         ` Peter Zijlstra
2026-03-31 16:14                           ` Peter Zijlstra
2026-03-31 17:02                             ` K Prateek Nayak
2026-03-31 22:40                             ` John Stultz
2026-03-30 19:40       ` John Stultz
2026-03-30 19:43         ` Peter Zijlstra
2026-03-30 21:45           ` John Stultz
2026-02-19  7:58 ` [PATCH v2 2/7] sched/fair: Only set slice protection at pick time Peter Zijlstra
2026-02-19  7:58 ` [PATCH v2 3/7] sched/eevdf: Update se->vprot in reweight_entity() Peter Zijlstra
2026-02-19  7:58 ` [PATCH v2 4/7] sched/fair: Fix lag clamp Peter Zijlstra
2026-02-23 10:23   ` Dietmar Eggemann
2026-02-23 10:57   ` Vincent Guittot
2026-02-19  7:58 ` [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime Peter Zijlstra
2026-02-23 10:56   ` Vincent Guittot
2026-02-23 11:51     ` Peter Zijlstra
2026-02-23 12:36       ` Peter Zijlstra
2026-02-23 13:06       ` Vincent Guittot
2026-03-30  7:55       ` K Prateek Nayak
2026-03-30  9:27         ` Peter Zijlstra
2026-04-02  5:28         ` K Prateek Nayak
2026-04-02 10:22           ` Peter Zijlstra
2026-04-02 10:56             ` K Prateek Nayak
2026-04-03  4:02               ` K Prateek Nayak
2026-04-07 12:00                 ` Peter Zijlstra [this message]
2026-04-07 13:42                   ` [tip: sched/core] sched/fair: Avoid overflow in enqueue_entity() tip-bot2 for K Prateek Nayak
2026-02-19  7:58 ` [PATCH v2 6/7] sched/fair: Revert 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag") Peter Zijlstra
2026-02-23 10:57   ` Vincent Guittot
2026-03-24 10:01     ` William Montaz
2026-04-07 13:45       ` Peter Zijlstra
2026-02-19  7:58 ` [PATCH v2 7/7] sched/fair: Use full weight to __calc_delta() Peter Zijlstra
2026-02-23 10:57   ` Vincent Guittot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260407120052.GG3738010@noisy.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=dsmythies@telus.net \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=quzicheng@huawei.com \
    --cc=rostedt@goodmis.org \
    --cc=shubhang@os.amperecomputing.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=wangtao554@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox