From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from casper.infradead.org (casper.infradead.org [90.155.50.34])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5135B3E2743
	for <linux-kernel@vger.kernel.org>; Tue, 31 Mar 2026 12:20:49 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774959655; cv=none; b=rPeojqmBpXlCMGt5QtGTE1gCLcWufFFUhtZwIqkxVX5XRpAL4oOds5KlVNNFzJ9iy3elKmqdXBd5k6h2C3156B8cfqliaNjZs0CC4mw+0TtBsqooj/1Dys4sy32yDy9yKSstG3mugUvkhAOIz7gDRPRvtwX98JUT1L22mbQfI00=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774959655; c=relaxed/simple;
	bh=5H0XAfvtMuPc9qWRvTrOyLINf8OyWrCMOgpP+OK79oI=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=Wcw9mZxqqedGBb+BXm8JArVTvrwSjVmSz/sBOLz4Iw1xT8QH+Uri9BTi5F23KPXtYaUZemxck3M/Mou1/0ih09lBEE9D9kV8gheccmKdnKXMmekWeKehzi85a1fGxu9YF1Jbq0rwllIFzeb4yFew9hSnxlGpLDV4Uf7fc+KfwE0=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=bECxhYXZ; arc=none smtp.client-ip=90.155.50.34
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org
Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="bECxhYXZ"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version:
	References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To:
	Content-Transfer-Encoding:Content-ID:Content-Description;
	bh=Yk+c5xxGQI3/tDksXe6uNnYV+C/+DV9gzanDCNc4w+w=; b=bECxhYXZwqkOvlNhA4kQ767+RU
	K+2HIfbBFdW+J1t1g/SSGHvJjDmBqvZEE0vNiIG7ejC6nOAGxNKdJqAaFNE+yszp39kjMEvzhkxHc
	4t83CCtE4GcYpH72laMkwCwsVOlFg9s6OuA5qU6l0Ee/VoPbI9o2/H/bNEF3Wd0zcR8OA6hykLFLb
	gDesMw5f8pC06PvrfQsBDMk/nC0sC9VtrEKgME+nEsdBS1XSjcPmU8ca17/9SK3+NQ0kkNZLIJQrV
	ChZozJc9SSHv3vwkQZ1c7zxOWFf2DsWUunfa7k2d2apYW/jwC0UBn/wrq07woCmQQDw7JwtL1zcTr
	4KJ9dJog==;
Received: from 2001-1c00-8d85-4b00-266e-96ff-fe07-7dcc.cable.dynamic.v6.ziggo.nl ([2001:1c00:8d85:4b00:266e:96ff:fe07:7dcc] helo=noisy.programming.kicks-ass.net)
	by casper.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux))
	id 1w7Y5E-00000008Us9-2x2i;
	Tue, 31 Mar 2026 12:20:36 +0000
Received: by noisy.programming.kicks-ass.net (Postfix, from userid 1000)
	id 22CDA300346; Tue, 31 Mar 2026 14:20:35 +0200 (CEST)
Date: Tue, 31 Mar 2026 14:20:35 +0200
From: Peter Zijlstra <peterz@infradead.org>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: John Stultz <jstultz@google.com>, mingo@kernel.org,
	juri.lelli@redhat.com, vincent.guittot@linaro.org,
	dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com,
	mgorman@suse.de, vschneid@redhat.com, linux-kernel@vger.kernel.org,
	wangtao554@huawei.com, quzicheng@huawei.com, dsmythies@telus.net,
	shubhang@os.amperecomputing.com,
	Suleiman Souhlal <suleiman@google.com>
Subject: Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
Message-ID: <20260331122035.GO3739106@noisy.programming.kicks-ass.net>
References: <20260330101018.GN3738786@noisy.programming.kicks-ass.net>
 <73dab51a-650f-4c82-9e73-13236b2a26c2@amd.com>
 <20260330144005.GP3738786@noisy.programming.kicks-ass.net>
 <fecbe7f2-c210-4ac9-a7fc-11781860c409@amd.com>
 <20260330191108.GU2872@noisy.programming.kicks-ass.net>
 <fb3f009b-cb95-4f18-90d6-970ef42d1cb3@amd.com>
 <20260331070822.GC3739027@noisy.programming.kicks-ass.net>
 <20260331071402.GN3739106@noisy.programming.kicks-ass.net>
 <19667aac-99c4-40cf-bc0a-b1e6b9d32ede@amd.com>
 <20260331092909.GQ3738010@noisy.programming.kicks-ass.net>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20260331092909.GQ3738010@noisy.programming.kicks-ass.net>

On Tue, Mar 31, 2026 at 11:29:09AM +0200, Peter Zijlstra wrote:
> On Tue, Mar 31, 2026 at 02:19:54PM +0530, K Prateek Nayak wrote:
> > On 3/31/2026 12:44 PM, Peter Zijlstra wrote:
> > > On Tue, Mar 31, 2026 at 09:08:23AM +0200, Peter Zijlstra wrote:
> > >> On Tue, Mar 31, 2026 at 06:08:27AM +0530, K Prateek Nayak wrote:
> > >>
> > >>> The above doesn't recover after a avg_vruntime(). Btw I'm running:
> > >>>
> > >>>     nice -n 19 stress-ng --yield 32 -t 1000000s&
> > >>>     while true; do perf bench sched messaging -p -t -l 100000 -g 16; done
> > >>
> > >> And you're running that on a 16 cpu machine / vm ?
> > > 
> > > W00t, it went b00m. Ok, let me go add some tracing.
> > 
> > I could only repro it on baremetal after few hours but good to know it
> > exploded effortlessly on your end! Was this a 16vCPU VM with the same
> > recipe?
> 
> Yep. It almost insta triggers. Trying to make sense of the traces now.

So the thing I'm seeing is that avg_vruntime() is behind of where it
should be, not much, but every time it goes *boom* it is just far enough
behind that no entity is eligible.

 sched-messaging-2192    [039] d..2.    77.136100: pick_task_fair: cfs_rq(39:ff4a5bc7bebeb680): sum_w_vruntime(194325882) sum_weight(5120) zero_vruntime(105210161141318) avg_vruntime(105210161179272)
 sched-messaging-2192    [039] d..2.    77.136100: pick_task_fair: T    se(ff4a5bc79040c940): vruntime(105210161556539) deadline(105210164099443) weight(1048576) -- sched-messaging:2340
 sched-messaging-2192    [039] d..2.    77.136101: pick_task_fair: T    se(ff4a5bc794ce98c0): vruntime(105210161435669) deadline(105210164235669) weight(1048576) -- sched-messaging:2212
 sched-messaging-2192    [039] d..2.    77.136101: pick_task_fair: T    se(ff4a5bc7952d3100): vruntime(105210161580240) deadline(105210164380240) weight(1048576) -- sched-messaging:2381
 sched-messaging-2192    [039] d..2.    77.136102: pick_task_fair: T    se(ff4a5bc794c318c0): vruntime(105210161818264) deadline(105210164518004) weight(1048576) -- sched-messaging:2306
 sched-messaging-2192    [039] d..2.    77.136103: pick_task_fair: T    se(ff4a5bc796b4b100): vruntime(105210161831546) deadline(105210164631546) weight(1048576) -- sched-messaging:2551
 sched-messaging-2192    [039] d..2.    77.136104: pick_task_fair:          min_lag(-652274) max_lag(0) limit(38000000)
 sched-messaging-2192    [039] d..2.    77.136104: pick_task_fair: picked NULL!!

If we compute the avg_vruntime() manually, then we get a
sum_w_vruntime contribution for each task:

(105210161556539-105210161141318)*1024
425186304
(105210161435669-105210161141318)*1024
301415424
(105210161580240-105210161141318)*1024
449456128
(105210161818264-105210161141318)*1024
693192704
(105210161831546-105210161141318)*1024
706793472

Which combined is:

425186304+301415424+449456128+693192704+706793472
2576044032

NOTE: this is different (more) from sum_w_vruntime(194325882).

So divided, and added to zero gives:

2576044032/5120
503133.60000000000000000000
105210161141318+503133.60000000000000000000
105210161644451.60000000000000000000

Which is where avg_vruntime() *should* be, except it ends up being at:

avg_vruntime(105210161179272), which then results in no eligible entities.

Note that with the computed avg, the first 3 entities would be eligible.

This suggests I go build a parallel infrastructure to double check when
and where this goes sizeways.

... various attempts later ....


 sched-messaging-1021    [009] d..2.    34.483159: update_curr: T<=> se(ff37d0bcd52718c0): vruntime(56921690782736, E) deadline(56921693563331) weight(1048576) -- sched-messaging:1021
 sched-messaging-1021    [009] d..2.    34.483160: __avg_vruntime: cfs_rq(9:ff37d0bcfe46b680): delta(-48327) sum_w_vruntime(811471242) zero_vruntime(56921691575188)

 sched-messaging-1021    [009] d..2.    34.483160: pick_task_fair: cfs_rq(9:ff37d0bcfe46b680): sum_w_vruntime(811471242) sum_weight(6159) zero_vruntime(56921691575188) avg_vruntime(56921691706941)
 sched-messaging-1021    [009] d..2.    34.483160: pick_task_fair: T<   se(ff37d0bcd5c6c940): vruntime(56921691276707, E) deadline(56921694076707) weight(1048576) -- sched-messaging:1276
 sched-messaging-1021    [009] d..2.    34.483161: pick_task_fair: T    se(ff37d0bcd56f98c0): vruntime(56921691917863) deadline(56921694079320) weight(1048576) -- sched-messaging:1201
 sched-messaging-1021    [009] d..2.    34.483162: pick_task_fair: T    se(ff37d0bcd5344940): vruntime(56921691340323, E) deadline(56921694140323) weight(1048576) -- sched-messaging:1036
 sched-messaging-1021    [009] d..2.    34.483163: pick_task_fair: T    se(ff37d0bcd56dc940): vruntime(56921691637185, E) deadline(56921694403038) weight(1048576) -- sched-messaging:1179
 sched-messaging-1021    [009] d..2.    34.483164: pick_task_fair: T    se(ff37d0bcd43eb100): vruntime(56921691629067, E) deadline(56921694429067) weight(1048576) -- sched-messaging:786
 sched-messaging-1021    [009] d..2.    34.483164: pick_task_fair: T    se(ff37d0bcd5d80080): vruntime(56921691810771) deadline(56921694610771) weight(1048576) -- sched-messaging:1291
 sched-messaging-1021    [009] d..2.    34.483165: pick_task_fair: T    se(ff37d0bcd027b100): vruntime(56921734696810) deadline(56921917287562) weight(15360) -- stress-ng-yield:693
 sched-messaging-1021    [009] d..2.    34.483165: pick_task_fair:          min_lag(-42989869) max_lag(430234) limit(38000000)
 sched-messaging-1021    [009] d..2.    34.483166: pick_task_fair:          swv(811471242)
 sched-messaging-1021    [009] d..2.    34.483167: __dequeue_entity: cfs_rq(9:ff37d0bcfe46b680): sum_w_vruntime(1117115786) zero_vruntime(56921691575188)

set_next_task(1276):

swv -= key * weight

811471242 - (56921691276707-56921691575188)*1024
1117115786

OK

 sched-messaging-1276    [009] d.h2.    34.483168: update_curr: T<=> se(ff37d0bcd5c6c940): vruntime(56921691285759, E) deadline(56921694076707) weight(1048576) -- sched-messaging:1276
 sched-messaging-1276    [009] d.h2.    34.483169: __avg_vruntime: cfs_rq(9:ff37d0bcfe46b680): delta(22156) sum_w_vruntime(319064896) zero_vruntime(56921691597344)

swv -= sw * delta

1117115786 - 5135 * 22156
1003344726

WTF!?!

zv += delta

56921691575188 + 22156
56921691597344

OK


 sched-messaging-1276    [009] d.h2.    34.483169: place_entity: T<   se(ff37d0bcd52718c0): vruntime(56921690673139, E) deadline(56921693473139) weight(1048576) -- sched-messaging:1021
 sched-messaging-1276    [009] d.h2.    34.483170: __enqueue_entity: cfs_rq(9:ff37d0bcfe46b680): sum_w_vruntime(-627321024) zero_vruntime(56921691597344)

swv += key * weight

Should be:

1003344726 + (56921690673139 - 56921691597344) * 1024
56958806 [*]

But is:

319064896 + (56921690673139 - 56921691597344) * 1024
-627321024

Consistent, but wrong

 sched-messaging-1276    [009] d..2.    34.483173: update_curr: T<=> se(ff37d0bcd5c6c940): vruntime(56921691289762, E) deadline(56921694076707) weight(1048576) -- sched-messaging:1276
 sched-messaging-1276    [009] d..2.    34.483173: __avg_vruntime: cfs_rq(9:ff37d0bcfe46b680): delta(571) sum_w_vruntime(180635073) zero_vruntime(56921691466161)

This would be dequeue(1276) update_entity_lag(), but the numbers make no sense...

swv -= sw * delta

-627321024 - 6159 * 571
-630837813 != 180635073


zv += delta

56921691597344 + 571
56921691597915 != 56921691466161

Also, the actual delta would be (zero_vruntime - prev zero_vruntime):

56921691466161-56921691597344
-131183

At which point we can construct the swv value from where we left of [*]

56958806 - -131183 * 6159
864914903


But the actual state makes no frigging sense....


 sched-messaging-1276    [009] d..2.    34.483174: pick_task_fair: cfs_rq(9:ff37d0bcfe46b680): sum_w_vruntime(180635073) sum_weight(6159) zero_vruntime(56921691466161) avg_vruntime(56921691495489)
 sched-messaging-1276    [009] d..2.    34.483174: pick_task_fair: T<   se(ff37d0bcd52718c0): vruntime(56921690673139, E) deadline(56921693473139) weight(1048576) -- sched-messaging:1021
 sched-messaging-1276    [009] d..2.    34.483175: pick_task_fair: T    se(ff37d0bcd56f98c0): vruntime(56921691917863) deadline(56921694079320) weight(1048576) -- sched-messaging:1201
 sched-messaging-1276    [009] d..2.    34.483175: pick_task_fair: T    se(ff37d0bcd5344940): vruntime(56921691340323, E) deadline(56921694140323) weight(1048576) -- sched-messaging:1036
 sched-messaging-1276    [009] d..2.    34.483176: pick_task_fair: T    se(ff37d0bcd56dc940): vruntime(56921691637185) deadline(56921694403038) weight(1048576) -- sched-messaging:1179
 sched-messaging-1276    [009] d..2.    34.483177: pick_task_fair: T    se(ff37d0bcd43eb100): vruntime(56921691629067) deadline(56921694429067) weight(1048576) -- sched-messaging:786
 sched-messaging-1276    [009] d..2.    34.483177: pick_task_fair: T    se(ff37d0bcd5d80080): vruntime(56921691810771) deadline(56921694610771) weight(1048576) -- sched-messaging:1291
 sched-messaging-1276    [009] d..2.    34.483178: pick_task_fair: T    se(ff37d0bcd027b100): vruntime(56921734696810) deadline(56921917287562) weight(15360) -- stress-ng-yield:693
 sched-messaging-1276    [009] d..2.    34.483178: pick_task_fair:          min_lag(-43201321) max_lag(822350) limit(38000000)
 sched-messaging-1276    [009] d..2.    34.483178: pick_task_fair:          swv(864914903)
 sched-messaging-1276    [009] d..2.    34.483179: pick_task_fair: FAIL


Generated with the below patch on top of -rc6.

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bf948db905ed..5462aeac1c45 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -678,6 +678,11 @@ sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 	cfs_rq->sum_w_vruntime += key * weight;
 	cfs_rq->sum_weight += weight;
+
+	trace_printk("cfs_rq(%d:%px): sum_w_vruntime(%Ld) zero_vruntime(%Ld)\n",
+		     rq_of(cfs_rq)->cpu, cfs_rq,
+		     cfs_rq->sum_w_vruntime,
+		     cfs_rq->zero_vruntime);
 }
 
 static void
@@ -688,6 +693,11 @@ sum_w_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 	cfs_rq->sum_w_vruntime -= key * weight;
 	cfs_rq->sum_weight -= weight;
+
+	trace_printk("cfs_rq(%d:%px): sum_w_vruntime(%Ld) zero_vruntime(%Ld)\n",
+		     rq_of(cfs_rq)->cpu, cfs_rq,
+		     cfs_rq->sum_w_vruntime,
+		     cfs_rq->zero_vruntime);
 }
 
 static inline
@@ -698,6 +708,12 @@ void update_zero_vruntime(struct cfs_rq *cfs_rq, s64 delta)
 	 */
 	cfs_rq->sum_w_vruntime -= cfs_rq->sum_weight * delta;
 	cfs_rq->zero_vruntime += delta;
+
+	trace_printk("cfs_rq(%d:%px): delta(%Ld) sum_w_vruntime(%Ld) zero_vruntime(%Ld)\n",
+		     rq_of(cfs_rq)->cpu, cfs_rq,
+		     delta,
+		     cfs_rq->sum_w_vruntime,
+		     cfs_rq->zero_vruntime);
 }
 
 /*
@@ -712,7 +728,7 @@ void update_zero_vruntime(struct cfs_rq *cfs_rq, s64 delta)
  * This means it is one entry 'behind' but that puts it close enough to where
  * the bound on entity_key() is at most two lag bounds.
  */
-u64 avg_vruntime(struct cfs_rq *cfs_rq)
+static u64 __avg_vruntime(struct cfs_rq *cfs_rq, bool update)
 {
 	struct sched_entity *curr = cfs_rq->curr;
 	long weight = cfs_rq->sum_weight;
@@ -743,9 +759,17 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
 		delta = curr->vruntime - cfs_rq->zero_vruntime;
 	}
 
-	update_zero_vruntime(cfs_rq, delta);
+	if (update) {
+		update_zero_vruntime(cfs_rq, delta);
+		return cfs_rq->zero_vruntime;
+	}
 
-	return cfs_rq->zero_vruntime;
+	return cfs_rq->zero_vruntime + delta;
+}
+
+u64 avg_vruntime(struct cfs_rq *cfs_rq)
+{
+	return __avg_vruntime(cfs_rq, true);
 }
 
 static inline u64 cfs_rq_max_slice(struct cfs_rq *cfs_rq);
@@ -1078,11 +1102,6 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect)
 	return best;
 }
 
-static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
-{
-	return __pick_eevdf(cfs_rq, true);
-}
-
 struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
 {
 	struct rb_node *last = rb_last(&cfs_rq->tasks_timeline.rb_root);
@@ -1279,6 +1298,8 @@ s64 update_curr_common(struct rq *rq)
 	return update_se(rq, &rq->donor->se);
 }
 
+static void print_se(struct cfs_rq *cfs_rq, struct sched_entity *se, bool pick);
+
 /*
  * Update the current task's runtime statistics.
  */
@@ -1304,6 +1325,10 @@ static void update_curr(struct cfs_rq *cfs_rq)
 
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
 	resched = update_deadline(cfs_rq, curr);
+	if (resched)
+		avg_vruntime(cfs_rq);
+
+	print_se(cfs_rq, curr, true);
 
 	if (entity_is_task(curr)) {
 		/*
@@ -3849,6 +3874,8 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 	bool rel_vprot = false;
 	u64 vprot;
 
+	print_se(cfs_rq, se, true);
+
 	if (se->on_rq) {
 		/* commit outstanding execution time */
 		update_curr(cfs_rq);
@@ -3896,6 +3923,8 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 			__enqueue_entity(cfs_rq, se);
 		cfs_rq->nr_queued++;
 	}
+
+	print_se(cfs_rq, se, true);
 }
 
 static void reweight_task_fair(struct rq *rq, struct task_struct *p,
@@ -5251,6 +5280,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	if (se->rel_deadline) {
 		se->deadline += se->vruntime;
 		se->rel_deadline = 0;
+		print_se(cfs_rq, se, true);
 		return;
 	}
 
@@ -5266,6 +5296,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * EEVDF: vd_i = ve_i + r_i/w_i
 	 */
 	se->deadline = se->vruntime + vslice;
+	print_se(cfs_rq, se, true);
 }
 
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
@@ -5529,31 +5560,6 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, bool first)
 	se->prev_sum_exec_runtime = se->sum_exec_runtime;
 }
 
-static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags);
-
-/*
- * Pick the next process, keeping these things in mind, in this order:
- * 1) keep things fair between processes/task groups
- * 2) pick the "next" process, since someone really wants that to run
- * 3) pick the "last" process, for cache locality
- * 4) do not run the "skip" process, if something else is available
- */
-static struct sched_entity *
-pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
-{
-	struct sched_entity *se;
-
-	se = pick_eevdf(cfs_rq);
-	if (se->sched_delayed) {
-		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
-		/*
-		 * Must not reference @se again, see __block_task().
-		 */
-		return NULL;
-	}
-	return se;
-}
-
 static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
 
 static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
@@ -8942,6 +8948,123 @@ static void wakeup_preempt_fair(struct rq *rq, struct task_struct *p, int wake_f
 	resched_curr_lazy(rq);
 }
 
+static __always_inline
+void print_se(struct cfs_rq *cfs_rq, struct sched_entity *se, bool pick)
+{
+	bool curr = (se == cfs_rq->curr);
+	bool el   = entity_eligible(cfs_rq, se);
+	bool prot = protect_slice(se);
+	bool task = false;
+	char *comm = NULL;
+	int pid = -1;
+
+	if (entity_is_task(se)) {
+		struct task_struct *p = task_of(se);
+		task = true;
+		comm = p->comm;
+		pid = p->pid;
+	}
+
+	trace_printk("%c%c%c%c se(%px): vruntime(%Ld%s) deadline(%Ld) weight(%ld) -- %s:%d\n",
+		     task ? 'T' : '@',
+		     pick ? '<' : ' ',
+		     curr && prot ? '=' : ' ',
+		     curr ? '>' : ' ',
+		     se, se->vruntime, el ? ", E" : "",
+		     se->deadline, se->load.weight,
+		     comm, pid);
+}
+
+static struct sched_entity *pick_debug(struct cfs_rq *cfs_rq)
+{
+	struct sched_entity *pick = __pick_eevdf(cfs_rq, true);
+	struct sched_entity *curr = cfs_rq->curr;
+	s64 min_lag = 0, max_lag = 0;
+	u64 runtime, weight, z_vruntime, avg;
+	u64 swv = 0;
+
+	s64 limit = 10*(sysctl_sched_base_slice + TICK_NSEC);
+
+	if (curr && !curr->on_rq)
+		curr = NULL;
+
+	runtime = cfs_rq->sum_w_vruntime;
+	weight  = cfs_rq->sum_weight;
+	z_vruntime = cfs_rq->zero_vruntime;
+	barrier();
+	avg = __avg_vruntime(cfs_rq, false);
+
+	trace_printk("cfs_rq(%d:%px): sum_w_vruntime(%Ld) sum_weight(%Ld) zero_vruntime(%Ld) avg_vruntime(%Ld)\n",
+		     rq_of(cfs_rq)->cpu, cfs_rq,
+		     runtime, weight,
+		     z_vruntime, avg);
+
+	for (struct rb_node *node = cfs_rq->tasks_timeline.rb_leftmost;
+	     node; node = rb_next(node)) {
+		struct sched_entity *se = __node_2_se(node);
+		if (se == curr)
+			curr = NULL;
+		print_se(cfs_rq, se, pick == se);
+
+		swv += (se->vruntime - z_vruntime) * scale_load_down(se->load.weight);
+
+		s64 vlag = avg - se->vruntime;
+		min_lag = min(min_lag, vlag);
+		max_lag = max(max_lag, vlag);
+	}
+
+	if (curr) {
+		print_se(cfs_rq, curr, pick == curr);
+
+		s64 vlag = avg - curr->vruntime;
+		min_lag = min(min_lag, vlag);
+		max_lag = max(max_lag, vlag);
+	}
+
+	trace_printk("         min_lag(%Ld) max_lag(%Ld) limit(%Ld)\n", min_lag, max_lag, limit);
+	trace_printk("         swv(%Ld)\n", swv);
+
+	if (swv != runtime) {
+		trace_printk("FAIL\n");
+		tracing_off();
+		printk("FAIL FAIL FAIL!!!\n");
+	}
+
+//	WARN_ON_ONCE(min_lag < -limit || max_lag > limit);
+
+	if (!pick) {
+		trace_printk("picked NULL!!\n");
+		tracing_off();
+		printk("FAIL FAIL FAIL!!!\n");
+		return __pick_first_entity(cfs_rq);
+	}
+
+	return pick;
+}
+
+/*
+ * Pick the next process, keeping these things in mind, in this order:
+ * 1) keep things fair between processes/task groups
+ * 2) pick the "next" process, since someone really wants that to run
+ * 3) pick the "last" process, for cache locality
+ * 4) do not run the "skip" process, if something else is available
+ */
+static struct sched_entity *
+pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
+{
+	struct sched_entity *se;
+
+	se = pick_debug(cfs_rq);
+	if (se->sched_delayed) {
+		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
+		/*
+		 * Must not reference @se again, see __block_task().
+		 */
+		return NULL;
+	}
+	return se;
+}
+
 static struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
 {
 	struct sched_entity *se;
@@ -9129,6 +9252,7 @@ static void yield_task_fair(struct rq *rq)
 	if (entity_eligible(cfs_rq, se)) {
 		se->vruntime = se->deadline;
 		se->deadline += calc_delta_fair(se->slice, se);
+		avg_vruntime(cfs_rq);
 	}
 }