From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5135B3E2743 for ; Tue, 31 Mar 2026 12:20:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774959655; cv=none; b=rPeojqmBpXlCMGt5QtGTE1gCLcWufFFUhtZwIqkxVX5XRpAL4oOds5KlVNNFzJ9iy3elKmqdXBd5k6h2C3156B8cfqliaNjZs0CC4mw+0TtBsqooj/1Dys4sy32yDy9yKSstG3mugUvkhAOIz7gDRPRvtwX98JUT1L22mbQfI00= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774959655; c=relaxed/simple; bh=5H0XAfvtMuPc9qWRvTrOyLINf8OyWrCMOgpP+OK79oI=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=Wcw9mZxqqedGBb+BXm8JArVTvrwSjVmSz/sBOLz4Iw1xT8QH+Uri9BTi5F23KPXtYaUZemxck3M/Mou1/0ih09lBEE9D9kV8gheccmKdnKXMmekWeKehzi85a1fGxu9YF1Jbq0rwllIFzeb4yFew9hSnxlGpLDV4Uf7fc+KfwE0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=bECxhYXZ; arc=none smtp.client-ip=90.155.50.34 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="bECxhYXZ" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=Yk+c5xxGQI3/tDksXe6uNnYV+C/+DV9gzanDCNc4w+w=; b=bECxhYXZwqkOvlNhA4kQ767+RU K+2HIfbBFdW+J1t1g/SSGHvJjDmBqvZEE0vNiIG7ejC6nOAGxNKdJqAaFNE+yszp39kjMEvzhkxHc 4t83CCtE4GcYpH72laMkwCwsVOlFg9s6OuA5qU6l0Ee/VoPbI9o2/H/bNEF3Wd0zcR8OA6hykLFLb gDesMw5f8pC06PvrfQsBDMk/nC0sC9VtrEKgME+nEsdBS1XSjcPmU8ca17/9SK3+NQ0kkNZLIJQrV ChZozJc9SSHv3vwkQZ1c7zxOWFf2DsWUunfa7k2d2apYW/jwC0UBn/wrq07woCmQQDw7JwtL1zcTr 4KJ9dJog==; Received: from 2001-1c00-8d85-4b00-266e-96ff-fe07-7dcc.cable.dynamic.v6.ziggo.nl ([2001:1c00:8d85:4b00:266e:96ff:fe07:7dcc] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux)) id 1w7Y5E-00000008Us9-2x2i; Tue, 31 Mar 2026 12:20:36 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 1000) id 22CDA300346; Tue, 31 Mar 2026 14:20:35 +0200 (CEST) Date: Tue, 31 Mar 2026 14:20:35 +0200 From: Peter Zijlstra To: K Prateek Nayak Cc: John Stultz , mingo@kernel.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, linux-kernel@vger.kernel.org, wangtao554@huawei.com, quzicheng@huawei.com, dsmythies@telus.net, shubhang@os.amperecomputing.com, Suleiman Souhlal Subject: Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking Message-ID: <20260331122035.GO3739106@noisy.programming.kicks-ass.net> References: <20260330101018.GN3738786@noisy.programming.kicks-ass.net> <73dab51a-650f-4c82-9e73-13236b2a26c2@amd.com> <20260330144005.GP3738786@noisy.programming.kicks-ass.net> <20260330191108.GU2872@noisy.programming.kicks-ass.net> <20260331070822.GC3739027@noisy.programming.kicks-ass.net> <20260331071402.GN3739106@noisy.programming.kicks-ass.net> <19667aac-99c4-40cf-bc0a-b1e6b9d32ede@amd.com> <20260331092909.GQ3738010@noisy.programming.kicks-ass.net> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260331092909.GQ3738010@noisy.programming.kicks-ass.net> On Tue, Mar 31, 2026 at 11:29:09AM +0200, Peter Zijlstra wrote: > On Tue, Mar 31, 2026 at 02:19:54PM +0530, K Prateek Nayak wrote: > > On 3/31/2026 12:44 PM, Peter Zijlstra wrote: > > > On Tue, Mar 31, 2026 at 09:08:23AM +0200, Peter Zijlstra wrote: > > >> On Tue, Mar 31, 2026 at 06:08:27AM +0530, K Prateek Nayak wrote: > > >> > > >>> The above doesn't recover after a avg_vruntime(). Btw I'm running: > > >>> > > >>> nice -n 19 stress-ng --yield 32 -t 1000000s& > > >>> while true; do perf bench sched messaging -p -t -l 100000 -g 16; done > > >> > > >> And you're running that on a 16 cpu machine / vm ? > > > > > > W00t, it went b00m. Ok, let me go add some tracing. > > > > I could only repro it on baremetal after few hours but good to know it > > exploded effortlessly on your end! Was this a 16vCPU VM with the same > > recipe? > > Yep. It almost insta triggers. Trying to make sense of the traces now. So the thing I'm seeing is that avg_vruntime() is behind of where it should be, not much, but every time it goes *boom* it is just far enough behind that no entity is eligible. sched-messaging-2192 [039] d..2. 77.136100: pick_task_fair: cfs_rq(39:ff4a5bc7bebeb680): sum_w_vruntime(194325882) sum_weight(5120) zero_vruntime(105210161141318) avg_vruntime(105210161179272) sched-messaging-2192 [039] d..2. 77.136100: pick_task_fair: T se(ff4a5bc79040c940): vruntime(105210161556539) deadline(105210164099443) weight(1048576) -- sched-messaging:2340 sched-messaging-2192 [039] d..2. 77.136101: pick_task_fair: T se(ff4a5bc794ce98c0): vruntime(105210161435669) deadline(105210164235669) weight(1048576) -- sched-messaging:2212 sched-messaging-2192 [039] d..2. 77.136101: pick_task_fair: T se(ff4a5bc7952d3100): vruntime(105210161580240) deadline(105210164380240) weight(1048576) -- sched-messaging:2381 sched-messaging-2192 [039] d..2. 77.136102: pick_task_fair: T se(ff4a5bc794c318c0): vruntime(105210161818264) deadline(105210164518004) weight(1048576) -- sched-messaging:2306 sched-messaging-2192 [039] d..2. 77.136103: pick_task_fair: T se(ff4a5bc796b4b100): vruntime(105210161831546) deadline(105210164631546) weight(1048576) -- sched-messaging:2551 sched-messaging-2192 [039] d..2. 77.136104: pick_task_fair: min_lag(-652274) max_lag(0) limit(38000000) sched-messaging-2192 [039] d..2. 77.136104: pick_task_fair: picked NULL!! If we compute the avg_vruntime() manually, then we get a sum_w_vruntime contribution for each task: (105210161556539-105210161141318)*1024 425186304 (105210161435669-105210161141318)*1024 301415424 (105210161580240-105210161141318)*1024 449456128 (105210161818264-105210161141318)*1024 693192704 (105210161831546-105210161141318)*1024 706793472 Which combined is: 425186304+301415424+449456128+693192704+706793472 2576044032 NOTE: this is different (more) from sum_w_vruntime(194325882). So divided, and added to zero gives: 2576044032/5120 503133.60000000000000000000 105210161141318+503133.60000000000000000000 105210161644451.60000000000000000000 Which is where avg_vruntime() *should* be, except it ends up being at: avg_vruntime(105210161179272), which then results in no eligible entities. Note that with the computed avg, the first 3 entities would be eligible. This suggests I go build a parallel infrastructure to double check when and where this goes sizeways. ... various attempts later .... sched-messaging-1021 [009] d..2. 34.483159: update_curr: T<=> se(ff37d0bcd52718c0): vruntime(56921690782736, E) deadline(56921693563331) weight(1048576) -- sched-messaging:1021 sched-messaging-1021 [009] d..2. 34.483160: __avg_vruntime: cfs_rq(9:ff37d0bcfe46b680): delta(-48327) sum_w_vruntime(811471242) zero_vruntime(56921691575188) sched-messaging-1021 [009] d..2. 34.483160: pick_task_fair: cfs_rq(9:ff37d0bcfe46b680): sum_w_vruntime(811471242) sum_weight(6159) zero_vruntime(56921691575188) avg_vruntime(56921691706941) sched-messaging-1021 [009] d..2. 34.483160: pick_task_fair: T< se(ff37d0bcd5c6c940): vruntime(56921691276707, E) deadline(56921694076707) weight(1048576) -- sched-messaging:1276 sched-messaging-1021 [009] d..2. 34.483161: pick_task_fair: T se(ff37d0bcd56f98c0): vruntime(56921691917863) deadline(56921694079320) weight(1048576) -- sched-messaging:1201 sched-messaging-1021 [009] d..2. 34.483162: pick_task_fair: T se(ff37d0bcd5344940): vruntime(56921691340323, E) deadline(56921694140323) weight(1048576) -- sched-messaging:1036 sched-messaging-1021 [009] d..2. 34.483163: pick_task_fair: T se(ff37d0bcd56dc940): vruntime(56921691637185, E) deadline(56921694403038) weight(1048576) -- sched-messaging:1179 sched-messaging-1021 [009] d..2. 34.483164: pick_task_fair: T se(ff37d0bcd43eb100): vruntime(56921691629067, E) deadline(56921694429067) weight(1048576) -- sched-messaging:786 sched-messaging-1021 [009] d..2. 34.483164: pick_task_fair: T se(ff37d0bcd5d80080): vruntime(56921691810771) deadline(56921694610771) weight(1048576) -- sched-messaging:1291 sched-messaging-1021 [009] d..2. 34.483165: pick_task_fair: T se(ff37d0bcd027b100): vruntime(56921734696810) deadline(56921917287562) weight(15360) -- stress-ng-yield:693 sched-messaging-1021 [009] d..2. 34.483165: pick_task_fair: min_lag(-42989869) max_lag(430234) limit(38000000) sched-messaging-1021 [009] d..2. 34.483166: pick_task_fair: swv(811471242) sched-messaging-1021 [009] d..2. 34.483167: __dequeue_entity: cfs_rq(9:ff37d0bcfe46b680): sum_w_vruntime(1117115786) zero_vruntime(56921691575188) set_next_task(1276): swv -= key * weight 811471242 - (56921691276707-56921691575188)*1024 1117115786 OK sched-messaging-1276 [009] d.h2. 34.483168: update_curr: T<=> se(ff37d0bcd5c6c940): vruntime(56921691285759, E) deadline(56921694076707) weight(1048576) -- sched-messaging:1276 sched-messaging-1276 [009] d.h2. 34.483169: __avg_vruntime: cfs_rq(9:ff37d0bcfe46b680): delta(22156) sum_w_vruntime(319064896) zero_vruntime(56921691597344) swv -= sw * delta 1117115786 - 5135 * 22156 1003344726 WTF!?! zv += delta 56921691575188 + 22156 56921691597344 OK sched-messaging-1276 [009] d.h2. 34.483169: place_entity: T< se(ff37d0bcd52718c0): vruntime(56921690673139, E) deadline(56921693473139) weight(1048576) -- sched-messaging:1021 sched-messaging-1276 [009] d.h2. 34.483170: __enqueue_entity: cfs_rq(9:ff37d0bcfe46b680): sum_w_vruntime(-627321024) zero_vruntime(56921691597344) swv += key * weight Should be: 1003344726 + (56921690673139 - 56921691597344) * 1024 56958806 [*] But is: 319064896 + (56921690673139 - 56921691597344) * 1024 -627321024 Consistent, but wrong sched-messaging-1276 [009] d..2. 34.483173: update_curr: T<=> se(ff37d0bcd5c6c940): vruntime(56921691289762, E) deadline(56921694076707) weight(1048576) -- sched-messaging:1276 sched-messaging-1276 [009] d..2. 34.483173: __avg_vruntime: cfs_rq(9:ff37d0bcfe46b680): delta(571) sum_w_vruntime(180635073) zero_vruntime(56921691466161) This would be dequeue(1276) update_entity_lag(), but the numbers make no sense... swv -= sw * delta -627321024 - 6159 * 571 -630837813 != 180635073 zv += delta 56921691597344 + 571 56921691597915 != 56921691466161 Also, the actual delta would be (zero_vruntime - prev zero_vruntime): 56921691466161-56921691597344 -131183 At which point we can construct the swv value from where we left of [*] 56958806 - -131183 * 6159 864914903 But the actual state makes no frigging sense.... sched-messaging-1276 [009] d..2. 34.483174: pick_task_fair: cfs_rq(9:ff37d0bcfe46b680): sum_w_vruntime(180635073) sum_weight(6159) zero_vruntime(56921691466161) avg_vruntime(56921691495489) sched-messaging-1276 [009] d..2. 34.483174: pick_task_fair: T< se(ff37d0bcd52718c0): vruntime(56921690673139, E) deadline(56921693473139) weight(1048576) -- sched-messaging:1021 sched-messaging-1276 [009] d..2. 34.483175: pick_task_fair: T se(ff37d0bcd56f98c0): vruntime(56921691917863) deadline(56921694079320) weight(1048576) -- sched-messaging:1201 sched-messaging-1276 [009] d..2. 34.483175: pick_task_fair: T se(ff37d0bcd5344940): vruntime(56921691340323, E) deadline(56921694140323) weight(1048576) -- sched-messaging:1036 sched-messaging-1276 [009] d..2. 34.483176: pick_task_fair: T se(ff37d0bcd56dc940): vruntime(56921691637185) deadline(56921694403038) weight(1048576) -- sched-messaging:1179 sched-messaging-1276 [009] d..2. 34.483177: pick_task_fair: T se(ff37d0bcd43eb100): vruntime(56921691629067) deadline(56921694429067) weight(1048576) -- sched-messaging:786 sched-messaging-1276 [009] d..2. 34.483177: pick_task_fair: T se(ff37d0bcd5d80080): vruntime(56921691810771) deadline(56921694610771) weight(1048576) -- sched-messaging:1291 sched-messaging-1276 [009] d..2. 34.483178: pick_task_fair: T se(ff37d0bcd027b100): vruntime(56921734696810) deadline(56921917287562) weight(15360) -- stress-ng-yield:693 sched-messaging-1276 [009] d..2. 34.483178: pick_task_fair: min_lag(-43201321) max_lag(822350) limit(38000000) sched-messaging-1276 [009] d..2. 34.483178: pick_task_fair: swv(864914903) sched-messaging-1276 [009] d..2. 34.483179: pick_task_fair: FAIL Generated with the below patch on top of -rc6. --- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bf948db905ed..5462aeac1c45 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -678,6 +678,11 @@ sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se) cfs_rq->sum_w_vruntime += key * weight; cfs_rq->sum_weight += weight; + + trace_printk("cfs_rq(%d:%px): sum_w_vruntime(%Ld) zero_vruntime(%Ld)\n", + rq_of(cfs_rq)->cpu, cfs_rq, + cfs_rq->sum_w_vruntime, + cfs_rq->zero_vruntime); } static void @@ -688,6 +693,11 @@ sum_w_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se) cfs_rq->sum_w_vruntime -= key * weight; cfs_rq->sum_weight -= weight; + + trace_printk("cfs_rq(%d:%px): sum_w_vruntime(%Ld) zero_vruntime(%Ld)\n", + rq_of(cfs_rq)->cpu, cfs_rq, + cfs_rq->sum_w_vruntime, + cfs_rq->zero_vruntime); } static inline @@ -698,6 +708,12 @@ void update_zero_vruntime(struct cfs_rq *cfs_rq, s64 delta) */ cfs_rq->sum_w_vruntime -= cfs_rq->sum_weight * delta; cfs_rq->zero_vruntime += delta; + + trace_printk("cfs_rq(%d:%px): delta(%Ld) sum_w_vruntime(%Ld) zero_vruntime(%Ld)\n", + rq_of(cfs_rq)->cpu, cfs_rq, + delta, + cfs_rq->sum_w_vruntime, + cfs_rq->zero_vruntime); } /* @@ -712,7 +728,7 @@ void update_zero_vruntime(struct cfs_rq *cfs_rq, s64 delta) * This means it is one entry 'behind' but that puts it close enough to where * the bound on entity_key() is at most two lag bounds. */ -u64 avg_vruntime(struct cfs_rq *cfs_rq) +static u64 __avg_vruntime(struct cfs_rq *cfs_rq, bool update) { struct sched_entity *curr = cfs_rq->curr; long weight = cfs_rq->sum_weight; @@ -743,9 +759,17 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq) delta = curr->vruntime - cfs_rq->zero_vruntime; } - update_zero_vruntime(cfs_rq, delta); + if (update) { + update_zero_vruntime(cfs_rq, delta); + return cfs_rq->zero_vruntime; + } - return cfs_rq->zero_vruntime; + return cfs_rq->zero_vruntime + delta; +} + +u64 avg_vruntime(struct cfs_rq *cfs_rq) +{ + return __avg_vruntime(cfs_rq, true); } static inline u64 cfs_rq_max_slice(struct cfs_rq *cfs_rq); @@ -1078,11 +1102,6 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect) return best; } -static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq) -{ - return __pick_eevdf(cfs_rq, true); -} - struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq) { struct rb_node *last = rb_last(&cfs_rq->tasks_timeline.rb_root); @@ -1279,6 +1298,8 @@ s64 update_curr_common(struct rq *rq) return update_se(rq, &rq->donor->se); } +static void print_se(struct cfs_rq *cfs_rq, struct sched_entity *se, bool pick); + /* * Update the current task's runtime statistics. */ @@ -1304,6 +1325,10 @@ static void update_curr(struct cfs_rq *cfs_rq) curr->vruntime += calc_delta_fair(delta_exec, curr); resched = update_deadline(cfs_rq, curr); + if (resched) + avg_vruntime(cfs_rq); + + print_se(cfs_rq, curr, true); if (entity_is_task(curr)) { /* @@ -3849,6 +3874,8 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, bool rel_vprot = false; u64 vprot; + print_se(cfs_rq, se, true); + if (se->on_rq) { /* commit outstanding execution time */ update_curr(cfs_rq); @@ -3896,6 +3923,8 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, __enqueue_entity(cfs_rq, se); cfs_rq->nr_queued++; } + + print_se(cfs_rq, se, true); } static void reweight_task_fair(struct rq *rq, struct task_struct *p, @@ -5251,6 +5280,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) if (se->rel_deadline) { se->deadline += se->vruntime; se->rel_deadline = 0; + print_se(cfs_rq, se, true); return; } @@ -5266,6 +5296,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) * EEVDF: vd_i = ve_i + r_i/w_i */ se->deadline = se->vruntime + vslice; + print_se(cfs_rq, se, true); } static void check_enqueue_throttle(struct cfs_rq *cfs_rq); @@ -5529,31 +5560,6 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, bool first) se->prev_sum_exec_runtime = se->sum_exec_runtime; } -static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags); - -/* - * Pick the next process, keeping these things in mind, in this order: - * 1) keep things fair between processes/task groups - * 2) pick the "next" process, since someone really wants that to run - * 3) pick the "last" process, for cache locality - * 4) do not run the "skip" process, if something else is available - */ -static struct sched_entity * -pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq) -{ - struct sched_entity *se; - - se = pick_eevdf(cfs_rq); - if (se->sched_delayed) { - dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED); - /* - * Must not reference @se again, see __block_task(). - */ - return NULL; - } - return se; -} - static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq); static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev) @@ -8942,6 +8948,123 @@ static void wakeup_preempt_fair(struct rq *rq, struct task_struct *p, int wake_f resched_curr_lazy(rq); } +static __always_inline +void print_se(struct cfs_rq *cfs_rq, struct sched_entity *se, bool pick) +{ + bool curr = (se == cfs_rq->curr); + bool el = entity_eligible(cfs_rq, se); + bool prot = protect_slice(se); + bool task = false; + char *comm = NULL; + int pid = -1; + + if (entity_is_task(se)) { + struct task_struct *p = task_of(se); + task = true; + comm = p->comm; + pid = p->pid; + } + + trace_printk("%c%c%c%c se(%px): vruntime(%Ld%s) deadline(%Ld) weight(%ld) -- %s:%d\n", + task ? 'T' : '@', + pick ? '<' : ' ', + curr && prot ? '=' : ' ', + curr ? '>' : ' ', + se, se->vruntime, el ? ", E" : "", + se->deadline, se->load.weight, + comm, pid); +} + +static struct sched_entity *pick_debug(struct cfs_rq *cfs_rq) +{ + struct sched_entity *pick = __pick_eevdf(cfs_rq, true); + struct sched_entity *curr = cfs_rq->curr; + s64 min_lag = 0, max_lag = 0; + u64 runtime, weight, z_vruntime, avg; + u64 swv = 0; + + s64 limit = 10*(sysctl_sched_base_slice + TICK_NSEC); + + if (curr && !curr->on_rq) + curr = NULL; + + runtime = cfs_rq->sum_w_vruntime; + weight = cfs_rq->sum_weight; + z_vruntime = cfs_rq->zero_vruntime; + barrier(); + avg = __avg_vruntime(cfs_rq, false); + + trace_printk("cfs_rq(%d:%px): sum_w_vruntime(%Ld) sum_weight(%Ld) zero_vruntime(%Ld) avg_vruntime(%Ld)\n", + rq_of(cfs_rq)->cpu, cfs_rq, + runtime, weight, + z_vruntime, avg); + + for (struct rb_node *node = cfs_rq->tasks_timeline.rb_leftmost; + node; node = rb_next(node)) { + struct sched_entity *se = __node_2_se(node); + if (se == curr) + curr = NULL; + print_se(cfs_rq, se, pick == se); + + swv += (se->vruntime - z_vruntime) * scale_load_down(se->load.weight); + + s64 vlag = avg - se->vruntime; + min_lag = min(min_lag, vlag); + max_lag = max(max_lag, vlag); + } + + if (curr) { + print_se(cfs_rq, curr, pick == curr); + + s64 vlag = avg - curr->vruntime; + min_lag = min(min_lag, vlag); + max_lag = max(max_lag, vlag); + } + + trace_printk(" min_lag(%Ld) max_lag(%Ld) limit(%Ld)\n", min_lag, max_lag, limit); + trace_printk(" swv(%Ld)\n", swv); + + if (swv != runtime) { + trace_printk("FAIL\n"); + tracing_off(); + printk("FAIL FAIL FAIL!!!\n"); + } + +// WARN_ON_ONCE(min_lag < -limit || max_lag > limit); + + if (!pick) { + trace_printk("picked NULL!!\n"); + tracing_off(); + printk("FAIL FAIL FAIL!!!\n"); + return __pick_first_entity(cfs_rq); + } + + return pick; +} + +/* + * Pick the next process, keeping these things in mind, in this order: + * 1) keep things fair between processes/task groups + * 2) pick the "next" process, since someone really wants that to run + * 3) pick the "last" process, for cache locality + * 4) do not run the "skip" process, if something else is available + */ +static struct sched_entity * +pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq) +{ + struct sched_entity *se; + + se = pick_debug(cfs_rq); + if (se->sched_delayed) { + dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED); + /* + * Must not reference @se again, see __block_task(). + */ + return NULL; + } + return se; +} + static struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf) { struct sched_entity *se; @@ -9129,6 +9252,7 @@ static void yield_task_fair(struct rq *rq) if (entity_eligible(cfs_rq, se)) { se->vruntime = se->deadline; se->deadline += calc_delta_fair(se->slice, se); + avg_vruntime(cfs_rq); } }