From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from casper.infradead.org (casper.infradead.org [90.155.50.34])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5C0333B47F2
	for <linux-kernel@vger.kernel.org>; Mon, 30 Mar 2026 10:10:30 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774865432; cv=none; b=ZI5LFtdfGbBR1quZpCGyxWoYvvj37sTz1zQOCh7JNH+TBnoHCW2IZQ79Aajl945dlJmDYDc11QnGMBKlbpjPqCYU6J8PNfR4juNOhpwzdPmE8t1TFtTh31KRG0cIosjuDADOcm47Xy/KESoH9KmxvM7A5hQoQ+kRoGUN/VDDT3Y=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774865432; c=relaxed/simple;
	bh=NfBlRpdfENsI0+zjBW7wbMMzkQENZ3xzvTwFV8KqrP0=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=EnpyiTzNv9WT1TrYl9P66QeK9nWoxLTuYW4WSix1hkWl3y1KlEaXLIzyHSvdhOHddTdbBMUG3363QMtrmcgDWSAFen06wA0WazXWJ3M0FVBtBvd9hqVIEmT9p0W7rex+6gszkqD/8Rzc5zRkrMt2vuOvvk6eDrloU8lKFBQ0uiE=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=RvjXoKc+; arc=none smtp.client-ip=90.155.50.34
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org
Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="RvjXoKc+"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Transfer-Encoding:
	Content-Type:MIME-Version:References:Message-ID:Subject:Cc:To:From:Date:
	Sender:Reply-To:Content-ID:Content-Description;
	bh=3quCy4YJJdRIfnuo/WbsWAR277bXJA1yBhezNfCkgRs=; b=RvjXoKc+l4gLkOfa58QALrjdmR
	rIyjuKAYNfQ9/b3/bfIBO/nDeE7uPjXFiVxLnKpQIBqAQVFAkZb1CCXkP4Crq0iMgrJKcf33GvTyj
	zy/cxhQGO+LNsdp3et1AEiMSwM3QAPoyIjFlzGgFyO9+kgeJAha8nu/xy2zVsw8rnO6flKYk3xK9/
	A8VGlXT/tQHTNzhU0k8AnuoQset7iDk+wlLxrw6xMSONk6XUsb/ePgLQMImHB/COMJoV+2gw2jj8h
	dUXm1fwrcLqUrxMxDtHwHKdeB1JlHDsDfziG8ufiySL1l2fQcRiQki3SU9+IAXTdv9g7+sSj3oI5E
	Kn8WwJ9Q==;
Received: from 77-249-17-252.cable.dynamic.v4.ziggo.nl ([77.249.17.252] helo=noisy.programming.kicks-ass.net)
	by casper.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux))
	id 1w79Zc-00000006S88-0n6G;
	Mon, 30 Mar 2026 10:10:20 +0000
Received: by noisy.programming.kicks-ass.net (Postfix, from userid 1000)
	id C2E09300346; Mon, 30 Mar 2026 12:10:18 +0200 (CEST)
Date: Mon, 30 Mar 2026 12:10:18 +0200
From: Peter Zijlstra <peterz@infradead.org>
To: John Stultz <jstultz@google.com>
Cc: mingo@kernel.org, juri.lelli@redhat.com, vincent.guittot@linaro.org,
	dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com,
	mgorman@suse.de, vschneid@redhat.com, linux-kernel@vger.kernel.org,
	wangtao554@huawei.com, quzicheng@huawei.com, kprateek.nayak@amd.com,
	dsmythies@telus.net, shubhang@os.amperecomputing.com,
	Suleiman Souhlal <suleiman@google.com>
Subject: Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
Message-ID: <20260330101018.GN3738786@noisy.programming.kicks-ass.net>
References: <20260219075840.162631716@infradead.org>
 <20260219080624.438854780@infradead.org>
 <CANDhNCr3ooATiBgcnq8CAZ+AwzvmHeoKhjvfy=awF1RKFHydCA@mail.gmail.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CANDhNCr3ooATiBgcnq8CAZ+AwzvmHeoKhjvfy=awF1RKFHydCA@mail.gmail.com>

On Fri, Mar 27, 2026 at 10:44:28PM -0700, John Stultz wrote:
> On Wed, Feb 18, 2026 at 11:58 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > It turns out that zero_vruntime tracking is broken when there is but a single
> > task running. Current update paths are through __{en,de}queue_entity(), and
> > when there is but a single task, pick_next_task() will always return that one
> > task, and put_prev_set_next_task() will end up in neither function.
> >
> > This can cause entity_key() to grow indefinitely large and cause overflows,
> > leading to much pain and suffering.
> >
> > Furtermore, doing update_zero_vruntime() from __{de,en}queue_entity(), which
> > are called from {set_next,put_prev}_entity() has problems because:
> >
> >  - set_next_entity() calls __dequeue_entity() before it does cfs_rq->curr = se.
> >    This means the avg_vruntime() will see the removal but not current, missing
> >    the entity for accounting.
> >
> >  - put_prev_entity() calls __enqueue_entity() before it does cfs_rq->curr =
> >    NULL. This means the avg_vruntime() will see the addition *and* current,
> >    leading to double accounting.
> >
> > Both cases are incorrect/inconsistent.
> >
> > Noting that avg_vruntime is already called on each {en,de}queue, remove the
> > explicit avg_vruntime() calls (which removes an extra 64bit division for each
> > {en,de}queue) and have avg_vruntime() update zero_vruntime itself.
> >
> > Additionally, have the tick call avg_vruntime() -- discarding the result, but
> > for the side-effect of updating zero_vruntime.
> 
> Hey all,
> 
> So in stress testing with my full proxy-exec series, I was
> occasionally tripping over the situation where __pick_eevdf() returns
> null which quickly crashes.

> The backtrace is usually due to stress-ng stress-ng-yield test:

Suppose we have 2 runnable tasks, both doing yield. Then one will be
eligible and one will not be, because the average position must be in
between these two entities.

Therefore, the runnable task will be eligible, and be promoted a full
slice (all the tasks do is yield after all). This causes it to jump over
the other task and now the other task is eligible and it is no longer.
So we schedule.

Since we are runnable, there is no dequeue or enqueue. All we have is
the __enqueue_entity() and __dequeue_entity() from put_prev_task() /
set_next_task(). But per the fingered commit, those two no longer move
zero_vruntime head.

All that moves zero_vruntime is tick and full dequeue or enqueue.

This means, that if the two tasks playing leapfrog can reach the
critical speed to reach the overflow point inside one tick's worth of
time, we're up a creek.

If this is indeed the case, then the below should cure things.

This also means that running a HZ=100 config will increase the chances
of hitting this vs HZ=1000.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9298f49f842c..c7daaf941b26 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9307,6 +9307,7 @@ static void yield_task_fair(struct rq *rq)
 	if (entity_eligible(cfs_rq, se)) {
 		se->vruntime = se->deadline;
 		se->deadline += calc_delta_fair(se->slice, se);
+		avg_vruntime(cfs_rq);
 	}
 }