From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5C0333B47F2 for ; Mon, 30 Mar 2026 10:10:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774865432; cv=none; b=ZI5LFtdfGbBR1quZpCGyxWoYvvj37sTz1zQOCh7JNH+TBnoHCW2IZQ79Aajl945dlJmDYDc11QnGMBKlbpjPqCYU6J8PNfR4juNOhpwzdPmE8t1TFtTh31KRG0cIosjuDADOcm47Xy/KESoH9KmxvM7A5hQoQ+kRoGUN/VDDT3Y= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774865432; c=relaxed/simple; bh=NfBlRpdfENsI0+zjBW7wbMMzkQENZ3xzvTwFV8KqrP0=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=EnpyiTzNv9WT1TrYl9P66QeK9nWoxLTuYW4WSix1hkWl3y1KlEaXLIzyHSvdhOHddTdbBMUG3363QMtrmcgDWSAFen06wA0WazXWJ3M0FVBtBvd9hqVIEmT9p0W7rex+6gszkqD/8Rzc5zRkrMt2vuOvvk6eDrloU8lKFBQ0uiE= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=RvjXoKc+; arc=none smtp.client-ip=90.155.50.34 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="RvjXoKc+" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Transfer-Encoding: Content-Type:MIME-Version:References:Message-ID:Subject:Cc:To:From:Date: Sender:Reply-To:Content-ID:Content-Description; bh=3quCy4YJJdRIfnuo/WbsWAR277bXJA1yBhezNfCkgRs=; b=RvjXoKc+l4gLkOfa58QALrjdmR rIyjuKAYNfQ9/b3/bfIBO/nDeE7uPjXFiVxLnKpQIBqAQVFAkZb1CCXkP4Crq0iMgrJKcf33GvTyj zy/cxhQGO+LNsdp3et1AEiMSwM3QAPoyIjFlzGgFyO9+kgeJAha8nu/xy2zVsw8rnO6flKYk3xK9/ A8VGlXT/tQHTNzhU0k8AnuoQset7iDk+wlLxrw6xMSONk6XUsb/ePgLQMImHB/COMJoV+2gw2jj8h dUXm1fwrcLqUrxMxDtHwHKdeB1JlHDsDfziG8ufiySL1l2fQcRiQki3SU9+IAXTdv9g7+sSj3oI5E Kn8WwJ9Q==; Received: from 77-249-17-252.cable.dynamic.v4.ziggo.nl ([77.249.17.252] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux)) id 1w79Zc-00000006S88-0n6G; Mon, 30 Mar 2026 10:10:20 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 1000) id C2E09300346; Mon, 30 Mar 2026 12:10:18 +0200 (CEST) Date: Mon, 30 Mar 2026 12:10:18 +0200 From: Peter Zijlstra To: John Stultz Cc: mingo@kernel.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, linux-kernel@vger.kernel.org, wangtao554@huawei.com, quzicheng@huawei.com, kprateek.nayak@amd.com, dsmythies@telus.net, shubhang@os.amperecomputing.com, Suleiman Souhlal Subject: Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking Message-ID: <20260330101018.GN3738786@noisy.programming.kicks-ass.net> References: <20260219075840.162631716@infradead.org> <20260219080624.438854780@infradead.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: On Fri, Mar 27, 2026 at 10:44:28PM -0700, John Stultz wrote: > On Wed, Feb 18, 2026 at 11:58 PM Peter Zijlstra wrote: > > > > It turns out that zero_vruntime tracking is broken when there is but a single > > task running. Current update paths are through __{en,de}queue_entity(), and > > when there is but a single task, pick_next_task() will always return that one > > task, and put_prev_set_next_task() will end up in neither function. > > > > This can cause entity_key() to grow indefinitely large and cause overflows, > > leading to much pain and suffering. > > > > Furtermore, doing update_zero_vruntime() from __{de,en}queue_entity(), which > > are called from {set_next,put_prev}_entity() has problems because: > > > > - set_next_entity() calls __dequeue_entity() before it does cfs_rq->curr = se. > > This means the avg_vruntime() will see the removal but not current, missing > > the entity for accounting. > > > > - put_prev_entity() calls __enqueue_entity() before it does cfs_rq->curr = > > NULL. This means the avg_vruntime() will see the addition *and* current, > > leading to double accounting. > > > > Both cases are incorrect/inconsistent. > > > > Noting that avg_vruntime is already called on each {en,de}queue, remove the > > explicit avg_vruntime() calls (which removes an extra 64bit division for each > > {en,de}queue) and have avg_vruntime() update zero_vruntime itself. > > > > Additionally, have the tick call avg_vruntime() -- discarding the result, but > > for the side-effect of updating zero_vruntime. > > Hey all, > > So in stress testing with my full proxy-exec series, I was > occasionally tripping over the situation where __pick_eevdf() returns > null which quickly crashes. > The backtrace is usually due to stress-ng stress-ng-yield test: Suppose we have 2 runnable tasks, both doing yield. Then one will be eligible and one will not be, because the average position must be in between these two entities. Therefore, the runnable task will be eligible, and be promoted a full slice (all the tasks do is yield after all). This causes it to jump over the other task and now the other task is eligible and it is no longer. So we schedule. Since we are runnable, there is no dequeue or enqueue. All we have is the __enqueue_entity() and __dequeue_entity() from put_prev_task() / set_next_task(). But per the fingered commit, those two no longer move zero_vruntime head. All that moves zero_vruntime is tick and full dequeue or enqueue. This means, that if the two tasks playing leapfrog can reach the critical speed to reach the overflow point inside one tick's worth of time, we're up a creek. If this is indeed the case, then the below should cure things. This also means that running a HZ=100 config will increase the chances of hitting this vs HZ=1000. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9298f49f842c..c7daaf941b26 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9307,6 +9307,7 @@ static void yield_task_fair(struct rq *rq) if (entity_eligible(cfs_rq, se)) { se->vruntime = se->deadline; se->deadline += calc_delta_fair(se->slice, se); + avg_vruntime(cfs_rq); } }