From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from va-2-115.ptr.blmpb.com (va-2-115.ptr.blmpb.com [209.127.231.115])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 18DC7372661
	for <linux-kernel@vger.kernel.org>; Wed, 27 May 2026 03:30:37 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.127.231.115
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779852641; cv=none; b=GrXCeojEJmMTAyTQFgnKfEUCf28jFhUFT3tqqMeu8NVNW2DQ0tcR+sKYlhdlwEuM6K+afYwQJ1E+DmTBbKXe6RdbUdwdg5Z+/YUaVz5LS/xjZw863jkDiwobbsdlzbaaqBWZB0KO40y4XBD0yqHTM1/9QJ2+58knthK2rUAhE/Q=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779852641; c=relaxed/simple;
	bh=7xYelRJkd+37UO9ST8b9RcJT8t1kkcbi01Jcy2Sxnt0=;
	h=Message-Id:Mime-Version:References:To:Cc:From:Subject:Date:
	 Content-Type:In-Reply-To:Content-Disposition; b=lOAYbn/Z7vZwr4T8CINgPcOv0AzTk3sFFStoffgOxuE8GFHFuI51TUyY5PWkevVUViUhhF/3NNOT7+2tq3b603gf/ICO4DIYmwYadNuOMDXvV4HLxWx8Bk9Kg2XIa7o/yGxK7QhPTqDJcWYIxvHGSTW1icmQaj9tiyfg/HALcmU=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=WsRks6nP; arc=none smtp.client-ip=209.127.231.115
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="WsRks6nP"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
 s=2212171451; d=bytedance.com; t=1779852628; h=from:subject:
 mime-version:from:date:message-id:subject:to:cc:reply-to:content-type:
 mime-version:in-reply-to:message-id;
 bh=Kv88GbwhqXT/bL0pXTVdPhE9LSwOK1/MwMuCBPQFOkA=;
 b=WsRks6nP4YMfvZmWJ9YRASQKZ4j2DXcG7OMUoPjIbEwU2iyddkPQGhuEzqgS2y/WnJ2gX+
 h36z1H5V6g+EjDoA7WOeP/dl/QHaNv9wMgmhH0vRFKFouxQ57M/M/G8RxDt6rZ2UfCpPWY
 fbk8RaVIVMiqlGNf8dyFBvVQuIssdfmAhRiH6nbEKBexh0V/b6mTLdsTuIdTQRMBx5JmY7
 BZRCfeS0hrfg+w+Q8ZPstuzxOLe3XAxEfmN+QPOotmF+bVsB30+L8hMqnCwnrLBd86Ah0z
 Ak5xBXxjS8abPMbs9MRdjL1WZNGCo38yZW0ttVXCV7oH8M0TsgBrWZT0ccfi1Q==
Message-Id: <20260527032929.GA2437744@bytedance.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
X-Original-From: Aaron Lu <ziqianlu@bytedance.com>
References: <20260526213152.445ca27c@fangorn>
To: "Rik van Riel" <riel@surriel.com>
Cc: "Ingo Molnar" <mingo@redhat.com>, 
	"Peter Zijlstra" <peterz@infradead.org>, 
	"Juri Lelli" <juri.lelli@redhat.com>, "Jakub Kicinski" <kuba@kernel.org>, 
	"Vincent Guittot" <vincent.guittot@linaro.org>, 
	"Dietmar Eggemann" <dietmar.eggemann@arm.com>, 
	"Steven Rostedt" <rostedt@goodmis.org>, 
	"Valentin Schneider" <vschneid@redhat.com>, 
	<linux-kernel@vger.kernel.org>
From: "Aaron Lu" <ziqianlu@bytedance.com>
Subject: Re: [PATCH] sched/fair: use rq_clock_task() in update_tg_load_avg() rate-limit
Date: Wed, 27 May 2026 11:30:08 +0800
X-Lms-Return-Path: <lba+26a166552+36be70+vger.kernel.org+ziqianlu@bytedance.com>
Content-Type: text/plain; charset=UTF-8
In-Reply-To: <20260526213152.445ca27c@fangorn>
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

On Tue, May 26, 2026 at 09:31:52PM -0400, Rik van Riel wrote:
> update_tg_load_avg() is called once per leaf cfs_rq from the
> __update_blocked_fair() walk that runs inside the NOHZ idle-balance
> softirq, and again from update_load_avg() with UPDATE_TG.  Its first
> operation after the trivial early-outs is unconditionally:
> 
> 	now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> 	if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
> 		return;
> 
> Jakub ran into a system where nohz_idle_balance() was taking 75%
> of a CPU (which is handling network traffic and doing many irq_exit_cpu
> calls), with 35% of that CPU spent in update_load_avg, and 17% of the
> CPU in sched_clock_cpu(), reading the TSC.
> 
> There are some optimizations upstream already to reduce that overhead,
> but it also looks like those rdtsc calls may not me necessary at all,
> giving another easy win.
> 
> Switch the rate-limit to read rq_clock_task(rq_of(cfs_rq)) instead.
> This does two things:
> 
> 1. Eliminates the rdtsc.  rq->clock_task is already updated by the
>    enclosing update_rq_clock(rq), sits in a hot cacheline, and reads
>    as a single load.
> 
> 2. Aligns the rate-limit clock with the clock the rate-limited data
>    is computed on.  cfs_rq->avg.load_avg (the value being published
>    to tg->load_avg) is computed by
>      update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)
>    where cfs_rq_clock_pelt() is derived from rq->clock_pelt, which is
>    derived from rq->clock_task.  PELT intentionally excludes IRQ-handling
>    time and steal time from its decay (commit 23127296889f
>    "sched/fair: Update scale invariance of PELT"), so
>    cfs_rq->avg.load_avg already evolves in clock_task time.  The
>    rate-limit was the only outlier in the propagation chain using a
>    different clock (wall time, via sched_clock_cpu).
> 
>    Under normal load and on idle CPUs (where clock_task advances at
>    wall-clock rate) behaviour is unchanged.  Under heavy IRQ load
>    clock_task advances slower than wall time, so the rate-limit fires
>    less often -- consistent with the fact that the underlying
>    cfs_rq->avg.load_avg is also changing slower under the same
>    conditions.  The publish cadence tracks the signal.
> 
> Note: update_tg_load_avg() propagates the *already-decayed*
> cfs_rq->avg.load_avg to tg->load_avg; it does not drive decay.  Decay
> happens in update_cfs_rq_load_avg() on the PELT clock regardless of
> what clock the rate-limit uses, so this change cannot lose decay
> information.  The rate-limit governs how often we publish, not how
> fast load decays.
> 
> All callers of update_tg_load_avg() and clear_tg_load_avg() hold
> rq->lock and have called update_rq_clock(rq) within microseconds:
> 
>   caller                                   pre-state
>   __update_blocked_fair                    encloser did update_rq_clock(rq)
>   update_load_avg's three UPDATE_TG sites  under rq->lock after enqueue/dequeue/update_curr
>   attach_/detach_entity_cfs_rq             preceded by update_load_avg(...)
>   clear_tg_load_avg via offline path       rq_clock_start_loop_update(rq) upfront
> 
> so rq->clock_task is fresh at every call.  Since cfs_rqs are per-CPU
> per-task_group, cfs_rq->last_update_tg_load_avg is always compared
> against the same rq's clock; no cross-rq drift.
> 
> The same hoisting pattern was recently applied to find_new_ilb() in
> commit 76504bce4ee6 ("sched/fair: Get this cpu once in find_new_ilb()").
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Assisted-by: Claude (Anthropic)

Looks good to me.
Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>