From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from va-2-115.ptr.blmpb.com (va-2-115.ptr.blmpb.com [209.127.231.115]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 18DC7372661 for ; Wed, 27 May 2026 03:30:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.127.231.115 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779852641; cv=none; b=GrXCeojEJmMTAyTQFgnKfEUCf28jFhUFT3tqqMeu8NVNW2DQ0tcR+sKYlhdlwEuM6K+afYwQJ1E+DmTBbKXe6RdbUdwdg5Z+/YUaVz5LS/xjZw863jkDiwobbsdlzbaaqBWZB0KO40y4XBD0yqHTM1/9QJ2+58knthK2rUAhE/Q= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779852641; c=relaxed/simple; bh=7xYelRJkd+37UO9ST8b9RcJT8t1kkcbi01Jcy2Sxnt0=; h=Message-Id:Mime-Version:References:To:Cc:From:Subject:Date: Content-Type:In-Reply-To:Content-Disposition; b=lOAYbn/Z7vZwr4T8CINgPcOv0AzTk3sFFStoffgOxuE8GFHFuI51TUyY5PWkevVUViUhhF/3NNOT7+2tq3b603gf/ICO4DIYmwYadNuOMDXvV4HLxWx8Bk9Kg2XIa7o/yGxK7QhPTqDJcWYIxvHGSTW1icmQaj9tiyfg/HALcmU= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=WsRks6nP; arc=none smtp.client-ip=209.127.231.115 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="WsRks6nP" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; s=2212171451; d=bytedance.com; t=1779852628; h=from:subject: mime-version:from:date:message-id:subject:to:cc:reply-to:content-type: mime-version:in-reply-to:message-id; bh=Kv88GbwhqXT/bL0pXTVdPhE9LSwOK1/MwMuCBPQFOkA=; b=WsRks6nP4YMfvZmWJ9YRASQKZ4j2DXcG7OMUoPjIbEwU2iyddkPQGhuEzqgS2y/WnJ2gX+ h36z1H5V6g+EjDoA7WOeP/dl/QHaNv9wMgmhH0vRFKFouxQ57M/M/G8RxDt6rZ2UfCpPWY fbk8RaVIVMiqlGNf8dyFBvVQuIssdfmAhRiH6nbEKBexh0V/b6mTLdsTuIdTQRMBx5JmY7 BZRCfeS0hrfg+w+Q8ZPstuzxOLe3XAxEfmN+QPOotmF+bVsB30+L8hMqnCwnrLBd86Ah0z Ak5xBXxjS8abPMbs9MRdjL1WZNGCo38yZW0ttVXCV7oH8M0TsgBrWZT0ccfi1Q== Message-Id: <20260527032929.GA2437744@bytedance.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 X-Original-From: Aaron Lu References: <20260526213152.445ca27c@fangorn> To: "Rik van Riel" Cc: "Ingo Molnar" , "Peter Zijlstra" , "Juri Lelli" , "Jakub Kicinski" , "Vincent Guittot" , "Dietmar Eggemann" , "Steven Rostedt" , "Valentin Schneider" , From: "Aaron Lu" Subject: Re: [PATCH] sched/fair: use rq_clock_task() in update_tg_load_avg() rate-limit Date: Wed, 27 May 2026 11:30:08 +0800 X-Lms-Return-Path: Content-Type: text/plain; charset=UTF-8 In-Reply-To: <20260526213152.445ca27c@fangorn> Content-Transfer-Encoding: 7bit Content-Disposition: inline On Tue, May 26, 2026 at 09:31:52PM -0400, Rik van Riel wrote: > update_tg_load_avg() is called once per leaf cfs_rq from the > __update_blocked_fair() walk that runs inside the NOHZ idle-balance > softirq, and again from update_load_avg() with UPDATE_TG. Its first > operation after the trivial early-outs is unconditionally: > > now = sched_clock_cpu(cpu_of(rq_of(cfs_rq))); > if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC) > return; > > Jakub ran into a system where nohz_idle_balance() was taking 75% > of a CPU (which is handling network traffic and doing many irq_exit_cpu > calls), with 35% of that CPU spent in update_load_avg, and 17% of the > CPU in sched_clock_cpu(), reading the TSC. > > There are some optimizations upstream already to reduce that overhead, > but it also looks like those rdtsc calls may not me necessary at all, > giving another easy win. > > Switch the rate-limit to read rq_clock_task(rq_of(cfs_rq)) instead. > This does two things: > > 1. Eliminates the rdtsc. rq->clock_task is already updated by the > enclosing update_rq_clock(rq), sits in a hot cacheline, and reads > as a single load. > > 2. Aligns the rate-limit clock with the clock the rate-limited data > is computed on. cfs_rq->avg.load_avg (the value being published > to tg->load_avg) is computed by > update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq) > where cfs_rq_clock_pelt() is derived from rq->clock_pelt, which is > derived from rq->clock_task. PELT intentionally excludes IRQ-handling > time and steal time from its decay (commit 23127296889f > "sched/fair: Update scale invariance of PELT"), so > cfs_rq->avg.load_avg already evolves in clock_task time. The > rate-limit was the only outlier in the propagation chain using a > different clock (wall time, via sched_clock_cpu). > > Under normal load and on idle CPUs (where clock_task advances at > wall-clock rate) behaviour is unchanged. Under heavy IRQ load > clock_task advances slower than wall time, so the rate-limit fires > less often -- consistent with the fact that the underlying > cfs_rq->avg.load_avg is also changing slower under the same > conditions. The publish cadence tracks the signal. > > Note: update_tg_load_avg() propagates the *already-decayed* > cfs_rq->avg.load_avg to tg->load_avg; it does not drive decay. Decay > happens in update_cfs_rq_load_avg() on the PELT clock regardless of > what clock the rate-limit uses, so this change cannot lose decay > information. The rate-limit governs how often we publish, not how > fast load decays. > > All callers of update_tg_load_avg() and clear_tg_load_avg() hold > rq->lock and have called update_rq_clock(rq) within microseconds: > > caller pre-state > __update_blocked_fair encloser did update_rq_clock(rq) > update_load_avg's three UPDATE_TG sites under rq->lock after enqueue/dequeue/update_curr > attach_/detach_entity_cfs_rq preceded by update_load_avg(...) > clear_tg_load_avg via offline path rq_clock_start_loop_update(rq) upfront > > so rq->clock_task is fresh at every call. Since cfs_rqs are per-CPU > per-task_group, cfs_rq->last_update_tg_load_avg is always compared > against the same rq's clock; no cross-rq drift. > > The same hoisting pattern was recently applied to find_new_ilb() in > commit 76504bce4ee6 ("sched/fair: Get this cpu once in find_new_ilb()"). > > Signed-off-by: Rik van Riel > Assisted-by: Claude (Anthropic) Looks good to me. Reviewed-by: Aaron Lu