From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mx0a-00069f02.pphosted.com (mx0a-00069f02.pphosted.com [205.220.165.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 42CBA3161A4 for ; Tue, 21 Apr 2026 05:06:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=205.220.165.32 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776748020; cv=none; b=JOEGAH4OeSy0KgstbjCnC7Dm3rRo8j3YEPrwiP8910RReoOuNaddWOq3mVP1G1V3aW3+yx+8QOCwckqEo9leiR6UFer45/ccwJRtAbGebeOWzsXSfZe0QUa/yjijw/ui2bEQAF2UJu0zEnogNzV/QGC9yiN4N/ygVVK+g2KEsPo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776748020; c=relaxed/simple; bh=sIgD6qkaMKZOyBLPIaYxRYrwFafuN6+SlJK3q0Va5JQ=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=FqvmsYP4kF0HGpZFdCJvIacyBZHAs9LACMFll1E8JB6Pto251ACvVb8BRt72w7f0LPZ/m0W1sMZGd+Yc6b/YBGKJgQyai8SWK5bVIpnP1iJduzsdbVS2vQtiKVI96bLv7ADAvv1iqyqR4CMGn1MzlxW7e3FMQyU8UoevluoR1V8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com; spf=pass smtp.mailfrom=oracle.com; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b=ZMXckkN4; arc=none smtp.client-ip=205.220.165.32 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=oracle.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="ZMXckkN4" Received: from pps.filterd (m0246617.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 63KLuIXk210451; Tue, 21 Apr 2026 05:06:40 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=corp-2025-04-25; bh=cfRnU /CpYvajhfghxOhbjXDFidKSWPNSlZ5b/mr5hP0=; b=ZMXckkN4WqLK4UZ7VmVD2 s1SAWRZ2MiPtwPSRdyNXCoQbDMm3uB2Iz3oIbBiktGTddxY7vNRcB26+Vu+QkgGk mJlAWVMH87c10K/K4AYnYP/esokbejAOlvyd/Ppd7e33qrHUYEQqgsQwQD4wO5HU A+LpU+nBMptUdPvQVtq2cqEJwi74knKz+4vEl1I2dHrDIsnOPiwbndSMEzt+58QD qwjpzbQOrA8jZFYCkS4y3MyhIq9FOs2s88QWYX2YYKKp3wjUBeMHc0k7LvHI6RSV nJUezMK4+Mn/7RLHOwFW/9fo1H5mxzH18yZHiFgm4gM/favO2wxzoYmZfdaZoiqw w== Received: from phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta03.appoci.oracle.com [138.1.37.129]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 4dm2grcq1q-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 21 Apr 2026 05:06:40 +0000 (GMT) Received: from pps.filterd (phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (8.18.1.7/8.18.1.7) with ESMTP id 63L56RIc005544; Tue, 21 Apr 2026 05:06:39 GMT Received: from imran-metabox.au.oracle.com (dhcp-10-191-74-155.vpn.oracle.com [10.191.74.155]) by phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTP id 4dn1afjjd1-2; Tue, 21 Apr 2026 05:06:39 +0000 (GMT) From: Imran Khan To: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org Cc: dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, linux-kernel@vger.kernel.org Subject: [PATCH 1/2] sched/fair: scale nohz.next_balance according to number of idle CPUs. Date: Tue, 21 Apr 2026 13:06:21 +0800 Message-Id: <20260421050622.19869-2-imran.f.khan@oracle.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260421050622.19869-1-imran.f.khan@oracle.com> References: <20260421050622.19869-1-imran.f.khan@oracle.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1143,Hydra:6.1.51,FMLib:17.12.100.49 definitions=2026-04-21_01,2026-04-20_02,2025-10-01_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 adultscore=0 lowpriorityscore=0 bulkscore=0 spamscore=0 malwarescore=0 suspectscore=0 phishscore=0 mlxscore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.19.0-2604070000 definitions=main-2604210048 X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNDIxMDA0OCBTYWx0ZWRfXxOlBcDHL9LSj XgptQc71Fr/3DiWyRXEfwptka71W7FeIJ09mMHGX1UkXIidTZVAaYfHUwi42wOPD53rhjcfMjFs qc/LVxaddbDs6pJ3T/vAWgg/nhT5FkOxFyqzXK0oLnPduAuuiV2ZMjjWc2bb6rp1gxO5FQXga67 h6yZ2UuWzljVVttK7mMn++Xy9+kaFOC3lfW8rNCGG8UEzivLah78ckk2KGcomMSduYSaO5nEfIA /MMGWIHfqpeG+DLkdyLEyqN3SYNabdyZVqFKe0/tLGDitRIq8wxMX7wkXjRjI32l4RSzr7aZazk WNPikKKtoliC4jSDmN+VAL0sn8Zh6jpnE4D2Htd+noBI6nFJpuaXHUxMQEcEgEGn4lyX/kxykS1 nr9Z8bUuu/05pzJBGb9JsDFFxaZpnZVuiPIqCZj2HaXG5pcMl+HL4iWHODyNKSEITks/rEfyvZ5 J5nQ2tzYSdOI90McbUw== X-Authority-Analysis: v=2.4 cv=TN51jVla c=1 sm=1 tr=0 ts=69e705e0 b=1 cx=c_pps a=WeWmnZmh0fydH62SvGsd2A==:117 a=WeWmnZmh0fydH62SvGsd2A==:17 a=A5OVakUREuEA:10 a=VkNPw1HP01LnGYTKEx00:22 a=jiCTI4zE5U7BLdzWsZGv:22 a=7Gl3-_t3PgB9XO-mQDs3:22 a=yPCof4ZbAAAA:8 a=15wLAThW0SGYWq5x0JkA:9 X-Proofpoint-ORIG-GUID: thw4NgH_vuB0fmtHXMF4X-LP0pCTHcol X-Proofpoint-GUID: thw4NgH_vuB0fmtHXMF4X-LP0pCTHcol On large scale systems, for example with 768 CPUs and cpusets consisting of 380+ CPUs, there may always be some idle CPU with it's rq->next_balance close to or same as now. This causes nohz.next_balance to be perpetually same as current jiffies and thus causing time based check in nohz_balancer_kick() to awlays fail. For example putting dtrace probe at nohz_balancer_kick, on such a system, we can see that nohz.next_balance is at current jiffy at almost each tick: 447 9536 nohz_balancer_kick:entry jiffies=9764770863 nohz.next_balance=9764770863 447 9536 nohz_balancer_kick:entry jiffies=9764770864 nohz.next_balance=9764770864 447 9536 nohz_balancer_kick:entry jiffies=9764770865 nohz.next_balance=9764770865 447 9536 nohz_balancer_kick:entry jiffies=9764770866 nohz.next_balance=9764770866 447 9536 nohz_balancer_kick:entry jiffies=9764770867 nohz.next_balance=9764770867 447 9536 nohz_balancer_kick:entry jiffies=9764770868 nohz.next_balance=9764770868 447 9536 nohz_balancer_kick:entry jiffies=9764770869 nohz.next_balance=9764770870 447 9536 nohz_balancer_kick:entry jiffies=9764770870 nohz.next_balance=9764770870 447 9536 nohz_balancer_kick:entry jiffies=9764770871 nohz.next_balance=9764770871 447 9536 nohz_balancer_kick:entry jiffies=9764770872 nohz.next_balance=9764770872 447 9536 nohz_balancer_kick:entry jiffies=9764770873 nohz.next_balance=9764770873 447 9536 nohz_balancer_kick:entry jiffies=9764770874 nohz.next_balance=9764770874 447 9536 nohz_balancer_kick:entry jiffies=9764770875 nohz.next_balance=9764770876 447 9536 nohz_balancer_kick:entry jiffies=9764770876 nohz.next_balance=9764770876 447 9536 nohz_balancer_kick:entry jiffies=9764770877 nohz.next_balance=9764770877 447 9536 nohz_balancer_kick:entry jiffies=9764770878 nohz.next_balance=9764770878 On such system setting nohz.next_balance to next jiffy can cause kick_ilb() to run almost every tick and this in turn can consume a lot of CPU cycles in subsequenet nohz idle balancing. So set nohz.next_balance based on number of currently idle CPUs, such that for 32 idle CPUs nohz.next_balance is advanced further by 1 jiffy. This will nohz_balancer_kick to bail out early. Signed-off-by: Imran Khan --- kernel/sched/fair.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ab4114712be74..bd35275a05b38 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -12447,8 +12447,17 @@ static void kick_ilb(unsigned int flags) * Increase nohz.next_balance only when if full ilb is triggered but * not if we only update stats. */ - if (flags & NOHZ_BALANCE_KICK) - nohz.next_balance = jiffies+1; + if (flags & NOHZ_BALANCE_KICK) { + unsigned int nr_idle = cpumask_weight(nohz.idle_cpus_mask); + + /* + * On large systems, there may always be some idle CPU(s) with + * rq->next_balance close to or at current time, thus causing + * frequent invocation of kick_ilb() from nohz_balancer_kick(). + * Adjust next_balance based on the number of idle CPUs. + */ + nohz.next_balance = jiffies + 1 + ((nr_idle > 32) ? ilog2(nr_idle) - 4 : 0); + } ilb_cpu = find_new_ilb(); if (ilb_cpu < 0) -- 2.34.1