From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 55AFC2BDC0B for ; Tue, 21 Apr 2026 17:30:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.158.5 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776792649; cv=none; b=GAkEDVdqPspA+LqPnHbxDOZBh7GrKB9qOxndbRSBnTze5kO5vpWtbBHbnuEHfc1p8XQbux3sc7uTLct7Z4IfRnXq1J1wcMKnfy9KE7hVXuPgTLSQlK74jsNAolWz8zMnvgzOQbS2/LIga26pzBKUmeHa8OGnTCVpRQcqE36U2OI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776792649; c=relaxed/simple; bh=4b6FiCawUn0MaPC6YmXtkz3O0sR+O7oS1ElbSRq36qY=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=FAinVE7ibeMJCmJoyHr3yAeYkmAu+XYYzDYWOs7VpUydNN4PaiQiX0Tui6AprIbmgHCWv/xbY2LvQiIIU+qGEAWQzqGIakrWj8IAm66maZhpS3iVMNAnG2ZFF/sI2s3jafUQ1jq/roLNQu33x0TXPqXLaYraErbYJyPDHwUth4Q= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=jDK+cb0P; arc=none smtp.client-ip=148.163.158.5 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="jDK+cb0P" Received: from pps.filterd (m0360072.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 63LBh9nw1535873; Tue, 21 Apr 2026 17:30:32 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s=pp1; bh=8E7mXO 1P9848sEh0m1z10tGowaN3xKq57vqs5/kvFvA=; b=jDK+cb0PAOXy6SNjmAml5j zW5VvP39qbiCLeYyNyHekqW7aPCHEm8fihy0/ec6Z2ciZmrH+KD88PgDsONYx3ea O/As8bhD9Gox5djGVwTHjXUWvQOolVnk6yHk7cwaTjZWaAOLqPNUguPRdvD/a4eh VQlWGg4H0rMlbfxekYB2jQqRKl2L2y5F+17ELWZmgBclvMv7df4UGe2kz0xvxnLX zMSMqhNP3bGH6Z76pt6zk2ILO4rNDquz3WGKRD0Aqy1ThwrDKSrnesy2rCswpFhG 0NLiPvRXg3NQQC7brp9hcjY7vesOn8t2WwustZkLI3SGPenpyz6qRlAJxDBOiAYg == Received: from ppma22.wdc07v.mail.ibm.com (5c.69.3da9.ip4.static.sl-reverse.com [169.61.105.92]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 4dm2k6mrrc-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 21 Apr 2026 17:30:31 +0000 (GMT) Received: from pps.filterd (ppma22.wdc07v.mail.ibm.com [127.0.0.1]) by ppma22.wdc07v.mail.ibm.com (8.18.1.7/8.18.1.7) with ESMTP id 63LHKa0Q032255; Tue, 21 Apr 2026 17:30:31 GMT Received: from smtprelay05.fra02v.mail.ibm.com ([9.218.2.225]) by ppma22.wdc07v.mail.ibm.com (PPS) with ESMTPS id 4dmmnvst7x-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 21 Apr 2026 17:30:31 +0000 (GMT) Received: from smtpav02.fra02v.mail.ibm.com (smtpav02.fra02v.mail.ibm.com [10.20.54.101]) by smtprelay05.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 63LHUT0O46137646 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 21 Apr 2026 17:30:29 GMT Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 2BC432004B; Tue, 21 Apr 2026 17:30:29 +0000 (GMT) Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 341D820040; Tue, 21 Apr 2026 17:30:27 +0000 (GMT) Received: from [9.39.29.211] (unknown [9.39.29.211]) by smtpav02.fra02v.mail.ibm.com (Postfix) with ESMTP; Tue, 21 Apr 2026 17:30:27 +0000 (GMT) Message-ID: <429667c2-f9cd-4c98-8f61-acb43bfd7ccd@linux.ibm.com> Date: Tue, 21 Apr 2026 23:00:26 +0530 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 1/2] sched/fair: scale nohz.next_balance according to number of idle CPUs. To: Imran Khan Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, linux-kernel@vger.kernel.org References: <20260421050622.19869-1-imran.f.khan@oracle.com> <20260421050622.19869-2-imran.f.khan@oracle.com> Content-Language: en-US From: Shrikanth Hegde In-Reply-To: <20260421050622.19869-2-imran.f.khan@oracle.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 X-Proofpoint-Reinject: loops=2 maxloops=12 X-Authority-Analysis: v=2.4 cv=L78theT8 c=1 sm=1 tr=0 ts=69e7b438 cx=c_pps a=5BHTudwdYE3Te8bg5FgnPg==:117 a=5BHTudwdYE3Te8bg5FgnPg==:17 a=IkcTkHD0fZMA:10 a=A5OVakUREuEA:10 a=VkNPw1HP01LnGYTKEx00:22 a=RnoormkPH1_aCDwRdu11:22 a=RzCfie-kr_QcCd8fBx8p:22 a=yPCof4ZbAAAA:8 a=-Meh1dD0MCb5jL2GXMMA:9 a=QEXdDO2ut3YA:10 X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNDIxMDE3MiBTYWx0ZWRfX8PRkQkoO229b l34GhZXE2O5WHtsZNv8h/Vm9o8WFMNM5U+yf02QrtNQ4PX6RdNcfA0kGWxwJpYaJl0O7X1iMPsr S1tQN+cGXuR+PQMy0DCgB2cPSPRkb1KADQ2exeio1qoHTKJlsiWp6cIVKd9Ul0cSjCnrLoWXqkl hJCQbXzqTpRZdYpqZdG71D6gYvEVIdYPrA8nxIWkHiArzmqf2wUeMaZc1bENbjx6iNu3LTGfEy2 T2A6soipjXR5FSu2Rip8RqmTYKw0iDji8Pz9U8O5zHpKVIsz/HZnZMenoCa/Q43Trjgy4UNGwTZ ab/FeQSkXQ5MUTPuRE3YGsSTtkPfqlI+e287QDV5ePKKbWW9P3o+DIuFOWlW5CrzjZ/RqikTawb XxQ+mEMonSZoh1um6BxlD4yoW8gWHuDQXj3YZY0WbhHJWj/9ft/3xjoegEkeBmMDgJsvv+s1mVJ RKd34hauZEc4FLoGjCg== X-Proofpoint-GUID: MCF2-uvHYYW_Ovxk6lbIr0WfguJuM3Gv X-Proofpoint-ORIG-GUID: pG-Z5qcU9eIZY7CeiQSWLT8MzrDH_fCa X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1143,Hydra:6.1.51,FMLib:17.12.100.49 definitions=2026-04-21_03,2026-04-21_02,2025-10-01_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 impostorscore=0 spamscore=0 bulkscore=0 suspectscore=0 lowpriorityscore=0 adultscore=0 clxscore=1011 malwarescore=0 phishscore=0 classifier=typeunknown authscore=0 authtc= authcc= route=outbound adjust=0 reason=mlx scancount=1 engine=8.22.0-2604070000 definitions=main-2604210172 Hi Imran, On 4/21/26 10:36 AM, Imran Khan wrote: > On large scale systems, for example with 768 CPUs and cpusets consisting > of 380+ CPUs, there may always be some idle CPU with it's rq->next_balance > close to or same as now. > This causes nohz.next_balance to be perpetually same as current jiffies and > thus causing time based check in nohz_balancer_kick() to awlays fail. Some benchmarks will be happy with faster idle load balance and some not. Could you share the performance numbers or benchmarks you have tried? > > For example putting dtrace probe at nohz_balancer_kick, on such a system, > we can see that nohz.next_balance is at current jiffy at almost each tick: > This depends on the system utilization too. When system is idle, i see nohz.next_balance increments randomly. But around 50% utilization, it increments by 1-2 ticks. Similar observation as you have. What was the utilization in the below case? or was it combination of specific number of threads and its utilization? > 447 9536 nohz_balancer_kick:entry jiffies=9764770863 nohz.next_balance=9764770863 > 447 9536 nohz_balancer_kick:entry jiffies=9764770864 nohz.next_balance=9764770864 > 447 9536 nohz_balancer_kick:entry jiffies=9764770865 nohz.next_balance=9764770865 > 447 9536 nohz_balancer_kick:entry jiffies=9764770866 nohz.next_balance=9764770866 > 447 9536 nohz_balancer_kick:entry jiffies=9764770867 nohz.next_balance=9764770867 > 447 9536 nohz_balancer_kick:entry jiffies=9764770868 nohz.next_balance=9764770868 > 447 9536 nohz_balancer_kick:entry jiffies=9764770869 nohz.next_balance=9764770870 > 447 9536 nohz_balancer_kick:entry jiffies=9764770870 nohz.next_balance=9764770870 > 447 9536 nohz_balancer_kick:entry jiffies=9764770871 nohz.next_balance=9764770871 > 447 9536 nohz_balancer_kick:entry jiffies=9764770872 nohz.next_balance=9764770872 > 447 9536 nohz_balancer_kick:entry jiffies=9764770873 nohz.next_balance=9764770873 > 447 9536 nohz_balancer_kick:entry jiffies=9764770874 nohz.next_balance=9764770874 > 447 9536 nohz_balancer_kick:entry jiffies=9764770875 nohz.next_balance=9764770876 > 447 9536 nohz_balancer_kick:entry jiffies=9764770876 nohz.next_balance=9764770876 > 447 9536 nohz_balancer_kick:entry jiffies=9764770877 nohz.next_balance=9764770877 > 447 9536 nohz_balancer_kick:entry jiffies=9764770878 nohz.next_balance=9764770878 > > On such system setting nohz.next_balance to next jiffy can cause kick_ilb() > to run almost every tick and this in turn can consume a lot of CPU cycles in > subsequenet nohz idle balancing. > So set nohz.next_balance based on number of currently idle CPUs, such that > for 32 idle CPUs nohz.next_balance is advanced further by 1 jiffy. > This will nohz_balancer_kick to bail out early. > I gave the patch series a go and observe at 25% load to see how the increments happens. I have attached the tracing diff at the end. I still see nohz.next_balance increment by 1-2 ticks under same 25% load at some places. Overall it is better with patch, but very difficult to observe the improvement. How does nohz.next_balance increments in your case with patch? > Signed-off-by: Imran Khan > --- > kernel/sched/fair.c | 13 +++++++++++-- > 1 file changed, 11 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index ab4114712be74..bd35275a05b38 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -12447,8 +12447,17 @@ static void kick_ilb(unsigned int flags) > * Increase nohz.next_balance only when if full ilb is triggered but > * not if we only update stats. > */ > - if (flags & NOHZ_BALANCE_KICK) > - nohz.next_balance = jiffies+1; > + if (flags & NOHZ_BALANCE_KICK) { > + unsigned int nr_idle = cpumask_weight(nohz.idle_cpus_mask); > + > + /* > + * On large systems, there may always be some idle CPU(s) with > + * rq->next_balance close to or at current time, thus causing > + * frequent invocation of kick_ilb() from nohz_balancer_kick(). > + * Adjust next_balance based on the number of idle CPUs. > + */ > + nohz.next_balance = jiffies + 1 + ((nr_idle > 32) ? ilog2(nr_idle) - 4 : 0); Also, I have see with traces using below patch that nohz.next_balance goes backwards sometimes.(Without your patches too). Did WRITE_ONCE for all nohz.next_balance writes, still seen. Shouldn;t be a big concern i guess. PS: I have used below diff to print the values. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7a298d149f29..452a981df48b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -12525,6 +12525,7 @@ static void nohz_balancer_kick(struct rq *rq) * But idle load balancing is not done as find_new_ilb fails. * That's very rare. So read nohz.nr_cpus only if time is due. */ + trace_printk("cpu: %d, jiffies: %lu, next_balance: %lu\n", cpu, now, nohz.next_balance); if (time_before(now, nohz.next_balance)) goto out;