From: Imran Khan <imran.f.khan@oracle.com>
To: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org
Cc: dietmar.eggemann@arm.com, rostedt@goodmis.org,
bsegall@google.com, mgorman@suse.de, vschneid@redhat.com,
linux-kernel@vger.kernel.org
Subject: [PATCH 1/2] sched/fair: scale nohz.next_balance according to number of idle CPUs.
Date: Tue, 21 Apr 2026 13:06:21 +0800 [thread overview]
Message-ID: <20260421050622.19869-2-imran.f.khan@oracle.com> (raw)
In-Reply-To: <20260421050622.19869-1-imran.f.khan@oracle.com>
On large scale systems, for example with 768 CPUs and cpusets consisting
of 380+ CPUs, there may always be some idle CPU with it's rq->next_balance
close to or same as now.
This causes nohz.next_balance to be perpetually same as current jiffies and
thus causing time based check in nohz_balancer_kick() to awlays fail.
For example putting dtrace probe at nohz_balancer_kick, on such a system,
we can see that nohz.next_balance is at current jiffy at almost each tick:
447 9536 nohz_balancer_kick:entry jiffies=9764770863 nohz.next_balance=9764770863
447 9536 nohz_balancer_kick:entry jiffies=9764770864 nohz.next_balance=9764770864
447 9536 nohz_balancer_kick:entry jiffies=9764770865 nohz.next_balance=9764770865
447 9536 nohz_balancer_kick:entry jiffies=9764770866 nohz.next_balance=9764770866
447 9536 nohz_balancer_kick:entry jiffies=9764770867 nohz.next_balance=9764770867
447 9536 nohz_balancer_kick:entry jiffies=9764770868 nohz.next_balance=9764770868
447 9536 nohz_balancer_kick:entry jiffies=9764770869 nohz.next_balance=9764770870
447 9536 nohz_balancer_kick:entry jiffies=9764770870 nohz.next_balance=9764770870
447 9536 nohz_balancer_kick:entry jiffies=9764770871 nohz.next_balance=9764770871
447 9536 nohz_balancer_kick:entry jiffies=9764770872 nohz.next_balance=9764770872
447 9536 nohz_balancer_kick:entry jiffies=9764770873 nohz.next_balance=9764770873
447 9536 nohz_balancer_kick:entry jiffies=9764770874 nohz.next_balance=9764770874
447 9536 nohz_balancer_kick:entry jiffies=9764770875 nohz.next_balance=9764770876
447 9536 nohz_balancer_kick:entry jiffies=9764770876 nohz.next_balance=9764770876
447 9536 nohz_balancer_kick:entry jiffies=9764770877 nohz.next_balance=9764770877
447 9536 nohz_balancer_kick:entry jiffies=9764770878 nohz.next_balance=9764770878
On such system setting nohz.next_balance to next jiffy can cause kick_ilb()
to run almost every tick and this in turn can consume a lot of CPU cycles in
subsequenet nohz idle balancing.
So set nohz.next_balance based on number of currently idle CPUs, such that
for 32 idle CPUs nohz.next_balance is advanced further by 1 jiffy.
This will nohz_balancer_kick to bail out early.
Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
---
kernel/sched/fair.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ab4114712be74..bd35275a05b38 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12447,8 +12447,17 @@ static void kick_ilb(unsigned int flags)
* Increase nohz.next_balance only when if full ilb is triggered but
* not if we only update stats.
*/
- if (flags & NOHZ_BALANCE_KICK)
- nohz.next_balance = jiffies+1;
+ if (flags & NOHZ_BALANCE_KICK) {
+ unsigned int nr_idle = cpumask_weight(nohz.idle_cpus_mask);
+
+ /*
+ * On large systems, there may always be some idle CPU(s) with
+ * rq->next_balance close to or at current time, thus causing
+ * frequent invocation of kick_ilb() from nohz_balancer_kick().
+ * Adjust next_balance based on the number of idle CPUs.
+ */
+ nohz.next_balance = jiffies + 1 + ((nr_idle > 32) ? ilog2(nr_idle) - 4 : 0);
+ }
ilb_cpu = find_new_ilb();
if (ilb_cpu < 0)
--
2.34.1
next prev parent reply other threads:[~2026-04-21 5:06 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-21 5:06 [PATCH 0/2] sched/fair: Reduce nohz_idle_balance CPU overhead on large systems Imran Khan
2026-04-21 5:06 ` Imran Khan [this message]
2026-04-21 17:30 ` [PATCH 1/2] sched/fair: scale nohz.next_balance according to number of idle CPUs Shrikanth Hegde
2026-04-22 7:54 ` Vincent Guittot
2026-04-22 16:13 ` imran.f.khan
2026-04-24 9:46 ` Vincent Guittot
2026-04-28 10:52 ` imran.f.khan
2026-04-28 15:06 ` Vincent Guittot
2026-04-21 5:06 ` [PATCH 2/2] sched/fair: distribute nohz ILB work across " Imran Khan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260421050622.19869-2-imran.f.khan@oracle.com \
--to=imran.f.khan@oracle.com \
--cc=bsegall@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox