public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Shrikanth Hegde <sshegde@linux.ibm.com>
To: Imran Khan <imran.f.khan@oracle.com>
Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
	vschneid@redhat.com, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/2] sched/fair: scale nohz.next_balance according to number of idle CPUs.
Date: Tue, 21 Apr 2026 23:00:26 +0530	[thread overview]
Message-ID: <429667c2-f9cd-4c98-8f61-acb43bfd7ccd@linux.ibm.com> (raw)
In-Reply-To: <20260421050622.19869-2-imran.f.khan@oracle.com>

Hi Imran,

On 4/21/26 10:36 AM, Imran Khan wrote:
> On large scale systems, for example with 768 CPUs and cpusets consisting
> of 380+ CPUs, there may always be some idle CPU with it's rq->next_balance
> close to or same as now.
> This causes nohz.next_balance to be perpetually same as current jiffies and
> thus causing time based check in nohz_balancer_kick() to awlays fail.

Some benchmarks will be happy with faster idle load balance and some not.
Could you share the performance numbers or benchmarks you have tried?

> 
> For example putting dtrace probe at nohz_balancer_kick, on such a system,
> we can see that nohz.next_balance is at current jiffy at almost each tick:
> 

This depends on the system utilization too. When system is idle, i see
nohz.next_balance increments randomly. But around 50% utilization, it increments by
1-2 ticks. Similar observation as you have.

What was the utilization in the below case? or was it combination of specific number
of threads and its utilization?

> 447 9536  nohz_balancer_kick:entry jiffies=9764770863 nohz.next_balance=9764770863
> 447 9536  nohz_balancer_kick:entry jiffies=9764770864 nohz.next_balance=9764770864
> 447 9536  nohz_balancer_kick:entry jiffies=9764770865 nohz.next_balance=9764770865
> 447 9536  nohz_balancer_kick:entry jiffies=9764770866 nohz.next_balance=9764770866
> 447 9536  nohz_balancer_kick:entry jiffies=9764770867 nohz.next_balance=9764770867
> 447 9536  nohz_balancer_kick:entry jiffies=9764770868 nohz.next_balance=9764770868
> 447 9536  nohz_balancer_kick:entry jiffies=9764770869 nohz.next_balance=9764770870
> 447 9536  nohz_balancer_kick:entry jiffies=9764770870 nohz.next_balance=9764770870
> 447 9536  nohz_balancer_kick:entry jiffies=9764770871 nohz.next_balance=9764770871
> 447 9536  nohz_balancer_kick:entry jiffies=9764770872 nohz.next_balance=9764770872
> 447 9536  nohz_balancer_kick:entry jiffies=9764770873 nohz.next_balance=9764770873
> 447 9536  nohz_balancer_kick:entry jiffies=9764770874 nohz.next_balance=9764770874
> 447 9536  nohz_balancer_kick:entry jiffies=9764770875 nohz.next_balance=9764770876
> 447 9536  nohz_balancer_kick:entry jiffies=9764770876 nohz.next_balance=9764770876
> 447 9536  nohz_balancer_kick:entry jiffies=9764770877 nohz.next_balance=9764770877
> 447 9536  nohz_balancer_kick:entry jiffies=9764770878 nohz.next_balance=9764770878
> 
> On such system setting nohz.next_balance to next jiffy can cause kick_ilb()
> to run almost every tick and this in turn can consume a lot of CPU cycles in
> subsequenet nohz idle balancing.
> So set nohz.next_balance based on number of currently idle CPUs, such that
> for 32 idle CPUs nohz.next_balance is advanced further by 1 jiffy.
> This will nohz_balancer_kick to bail out early.
> 

I gave the patch series a go and observe at 25% load to see how the increments happens.
I have attached the tracing diff at the end.

I still see nohz.next_balance increment by 1-2 ticks under same 25% load at some places.
Overall it is better with patch, but very difficult to observe the improvement.

How does nohz.next_balance increments in your case with patch?

> Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
> ---
>   kernel/sched/fair.c | 13 +++++++++++--
>   1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ab4114712be74..bd35275a05b38 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12447,8 +12447,17 @@ static void kick_ilb(unsigned int flags)
>   	 * Increase nohz.next_balance only when if full ilb is triggered but
>   	 * not if we only update stats.
>   	 */
> -	if (flags & NOHZ_BALANCE_KICK)
> -		nohz.next_balance = jiffies+1;
> +	if (flags & NOHZ_BALANCE_KICK) {
> +		unsigned int nr_idle = cpumask_weight(nohz.idle_cpus_mask);
> +
> +		/*
> +		 * On large systems, there may always be some idle CPU(s) with
> +		 * rq->next_balance close to or at current time, thus causing
> +		 * frequent invocation of kick_ilb() from nohz_balancer_kick().
> +		 * Adjust next_balance based on the number of idle CPUs.
> +		 */
> +		nohz.next_balance = jiffies + 1 + ((nr_idle > 32) ? ilog2(nr_idle) - 4 : 0);


Also, I have see with traces using below patch that nohz.next_balance goes
backwards sometimes.(Without your patches too).
Did WRITE_ONCE for all nohz.next_balance writes, still seen.

Shouldn;t be a big concern i guess.


PS:
I have used below diff to print the values.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a298d149f29..452a981df48b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12525,6 +12525,7 @@ static void nohz_balancer_kick(struct rq *rq)
          * But idle load balancing is not done as find_new_ilb fails.
          * That's very rare. So read nohz.nr_cpus only if time is due.
          */
+       trace_printk("cpu: %d, jiffies: %lu, next_balance: %lu\n", cpu, now, nohz.next_balance);
         if (time_before(now, nohz.next_balance))
                 goto out;


  reply	other threads:[~2026-04-21 17:30 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-21  5:06 [PATCH 0/2] sched/fair: Reduce nohz_idle_balance CPU overhead on large systems Imran Khan
2026-04-21  5:06 ` [PATCH 1/2] sched/fair: scale nohz.next_balance according to number of idle CPUs Imran Khan
2026-04-21 17:30   ` Shrikanth Hegde [this message]
2026-04-22  7:54   ` Vincent Guittot
2026-04-22 16:13     ` imran.f.khan
2026-04-24  9:46       ` Vincent Guittot
2026-04-28 10:52         ` imran.f.khan
2026-04-28 15:06           ` Vincent Guittot
2026-04-21  5:06 ` [PATCH 2/2] sched/fair: distribute nohz ILB work across " Imran Khan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=429667c2-f9cd-4c98-8f61-acb43bfd7ccd@linux.ibm.com \
    --to=sshegde@linux.ibm.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=imran.f.khan@oracle.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox