Re: [PATCH 1/2] sched/fair: scale nohz.next_balance according to number of idle CPUs.

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: imran.f.khan@oracle.com
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
	dietmar.eggemann@arm.com, rostedt@goodmis.org,
	bsegall@google.com, mgorman@suse.de, vschneid@redhat.com,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/2] sched/fair: scale nohz.next_balance according to number of idle CPUs.
Date: Thu, 23 Apr 2026 00:13:03 +0800	[thread overview]
Message-ID: <610a00cc-e6be-42ab-9c70-e3f24d66e7a7@oracle.com> (raw)
In-Reply-To: <CAKfTPtAcF4PTMT7TDrq+_mpYp=ebWA5Ws0P-nZ6gOz0BkEGTkQ@mail.gmail.com>

Hello Vincent,
Thanks a lot for taking a look into this.
On 22/4/2026 3:54 pm, Vincent Guittot wrote:
> On Tue, 21 Apr 2026 at 07:06, Imran Khan <imran.f.khan@oracle.com> wrote:
>>
>> On large scale systems, for example with 768 CPUs and cpusets consisting
>> of 380+ CPUs, there may always be some idle CPU with it's rq->next_balance
>> close to or same as now.
>> This causes nohz.next_balance to be perpetually same as current jiffies and
>> thus causing time based check in nohz_balancer_kick() to awlays fail.
>>
>> For example putting dtrace probe at nohz_balancer_kick, on such a system,
>> we can see that nohz.next_balance is at current jiffy at almost each tick:
>>
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770863 nohz.next_balance=9764770863
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770864 nohz.next_balance=9764770864
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770865 nohz.next_balance=9764770865
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770866 nohz.next_balance=9764770866
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770867 nohz.next_balance=9764770867
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770868 nohz.next_balance=9764770868
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770869 nohz.next_balance=9764770870
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770870 nohz.next_balance=9764770870
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770871 nohz.next_balance=9764770871
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770872 nohz.next_balance=9764770872
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770873 nohz.next_balance=9764770873
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770874 nohz.next_balance=9764770874
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770875 nohz.next_balance=9764770876
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770876 nohz.next_balance=9764770876
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770877 nohz.next_balance=9764770877
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770878 nohz.next_balance=9764770878
>>
>> On such system setting nohz.next_balance to next jiffy can cause kick_ilb()
>> to run almost every tick and this in turn can consume a lot of CPU cycles in
>> subsequenet nohz idle balancing.
>> So set nohz.next_balance based on number of currently idle CPUs, such that
>> for 32 idle CPUs nohz.next_balance is advanced further by 1 jiffy.
>> This will nohz_balancer_kick to bail out early.
>>
>> Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
>> ---
>>  kernel/sched/fair.c | 13 +++++++++++--
>>  1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index ab4114712be74..bd35275a05b38 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -12447,8 +12447,17 @@ static void kick_ilb(unsigned int flags)
>>          * Increase nohz.next_balance only when if full ilb is triggered but
>>          * not if we only update stats.
>>          */
>> -       if (flags & NOHZ_BALANCE_KICK)
>> -               nohz.next_balance = jiffies+1;
> 
> This +1 only cheaply prevents multiple nohz_ilb from happening
> simultaneously during the current jiffies.
> 
> The actual update of nohz.next_balance is done in _nohz_idle_balance()
> and reflects the next balance of all idle rqs. You should look at the
> balance interval of your sched_domains. The min interva is the weight
> of the sched_domain which can be 2 at SMT level
> 

I did not look at the balance interval of the involved sched domain.
IIUC once nohz.next_balance has been updated in _nohz_idle_balance(),
we will see that updated value in nohz_balancer_kick() and if its further
from current jiffies, the time_before(now, nohz.next_balance) test would
cause nohz_balancer_kick() to bail out without updating flags and that in 
tune would avoid kick_ilb() path.
Since jiffies and nohz.next_balance were appearing close or same in
nohz_balancer_kick() and I could see that CPU 2 was executing nohz_csd_func(),
almost instantly and pretty much at frequency of each tick (dtrace snippet shown
below), my conclusion was that one or more CPUs in sched domain of CPU 2 must
have had their rq->next_balance close to or same as current jiffies.

ts_ms = 1776868498610 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498611 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498612 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498613 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498614 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498615 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498616 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498617 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498618 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498619 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498620 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498621 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498622 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498623 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498624 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498625 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498626 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498627 rq_cpu = 2 nohz_flags = 3

Could you please let me know if this understanding is incorrect ?

Regarding the question of sched_domain topology, this host
has 768 CPUs and almost all (except 6) have been divided 
between 2 cpusets (one for each node). For example for node0
CPUs we have:

# cat /sys/fs/cgroup/sellable-numa0/cpuset.cpus.partition
root
# cat /sys/fs/cgroup/sellable-numa0/cpuset.cpus.effective
2-191,386-575

and their sched_domains look like, as shown below:

cpu2:
  domain0: cpus=2,386
  domain1: cpus=2-15,386-399
  domain2: cpus=2-191,386-575
cpu3:
  domain0: cpus=3,387
  domain1: cpus=2-15,386-399
  domain2: cpus=2-191,386-575
cpu4:
  domain0: cpus=4,388
  domain1: cpus=2-15,386-399
  domain2: cpus=2-191,386-575
.....
.....

Could you please suggest if updating rq->next_balance or
final nohz.next_balance by some other logic can help reduce the
CPU usage of _nohz_idle_balance or should we just ignore it
because CPU is idle anyways.

On these systems I can see that CPU 2 is doing most of this work.
Running a perf top on CPU 2 gives numbers like:

    21.69%  [kernel]       [k] __update_blocked_fair
    11.40%  [kernel]       [k] update_load_avg
     9.36%  [kernel]       [k] __update_load_avg_cfs_rq
     8.07%  [kernel]       [k] update_rq_clock
     7.09%  [kernel]       [k] __update_load_avg_se
     4.67%  [kernel]       [k] update_irq_load_avg

.....
.....
    22.26%  [kernel]       [k] __update_blocked_fair
    10.89%  [kernel]       [k] update_load_avg
     9.65%  [kernel]       [k] __update_load_avg_cfs_rq
     7.80%  [kernel]       [k] update_rq_clock
     7.23%  [kernel]       [k] __update_load_avg_se
     4.76%  [kernel]       [k] update_sg_lb_stats

and mpstat also shows softirq usage of around 20-25% on CPU 2 and 
most of that is due to SCHED_SOFTIRQ leading into 
_nohz_idle_balance.

Thanks,
Imran

PS: I used the following dtrace snippets to get nohz_balancer_kick
data shown earlier and nohz_csd_func() data shown in this message.

dtrace -n 'fbt::nohz_balancer_kick:entry {printf("jiffies = %lu nohz.next_balance = %lu \n", `jiffies, `nohz.next_balance);}'



fbt::nohz_csd_func:entry
{
    this->rq = (struct rq *)arg0;
    this->rq_cpu = this->rq->cpu;
    this->rq_nohz_flags = this->rq->nohz_flags.counter;
    this->ts_ms = (unsigned long)(walltimestamp / 1000000);
    printf("ts_ms = %lu rq_cpu = %d nohz_flags = %d \n", this->ts_ms, this->rq_cpu, this->rq_nohz_flags);
    /*printf("[%lu] IPI received on cpu=%d\n",
           this->ts_ms, cpu);*/
    /*@ipi_rate[cpu] = count();*/
}

> Which kind of sched_domain topology do you have?
> 
> 
>> +       if (flags & NOHZ_BALANCE_KICK) {
>> +               unsigned int nr_idle = cpumask_weight(nohz.idle_cpus_mask);
>> +
>> +               /*
>> +                * On large systems, there may always be some idle CPU(s) with
>> +                * rq->next_balance close to or at current time, thus causing
>> +                * frequent invocation of kick_ilb() from nohz_balancer_kick().
>> +                * Adjust next_balance based on the number of idle CPUs.
>> +                */
>> +               nohz.next_balance = jiffies + 1 + ((nr_idle > 32) ? ilog2(nr_idle) - 4 : 0);
>> +       }
>>
>>         ilb_cpu = find_new_ilb();
>>         if (ilb_cpu < 0)
>> --
>> 2.34.1
>>

next prev parent reply	other threads:[~2026-04-22 16:13 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-21  5:06 [PATCH 0/2] sched/fair: Reduce nohz_idle_balance CPU overhead on large systems Imran Khan
2026-04-21  5:06 ` [PATCH 1/2] sched/fair: scale nohz.next_balance according to number of idle CPUs Imran Khan
2026-04-21 17:30   ` Shrikanth Hegde
2026-04-22  7:54   ` Vincent Guittot
2026-04-22 16:13     ` imran.f.khan [this message]
2026-04-24  9:46       ` Vincent Guittot
2026-04-28 10:52         ` imran.f.khan
2026-04-28 15:06           ` Vincent Guittot
2026-04-21  5:06 ` [PATCH 2/2] sched/fair: distribute nohz ILB work across " Imran Khan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=610a00cc-e6be-42ab-9c70-e3f24d66e7a7@oracle.com \
    --to=imran.f.khan@oracle.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox