public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] sched/fair: Reduce nohz_idle_balance CPU overhead on large systems
@ 2026-04-21  5:06 Imran Khan
  2026-04-21  5:06 ` [PATCH 1/2] sched/fair: scale nohz.next_balance according to number of idle CPUs Imran Khan
  2026-04-21  5:06 ` [PATCH 2/2] sched/fair: distribute nohz ILB work across " Imran Khan
  0 siblings, 2 replies; 9+ messages in thread
From: Imran Khan @ 2026-04-21  5:06 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel

On large systems (700+ CPUs, 350+ CPUs in root sched_domain/cpuset),
nohz idle balancing can consume significant amount of CPU due to two
independent problems.

First, due to large number of CPUs there is a very good chance of
nohz.next_balance always being same or very close to current jiffies,
causing nohz idle balance work to happen on almost each tick.

Second, find_new_ilb() uses for_each_cpu_and() to iterate idle_cpus_mask
from the lowest bit, so the lowest-numbered idle CPU in the cpuset bears
the full burden of nohz ILB work and most of the times it's the same CPU.
Again on large scale systems this work becomes significant and unfairly
consumes cycles of same CPU most of the times.

Patch 1 addresses the first issue by advancing nohz.next_balance based on
the number of idle CPUs and patch 2 addresses the second issue by
distributing the nohz ILB work across eligible idle CPUs.

Imran Khan (2):
  sched/fair: scale nohz.next_balance according to number of idle CPUs.
  sched/fair: distribute nohz ILB work across idle CPUs.

 kernel/sched/fair.c | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)


base-commit: 591cd656a1bf5ea94a222af5ef2ee76df029c1d2
-- 
2.34.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/2] sched/fair: scale nohz.next_balance according to number of idle CPUs.
  2026-04-21  5:06 [PATCH 0/2] sched/fair: Reduce nohz_idle_balance CPU overhead on large systems Imran Khan
@ 2026-04-21  5:06 ` Imran Khan
  2026-04-21 17:30   ` Shrikanth Hegde
  2026-04-22  7:54   ` Vincent Guittot
  2026-04-21  5:06 ` [PATCH 2/2] sched/fair: distribute nohz ILB work across " Imran Khan
  1 sibling, 2 replies; 9+ messages in thread
From: Imran Khan @ 2026-04-21  5:06 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel

On large scale systems, for example with 768 CPUs and cpusets consisting
of 380+ CPUs, there may always be some idle CPU with it's rq->next_balance
close to or same as now.
This causes nohz.next_balance to be perpetually same as current jiffies and
thus causing time based check in nohz_balancer_kick() to awlays fail.

For example putting dtrace probe at nohz_balancer_kick, on such a system,
we can see that nohz.next_balance is at current jiffy at almost each tick:

447 9536  nohz_balancer_kick:entry jiffies=9764770863 nohz.next_balance=9764770863
447 9536  nohz_balancer_kick:entry jiffies=9764770864 nohz.next_balance=9764770864
447 9536  nohz_balancer_kick:entry jiffies=9764770865 nohz.next_balance=9764770865
447 9536  nohz_balancer_kick:entry jiffies=9764770866 nohz.next_balance=9764770866
447 9536  nohz_balancer_kick:entry jiffies=9764770867 nohz.next_balance=9764770867
447 9536  nohz_balancer_kick:entry jiffies=9764770868 nohz.next_balance=9764770868
447 9536  nohz_balancer_kick:entry jiffies=9764770869 nohz.next_balance=9764770870
447 9536  nohz_balancer_kick:entry jiffies=9764770870 nohz.next_balance=9764770870
447 9536  nohz_balancer_kick:entry jiffies=9764770871 nohz.next_balance=9764770871
447 9536  nohz_balancer_kick:entry jiffies=9764770872 nohz.next_balance=9764770872
447 9536  nohz_balancer_kick:entry jiffies=9764770873 nohz.next_balance=9764770873
447 9536  nohz_balancer_kick:entry jiffies=9764770874 nohz.next_balance=9764770874
447 9536  nohz_balancer_kick:entry jiffies=9764770875 nohz.next_balance=9764770876
447 9536  nohz_balancer_kick:entry jiffies=9764770876 nohz.next_balance=9764770876
447 9536  nohz_balancer_kick:entry jiffies=9764770877 nohz.next_balance=9764770877
447 9536  nohz_balancer_kick:entry jiffies=9764770878 nohz.next_balance=9764770878

On such system setting nohz.next_balance to next jiffy can cause kick_ilb()
to run almost every tick and this in turn can consume a lot of CPU cycles in
subsequenet nohz idle balancing.
So set nohz.next_balance based on number of currently idle CPUs, such that
for 32 idle CPUs nohz.next_balance is advanced further by 1 jiffy.
This will nohz_balancer_kick to bail out early.

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
---
 kernel/sched/fair.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ab4114712be74..bd35275a05b38 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12447,8 +12447,17 @@ static void kick_ilb(unsigned int flags)
 	 * Increase nohz.next_balance only when if full ilb is triggered but
 	 * not if we only update stats.
 	 */
-	if (flags & NOHZ_BALANCE_KICK)
-		nohz.next_balance = jiffies+1;
+	if (flags & NOHZ_BALANCE_KICK) {
+		unsigned int nr_idle = cpumask_weight(nohz.idle_cpus_mask);
+
+		/*
+		 * On large systems, there may always be some idle CPU(s) with
+		 * rq->next_balance close to or at current time, thus causing
+		 * frequent invocation of kick_ilb() from nohz_balancer_kick().
+		 * Adjust next_balance based on the number of idle CPUs.
+		 */
+		nohz.next_balance = jiffies + 1 + ((nr_idle > 32) ? ilog2(nr_idle) - 4 : 0);
+	}
 
 	ilb_cpu = find_new_ilb();
 	if (ilb_cpu < 0)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/2] sched/fair: distribute nohz ILB work across idle CPUs.
  2026-04-21  5:06 [PATCH 0/2] sched/fair: Reduce nohz_idle_balance CPU overhead on large systems Imran Khan
  2026-04-21  5:06 ` [PATCH 1/2] sched/fair: scale nohz.next_balance according to number of idle CPUs Imran Khan
@ 2026-04-21  5:06 ` Imran Khan
  1 sibling, 0 replies; 9+ messages in thread
From: Imran Khan @ 2026-04-21  5:06 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel

find_new_ilb() uses for_each_cpu_and() to iterate nohz.idle_cpus_mask
from the lowest bit upward, returning the first idle housekeeping CPU
it finds. This can (unfairly) select the lowest nohz idle CPU most of
the times.

Fix this by selecting nohz ILB CPU in a round robin way and thus
distributing the nohz ILB work (which can be significant on large
scale systems) across all eligible idle CPUs.

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
---
 kernel/sched/fair.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bd35275a05b38..93bdb542ff714 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7213,6 +7213,7 @@ static struct {
 	cpumask_var_t idle_cpus_mask;
 	int has_blocked_load;		/* Idle CPUS has blocked load */
 	int needs_update;		/* Newly idle CPUs need their next_balance collated */
+	int ilb_cpu_last;		/* Last CPU selected for nohz ILB */
 	unsigned long next_balance;     /* in jiffy units */
 	unsigned long next_blocked;	/* Next update of blocked load in jiffies */
 } nohz ____cacheline_aligned;
@@ -12420,13 +12421,17 @@ static inline int find_new_ilb(void)
 
 	hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
 
-	for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
+	for_each_cpu_wrap(ilb_cpu, nohz.idle_cpus_mask, nohz.ilb_cpu_last + 1) {
+		if (!cpumask_test_cpu(ilb_cpu, hk_mask))
+			continue;
 
 		if (ilb_cpu == smp_processor_id())
 			continue;
 
-		if (idle_cpu(ilb_cpu))
+		if (idle_cpu(ilb_cpu)) {
+			nohz.ilb_cpu_last = ilb_cpu;
 			return ilb_cpu;
+		}
 	}
 
 	return -1;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] sched/fair: scale nohz.next_balance according to number of idle CPUs.
  2026-04-21  5:06 ` [PATCH 1/2] sched/fair: scale nohz.next_balance according to number of idle CPUs Imran Khan
@ 2026-04-21 17:30   ` Shrikanth Hegde
  2026-04-22  7:54   ` Vincent Guittot
  1 sibling, 0 replies; 9+ messages in thread
From: Shrikanth Hegde @ 2026-04-21 17:30 UTC (permalink / raw)
  To: Imran Khan
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel

Hi Imran,

On 4/21/26 10:36 AM, Imran Khan wrote:
> On large scale systems, for example with 768 CPUs and cpusets consisting
> of 380+ CPUs, there may always be some idle CPU with it's rq->next_balance
> close to or same as now.
> This causes nohz.next_balance to be perpetually same as current jiffies and
> thus causing time based check in nohz_balancer_kick() to awlays fail.

Some benchmarks will be happy with faster idle load balance and some not.
Could you share the performance numbers or benchmarks you have tried?

> 
> For example putting dtrace probe at nohz_balancer_kick, on such a system,
> we can see that nohz.next_balance is at current jiffy at almost each tick:
> 

This depends on the system utilization too. When system is idle, i see
nohz.next_balance increments randomly. But around 50% utilization, it increments by
1-2 ticks. Similar observation as you have.

What was the utilization in the below case? or was it combination of specific number
of threads and its utilization?

> 447 9536  nohz_balancer_kick:entry jiffies=9764770863 nohz.next_balance=9764770863
> 447 9536  nohz_balancer_kick:entry jiffies=9764770864 nohz.next_balance=9764770864
> 447 9536  nohz_balancer_kick:entry jiffies=9764770865 nohz.next_balance=9764770865
> 447 9536  nohz_balancer_kick:entry jiffies=9764770866 nohz.next_balance=9764770866
> 447 9536  nohz_balancer_kick:entry jiffies=9764770867 nohz.next_balance=9764770867
> 447 9536  nohz_balancer_kick:entry jiffies=9764770868 nohz.next_balance=9764770868
> 447 9536  nohz_balancer_kick:entry jiffies=9764770869 nohz.next_balance=9764770870
> 447 9536  nohz_balancer_kick:entry jiffies=9764770870 nohz.next_balance=9764770870
> 447 9536  nohz_balancer_kick:entry jiffies=9764770871 nohz.next_balance=9764770871
> 447 9536  nohz_balancer_kick:entry jiffies=9764770872 nohz.next_balance=9764770872
> 447 9536  nohz_balancer_kick:entry jiffies=9764770873 nohz.next_balance=9764770873
> 447 9536  nohz_balancer_kick:entry jiffies=9764770874 nohz.next_balance=9764770874
> 447 9536  nohz_balancer_kick:entry jiffies=9764770875 nohz.next_balance=9764770876
> 447 9536  nohz_balancer_kick:entry jiffies=9764770876 nohz.next_balance=9764770876
> 447 9536  nohz_balancer_kick:entry jiffies=9764770877 nohz.next_balance=9764770877
> 447 9536  nohz_balancer_kick:entry jiffies=9764770878 nohz.next_balance=9764770878
> 
> On such system setting nohz.next_balance to next jiffy can cause kick_ilb()
> to run almost every tick and this in turn can consume a lot of CPU cycles in
> subsequenet nohz idle balancing.
> So set nohz.next_balance based on number of currently idle CPUs, such that
> for 32 idle CPUs nohz.next_balance is advanced further by 1 jiffy.
> This will nohz_balancer_kick to bail out early.
> 

I gave the patch series a go and observe at 25% load to see how the increments happens.
I have attached the tracing diff at the end.

I still see nohz.next_balance increment by 1-2 ticks under same 25% load at some places.
Overall it is better with patch, but very difficult to observe the improvement.

How does nohz.next_balance increments in your case with patch?

> Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
> ---
>   kernel/sched/fair.c | 13 +++++++++++--
>   1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ab4114712be74..bd35275a05b38 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12447,8 +12447,17 @@ static void kick_ilb(unsigned int flags)
>   	 * Increase nohz.next_balance only when if full ilb is triggered but
>   	 * not if we only update stats.
>   	 */
> -	if (flags & NOHZ_BALANCE_KICK)
> -		nohz.next_balance = jiffies+1;
> +	if (flags & NOHZ_BALANCE_KICK) {
> +		unsigned int nr_idle = cpumask_weight(nohz.idle_cpus_mask);
> +
> +		/*
> +		 * On large systems, there may always be some idle CPU(s) with
> +		 * rq->next_balance close to or at current time, thus causing
> +		 * frequent invocation of kick_ilb() from nohz_balancer_kick().
> +		 * Adjust next_balance based on the number of idle CPUs.
> +		 */
> +		nohz.next_balance = jiffies + 1 + ((nr_idle > 32) ? ilog2(nr_idle) - 4 : 0);


Also, I have see with traces using below patch that nohz.next_balance goes
backwards sometimes.(Without your patches too).
Did WRITE_ONCE for all nohz.next_balance writes, still seen.

Shouldn;t be a big concern i guess.


PS:
I have used below diff to print the values.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a298d149f29..452a981df48b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12525,6 +12525,7 @@ static void nohz_balancer_kick(struct rq *rq)
          * But idle load balancing is not done as find_new_ilb fails.
          * That's very rare. So read nohz.nr_cpus only if time is due.
          */
+       trace_printk("cpu: %d, jiffies: %lu, next_balance: %lu\n", cpu, now, nohz.next_balance);
         if (time_before(now, nohz.next_balance))
                 goto out;


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] sched/fair: scale nohz.next_balance according to number of idle CPUs.
  2026-04-21  5:06 ` [PATCH 1/2] sched/fair: scale nohz.next_balance according to number of idle CPUs Imran Khan
  2026-04-21 17:30   ` Shrikanth Hegde
@ 2026-04-22  7:54   ` Vincent Guittot
  2026-04-22 16:13     ` imran.f.khan
  1 sibling, 1 reply; 9+ messages in thread
From: Vincent Guittot @ 2026-04-22  7:54 UTC (permalink / raw)
  To: Imran Khan
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, linux-kernel

On Tue, 21 Apr 2026 at 07:06, Imran Khan <imran.f.khan@oracle.com> wrote:
>
> On large scale systems, for example with 768 CPUs and cpusets consisting
> of 380+ CPUs, there may always be some idle CPU with it's rq->next_balance
> close to or same as now.
> This causes nohz.next_balance to be perpetually same as current jiffies and
> thus causing time based check in nohz_balancer_kick() to awlays fail.
>
> For example putting dtrace probe at nohz_balancer_kick, on such a system,
> we can see that nohz.next_balance is at current jiffy at almost each tick:
>
> 447 9536  nohz_balancer_kick:entry jiffies=9764770863 nohz.next_balance=9764770863
> 447 9536  nohz_balancer_kick:entry jiffies=9764770864 nohz.next_balance=9764770864
> 447 9536  nohz_balancer_kick:entry jiffies=9764770865 nohz.next_balance=9764770865
> 447 9536  nohz_balancer_kick:entry jiffies=9764770866 nohz.next_balance=9764770866
> 447 9536  nohz_balancer_kick:entry jiffies=9764770867 nohz.next_balance=9764770867
> 447 9536  nohz_balancer_kick:entry jiffies=9764770868 nohz.next_balance=9764770868
> 447 9536  nohz_balancer_kick:entry jiffies=9764770869 nohz.next_balance=9764770870
> 447 9536  nohz_balancer_kick:entry jiffies=9764770870 nohz.next_balance=9764770870
> 447 9536  nohz_balancer_kick:entry jiffies=9764770871 nohz.next_balance=9764770871
> 447 9536  nohz_balancer_kick:entry jiffies=9764770872 nohz.next_balance=9764770872
> 447 9536  nohz_balancer_kick:entry jiffies=9764770873 nohz.next_balance=9764770873
> 447 9536  nohz_balancer_kick:entry jiffies=9764770874 nohz.next_balance=9764770874
> 447 9536  nohz_balancer_kick:entry jiffies=9764770875 nohz.next_balance=9764770876
> 447 9536  nohz_balancer_kick:entry jiffies=9764770876 nohz.next_balance=9764770876
> 447 9536  nohz_balancer_kick:entry jiffies=9764770877 nohz.next_balance=9764770877
> 447 9536  nohz_balancer_kick:entry jiffies=9764770878 nohz.next_balance=9764770878
>
> On such system setting nohz.next_balance to next jiffy can cause kick_ilb()
> to run almost every tick and this in turn can consume a lot of CPU cycles in
> subsequenet nohz idle balancing.
> So set nohz.next_balance based on number of currently idle CPUs, such that
> for 32 idle CPUs nohz.next_balance is advanced further by 1 jiffy.
> This will nohz_balancer_kick to bail out early.
>
> Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
> ---
>  kernel/sched/fair.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ab4114712be74..bd35275a05b38 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12447,8 +12447,17 @@ static void kick_ilb(unsigned int flags)
>          * Increase nohz.next_balance only when if full ilb is triggered but
>          * not if we only update stats.
>          */
> -       if (flags & NOHZ_BALANCE_KICK)
> -               nohz.next_balance = jiffies+1;

This +1 only cheaply prevents multiple nohz_ilb from happening
simultaneously during the current jiffies.

The actual update of nohz.next_balance is done in _nohz_idle_balance()
and reflects the next balance of all idle rqs. You should look at the
balance interval of your sched_domains. The min interva is the weight
of the sched_domain which can be 2 at SMT level

Which kind of sched_domain topology do you have?


> +       if (flags & NOHZ_BALANCE_KICK) {
> +               unsigned int nr_idle = cpumask_weight(nohz.idle_cpus_mask);
> +
> +               /*
> +                * On large systems, there may always be some idle CPU(s) with
> +                * rq->next_balance close to or at current time, thus causing
> +                * frequent invocation of kick_ilb() from nohz_balancer_kick().
> +                * Adjust next_balance based on the number of idle CPUs.
> +                */
> +               nohz.next_balance = jiffies + 1 + ((nr_idle > 32) ? ilog2(nr_idle) - 4 : 0);
> +       }
>
>         ilb_cpu = find_new_ilb();
>         if (ilb_cpu < 0)
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] sched/fair: scale nohz.next_balance according to number of idle CPUs.
  2026-04-22  7:54   ` Vincent Guittot
@ 2026-04-22 16:13     ` imran.f.khan
  2026-04-24  9:46       ` Vincent Guittot
  0 siblings, 1 reply; 9+ messages in thread
From: imran.f.khan @ 2026-04-22 16:13 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, linux-kernel

Hello Vincent,
Thanks a lot for taking a look into this.
On 22/4/2026 3:54 pm, Vincent Guittot wrote:
> On Tue, 21 Apr 2026 at 07:06, Imran Khan <imran.f.khan@oracle.com> wrote:
>>
>> On large scale systems, for example with 768 CPUs and cpusets consisting
>> of 380+ CPUs, there may always be some idle CPU with it's rq->next_balance
>> close to or same as now.
>> This causes nohz.next_balance to be perpetually same as current jiffies and
>> thus causing time based check in nohz_balancer_kick() to awlays fail.
>>
>> For example putting dtrace probe at nohz_balancer_kick, on such a system,
>> we can see that nohz.next_balance is at current jiffy at almost each tick:
>>
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770863 nohz.next_balance=9764770863
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770864 nohz.next_balance=9764770864
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770865 nohz.next_balance=9764770865
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770866 nohz.next_balance=9764770866
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770867 nohz.next_balance=9764770867
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770868 nohz.next_balance=9764770868
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770869 nohz.next_balance=9764770870
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770870 nohz.next_balance=9764770870
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770871 nohz.next_balance=9764770871
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770872 nohz.next_balance=9764770872
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770873 nohz.next_balance=9764770873
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770874 nohz.next_balance=9764770874
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770875 nohz.next_balance=9764770876
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770876 nohz.next_balance=9764770876
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770877 nohz.next_balance=9764770877
>> 447 9536  nohz_balancer_kick:entry jiffies=9764770878 nohz.next_balance=9764770878
>>
>> On such system setting nohz.next_balance to next jiffy can cause kick_ilb()
>> to run almost every tick and this in turn can consume a lot of CPU cycles in
>> subsequenet nohz idle balancing.
>> So set nohz.next_balance based on number of currently idle CPUs, such that
>> for 32 idle CPUs nohz.next_balance is advanced further by 1 jiffy.
>> This will nohz_balancer_kick to bail out early.
>>
>> Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
>> ---
>>  kernel/sched/fair.c | 13 +++++++++++--
>>  1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index ab4114712be74..bd35275a05b38 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -12447,8 +12447,17 @@ static void kick_ilb(unsigned int flags)
>>          * Increase nohz.next_balance only when if full ilb is triggered but
>>          * not if we only update stats.
>>          */
>> -       if (flags & NOHZ_BALANCE_KICK)
>> -               nohz.next_balance = jiffies+1;
> 
> This +1 only cheaply prevents multiple nohz_ilb from happening
> simultaneously during the current jiffies.
> 
> The actual update of nohz.next_balance is done in _nohz_idle_balance()
> and reflects the next balance of all idle rqs. You should look at the
> balance interval of your sched_domains. The min interva is the weight
> of the sched_domain which can be 2 at SMT level
> 

I did not look at the balance interval of the involved sched domain.
IIUC once nohz.next_balance has been updated in _nohz_idle_balance(),
we will see that updated value in nohz_balancer_kick() and if its further
from current jiffies, the time_before(now, nohz.next_balance) test would
cause nohz_balancer_kick() to bail out without updating flags and that in 
tune would avoid kick_ilb() path.
Since jiffies and nohz.next_balance were appearing close or same in
nohz_balancer_kick() and I could see that CPU 2 was executing nohz_csd_func(),
almost instantly and pretty much at frequency of each tick (dtrace snippet shown
below), my conclusion was that one or more CPUs in sched domain of CPU 2 must
have had their rq->next_balance close to or same as current jiffies.

ts_ms = 1776868498610 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498611 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498612 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498613 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498614 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498615 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498616 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498617 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498618 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498619 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498620 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498621 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498622 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498623 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498624 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498625 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498626 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498627 rq_cpu = 2 nohz_flags = 3

Could you please let me know if this understanding is incorrect ?

Regarding the question of sched_domain topology, this host
has 768 CPUs and almost all (except 6) have been divided 
between 2 cpusets (one for each node). For example for node0
CPUs we have:

# cat /sys/fs/cgroup/sellable-numa0/cpuset.cpus.partition
root
# cat /sys/fs/cgroup/sellable-numa0/cpuset.cpus.effective
2-191,386-575

and their sched_domains look like, as shown below:

cpu2:
  domain0: cpus=2,386
  domain1: cpus=2-15,386-399
  domain2: cpus=2-191,386-575
cpu3:
  domain0: cpus=3,387
  domain1: cpus=2-15,386-399
  domain2: cpus=2-191,386-575
cpu4:
  domain0: cpus=4,388
  domain1: cpus=2-15,386-399
  domain2: cpus=2-191,386-575
.....
.....

Could you please suggest if updating rq->next_balance or
final nohz.next_balance by some other logic can help reduce the
CPU usage of _nohz_idle_balance or should we just ignore it
because CPU is idle anyways.

On these systems I can see that CPU 2 is doing most of this work.
Running a perf top on CPU 2 gives numbers like:

    21.69%  [kernel]       [k] __update_blocked_fair
    11.40%  [kernel]       [k] update_load_avg
     9.36%  [kernel]       [k] __update_load_avg_cfs_rq
     8.07%  [kernel]       [k] update_rq_clock
     7.09%  [kernel]       [k] __update_load_avg_se
     4.67%  [kernel]       [k] update_irq_load_avg

.....
.....
    22.26%  [kernel]       [k] __update_blocked_fair
    10.89%  [kernel]       [k] update_load_avg
     9.65%  [kernel]       [k] __update_load_avg_cfs_rq
     7.80%  [kernel]       [k] update_rq_clock
     7.23%  [kernel]       [k] __update_load_avg_se
     4.76%  [kernel]       [k] update_sg_lb_stats

and mpstat also shows softirq usage of around 20-25% on CPU 2 and 
most of that is due to SCHED_SOFTIRQ leading into 
_nohz_idle_balance.

Thanks,
Imran

PS: I used the following dtrace snippets to get nohz_balancer_kick
data shown earlier and nohz_csd_func() data shown in this message.

dtrace -n 'fbt::nohz_balancer_kick:entry {printf("jiffies = %lu nohz.next_balance = %lu \n", `jiffies, `nohz.next_balance);}'



fbt::nohz_csd_func:entry
{
    this->rq = (struct rq *)arg0;
    this->rq_cpu = this->rq->cpu;
    this->rq_nohz_flags = this->rq->nohz_flags.counter;
    this->ts_ms = (unsigned long)(walltimestamp / 1000000);
    printf("ts_ms = %lu rq_cpu = %d nohz_flags = %d \n", this->ts_ms, this->rq_cpu, this->rq_nohz_flags);
    /*printf("[%lu] IPI received on cpu=%d\n",
           this->ts_ms, cpu);*/
    /*@ipi_rate[cpu] = count();*/
}

> Which kind of sched_domain topology do you have?
> 
> 
>> +       if (flags & NOHZ_BALANCE_KICK) {
>> +               unsigned int nr_idle = cpumask_weight(nohz.idle_cpus_mask);
>> +
>> +               /*
>> +                * On large systems, there may always be some idle CPU(s) with
>> +                * rq->next_balance close to or at current time, thus causing
>> +                * frequent invocation of kick_ilb() from nohz_balancer_kick().
>> +                * Adjust next_balance based on the number of idle CPUs.
>> +                */
>> +               nohz.next_balance = jiffies + 1 + ((nr_idle > 32) ? ilog2(nr_idle) - 4 : 0);
>> +       }
>>
>>         ilb_cpu = find_new_ilb();
>>         if (ilb_cpu < 0)
>> --
>> 2.34.1
>>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] sched/fair: scale nohz.next_balance according to number of idle CPUs.
  2026-04-22 16:13     ` imran.f.khan
@ 2026-04-24  9:46       ` Vincent Guittot
  2026-04-28 10:52         ` imran.f.khan
  0 siblings, 1 reply; 9+ messages in thread
From: Vincent Guittot @ 2026-04-24  9:46 UTC (permalink / raw)
  To: imran.f.khan
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, linux-kernel

On Wed, 22 Apr 2026 at 18:13, <imran.f.khan@oracle.com> wrote:
>
> Hello Vincent,
> Thanks a lot for taking a look into this.
> On 22/4/2026 3:54 pm, Vincent Guittot wrote:
> > On Tue, 21 Apr 2026 at 07:06, Imran Khan <imran.f.khan@oracle.com> wrote:
> >>
> >> On large scale systems, for example with 768 CPUs and cpusets consisting
> >> of 380+ CPUs, there may always be some idle CPU with it's rq->next_balance
> >> close to or same as now.
> >> This causes nohz.next_balance to be perpetually same as current jiffies and
> >> thus causing time based check in nohz_balancer_kick() to awlays fail.
> >>
> >> For example putting dtrace probe at nohz_balancer_kick, on such a system,
> >> we can see that nohz.next_balance is at current jiffy at almost each tick:
> >>
> >> 447 9536  nohz_balancer_kick:entry jiffies=9764770863 nohz.next_balance=9764770863
> >> 447 9536  nohz_balancer_kick:entry jiffies=9764770864 nohz.next_balance=9764770864
> >> 447 9536  nohz_balancer_kick:entry jiffies=9764770865 nohz.next_balance=9764770865
> >> 447 9536  nohz_balancer_kick:entry jiffies=9764770866 nohz.next_balance=9764770866
> >> 447 9536  nohz_balancer_kick:entry jiffies=9764770867 nohz.next_balance=9764770867
> >> 447 9536  nohz_balancer_kick:entry jiffies=9764770868 nohz.next_balance=9764770868
> >> 447 9536  nohz_balancer_kick:entry jiffies=9764770869 nohz.next_balance=9764770870
> >> 447 9536  nohz_balancer_kick:entry jiffies=9764770870 nohz.next_balance=9764770870
> >> 447 9536  nohz_balancer_kick:entry jiffies=9764770871 nohz.next_balance=9764770871
> >> 447 9536  nohz_balancer_kick:entry jiffies=9764770872 nohz.next_balance=9764770872
> >> 447 9536  nohz_balancer_kick:entry jiffies=9764770873 nohz.next_balance=9764770873
> >> 447 9536  nohz_balancer_kick:entry jiffies=9764770874 nohz.next_balance=9764770874
> >> 447 9536  nohz_balancer_kick:entry jiffies=9764770875 nohz.next_balance=9764770876
> >> 447 9536  nohz_balancer_kick:entry jiffies=9764770876 nohz.next_balance=9764770876
> >> 447 9536  nohz_balancer_kick:entry jiffies=9764770877 nohz.next_balance=9764770877
> >> 447 9536  nohz_balancer_kick:entry jiffies=9764770878 nohz.next_balance=9764770878
> >>
> >> On such system setting nohz.next_balance to next jiffy can cause kick_ilb()
> >> to run almost every tick and this in turn can consume a lot of CPU cycles in
> >> subsequenet nohz idle balancing.
> >> So set nohz.next_balance based on number of currently idle CPUs, such that
> >> for 32 idle CPUs nohz.next_balance is advanced further by 1 jiffy.
> >> This will nohz_balancer_kick to bail out early.
> >>
> >> Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
> >> ---
> >>  kernel/sched/fair.c | 13 +++++++++++--
> >>  1 file changed, 11 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index ab4114712be74..bd35275a05b38 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -12447,8 +12447,17 @@ static void kick_ilb(unsigned int flags)
> >>          * Increase nohz.next_balance only when if full ilb is triggered but
> >>          * not if we only update stats.
> >>          */
> >> -       if (flags & NOHZ_BALANCE_KICK)
> >> -               nohz.next_balance = jiffies+1;
> >
> > This +1 only cheaply prevents multiple nohz_ilb from happening
> > simultaneously during the current jiffies.
> >
> > The actual update of nohz.next_balance is done in _nohz_idle_balance()
> > and reflects the next balance of all idle rqs. You should look at the
> > balance interval of your sched_domains. The min interva is the weight
> > of the sched_domain which can be 2 at SMT level
> >
>
> I did not look at the balance interval of the involved sched domain.
> IIUC once nohz.next_balance has been updated in _nohz_idle_balance(),
> we will see that updated value in nohz_balancer_kick() and if its further
> from current jiffies, the time_before(now, nohz.next_balance) test would
> cause nohz_balancer_kick() to bail out without updating flags and that in
> tune would avoid kick_ilb() path.

yes

> Since jiffies and nohz.next_balance were appearing close or same in
> nohz_balancer_kick() and I could see that CPU 2 was executing nohz_csd_func(),
> almost instantly and pretty much at frequency of each tick (dtrace snippet shown
> below), my conclusion was that one or more CPUs in sched domain of CPU 2 must
> have had their rq->next_balance close to or same as current jiffies.

Yes

>
> ts_ms = 1776868498610 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498611 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498612 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498613 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498614 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498615 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498616 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498617 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498618 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498619 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498620 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498621 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498622 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498623 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498624 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498625 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498626 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498627 rq_cpu = 2 nohz_flags = 3
>
> Could you please let me know if this understanding is incorrect ?

yes, it is correct.

The ILB is kicked for several reasons:
- NOHZ_BALANCE_KICK : periodic load balance based on the
balance_interval of each sched_domain
- NOHZ_STATS_KICK: update of statistics i.e. decaying the blocked load
- NOHZ_NEXT_KICK: loop on idle cpu to update nohz.next_balance when a
cpu becomes idle.

NOHZ_NEXT_KICK and NOHZ_STATS_KICK can be set independently for
"cheap" idle load balance

and NOHZ_STATS_KICK is set whenever NOHZ_BALANCE_KICK is set to take
advantage of the ILB to update the block load instead of kicking
anither one just for updating the stats.


>
> Regarding the question of sched_domain topology, this host
> has 768 CPUs and almost all (except 6) have been divided
> between 2 cpusets (one for each node). For example for node0
> CPUs we have:
>
> # cat /sys/fs/cgroup/sellable-numa0/cpuset.cpus.partition
> root
> # cat /sys/fs/cgroup/sellable-numa0/cpuset.cpus.effective
> 2-191,386-575
>
> and their sched_domains look like, as shown below:
>
> cpu2:
>   domain0: cpus=2,386
>   domain1: cpus=2-15,386-399
>   domain2: cpus=2-191,386-575
> cpu3:
>   domain0: cpus=3,387
>   domain1: cpus=2-15,386-399
>   domain2: cpus=2-191,386-575
> cpu4:
>   domain0: cpus=4,388
>   domain1: cpus=2-15,386-399
>   domain2: cpus=2-191,386-575
> .....
> .....
>
> Could you please suggest if updating rq->next_balance or
> final nohz.next_balance by some other logic can help reduce the
> CPU usage of _nohz_idle_balance or should we just ignore it
> because CPU is idle anyways.

With SMT domain, the idle load balance will be kicked every 2 ms for
each core domain. If the load balance of all cores is not aligned on
the same tick, you will have an ILB every tick if there are activities
on some CPUs and we need to check whether it can be pulled on an idle
CPU. But it should be light

>
> On these systems I can see that CPU 2 is doing most of this work.
> Running a perf top on CPU 2 gives numbers like:
>
>     21.69%  [kernel]       [k] __update_blocked_fair
>     11.40%  [kernel]       [k] update_load_avg
>      9.36%  [kernel]       [k] __update_load_avg_cfs_rq
>      8.07%  [kernel]       [k] update_rq_clock
>      7.09%  [kernel]       [k] __update_load_avg_se
>      4.67%  [kernel]       [k] update_irq_load_avg
>
> .....
> .....
>     22.26%  [kernel]       [k] __update_blocked_fair
>     10.89%  [kernel]       [k] update_load_avg
>      9.65%  [kernel]       [k] __update_load_avg_cfs_rq
>      7.80%  [kernel]       [k] update_rq_clock
>      7.23%  [kernel]       [k] __update_load_avg_se
>      4.76%  [kernel]       [k] update_sg_lb_stats
>
> and mpstat also shows softirq usage of around 20-25% on CPU 2 and
> most of that is due to SCHED_SOFTIRQ leading into
> _nohz_idle_balance.

The time to update the blocked loads increases with the cgroup
hierarchy because we must to walk the hierarchy.

Does it generate problems for your system? As you mentioned above, if
CPU2 is idle, running such background activities should not cause
harm.

>
> Thanks,
> Imran
>
> PS: I used the following dtrace snippets to get nohz_balancer_kick
> data shown earlier and nohz_csd_func() data shown in this message.
>
> dtrace -n 'fbt::nohz_balancer_kick:entry {printf("jiffies = %lu nohz.next_balance = %lu \n", `jiffies, `nohz.next_balance);}'
>
>
>
> fbt::nohz_csd_func:entry
> {
>     this->rq = (struct rq *)arg0;
>     this->rq_cpu = this->rq->cpu;
>     this->rq_nohz_flags = this->rq->nohz_flags.counter;
>     this->ts_ms = (unsigned long)(walltimestamp / 1000000);
>     printf("ts_ms = %lu rq_cpu = %d nohz_flags = %d \n", this->ts_ms, this->rq_cpu, this->rq_nohz_flags);
>     /*printf("[%lu] IPI received on cpu=%d\n",
>            this->ts_ms, cpu);*/
>     /*@ipi_rate[cpu] = count();*/
> }
>
> > Which kind of sched_domain topology do you have?
> >
> >
> >> +       if (flags & NOHZ_BALANCE_KICK) {
> >> +               unsigned int nr_idle = cpumask_weight(nohz.idle_cpus_mask);
> >> +
> >> +               /*
> >> +                * On large systems, there may always be some idle CPU(s) with
> >> +                * rq->next_balance close to or at current time, thus causing
> >> +                * frequent invocation of kick_ilb() from nohz_balancer_kick().
> >> +                * Adjust next_balance based on the number of idle CPUs.
> >> +                */
> >> +               nohz.next_balance = jiffies + 1 + ((nr_idle > 32) ? ilog2(nr_idle) - 4 : 0);
> >> +       }
> >>
> >>         ilb_cpu = find_new_ilb();
> >>         if (ilb_cpu < 0)
> >> --
> >> 2.34.1
> >>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] sched/fair: scale nohz.next_balance according to number of idle CPUs.
  2026-04-24  9:46       ` Vincent Guittot
@ 2026-04-28 10:52         ` imran.f.khan
  2026-04-28 15:06           ` Vincent Guittot
  0 siblings, 1 reply; 9+ messages in thread
From: imran.f.khan @ 2026-04-28 10:52 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, linux-kernel

Hello Vincent,
Thanks so much for clarifying my queries.
On 24/4/2026 5:46 pm, Vincent Guittot wrote:
> On Wed, 22 Apr 2026 at 18:13, <imran.f.khan@oracle.com> wrote:
>>
>> Hello Vincent,
>> Thanks a lot for taking a look into this.
>> On 22/4/2026 3:54 pm, Vincent Guittot wrote:
>>> On Tue, 21 Apr 2026 at 07:06, Imran Khan <imran.f.khan@oracle.com> wrote:
>>>>
>>>> On large scale systems, for example with 768 CPUs and cpusets consisting
>>>> of 380+ CPUs, there may always be some idle CPU with it's rq->next_balance
>>>> close to or same as now.
>>>> This causes nohz.next_balance to be perpetually same as current jiffies and
>>>> thus causing time based check in nohz_balancer_kick() to awlays fail.
>>>>
>>>> For example putting dtrace probe at nohz_balancer_kick, on such a system,
>>>> we can see that nohz.next_balance is at current jiffy at almost each tick:
>>>>
>>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770863 nohz.next_balance=9764770863
>>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770864 nohz.next_balance=9764770864
>>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770865 nohz.next_balance=9764770865
>>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770866 nohz.next_balance=9764770866
>>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770867 nohz.next_balance=9764770867
>>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770868 nohz.next_balance=9764770868
>>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770869 nohz.next_balance=9764770870
>>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770870 nohz.next_balance=9764770870
>>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770871 nohz.next_balance=9764770871
>>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770872 nohz.next_balance=9764770872
>>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770873 nohz.next_balance=9764770873
>>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770874 nohz.next_balance=9764770874
>>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770875 nohz.next_balance=9764770876
>>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770876 nohz.next_balance=9764770876
>>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770877 nohz.next_balance=9764770877
>>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770878 nohz.next_balance=9764770878
>>>>
>>>> On such system setting nohz.next_balance to next jiffy can cause kick_ilb()
>>>> to run almost every tick and this in turn can consume a lot of CPU cycles in
>>>> subsequenet nohz idle balancing.
>>>> So set nohz.next_balance based on number of currently idle CPUs, such that
>>>> for 32 idle CPUs nohz.next_balance is advanced further by 1 jiffy.
>>>> This will nohz_balancer_kick to bail out early.
>>>>
>>>> Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
>>>> ---
>>>>  kernel/sched/fair.c | 13 +++++++++++--
>>>>  1 file changed, 11 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index ab4114712be74..bd35275a05b38 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -12447,8 +12447,17 @@ static void kick_ilb(unsigned int flags)
>>>>          * Increase nohz.next_balance only when if full ilb is triggered but
>>>>          * not if we only update stats.
>>>>          */
>>>> -       if (flags & NOHZ_BALANCE_KICK)
>>>> -               nohz.next_balance = jiffies+1;
>>>
>>> This +1 only cheaply prevents multiple nohz_ilb from happening
>>> simultaneously during the current jiffies.
>>>
>>> The actual update of nohz.next_balance is done in _nohz_idle_balance()
>>> and reflects the next balance of all idle rqs. You should look at the
>>> balance interval of your sched_domains. The min interva is the weight
>>> of the sched_domain which can be 2 at SMT level
>>>
>>
>> I did not look at the balance interval of the involved sched domain.
>> IIUC once nohz.next_balance has been updated in _nohz_idle_balance(),
>> we will see that updated value in nohz_balancer_kick() and if its further
>> from current jiffies, the time_before(now, nohz.next_balance) test would
>> cause nohz_balancer_kick() to bail out without updating flags and that in
>> tune would avoid kick_ilb() path.
> 
> yes
> 
>> Since jiffies and nohz.next_balance were appearing close or same in
>> nohz_balancer_kick() and I could see that CPU 2 was executing nohz_csd_func(),
>> almost instantly and pretty much at frequency of each tick (dtrace snippet shown
>> below), my conclusion was that one or more CPUs in sched domain of CPU 2 must
>> have had their rq->next_balance close to or same as current jiffies.
> 
> Yes
> 
>>
>> ts_ms = 1776868498610 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498611 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498612 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498613 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498614 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498615 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498616 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498617 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498618 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498619 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498620 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498621 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498622 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498623 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498624 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498625 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498626 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498627 rq_cpu = 2 nohz_flags = 3
>>
>> Could you please let me know if this understanding is incorrect ?
> 
> yes, it is correct.
> 
> The ILB is kicked for several reasons:
> - NOHZ_BALANCE_KICK : periodic load balance based on the
> balance_interval of each sched_domain
> - NOHZ_STATS_KICK: update of statistics i.e. decaying the blocked load
> - NOHZ_NEXT_KICK: loop on idle cpu to update nohz.next_balance when a
> cpu becomes idle.
> 
> NOHZ_NEXT_KICK and NOHZ_STATS_KICK can be set independently for
> "cheap" idle load balance
> 
> and NOHZ_STATS_KICK is set whenever NOHZ_BALANCE_KICK is set to take
> advantage of the ILB to update the block load instead of kicking
> anither one just for updating the stats.
> 
> 
>>
>> Regarding the question of sched_domain topology, this host
>> has 768 CPUs and almost all (except 6) have been divided
>> between 2 cpusets (one for each node). For example for node0
>> CPUs we have:
>>
>> # cat /sys/fs/cgroup/sellable-numa0/cpuset.cpus.partition
>> root
>> # cat /sys/fs/cgroup/sellable-numa0/cpuset.cpus.effective
>> 2-191,386-575
>>
>> and their sched_domains look like, as shown below:
>>
>> cpu2:
>>   domain0: cpus=2,386
>>   domain1: cpus=2-15,386-399
>>   domain2: cpus=2-191,386-575
>> cpu3:
>>   domain0: cpus=3,387
>>   domain1: cpus=2-15,386-399
>>   domain2: cpus=2-191,386-575
>> cpu4:
>>   domain0: cpus=4,388
>>   domain1: cpus=2-15,386-399
>>   domain2: cpus=2-191,386-575
>> .....
>> .....
>>
>> Could you please suggest if updating rq->next_balance or
>> final nohz.next_balance by some other logic can help reduce the
>> CPU usage of _nohz_idle_balance or should we just ignore it
>> because CPU is idle anyways.
> 
> With SMT domain, the idle load balance will be kicked every 2 ms for
> each core domain. If the load balance of all cores is not aligned on
> the same tick, you will have an ILB every tick if there are activities
> on some CPUs and we need to check whether it can be pulled on an idle
> CPU. But it should be light
> 
>>
>> On these systems I can see that CPU 2 is doing most of this work.
>> Running a perf top on CPU 2 gives numbers like:
>>
>>     21.69%  [kernel]       [k] __update_blocked_fair
>>     11.40%  [kernel]       [k] update_load_avg
>>      9.36%  [kernel]       [k] __update_load_avg_cfs_rq
>>      8.07%  [kernel]       [k] update_rq_clock
>>      7.09%  [kernel]       [k] __update_load_avg_se
>>      4.67%  [kernel]       [k] update_irq_load_avg
>>
>> .....
>> .....
>>     22.26%  [kernel]       [k] __update_blocked_fair
>>     10.89%  [kernel]       [k] update_load_avg
>>      9.65%  [kernel]       [k] __update_load_avg_cfs_rq
>>      7.80%  [kernel]       [k] update_rq_clock
>>      7.23%  [kernel]       [k] __update_load_avg_se
>>      4.76%  [kernel]       [k] update_sg_lb_stats
>>
>> and mpstat also shows softirq usage of around 20-25% on CPU 2 and
>> most of that is due to SCHED_SOFTIRQ leading into
>> _nohz_idle_balance.
> 
> The time to update the blocked loads increases with the cgroup
> hierarchy because we must to walk the hierarchy.
> 
> Does it generate problems for your system? As you mentioned above, if
> CPU2 is idle, running such background activities should not cause
> harm.
> 

No its not causing any issues. Should this mean that the second patch of this
set can be dropped as well. I could see that despite multiple CPUs being idle
in this domain, it was CPU 2 that was doing nohz idle balance most of the times.

Thanks,
Imran
 
>>
>> Thanks,
>> Imran
>>
>> PS: I used the following dtrace snippets to get nohz_balancer_kick
>> data shown earlier and nohz_csd_func() data shown in this message.
>>
>> dtrace -n 'fbt::nohz_balancer_kick:entry {printf("jiffies = %lu nohz.next_balance = %lu \n", `jiffies, `nohz.next_balance);}'
>>
>>
>>
>> fbt::nohz_csd_func:entry
>> {
>>     this->rq = (struct rq *)arg0;
>>     this->rq_cpu = this->rq->cpu;
>>     this->rq_nohz_flags = this->rq->nohz_flags.counter;
>>     this->ts_ms = (unsigned long)(walltimestamp / 1000000);
>>     printf("ts_ms = %lu rq_cpu = %d nohz_flags = %d \n", this->ts_ms, this->rq_cpu, this->rq_nohz_flags);
>>     /*printf("[%lu] IPI received on cpu=%d\n",
>>            this->ts_ms, cpu);*/
>>     /*@ipi_rate[cpu] = count();*/
>> }
>>
>>> Which kind of sched_domain topology do you have?
>>>
>>>
>>>> +       if (flags & NOHZ_BALANCE_KICK) {
>>>> +               unsigned int nr_idle = cpumask_weight(nohz.idle_cpus_mask);
>>>> +
>>>> +               /*
>>>> +                * On large systems, there may always be some idle CPU(s) with
>>>> +                * rq->next_balance close to or at current time, thus causing
>>>> +                * frequent invocation of kick_ilb() from nohz_balancer_kick().
>>>> +                * Adjust next_balance based on the number of idle CPUs.
>>>> +                */
>>>> +               nohz.next_balance = jiffies + 1 + ((nr_idle > 32) ? ilog2(nr_idle) - 4 : 0);
>>>> +       }
>>>>
>>>>         ilb_cpu = find_new_ilb();
>>>>         if (ilb_cpu < 0)
>>>> --
>>>> 2.34.1
>>>>
>>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] sched/fair: scale nohz.next_balance according to number of idle CPUs.
  2026-04-28 10:52         ` imran.f.khan
@ 2026-04-28 15:06           ` Vincent Guittot
  0 siblings, 0 replies; 9+ messages in thread
From: Vincent Guittot @ 2026-04-28 15:06 UTC (permalink / raw)
  To: imran.f.khan
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, linux-kernel

On Tue, 28 Apr 2026 at 12:53, <imran.f.khan@oracle.com> wrote:
>
> Hello Vincent,
> Thanks so much for clarifying my queries.
> On 24/4/2026 5:46 pm, Vincent Guittot wrote:
> > On Wed, 22 Apr 2026 at 18:13, <imran.f.khan@oracle.com> wrote:
> >>
> >> Hello Vincent,
> >> Thanks a lot for taking a look into this.
> >> On 22/4/2026 3:54 pm, Vincent Guittot wrote:
> >>> On Tue, 21 Apr 2026 at 07:06, Imran Khan <imran.f.khan@oracle.com> wrote:
> >>>>
> >>>> On large scale systems, for example with 768 CPUs and cpusets consisting
> >>>> of 380+ CPUs, there may always be some idle CPU with it's rq->next_balance
> >>>> close to or same as now.
> >>>> This causes nohz.next_balance to be perpetually same as current jiffies and
> >>>> thus causing time based check in nohz_balancer_kick() to awlays fail.
> >>>>
> >>>> For example putting dtrace probe at nohz_balancer_kick, on such a system,
> >>>> we can see that nohz.next_balance is at current jiffy at almost each tick:
> >>>>
> >>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770863 nohz.next_balance=9764770863
> >>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770864 nohz.next_balance=9764770864
> >>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770865 nohz.next_balance=9764770865
> >>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770866 nohz.next_balance=9764770866
> >>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770867 nohz.next_balance=9764770867
> >>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770868 nohz.next_balance=9764770868
> >>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770869 nohz.next_balance=9764770870
> >>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770870 nohz.next_balance=9764770870
> >>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770871 nohz.next_balance=9764770871
> >>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770872 nohz.next_balance=9764770872
> >>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770873 nohz.next_balance=9764770873
> >>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770874 nohz.next_balance=9764770874
> >>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770875 nohz.next_balance=9764770876
> >>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770876 nohz.next_balance=9764770876
> >>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770877 nohz.next_balance=9764770877
> >>>> 447 9536  nohz_balancer_kick:entry jiffies=9764770878 nohz.next_balance=9764770878
> >>>>
> >>>> On such system setting nohz.next_balance to next jiffy can cause kick_ilb()
> >>>> to run almost every tick and this in turn can consume a lot of CPU cycles in
> >>>> subsequenet nohz idle balancing.
> >>>> So set nohz.next_balance based on number of currently idle CPUs, such that
> >>>> for 32 idle CPUs nohz.next_balance is advanced further by 1 jiffy.
> >>>> This will nohz_balancer_kick to bail out early.
> >>>>
> >>>> Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
> >>>> ---
> >>>>  kernel/sched/fair.c | 13 +++++++++++--
> >>>>  1 file changed, 11 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>> index ab4114712be74..bd35275a05b38 100644
> >>>> --- a/kernel/sched/fair.c
> >>>> +++ b/kernel/sched/fair.c
> >>>> @@ -12447,8 +12447,17 @@ static void kick_ilb(unsigned int flags)
> >>>>          * Increase nohz.next_balance only when if full ilb is triggered but
> >>>>          * not if we only update stats.
> >>>>          */
> >>>> -       if (flags & NOHZ_BALANCE_KICK)
> >>>> -               nohz.next_balance = jiffies+1;
> >>>
> >>> This +1 only cheaply prevents multiple nohz_ilb from happening
> >>> simultaneously during the current jiffies.
> >>>
> >>> The actual update of nohz.next_balance is done in _nohz_idle_balance()
> >>> and reflects the next balance of all idle rqs. You should look at the
> >>> balance interval of your sched_domains. The min interva is the weight
> >>> of the sched_domain which can be 2 at SMT level
> >>>
> >>
> >> I did not look at the balance interval of the involved sched domain.
> >> IIUC once nohz.next_balance has been updated in _nohz_idle_balance(),
> >> we will see that updated value in nohz_balancer_kick() and if its further
> >> from current jiffies, the time_before(now, nohz.next_balance) test would
> >> cause nohz_balancer_kick() to bail out without updating flags and that in
> >> tune would avoid kick_ilb() path.
> >
> > yes
> >
> >> Since jiffies and nohz.next_balance were appearing close or same in
> >> nohz_balancer_kick() and I could see that CPU 2 was executing nohz_csd_func(),
> >> almost instantly and pretty much at frequency of each tick (dtrace snippet shown
> >> below), my conclusion was that one or more CPUs in sched domain of CPU 2 must
> >> have had their rq->next_balance close to or same as current jiffies.
> >
> > Yes
> >
> >>
> >> ts_ms = 1776868498610 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498611 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498612 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498613 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498614 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498615 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498616 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498617 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498618 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498619 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498620 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498621 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498622 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498623 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498624 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498625 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498626 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498627 rq_cpu = 2 nohz_flags = 3
> >>
> >> Could you please let me know if this understanding is incorrect ?
> >
> > yes, it is correct.
> >
> > The ILB is kicked for several reasons:
> > - NOHZ_BALANCE_KICK : periodic load balance based on the
> > balance_interval of each sched_domain
> > - NOHZ_STATS_KICK: update of statistics i.e. decaying the blocked load
> > - NOHZ_NEXT_KICK: loop on idle cpu to update nohz.next_balance when a
> > cpu becomes idle.
> >
> > NOHZ_NEXT_KICK and NOHZ_STATS_KICK can be set independently for
> > "cheap" idle load balance
> >
> > and NOHZ_STATS_KICK is set whenever NOHZ_BALANCE_KICK is set to take
> > advantage of the ILB to update the block load instead of kicking
> > anither one just for updating the stats.
> >
> >
> >>
> >> Regarding the question of sched_domain topology, this host
> >> has 768 CPUs and almost all (except 6) have been divided
> >> between 2 cpusets (one for each node). For example for node0
> >> CPUs we have:
> >>
> >> # cat /sys/fs/cgroup/sellable-numa0/cpuset.cpus.partition
> >> root
> >> # cat /sys/fs/cgroup/sellable-numa0/cpuset.cpus.effective
> >> 2-191,386-575
> >>
> >> and their sched_domains look like, as shown below:
> >>
> >> cpu2:
> >>   domain0: cpus=2,386
> >>   domain1: cpus=2-15,386-399
> >>   domain2: cpus=2-191,386-575
> >> cpu3:
> >>   domain0: cpus=3,387
> >>   domain1: cpus=2-15,386-399
> >>   domain2: cpus=2-191,386-575
> >> cpu4:
> >>   domain0: cpus=4,388
> >>   domain1: cpus=2-15,386-399
> >>   domain2: cpus=2-191,386-575
> >> .....
> >> .....
> >>
> >> Could you please suggest if updating rq->next_balance or
> >> final nohz.next_balance by some other logic can help reduce the
> >> CPU usage of _nohz_idle_balance or should we just ignore it
> >> because CPU is idle anyways.
> >
> > With SMT domain, the idle load balance will be kicked every 2 ms for
> > each core domain. If the load balance of all cores is not aligned on
> > the same tick, you will have an ILB every tick if there are activities
> > on some CPUs and we need to check whether it can be pulled on an idle
> > CPU. But it should be light
> >
> >>
> >> On these systems I can see that CPU 2 is doing most of this work.
> >> Running a perf top on CPU 2 gives numbers like:
> >>
> >>     21.69%  [kernel]       [k] __update_blocked_fair
> >>     11.40%  [kernel]       [k] update_load_avg
> >>      9.36%  [kernel]       [k] __update_load_avg_cfs_rq
> >>      8.07%  [kernel]       [k] update_rq_clock
> >>      7.09%  [kernel]       [k] __update_load_avg_se
> >>      4.67%  [kernel]       [k] update_irq_load_avg
> >>
> >> .....
> >> .....
> >>     22.26%  [kernel]       [k] __update_blocked_fair
> >>     10.89%  [kernel]       [k] update_load_avg
> >>      9.65%  [kernel]       [k] __update_load_avg_cfs_rq
> >>      7.80%  [kernel]       [k] update_rq_clock
> >>      7.23%  [kernel]       [k] __update_load_avg_se
> >>      4.76%  [kernel]       [k] update_sg_lb_stats
> >>
> >> and mpstat also shows softirq usage of around 20-25% on CPU 2 and
> >> most of that is due to SCHED_SOFTIRQ leading into
> >> _nohz_idle_balance.
> >
> > The time to update the blocked loads increases with the cgroup
> > hierarchy because we must to walk the hierarchy.
> >
> > Does it generate problems for your system? As you mentioned above, if
> > CPU2 is idle, running such background activities should not cause
> > harm.
> >
>
> No its not causing any issues. Should this mean that the second patch of this
> set can be dropped as well. I could see that despite multiple CPUs being idle
> in this domain, it was CPU 2 that was doing nohz idle balance most of the times.

Yes, we can drop patch 2 as well. The fact that CPU2 handles most of
the nohz idle balance is not a problem by itself

Vincent

>
> Thanks,
> Imran
>
> >>
> >> Thanks,
> >> Imran
> >>
> >> PS: I used the following dtrace snippets to get nohz_balancer_kick
> >> data shown earlier and nohz_csd_func() data shown in this message.
> >>
> >> dtrace -n 'fbt::nohz_balancer_kick:entry {printf("jiffies = %lu nohz.next_balance = %lu \n", `jiffies, `nohz.next_balance);}'
> >>
> >>
> >>
> >> fbt::nohz_csd_func:entry
> >> {
> >>     this->rq = (struct rq *)arg0;
> >>     this->rq_cpu = this->rq->cpu;
> >>     this->rq_nohz_flags = this->rq->nohz_flags.counter;
> >>     this->ts_ms = (unsigned long)(walltimestamp / 1000000);
> >>     printf("ts_ms = %lu rq_cpu = %d nohz_flags = %d \n", this->ts_ms, this->rq_cpu, this->rq_nohz_flags);
> >>     /*printf("[%lu] IPI received on cpu=%d\n",
> >>            this->ts_ms, cpu);*/
> >>     /*@ipi_rate[cpu] = count();*/
> >> }
> >>
> >>> Which kind of sched_domain topology do you have?
> >>>
> >>>
> >>>> +       if (flags & NOHZ_BALANCE_KICK) {
> >>>> +               unsigned int nr_idle = cpumask_weight(nohz.idle_cpus_mask);
> >>>> +
> >>>> +               /*
> >>>> +                * On large systems, there may always be some idle CPU(s) with
> >>>> +                * rq->next_balance close to or at current time, thus causing
> >>>> +                * frequent invocation of kick_ilb() from nohz_balancer_kick().
> >>>> +                * Adjust next_balance based on the number of idle CPUs.
> >>>> +                */
> >>>> +               nohz.next_balance = jiffies + 1 + ((nr_idle > 32) ? ilog2(nr_idle) - 4 : 0);
> >>>> +       }
> >>>>
> >>>>         ilb_cpu = find_new_ilb();
> >>>>         if (ilb_cpu < 0)
> >>>> --
> >>>> 2.34.1
> >>>>
> >>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-04-28 15:06 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-21  5:06 [PATCH 0/2] sched/fair: Reduce nohz_idle_balance CPU overhead on large systems Imran Khan
2026-04-21  5:06 ` [PATCH 1/2] sched/fair: scale nohz.next_balance according to number of idle CPUs Imran Khan
2026-04-21 17:30   ` Shrikanth Hegde
2026-04-22  7:54   ` Vincent Guittot
2026-04-22 16:13     ` imran.f.khan
2026-04-24  9:46       ` Vincent Guittot
2026-04-28 10:52         ` imran.f.khan
2026-04-28 15:06           ` Vincent Guittot
2026-04-21  5:06 ` [PATCH 2/2] sched/fair: distribute nohz ILB work across " Imran Khan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox