linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched/nohz: Fix NOHZ imbalance by adding options for ILB CPU
@ 2025-08-19  2:57 Adam Li
  2025-08-19 14:00 ` Valentin Schneider
  0 siblings, 1 reply; 14+ messages in thread
From: Adam Li @ 2025-08-19  2:57 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, cl,
	frederic, linux-kernel, patches, Adam Li

A qualified CPU to run NOHZ idle load balancing (ILB) has to be:
1) housekeeping CPU in housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)
2) and in nohz.idle_cpus_mask
3) and idle
4) and not current CPU

If most CPUs are in nohz_full CPU list there is few housekeeping CPU left.
In the worst case if all CPUs are in nohz_full only the boot CPU is used
for housekeeping. And the housekeeping CPU is usually busier so it will
be unlikely added to nohz.idle_cpus_mask.

Therefore if there is few housekeeping CPUs, find_new_ilb() may likely
failed to find any CPU to do NOHZ idle load balancing. Some NOHZ CPUs may
stay idle while other CPUs are busy.

This patch adds fallback options when looking for ILB CPU if there is
no CPU meeting above requirements. Then it searches in bellow order:
1) Try looking for the first idle housekeeping CPU
2) Try looking for the first idle CPU in nohz.idle_cpus_mask if no SMT.
3) Select the first housekeeping CPU even if it is busy.

With this patch the NOHZ idle balancing happens more frequently.

Signed-off-by: Adam Li <adamli@os.amperecomputing.com>
---
 kernel/sched/fair.c | 32 +++++++++++++++++++++++++++++---
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b173a059315c..12bcc3f81f9b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12194,19 +12194,45 @@ static inline int on_null_domain(struct rq *rq)
 static inline int find_new_ilb(void)
 {
 	const struct cpumask *hk_mask;
-	int ilb_cpu;
+	struct cpumask ilb_mask;
+	int ilb_cpu, this_cpu = smp_processor_id();
 
 	hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
 
-	for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
+	/*
+	 * Look for an idle cpu who is both NOHZ_idle and housekeeping.
+	 * If no such cpu, look for an idle housekeeping cpu.
+	 */
+	if (!cpumask_and(&ilb_mask, nohz.idle_cpus_mask, hk_mask))
+		cpumask_copy(&ilb_mask, hk_mask);
 
-		if (ilb_cpu == smp_processor_id())
+	for_each_cpu(ilb_cpu, &ilb_mask) {
+		if (ilb_cpu == this_cpu)
 			continue;
 
 		if (idle_cpu(ilb_cpu))
 			return ilb_cpu;
 	}
 
+	/*
+	 * If CPU has no SMT, look for an idle NOHZ_idle cpu.
+	 * Run NOHZ ILB may cause jitter on SMT sibling CPU.
+	 */
+	if (!sched_smt_active()) {
+		for_each_cpu(ilb_cpu, nohz.idle_cpus_mask) {
+			if (ilb_cpu == this_cpu)
+				continue;
+
+			if (idle_cpu(ilb_cpu))
+				return ilb_cpu;
+		}
+	}
+
+	/* Select the first housekeeping cpu anyway. */
+	ilb_cpu = cpumask_first(hk_mask);
+	if (ilb_cpu < nr_cpu_ids)
+		return ilb_cpu;
+
 	return -1;
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/nohz: Fix NOHZ imbalance by adding options for ILB CPU
  2025-08-19  2:57 [PATCH] sched/nohz: Fix NOHZ imbalance by adding options for ILB CPU Adam Li
@ 2025-08-19 14:00 ` Valentin Schneider
  2025-08-20  3:35   ` Adam Li
  0 siblings, 1 reply; 14+ messages in thread
From: Valentin Schneider @ 2025-08-19 14:00 UTC (permalink / raw)
  To: Adam Li, mingo, peterz, juri.lelli, vincent.guittot
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, cl, frederic,
	linux-kernel, patches, Adam Li

On 19/08/25 02:57, Adam Li wrote:
> A qualified CPU to run NOHZ idle load balancing (ILB) has to be:
> 1) housekeeping CPU in housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)
> 2) and in nohz.idle_cpus_mask
> 3) and idle
> 4) and not current CPU
>
> If most CPUs are in nohz_full CPU list there is few housekeeping CPU left.
> In the worst case if all CPUs are in nohz_full only the boot CPU is used
> for housekeeping. And the housekeeping CPU is usually busier so it will
> be unlikely added to nohz.idle_cpus_mask.
>
> Therefore if there is few housekeeping CPUs, find_new_ilb() may likely
> failed to find any CPU to do NOHZ idle load balancing. Some NOHZ CPUs may
> stay idle while other CPUs are busy.
>
> This patch adds fallback options when looking for ILB CPU if there is
> no CPU meeting above requirements. Then it searches in bellow order:
> 1) Try looking for the first idle housekeeping CPU
> 2) Try looking for the first idle CPU in nohz.idle_cpus_mask if no SMT.
> 3) Select the first housekeeping CPU even if it is busy.
>
> With this patch the NOHZ idle balancing happens more frequently.
>

I'm not understanding why, in the scenarios outlined above, more NOHZ idle
balancing is a good thing.

Considering only housekeeping CPUs, they're all covered by wakeup, periodic
and idle balancing (on top of NOHZ idle balancing when relevant). So if
find_new_ilb() never finds a NOHZ-idle CPU, then that means your HK CPUs
are either always busy or never stopping the tick when going idle, IOW they
always have some work to do within a jiffy boundary.

Am I missing something?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/nohz: Fix NOHZ imbalance by adding options for ILB CPU
  2025-08-19 14:00 ` Valentin Schneider
@ 2025-08-20  3:35   ` Adam Li
  2025-08-20  8:43     ` Valentin Schneider
  0 siblings, 1 reply; 14+ messages in thread
From: Adam Li @ 2025-08-20  3:35 UTC (permalink / raw)
  To: Valentin Schneider, mingo, peterz, juri.lelli, vincent.guittot
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, cl, frederic,
	linux-kernel, patches

Hi Valentin,

Thanks for your comments.
On 8/19/2025 10:00 PM, Valentin Schneider wrote:
> On 19/08/25 02:57, Adam Li wrote:
>> A qualified CPU to run NOHZ idle load balancing (ILB) has to be:
>> 1) housekeeping CPU in housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)
>> 2) and in nohz.idle_cpus_mask
>> 3) and idle
>> 4) and not current CPU
>>
>> If most CPUs are in nohz_full CPU list there is few housekeeping CPU left.
>> In the worst case if all CPUs are in nohz_full only the boot CPU is used
>> for housekeeping. And the housekeeping CPU is usually busier so it will
>> be unlikely added to nohz.idle_cpus_mask.
>>
>> Therefore if there is few housekeeping CPUs, find_new_ilb() may likely
>> failed to find any CPU to do NOHZ idle load balancing. Some NOHZ CPUs may
>> stay idle while other CPUs are busy.
>>
>> This patch adds fallback options when looking for ILB CPU if there is
>> no CPU meeting above requirements. Then it searches in bellow order:
>> 1) Try looking for the first idle housekeeping CPU
>> 2) Try looking for the first idle CPU in nohz.idle_cpus_mask if no SMT.
>> 3) Select the first housekeeping CPU even if it is busy.
>>
>> With this patch the NOHZ idle balancing happens more frequently.
>>
> 
> I'm not understanding why, in the scenarios outlined above, more NOHZ idle
> balancing is a good thing.
> 
> Considering only housekeeping CPUs, they're all covered by wakeup, periodic
> and idle balancing (on top of NOHZ idle balancing when relevant). So if
> find_new_ilb() never finds a NOHZ-idle CPU, then that means your HK CPUs
> are either always busy or never stopping the tick when going idle, IOW they
> always have some work to do within a jiffy boundary.
> > Am I missing something?
>

I agree with your description about the housekeeping CPUs. In the worst case,
the system only has one housekeeping CPU and this housekeeping CPU is so busy
that:
1) This housekeeping CPU is unlikely idle;
2) and this housekeeping CPU is unlikely in 'nohz.idle_cpus_mask' because tick
is not stopped.
Therefore find_new_ilb() may very likely return -1. *No* CPU can be selected
to do NOHZ idle load balancing.

This patch tries to fix the imbalance of NOHZ idle CPUs (CPUs in nohz.idle_cpus_mask).
Here is more background:

When running llama on arm64 server, some CPUs *keep* idle while others
are 100% busy. All CPUs are in 'nohz_full=' cpu list, and CONFIG_NO_HZ_FULL
is set.

The problem is caused by two issues:
1) Some idle CPUs cannot be added to 'nohz.idle_cpus_mask',
this bug is fixed by another patch:
https://lore.kernel.org/all/20250815065115.289337-2-adamli@os.amperecomputing.com/

2) Even if the idle CPUs are in 'nohz.idle_cpus_mask', *no* CPU can be selected to
do NOHZ idle load balancing because conditions in find_new_ilb() is too strict.
This patch tries to solve this issue.

Hope this information helps.

Thanks,
-adam

 



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/nohz: Fix NOHZ imbalance by adding options for ILB CPU
  2025-08-20  3:35   ` Adam Li
@ 2025-08-20  8:43     ` Valentin Schneider
  2025-08-20 11:05       ` Adam Li
  2025-08-20 17:31       ` Christoph Lameter (Ampere)
  0 siblings, 2 replies; 14+ messages in thread
From: Valentin Schneider @ 2025-08-20  8:43 UTC (permalink / raw)
  To: Adam Li, mingo, peterz, juri.lelli, vincent.guittot
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, cl, frederic,
	linux-kernel, patches

On 20/08/25 11:35, Adam Li wrote:
> On 8/19/2025 10:00 PM, Valentin Schneider wrote:
>>
>> I'm not understanding why, in the scenarios outlined above, more NOHZ idle
>> balancing is a good thing.
>>
>> Considering only housekeeping CPUs, they're all covered by wakeup, periodic
>> and idle balancing (on top of NOHZ idle balancing when relevant). So if
>> find_new_ilb() never finds a NOHZ-idle CPU, then that means your HK CPUs
>> are either always busy or never stopping the tick when going idle, IOW they
>> always have some work to do within a jiffy boundary.
>> > Am I missing something?
>>
>
> I agree with your description about the housekeeping CPUs. In the worst case,
> the system only has one housekeeping CPU and this housekeeping CPU is so busy
> that:
> 1) This housekeeping CPU is unlikely idle;
> 2) and this housekeeping CPU is unlikely in 'nohz.idle_cpus_mask' because tick
> is not stopped.
> Therefore find_new_ilb() may very likely return -1. *No* CPU can be selected
> to do NOHZ idle load balancing.
>
> This patch tries to fix the imbalance of NOHZ idle CPUs (CPUs in nohz.idle_cpus_mask).
> Here is more background:
>
> When running llama on arm64 server, some CPUs *keep* idle while others
> are 100% busy. All CPUs are in 'nohz_full=' cpu list, and CONFIG_NO_HZ_FULL
> is set.
>

I assume you mean all but one CPU is in 'nohz_full=' since you need at
least one housekeeping CPU. But in that case this becomes a slightly
different problem, since no CPU in 'nohz_full' will be in

  housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)

> The problem is caused by two issues:
> 1) Some idle CPUs cannot be added to 'nohz.idle_cpus_mask',
> this bug is fixed by another patch:
> https://lore.kernel.org/all/20250815065115.289337-2-adamli@os.amperecomputing.com/
>
> 2) Even if the idle CPUs are in 'nohz.idle_cpus_mask', *no* CPU can be selected to
> do NOHZ idle load balancing because conditions in find_new_ilb() is too strict.
> This patch tries to solve this issue.
>
> Hope this information helps.
>

I hadn't seen that patch; that cclist is quite small, you'll want to add
the scheduler people to our next submission.

So IIUC:
- Pretty much all your CPUs are NOHZ_FULL
- When they go idle they remain so for a while despite work being available

My first question would be: is NOHZ_FULL really right for your workload?
It's mainly designed to be used with always-running userspace tasks,
generally affined to a CPU by the system administrator.
Here AIUI you're relying on the scheduler load balancing to distribute work
to the NOHZ_FULL CPUs, so you're going to be penalized a lot by the
NOHZ_FULL context switch overheads. What's the point? Wouldn't you have
less overhead with just NOHZ_IDLE?

As for the actual balancing, yeah if you have idle NOHZ_FULL CPUs they
won't do the periodic balance; the residual 1Hz remote tick doesn't do that
either. But they should still do the newidle balance to pull work before
going tickless idle, and wakeup balance should help as well, albeit that
also depends on your topology.

Could you share your system topology and your actual nohz_full cmdline?

> Thanks,
> -adam
>
>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/nohz: Fix NOHZ imbalance by adding options for ILB CPU
  2025-08-20  8:43     ` Valentin Schneider
@ 2025-08-20 11:05       ` Adam Li
  2025-08-20 11:46         ` Valentin Schneider
  2025-08-20 17:31       ` Christoph Lameter (Ampere)
  1 sibling, 1 reply; 14+ messages in thread
From: Adam Li @ 2025-08-20 11:05 UTC (permalink / raw)
  To: Valentin Schneider, mingo, peterz, juri.lelli, vincent.guittot
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, cl, frederic,
	linux-kernel, patches

On 8/20/2025 4:43 PM, Valentin Schneider wrote:
> On 20/08/25 11:35, Adam Li wrote:
>> On 8/19/2025 10:00 PM, Valentin Schneider wrote:
>>>
>>> I'm not understanding why, in the scenarios outlined above, more NOHZ idle
>>> balancing is a good thing.
>>>
>>> Considering only housekeeping CPUs, they're all covered by wakeup, periodic
>>> and idle balancing (on top of NOHZ idle balancing when relevant). So if
>>> find_new_ilb() never finds a NOHZ-idle CPU, then that means your HK CPUs
>>> are either always busy or never stopping the tick when going idle, IOW they
>>> always have some work to do within a jiffy boundary.
>>>> Am I missing something?
>>>
>>
>> I agree with your description about the housekeeping CPUs. In the worst case,
>> the system only has one housekeeping CPU and this housekeeping CPU is so busy
>> that:
>> 1) This housekeeping CPU is unlikely idle;
>> 2) and this housekeeping CPU is unlikely in 'nohz.idle_cpus_mask' because tick
>> is not stopped.
>> Therefore find_new_ilb() may very likely return -1. *No* CPU can be selected
>> to do NOHZ idle load balancing.
>>
>> This patch tries to fix the imbalance of NOHZ idle CPUs (CPUs in nohz.idle_cpus_mask).
>> Here is more background:
>>
>> When running llama on arm64 server, some CPUs *keep* idle while others
>> are 100% busy. All CPUs are in 'nohz_full=' cpu list, and CONFIG_NO_HZ_FULL
>> is set.
>>
> 
> I assume you mean all but one CPU is in 'nohz_full=' since you need at
> least one housekeeping CPU. But in that case this becomes a slightly
> different problem, since no CPU in 'nohz_full' will be in
> 
>   housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)
> 

I ran llama workload on a system with 192 CPUs. I set "nohz_full=0-191" so all CPUs
are in 'nohz_full' list. In this case, kernel uses the boot CPU for housekeeping:

Kernel message: "Housekeeping: must include one present CPU, using boot CPU:0"

find_new_ilb() looks for qualified CPU from housekeeping CPUs. The searching
is likely to fail if there is only one housekeeping CPU.

>> The problem is caused by two issues:
>> 1) Some idle CPUs cannot be added to 'nohz.idle_cpus_mask',
>> this bug is fixed by another patch:
>> https://lore.kernel.org/all/20250815065115.289337-2-adamli@os.amperecomputing.com/
>>
>> 2) Even if the idle CPUs are in 'nohz.idle_cpus_mask', *no* CPU can be selected to
>> do NOHZ idle load balancing because conditions in find_new_ilb() is too strict.
>> This patch tries to solve this issue.
>>
>> Hope this information helps.
>>
> 
> I hadn't seen that patch; that cclist is quite small, you'll want to add
> the scheduler people to our next submission.
> 

Sure. The first patch involves both 'tick' and 'scheduler' subsystem. I can resend
the first patch to broader reviewers if you don't mind.

> So IIUC:
> - Pretty much all your CPUs are NOHZ_FULL
> - When they go idle they remain so for a while despite work being available
> 

Exactly.

> My first question would be: is NOHZ_FULL really right for your workload?
> It's mainly designed to be used with always-running userspace tasks,
> generally affined to a CPU by the system administrator.
> Here AIUI you're relying on the scheduler load balancing to distribute work
> to the NOHZ_FULL CPUs, so you're going to be penalized a lot by the
> NOHZ_FULL context switch overheads. What's the point? Wouldn't you have
> less overhead with just NOHZ_IDLE?
> 

I ran the llama workload to do 'Large Language Model' reference.
The workload creates 'always-running userspace' threads doing math computing.
There is *few* sleep, wakeup and context switch. IIUC NOHZ_IDLE cannot help
always-running task? 

'nohz_full' option is supposed to benefit performance by reducing kernel
noise I think. Could you please give more detail on
'NOHZ_FULL context switch overhead'?
  
> As for the actual balancing, yeah if you have idle NOHZ_FULL CPUs they
> won't do the periodic balance; the residual 1Hz remote tick doesn't do that
> either. But they should still do the newidle balance to pull work before
> going tickless idle, and wakeup balance should help as well, albeit that
> also depends on your topology.
>

I think the newidle balance and wakeup balance do not help in this case
because the workload has few sleep and wakeup.
 
> Could you share your system topology and your actual nohz_full cmdline?
>

The system has 192 CPUs. I set "nohz_full=0-191".

Thanks,
-adam

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/nohz: Fix NOHZ imbalance by adding options for ILB CPU
  2025-08-20 11:05       ` Adam Li
@ 2025-08-20 11:46         ` Valentin Schneider
  2025-08-21 11:18           ` Adam Li
  0 siblings, 1 reply; 14+ messages in thread
From: Valentin Schneider @ 2025-08-20 11:46 UTC (permalink / raw)
  To: Adam Li, mingo, peterz, juri.lelli, vincent.guittot
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, cl, frederic,
	linux-kernel, patches

On 20/08/25 19:05, Adam Li wrote:
> On 8/20/2025 4:43 PM, Valentin Schneider wrote:
>> On 20/08/25 11:35, Adam Li wrote:
>>> I agree with your description about the housekeeping CPUs. In the worst case,
>>> the system only has one housekeeping CPU and this housekeeping CPU is so busy
>>> that:
>>> 1) This housekeeping CPU is unlikely idle;
>>> 2) and this housekeeping CPU is unlikely in 'nohz.idle_cpus_mask' because tick
>>> is not stopped.
>>> Therefore find_new_ilb() may very likely return -1. *No* CPU can be selected
>>> to do NOHZ idle load balancing.
>>>
>>> This patch tries to fix the imbalance of NOHZ idle CPUs (CPUs in nohz.idle_cpus_mask).
>>> Here is more background:
>>>
>>> When running llama on arm64 server, some CPUs *keep* idle while others
>>> are 100% busy. All CPUs are in 'nohz_full=' cpu list, and CONFIG_NO_HZ_FULL
>>> is set.
>>>
>>
>> I assume you mean all but one CPU is in 'nohz_full=' since you need at
>> least one housekeeping CPU. But in that case this becomes a slightly
>> different problem, since no CPU in 'nohz_full' will be in
>>
>>   housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)
>>
>
> I ran llama workload on a system with 192 CPUs. I set "nohz_full=0-191" so all CPUs
> are in 'nohz_full' list. In this case, kernel uses the boot CPU for housekeeping:
>
> Kernel message: "Housekeeping: must include one present CPU, using boot CPU:0"
>
> find_new_ilb() looks for qualified CPU from housekeeping CPUs. The searching
> is likely to fail if there is only one housekeeping CPU.
>

Right

>>> The problem is caused by two issues:
>>> 1) Some idle CPUs cannot be added to 'nohz.idle_cpus_mask',
>>> this bug is fixed by another patch:
>>> https://lore.kernel.org/all/20250815065115.289337-2-adamli@os.amperecomputing.com/
>>>
>>> 2) Even if the idle CPUs are in 'nohz.idle_cpus_mask', *no* CPU can be selected to
>>> do NOHZ idle load balancing because conditions in find_new_ilb() is too strict.
>>> This patch tries to solve this issue.
>>>
>>> Hope this information helps.
>>>
>>
>> I hadn't seen that patch; that cclist is quite small, you'll want to add
>> the scheduler people to our next submission.
>>
>
> Sure. The first patch involves both 'tick' and 'scheduler' subsystem. I can resend
> the first patch to broader reviewers if you don't mind.
>

I'd say resend the whole series with the right folks cc'd.

>> So IIUC:
>> - Pretty much all your CPUs are NOHZ_FULL
>> - When they go idle they remain so for a while despite work being available
>>
>
> Exactly.
>
>> My first question would be: is NOHZ_FULL really right for your workload?
>> It's mainly designed to be used with always-running userspace tasks,
>> generally affined to a CPU by the system administrator.
>> Here AIUI you're relying on the scheduler load balancing to distribute work
>> to the NOHZ_FULL CPUs, so you're going to be penalized a lot by the
>> NOHZ_FULL context switch overheads. What's the point? Wouldn't you have
>> less overhead with just NOHZ_IDLE?
>>
>
> I ran the llama workload to do 'Large Language Model' reference.
> The workload creates 'always-running userspace' threads doing math computing.
> There is *few* sleep, wakeup and context switch. IIUC NOHZ_IDLE cannot help
> always-running task?
>

Right, NOHZ_IDLE is really about power savings while a CPU is idle (and
IIRC it helps some virtualization cases).

> 'nohz_full' option is supposed to benefit performance by reducing kernel
> noise I think. Could you please give more detail on
> 'NOHZ_FULL context switch overhead'?
>

The doc briefly touches on that:

  https://docs.kernel.org/timers/no_hz.html#omit-scheduling-clock-ticks-for-cpus-with-only-one-runnable-task

The longer story is have a look at kernel/context_tracking.c; every
transition into and out of the kernel to and from user or idle requires
additional atomic operations and synchronization.

It would be worth for you to quantify how much these processes
sleep/context switch, it could be that keep the tick enabled incurs a lower
throughput penalty than the NO_HZ_FULL overheads.

>> As for the actual balancing, yeah if you have idle NOHZ_FULL CPUs they
>> won't do the periodic balance; the residual 1Hz remote tick doesn't do that
>> either. But they should still do the newidle balance to pull work before
>> going tickless idle, and wakeup balance should help as well, albeit that
>> also depends on your topology.
>>
>
> I think the newidle balance and wakeup balance do not help in this case
> because the workload has few sleep and wakeup.
>

Right. So other than the NO_HZ_FULL vs NO_HZ_IDLE considerations above, you
could manually affine the threads of the workload. Depending on how much
control you have over how many threads it spawn, you could either pin on
thread per CPU, or just spawn the workload into a cpuset covering the
NO_HZ_FULL CPUs.

Having the scheduler do the balancing is bit of a precarious
situation. Your single housekeeping CPU is pretty much going to be always
running things, does it make sense to have it run the NOHZ idle balance
when there are available idle NOHZ_FULL CPUs? And in the same sense, does
it make sense to disturb an idle NOHZ_FULL CPU to get it to spread load on
other NOHZ_FULL CPUs? Admins that manually affine their threads will
probably say no.

9b019acb72e4 ("sched/nohz: Run NOHZ idle load balancer on HK_FLAG_MISC CPUs")
also mentions SMT being an issue.

>> Could you share your system topology and your actual nohz_full cmdline?
>>
>
> The system has 192 CPUs. I set "nohz_full=0-191".
>
> Thanks,
> -adam


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/nohz: Fix NOHZ imbalance by adding options for ILB CPU
  2025-08-20  8:43     ` Valentin Schneider
  2025-08-20 11:05       ` Adam Li
@ 2025-08-20 17:31       ` Christoph Lameter (Ampere)
  2025-08-21  9:01         ` Valentin Schneider
  1 sibling, 1 reply; 14+ messages in thread
From: Christoph Lameter (Ampere) @ 2025-08-20 17:31 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Adam Li, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, frederic,
	linux-kernel, patches

On Wed, 20 Aug 2025, Valentin Schneider wrote:

> My first question would be: is NOHZ_FULL really right for your workload?

Yes performance is improved. AI workloads are like HPC workloads in that
they need to do compute and then rendezvous for data exchange. Variations
in the runtime due to timer ticks cause idle periods where the rendezvous
cannot be completed because some cpus are delayed.

The more frequent rendezvous can be performed the better the performance
numbers will be.

> It's mainly designed to be used with always-running userspace
tasks, > generally affined to a CPU by the system administrator.

hohz full has been reworked somewhat since the early days and works in a
more general way today.

> Here AIUI you're relying on the scheduler load balancing to distribute work
> to the NOHZ_FULL CPUs, so you're going to be penalized a lot by the
> NOHZ_FULL context switch overheads. What's the point? Wouldn't you have
> less overhead with just NOHZ_IDLE?

The benchmarks show a regression of 10-20% if the tick is operational.
The context switch overhead is negligible since the cpus are doing compute
and not system calls.

> As for the actual balancing, yeah if you have idle NOHZ_FULL CPUs they
> won't do the periodic balance; the residual 1Hz remote tick doesn't do that
> either. But they should still do the newidle balance to pull work before
> going tickless idle, and wakeup balance should help as well, albeit that
> also depends on your topology.

That should work in general and not depend on any hardware topology. In
this case we have a linear sched domain including all processors.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/nohz: Fix NOHZ imbalance by adding options for ILB CPU
  2025-08-20 17:31       ` Christoph Lameter (Ampere)
@ 2025-08-21  9:01         ` Valentin Schneider
  0 siblings, 0 replies; 14+ messages in thread
From: Valentin Schneider @ 2025-08-21  9:01 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Adam Li, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, frederic,
	linux-kernel, patches

On 20/08/25 10:31, Christoph Lameter (Ampere) wrote:
> On Wed, 20 Aug 2025, Valentin Schneider wrote:
>
>> My first question would be: is NOHZ_FULL really right for your workload?
>
> Yes performance is improved. AI workloads are like HPC workloads in that
> they need to do compute and then rendezvous for data exchange. Variations
> in the runtime due to timer ticks cause idle periods where the rendezvous
> cannot be completed because some cpus are delayed.
>
> The more frequent rendezvous can be performed the better the performance
> numbers will be.

[...]

> The benchmarks show a regression of 10-20% if the tick is operational.
> The context switch overhead is negligible since the cpus are doing compute
> and not system calls.
>

Ah good, that's useful information, thanks!

>> As for the actual balancing, yeah if you have idle NOHZ_FULL CPUs they
>> won't do the periodic balance; the residual 1Hz remote tick doesn't do that
>> either. But they should still do the newidle balance to pull work before
>> going tickless idle, and wakeup balance should help as well, albeit that
>> also depends on your topology.
>
> That should work in general and not depend on any hardware topology. In
> this case we have a linear sched domain including all processors.

Wakeup balance not so much, select_idle_sibling() won't move tasks outside
of the waker's LLC - but AIUI in your case you have just a single node and
I'm assuming one big LLC.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/nohz: Fix NOHZ imbalance by adding options for ILB CPU
  2025-08-20 11:46         ` Valentin Schneider
@ 2025-08-21 11:18           ` Adam Li
  2025-08-28 10:56             ` Valentin Schneider
  0 siblings, 1 reply; 14+ messages in thread
From: Adam Li @ 2025-08-21 11:18 UTC (permalink / raw)
  To: Valentin Schneider, mingo, peterz, juri.lelli, vincent.guittot
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, cl, frederic,
	linux-kernel, patches

On 8/20/2025 7:46 PM, Valentin Schneider wrote:
> 
> I'd say resend the whole series with the right folks cc'd.
>
OK. I resent the patch series.
Please refer to: https://lore.kernel.org/all/20250821042707.62993-1-adamli@os.amperecomputing.com/

>> 'nohz_full' option is supposed to benefit performance by reducing kernel
>> noise I think. Could you please give more detail on
>> 'NOHZ_FULL context switch overhead'?
>>
> 
> The doc briefly touches on that:
> 
>   https://docs.kernel.org/timers/no_hz.html#omit-scheduling-clock-ticks-for-cpus-with-only-one-runnable-task
> 
> The longer story is have a look at kernel/context_tracking.c; every
> transition into and out of the kernel to and from user or idle requires
> additional atomic operations and synchronization.
> 
> It would be worth for you to quantify how much these processes
> sleep/context switch, it could be that keep the tick enabled incurs a lower
> throughput penalty than the NO_HZ_FULL overheads.
> 

Thanks for the information.

>>> As for the actual balancing, yeah if you have idle NOHZ_FULL CPUs they
>>> won't do the periodic balance; the residual 1Hz remote tick doesn't do that
>>> either. But they should still do the newidle balance to pull work before
>>> going tickless idle, and wakeup balance should help as well, albeit that
>>> also depends on your topology.
>>>
>>
>> I think the newidle balance and wakeup balance do not help in this case
>> because the workload has few sleep and wakeup.
>>
> 
> Right. So other than the NO_HZ_FULL vs NO_HZ_IDLE considerations above, you
> could manually affine the threads of the workload. Depending on how much
> control you have over how many threads it spawn, you could either pin on
> thread per CPU, or just spawn the workload into a cpuset covering the
> NO_HZ_FULL CPUs.
> 

Yes, binding the threads to CPU can work around the performance
issue caused by load imbalance. Should we document that 'nohz_full' may cause
the scheduler load balancing not working well and CPU affinity is preferred?

> Having the scheduler do the balancing is bit of a precarious
> situation. Your single housekeeping CPU is pretty much going to be always
> running things, does it make sense to have it run the NOHZ idle balance
> when there are available idle NOHZ_FULL CPUs? And in the same sense, does
> it make sense to disturb an idle NOHZ_FULL CPU to get it to spread load on
> other NOHZ_FULL CPUs? Admins that manually affine their threads will
> probably say no.
> 

I think when the NOHZ_FULL CPU is added to nohz.idle_cpus_mask and
its tick is stopped, the CPU is 'very' idle. We can safely assign some work to it.

> 9b019acb72e4 ("sched/nohz: Run NOHZ idle load balancer on HK_FLAG_MISC CPUs")
> also mentions SMT being an issue.
> 

From the commit message of 9b019acb72e4:
"The problem was observed with increased jitter on an application
running on CPU0, caused by NOHZ idle load balancing being run on
CPU1 (an SMT sibling)."

Can we say if *no* SMT, it is safe to run NOHZ idle load balancing
on CPU in nohz.idle_cpus_mask? My patch checks '!sched_smt_active()' when
searching from nohz.idle_cpus_mask.

Thanks,
-adam


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/nohz: Fix NOHZ imbalance by adding options for ILB CPU
  2025-08-21 11:18           ` Adam Li
@ 2025-08-28 10:56             ` Valentin Schneider
  2025-08-28 15:44               ` Christoph Lameter (Ampere)
  0 siblings, 1 reply; 14+ messages in thread
From: Valentin Schneider @ 2025-08-28 10:56 UTC (permalink / raw)
  To: Adam Li, mingo, peterz, juri.lelli, vincent.guittot
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, cl, frederic,
	linux-kernel, patches

On 21/08/25 19:18, Adam Li wrote:
> On 8/20/2025 7:46 PM, Valentin Schneider wrote:
>> Right. So other than the NO_HZ_FULL vs NO_HZ_IDLE considerations above, you
>> could manually affine the threads of the workload. Depending on how much
>> control you have over how many threads it spawn, you could either pin on
>> thread per CPU, or just spawn the workload into a cpuset covering the
>> NO_HZ_FULL CPUs.
>>
>
> Yes, binding the threads to CPU can work around the performance
> issue caused by load imbalance. Should we document that 'nohz_full' may cause
> the scheduler load balancing not working well and CPU affinity is preferred?
>

Yeah I guess we could highlight that.

I think it's kind of a gray area; technically we could change load
balancing to make NO_HZ_FULL CPUs better at pulling tasks, but that only
works up to the point where, if you have N NO_HZ_FULL CPUs, you have pulled
N tasks. So there is an underlying assumption that the workload threading
matches your NO_HZ_FULL topology; and if that's the case, you might as well
affine the tasks by hand and avoid any surprises.

Put in another way: yes we can probably make load balancing better
for NO_HZ_FULL CPUs, but that only really works if we have one task to pull
per NO_HZ_FULL CPU, in which case manual affinity binding works just as
well, and I prefer that approach since it means we don't have to add a
NO_HZ_FULL load balancing logic which may end up interfering with
NO_HZ_FULL itself. At least, that is my opinion.

>> Having the scheduler do the balancing is bit of a precarious
>> situation. Your single housekeeping CPU is pretty much going to be always
>> running things, does it make sense to have it run the NOHZ idle balance
>> when there are available idle NOHZ_FULL CPUs? And in the same sense, does
>> it make sense to disturb an idle NOHZ_FULL CPU to get it to spread load on
>> other NOHZ_FULL CPUs? Admins that manually affine their threads will
>> probably say no.
>>
>
> I think when the NOHZ_FULL CPU is added to nohz.idle_cpus_mask and
> its tick is stopped, the CPU is 'very' idle. We can safely assign some work to it.
>
>> 9b019acb72e4 ("sched/nohz: Run NOHZ idle load balancer on HK_FLAG_MISC CPUs")
>> also mentions SMT being an issue.
>>
>
> From the commit message of 9b019acb72e4:
> "The problem was observed with increased jitter on an application
> running on CPU0, caused by NOHZ idle load balancing being run on
> CPU1 (an SMT sibling)."
>
> Can we say if *no* SMT, it is safe to run NOHZ idle load balancing
> on CPU in nohz.idle_cpus_mask? My patch checks '!sched_smt_active()' when
> searching from nohz.idle_cpus_mask.
>

I suppose we could still make this work for SMT with e.g. is_core_idle(),
but see my point above.

> Thanks,
> -adam


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/nohz: Fix NOHZ imbalance by adding options for ILB CPU
  2025-08-28 10:56             ` Valentin Schneider
@ 2025-08-28 15:44               ` Christoph Lameter (Ampere)
  2025-09-03 12:35                 ` Valentin Schneider
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Lameter (Ampere) @ 2025-08-28 15:44 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Adam Li, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, frederic,
	linux-kernel, patches

On Thu, 28 Aug 2025, Valentin Schneider wrote:

> > Yes, binding the threads to CPU can work around the performance
> > issue caused by load imbalance. Should we document that 'nohz_full' may cause
> > the scheduler load balancing not working well and CPU affinity is preferred?
> >
>
> Yeah I guess we could highlight that.

We need to make sure that the idle cpus are used when available and
needed. Otherwise the scheduler is buggy.

Such a load balancing action means that there is a cpu that is running
multiple processes. Therefore the timer interrrupt and the scheduler
processing is active on at least one cpu. We can therefore do something
about the situation.

The scheduler needs to move one of the processes onto the idle cpu.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/nohz: Fix NOHZ imbalance by adding options for ILB CPU
  2025-08-28 15:44               ` Christoph Lameter (Ampere)
@ 2025-09-03 12:35                 ` Valentin Schneider
  2025-09-03 14:14                   ` Vincent Guittot
  0 siblings, 1 reply; 14+ messages in thread
From: Valentin Schneider @ 2025-09-03 12:35 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Adam Li, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, frederic,
	linux-kernel, patches

On 28/08/25 08:44, Christoph Lameter (Ampere) wrote:
> On Thu, 28 Aug 2025, Valentin Schneider wrote:
>
>> > Yes, binding the threads to CPU can work around the performance
>> > issue caused by load imbalance. Should we document that 'nohz_full' may cause
>> > the scheduler load balancing not working well and CPU affinity is preferred?
>> >
>>
>> Yeah I guess we could highlight that.
>
> We need to make sure that the idle cpus are used when available and
> needed. Otherwise the scheduler is buggy.
>
> Such a load balancing action means that there is a cpu that is running
> multiple processes. Therefore the timer interrrupt and the scheduler
> processing is active on at least one cpu. We can therefore do something
> about the situation.
>
> The scheduler needs to move one of the processes onto the idle cpu.

AFAICT we have (at least) two options:
1) Trigger NOHZ balancing on a busy housekeeping CPU (what this patch does)

   This is somewhat against idle load balancing rules (only spend CPU time
   on that if there is no "genuine" work to run), but I guess from a CPU
   isolation PoV this can be tallied as just another housekeeping activity

2) Trigger NOHZ balancing on an idle NOHZ_FULL CPU

   That doesn't steal useful CPU time, but that also potentially causes
   interference, albeit only if racing with the NOHZ_FULL workload spawning
   (which shouldn't be the steady state).

The more I think about it the more I'm leaning towards 1), but I'd like
other folks' opinion.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/nohz: Fix NOHZ imbalance by adding options for ILB CPU
  2025-09-03 12:35                 ` Valentin Schneider
@ 2025-09-03 14:14                   ` Vincent Guittot
  2025-09-03 20:33                     ` Christoph Lameter (Ampere)
  0 siblings, 1 reply; 14+ messages in thread
From: Vincent Guittot @ 2025-09-03 14:14 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Christoph Lameter (Ampere), Adam Li, mingo, peterz, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, frederic,
	linux-kernel, patches

On Wed, 3 Sept 2025 at 14:35, Valentin Schneider <vschneid@redhat.com> wrote:
>
> On 28/08/25 08:44, Christoph Lameter (Ampere) wrote:
> > On Thu, 28 Aug 2025, Valentin Schneider wrote:
> >
> >> > Yes, binding the threads to CPU can work around the performance
> >> > issue caused by load imbalance. Should we document that 'nohz_full' may cause
> >> > the scheduler load balancing not working well and CPU affinity is preferred?
> >> >
> >>
> >> Yeah I guess we could highlight that.
> >
> > We need to make sure that the idle cpus are used when available and
> > needed. Otherwise the scheduler is buggy.
> >
> > Such a load balancing action means that there is a cpu that is running
> > multiple processes. Therefore the timer interrrupt and the scheduler
> > processing is active on at least one cpu. We can therefore do something
> > about the situation.
> >
> > The scheduler needs to move one of the processes onto the idle cpu.
>
> AFAICT we have (at least) two options:
> 1) Trigger NOHZ balancing on a busy housekeeping CPU (what this patch does)
>
>    This is somewhat against idle load balancing rules (only spend CPU time
>    on that if there is no "genuine" work to run), but I guess from a CPU
>    isolation PoV this can be tallied as just another housekeeping activity

In this case, this should only be done for full nohz case and not for
other cases because the ILB overhead is not negligible on a busy cpu
and I don't see anything that enable 1) only for full no hz

>
> 2) Trigger NOHZ balancing on an idle NOHZ_FULL CPU

this patch also does 2) for no smt case

I wonder why this happens only for no smt case ?   If the sibling is
used by another thread with full nohz, it already interferes with this
one

But we might want to do is_core_idle() instead

>
>    That doesn't steal useful CPU time, but that also potentially causes
>    interference, albeit only if racing with the NOHZ_FULL workload spawning
>    (which shouldn't be the steady state).
>
> The more I think about it the more I'm leaning towards 1), but I'd like
> other folks' opinion.
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/nohz: Fix NOHZ imbalance by adding options for ILB CPU
  2025-09-03 14:14                   ` Vincent Guittot
@ 2025-09-03 20:33                     ` Christoph Lameter (Ampere)
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter (Ampere) @ 2025-09-03 20:33 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Valentin Schneider, Adam Li, mingo, peterz, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, frederic,
	linux-kernel, patches

On Wed, 3 Sep 2025, Vincent Guittot wrote:

> > AFAICT we have (at least) two options:
> > 1) Trigger NOHZ balancing on a busy housekeeping CPU (what this patch does)


Isnt there a third option?

3) Trigger load balancing if on a NOHZ_FULL cpu and multiple processes are
running on it.


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-09-03 20:43 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-19  2:57 [PATCH] sched/nohz: Fix NOHZ imbalance by adding options for ILB CPU Adam Li
2025-08-19 14:00 ` Valentin Schneider
2025-08-20  3:35   ` Adam Li
2025-08-20  8:43     ` Valentin Schneider
2025-08-20 11:05       ` Adam Li
2025-08-20 11:46         ` Valentin Schneider
2025-08-21 11:18           ` Adam Li
2025-08-28 10:56             ` Valentin Schneider
2025-08-28 15:44               ` Christoph Lameter (Ampere)
2025-09-03 12:35                 ` Valentin Schneider
2025-09-03 14:14                   ` Vincent Guittot
2025-09-03 20:33                     ` Christoph Lameter (Ampere)
2025-08-20 17:31       ` Christoph Lameter (Ampere)
2025-08-21  9:01         ` Valentin Schneider

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).