[RESEND PATCH] tick/nohz: Fix wrong NOHZ idle CPU state

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RESEND PATCH] tick/nohz: Fix wrong NOHZ idle CPU state
@ 2026-02-04  0:49 Shubhang Kaushik
  2026-02-12 14:33 ` Frederic Weisbecker
  0 siblings, 1 reply; 8+ messages in thread
From: Shubhang Kaushik @ 2026-02-04  0:49 UTC (permalink / raw)
  To: Anna-Maria Behnsen, Frederic Weisbecker, Ingo Molnar,
	Thomas Gleixner, Vincent Guittot, Valentin Schneider
  Cc: dietmar.eggemann, bsegall, mgorman, rostedt, Shubhang Kaushik,
	Christoph Lameter, linux-kernel, Shubhang Kaushik, Adam Li

Under CONFIG_NO_HZ_FULL, the scheduler tick can get stopped earlier via
tick_nohz_full_stop_tick() before the CPU subsequently enters the idle
path. In this case, tick_nohz_idle_stop_tick() observes TS_FLAG_STOPPED
already set and skips nohz_balance_enter_idle() because the !was_stopped
condition assumes tick-stop and idle-entry are coupled.
This leaves a tickless idle CPU absent from nohz.idle_cpus_mask, making
it invisible to NOHZ idle load balancing while periodic balancing is
also suppressed.

The patch fixes this by decoupling tick-stop transition accounting from
scheduler bookkeeping. idle_jiffies remains updated only on the
tick-stop transition, while nohz_balance_enter_idle() is invoked
whenever a CPU enters idle with the tick already stopped, relying on its
existing idempotent gaurd to avoid duplicate registration.

Tested on Ampere Altra on 6.19.0-rc8 with CONFIG_NO_HZ_FULL enabled:
- This change improves load distribution by ensuring that tickless idle
  CPUs are visible to NOHZ idle load balancing. In llama-batched-bench,
  throughput improves by up to ~14% across multiple thread counts.
- Hackbench single-process results improve by 5% and multi-process
  results improve by up to ~26%, consistent with reduced scheduler
  jitter and earlier utilization of fully idle cores.
  No regressions observed.

Signed-off-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
Signed-off-by: Adam Li <adamli@os.amperecomputing.com>
Reviewed-by: Christoph Lameter (Ampere) <cl@gentwo.org>
Reviewed-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
---
This is a resend of the original patch to ensure visibility.
Previous resend: https://lkml.org/lkml/2025/8/21/170
Original thread: https://lkml.org/lkml/2025/8/21/171

The patch addresses a performance regression in NOHZ idle load balancing 
observed under CONFIG_NO_HZ_FULL, where idle CPUs were becoming 
invisible to the balancer.
---
 kernel/time/tick-sched.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 2f8a7923fa279409ffe950f770ff2eac868f6ece..eee6fcebe78c2f8d93464a55fe332e12fe9c164e 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1250,8 +1250,9 @@ void tick_nohz_idle_stop_tick(void)
 		ts->idle_sleeps++;
 		ts->idle_expires = expires;
 
-		if (!was_stopped && tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
-			ts->idle_jiffies = ts->last_jiffies;
+		if (tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
+			if (!was_stopped)
+				ts->idle_jiffies = ts->last_jiffies;
 			nohz_balance_enter_idle(cpu);
 		}
 	} else {

---
base-commit: 18f7fcd5e69a04df57b563360b88be72471d6b62
change-id: 20260203-fix-nohz-idle-b2838276cb91

Best regards,
-- 
Shubhang Kaushik <shubhang@os.amperecomputing.com>


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RESEND PATCH] tick/nohz: Fix wrong NOHZ idle CPU state
  2026-02-04  0:49 [RESEND PATCH] tick/nohz: Fix wrong NOHZ idle CPU state Shubhang Kaushik
@ 2026-02-12 14:33 ` Frederic Weisbecker
  2026-02-12 19:36   ` Shubhang Kaushik
  0 siblings, 1 reply; 8+ messages in thread
From: Frederic Weisbecker @ 2026-02-12 14:33 UTC (permalink / raw)
  To: Shubhang Kaushik
  Cc: Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Vincent Guittot,
	Valentin Schneider, dietmar.eggemann, bsegall, mgorman, rostedt,
	Shubhang Kaushik, Christoph Lameter, linux-kernel, Adam Li

Le Tue, Feb 03, 2026 at 04:49:03PM -0800, Shubhang Kaushik a écrit :
> Under CONFIG_NO_HZ_FULL, the scheduler tick can get stopped earlier via
> tick_nohz_full_stop_tick() before the CPU subsequently enters the idle
> path. In this case, tick_nohz_idle_stop_tick() observes TS_FLAG_STOPPED
> already set and skips nohz_balance_enter_idle() because the !was_stopped
> condition assumes tick-stop and idle-entry are coupled.
> This leaves a tickless idle CPU absent from nohz.idle_cpus_mask, making
> it invisible to NOHZ idle load balancing while periodic balancing is
> also suppressed.
> 
> The patch fixes this by decoupling tick-stop transition accounting from
> scheduler bookkeeping. idle_jiffies remains updated only on the
> tick-stop transition, while nohz_balance_enter_idle() is invoked
> whenever a CPU enters idle with the tick already stopped, relying on its
> existing idempotent gaurd to avoid duplicate registration.
> 
> Tested on Ampere Altra on 6.19.0-rc8 with CONFIG_NO_HZ_FULL enabled:
> - This change improves load distribution by ensuring that tickless idle
>   CPUs are visible to NOHZ idle load balancing. In llama-batched-bench,
>   throughput improves by up to ~14% across multiple thread counts.
> - Hackbench single-process results improve by 5% and multi-process
>   results improve by up to ~26%, consistent with reduced scheduler
>   jitter and earlier utilization of fully idle cores.
>   No regressions observed.

Because you rely on dynamic placement of isolated tasks throughout isolated
CPUs by the scheduler.

But nohz_full is designed for running only one task per isolated CPU without
any disturbance. And migration is a significant disturbance. This is why
nohz_full tries not to be too smart and assumes that task placement is entirely
within the hands of the user.

So I have to ask, what prevents you from using static task placement in your
workload?

I'm not saying it's undesirable or impossible to do adaptive userspace dyntick
for users that don't rely on ultra low latency but rather on high CPU-bound
performance. In fact the initial purpose of nohz_full was for HPC and not
real-time. Turns out that real time is all the usecase I have seen so far and
you're the first HPC one. But adapting nohz_full dynamically for that will involve
much more than just load balancing. Now the static affinity should work for
everyone.

Thanks.


> 
> Signed-off-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
> Signed-off-by: Adam Li <adamli@os.amperecomputing.com>
> Reviewed-by: Christoph Lameter (Ampere) <cl@gentwo.org>
> Reviewed-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
> ---
> This is a resend of the original patch to ensure visibility.
> Previous resend: https://lkml.org/lkml/2025/8/21/170
> Original thread: https://lkml.org/lkml/2025/8/21/171
> 
> The patch addresses a performance regression in NOHZ idle load balancing 
> observed under CONFIG_NO_HZ_FULL, where idle CPUs were becoming 
> invisible to the balancer.
> ---
>  kernel/time/tick-sched.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 2f8a7923fa279409ffe950f770ff2eac868f6ece..eee6fcebe78c2f8d93464a55fe332e12fe9c164e 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -1250,8 +1250,9 @@ void tick_nohz_idle_stop_tick(void)
>  		ts->idle_sleeps++;
>  		ts->idle_expires = expires;
>  
> -		if (!was_stopped && tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
> -			ts->idle_jiffies = ts->last_jiffies;
> +		if (tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
> +			if (!was_stopped)
> +				ts->idle_jiffies = ts->last_jiffies;
>  			nohz_balance_enter_idle(cpu);
>  		}
>  	} else {
> 
> ---
> base-commit: 18f7fcd5e69a04df57b563360b88be72471d6b62
> change-id: 20260203-fix-nohz-idle-b2838276cb91
> 
> Best regards,
> -- 
> Shubhang Kaushik <shubhang@os.amperecomputing.com>
> 

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RESEND PATCH] tick/nohz: Fix wrong NOHZ idle CPU state
  2026-02-12 14:33 ` Frederic Weisbecker
@ 2026-02-12 19:36   ` Shubhang Kaushik
  2026-02-12 20:04     ` Shubhang Kaushik
  2026-02-13 12:56     ` Frederic Weisbecker
  0 siblings, 2 replies; 8+ messages in thread
From: Shubhang Kaushik @ 2026-02-12 19:36 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Vincent Guittot,
	Valentin Schneider, dietmar.eggemann, bsegall, mgorman, rostedt,
	Christoph Lameter, linux-kernel, Adam Li

Hi Frederic,

On Thu, 12 Feb 2026, Frederic Weisbecker wrote:

>>
>> Tested on Ampere Altra on 6.19.0-rc8 with CONFIG_NO_HZ_FULL enabled:
>> - This change improves load distribution by ensuring that tickless idle
>>   CPUs are visible to NOHZ idle load balancing. In llama-batched-bench,
>>   throughput improves by up to ~14% across multiple thread counts.
>> - Hackbench single-process results improve by 5% and multi-process
>>   results improve by up to ~26%, consistent with reduced scheduler
>>   jitter and earlier utilization of fully idle cores.
>>   No regressions observed.
>
> Because you rely on dynamic placement of isolated tasks throughout isolated
> CPUs by the scheduler.
>
> But nohz_full is designed for running only one task per isolated CPU without
> any disturbance. And migration is a significant disturbance. This is why
> nohz_full tries not to be too smart and assumes that task placement is entirely
> within the hands of the user.
>
> So I have to ask, what prevents you from using static task placement in your
> workload?

Actually, the llama-batched-bench results I shared already included 
static affinity testing via numactl -C.

Even with static placement, we observe this ~14% throughput improvement. 
This suggests that the issue isn't about the scheduler trying to be 
smart with task migration, but rather about the side effects of an idle 
CPU being absent from nohz.idle_cpus_mask.

When nohz_full CPUs enter idle but aren't correctly accounted for in the 
idle mask, it appears to cause unnecessary overhead or interference in the 
NOHZ load balancing logic for the CPUs that are still running tasks. By 
ensuring the idle state is correctly tracked, we're not encouraging 
migration, but rather ensuring the scheduler's global state accurately 
reflects reality.

AFAICT this seems to be a case where correcting the bookkeeping benefits 
HPC throughput even when the user handles all task placement manually.

Regards,
Shubhang Kaushik
>
> I'm not saying it's undesirable or impossible to do adaptive userspace dyntick
> for users that don't rely on ultra low latency but rather on high CPU-bound
> performance. In fact the initial purpose of nohz_full was for HPC and not
> real-time. Turns out that real time is all the usecase I have seen so far and
> you're the first HPC one. But adapting nohz_full dynamically for that will involve
> much more than just load balancing. Now the static affinity should work for
> everyone.
>
> Thanks.
>
>
>>
>> Signed-off-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
>> Signed-off-by: Adam Li <adamli@os.amperecomputing.com>
>> Reviewed-by: Christoph Lameter (Ampere) <cl@gentwo.org>
>> Reviewed-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
>> ---
>> This is a resend of the original patch to ensure visibility.
>> Previous resend: https://lkml.org/lkml/2025/8/21/170
>> Original thread: https://lkml.org/lkml/2025/8/21/171
>>
>> The patch addresses a performance regression in NOHZ idle load balancing
>> observed under CONFIG_NO_HZ_FULL, where idle CPUs were becoming
>> invisible to the balancer.
>> ---
>>  kernel/time/tick-sched.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
>> index 2f8a7923fa279409ffe950f770ff2eac868f6ece..eee6fcebe78c2f8d93464a55fe332e12fe9c164e 100644
>> --- a/kernel/time/tick-sched.c
>> +++ b/kernel/time/tick-sched.c
>> @@ -1250,8 +1250,9 @@ void tick_nohz_idle_stop_tick(void)
>>  		ts->idle_sleeps++;
>>  		ts->idle_expires = expires;
>>
>> -		if (!was_stopped && tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
>> -			ts->idle_jiffies = ts->last_jiffies;
>> +		if (tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
>> +			if (!was_stopped)
>> +				ts->idle_jiffies = ts->last_jiffies;
>>  			nohz_balance_enter_idle(cpu);
>>  		}
>>  	} else {
>>
>> ---
>> base-commit: 18f7fcd5e69a04df57b563360b88be72471d6b62
>> change-id: 20260203-fix-nohz-idle-b2838276cb91
>>
>> Best regards,
>> --
>> Shubhang Kaushik <shubhang@os.amperecomputing.com>
>>
>
> -- 
> Frederic Weisbecker
> SUSE Labs
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RESEND PATCH] tick/nohz: Fix wrong NOHZ idle CPU state
  2026-02-12 19:36   ` Shubhang Kaushik
@ 2026-02-12 20:04     ` Shubhang Kaushik
  2026-02-13 13:11       ` Frederic Weisbecker
  2026-02-13 12:56     ` Frederic Weisbecker
  1 sibling, 1 reply; 8+ messages in thread
From: Shubhang Kaushik @ 2026-02-12 20:04 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Vincent Guittot,
	Valentin Schneider, dietmar.eggemann, bsegall, mgorman, rostedt,
	Christoph Lameter, linux-kernel, Adam Li

On Thu, 12 Feb 2026, Shubhang Kaushik wrote:

>>  Because you rely on dynamic placement of isolated tasks throughout
>>  isolated
>>  CPUs by the scheduler.
>>
>>  But nohz_full is designed for running only one task per isolated CPU
>>  without
>>  any disturbance. And migration is a significant disturbance. This is why
>>  nohz_full tries not to be too smart and assumes that task placement is
>>  entirely
>>  within the hands of the user.
>>
>>  So I have to ask, what prevents you from using static task placement in
>>  your
>>  workload?
>
> Actually, the llama-batched-bench results I shared already included static 
> affinity testing via numactl -C.

What I mean by that is even when tasks are strictly pinned to individual 
cores, the performance gap remains.

IIUC, the current implementation assumes tick-stop and idle-entry are 
coupled. While this holds for standard NOHZ, nohz_full decouples them, 
causing idle CPUs to be omitted from nohz.idle_cpus_mask.

This hides idle capacity from  the NOHZ idle balancer, forcing 
housekeeping tasks onto active cores. By decoupling these transitions in 
the code, we ensure accurate state accounting.

>
> Even with static placement, we observe this ~14% throughput improvement. This 
> suggests that the issue isn't about the scheduler trying to be smart with 
> task migration, but rather about the side effects of an idle CPU being absent 
> from nohz.idle_cpus_mask.
>
> When nohz_full CPUs enter idle but aren't correctly accounted for in the idle 
> mask, it appears to cause unnecessary overhead or interference in the NOHZ 
> load balancing logic for the CPUs that are still running tasks. By ensuring 
> the idle state is correctly tracked, we're not encouraging migration, but 
> rather ensuring the scheduler's global state accurately reflects reality.
>
> AFAICT this seems to be a case where correcting the bookkeeping benefits HPC 
> throughput even when the user handles all task placement manually.
>
> Regards,
> Shubhang Kaushik
>>
>>  I'm not saying it's undesirable or impossible to do adaptive userspace
>>  dyntick
>>  for users that don't rely on ultra low latency but rather on high
>>  CPU-bound
>>  performance. In fact the initial purpose of nohz_full was for HPC and not
>>  real-time. Turns out that real time is all the usecase I have seen so far
>>  and
>>  you're the first HPC one. But adapting nohz_full dynamically for that will
>>  involve
>>  much more than just load balancing. Now the static affinity should work
>>  for
>>  everyone.
>>
>>  Thanks.
>> 
>> 
>>>
>>>  Signed-off-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
>>>  Signed-off-by: Adam Li <adamli@os.amperecomputing.com>
>>>  Reviewed-by: Christoph Lameter (Ampere) <cl@gentwo.org>
>>>  Reviewed-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
>>>  ---
>>>  This is a resend of the original patch to ensure visibility.
>>>  Previous resend: https://lkml.org/lkml/2025/8/21/170
>>>  Original thread: https://lkml.org/lkml/2025/8/21/171
>>>
>>>  The patch addresses a performance regression in NOHZ idle load balancing
>>>  observed under CONFIG_NO_HZ_FULL, where idle CPUs were becoming
>>>  invisible to the balancer.
>>>  ---
>>>   kernel/time/tick-sched.c | 5 +++--
>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>>  diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
>>>  index
>>>  2f8a7923fa279409ffe950f770ff2eac868f6ece..eee6fcebe78c2f8d93464a55fe332e12fe9c164e
>>>  100644
>>>  --- a/kernel/time/tick-sched.c
>>>  +++ b/kernel/time/tick-sched.c
>>>  @@ -1250,8 +1250,9 @@ void tick_nohz_idle_stop_tick(void)
>>>     ts->idle_sleeps++;
>>>     ts->idle_expires = expires;
>>>
>>>  -		if (!was_stopped && tick_sched_flag_test(ts,
>>>  TS_FLAG_STOPPED)) {
>>>  -			ts->idle_jiffies = ts->last_jiffies;
>>>  +		if (tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
>>>  +			if (!was_stopped)
>>>  +				ts->idle_jiffies = ts->last_jiffies;
>>>     	nohz_balance_enter_idle(cpu);
>>>    	}
>>>    } else {
>>>
>>>  ---
>>>  base-commit: 18f7fcd5e69a04df57b563360b88be72471d6b62
>>>  change-id: 20260203-fix-nohz-idle-b2838276cb91
>>>
>>>  Best regards,
>>>  --
>>>  Shubhang Kaushik <shubhang@os.amperecomputing.com>
>>> 
>>
>>  --
>>  Frederic Weisbecker
>>  SUSE Labs
>> 
>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RESEND PATCH] tick/nohz: Fix wrong NOHZ idle CPU state
  2026-02-12 19:36   ` Shubhang Kaushik
  2026-02-12 20:04     ` Shubhang Kaushik
@ 2026-02-13 12:56     ` Frederic Weisbecker
  2026-02-13 18:15       ` Christoph Lameter (Ampere)
  1 sibling, 1 reply; 8+ messages in thread
From: Frederic Weisbecker @ 2026-02-13 12:56 UTC (permalink / raw)
  To: Shubhang Kaushik
  Cc: Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Vincent Guittot,
	Valentin Schneider, dietmar.eggemann, bsegall, mgorman, rostedt,
	Christoph Lameter, linux-kernel, Adam Li

Le Thu, Feb 12, 2026 at 11:36:06AM -0800, Shubhang Kaushik a écrit :
> Hi Frederic,
> 
> On Thu, 12 Feb 2026, Frederic Weisbecker wrote:
> 
> > > 
> > > Tested on Ampere Altra on 6.19.0-rc8 with CONFIG_NO_HZ_FULL enabled:
> > > - This change improves load distribution by ensuring that tickless idle
> > >   CPUs are visible to NOHZ idle load balancing. In llama-batched-bench,
> > >   throughput improves by up to ~14% across multiple thread counts.
> > > - Hackbench single-process results improve by 5% and multi-process
> > >   results improve by up to ~26%, consistent with reduced scheduler
> > >   jitter and earlier utilization of fully idle cores.
> > >   No regressions observed.
> > 
> > Because you rely on dynamic placement of isolated tasks throughout isolated
> > CPUs by the scheduler.
> > 
> > But nohz_full is designed for running only one task per isolated CPU without
> > any disturbance. And migration is a significant disturbance. This is why
> > nohz_full tries not to be too smart and assumes that task placement is entirely
> > within the hands of the user.
> > 
> > So I have to ask, what prevents you from using static task placement in your
> > workload?
> 
> Actually, the llama-batched-bench results I shared already included static
> affinity testing via numactl -C.
> 
> Even with static placement, we observe this ~14% throughput improvement.
> This suggests that the issue isn't about the scheduler trying to be smart
> with task migration, but rather about the side effects of an idle CPU being
> absent from nohz.idle_cpus_mask.
> 
> When nohz_full CPUs enter idle but aren't correctly accounted for in the
> idle mask, it appears to cause unnecessary overhead or interference in the
> NOHZ load balancing logic for the CPUs that are still running tasks. By
> ensuring the idle state is correctly tracked, we're not encouraging
> migration, but rather ensuring the scheduler's global state accurately
> reflects reality.

Then there seem to be something else going on that we don't fully understand
because isolated CPUs run 1 pinned task per CPU and the only housekeeping CPU
is CPU 0. So there is nothing to balance here.

Perhaps some CPUs spend too much time scanning through all isolated CPUs to
see if there is balancing to do. I don't know, this needs further investigation.
But if the nohz_full CPUs are correctly domain isolated as they should
(through isolcpus=domain or cpuset isolated partitions), they should be
invisible to ilb anyway.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RESEND PATCH] tick/nohz: Fix wrong NOHZ idle CPU state
  2026-02-12 20:04     ` Shubhang Kaushik
@ 2026-02-13 13:11       ` Frederic Weisbecker
  0 siblings, 0 replies; 8+ messages in thread
From: Frederic Weisbecker @ 2026-02-13 13:11 UTC (permalink / raw)
  To: Shubhang Kaushik
  Cc: Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Vincent Guittot,
	Valentin Schneider, dietmar.eggemann, bsegall, mgorman, rostedt,
	Christoph Lameter, linux-kernel, Adam Li

Le Thu, Feb 12, 2026 at 12:04:11PM -0800, Shubhang Kaushik a écrit :
> On Thu, 12 Feb 2026, Shubhang Kaushik wrote:
> 
> > >  Because you rely on dynamic placement of isolated tasks throughout
> > >  isolated
> > >  CPUs by the scheduler.
> > > 
> > >  But nohz_full is designed for running only one task per isolated CPU
> > >  without
> > >  any disturbance. And migration is a significant disturbance. This is why
> > >  nohz_full tries not to be too smart and assumes that task placement is
> > >  entirely
> > >  within the hands of the user.
> > > 
> > >  So I have to ask, what prevents you from using static task placement in
> > >  your
> > >  workload?
> > 
> > Actually, the llama-batched-bench results I shared already included
> > static affinity testing via numactl -C.
> 
> What I mean by that is even when tasks are strictly pinned to individual
> cores, the performance gap remains.
> 
> IIUC, the current implementation assumes tick-stop and idle-entry are
> coupled. While this holds for standard NOHZ, nohz_full decouples them,
> causing idle CPUs to be omitted from nohz.idle_cpus_mask.
> 
> This hides idle capacity from  the NOHZ idle balancer, forcing housekeeping
> tasks onto active cores. By decoupling these transitions in the code, we
> ensure accurate state accounting.

You mean housekeeping tasks are moved to isolated CPUs? With proper
isolation setting (ie: domain + nohz_full) this shouldn't happen.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RESEND PATCH] tick/nohz: Fix wrong NOHZ idle CPU state
  2026-02-13 12:56     ` Frederic Weisbecker
@ 2026-02-13 18:15       ` Christoph Lameter (Ampere)
  2026-03-11 11:06         ` Frederic Weisbecker
  0 siblings, 1 reply; 8+ messages in thread
From: Christoph Lameter (Ampere) @ 2026-02-13 18:15 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Shubhang Kaushik, Anna-Maria Behnsen, Ingo Molnar,
	Thomas Gleixner, Vincent Guittot, Valentin Schneider,
	dietmar.eggemann, bsegall, mgorman, rostedt, linux-kernel,
	Adam Li

On Fri, 13 Feb 2026, Frederic Weisbecker wrote:

> Then there seem to be something else going on that we don't fully understand
> because isolated CPUs run 1 pinned task per CPU and the only housekeeping CPU
> is CPU 0. So there is nothing to balance here.
>
> Perhaps some CPUs spend too much time scanning through all isolated CPUs to
> see if there is balancing to do. I don't know, this needs further investigation.
> But if the nohz_full CPUs are correctly domain isolated as they should
> (through isolcpus=domain or cpuset isolated partitions), they should be
> invisible to ilb anyway.

"balancing" would mean moving tasks from busy cpus (that are not in
NOHZ_FULL state) to idle cpus that can then be in NOHZ_FULL state.

If the move to from a busy cpu to an idle cpu succeeds then both cpus may
only run one process and be able to enter NOHZ_FULL.

This is f.e. the caser with threadpools used by certain AI apps. Before
the app starts numactl is used to setup a group of cpus that the app can use.

One may optimize and allow NOHZ_FULL for these cpus.

The app will then create a number of threads during its startup phase.
These should be all placed on idle cpus in the allowed cpu range.

If this is configured the right way then each thread is on a different cpu
and there is one thread per cpu so that we can use NOHZ_FULL.

This is sometimes broken because not all idle cpus are used. Instead some
cpus get two threads and other cpus stay idle. That is why idle load
balancing is needed.

There is no cpu isolation/cgroups or other black magic involved here.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RESEND PATCH] tick/nohz: Fix wrong NOHZ idle CPU state
  2026-02-13 18:15       ` Christoph Lameter (Ampere)
@ 2026-03-11 11:06         ` Frederic Weisbecker
  0 siblings, 0 replies; 8+ messages in thread
From: Frederic Weisbecker @ 2026-03-11 11:06 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Shubhang Kaushik, Anna-Maria Behnsen, Ingo Molnar,
	Thomas Gleixner, Vincent Guittot, Valentin Schneider,
	dietmar.eggemann, bsegall, mgorman, rostedt, linux-kernel,
	Adam Li

Le Fri, Feb 13, 2026 at 10:15:15AM -0800, Christoph Lameter (Ampere) a écrit :
> On Fri, 13 Feb 2026, Frederic Weisbecker wrote:
> 
> > Then there seem to be something else going on that we don't fully understand
> > because isolated CPUs run 1 pinned task per CPU and the only housekeeping CPU
> > is CPU 0. So there is nothing to balance here.
> >
> > Perhaps some CPUs spend too much time scanning through all isolated CPUs to
> > see if there is balancing to do. I don't know, this needs further investigation.
> > But if the nohz_full CPUs are correctly domain isolated as they should
> > (through isolcpus=domain or cpuset isolated partitions), they should be
> > invisible to ilb anyway.
> 
> 
> "balancing" would mean moving tasks from busy cpus (that are not in
> NOHZ_FULL state) to idle cpus that can then be in NOHZ_FULL state.
> 
> If the move to from a busy cpu to an idle cpu succeeds then both cpus may
> only run one process and be able to enter NOHZ_FULL.
> 
> This is f.e. the caser with threadpools used by certain AI apps. Before
> the app starts numactl is used to setup a group of cpus that the app can use.
> 
> One may optimize and allow NOHZ_FULL for these cpus.
> 
> The app will then create a number of threads during its startup phase.
> These should be all placed on idle cpus in the allowed cpu range.
> 
> If this is configured the right way then each thread is on a different cpu
> and there is one thread per cpu so that we can use NOHZ_FULL.
> 
> This is sometimes broken because not all idle cpus are used. Instead some
> cpus get two threads and other cpus stay idle. That is why idle load
> balancing is needed.

Which means you guys eventually rely on load balancing...
So I can only repeat what I said there:

https://lore.kernel.org/lkml/aY3k1_JJjPFUhPd4@localhost.localdomain/

> There is no cpu isolation/cgroups or other black magic involved here.

Too bad, static task placement would fix your issue and domain isolation
would improve your workload.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-03-11 11:06 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-04  0:49 [RESEND PATCH] tick/nohz: Fix wrong NOHZ idle CPU state Shubhang Kaushik
2026-02-12 14:33 ` Frederic Weisbecker
2026-02-12 19:36   ` Shubhang Kaushik
2026-02-12 20:04     ` Shubhang Kaushik
2026-02-13 13:11       ` Frederic Weisbecker
2026-02-13 12:56     ` Frederic Weisbecker
2026-02-13 18:15       ` Christoph Lameter (Ampere)
2026-03-11 11:06         ` Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox