public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [patch 0/2] sched/idle: Prevent pointless NOHZ transitions in default_idle_call()
@ 2026-03-01 19:30 Thomas Gleixner
  2026-03-01 19:30 ` [patch 1/2] sched/idle: Make default_idle_call() static Thomas Gleixner
  2026-03-01 19:30 ` [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware Thomas Gleixner
  0 siblings, 2 replies; 29+ messages in thread
From: Thomas Gleixner @ 2026-03-01 19:30 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Rafael J. Wysocki, Frederic Weisbecker,
	Christian Loehle

default_idle_call() is used when cpuidle is not available. That's the case
on most virtual machines.

It unconditionally tries to transition to NOHZ idle mode on every
invocation, which allows the hypervisor to go into long idle sleeps.

But that's counterproductive on a loaded system where CPUs go briefly idle
for a couple of microseconds. That causes to reprogram the clock event
device twice, one on entry and then when leaving idle a few microseconds
later. That's especially hurtful for VMs as programming the clock event
device implies a VM exit.

See also the related discussion here:

   https://lore.kernel.org/875x7mv8wd.ffs@tglx

Cure this by implementing a moving average tracking idle time in
default_idle_call() and only stop the tick when the resulting average idle
time is larger than a tick.

The series applies on v7.0-rc1.

Thanks,

	tglx
---
 include/linux/cpuidle.h |    1 
 kernel/sched/idle.c     |   65 ++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 57 insertions(+), 9 deletions(-)



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 1/2] sched/idle: Make default_idle_call() static
  2026-03-01 19:30 [patch 0/2] sched/idle: Prevent pointless NOHZ transitions in default_idle_call() Thomas Gleixner
@ 2026-03-01 19:30 ` Thomas Gleixner
  2026-03-01 19:30 ` [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware Thomas Gleixner
  1 sibling, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2026-03-01 19:30 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Rafael J. Wysocki, Frederic Weisbecker,
	Christian Loehle

Nothing outside of idle.c uses it.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 include/linux/cpuidle.h |    1 -
 kernel/sched/idle.c     |    2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -267,7 +267,6 @@ static inline void cpuidle_use_deepest_s
 
 /* kernel/sched/idle.c */
 extern void sched_idle_set_state(struct cpuidle_state *idle_state);
-extern void default_idle_call(void);
 
 #ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED
 void cpuidle_coupled_parallel_barrier(struct cpuidle_device *dev, atomic_t *a);
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -110,7 +110,7 @@ static inline void cond_tick_broadcast_e
  *
  * To use when the cpuidle framework cannot be used.
  */
-void __cpuidle default_idle_call(void)
+static void __cpuidle default_idle_call(void)
 {
 	instrumentation_begin();
 	if (!current_clr_polling_and_test()) {


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware
  2026-03-01 19:30 [patch 0/2] sched/idle: Prevent pointless NOHZ transitions in default_idle_call() Thomas Gleixner
  2026-03-01 19:30 ` [patch 1/2] sched/idle: Make default_idle_call() static Thomas Gleixner
@ 2026-03-01 19:30 ` Thomas Gleixner
  2026-03-02  6:05   ` K Prateek Nayak
                     ` (3 more replies)
  1 sibling, 4 replies; 29+ messages in thread
From: Thomas Gleixner @ 2026-03-01 19:30 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Rafael J. Wysocki, Frederic Weisbecker,
	Christian Loehle

Guests fall back to default_idle_call() as there is no cpuidle driver
available to them by default. That causes a problem in fully loaded
scenarios where CPUs go briefly idle for a couple of microseconds:

tick_nohz_idle_stop_tick() is invoked unconditionally which means unless
there is timer pending in the next tick, the tick is stopped and a couple
of microseconds later when the idle condition goes away restarted. That
requires to program the clockevent device twice which implies a VM exit for
each reprogramming.

It was suggested to remove the tick_nohz_idle_stop_tick() invocation from
the default idle code, but would be counterproductive. It would not allow
the host to go into deeper idle states when the guest CPU is fully idle as
it has to maintain the periodic tick.

Cure this by implementing a trivial moving average filter which keeps track
of the recent idle recidency time and only stop the tick when the average
is larger than a tick.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 kernel/sched/idle.c |   65 +++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 57 insertions(+), 8 deletions(-)

--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -105,12 +105,7 @@ static inline void cond_tick_broadcast_e
 static inline void cond_tick_broadcast_exit(void) { }
 #endif /* !CONFIG_GENERIC_CLOCKEVENTS_BROADCAST_IDLE */
 
-/**
- * default_idle_call - Default CPU idle routine.
- *
- * To use when the cpuidle framework cannot be used.
- */
-static void __cpuidle default_idle_call(void)
+static void __cpuidle __default_idle_call(void)
 {
 	instrumentation_begin();
 	if (!current_clr_polling_and_test()) {
@@ -130,6 +125,61 @@ static void __cpuidle default_idle_call(
 	instrumentation_end();
 }
 
+#ifdef CONFIG_NO_HZ_COMMON
+
+/* Limit to 4 entries so it fits in a cache line */
+#define IDLE_DUR_ENTRIES	4
+#define IDLE_DUR_MASK		(IDLE_DUR_ENTRIES - 1)
+
+struct idle_nohz_data {
+	u64		duration[IDLE_DUR_ENTRIES];
+	u64		entry_time;
+	u64		sum;
+	unsigned int	idx;
+};
+
+static DEFINE_PER_CPU_ALIGNED(struct idle_nohz_data, nohz_data);
+
+/**
+ * default_idle_call - Default CPU idle routine.
+ *
+ * To use when the cpuidle framework cannot be used.
+ */
+static void default_idle_call(void)
+{
+	struct idle_nohz_data *nd = this_cpu_ptr(&nohz_data);
+	unsigned int idx = nd->idx;
+	s64 delta;
+
+	/*
+	 * If the CPU spends more than a tick on average in idle, try to stop
+	 * the tick.
+	 */
+	if (nd->sum > TICK_NSEC * IDLE_DUR_ENTRIES)
+		tick_nohz_idle_stop_tick();
+
+	__default_idle_call();
+
+	/*
+	 * Build a moving average of the time spent in idle to prevent stopping
+	 * the tick on a loaded system which only goes idle briefly.
+	 */
+	delta = max(sched_clock() - nd->entry_time, 0);
+	nd->sum += delta - nd->duration[idx];
+	nd->duration[idx] = delta;
+	nd->idx = (idx + 1) & IDLE_DUR_MASK;
+}
+
+static void default_idle_enter(void)
+{
+	this_cpu_write(nohz_data.entry_time, sched_clock());
+}
+
+#else  /* CONFIG_NO_HZ_COMMON */
+static inline void default_idle_call(void { __default_idle_call(); }
+static inline void default_idle_enter(void) { }
+#endif /* !CONFIG_NO_HZ_COMMON */
+
 static int call_cpuidle_s2idle(struct cpuidle_driver *drv,
 			       struct cpuidle_device *dev,
 			       u64 max_latency_ns)
@@ -186,8 +236,6 @@ static void cpuidle_idle_call(void)
 	}
 
 	if (cpuidle_not_available(drv, dev)) {
-		tick_nohz_idle_stop_tick();
-
 		default_idle_call();
 		goto exit_idle;
 	}
@@ -276,6 +324,7 @@ static void do_idle(void)
 
 	__current_set_polling();
 	tick_nohz_idle_enter();
+	default_idle_enter();
 
 	while (!need_resched()) {
 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware
  2026-03-01 19:30 ` [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware Thomas Gleixner
@ 2026-03-02  6:05   ` K Prateek Nayak
  2026-03-02 10:43   ` Frederic Weisbecker
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 29+ messages in thread
From: K Prateek Nayak @ 2026-03-02  6:05 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Peter Zijlstra, Rafael J. Wysocki, Frederic Weisbecker,
	Christian Loehle

Hello Thomas,

On 3/2/2026 1:00 AM, Thomas Gleixner wrote:
> +static void default_idle_enter(void)
> +{
> +	this_cpu_write(nohz_data.entry_time, sched_clock());
> +}
> +
> +#else  /* CONFIG_NO_HZ_COMMON */
> +static inline void default_idle_call(void { __default_idle_call(); }

                                            ^
s/void/void)/ to add a closing bracket here for !CONFIG_NO_HZ_COMMON.

> +static inline void default_idle_enter(void) { }
> +#endif /* !CONFIG_NO_HZ_COMMON */
> +
>  static int call_cpuidle_s2idle(struct cpuidle_driver *drv,
>  			       struct cpuidle_device *dev,
>  			       u64 max_latency_ns)
> @@ -186,8 +236,6 @@ static void cpuidle_idle_call(void)
>  	}
>  
>  	if (cpuidle_not_available(drv, dev)) {
> -		tick_nohz_idle_stop_tick();
> -
>  		default_idle_call();
>  		goto exit_idle;
>  	}
> @@ -276,6 +324,7 @@ static void do_idle(void)
>  
>  	__current_set_polling();
>  	tick_nohz_idle_enter();
> +	default_idle_enter();

Can we defer this to until after do_idle() disables IRQs to avoid a
per-CPU write when we exit out immediately if it is semantically same?
As a bonus, we can then use __this_cpu_write().

>  
>  	while (!need_resched()) {
>  
> 
> 

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware
  2026-03-01 19:30 ` [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware Thomas Gleixner
  2026-03-02  6:05   ` K Prateek Nayak
@ 2026-03-02 10:43   ` Frederic Weisbecker
  2026-03-02 11:03     ` Christian Loehle
  2026-03-02 11:03   ` Christian Loehle
  2026-03-02 12:17   ` [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware Peter Zijlstra
  3 siblings, 1 reply; 29+ messages in thread
From: Frederic Weisbecker @ 2026-03-02 10:43 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: LKML, Peter Zijlstra, Rafael J. Wysocki, Christian Loehle

On Sun, Mar 01, 2026 at 08:30:51PM +0100, Thomas Gleixner wrote:
> Guests fall back to default_idle_call() as there is no cpuidle driver
> available to them by default. That causes a problem in fully loaded
> scenarios where CPUs go briefly idle for a couple of microseconds:
> 
> tick_nohz_idle_stop_tick() is invoked unconditionally which means unless
> there is timer pending in the next tick, the tick is stopped and a couple
> of microseconds later when the idle condition goes away restarted. That
> requires to program the clockevent device twice which implies a VM exit for
> each reprogramming.
> 
> It was suggested to remove the tick_nohz_idle_stop_tick() invocation from
> the default idle code, but would be counterproductive. It would not allow
> the host to go into deeper idle states when the guest CPU is fully idle as
> it has to maintain the periodic tick.
> 
> Cure this by implementing a trivial moving average filter which keeps track
> of the recent idle recidency time and only stop the tick when the average
> is larger than a tick.
> 
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>

Shouldn't there be instead a new dedicated cpuidle driver with proper governor support?

Thanks.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware
  2026-03-02 10:43   ` Frederic Weisbecker
@ 2026-03-02 11:03     ` Christian Loehle
  2026-03-02 11:11       ` Frederic Weisbecker
  0 siblings, 1 reply; 29+ messages in thread
From: Christian Loehle @ 2026-03-02 11:03 UTC (permalink / raw)
  To: Frederic Weisbecker, Thomas Gleixner
  Cc: LKML, Peter Zijlstra, Rafael J. Wysocki

On 3/2/26 10:43, Frederic Weisbecker wrote:
> On Sun, Mar 01, 2026 at 08:30:51PM +0100, Thomas Gleixner wrote:
>> Guests fall back to default_idle_call() as there is no cpuidle driver
>> available to them by default. That causes a problem in fully loaded
>> scenarios where CPUs go briefly idle for a couple of microseconds:
>>
>> tick_nohz_idle_stop_tick() is invoked unconditionally which means unless
>> there is timer pending in the next tick, the tick is stopped and a couple
>> of microseconds later when the idle condition goes away restarted. That
>> requires to program the clockevent device twice which implies a VM exit for
>> each reprogramming.
>>
>> It was suggested to remove the tick_nohz_idle_stop_tick() invocation from
>> the default idle code, but would be counterproductive. It would not allow
>> the host to go into deeper idle states when the guest CPU is fully idle as
>> it has to maintain the periodic tick.
>>
>> Cure this by implementing a trivial moving average filter which keeps track
>> of the recent idle recidency time and only stop the tick when the average
>> is larger than a tick.
>>
>> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> 
> Shouldn't there be instead a new dedicated cpuidle driver with proper governor support?

I think a dummy cpuidle driver is an option, but calling into any governor
seems overkill IMO, it presents an option to the user where there really is
none (after all the cpuidle governor would just make a boolean decision as
there are no states).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware
  2026-03-01 19:30 ` [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware Thomas Gleixner
  2026-03-02  6:05   ` K Prateek Nayak
  2026-03-02 10:43   ` Frederic Weisbecker
@ 2026-03-02 11:03   ` Christian Loehle
  2026-03-02 21:25     ` Rafael J. Wysocki
  2026-03-02 12:17   ` [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware Peter Zijlstra
  3 siblings, 1 reply; 29+ messages in thread
From: Christian Loehle @ 2026-03-02 11:03 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Peter Zijlstra, Rafael J. Wysocki, Frederic Weisbecker

On 3/1/26 19:30, Thomas Gleixner wrote:
> Guests fall back to default_idle_call() as there is no cpuidle driver
> available to them by default. That causes a problem in fully loaded
> scenarios where CPUs go briefly idle for a couple of microseconds:
> 
> tick_nohz_idle_stop_tick() is invoked unconditionally which means unless
> there is timer pending in the next tick, the tick is stopped and a couple
> of microseconds later when the idle condition goes away restarted. That
> requires to program the clockevent device twice which implies a VM exit for
> each reprogramming.
> 
> It was suggested to remove the tick_nohz_idle_stop_tick() invocation from
> the default idle code, but would be counterproductive. It would not allow
> the host to go into deeper idle states when the guest CPU is fully idle as
> it has to maintain the periodic tick.
> 
> Cure this by implementing a trivial moving average filter which keeps track
> of the recent idle recidency time and only stop the tick when the average
> is larger than a tick.
> 
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> ---
>  kernel/sched/idle.c |   65 +++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 57 insertions(+), 8 deletions(-)
> 
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -105,12 +105,7 @@ static inline void cond_tick_broadcast_e
>  static inline void cond_tick_broadcast_exit(void) { }
>  #endif /* !CONFIG_GENERIC_CLOCKEVENTS_BROADCAST_IDLE */
>  
> -/**
> - * default_idle_call - Default CPU idle routine.
> - *
> - * To use when the cpuidle framework cannot be used.
> - */
> -static void __cpuidle default_idle_call(void)
> +static void __cpuidle __default_idle_call(void)
>  {
>  	instrumentation_begin();
>  	if (!current_clr_polling_and_test()) {
> @@ -130,6 +125,61 @@ static void __cpuidle default_idle_call(
>  	instrumentation_end();
>  }
>  
> +#ifdef CONFIG_NO_HZ_COMMON
> +
> +/* Limit to 4 entries so it fits in a cache line */
> +#define IDLE_DUR_ENTRIES	4
> +#define IDLE_DUR_MASK		(IDLE_DUR_ENTRIES - 1)
> +
> +struct idle_nohz_data {
> +	u64		duration[IDLE_DUR_ENTRIES];
> +	u64		entry_time;
> +	u64		sum;
> +	unsigned int	idx;
> +};
> +
> +static DEFINE_PER_CPU_ALIGNED(struct idle_nohz_data, nohz_data);
> +
> +/**
> + * default_idle_call - Default CPU idle routine.
> + *
> + * To use when the cpuidle framework cannot be used.
> + */
> +static void default_idle_call(void)
> +{
> +	struct idle_nohz_data *nd = this_cpu_ptr(&nohz_data);
> +	unsigned int idx = nd->idx;
> +	s64 delta;
> +
> +	/*
> +	 * If the CPU spends more than a tick on average in idle, try to stop
> +	 * the tick.
> +	 */
> +	if (nd->sum > TICK_NSEC * IDLE_DUR_ENTRIES)
> +		tick_nohz_idle_stop_tick();
> +
> +	__default_idle_call();
> +
> +	/*
> +	 * Build a moving average of the time spent in idle to prevent stopping
> +	 * the tick on a loaded system which only goes idle briefly.
> +	 */
> +	delta = max(sched_clock() - nd->entry_time, 0);
> +	nd->sum += delta - nd->duration[idx];
> +	nd->duration[idx] = delta;
> +	nd->idx = (idx + 1) & IDLE_DUR_MASK;
> +}
> +
> +static void default_idle_enter(void)
> +{
> +	this_cpu_write(nohz_data.entry_time, sched_clock());
> +}
> +
> +#else  /* CONFIG_NO_HZ_COMMON */
> +static inline void default_idle_call(void { __default_idle_call(); }
> +static inline void default_idle_enter(void) { }
> +#endif /* !CONFIG_NO_HZ_COMMON */
> +
>  static int call_cpuidle_s2idle(struct cpuidle_driver *drv,
>  			       struct cpuidle_device *dev,
>  			       u64 max_latency_ns)
> @@ -186,8 +236,6 @@ static void cpuidle_idle_call(void)
>  	}
>  
>  	if (cpuidle_not_available(drv, dev)) {
> -		tick_nohz_idle_stop_tick();
> -
>  		default_idle_call();
>  		goto exit_idle;
>  	}
> @@ -276,6 +324,7 @@ static void do_idle(void)
>  
>  	__current_set_polling();
>  	tick_nohz_idle_enter();
> +	default_idle_enter();
>  
>  	while (!need_resched()) {
>  
> 

How does this work? We don't stop the tick until the average idle time is larger,
but if we don't stop the tick how is that possible?

Why don't we just require one or two consecutive tick wakeups before stopping?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware
  2026-03-02 11:03     ` Christian Loehle
@ 2026-03-02 11:11       ` Frederic Weisbecker
  2026-03-02 11:39         ` Christian Loehle
  0 siblings, 1 reply; 29+ messages in thread
From: Frederic Weisbecker @ 2026-03-02 11:11 UTC (permalink / raw)
  To: Christian Loehle; +Cc: Thomas Gleixner, LKML, Peter Zijlstra, Rafael J. Wysocki

On Mon, Mar 02, 2026 at 11:03:00AM +0000, Christian Loehle wrote:
> On 3/2/26 10:43, Frederic Weisbecker wrote:
> > On Sun, Mar 01, 2026 at 08:30:51PM +0100, Thomas Gleixner wrote:
> >> Guests fall back to default_idle_call() as there is no cpuidle driver
> >> available to them by default. That causes a problem in fully loaded
> >> scenarios where CPUs go briefly idle for a couple of microseconds:
> >>
> >> tick_nohz_idle_stop_tick() is invoked unconditionally which means unless
> >> there is timer pending in the next tick, the tick is stopped and a couple
> >> of microseconds later when the idle condition goes away restarted. That
> >> requires to program the clockevent device twice which implies a VM exit for
> >> each reprogramming.
> >>
> >> It was suggested to remove the tick_nohz_idle_stop_tick() invocation from
> >> the default idle code, but would be counterproductive. It would not allow
> >> the host to go into deeper idle states when the guest CPU is fully idle as
> >> it has to maintain the periodic tick.
> >>
> >> Cure this by implementing a trivial moving average filter which keeps track
> >> of the recent idle recidency time and only stop the tick when the average
> >> is larger than a tick.
> >>
> >> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> > 
> > Shouldn't there be instead a new dedicated cpuidle driver with proper governor support?
> 
> I think a dummy cpuidle driver is an option, but calling into any governor
> seems overkill IMO, it presents an option to the user where there really is
> none (after all the cpuidle governor would just make a boolean decision as
> there are no states).

I must confess I don't fully understand the picture with the non-existent states
but what Thomas is doing in his patch is basically an ad-hoc implementation of
cpuidle governor decision whether or not to stop the tick.

Thanks.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware
  2026-03-02 11:11       ` Frederic Weisbecker
@ 2026-03-02 11:39         ` Christian Loehle
  2026-03-04  3:35           ` Qais Yousef
  0 siblings, 1 reply; 29+ messages in thread
From: Christian Loehle @ 2026-03-02 11:39 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Rafael J. Wysocki

On 3/2/26 11:11, Frederic Weisbecker wrote:
> On Mon, Mar 02, 2026 at 11:03:00AM +0000, Christian Loehle wrote:
>> On 3/2/26 10:43, Frederic Weisbecker wrote:
>>> On Sun, Mar 01, 2026 at 08:30:51PM +0100, Thomas Gleixner wrote:
>>>> Guests fall back to default_idle_call() as there is no cpuidle driver
>>>> available to them by default. That causes a problem in fully loaded
>>>> scenarios where CPUs go briefly idle for a couple of microseconds:
>>>>
>>>> tick_nohz_idle_stop_tick() is invoked unconditionally which means unless
>>>> there is timer pending in the next tick, the tick is stopped and a couple
>>>> of microseconds later when the idle condition goes away restarted. That
>>>> requires to program the clockevent device twice which implies a VM exit for
>>>> each reprogramming.
>>>>
>>>> It was suggested to remove the tick_nohz_idle_stop_tick() invocation from
>>>> the default idle code, but would be counterproductive. It would not allow
>>>> the host to go into deeper idle states when the guest CPU is fully idle as
>>>> it has to maintain the periodic tick.
>>>>
>>>> Cure this by implementing a trivial moving average filter which keeps track
>>>> of the recent idle recidency time and only stop the tick when the average
>>>> is larger than a tick.
>>>>
>>>> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
>>>
>>> Shouldn't there be instead a new dedicated cpuidle driver with proper governor support?
>>
>> I think a dummy cpuidle driver is an option, but calling into any governor
>> seems overkill IMO, it presents an option to the user where there really is
>> none (after all the cpuidle governor would just make a boolean decision as
>> there are no states).
> 
> I must confess I don't fully understand the picture with the non-existent states
> but what Thomas is doing in his patch is basically an ad-hoc implementation of
> cpuidle governor decision whether or not to stop the tick.
> 

Yup and if we put that into the cpuidle governor then we have to duplicate
that logic for all governors even though for <= 1 states they hopefully
should be the same.

A dummy driver would allow for this logic to live in drivers/cpuidle/ but
I don't have a preference either way.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware
  2026-03-01 19:30 ` [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware Thomas Gleixner
                     ` (2 preceding siblings ...)
  2026-03-02 11:03   ` Christian Loehle
@ 2026-03-02 12:17   ` Peter Zijlstra
  2026-03-02 12:19     ` Peter Zijlstra
  3 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2026-03-02 12:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Rafael J. Wysocki, Frederic Weisbecker, Christian Loehle

On Sun, Mar 01, 2026 at 08:30:51PM +0100, Thomas Gleixner wrote:
> Guests fall back to default_idle_call() as there is no cpuidle driver
> available to them by default. That causes a problem in fully loaded
> scenarios where CPUs go briefly idle for a couple of microseconds:
> 
> tick_nohz_idle_stop_tick() is invoked unconditionally which means unless
> there is timer pending in the next tick, the tick is stopped and a couple
> of microseconds later when the idle condition goes away restarted. That
> requires to program the clockevent device twice which implies a VM exit for
> each reprogramming.
> 
> It was suggested to remove the tick_nohz_idle_stop_tick() invocation from
> the default idle code, but would be counterproductive. It would not allow
> the host to go into deeper idle states when the guest CPU is fully idle as
> it has to maintain the periodic tick.
> 
> Cure this by implementing a trivial moving average filter which keeps track
> of the recent idle recidency time and only stop the tick when the average
> is larger than a tick.
> 
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>

How about so? No reason to not also pass this into the idle governors.
This way it becomes a common least functionality. Governor can override,
but it had better have a good reason.

---
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -6,10 +6,12 @@
  * (NOTE: these are not related to SCHED_IDLE batch scheduled
  *        tasks which are handled in sched/fair.c )
  */
+#include <linux/sched/clock.h>
 #include <linux/cpuidle.h>
 #include <linux/suspend.h>
 #include <linux/livepatch.h>
 #include "sched.h"
+#include "pelt.h"
 #include "smp.h"
 
 /* Linker adds these: start and end of __cpuidle functions */
@@ -105,12 +107,7 @@ static inline void cond_tick_broadcast_e
 static inline void cond_tick_broadcast_exit(void) { }
 #endif /* !CONFIG_GENERIC_CLOCKEVENTS_BROADCAST_IDLE */
 
-/**
- * default_idle_call - Default CPU idle routine.
- *
- * To use when the cpuidle framework cannot be used.
- */
-static void __cpuidle default_idle_call(void)
+static void __cpuidle __default_idle_call(void)
 {
 	instrumentation_begin();
 	if (!current_clr_polling_and_test()) {
@@ -130,6 +127,63 @@ static void __cpuidle default_idle_call(
 	instrumentation_end();
 }
 
+#ifdef CONFIG_NO_HZ_COMMON
+
+/* Limit to 4 entries so it fits in a cache line */
+#define IDLE_DUR_ENTRIES	4
+#define IDLE_DUR_MASK		(IDLE_DUR_ENTRIES - 1)
+
+struct idle_nohz_data {
+	u64		duration[IDLE_DUR_ENTRIES];
+	u64		entry_time;
+	u64		sum;
+	unsigned int	idx;
+};
+
+static DEFINE_PER_CPU_ALIGNED(struct idle_nohz_data, nohz_data);
+
+static void default_idle_enter(void)
+{
+	this_cpu_write(nohz_data.entry_time, sched_clock());
+}
+
+static inline bool default_stop_tick(void)
+{
+	struct idle_nohz_data *nd = this_cpu_ptr(&nohz_data);
+	return nd->sum > TICK_NSEC * IDLE_DUR_ENTRIES;
+}
+
+static void default_reflect(void)
+{
+	struct idle_nohz_data *nd = this_cpu_ptr(&nohz_data);
+	unsigned int idx = nd->idx;
+	s64 delta;
+
+	/*
+	 * Build a moving average of the time spent in idle to prevent stopping
+	 * the tick on a loaded system which only goes idle briefly.
+	 */
+	delta = max(sched_clock() - nd->entry_time, 0);
+	nd->sum += delta - nd->duration[idx];
+	nd->duration[idx] = delta;
+	nd->idx = (idx + 1) & IDLE_DUR_MASK;
+}
+#else  /* CONFIG_NO_HZ_COMMON */
+static inline void default_idle_enter(void) { }
+static inline bool default_stop_tick(void) { return false; }
+static inline void default_reflect(void) { }
+#endif /* !CONFIG_NO_HZ_COMMON */
+
+static inline void default_idle_call(void)
+{
+	if (default_stop_tick())
+		tick_nohz_idle_stop_tick();
+
+	__default_idle_call();
+
+	default_reflect();
+}
+
 static int call_cpuidle_s2idle(struct cpuidle_driver *drv,
 			       struct cpuidle_device *dev,
 			       u64 max_latency_ns)
@@ -186,8 +240,6 @@ static void cpuidle_idle_call(void)
 	}
 
 	if (cpuidle_not_available(drv, dev)) {
-		tick_nohz_idle_stop_tick();
-
 		default_idle_call();
 		goto exit_idle;
 	}
@@ -222,7 +274,7 @@ static void cpuidle_idle_call(void)
 		next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
 		call_cpuidle(drv, dev, next_state);
 	} else {
-		bool stop_tick = true;
+		bool stop_tick = default_stop_tick();
 
 		/*
 		 * Ask the cpuidle framework to choose a convenient idle state.
@@ -238,6 +290,7 @@ static void cpuidle_idle_call(void)
 		/*
 		 * Give the governor an opportunity to reflect on the outcome
 		 */
+		default_reflect();
 		cpuidle_reflect(dev, entered_state);
 	}
 
@@ -276,6 +329,7 @@ static void do_idle(void)
 
 	__current_set_polling();
 	tick_nohz_idle_enter();
+	default_idle_enter();
 
 	while (!need_resched()) {
 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware
  2026-03-02 12:17   ` [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware Peter Zijlstra
@ 2026-03-02 12:19     ` Peter Zijlstra
  2026-03-02 21:23       ` Rafael J. Wysocki
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2026-03-02 12:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Rafael J. Wysocki, Frederic Weisbecker, Christian Loehle

On Mon, Mar 02, 2026 at 01:17:55PM +0100, Peter Zijlstra wrote:
> On Sun, Mar 01, 2026 at 08:30:51PM +0100, Thomas Gleixner wrote:
> > Guests fall back to default_idle_call() as there is no cpuidle driver
> > available to them by default. That causes a problem in fully loaded
> > scenarios where CPUs go briefly idle for a couple of microseconds:
> > 
> > tick_nohz_idle_stop_tick() is invoked unconditionally which means unless
> > there is timer pending in the next tick, the tick is stopped and a couple
> > of microseconds later when the idle condition goes away restarted. That
> > requires to program the clockevent device twice which implies a VM exit for
> > each reprogramming.
> > 
> > It was suggested to remove the tick_nohz_idle_stop_tick() invocation from
> > the default idle code, but would be counterproductive. It would not allow
> > the host to go into deeper idle states when the guest CPU is fully idle as
> > it has to maintain the periodic tick.
> > 
> > Cure this by implementing a trivial moving average filter which keeps track
> > of the recent idle recidency time and only stop the tick when the average
> > is larger than a tick.
> > 
> > Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> 
> How about so? No reason to not also pass this into the idle governors.
> This way it becomes a common least functionality. Governor can override,
> but it had better have a good reason.
> 
> ---
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -6,10 +6,12 @@
>   * (NOTE: these are not related to SCHED_IDLE batch scheduled
>   *        tasks which are handled in sched/fair.c )
>   */
> +#include <linux/sched/clock.h>
>  #include <linux/cpuidle.h>
>  #include <linux/suspend.h>
>  #include <linux/livepatch.h>
>  #include "sched.h"
> +#include "pelt.h"
>  #include "smp.h"
>  
>  /* Linker adds these: start and end of __cpuidle functions */
> @@ -105,12 +107,7 @@ static inline void cond_tick_broadcast_e
>  static inline void cond_tick_broadcast_exit(void) { }
>  #endif /* !CONFIG_GENERIC_CLOCKEVENTS_BROADCAST_IDLE */
>  
> -/**
> - * default_idle_call - Default CPU idle routine.
> - *
> - * To use when the cpuidle framework cannot be used.
> - */
> -static void __cpuidle default_idle_call(void)
> +static void __cpuidle __default_idle_call(void)
>  {
>  	instrumentation_begin();
>  	if (!current_clr_polling_and_test()) {
> @@ -130,6 +127,63 @@ static void __cpuidle default_idle_call(
>  	instrumentation_end();
>  }
>  
> +#ifdef CONFIG_NO_HZ_COMMON
> +
> +/* Limit to 4 entries so it fits in a cache line */
> +#define IDLE_DUR_ENTRIES	4
> +#define IDLE_DUR_MASK		(IDLE_DUR_ENTRIES - 1)
> +
> +struct idle_nohz_data {
> +	u64		duration[IDLE_DUR_ENTRIES];
> +	u64		entry_time;
> +	u64		sum;
> +	unsigned int	idx;
> +};
> +
> +static DEFINE_PER_CPU_ALIGNED(struct idle_nohz_data, nohz_data);
> +
> +static void default_idle_enter(void)
> +{
> +	this_cpu_write(nohz_data.entry_time, sched_clock());
> +}
> +
> +static inline bool default_stop_tick(void)
> +{
> +	struct idle_nohz_data *nd = this_cpu_ptr(&nohz_data);
> +	return nd->sum > TICK_NSEC * IDLE_DUR_ENTRIES;
> +}
> +
> +static void default_reflect(void)
> +{
> +	struct idle_nohz_data *nd = this_cpu_ptr(&nohz_data);
> +	unsigned int idx = nd->idx;
> +	s64 delta;
> +
> +	/*
> +	 * Build a moving average of the time spent in idle to prevent stopping
> +	 * the tick on a loaded system which only goes idle briefly.
> +	 */
> +	delta = max(sched_clock() - nd->entry_time, 0);
> +	nd->sum += delta - nd->duration[idx];
> +	nd->duration[idx] = delta;
> +	nd->idx = (idx + 1) & IDLE_DUR_MASK;
> +}
> +#else  /* CONFIG_NO_HZ_COMMON */
> +static inline void default_idle_enter(void) { }
> +static inline bool default_stop_tick(void) { return false; }
> +static inline void default_reflect(void) { }
> +#endif /* !CONFIG_NO_HZ_COMMON */
> +
> +static inline void default_idle_call(void)
> +{
> +	if (default_stop_tick())
> +		tick_nohz_idle_stop_tick();
> +
> +	__default_idle_call();
> +
> +	default_reflect();
> +}
> +
>  static int call_cpuidle_s2idle(struct cpuidle_driver *drv,
>  			       struct cpuidle_device *dev,
>  			       u64 max_latency_ns)
> @@ -186,8 +240,6 @@ static void cpuidle_idle_call(void)
>  	}
>  
>  	if (cpuidle_not_available(drv, dev)) {
> -		tick_nohz_idle_stop_tick();
> -
>  		default_idle_call();
>  		goto exit_idle;
>  	}
> @@ -222,7 +274,7 @@ static void cpuidle_idle_call(void)
>  		next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
>  		call_cpuidle(drv, dev, next_state);
>  	} else {
> -		bool stop_tick = true;
> +		bool stop_tick = default_stop_tick();
>  
>  		/*
>  		 * Ask the cpuidle framework to choose a convenient idle state.
> @@ -238,6 +290,7 @@ static void cpuidle_idle_call(void)
>  		/*
>  		 * Give the governor an opportunity to reflect on the outcome
>  		 */
> +		default_reflect();
>  		cpuidle_reflect(dev, entered_state);
>  	}
>  
> @@ -276,6 +329,7 @@ static void do_idle(void)
>  
>  	__current_set_polling();
>  	tick_nohz_idle_enter();
> +	default_idle_enter();
>  
>  	while (!need_resched()) {
>  

Damn, lost hunk:


diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 65fbb8e807b9..c7876e9e024f 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -359,16 +359,6 @@ noinstr int cpuidle_enter_state(struct cpuidle_device *dev,
 int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 		   bool *stop_tick)
 {
-	/*
-	 * If there is only a single idle state (or none), there is nothing
-	 * meaningful for the governor to choose. Skip the governor and
-	 * always use state 0 with the tick running.
-	 */
-	if (drv->state_count <= 1) {
-		*stop_tick = false;
-		return 0;
-	}
-
 	return cpuidle_curr_governor->select(drv, dev, stop_tick);
 }
 

Also, I suppose menu wants this?

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 899ff16ff1fe..a75fe1fca65d 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -290,7 +290,8 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 		 * it right away and keep the tick running if state[0] is a
 		 * polling one.
 		 */
-		*stop_tick = !(drv->states[0].flags & CPUIDLE_FLAG_POLLING);
+		if (drv->states[0].flags & CPUIDLE_FLAG_POLLING)
+			*stop_tick = false;
 		return 0;
 	}
 

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware
  2026-03-02 12:19     ` Peter Zijlstra
@ 2026-03-02 21:23       ` Rafael J. Wysocki
  0 siblings, 0 replies; 29+ messages in thread
From: Rafael J. Wysocki @ 2026-03-02 21:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, Rafael J. Wysocki, Frederic Weisbecker,
	Christian Loehle

On Mon, Mar 2, 2026 at 1:19 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Mar 02, 2026 at 01:17:55PM +0100, Peter Zijlstra wrote:
> > On Sun, Mar 01, 2026 at 08:30:51PM +0100, Thomas Gleixner wrote:
> > > Guests fall back to default_idle_call() as there is no cpuidle driver
> > > available to them by default. That causes a problem in fully loaded
> > > scenarios where CPUs go briefly idle for a couple of microseconds:
> > >
> > > tick_nohz_idle_stop_tick() is invoked unconditionally which means unless
> > > there is timer pending in the next tick, the tick is stopped and a couple
> > > of microseconds later when the idle condition goes away restarted. That
> > > requires to program the clockevent device twice which implies a VM exit for
> > > each reprogramming.
> > >
> > > It was suggested to remove the tick_nohz_idle_stop_tick() invocation from
> > > the default idle code, but would be counterproductive. It would not allow
> > > the host to go into deeper idle states when the guest CPU is fully idle as
> > > it has to maintain the periodic tick.
> > >
> > > Cure this by implementing a trivial moving average filter which keeps track
> > > of the recent idle recidency time and only stop the tick when the average
> > > is larger than a tick.
> > >
> > > Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> >
> > How about so? No reason to not also pass this into the idle governors.
> > This way it becomes a common least functionality. Governor can override,
> > but it had better have a good reason.

First, I hope you have seen the responses from Christian and Frederic.

Second, in the cases when the governor decides, it is better to leave
it to decide or somebody somewhere will complain even if you are
absolutely convinced that you can do better.

> > ---
> > --- a/kernel/sched/idle.c
> > +++ b/kernel/sched/idle.c
> > @@ -6,10 +6,12 @@
> >   * (NOTE: these are not related to SCHED_IDLE batch scheduled
> >   *        tasks which are handled in sched/fair.c )
> >   */
> > +#include <linux/sched/clock.h>
> >  #include <linux/cpuidle.h>
> >  #include <linux/suspend.h>
> >  #include <linux/livepatch.h>
> >  #include "sched.h"
> > +#include "pelt.h"
> >  #include "smp.h"
> >
> >  /* Linker adds these: start and end of __cpuidle functions */
> > @@ -105,12 +107,7 @@ static inline void cond_tick_broadcast_e
> >  static inline void cond_tick_broadcast_exit(void) { }
> >  #endif /* !CONFIG_GENERIC_CLOCKEVENTS_BROADCAST_IDLE */
> >
> > -/**
> > - * default_idle_call - Default CPU idle routine.
> > - *
> > - * To use when the cpuidle framework cannot be used.
> > - */
> > -static void __cpuidle default_idle_call(void)
> > +static void __cpuidle __default_idle_call(void)
> >  {
> >       instrumentation_begin();
> >       if (!current_clr_polling_and_test()) {
> > @@ -130,6 +127,63 @@ static void __cpuidle default_idle_call(
> >       instrumentation_end();
> >  }
> >
> > +#ifdef CONFIG_NO_HZ_COMMON
> > +
> > +/* Limit to 4 entries so it fits in a cache line */
> > +#define IDLE_DUR_ENTRIES     4
> > +#define IDLE_DUR_MASK                (IDLE_DUR_ENTRIES - 1)
> > +
> > +struct idle_nohz_data {
> > +     u64             duration[IDLE_DUR_ENTRIES];
> > +     u64             entry_time;
> > +     u64             sum;
> > +     unsigned int    idx;
> > +};
> > +
> > +static DEFINE_PER_CPU_ALIGNED(struct idle_nohz_data, nohz_data);
> > +
> > +static void default_idle_enter(void)
> > +{
> > +     this_cpu_write(nohz_data.entry_time, sched_clock());
> > +}
> > +
> > +static inline bool default_stop_tick(void)
> > +{
> > +     struct idle_nohz_data *nd = this_cpu_ptr(&nohz_data);
> > +     return nd->sum > TICK_NSEC * IDLE_DUR_ENTRIES;
> > +}
> > +
> > +static void default_reflect(void)
> > +{
> > +     struct idle_nohz_data *nd = this_cpu_ptr(&nohz_data);
> > +     unsigned int idx = nd->idx;
> > +     s64 delta;
> > +
> > +     /*
> > +      * Build a moving average of the time spent in idle to prevent stopping
> > +      * the tick on a loaded system which only goes idle briefly.
> > +      */
> > +     delta = max(sched_clock() - nd->entry_time, 0);
> > +     nd->sum += delta - nd->duration[idx];
> > +     nd->duration[idx] = delta;
> > +     nd->idx = (idx + 1) & IDLE_DUR_MASK;
> > +}

So I'd prefer to do something even simpler as suggested by Christian.

> > +#else  /* CONFIG_NO_HZ_COMMON */
> > +static inline void default_idle_enter(void) { }
> > +static inline bool default_stop_tick(void) { return false; }
> > +static inline void default_reflect(void) { }
> > +#endif /* !CONFIG_NO_HZ_COMMON */
> > +
> > +static inline void default_idle_call(void)
> > +{
> > +     if (default_stop_tick())
> > +             tick_nohz_idle_stop_tick();
> > +
> > +     __default_idle_call();
> > +
> > +     default_reflect();
> > +}
> > +
> >  static int call_cpuidle_s2idle(struct cpuidle_driver *drv,
> >                              struct cpuidle_device *dev,
> >                              u64 max_latency_ns)
> > @@ -186,8 +240,6 @@ static void cpuidle_idle_call(void)
> >       }
> >
> >       if (cpuidle_not_available(drv, dev)) {
> > -             tick_nohz_idle_stop_tick();
> > -
> >               default_idle_call();
> >               goto exit_idle;
> >       }
> > @@ -222,7 +274,7 @@ static void cpuidle_idle_call(void)
> >               next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
> >               call_cpuidle(drv, dev, next_state);
> >       } else {
> > -             bool stop_tick = true;
> > +             bool stop_tick = default_stop_tick();
> >
> >               /*
> >                * Ask the cpuidle framework to choose a convenient idle state.
> > @@ -238,6 +290,7 @@ static void cpuidle_idle_call(void)
> >               /*
> >                * Give the governor an opportunity to reflect on the outcome
> >                */
> > +             default_reflect();
> >               cpuidle_reflect(dev, entered_state);
> >       }
> >
> > @@ -276,6 +329,7 @@ static void do_idle(void)
> >
> >       __current_set_polling();
> >       tick_nohz_idle_enter();
> > +     default_idle_enter();
> >
> >       while (!need_resched()) {
> >
>
> Damn, lost hunk:
>
>
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index 65fbb8e807b9..c7876e9e024f 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -359,16 +359,6 @@ noinstr int cpuidle_enter_state(struct cpuidle_device *dev,
>  int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>                    bool *stop_tick)
>  {
> -       /*
> -        * If there is only a single idle state (or none), there is nothing
> -        * meaningful for the governor to choose. Skip the governor and
> -        * always use state 0 with the tick running.
> -        */
> -       if (drv->state_count <= 1) {
> -               *stop_tick = false;
> -               return 0;
> -       }
> -
>         return cpuidle_curr_governor->select(drv, dev, stop_tick);
>  }
>
>
> Also, I suppose menu wants this?

Well, as I said, I think it's better to leave the governors alone at
this point and maybe updated them later in a separate patch (or
patches) that can be reverted individually if need be.

> diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
> index 899ff16ff1fe..a75fe1fca65d 100644
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -290,7 +290,8 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>                  * it right away and keep the tick running if state[0] is a
>                  * polling one.
>                  */
> -               *stop_tick = !(drv->states[0].flags & CPUIDLE_FLAG_POLLING);
> +               if (drv->states[0].flags & CPUIDLE_FLAG_POLLING)
> +                       *stop_tick = false;
>                 return 0;
>         }
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware
  2026-03-02 11:03   ` Christian Loehle
@ 2026-03-02 21:25     ` Rafael J. Wysocki
  2026-03-04  3:03       ` Qais Yousef
  0 siblings, 1 reply; 29+ messages in thread
From: Rafael J. Wysocki @ 2026-03-02 21:25 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Rafael J. Wysocki,
	Frederic Weisbecker

On Mon, Mar 2, 2026 at 12:04 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 3/1/26 19:30, Thomas Gleixner wrote:
> > Guests fall back to default_idle_call() as there is no cpuidle driver
> > available to them by default. That causes a problem in fully loaded
> > scenarios where CPUs go briefly idle for a couple of microseconds:
> >
> > tick_nohz_idle_stop_tick() is invoked unconditionally which means unless
> > there is timer pending in the next tick, the tick is stopped and a couple
> > of microseconds later when the idle condition goes away restarted. That
> > requires to program the clockevent device twice which implies a VM exit for
> > each reprogramming.
> >
> > It was suggested to remove the tick_nohz_idle_stop_tick() invocation from
> > the default idle code, but would be counterproductive. It would not allow
> > the host to go into deeper idle states when the guest CPU is fully idle as
> > it has to maintain the periodic tick.
> >
> > Cure this by implementing a trivial moving average filter which keeps track
> > of the recent idle recidency time and only stop the tick when the average
> > is larger than a tick.
> >
> > Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> > ---
> >  kernel/sched/idle.c |   65 +++++++++++++++++++++++++++++++++++++++++++++-------
> >  1 file changed, 57 insertions(+), 8 deletions(-)
> >
> > --- a/kernel/sched/idle.c
> > +++ b/kernel/sched/idle.c
> > @@ -105,12 +105,7 @@ static inline void cond_tick_broadcast_e
> >  static inline void cond_tick_broadcast_exit(void) { }
> >  #endif /* !CONFIG_GENERIC_CLOCKEVENTS_BROADCAST_IDLE */
> >
> > -/**
> > - * default_idle_call - Default CPU idle routine.
> > - *
> > - * To use when the cpuidle framework cannot be used.
> > - */
> > -static void __cpuidle default_idle_call(void)
> > +static void __cpuidle __default_idle_call(void)
> >  {
> >       instrumentation_begin();
> >       if (!current_clr_polling_and_test()) {
> > @@ -130,6 +125,61 @@ static void __cpuidle default_idle_call(
> >       instrumentation_end();
> >  }
> >
> > +#ifdef CONFIG_NO_HZ_COMMON
> > +
> > +/* Limit to 4 entries so it fits in a cache line */
> > +#define IDLE_DUR_ENTRIES     4
> > +#define IDLE_DUR_MASK                (IDLE_DUR_ENTRIES - 1)
> > +
> > +struct idle_nohz_data {
> > +     u64             duration[IDLE_DUR_ENTRIES];
> > +     u64             entry_time;
> > +     u64             sum;
> > +     unsigned int    idx;
> > +};
> > +
> > +static DEFINE_PER_CPU_ALIGNED(struct idle_nohz_data, nohz_data);
> > +
> > +/**
> > + * default_idle_call - Default CPU idle routine.
> > + *
> > + * To use when the cpuidle framework cannot be used.
> > + */
> > +static void default_idle_call(void)
> > +{
> > +     struct idle_nohz_data *nd = this_cpu_ptr(&nohz_data);
> > +     unsigned int idx = nd->idx;
> > +     s64 delta;
> > +
> > +     /*
> > +      * If the CPU spends more than a tick on average in idle, try to stop
> > +      * the tick.
> > +      */
> > +     if (nd->sum > TICK_NSEC * IDLE_DUR_ENTRIES)
> > +             tick_nohz_idle_stop_tick();
> > +
> > +     __default_idle_call();
> > +
> > +     /*
> > +      * Build a moving average of the time spent in idle to prevent stopping
> > +      * the tick on a loaded system which only goes idle briefly.
> > +      */
> > +     delta = max(sched_clock() - nd->entry_time, 0);
> > +     nd->sum += delta - nd->duration[idx];
> > +     nd->duration[idx] = delta;
> > +     nd->idx = (idx + 1) & IDLE_DUR_MASK;
> > +}
> > +
> > +static void default_idle_enter(void)
> > +{
> > +     this_cpu_write(nohz_data.entry_time, sched_clock());
> > +}
> > +
> > +#else  /* CONFIG_NO_HZ_COMMON */
> > +static inline void default_idle_call(void { __default_idle_call(); }
> > +static inline void default_idle_enter(void) { }
> > +#endif /* !CONFIG_NO_HZ_COMMON */
> > +
> >  static int call_cpuidle_s2idle(struct cpuidle_driver *drv,
> >                              struct cpuidle_device *dev,
> >                              u64 max_latency_ns)
> > @@ -186,8 +236,6 @@ static void cpuidle_idle_call(void)
> >       }
> >
> >       if (cpuidle_not_available(drv, dev)) {
> > -             tick_nohz_idle_stop_tick();
> > -
> >               default_idle_call();
> >               goto exit_idle;
> >       }
> > @@ -276,6 +324,7 @@ static void do_idle(void)
> >
> >       __current_set_polling();
> >       tick_nohz_idle_enter();
> > +     default_idle_enter();
> >
> >       while (!need_resched()) {
> >
> >
>
> How does this work? We don't stop the tick until the average idle time is larger,
> but if we don't stop the tick how is that possible?
>
> Why don't we just require one or two consecutive tick wakeups before stopping?

Exactly my thought and I think one should be sufficient.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware
  2026-03-02 21:25     ` Rafael J. Wysocki
@ 2026-03-04  3:03       ` Qais Yousef
  2026-03-06 21:21         ` Rafael J. Wysocki
  2026-03-07 16:12         ` [PATCH v1] sched: idle: Make skipping governor callbacks more consistent Rafael J. Wysocki
  0 siblings, 2 replies; 29+ messages in thread
From: Qais Yousef @ 2026-03-04  3:03 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Christian Loehle, Thomas Gleixner, LKML, Peter Zijlstra,
	Frederic Weisbecker

On 03/02/26 22:25, Rafael J. Wysocki wrote:
> On Mon, Mar 2, 2026 at 12:04 PM Christian Loehle
> <christian.loehle@arm.com> wrote:
> >
> > On 3/1/26 19:30, Thomas Gleixner wrote:
> > > Guests fall back to default_idle_call() as there is no cpuidle driver
> > > available to them by default. That causes a problem in fully loaded
> > > scenarios where CPUs go briefly idle for a couple of microseconds:
> > >
> > > tick_nohz_idle_stop_tick() is invoked unconditionally which means unless
> > > there is timer pending in the next tick, the tick is stopped and a couple
> > > of microseconds later when the idle condition goes away restarted. That
> > > requires to program the clockevent device twice which implies a VM exit for
> > > each reprogramming.
> > >
> > > It was suggested to remove the tick_nohz_idle_stop_tick() invocation from
> > > the default idle code, but would be counterproductive. It would not allow
> > > the host to go into deeper idle states when the guest CPU is fully idle as
> > > it has to maintain the periodic tick.
> > >
> > > Cure this by implementing a trivial moving average filter which keeps track
> > > of the recent idle recidency time and only stop the tick when the average
> > > is larger than a tick.
> > >
> > > Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> > > ---
> > >  kernel/sched/idle.c |   65 +++++++++++++++++++++++++++++++++++++++++++++-------
> > >  1 file changed, 57 insertions(+), 8 deletions(-)
> > >
> > > --- a/kernel/sched/idle.c
> > > +++ b/kernel/sched/idle.c
> > > @@ -105,12 +105,7 @@ static inline void cond_tick_broadcast_e
> > >  static inline void cond_tick_broadcast_exit(void) { }
> > >  #endif /* !CONFIG_GENERIC_CLOCKEVENTS_BROADCAST_IDLE */
> > >
> > > -/**
> > > - * default_idle_call - Default CPU idle routine.
> > > - *
> > > - * To use when the cpuidle framework cannot be used.
> > > - */
> > > -static void __cpuidle default_idle_call(void)
> > > +static void __cpuidle __default_idle_call(void)
> > >  {
> > >       instrumentation_begin();
> > >       if (!current_clr_polling_and_test()) {
> > > @@ -130,6 +125,61 @@ static void __cpuidle default_idle_call(
> > >       instrumentation_end();
> > >  }
> > >
> > > +#ifdef CONFIG_NO_HZ_COMMON
> > > +
> > > +/* Limit to 4 entries so it fits in a cache line */
> > > +#define IDLE_DUR_ENTRIES     4
> > > +#define IDLE_DUR_MASK                (IDLE_DUR_ENTRIES - 1)
> > > +
> > > +struct idle_nohz_data {
> > > +     u64             duration[IDLE_DUR_ENTRIES];
> > > +     u64             entry_time;
> > > +     u64             sum;
> > > +     unsigned int    idx;
> > > +};
> > > +
> > > +static DEFINE_PER_CPU_ALIGNED(struct idle_nohz_data, nohz_data);
> > > +
> > > +/**
> > > + * default_idle_call - Default CPU idle routine.
> > > + *
> > > + * To use when the cpuidle framework cannot be used.
> > > + */
> > > +static void default_idle_call(void)
> > > +{
> > > +     struct idle_nohz_data *nd = this_cpu_ptr(&nohz_data);
> > > +     unsigned int idx = nd->idx;
> > > +     s64 delta;
> > > +
> > > +     /*
> > > +      * If the CPU spends more than a tick on average in idle, try to stop
> > > +      * the tick.
> > > +      */
> > > +     if (nd->sum > TICK_NSEC * IDLE_DUR_ENTRIES)
> > > +             tick_nohz_idle_stop_tick();
> > > +
> > > +     __default_idle_call();
> > > +
> > > +     /*
> > > +      * Build a moving average of the time spent in idle to prevent stopping
> > > +      * the tick on a loaded system which only goes idle briefly.
> > > +      */
> > > +     delta = max(sched_clock() - nd->entry_time, 0);
> > > +     nd->sum += delta - nd->duration[idx];
> > > +     nd->duration[idx] = delta;
> > > +     nd->idx = (idx + 1) & IDLE_DUR_MASK;
> > > +}
> > > +
> > > +static void default_idle_enter(void)
> > > +{
> > > +     this_cpu_write(nohz_data.entry_time, sched_clock());
> > > +}
> > > +
> > > +#else  /* CONFIG_NO_HZ_COMMON */
> > > +static inline void default_idle_call(void { __default_idle_call(); }
> > > +static inline void default_idle_enter(void) { }
> > > +#endif /* !CONFIG_NO_HZ_COMMON */
> > > +
> > >  static int call_cpuidle_s2idle(struct cpuidle_driver *drv,
> > >                              struct cpuidle_device *dev,
> > >                              u64 max_latency_ns)
> > > @@ -186,8 +236,6 @@ static void cpuidle_idle_call(void)
> > >       }
> > >
> > >       if (cpuidle_not_available(drv, dev)) {
> > > -             tick_nohz_idle_stop_tick();
> > > -
> > >               default_idle_call();
> > >               goto exit_idle;
> > >       }
> > > @@ -276,6 +324,7 @@ static void do_idle(void)
> > >
> > >       __current_set_polling();
> > >       tick_nohz_idle_enter();
> > > +     default_idle_enter();
> > >
> > >       while (!need_resched()) {
> > >
> > >
> >
> > How does this work? We don't stop the tick until the average idle time is larger,
> > but if we don't stop the tick how is that possible?
> >
> > Why don't we just require one or two consecutive tick wakeups before stopping?
> 
> Exactly my thought and I think one should be sufficient.

I concur. From our experience with TEO util threshold these averages can
backfire. I think one tick is sufficient delay to not be obviously broken. But
IMO the setup is broken too. No cpuidle driver and nohz is enabled but
performance is important is not a good combination. Since this has proven to
have both power and performance impact, ensuring there's a sensible cpuidle
driver is the right thing to do IMHO.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware
  2026-03-02 11:39         ` Christian Loehle
@ 2026-03-04  3:35           ` Qais Yousef
  0 siblings, 0 replies; 29+ messages in thread
From: Qais Yousef @ 2026-03-04  3:35 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Frederic Weisbecker, Thomas Gleixner, LKML, Peter Zijlstra,
	Rafael J. Wysocki

On 03/02/26 11:39, Christian Loehle wrote:
> On 3/2/26 11:11, Frederic Weisbecker wrote:
> > On Mon, Mar 02, 2026 at 11:03:00AM +0000, Christian Loehle wrote:
> >> On 3/2/26 10:43, Frederic Weisbecker wrote:
> >>> On Sun, Mar 01, 2026 at 08:30:51PM +0100, Thomas Gleixner wrote:
> >>>> Guests fall back to default_idle_call() as there is no cpuidle driver
> >>>> available to them by default. That causes a problem in fully loaded
> >>>> scenarios where CPUs go briefly idle for a couple of microseconds:
> >>>>
> >>>> tick_nohz_idle_stop_tick() is invoked unconditionally which means unless
> >>>> there is timer pending in the next tick, the tick is stopped and a couple
> >>>> of microseconds later when the idle condition goes away restarted. That
> >>>> requires to program the clockevent device twice which implies a VM exit for
> >>>> each reprogramming.
> >>>>
> >>>> It was suggested to remove the tick_nohz_idle_stop_tick() invocation from
> >>>> the default idle code, but would be counterproductive. It would not allow
> >>>> the host to go into deeper idle states when the guest CPU is fully idle as
> >>>> it has to maintain the periodic tick.
> >>>>
> >>>> Cure this by implementing a trivial moving average filter which keeps track
> >>>> of the recent idle recidency time and only stop the tick when the average
> >>>> is larger than a tick.
> >>>>
> >>>> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> >>>
> >>> Shouldn't there be instead a new dedicated cpuidle driver with proper governor support?
> >>
> >> I think a dummy cpuidle driver is an option, but calling into any governor
> >> seems overkill IMO, it presents an option to the user where there really is
> >> none (after all the cpuidle governor would just make a boolean decision as
> >> there are no states).
> > 
> > I must confess I don't fully understand the picture with the non-existent states
> > but what Thomas is doing in his patch is basically an ad-hoc implementation of
> > cpuidle governor decision whether or not to stop the tick.
> > 
> 
> Yup and if we put that into the cpuidle governor then we have to duplicate
> that logic for all governors even though for <= 1 states they hopefully
> should be the same.
> 
> A dummy driver would allow for this logic to live in drivers/cpuidle/ but
> I don't have a preference either way.

I am not sure about all the details, but vm exit seems akin to a deep idle
state with sizeable latency hit. Not sure how the power impact can be modeled
though.. It seems purely associated with stopping the tick, so maybe can be the
same as allowing the physical CPU to enter deep idle state since not stopping
the tick means the host cpu can't enter it either? ie: copy min residency from
first deep idle state of the host.

Haven't thought this through to be honest, but seems there's room for some
sensible model. Whether worth it or not, I don't know either :)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware
  2026-03-04  3:03       ` Qais Yousef
@ 2026-03-06 21:21         ` Rafael J. Wysocki
  2026-03-06 21:31           ` Rafael J. Wysocki
  2026-03-07 16:12         ` [PATCH v1] sched: idle: Make skipping governor callbacks more consistent Rafael J. Wysocki
  1 sibling, 1 reply; 29+ messages in thread
From: Rafael J. Wysocki @ 2026-03-06 21:21 UTC (permalink / raw)
  To: Qais Yousef, Christian Loehle
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Frederic Weisbecker,
	Linux PM

On Wednesday, March 4, 2026 4:03:06 AM CET Qais Yousef wrote:
> On 03/02/26 22:25, Rafael J. Wysocki wrote:
> > On Mon, Mar 2, 2026 at 12:04 PM Christian Loehle
> > <christian.loehle@arm.com> wrote:
> > >
> > > On 3/1/26 19:30, Thomas Gleixner wrote:
> > > > Guests fall back to default_idle_call() as there is no cpuidle driver
> > > > available to them by default. That causes a problem in fully loaded
> > > > scenarios where CPUs go briefly idle for a couple of microseconds:
> > > >
> > > > tick_nohz_idle_stop_tick() is invoked unconditionally which means unless
> > > > there is timer pending in the next tick, the tick is stopped and a couple
> > > > of microseconds later when the idle condition goes away restarted. That
> > > > requires to program the clockevent device twice which implies a VM exit for
> > > > each reprogramming.
> > > >
> > > > It was suggested to remove the tick_nohz_idle_stop_tick() invocation from
> > > > the default idle code, but would be counterproductive. It would not allow
> > > > the host to go into deeper idle states when the guest CPU is fully idle as
> > > > it has to maintain the periodic tick.
> > > >
> > > > Cure this by implementing a trivial moving average filter which keeps track
> > > > of the recent idle recidency time and only stop the tick when the average
> > > > is larger than a tick.
> > > >
> > > > Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> > > > ---
> > > >  kernel/sched/idle.c |   65 +++++++++++++++++++++++++++++++++++++++++++++-------
> > > >  1 file changed, 57 insertions(+), 8 deletions(-)
> > > >
> > > > --- a/kernel/sched/idle.c
> > > > +++ b/kernel/sched/idle.c
> > > > @@ -105,12 +105,7 @@ static inline void cond_tick_broadcast_e
> > > >  static inline void cond_tick_broadcast_exit(void) { }
> > > >  #endif /* !CONFIG_GENERIC_CLOCKEVENTS_BROADCAST_IDLE */
> > > >
> > > > -/**
> > > > - * default_idle_call - Default CPU idle routine.
> > > > - *
> > > > - * To use when the cpuidle framework cannot be used.
> > > > - */
> > > > -static void __cpuidle default_idle_call(void)
> > > > +static void __cpuidle __default_idle_call(void)
> > > >  {
> > > >       instrumentation_begin();
> > > >       if (!current_clr_polling_and_test()) {
> > > > @@ -130,6 +125,61 @@ static void __cpuidle default_idle_call(
> > > >       instrumentation_end();
> > > >  }
> > > >
> > > > +#ifdef CONFIG_NO_HZ_COMMON
> > > > +
> > > > +/* Limit to 4 entries so it fits in a cache line */
> > > > +#define IDLE_DUR_ENTRIES     4
> > > > +#define IDLE_DUR_MASK                (IDLE_DUR_ENTRIES - 1)
> > > > +
> > > > +struct idle_nohz_data {
> > > > +     u64             duration[IDLE_DUR_ENTRIES];
> > > > +     u64             entry_time;
> > > > +     u64             sum;
> > > > +     unsigned int    idx;
> > > > +};
> > > > +
> > > > +static DEFINE_PER_CPU_ALIGNED(struct idle_nohz_data, nohz_data);
> > > > +
> > > > +/**
> > > > + * default_idle_call - Default CPU idle routine.
> > > > + *
> > > > + * To use when the cpuidle framework cannot be used.
> > > > + */
> > > > +static void default_idle_call(void)
> > > > +{
> > > > +     struct idle_nohz_data *nd = this_cpu_ptr(&nohz_data);
> > > > +     unsigned int idx = nd->idx;
> > > > +     s64 delta;
> > > > +
> > > > +     /*
> > > > +      * If the CPU spends more than a tick on average in idle, try to stop
> > > > +      * the tick.
> > > > +      */
> > > > +     if (nd->sum > TICK_NSEC * IDLE_DUR_ENTRIES)
> > > > +             tick_nohz_idle_stop_tick();
> > > > +
> > > > +     __default_idle_call();
> > > > +
> > > > +     /*
> > > > +      * Build a moving average of the time spent in idle to prevent stopping
> > > > +      * the tick on a loaded system which only goes idle briefly.
> > > > +      */
> > > > +     delta = max(sched_clock() - nd->entry_time, 0);
> > > > +     nd->sum += delta - nd->duration[idx];
> > > > +     nd->duration[idx] = delta;
> > > > +     nd->idx = (idx + 1) & IDLE_DUR_MASK;
> > > > +}
> > > > +
> > > > +static void default_idle_enter(void)
> > > > +{
> > > > +     this_cpu_write(nohz_data.entry_time, sched_clock());
> > > > +}
> > > > +
> > > > +#else  /* CONFIG_NO_HZ_COMMON */
> > > > +static inline void default_idle_call(void { __default_idle_call(); }
> > > > +static inline void default_idle_enter(void) { }
> > > > +#endif /* !CONFIG_NO_HZ_COMMON */
> > > > +
> > > >  static int call_cpuidle_s2idle(struct cpuidle_driver *drv,
> > > >                              struct cpuidle_device *dev,
> > > >                              u64 max_latency_ns)
> > > > @@ -186,8 +236,6 @@ static void cpuidle_idle_call(void)
> > > >       }
> > > >
> > > >       if (cpuidle_not_available(drv, dev)) {
> > > > -             tick_nohz_idle_stop_tick();
> > > > -
> > > >               default_idle_call();
> > > >               goto exit_idle;
> > > >       }
> > > > @@ -276,6 +324,7 @@ static void do_idle(void)
> > > >
> > > >       __current_set_polling();
> > > >       tick_nohz_idle_enter();
> > > > +     default_idle_enter();
> > > >
> > > >       while (!need_resched()) {
> > > >
> > > >
> > >
> > > How does this work? We don't stop the tick until the average idle time is larger,
> > > but if we don't stop the tick how is that possible?
> > >
> > > Why don't we just require one or two consecutive tick wakeups before stopping?
> > 
> > Exactly my thought and I think one should be sufficient.
> 
> I concur. From our experience with TEO util threshold these averages can
> backfire. I think one tick is sufficient delay to not be obviously broken.

So if I'm not mistaken, it would be something like the appended prototype
(completely untested, but it builds for me).

---
 drivers/cpuidle/cpuidle.c |   10 ----------
 kernel/sched/idle.c       |   32 ++++++++++++++++++++++++--------
 2 files changed, 24 insertions(+), 18 deletions(-)

--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -359,16 +359,6 @@ noinstr int cpuidle_enter_state(struct c
 int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 		   bool *stop_tick)
 {
-	/*
-	 * If there is only a single idle state (or none), there is nothing
-	 * meaningful for the governor to choose. Skip the governor and
-	 * always use state 0 with the tick running.
-	 */
-	if (drv->state_count <= 1) {
-		*stop_tick = false;
-		return 0;
-	}
-
 	return cpuidle_curr_governor->select(drv, dev, stop_tick);
 }
 
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -161,6 +161,14 @@ static int call_cpuidle(struct cpuidle_d
 	return cpuidle_enter(drv, dev, next_state);
 }
 
+static void idle_call_stop_or_retain_tick(bool stop_tick)
+{
+	if (stop_tick || tick_nohz_tick_stopped())
+		tick_nohz_idle_stop_tick();
+	else
+		tick_nohz_idle_retain_tick();
+}
+
 /**
  * cpuidle_idle_call - the main idle function
  *
@@ -170,7 +178,7 @@ static int call_cpuidle(struct cpuidle_d
  * set, and it returns with polling set.  If it ever stops polling, it
  * must clear the polling bit.
  */
-static void cpuidle_idle_call(void)
+static void cpuidle_idle_call(bool got_tick)
 {
 	struct cpuidle_device *dev = cpuidle_get_device();
 	struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
@@ -186,7 +194,7 @@ static void cpuidle_idle_call(void)
 	}
 
 	if (cpuidle_not_available(drv, dev)) {
-		tick_nohz_idle_stop_tick();
+		idle_call_stop_or_retain_tick(!got_tick);
 
 		default_idle_call();
 		goto exit_idle;
@@ -221,7 +229,7 @@ static void cpuidle_idle_call(void)
 
 		next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
 		call_cpuidle(drv, dev, next_state);
-	} else {
+	} else if (drv->state_count > 1) {
 		bool stop_tick = true;
 
 		/*
@@ -229,16 +237,22 @@ static void cpuidle_idle_call(void)
 		 */
 		next_state = cpuidle_select(drv, dev, &stop_tick);
 
-		if (stop_tick || tick_nohz_tick_stopped())
-			tick_nohz_idle_stop_tick();
-		else
-			tick_nohz_idle_retain_tick();
+		idle_call_stop_or_retain_tick(stop_tick);
 
 		entered_state = call_cpuidle(drv, dev, next_state);
 		/*
 		 * Give the governor an opportunity to reflect on the outcome
 		 */
 		cpuidle_reflect(dev, entered_state);
+	} else {
+		/*
+		 * If there is only a single idle state (or none), there is
+		 * nothing meaningful for the governor to choose.  Skip the
+		 * governor and always use state 0.
+		 */
+		idle_call_stop_or_retain_tick(!got_tick);
+
+		call_cpuidle(drv, dev, 0);
 	}
 
 exit_idle:
@@ -259,6 +273,7 @@ exit_idle:
 static void do_idle(void)
 {
 	int cpu = smp_processor_id();
+	bool got_tick = false;
 
 	/*
 	 * Check if we need to update blocked load
@@ -329,8 +344,9 @@ static void do_idle(void)
 			tick_nohz_idle_restart_tick();
 			cpu_idle_poll();
 		} else {
-			cpuidle_idle_call();
+			cpuidle_idle_call(got_tick);
 		}
+		got_tick = tick_nohz_idle_got_tick();
 		arch_cpu_idle_exit();
 	}
 




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware
  2026-03-06 21:21         ` Rafael J. Wysocki
@ 2026-03-06 21:31           ` Rafael J. Wysocki
  2026-03-07 16:25             ` Rafael J. Wysocki
  0 siblings, 1 reply; 29+ messages in thread
From: Rafael J. Wysocki @ 2026-03-06 21:31 UTC (permalink / raw)
  To: Qais Yousef, Christian Loehle
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Frederic Weisbecker,
	Linux PM

On Fri, Mar 6, 2026 at 10:21 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Wednesday, March 4, 2026 4:03:06 AM CET Qais Yousef wrote:
> > On 03/02/26 22:25, Rafael J. Wysocki wrote:
> > > On Mon, Mar 2, 2026 at 12:04 PM Christian Loehle
> > > <christian.loehle@arm.com> wrote:
> > > >
> > > > On 3/1/26 19:30, Thomas Gleixner wrote:
> > > > > Guests fall back to default_idle_call() as there is no cpuidle driver
> > > > > available to them by default. That causes a problem in fully loaded
> > > > > scenarios where CPUs go briefly idle for a couple of microseconds:
> > > > >
> > > > > tick_nohz_idle_stop_tick() is invoked unconditionally which means unless
> > > > > there is timer pending in the next tick, the tick is stopped and a couple
> > > > > of microseconds later when the idle condition goes away restarted. That
> > > > > requires to program the clockevent device twice which implies a VM exit for
> > > > > each reprogramming.
> > > > >
> > > > > It was suggested to remove the tick_nohz_idle_stop_tick() invocation from
> > > > > the default idle code, but would be counterproductive. It would not allow
> > > > > the host to go into deeper idle states when the guest CPU is fully idle as
> > > > > it has to maintain the periodic tick.
> > > > >
> > > > > Cure this by implementing a trivial moving average filter which keeps track
> > > > > of the recent idle recidency time and only stop the tick when the average
> > > > > is larger than a tick.
> > > > >
> > > > > Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> > > > > ---
> > > > >  kernel/sched/idle.c |   65 +++++++++++++++++++++++++++++++++++++++++++++-------
> > > > >  1 file changed, 57 insertions(+), 8 deletions(-)
> > > > >
> > > > > --- a/kernel/sched/idle.c
> > > > > +++ b/kernel/sched/idle.c
> > > > > @@ -105,12 +105,7 @@ static inline void cond_tick_broadcast_e
> > > > >  static inline void cond_tick_broadcast_exit(void) { }
> > > > >  #endif /* !CONFIG_GENERIC_CLOCKEVENTS_BROADCAST_IDLE */
> > > > >
> > > > > -/**
> > > > > - * default_idle_call - Default CPU idle routine.
> > > > > - *
> > > > > - * To use when the cpuidle framework cannot be used.
> > > > > - */
> > > > > -static void __cpuidle default_idle_call(void)
> > > > > +static void __cpuidle __default_idle_call(void)
> > > > >  {
> > > > >       instrumentation_begin();
> > > > >       if (!current_clr_polling_and_test()) {
> > > > > @@ -130,6 +125,61 @@ static void __cpuidle default_idle_call(
> > > > >       instrumentation_end();
> > > > >  }
> > > > >
> > > > > +#ifdef CONFIG_NO_HZ_COMMON
> > > > > +
> > > > > +/* Limit to 4 entries so it fits in a cache line */
> > > > > +#define IDLE_DUR_ENTRIES     4
> > > > > +#define IDLE_DUR_MASK                (IDLE_DUR_ENTRIES - 1)
> > > > > +
> > > > > +struct idle_nohz_data {
> > > > > +     u64             duration[IDLE_DUR_ENTRIES];
> > > > > +     u64             entry_time;
> > > > > +     u64             sum;
> > > > > +     unsigned int    idx;
> > > > > +};
> > > > > +
> > > > > +static DEFINE_PER_CPU_ALIGNED(struct idle_nohz_data, nohz_data);
> > > > > +
> > > > > +/**
> > > > > + * default_idle_call - Default CPU idle routine.
> > > > > + *
> > > > > + * To use when the cpuidle framework cannot be used.
> > > > > + */
> > > > > +static void default_idle_call(void)
> > > > > +{
> > > > > +     struct idle_nohz_data *nd = this_cpu_ptr(&nohz_data);
> > > > > +     unsigned int idx = nd->idx;
> > > > > +     s64 delta;
> > > > > +
> > > > > +     /*
> > > > > +      * If the CPU spends more than a tick on average in idle, try to stop
> > > > > +      * the tick.
> > > > > +      */
> > > > > +     if (nd->sum > TICK_NSEC * IDLE_DUR_ENTRIES)
> > > > > +             tick_nohz_idle_stop_tick();
> > > > > +
> > > > > +     __default_idle_call();
> > > > > +
> > > > > +     /*
> > > > > +      * Build a moving average of the time spent in idle to prevent stopping
> > > > > +      * the tick on a loaded system which only goes idle briefly.
> > > > > +      */
> > > > > +     delta = max(sched_clock() - nd->entry_time, 0);
> > > > > +     nd->sum += delta - nd->duration[idx];
> > > > > +     nd->duration[idx] = delta;
> > > > > +     nd->idx = (idx + 1) & IDLE_DUR_MASK;
> > > > > +}
> > > > > +
> > > > > +static void default_idle_enter(void)
> > > > > +{
> > > > > +     this_cpu_write(nohz_data.entry_time, sched_clock());
> > > > > +}
> > > > > +
> > > > > +#else  /* CONFIG_NO_HZ_COMMON */
> > > > > +static inline void default_idle_call(void { __default_idle_call(); }
> > > > > +static inline void default_idle_enter(void) { }
> > > > > +#endif /* !CONFIG_NO_HZ_COMMON */
> > > > > +
> > > > >  static int call_cpuidle_s2idle(struct cpuidle_driver *drv,
> > > > >                              struct cpuidle_device *dev,
> > > > >                              u64 max_latency_ns)
> > > > > @@ -186,8 +236,6 @@ static void cpuidle_idle_call(void)
> > > > >       }
> > > > >
> > > > >       if (cpuidle_not_available(drv, dev)) {
> > > > > -             tick_nohz_idle_stop_tick();
> > > > > -
> > > > >               default_idle_call();
> > > > >               goto exit_idle;
> > > > >       }
> > > > > @@ -276,6 +324,7 @@ static void do_idle(void)
> > > > >
> > > > >       __current_set_polling();
> > > > >       tick_nohz_idle_enter();
> > > > > +     default_idle_enter();
> > > > >
> > > > >       while (!need_resched()) {
> > > > >
> > > > >
> > > >
> > > > How does this work? We don't stop the tick until the average idle time is larger,
> > > > but if we don't stop the tick how is that possible?
> > > >
> > > > Why don't we just require one or two consecutive tick wakeups before stopping?
> > >
> > > Exactly my thought and I think one should be sufficient.
> >
> > I concur. From our experience with TEO util threshold these averages can
> > backfire. I think one tick is sufficient delay to not be obviously broken.
>
> So if I'm not mistaken, it would be something like the appended prototype
> (completely untested, but it builds for me).
>
> ---
>  drivers/cpuidle/cpuidle.c |   10 ----------
>  kernel/sched/idle.c       |   32 ++++++++++++++++++++++++--------
>  2 files changed, 24 insertions(+), 18 deletions(-)
>
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -359,16 +359,6 @@ noinstr int cpuidle_enter_state(struct c
>  int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>                    bool *stop_tick)
>  {
> -       /*
> -        * If there is only a single idle state (or none), there is nothing
> -        * meaningful for the governor to choose. Skip the governor and
> -        * always use state 0 with the tick running.
> -        */
> -       if (drv->state_count <= 1) {
> -               *stop_tick = false;
> -               return 0;
> -       }
> -
>         return cpuidle_curr_governor->select(drv, dev, stop_tick);
>  }
>
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -161,6 +161,14 @@ static int call_cpuidle(struct cpuidle_d
>         return cpuidle_enter(drv, dev, next_state);
>  }
>
> +static void idle_call_stop_or_retain_tick(bool stop_tick)
> +{
> +       if (stop_tick || tick_nohz_tick_stopped())
> +               tick_nohz_idle_stop_tick();
> +       else
> +               tick_nohz_idle_retain_tick();
> +}
> +
>  /**
>   * cpuidle_idle_call - the main idle function
>   *
> @@ -170,7 +178,7 @@ static int call_cpuidle(struct cpuidle_d
>   * set, and it returns with polling set.  If it ever stops polling, it
>   * must clear the polling bit.
>   */
> -static void cpuidle_idle_call(void)
> +static void cpuidle_idle_call(bool got_tick)
>  {
>         struct cpuidle_device *dev = cpuidle_get_device();
>         struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
> @@ -186,7 +194,7 @@ static void cpuidle_idle_call(void)
>         }
>
>         if (cpuidle_not_available(drv, dev)) {
> -               tick_nohz_idle_stop_tick();
> +               idle_call_stop_or_retain_tick(!got_tick);

Oh, I got this backwards (here and below).

The tick should be stopped if we've got the tick previously, but you
get the idea.

[Note to self: Don't send patches in the night.]

>                 default_idle_call();
>                 goto exit_idle;
> @@ -221,7 +229,7 @@ static void cpuidle_idle_call(void)
>
>                 next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
>                 call_cpuidle(drv, dev, next_state);
> -       } else {
> +       } else if (drv->state_count > 1) {
>                 bool stop_tick = true;
>
>                 /*
> @@ -229,16 +237,22 @@ static void cpuidle_idle_call(void)
>                  */
>                 next_state = cpuidle_select(drv, dev, &stop_tick);
>
> -               if (stop_tick || tick_nohz_tick_stopped())
> -                       tick_nohz_idle_stop_tick();
> -               else
> -                       tick_nohz_idle_retain_tick();
> +               idle_call_stop_or_retain_tick(stop_tick);
>
>                 entered_state = call_cpuidle(drv, dev, next_state);
>                 /*
>                  * Give the governor an opportunity to reflect on the outcome
>                  */
>                 cpuidle_reflect(dev, entered_state);
> +       } else {
> +               /*
> +                * If there is only a single idle state (or none), there is
> +                * nothing meaningful for the governor to choose.  Skip the
> +                * governor and always use state 0.
> +                */
> +               idle_call_stop_or_retain_tick(!got_tick);
> +
> +               call_cpuidle(drv, dev, 0);
>         }
>
>  exit_idle:
> @@ -259,6 +273,7 @@ exit_idle:
>  static void do_idle(void)
>  {
>         int cpu = smp_processor_id();
> +       bool got_tick = false;
>
>         /*
>          * Check if we need to update blocked load
> @@ -329,8 +344,9 @@ static void do_idle(void)
>                         tick_nohz_idle_restart_tick();
>                         cpu_idle_poll();
>                 } else {
> -                       cpuidle_idle_call();
> +                       cpuidle_idle_call(got_tick);
>                 }
> +               got_tick = tick_nohz_idle_got_tick();
>                 arch_cpu_idle_exit();
>         }

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v1] sched: idle: Make skipping governor callbacks more consistent
  2026-03-04  3:03       ` Qais Yousef
  2026-03-06 21:21         ` Rafael J. Wysocki
@ 2026-03-07 16:12         ` Rafael J. Wysocki
  2026-03-09  9:13           ` Christian Loehle
                             ` (2 more replies)
  1 sibling, 3 replies; 29+ messages in thread
From: Rafael J. Wysocki @ 2026-03-07 16:12 UTC (permalink / raw)
  To: Linux PM
  Cc: Qais Yousef, Christian Loehle, Thomas Gleixner, LKML,
	Peter Zijlstra, Frederic Weisbecker, Aboorva Devarajan

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

If the cpuidle governor .select() callback is skipped because there
is only one idle state in the cpuidle driver, the .reflect() callback
should be skipped as well, at least for consistency (if not for
correctness), so do it.

Fixes: e5c9ffc6ae1b ("cpuidle: Skip governor when only one idle state is available")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpuidle/cpuidle.c |   10 ----------
 kernel/sched/idle.c       |   11 ++++++++++-
 2 files changed, 10 insertions(+), 11 deletions(-)

--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -359,16 +359,6 @@ noinstr int cpuidle_enter_state(struct c
 int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 		   bool *stop_tick)
 {
-	/*
-	 * If there is only a single idle state (or none), there is nothing
-	 * meaningful for the governor to choose. Skip the governor and
-	 * always use state 0 with the tick running.
-	 */
-	if (drv->state_count <= 1) {
-		*stop_tick = false;
-		return 0;
-	}
-
 	return cpuidle_curr_governor->select(drv, dev, stop_tick);
 }
 
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -221,7 +221,7 @@ static void cpuidle_idle_call(void)
 
 		next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
 		call_cpuidle(drv, dev, next_state);
-	} else {
+	} else if (drv->state_count > 1) {
 		bool stop_tick = true;
 
 		/*
@@ -239,6 +239,15 @@ static void cpuidle_idle_call(void)
 		 * Give the governor an opportunity to reflect on the outcome
 		 */
 		cpuidle_reflect(dev, entered_state);
+	} else {
+		tick_nohz_idle_retain_tick();
+
+		/*
+		 * If there is only a single idle state (or none), there is
+		 * nothing meaningful for the governor to choose.  Skip the
+		 * governor and always use state 0.
+		 */
+		call_cpuidle(drv, dev, 0);
 	}
 
 exit_idle:




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware
  2026-03-06 21:31           ` Rafael J. Wysocki
@ 2026-03-07 16:25             ` Rafael J. Wysocki
  2026-03-10  3:54               ` Qais Yousef
  0 siblings, 1 reply; 29+ messages in thread
From: Rafael J. Wysocki @ 2026-03-07 16:25 UTC (permalink / raw)
  To: Qais Yousef, Christian Loehle
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Frederic Weisbecker,
	Linux PM

On Friday, March 6, 2026 10:31:49 PM CET Rafael J. Wysocki wrote:
> On Fri, Mar 6, 2026 at 10:21 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > On Wednesday, March 4, 2026 4:03:06 AM CET Qais Yousef wrote:
> > > On 03/02/26 22:25, Rafael J. Wysocki wrote:
> > > > On Mon, Mar 2, 2026 at 12:04 PM Christian Loehle

[cut]

> > > > >
> > > > > Why don't we just require one or two consecutive tick wakeups before stopping?
> > > >
> > > > Exactly my thought and I think one should be sufficient.
> > >
> > > I concur. From our experience with TEO util threshold these averages can
> > > backfire. I think one tick is sufficient delay to not be obviously broken.
> >
> > So if I'm not mistaken, it would be something like the appended prototype
> > (completely untested, but it builds for me).
> >
> > ---
> >  drivers/cpuidle/cpuidle.c |   10 ----------
> >  kernel/sched/idle.c       |   32 ++++++++++++++++++++++++--------
> >  2 files changed, 24 insertions(+), 18 deletions(-)
> >
> > --- a/drivers/cpuidle/cpuidle.c
> > +++ b/drivers/cpuidle/cpuidle.c
> > @@ -359,16 +359,6 @@ noinstr int cpuidle_enter_state(struct c
> >  int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> >                    bool *stop_tick)
> >  {
> > -       /*
> > -        * If there is only a single idle state (or none), there is nothing
> > -        * meaningful for the governor to choose. Skip the governor and
> > -        * always use state 0 with the tick running.
> > -        */
> > -       if (drv->state_count <= 1) {
> > -               *stop_tick = false;
> > -               return 0;
> > -       }
> > -
> >         return cpuidle_curr_governor->select(drv, dev, stop_tick);
> >  }
> >
> > --- a/kernel/sched/idle.c
> > +++ b/kernel/sched/idle.c
> > @@ -161,6 +161,14 @@ static int call_cpuidle(struct cpuidle_d
> >         return cpuidle_enter(drv, dev, next_state);
> >  }
> >
> > +static void idle_call_stop_or_retain_tick(bool stop_tick)
> > +{
> > +       if (stop_tick || tick_nohz_tick_stopped())
> > +               tick_nohz_idle_stop_tick();
> > +       else
> > +               tick_nohz_idle_retain_tick();
> > +}
> > +
> >  /**
> >   * cpuidle_idle_call - the main idle function
> >   *
> > @@ -170,7 +178,7 @@ static int call_cpuidle(struct cpuidle_d
> >   * set, and it returns with polling set.  If it ever stops polling, it
> >   * must clear the polling bit.
> >   */
> > -static void cpuidle_idle_call(void)
> > +static void cpuidle_idle_call(bool got_tick)
> >  {
> >         struct cpuidle_device *dev = cpuidle_get_device();
> >         struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
> > @@ -186,7 +194,7 @@ static void cpuidle_idle_call(void)
> >         }
> >
> >         if (cpuidle_not_available(drv, dev)) {
> > -               tick_nohz_idle_stop_tick();
> > +               idle_call_stop_or_retain_tick(!got_tick);
> 
> Oh, I got this backwards (here and below).
> 
> The tick should be stopped if we've got the tick previously, but you
> get the idea.

In the meantime I realized that if the .select() governor
callback is skipped, its .reflect() callback should be skipped
either, so I've posted this:

https://lkml.org/lkml/2026/3/7/569

and here's a fixed version of the last patch on top of the above (for
completeness):

---
 kernel/sched/idle.c |   25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -161,6 +161,14 @@ static int call_cpuidle(struct cpuidle_d
 	return cpuidle_enter(drv, dev, next_state);
 }
 
+static void idle_call_stop_or_retain_tick(bool stop_tick)
+{
+	if (stop_tick || tick_nohz_tick_stopped())
+		tick_nohz_idle_stop_tick();
+	else
+		tick_nohz_idle_retain_tick();
+}
+
 /**
  * cpuidle_idle_call - the main idle function
  *
@@ -170,7 +178,7 @@ static int call_cpuidle(struct cpuidle_d
  * set, and it returns with polling set.  If it ever stops polling, it
  * must clear the polling bit.
  */
-static void cpuidle_idle_call(void)
+static void cpuidle_idle_call(bool stop_tick)
 {
 	struct cpuidle_device *dev = cpuidle_get_device();
 	struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
@@ -186,7 +194,7 @@ static void cpuidle_idle_call(void)
 	}
 
 	if (cpuidle_not_available(drv, dev)) {
-		tick_nohz_idle_stop_tick();
+		idle_call_stop_or_retain_tick(stop_tick);
 
 		default_idle_call();
 		goto exit_idle;
@@ -222,17 +230,14 @@ static void cpuidle_idle_call(void)
 		next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
 		call_cpuidle(drv, dev, next_state);
 	} else if (drv->state_count > 1) {
-		bool stop_tick = true;
+		stop_tick = true;
 
 		/*
 		 * Ask the cpuidle framework to choose a convenient idle state.
 		 */
 		next_state = cpuidle_select(drv, dev, &stop_tick);
 
-		if (stop_tick || tick_nohz_tick_stopped())
-			tick_nohz_idle_stop_tick();
-		else
-			tick_nohz_idle_retain_tick();
+		idle_call_stop_or_retain_tick(stop_tick);
 
 		entered_state = call_cpuidle(drv, dev, next_state);
 		/*
@@ -240,7 +245,7 @@ static void cpuidle_idle_call(void)
 		 */
 		cpuidle_reflect(dev, entered_state);
 	} else {
-		tick_nohz_idle_retain_tick();
+		idle_call_stop_or_retain_tick(stop_tick);
 
 		/*
 		 * If there is only a single idle state (or none), there is
@@ -268,6 +273,7 @@ exit_idle:
 static void do_idle(void)
 {
 	int cpu = smp_processor_id();
+	bool got_tick = false;
 
 	/*
 	 * Check if we need to update blocked load
@@ -338,8 +344,9 @@ static void do_idle(void)
 			tick_nohz_idle_restart_tick();
 			cpu_idle_poll();
 		} else {
-			cpuidle_idle_call();
+			cpuidle_idle_call(got_tick);
 		}
+		got_tick = tick_nohz_idle_got_tick();
 		arch_cpu_idle_exit();
 	}
 




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1] sched: idle: Make skipping governor callbacks more consistent
  2026-03-07 16:12         ` [PATCH v1] sched: idle: Make skipping governor callbacks more consistent Rafael J. Wysocki
@ 2026-03-09  9:13           ` Christian Loehle
  2026-03-09 12:26             ` Rafael J. Wysocki
  2026-03-09 12:44           ` Aboorva Devarajan
  2026-03-10 14:28           ` Frederic Weisbecker
  2 siblings, 1 reply; 29+ messages in thread
From: Christian Loehle @ 2026-03-09  9:13 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM
  Cc: Qais Yousef, Thomas Gleixner, LKML, Peter Zijlstra,
	Frederic Weisbecker, Aboorva Devarajan

On 3/7/26 16:12, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> If the cpuidle governor .select() callback is skipped because there
> is only one idle state in the cpuidle driver, the .reflect() callback
> should be skipped as well, at least for consistency (if not for
> correctness), so do it.
> 
> Fixes: e5c9ffc6ae1b ("cpuidle: Skip governor when only one idle state is available")
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  drivers/cpuidle/cpuidle.c |   10 ----------
>  kernel/sched/idle.c       |   11 ++++++++++-
>  2 files changed, 10 insertions(+), 11 deletions(-)
> 
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -359,16 +359,6 @@ noinstr int cpuidle_enter_state(struct c
>  int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>  		   bool *stop_tick)
>  {
> -	/*
> -	 * If there is only a single idle state (or none), there is nothing
> -	 * meaningful for the governor to choose. Skip the governor and
> -	 * always use state 0 with the tick running.
> -	 */
> -	if (drv->state_count <= 1) {
> -		*stop_tick = false;
> -		return 0;
> -	}
> -
>  	return cpuidle_curr_governor->select(drv, dev, stop_tick);
>  }
>  
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -221,7 +221,7 @@ static void cpuidle_idle_call(void)
>  
>  		next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
>  		call_cpuidle(drv, dev, next_state);
> -	} else {
> +	} else if (drv->state_count > 1) {
>  		bool stop_tick = true;
>  
>  		/*
> @@ -239,6 +239,15 @@ static void cpuidle_idle_call(void)
>  		 * Give the governor an opportunity to reflect on the outcome
>  		 */
>  		cpuidle_reflect(dev, entered_state);
> +	} else {
> +		tick_nohz_idle_retain_tick();
> +
> +		/*
> +		 * If there is only a single idle state (or none), there is
> +		 * nothing meaningful for the governor to choose.  Skip the
> +		 * governor and always use state 0.
> +		 */
> +		call_cpuidle(drv, dev, 0);
>  	}
>  
>  exit_idle:
> 
> 
> 

Duh, good catch.
Reviewed-by: Christian Loehle <christian.loehle@arm.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1] sched: idle: Make skipping governor callbacks more consistent
  2026-03-09  9:13           ` Christian Loehle
@ 2026-03-09 12:26             ` Rafael J. Wysocki
  2026-03-10  3:57               ` Qais Yousef
  0 siblings, 1 reply; 29+ messages in thread
From: Rafael J. Wysocki @ 2026-03-09 12:26 UTC (permalink / raw)
  To: Christian Loehle, Linux PM
  Cc: Qais Yousef, Thomas Gleixner, LKML, Peter Zijlstra,
	Frederic Weisbecker, Aboorva Devarajan

On Mon, Mar 9, 2026 at 10:13 AM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 3/7/26 16:12, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > If the cpuidle governor .select() callback is skipped because there
> > is only one idle state in the cpuidle driver, the .reflect() callback
> > should be skipped as well, at least for consistency (if not for
> > correctness), so do it.
> >
> > Fixes: e5c9ffc6ae1b ("cpuidle: Skip governor when only one idle state is available")
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >  drivers/cpuidle/cpuidle.c |   10 ----------
> >  kernel/sched/idle.c       |   11 ++++++++++-
> >  2 files changed, 10 insertions(+), 11 deletions(-)
> >
> > --- a/drivers/cpuidle/cpuidle.c
> > +++ b/drivers/cpuidle/cpuidle.c
> > @@ -359,16 +359,6 @@ noinstr int cpuidle_enter_state(struct c
> >  int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> >                  bool *stop_tick)
> >  {
> > -     /*
> > -      * If there is only a single idle state (or none), there is nothing
> > -      * meaningful for the governor to choose. Skip the governor and
> > -      * always use state 0 with the tick running.
> > -      */
> > -     if (drv->state_count <= 1) {
> > -             *stop_tick = false;
> > -             return 0;
> > -     }
> > -
> >       return cpuidle_curr_governor->select(drv, dev, stop_tick);
> >  }
> >
> > --- a/kernel/sched/idle.c
> > +++ b/kernel/sched/idle.c
> > @@ -221,7 +221,7 @@ static void cpuidle_idle_call(void)
> >
> >               next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
> >               call_cpuidle(drv, dev, next_state);
> > -     } else {
> > +     } else if (drv->state_count > 1) {
> >               bool stop_tick = true;
> >
> >               /*
> > @@ -239,6 +239,15 @@ static void cpuidle_idle_call(void)
> >                * Give the governor an opportunity to reflect on the outcome
> >                */
> >               cpuidle_reflect(dev, entered_state);
> > +     } else {
> > +             tick_nohz_idle_retain_tick();
> > +
> > +             /*
> > +              * If there is only a single idle state (or none), there is
> > +              * nothing meaningful for the governor to choose.  Skip the
> > +              * governor and always use state 0.
> > +              */
> > +             call_cpuidle(drv, dev, 0);
> >       }
> >
> >  exit_idle:
> >
> >
> >
>
> Duh, good catch.
> Reviewed-by: Christian Loehle <christian.loehle@arm.com>

OK, so any objections or concerns from anyone?

I'm about to queue this up for the next -rc.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1] sched: idle: Make skipping governor callbacks more consistent
  2026-03-07 16:12         ` [PATCH v1] sched: idle: Make skipping governor callbacks more consistent Rafael J. Wysocki
  2026-03-09  9:13           ` Christian Loehle
@ 2026-03-09 12:44           ` Aboorva Devarajan
  2026-03-10 14:28           ` Frederic Weisbecker
  2 siblings, 0 replies; 29+ messages in thread
From: Aboorva Devarajan @ 2026-03-09 12:44 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM
  Cc: Qais Yousef, Christian Loehle, Thomas Gleixner, LKML,
	Peter Zijlstra, Frederic Weisbecker

On Sat, 2026-03-07 at 17:12 +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> If the cpuidle governor .select() callback is skipped because there
> is only one idle state in the cpuidle driver, the .reflect() callback
> should be skipped as well, at least for consistency (if not for
> correctness), so do it.
> 
> Fixes: e5c9ffc6ae1b ("cpuidle: Skip governor when only one idle state
> is available")
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  drivers/cpuidle/cpuidle.c |   10 ----------
>  kernel/sched/idle.c       |   11 ++++++++++-
>  2 files changed, 10 insertions(+), 11 deletions(-)
> 
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -359,16 +359,6 @@ noinstr int cpuidle_enter_state(struct c
>  int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device
> *dev,
>  		   bool *stop_tick)
>  {
> -	/*
> -	 * If there is only a single idle state (or none), there is
> nothing
> -	 * meaningful for the governor to choose. Skip the governor
> and
> -	 * always use state 0 with the tick running.
> -	 */
> -	if (drv->state_count <= 1) {
> -		*stop_tick = false;
> -		return 0;
> -	}
> -
>  	return cpuidle_curr_governor->select(drv, dev, stop_tick);
>  }
>  
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -221,7 +221,7 @@ static void cpuidle_idle_call(void)
>  
>  		next_state = cpuidle_find_deepest_state(drv, dev,
> max_latency_ns);
>  		call_cpuidle(drv, dev, next_state);
> -	} else {
> +	} else if (drv->state_count > 1) {
>  		bool stop_tick = true;
>  
>  		/*
> @@ -239,6 +239,15 @@ static void cpuidle_idle_call(void)
>  		 * Give the governor an opportunity to reflect on
> the outcome
>  		 */
>  		cpuidle_reflect(dev, entered_state);
> +	} else {
> +		tick_nohz_idle_retain_tick();
> +
> +		/*
> +		 * If there is only a single idle state (or none),
> there is
> +		 * nothing meaningful for the governor to choose. 
> Skip the
> +		 * governor and always use state 0.
> +		 */
> +		call_cpuidle(drv, dev, 0);
>  	}
>  
>  exit_idle:
> 
> 

Hi Rafael,

Thanks for fixing this, sorry I missed it earlier.

Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>

Regards,
Aboorva

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware
  2026-03-07 16:25             ` Rafael J. Wysocki
@ 2026-03-10  3:54               ` Qais Yousef
  2026-03-10  9:18                 ` Christian Loehle
  0 siblings, 1 reply; 29+ messages in thread
From: Qais Yousef @ 2026-03-10  3:54 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Christian Loehle, Thomas Gleixner, LKML, Peter Zijlstra,
	Frederic Weisbecker, Linux PM

On 03/07/26 17:25, Rafael J. Wysocki wrote:

> In the meantime I realized that if the .select() governor
> callback is skipped, its .reflect() callback should be skipped
> either, so I've posted this:
> 
> https://lkml.org/lkml/2026/3/7/569
> 
> and here's a fixed version of the last patch on top of the above (for
> completeness):
> 
> ---
>  kernel/sched/idle.c |   25 ++++++++++++++++---------
>  1 file changed, 16 insertions(+), 9 deletions(-)
> 
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -161,6 +161,14 @@ static int call_cpuidle(struct cpuidle_d
>  	return cpuidle_enter(drv, dev, next_state);
>  }
>  
> +static void idle_call_stop_or_retain_tick(bool stop_tick)
> +{
> +	if (stop_tick || tick_nohz_tick_stopped())
> +		tick_nohz_idle_stop_tick();
> +	else
> +		tick_nohz_idle_retain_tick();
> +}
> +
>  /**
>   * cpuidle_idle_call - the main idle function
>   *
> @@ -170,7 +178,7 @@ static int call_cpuidle(struct cpuidle_d
>   * set, and it returns with polling set.  If it ever stops polling, it
>   * must clear the polling bit.
>   */
> -static void cpuidle_idle_call(void)
> +static void cpuidle_idle_call(bool stop_tick)
>  {
>  	struct cpuidle_device *dev = cpuidle_get_device();
>  	struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
> @@ -186,7 +194,7 @@ static void cpuidle_idle_call(void)
>  	}
>  
>  	if (cpuidle_not_available(drv, dev)) {
> -		tick_nohz_idle_stop_tick();
> +		idle_call_stop_or_retain_tick(stop_tick);
>  
>  		default_idle_call();
>  		goto exit_idle;
> @@ -222,17 +230,14 @@ static void cpuidle_idle_call(void)
>  		next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
>  		call_cpuidle(drv, dev, next_state);
>  	} else if (drv->state_count > 1) {
> -		bool stop_tick = true;
> +		stop_tick = true;

Silly question, but wouldn't this benefit the normal path too to delay for one
tick? This will only matter for the cases where the governor doesn't explicitly
set stop_tick to either true or false - which I am not sure what they are :)

>  
>  		/*
>  		 * Ask the cpuidle framework to choose a convenient idle state.
>  		 */
>  		next_state = cpuidle_select(drv, dev, &stop_tick);
>  
> -		if (stop_tick || tick_nohz_tick_stopped())
> -			tick_nohz_idle_stop_tick();
> -		else
> -			tick_nohz_idle_retain_tick();
> +		idle_call_stop_or_retain_tick(stop_tick);
>  
>  		entered_state = call_cpuidle(drv, dev, next_state);
>  		/*
> @@ -240,7 +245,7 @@ static void cpuidle_idle_call(void)
>  		 */
>  		cpuidle_reflect(dev, entered_state);
>  	} else {
> -		tick_nohz_idle_retain_tick();
> +		idle_call_stop_or_retain_tick(stop_tick);
>  
>  		/*
>  		 * If there is only a single idle state (or none), there is
> @@ -268,6 +273,7 @@ exit_idle:
>  static void do_idle(void)
>  {
>  	int cpu = smp_processor_id();
> +	bool got_tick = false;
>  
>  	/*
>  	 * Check if we need to update blocked load
> @@ -338,8 +344,9 @@ static void do_idle(void)
>  			tick_nohz_idle_restart_tick();
>  			cpu_idle_poll();
>  		} else {
> -			cpuidle_idle_call();
> +			cpuidle_idle_call(got_tick);
>  		}
> +		got_tick = tick_nohz_idle_got_tick();
>  		arch_cpu_idle_exit();
>  	}
>  
> 
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1] sched: idle: Make skipping governor callbacks more consistent
  2026-03-09 12:26             ` Rafael J. Wysocki
@ 2026-03-10  3:57               ` Qais Yousef
  0 siblings, 0 replies; 29+ messages in thread
From: Qais Yousef @ 2026-03-10  3:57 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Christian Loehle, Linux PM, Thomas Gleixner, LKML, Peter Zijlstra,
	Frederic Weisbecker, Aboorva Devarajan

On 03/09/26 13:26, Rafael J. Wysocki wrote:
> On Mon, Mar 9, 2026 at 10:13 AM Christian Loehle
> <christian.loehle@arm.com> wrote:
> >
> > On 3/7/26 16:12, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > >
> > > If the cpuidle governor .select() callback is skipped because there
> > > is only one idle state in the cpuidle driver, the .reflect() callback
> > > should be skipped as well, at least for consistency (if not for
> > > correctness), so do it.
> > >
> > > Fixes: e5c9ffc6ae1b ("cpuidle: Skip governor when only one idle state is available")
> > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > ---
> > >  drivers/cpuidle/cpuidle.c |   10 ----------
> > >  kernel/sched/idle.c       |   11 ++++++++++-
> > >  2 files changed, 10 insertions(+), 11 deletions(-)
> > >
> > > --- a/drivers/cpuidle/cpuidle.c
> > > +++ b/drivers/cpuidle/cpuidle.c
> > > @@ -359,16 +359,6 @@ noinstr int cpuidle_enter_state(struct c
> > >  int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> > >                  bool *stop_tick)
> > >  {
> > > -     /*
> > > -      * If there is only a single idle state (or none), there is nothing
> > > -      * meaningful for the governor to choose. Skip the governor and
> > > -      * always use state 0 with the tick running.
> > > -      */
> > > -     if (drv->state_count <= 1) {
> > > -             *stop_tick = false;
> > > -             return 0;
> > > -     }
> > > -
> > >       return cpuidle_curr_governor->select(drv, dev, stop_tick);
> > >  }
> > >
> > > --- a/kernel/sched/idle.c
> > > +++ b/kernel/sched/idle.c
> > > @@ -221,7 +221,7 @@ static void cpuidle_idle_call(void)
> > >
> > >               next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
> > >               call_cpuidle(drv, dev, next_state);
> > > -     } else {
> > > +     } else if (drv->state_count > 1) {
> > >               bool stop_tick = true;
> > >
> > >               /*
> > > @@ -239,6 +239,15 @@ static void cpuidle_idle_call(void)
> > >                * Give the governor an opportunity to reflect on the outcome
> > >                */
> > >               cpuidle_reflect(dev, entered_state);
> > > +     } else {
> > > +             tick_nohz_idle_retain_tick();
> > > +
> > > +             /*
> > > +              * If there is only a single idle state (or none), there is
> > > +              * nothing meaningful for the governor to choose.  Skip the
> > > +              * governor and always use state 0.
> > > +              */
> > > +             call_cpuidle(drv, dev, 0);
> > >       }
> > >
> > >  exit_idle:
> > >
> > >
> > >
> >
> > Duh, good catch.
> > Reviewed-by: Christian Loehle <christian.loehle@arm.com>
> 
> OK, so any objections or concerns from anyone?
> 
> I'm about to queue this up for the next -rc.

LGTM too

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware
  2026-03-10  3:54               ` Qais Yousef
@ 2026-03-10  9:18                 ` Christian Loehle
  2026-03-10 15:03                   ` Qais Yousef
  0 siblings, 1 reply; 29+ messages in thread
From: Christian Loehle @ 2026-03-10  9:18 UTC (permalink / raw)
  To: Qais Yousef, Rafael J. Wysocki
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Frederic Weisbecker,
	Linux PM

On 3/10/26 03:54, Qais Yousef wrote:
> On 03/07/26 17:25, Rafael J. Wysocki wrote:
> 
>> In the meantime I realized that if the .select() governor
>> callback is skipped, its .reflect() callback should be skipped
>> either, so I've posted this:
>>
>> https://lkml.org/lkml/2026/3/7/569
>>
>> and here's a fixed version of the last patch on top of the above (for
>> completeness):
>>
>> ---
>>  kernel/sched/idle.c |   25 ++++++++++++++++---------
>>  1 file changed, 16 insertions(+), 9 deletions(-)
>>
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -161,6 +161,14 @@ static int call_cpuidle(struct cpuidle_d
>>  	return cpuidle_enter(drv, dev, next_state);
>>  }
>>  
>> +static void idle_call_stop_or_retain_tick(bool stop_tick)
>> +{
>> +	if (stop_tick || tick_nohz_tick_stopped())
>> +		tick_nohz_idle_stop_tick();
>> +	else
>> +		tick_nohz_idle_retain_tick();
>> +}
>> +
>>  /**
>>   * cpuidle_idle_call - the main idle function
>>   *
>> @@ -170,7 +178,7 @@ static int call_cpuidle(struct cpuidle_d
>>   * set, and it returns with polling set.  If it ever stops polling, it
>>   * must clear the polling bit.
>>   */
>> -static void cpuidle_idle_call(void)
>> +static void cpuidle_idle_call(bool stop_tick)
>>  {
>>  	struct cpuidle_device *dev = cpuidle_get_device();
>>  	struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
>> @@ -186,7 +194,7 @@ static void cpuidle_idle_call(void)
>>  	}
>>  
>>  	if (cpuidle_not_available(drv, dev)) {
>> -		tick_nohz_idle_stop_tick();
>> +		idle_call_stop_or_retain_tick(stop_tick);
>>  
>>  		default_idle_call();
>>  		goto exit_idle;
>> @@ -222,17 +230,14 @@ static void cpuidle_idle_call(void)
>>  		next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
>>  		call_cpuidle(drv, dev, next_state);
>>  	} else if (drv->state_count > 1) {
>> -		bool stop_tick = true;
>> +		stop_tick = true;
> 
> Silly question, but wouldn't this benefit the normal path too to delay for one
> tick? This will only matter for the cases where the governor doesn't explicitly
> set stop_tick to either true or false - which I am not sure what they are :)
> 
Right now the governors will always set stop_tick explicitly (and overriding
that might confuse the governor-internal state).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1] sched: idle: Make skipping governor callbacks more consistent
  2026-03-07 16:12         ` [PATCH v1] sched: idle: Make skipping governor callbacks more consistent Rafael J. Wysocki
  2026-03-09  9:13           ` Christian Loehle
  2026-03-09 12:44           ` Aboorva Devarajan
@ 2026-03-10 14:28           ` Frederic Weisbecker
  2 siblings, 0 replies; 29+ messages in thread
From: Frederic Weisbecker @ 2026-03-10 14:28 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM, Qais Yousef, Christian Loehle, Thomas Gleixner, LKML,
	Peter Zijlstra, Aboorva Devarajan

Le Sat, Mar 07, 2026 at 05:12:05PM +0100, Rafael J. Wysocki a écrit :
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> If the cpuidle governor .select() callback is skipped because there
> is only one idle state in the cpuidle driver, the .reflect() callback
> should be skipped as well, at least for consistency (if not for
> correctness), so do it.
> 
> Fixes: e5c9ffc6ae1b ("cpuidle: Skip governor when only one idle state is available")
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware
  2026-03-10  9:18                 ` Christian Loehle
@ 2026-03-10 15:03                   ` Qais Yousef
  2026-03-10 15:09                     ` Rafael J. Wysocki
  0 siblings, 1 reply; 29+ messages in thread
From: Qais Yousef @ 2026-03-10 15:03 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Rafael J. Wysocki, Thomas Gleixner, LKML, Peter Zijlstra,
	Frederic Weisbecker, Linux PM

On 03/10/26 09:18, Christian Loehle wrote:
> On 3/10/26 03:54, Qais Yousef wrote:
> > On 03/07/26 17:25, Rafael J. Wysocki wrote:
> > 
> >> In the meantime I realized that if the .select() governor
> >> callback is skipped, its .reflect() callback should be skipped
> >> either, so I've posted this:
> >>
> >> https://lkml.org/lkml/2026/3/7/569
> >>
> >> and here's a fixed version of the last patch on top of the above (for
> >> completeness):
> >>
> >> ---
> >>  kernel/sched/idle.c |   25 ++++++++++++++++---------
> >>  1 file changed, 16 insertions(+), 9 deletions(-)
> >>
> >> --- a/kernel/sched/idle.c
> >> +++ b/kernel/sched/idle.c
> >> @@ -161,6 +161,14 @@ static int call_cpuidle(struct cpuidle_d
> >>  	return cpuidle_enter(drv, dev, next_state);
> >>  }
> >>  
> >> +static void idle_call_stop_or_retain_tick(bool stop_tick)
> >> +{
> >> +	if (stop_tick || tick_nohz_tick_stopped())
> >> +		tick_nohz_idle_stop_tick();
> >> +	else
> >> +		tick_nohz_idle_retain_tick();
> >> +}
> >> +
> >>  /**
> >>   * cpuidle_idle_call - the main idle function
> >>   *
> >> @@ -170,7 +178,7 @@ static int call_cpuidle(struct cpuidle_d
> >>   * set, and it returns with polling set.  If it ever stops polling, it
> >>   * must clear the polling bit.
> >>   */
> >> -static void cpuidle_idle_call(void)
> >> +static void cpuidle_idle_call(bool stop_tick)
> >>  {
> >>  	struct cpuidle_device *dev = cpuidle_get_device();
> >>  	struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
> >> @@ -186,7 +194,7 @@ static void cpuidle_idle_call(void)
> >>  	}
> >>  
> >>  	if (cpuidle_not_available(drv, dev)) {
> >> -		tick_nohz_idle_stop_tick();
> >> +		idle_call_stop_or_retain_tick(stop_tick);
> >>  
> >>  		default_idle_call();
> >>  		goto exit_idle;
> >> @@ -222,17 +230,14 @@ static void cpuidle_idle_call(void)
> >>  		next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
> >>  		call_cpuidle(drv, dev, next_state);
> >>  	} else if (drv->state_count > 1) {
> >> -		bool stop_tick = true;
> >> +		stop_tick = true;
> > 
> > Silly question, but wouldn't this benefit the normal path too to delay for one
> > tick? This will only matter for the cases where the governor doesn't explicitly
> > set stop_tick to either true or false - which I am not sure what they are :)
> > 
> Right now the governors will always set stop_tick explicitly (and overriding
> that might confuse the governor-internal state).

So we can drop this hunk then

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware
  2026-03-10 15:03                   ` Qais Yousef
@ 2026-03-10 15:09                     ` Rafael J. Wysocki
  2026-03-10 15:14                       ` Qais Yousef
  0 siblings, 1 reply; 29+ messages in thread
From: Rafael J. Wysocki @ 2026-03-10 15:09 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Christian Loehle, Rafael J. Wysocki, Thomas Gleixner, LKML,
	Peter Zijlstra, Frederic Weisbecker, Linux PM

On Tue, Mar 10, 2026 at 4:03 PM Qais Yousef <qyousef@layalina.io> wrote:
>
> On 03/10/26 09:18, Christian Loehle wrote:
> > On 3/10/26 03:54, Qais Yousef wrote:
> > > On 03/07/26 17:25, Rafael J. Wysocki wrote:
> > >
> > >> In the meantime I realized that if the .select() governor
> > >> callback is skipped, its .reflect() callback should be skipped
> > >> either, so I've posted this:
> > >>
> > >> https://lkml.org/lkml/2026/3/7/569
> > >>
> > >> and here's a fixed version of the last patch on top of the above (for
> > >> completeness):
> > >>
> > >> ---
> > >>  kernel/sched/idle.c |   25 ++++++++++++++++---------
> > >>  1 file changed, 16 insertions(+), 9 deletions(-)
> > >>
> > >> --- a/kernel/sched/idle.c
> > >> +++ b/kernel/sched/idle.c
> > >> @@ -161,6 +161,14 @@ static int call_cpuidle(struct cpuidle_d
> > >>    return cpuidle_enter(drv, dev, next_state);
> > >>  }
> > >>
> > >> +static void idle_call_stop_or_retain_tick(bool stop_tick)
> > >> +{
> > >> +  if (stop_tick || tick_nohz_tick_stopped())
> > >> +          tick_nohz_idle_stop_tick();
> > >> +  else
> > >> +          tick_nohz_idle_retain_tick();
> > >> +}
> > >> +
> > >>  /**
> > >>   * cpuidle_idle_call - the main idle function
> > >>   *
> > >> @@ -170,7 +178,7 @@ static int call_cpuidle(struct cpuidle_d
> > >>   * set, and it returns with polling set.  If it ever stops polling, it
> > >>   * must clear the polling bit.
> > >>   */
> > >> -static void cpuidle_idle_call(void)
> > >> +static void cpuidle_idle_call(bool stop_tick)
> > >>  {
> > >>    struct cpuidle_device *dev = cpuidle_get_device();
> > >>    struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
> > >> @@ -186,7 +194,7 @@ static void cpuidle_idle_call(void)
> > >>    }
> > >>
> > >>    if (cpuidle_not_available(drv, dev)) {
> > >> -          tick_nohz_idle_stop_tick();
> > >> +          idle_call_stop_or_retain_tick(stop_tick);
> > >>
> > >>            default_idle_call();
> > >>            goto exit_idle;
> > >> @@ -222,17 +230,14 @@ static void cpuidle_idle_call(void)
> > >>            next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
> > >>            call_cpuidle(drv, dev, next_state);
> > >>    } else if (drv->state_count > 1) {
> > >> -          bool stop_tick = true;
> > >> +          stop_tick = true;
> > >
> > > Silly question, but wouldn't this benefit the normal path too to delay for one
> > > tick? This will only matter for the cases where the governor doesn't explicitly
> > > set stop_tick to either true or false - which I am not sure what they are :)
> > >
> > Right now the governors will always set stop_tick explicitly (and overriding
> > that might confuse the governor-internal state).
>
> So we can drop this hunk then

Not really.

The governors expect that stop_tick is true by default and they clear
it if needed/desired.  They may be confused if it is false to start
with (theoretically, a governor may select an idle state with target
residency beyond the tick period length then which won't make sense).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware
  2026-03-10 15:09                     ` Rafael J. Wysocki
@ 2026-03-10 15:14                       ` Qais Yousef
  0 siblings, 0 replies; 29+ messages in thread
From: Qais Yousef @ 2026-03-10 15:14 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Christian Loehle, Thomas Gleixner, LKML, Peter Zijlstra,
	Frederic Weisbecker, Linux PM

On 03/10/26 16:09, Rafael J. Wysocki wrote:
> On Tue, Mar 10, 2026 at 4:03 PM Qais Yousef <qyousef@layalina.io> wrote:
> >
> > On 03/10/26 09:18, Christian Loehle wrote:
> > > On 3/10/26 03:54, Qais Yousef wrote:
> > > > On 03/07/26 17:25, Rafael J. Wysocki wrote:
> > > >
> > > >> In the meantime I realized that if the .select() governor
> > > >> callback is skipped, its .reflect() callback should be skipped
> > > >> either, so I've posted this:
> > > >>
> > > >> https://lkml.org/lkml/2026/3/7/569
> > > >>
> > > >> and here's a fixed version of the last patch on top of the above (for
> > > >> completeness):
> > > >>
> > > >> ---
> > > >>  kernel/sched/idle.c |   25 ++++++++++++++++---------
> > > >>  1 file changed, 16 insertions(+), 9 deletions(-)
> > > >>
> > > >> --- a/kernel/sched/idle.c
> > > >> +++ b/kernel/sched/idle.c
> > > >> @@ -161,6 +161,14 @@ static int call_cpuidle(struct cpuidle_d
> > > >>    return cpuidle_enter(drv, dev, next_state);
> > > >>  }
> > > >>
> > > >> +static void idle_call_stop_or_retain_tick(bool stop_tick)
> > > >> +{
> > > >> +  if (stop_tick || tick_nohz_tick_stopped())
> > > >> +          tick_nohz_idle_stop_tick();
> > > >> +  else
> > > >> +          tick_nohz_idle_retain_tick();
> > > >> +}
> > > >> +
> > > >>  /**
> > > >>   * cpuidle_idle_call - the main idle function
> > > >>   *
> > > >> @@ -170,7 +178,7 @@ static int call_cpuidle(struct cpuidle_d
> > > >>   * set, and it returns with polling set.  If it ever stops polling, it
> > > >>   * must clear the polling bit.
> > > >>   */
> > > >> -static void cpuidle_idle_call(void)
> > > >> +static void cpuidle_idle_call(bool stop_tick)
> > > >>  {
> > > >>    struct cpuidle_device *dev = cpuidle_get_device();
> > > >>    struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
> > > >> @@ -186,7 +194,7 @@ static void cpuidle_idle_call(void)
> > > >>    }
> > > >>
> > > >>    if (cpuidle_not_available(drv, dev)) {
> > > >> -          tick_nohz_idle_stop_tick();
> > > >> +          idle_call_stop_or_retain_tick(stop_tick);
> > > >>
> > > >>            default_idle_call();
> > > >>            goto exit_idle;
> > > >> @@ -222,17 +230,14 @@ static void cpuidle_idle_call(void)
> > > >>            next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
> > > >>            call_cpuidle(drv, dev, next_state);
> > > >>    } else if (drv->state_count > 1) {
> > > >> -          bool stop_tick = true;
> > > >> +          stop_tick = true;
> > > >
> > > > Silly question, but wouldn't this benefit the normal path too to delay for one
> > > > tick? This will only matter for the cases where the governor doesn't explicitly
> > > > set stop_tick to either true or false - which I am not sure what they are :)
> > > >
> > > Right now the governors will always set stop_tick explicitly (and overriding
> > > that might confuse the governor-internal state).
> >
> > So we can drop this hunk then
> 
> Not really.
> 
> The governors expect that stop_tick is true by default and they clear
> it if needed/desired.  They may be confused if it is false to start
> with (theoretically, a governor may select an idle state with target
> residency beyond the tick period length then which won't make sense).

I see, thanks for the explanation. It could be me, but if you think a comment
is worthwhile to document this expectation, would be nice to have.

LGTM anyway.

Cheers

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2026-03-10 15:14 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-01 19:30 [patch 0/2] sched/idle: Prevent pointless NOHZ transitions in default_idle_call() Thomas Gleixner
2026-03-01 19:30 ` [patch 1/2] sched/idle: Make default_idle_call() static Thomas Gleixner
2026-03-01 19:30 ` [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware Thomas Gleixner
2026-03-02  6:05   ` K Prateek Nayak
2026-03-02 10:43   ` Frederic Weisbecker
2026-03-02 11:03     ` Christian Loehle
2026-03-02 11:11       ` Frederic Weisbecker
2026-03-02 11:39         ` Christian Loehle
2026-03-04  3:35           ` Qais Yousef
2026-03-02 11:03   ` Christian Loehle
2026-03-02 21:25     ` Rafael J. Wysocki
2026-03-04  3:03       ` Qais Yousef
2026-03-06 21:21         ` Rafael J. Wysocki
2026-03-06 21:31           ` Rafael J. Wysocki
2026-03-07 16:25             ` Rafael J. Wysocki
2026-03-10  3:54               ` Qais Yousef
2026-03-10  9:18                 ` Christian Loehle
2026-03-10 15:03                   ` Qais Yousef
2026-03-10 15:09                     ` Rafael J. Wysocki
2026-03-10 15:14                       ` Qais Yousef
2026-03-07 16:12         ` [PATCH v1] sched: idle: Make skipping governor callbacks more consistent Rafael J. Wysocki
2026-03-09  9:13           ` Christian Loehle
2026-03-09 12:26             ` Rafael J. Wysocki
2026-03-10  3:57               ` Qais Yousef
2026-03-09 12:44           ` Aboorva Devarajan
2026-03-10 14:28           ` Frederic Weisbecker
2026-03-02 12:17   ` [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware Peter Zijlstra
2026-03-02 12:19     ` Peter Zijlstra
2026-03-02 21:23       ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox