* [PATCH v1 1/3] cpuidle: governors: menu: Avoid selecting states with too much latency
2025-08-13 10:21 [PATCH v1 0/3] cpuidle: governors: menu: A fix, a corner case adjustment and a cleanup Rafael J. Wysocki
@ 2025-08-13 10:25 ` Rafael J. Wysocki
2025-08-13 19:13 ` Christian Loehle
` (2 more replies)
2025-08-13 10:26 ` [PATCH v1 2/3] cpuidle: governors: menu: Rearrange main loop in menu_select() Rafael J. Wysocki
` (2 subsequent siblings)
3 siblings, 3 replies; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-08-13 10:25 UTC (permalink / raw)
To: Linux PM; +Cc: Frederic Weisbecker, LKML, Peter Zijlstra, Christian Loehle
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Occasionally, the exit latency of the idle state selected by the menu
governor may exceed the PM QoS CPU wakeup latency limit. Namely, if the
scheduler tick has been stopped already and predicted_ns is greater than
the tick period length, the governor may return an idle state whose exit
latency exceeds latency_req because that decision is made before
checking the current idle state's exit latency.
For instance, say that there are 3 idle states, 0, 1, and 2. For idle
states 0 and 1, the exit latency is equal to the target residency and
the values are 0 and 5 us, respectively. State 2 is deeper and has the
exit latency and target residency of 200 us and 2 ms (which is greater
than the tick period length), respectively.
Say that predicted_ns is equal to TICK_NSEC and the PM QoS latency
limit is 20 us. After the first two iterations of the main loop in
menu_select(), idx becomes 1 and in the third iteration of it the target
residency of the current state (state 2) is greater than predicted_ns.
State 2 is not a polling one and predicted_ns is not less than TICK_NSEC,
so the check on whether or not the tick has been stopped is done. Say
that the tick has been stopped already and there are no imminent timers
(that is, delta_tick is greater than the target residency of state 2).
In that case, idx becomes 2 and it is returned immediately, but the exit
latency of state 2 exceeds the latency limit.
Address this issue by modifying the code to compare the exit latency of
the current idle state (idle state i) with the latency limit before
comparing its target residecy with predicted_ns, which allows one
more exit_latency_ns check that becomes redundant to be dropped.
However, after the above change, latency_req cannot take the predicted_ns
value any more, which takes place after commit 38f83090f515 ("cpuidle:
menu: Remove iowait influence"), because it may cause a polling state
to be returned prematurely.
In the context of the previous example say that predicted_ns is 3000 and
the PM QoS latency limit is still 20 us. Additionally, say that idle
state 0 is a polling one. Moving the exit_latency_ns check before the
target_residency_ns one causes the loop to terminate in the second
iteration, before the target_residency_ns check, so idle state 0 will be
returned even though previously state 1 would be returned if there were
no imminent timers.
For this reason, remove the assignment of the predicted_ns value to
latency_req from the code.
Fixes: 5ef499cd571c ("cpuidle: menu: Handle stopped tick more aggressively")
Cc: 4.17+ <stable@vger.kernel.org> # 4.17+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
drivers/cpuidle/governors/menu.c | 29 ++++++++++++-----------------
1 file changed, 12 insertions(+), 17 deletions(-)
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -287,20 +287,15 @@
return 0;
}
- if (tick_nohz_tick_stopped()) {
- /*
- * If the tick is already stopped, the cost of possible short
- * idle duration misprediction is much higher, because the CPU
- * may be stuck in a shallow idle state for a long time as a
- * result of it. In that case say we might mispredict and use
- * the known time till the closest timer event for the idle
- * state selection.
- */
- if (predicted_ns < TICK_NSEC)
- predicted_ns = data->next_timer_ns;
- } else if (latency_req > predicted_ns) {
- latency_req = predicted_ns;
- }
+ /*
+ * If the tick is already stopped, the cost of possible short idle
+ * duration misprediction is much higher, because the CPU may be stuck
+ * in a shallow idle state for a long time as a result of it. In that
+ * case, say we might mispredict and use the known time till the closest
+ * timer event for the idle state selection.
+ */
+ if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
+ predicted_ns = data->next_timer_ns;
/*
* Find the idle state with the lowest power while satisfying
@@ -316,13 +311,15 @@
if (idx == -1)
idx = i; /* first enabled state */
+ if (s->exit_latency_ns > latency_req)
+ break;
+
if (s->target_residency_ns > predicted_ns) {
/*
* Use a physical idle state, not busy polling, unless
* a timer is going to trigger soon enough.
*/
if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
- s->exit_latency_ns <= latency_req &&
s->target_residency_ns <= data->next_timer_ns) {
predicted_ns = s->target_residency_ns;
idx = i;
@@ -354,8 +351,6 @@
return idx;
}
- if (s->exit_latency_ns > latency_req)
- break;
idx = i;
}
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH v1 1/3] cpuidle: governors: menu: Avoid selecting states with too much latency
2025-08-13 10:25 ` [PATCH v1 1/3] cpuidle: governors: menu: Avoid selecting states with too much latency Rafael J. Wysocki
@ 2025-08-13 19:13 ` Christian Loehle
2025-08-18 17:08 ` Rafael J. Wysocki
2025-09-11 13:37 ` Frederic Weisbecker
2025-10-23 3:05 ` Doug Smythies
2 siblings, 1 reply; 27+ messages in thread
From: Christian Loehle @ 2025-08-13 19:13 UTC (permalink / raw)
To: Rafael J. Wysocki, Linux PM; +Cc: Frederic Weisbecker, LKML, Peter Zijlstra
On 8/13/25 11:25, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> Occasionally, the exit latency of the idle state selected by the menu
> governor may exceed the PM QoS CPU wakeup latency limit. Namely, if the
> scheduler tick has been stopped already and predicted_ns is greater than
> the tick period length, the governor may return an idle state whose exit
> latency exceeds latency_req because that decision is made before
> checking the current idle state's exit latency.
>
> For instance, say that there are 3 idle states, 0, 1, and 2. For idle
> states 0 and 1, the exit latency is equal to the target residency and
> the values are 0 and 5 us, respectively. State 2 is deeper and has the
> exit latency and target residency of 200 us and 2 ms (which is greater
> than the tick period length), respectively.
>
> Say that predicted_ns is equal to TICK_NSEC and the PM QoS latency
> limit is 20 us. After the first two iterations of the main loop in
> menu_select(), idx becomes 1 and in the third iteration of it the target
Can drop "of it" here?
> residency of the current state (state 2) is greater than predicted_ns.
> State 2 is not a polling one and predicted_ns is not less than TICK_NSEC,
> so the check on whether or not the tick has been stopped is done. Say
> that the tick has been stopped already and there are no imminent timers
> (that is, delta_tick is greater than the target residency of state 2).
> In that case, idx becomes 2 and it is returned immediately, but the exit
> latency of state 2 exceeds the latency limit.
>
> Address this issue by modifying the code to compare the exit latency of
> the current idle state (idle state i) with the latency limit before
> comparing its target residecy with predicted_ns, which allows one
residency
> more exit_latency_ns check that becomes redundant to be dropped.
>
> However, after the above change, latency_req cannot take the predicted_ns
> value any more, which takes place after commit 38f83090f515 ("cpuidle:
> menu: Remove iowait influence"), because it may cause a polling state
> to be returned prematurely.
>
> In the context of the previous example say that predicted_ns is 3000 and
> the PM QoS latency limit is still 20 us. Additionally, say that idle
> state 0 is a polling one. Moving the exit_latency_ns check before the
> target_residency_ns one causes the loop to terminate in the second
> iteration, before the target_residency_ns check, so idle state 0 will be
> returned even though previously state 1 would be returned if there were
> no imminent timers.
>
> For this reason, remove the assignment of the predicted_ns value to
> latency_req from the code.
>
> Fixes: 5ef499cd571c ("cpuidle: menu: Handle stopped tick more aggressively")
> Cc: 4.17+ <stable@vger.kernel.org> # 4.17+
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> drivers/cpuidle/governors/menu.c | 29 ++++++++++++-----------------
> 1 file changed, 12 insertions(+), 17 deletions(-)
>
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -287,20 +287,15 @@
> return 0;
> }
>
> - if (tick_nohz_tick_stopped()) {
> - /*
> - * If the tick is already stopped, the cost of possible short
> - * idle duration misprediction is much higher, because the CPU
> - * may be stuck in a shallow idle state for a long time as a
> - * result of it. In that case say we might mispredict and use
> - * the known time till the closest timer event for the idle
> - * state selection.
> - */
> - if (predicted_ns < TICK_NSEC)
> - predicted_ns = data->next_timer_ns;
> - } else if (latency_req > predicted_ns) {
> - latency_req = predicted_ns;
> - }
> + /*
> + * If the tick is already stopped, the cost of possible short idle
> + * duration misprediction is much higher, because the CPU may be stuck
> + * in a shallow idle state for a long time as a result of it. In that
> + * case, say we might mispredict and use the known time till the closest
> + * timer event for the idle state selection.
> + */
> + if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
> + predicted_ns = data->next_timer_ns;
>
> /*
> * Find the idle state with the lowest power while satisfying
> @@ -316,13 +311,15 @@
> if (idx == -1)
> idx = i; /* first enabled state */
>
> + if (s->exit_latency_ns > latency_req)
> + break;
> +
> if (s->target_residency_ns > predicted_ns) {
> /*
> * Use a physical idle state, not busy polling, unless
> * a timer is going to trigger soon enough.
> */
> if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
> - s->exit_latency_ns <= latency_req &&
> s->target_residency_ns <= data->next_timer_ns) {
> predicted_ns = s->target_residency_ns;
> idx = i;
> @@ -354,8 +351,6 @@
>
> return idx;
> }
> - if (s->exit_latency_ns > latency_req)
> - break;
>
> idx = i;
> }
>
>
>
Good catch!
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH v1 1/3] cpuidle: governors: menu: Avoid selecting states with too much latency
2025-08-13 19:13 ` Christian Loehle
@ 2025-08-18 17:08 ` Rafael J. Wysocki
0 siblings, 0 replies; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-08-18 17:08 UTC (permalink / raw)
To: Christian Loehle
Cc: Rafael J. Wysocki, Linux PM, Frederic Weisbecker, LKML,
Peter Zijlstra
On Wed, Aug 13, 2025 at 9:13 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 8/13/25 11:25, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Occasionally, the exit latency of the idle state selected by the menu
> > governor may exceed the PM QoS CPU wakeup latency limit. Namely, if the
> > scheduler tick has been stopped already and predicted_ns is greater than
> > the tick period length, the governor may return an idle state whose exit
> > latency exceeds latency_req because that decision is made before
> > checking the current idle state's exit latency.
> >
> > For instance, say that there are 3 idle states, 0, 1, and 2. For idle
> > states 0 and 1, the exit latency is equal to the target residency and
> > the values are 0 and 5 us, respectively. State 2 is deeper and has the
> > exit latency and target residency of 200 us and 2 ms (which is greater
> > than the tick period length), respectively.
> >
> > Say that predicted_ns is equal to TICK_NSEC and the PM QoS latency
> > limit is 20 us. After the first two iterations of the main loop in
> > menu_select(), idx becomes 1 and in the third iteration of it the target
> Can drop "of it" here?
But it also doesn't really hurt I think.
> > residency of the current state (state 2) is greater than predicted_ns.
> > State 2 is not a polling one and predicted_ns is not less than TICK_NSEC,
> > so the check on whether or not the tick has been stopped is done. Say
> > that the tick has been stopped already and there are no imminent timers
> > (that is, delta_tick is greater than the target residency of state 2).
> > In that case, idx becomes 2 and it is returned immediately, but the exit
> > latency of state 2 exceeds the latency limit.
> >
> > Address this issue by modifying the code to compare the exit latency of
> > the current idle state (idle state i) with the latency limit before
> > comparing its target residecy with predicted_ns, which allows one
> residency
Fixed while applying.
> > more exit_latency_ns check that becomes redundant to be dropped.
> >
> > However, after the above change, latency_req cannot take the predicted_ns
> > value any more, which takes place after commit 38f83090f515 ("cpuidle:
> > menu: Remove iowait influence"), because it may cause a polling state
> > to be returned prematurely.
> >
> > In the context of the previous example say that predicted_ns is 3000 and
> > the PM QoS latency limit is still 20 us. Additionally, say that idle
> > state 0 is a polling one. Moving the exit_latency_ns check before the
> > target_residency_ns one causes the loop to terminate in the second
> > iteration, before the target_residency_ns check, so idle state 0 will be
> > returned even though previously state 1 would be returned if there were
> > no imminent timers.
> >
> > For this reason, remove the assignment of the predicted_ns value to
> > latency_req from the code.
> >
> > Fixes: 5ef499cd571c ("cpuidle: menu: Handle stopped tick more aggressively")
> > Cc: 4.17+ <stable@vger.kernel.org> # 4.17+
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> > drivers/cpuidle/governors/menu.c | 29 ++++++++++++-----------------
> > 1 file changed, 12 insertions(+), 17 deletions(-)
> >
> > --- a/drivers/cpuidle/governors/menu.c
> > +++ b/drivers/cpuidle/governors/menu.c
> > @@ -287,20 +287,15 @@
> > return 0;
> > }
> >
> > - if (tick_nohz_tick_stopped()) {
> > - /*
> > - * If the tick is already stopped, the cost of possible short
> > - * idle duration misprediction is much higher, because the CPU
> > - * may be stuck in a shallow idle state for a long time as a
> > - * result of it. In that case say we might mispredict and use
> > - * the known time till the closest timer event for the idle
> > - * state selection.
> > - */
> > - if (predicted_ns < TICK_NSEC)
> > - predicted_ns = data->next_timer_ns;
> > - } else if (latency_req > predicted_ns) {
> > - latency_req = predicted_ns;
> > - }
> > + /*
> > + * If the tick is already stopped, the cost of possible short idle
> > + * duration misprediction is much higher, because the CPU may be stuck
> > + * in a shallow idle state for a long time as a result of it. In that
> > + * case, say we might mispredict and use the known time till the closest
> > + * timer event for the idle state selection.
> > + */
> > + if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
> > + predicted_ns = data->next_timer_ns;
> >
> > /*
> > * Find the idle state with the lowest power while satisfying
> > @@ -316,13 +311,15 @@
> > if (idx == -1)
> > idx = i; /* first enabled state */
> >
> > + if (s->exit_latency_ns > latency_req)
> > + break;
> > +
> > if (s->target_residency_ns > predicted_ns) {
> > /*
> > * Use a physical idle state, not busy polling, unless
> > * a timer is going to trigger soon enough.
> > */
> > if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
> > - s->exit_latency_ns <= latency_req &&
> > s->target_residency_ns <= data->next_timer_ns) {
> > predicted_ns = s->target_residency_ns;
> > idx = i;
> > @@ -354,8 +351,6 @@
> >
> > return idx;
> > }
> > - if (s->exit_latency_ns > latency_req)
> > - break;
> >
> > idx = i;
> > }
> >
> >
> >
>
> Good catch!
> Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Thank you!
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v1 1/3] cpuidle: governors: menu: Avoid selecting states with too much latency
2025-08-13 10:25 ` [PATCH v1 1/3] cpuidle: governors: menu: Avoid selecting states with too much latency Rafael J. Wysocki
2025-08-13 19:13 ` Christian Loehle
@ 2025-09-11 13:37 ` Frederic Weisbecker
2025-10-23 3:05 ` Doug Smythies
2 siblings, 0 replies; 27+ messages in thread
From: Frederic Weisbecker @ 2025-09-11 13:37 UTC (permalink / raw)
To: Rafael J. Wysocki; +Cc: Linux PM, LKML, Peter Zijlstra, Christian Loehle
Le Wed, Aug 13, 2025 at 12:25:58PM +0200, Rafael J. Wysocki a écrit :
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> Occasionally, the exit latency of the idle state selected by the menu
> governor may exceed the PM QoS CPU wakeup latency limit. Namely, if the
> scheduler tick has been stopped already and predicted_ns is greater than
> the tick period length, the governor may return an idle state whose exit
> latency exceeds latency_req because that decision is made before
> checking the current idle state's exit latency.
>
> For instance, say that there are 3 idle states, 0, 1, and 2. For idle
> states 0 and 1, the exit latency is equal to the target residency and
> the values are 0 and 5 us, respectively. State 2 is deeper and has the
> exit latency and target residency of 200 us and 2 ms (which is greater
> than the tick period length), respectively.
>
> Say that predicted_ns is equal to TICK_NSEC and the PM QoS latency
> limit is 20 us. After the first two iterations of the main loop in
> menu_select(), idx becomes 1 and in the third iteration of it the target
> residency of the current state (state 2) is greater than predicted_ns.
> State 2 is not a polling one and predicted_ns is not less than TICK_NSEC,
> so the check on whether or not the tick has been stopped is done. Say
> that the tick has been stopped already and there are no imminent timers
> (that is, delta_tick is greater than the target residency of state 2).
> In that case, idx becomes 2 and it is returned immediately, but the exit
> latency of state 2 exceeds the latency limit.
>
> Address this issue by modifying the code to compare the exit latency of
> the current idle state (idle state i) with the latency limit before
> comparing its target residecy with predicted_ns, which allows one
> more exit_latency_ns check that becomes redundant to be dropped.
>
> However, after the above change, latency_req cannot take the predicted_ns
> value any more, which takes place after commit 38f83090f515 ("cpuidle:
> menu: Remove iowait influence"), because it may cause a polling state
> to be returned prematurely.
>
> In the context of the previous example say that predicted_ns is 3000 and
> the PM QoS latency limit is still 20 us. Additionally, say that idle
> state 0 is a polling one. Moving the exit_latency_ns check before the
> target_residency_ns one causes the loop to terminate in the second
> iteration, before the target_residency_ns check, so idle state 0 will be
> returned even though previously state 1 would be returned if there were
> no imminent timers.
>
> For this reason, remove the assignment of the predicted_ns value to
> latency_req from the code.
>
> Fixes: 5ef499cd571c ("cpuidle: menu: Handle stopped tick more aggressively")
> Cc: 4.17+ <stable@vger.kernel.org> # 4.17+
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Too late I guess but meanwhile:
Acked-by: Frederic Weisbecker <frederic@kernel.org>
--
Frederic Weisbecker
SUSE Labs
^ permalink raw reply [flat|nested] 27+ messages in thread* RE: [PATCH v1 1/3] cpuidle: governors: menu: Avoid selecting states with too much latency
2025-08-13 10:25 ` [PATCH v1 1/3] cpuidle: governors: menu: Avoid selecting states with too much latency Rafael J. Wysocki
2025-08-13 19:13 ` Christian Loehle
2025-09-11 13:37 ` Frederic Weisbecker
@ 2025-10-23 3:05 ` Doug Smythies
2025-10-23 14:51 ` Rafael J. Wysocki
2 siblings, 1 reply; 27+ messages in thread
From: Doug Smythies @ 2025-10-23 3:05 UTC (permalink / raw)
To: 'Rafael J. Wysocki'
Cc: 'Frederic Weisbecker', 'LKML',
'Peter Zijlstra', 'Christian Loehle',
'Linux PM', Doug Smythies
[-- Attachment #1: Type: text/plain, Size: 2561 bytes --]
Hi Rafael,
Recent email communications about other patches had me
looking at this one again.
On 2025.08.13 03:26 Rafael wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
... snip...
> However, after the above change, latency_req cannot take the predicted_ns
> value any more, which takes place after commit 38f83090f515 ("cpuidle:
> menu: Remove iowait influence"), because it may cause a polling state
> to be returned prematurely.
>
> In the context of the previous example say that predicted_ns is 3000 and
> the PM QoS latency limit is still 20 us. Additionally, say that idle
> state 0 is a polling one. Moving the exit_latency_ns check before the
> target_residency_ns one causes the loop to terminate in the second
> iteration, before the target_residency_ns check, so idle state 0 will be
> returned even though previously state 1 would be returned if there were
> no imminent timers.
>
> For this reason, remove the assignment of the predicted_ns value to
> latency_req from the code.
Which is okay for timer-based workflow,
but what about non-timer based, or interrupt driven, workflow?
Under conditions where idle state 0, or Polling, would be used a lot,
I am observing about a 11 % throughput regression with this patch
And idle state 0, polling, usage going from 20% to 0%.
From my testing of kernels 6.17-rc1, rc2,rc3 in August and September
and again now. I missed this in August/September:
779b1a1cb13a cpuidle: governors: menu: Avoid selecting states with too much latency - v6.17-rc3
fa3fa55de0d6 cpuidle: governors: menu: Avoid using invalid recent intervals data - v6.17-rc2
baseline reference: v6.17-rc1
teo was included also. As far as I can recall its response has always been similar to rc3. At least, recently.
Three graphs are attached:
Sampling data once per 20 seconds, the test is started after the first idle sample,
and at least one sample is taken after the system returns to idle after the test.
The faster the test runs the better.
Test computer:
Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
Distro: Ubuntu 24.04.3, server, no desktop GUI.
CPU frequency scaling driver: intel_pstate
HWP: disabled.
CPU frequency scaling governor: performance
Ilde driver: intel_idle
Idle governor: menu (except teo for one compare test run)
Idle states: 4: name : description:
state0/name:POLL desc:CPUIDLE CORE POLL IDLE
state1/name:C1_ACPI desc:ACPI FFH MWAIT 0x0
state2/name:C2_ACPI desc:ACPI FFH MWAIT 0x30
state3/name:C3_ACPI desc:ACPI FFH MWAIT 0x60
[-- Attachment #2: power.png --]
[-- Type: image/png, Size: 42648 bytes --]
[-- Attachment #3: 0_residency.png --]
[-- Type: image/png, Size: 34397 bytes --]
[-- Attachment #4: 1_residency.png --]
[-- Type: image/png, Size: 39974 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH v1 1/3] cpuidle: governors: menu: Avoid selecting states with too much latency
2025-10-23 3:05 ` Doug Smythies
@ 2025-10-23 14:51 ` Rafael J. Wysocki
2025-10-23 16:02 ` Doug Smythies
0 siblings, 1 reply; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-10-23 14:51 UTC (permalink / raw)
To: Doug Smythies
Cc: 'Frederic Weisbecker', 'LKML',
'Peter Zijlstra', 'Christian Loehle',
'Linux PM', Doug Smythies
Hi Doug,
On Thursday, October 23, 2025 5:05:44 AM CEST Doug Smythies wrote:
> Hi Rafael,
>
> Recent email communications about other patches had me
> looking at this one again.
>
> On 2025.08.13 03:26 Rafael wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> ... snip...
>
> > However, after the above change, latency_req cannot take the predicted_ns
> > value any more, which takes place after commit 38f83090f515 ("cpuidle:
> > menu: Remove iowait influence"), because it may cause a polling state
> > to be returned prematurely.
> >
> > In the context of the previous example say that predicted_ns is 3000 and
> > the PM QoS latency limit is still 20 us. Additionally, say that idle
> > state 0 is a polling one. Moving the exit_latency_ns check before the
> > target_residency_ns one causes the loop to terminate in the second
> > iteration, before the target_residency_ns check, so idle state 0 will be
> > returned even though previously state 1 would be returned if there were
> > no imminent timers.
> >
> > For this reason, remove the assignment of the predicted_ns value to
> > latency_req from the code.
>
> Which is okay for timer-based workflow,
> but what about non-timer based, or interrupt driven, workflow?
>
> Under conditions where idle state 0, or Polling, would be used a lot,
> I am observing about a 11 % throughput regression with this patch
> And idle state 0, polling, usage going from 20% to 0%.
>
> From my testing of kernels 6.17-rc1, rc2,rc3 in August and September
> and again now. I missed this in August/September:
>
> 779b1a1cb13a cpuidle: governors: menu: Avoid selecting states with too much latency - v6.17-rc3
> fa3fa55de0d6 cpuidle: governors: menu: Avoid using invalid recent intervals data - v6.17-rc2
> baseline reference: v6.17-rc1
>
> teo was included also. As far as I can recall its response has always been similar to rc3. At least, recently.
>
> Three graphs are attached:
> Sampling data once per 20 seconds, the test is started after the first idle sample,
> and at least one sample is taken after the system returns to idle after the test.
> The faster the test runs the better.
>
> Test computer:
> Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
> Distro: Ubuntu 24.04.3, server, no desktop GUI.
> CPU frequency scaling driver: intel_pstate
> HWP: disabled.
> CPU frequency scaling governor: performance
> Ilde driver: intel_idle
> Idle governor: menu (except teo for one compare test run)
> Idle states: 4: name : description:
> state0/name:POLL desc:CPUIDLE CORE POLL IDLE
> state1/name:C1_ACPI desc:ACPI FFH MWAIT 0x0
> state2/name:C2_ACPI desc:ACPI FFH MWAIT 0x30
> state3/name:C3_ACPI desc:ACPI FFH MWAIT 0x60
OK, so since the exit residency of an idle state cannot exceed its target
residency, the appended change (on top of 6.18-rc2) should make the throughput
regression go away.
---
drivers/cpuidle/governors/menu.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -321,10 +321,13 @@ static int menu_select(struct cpuidle_dr
/*
* Use a physical idle state, not busy polling, unless a timer
- * is going to trigger soon enough.
+ * is going to trigger soon enough or the exit latency of the
+ * idle state in question is greater than the predicted idle
+ * duration.
*/
if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
- s->target_residency_ns <= data->next_timer_ns) {
+ s->target_residency_ns <= data->next_timer_ns &&
+ s->exit_latency_ns <= predicted_ns) {
predicted_ns = s->target_residency_ns;
idx = i;
break;
^ permalink raw reply [flat|nested] 27+ messages in thread* RE: [PATCH v1 1/3] cpuidle: governors: menu: Avoid selecting states with too much latency
2025-10-23 14:51 ` Rafael J. Wysocki
@ 2025-10-23 16:02 ` Doug Smythies
2025-10-23 16:52 ` Rafael J. Wysocki
0 siblings, 1 reply; 27+ messages in thread
From: Doug Smythies @ 2025-10-23 16:02 UTC (permalink / raw)
To: 'Rafael J. Wysocki'
Cc: 'Frederic Weisbecker', 'LKML',
'Peter Zijlstra', 'Christian Loehle',
'Linux PM', Doug Smythies
On 2025.10.23 07:51 Rafael wrote:
> Hi Doug,
>
> On Thursday, October 23, 2025 5:05:44 AM CEST Doug Smythies wrote:
>> Hi Rafael,
>>
>> Recent email communications about other patches had me
>> looking at this one again.
>>
>> On 2025.08.13 03:26 Rafael wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>> ... snip...
>>
>>> However, after the above change, latency_req cannot take the predicted_ns
>>> value any more, which takes place after commit 38f83090f515 ("cpuidle:
>>> menu: Remove iowait influence"), because it may cause a polling state
>>> to be returned prematurely.
>>>
>>> In the context of the previous example say that predicted_ns is 3000 and
>>> the PM QoS latency limit is still 20 us. Additionally, say that idle
>>> state 0 is a polling one. Moving the exit_latency_ns check before the
>>> target_residency_ns one causes the loop to terminate in the second
>>> iteration, before the target_residency_ns check, so idle state 0 will be
>>> returned even though previously state 1 would be returned if there were
>>> no imminent timers.
>>>
>>> For this reason, remove the assignment of the predicted_ns value to
>>> latency_req from the code.
>>
>> Which is okay for timer-based workflow,
>> but what about non-timer based, or interrupt driven, workflow?
>>
>> Under conditions where idle state 0, or Polling, would be used a lot,
>> I am observing about a 11 % throughput regression with this patch
>> And idle state 0, polling, usage going from 20% to 0%.
>>
>> From my testing of kernels 6.17-rc1, rc2,rc3 in August and September
>> and again now. I missed this in August/September:
>>
>> 779b1a1cb13a cpuidle: governors: menu: Avoid selecting states with too much latency - v6.17-rc3
>> fa3fa55de0d6 cpuidle: governors: menu: Avoid using invalid recent intervals data - v6.17-rc2
>> baseline reference: v6.17-rc1
>>
>> teo was included also. As far as I can recall its response has always been similar to rc3. At least, recently.
>>
>> Three graphs are attached:
>> Sampling data once per 20 seconds, the test is started after the first idle sample,
>> and at least one sample is taken after the system returns to idle after the test.
>> The faster the test runs the better.
>>
>> Test computer:
>> Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
>> Distro: Ubuntu 24.04.3, server, no desktop GUI.
>> CPU frequency scaling driver: intel_pstate
>> HWP: disabled.
>> CPU frequency scaling governor: performance
>> Ilde driver: intel_idle
>> Idle governor: menu (except teo for one compare test run)
>> Idle states: 4: name : description:
>> state0/name:POLL desc:CPUIDLE CORE POLL IDLE
>> state1/name:C1_ACPI desc:ACPI FFH MWAIT 0x0
>> state2/name:C2_ACPI desc:ACPI FFH MWAIT 0x30
>> state3/name:C3_ACPI desc:ACPI FFH MWAIT 0x60
>
> OK, so since the exit residency of an idle state cannot exceed its target
> residency, the appended change (on top of 6.18-rc2) should make the throughput
> regression go away.
Indeed, the patch you appended below did make the
throughput regression go away.
Thank you.
>
> ---
> drivers/cpuidle/governors/menu.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -321,10 +321,13 @@ static int menu_select(struct cpuidle_dr
>
> /*
> * Use a physical idle state, not busy polling, unless a timer
> - * is going to trigger soon enough.
> + * is going to trigger soon enough or the exit latency of the
> + * idle state in question is greater than the predicted idle
> + * duration.
> */
> if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
> - s->target_residency_ns <= data->next_timer_ns) {
> + s->target_residency_ns <= data->next_timer_ns &&
> + s->exit_latency_ns <= predicted_ns) {
> predicted_ns = s->target_residency_ns;
> idx = i;
> break;
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH v1 1/3] cpuidle: governors: menu: Avoid selecting states with too much latency
2025-10-23 16:02 ` Doug Smythies
@ 2025-10-23 16:52 ` Rafael J. Wysocki
0 siblings, 0 replies; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-10-23 16:52 UTC (permalink / raw)
To: Doug Smythies
Cc: Rafael J. Wysocki, Frederic Weisbecker, LKML, Peter Zijlstra,
Christian Loehle, Linux PM
On Thu, Oct 23, 2025 at 6:03 PM Doug Smythies <dsmythies@telus.net> wrote:
>
> On 2025.10.23 07:51 Rafael wrote:
>
> > Hi Doug,
> >
> > On Thursday, October 23, 2025 5:05:44 AM CEST Doug Smythies wrote:
> >> Hi Rafael,
> >>
> >> Recent email communications about other patches had me
> >> looking at this one again.
> >>
> >> On 2025.08.13 03:26 Rafael wrote:
> >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>>
> >> ... snip...
> >>
> >>> However, after the above change, latency_req cannot take the predicted_ns
> >>> value any more, which takes place after commit 38f83090f515 ("cpuidle:
> >>> menu: Remove iowait influence"), because it may cause a polling state
> >>> to be returned prematurely.
> >>>
> >>> In the context of the previous example say that predicted_ns is 3000 and
> >>> the PM QoS latency limit is still 20 us. Additionally, say that idle
> >>> state 0 is a polling one. Moving the exit_latency_ns check before the
> >>> target_residency_ns one causes the loop to terminate in the second
> >>> iteration, before the target_residency_ns check, so idle state 0 will be
> >>> returned even though previously state 1 would be returned if there were
> >>> no imminent timers.
> >>>
> >>> For this reason, remove the assignment of the predicted_ns value to
> >>> latency_req from the code.
> >>
> >> Which is okay for timer-based workflow,
> >> but what about non-timer based, or interrupt driven, workflow?
> >>
> >> Under conditions where idle state 0, or Polling, would be used a lot,
> >> I am observing about a 11 % throughput regression with this patch
> >> And idle state 0, polling, usage going from 20% to 0%.
> >>
> >> From my testing of kernels 6.17-rc1, rc2,rc3 in August and September
> >> and again now. I missed this in August/September:
> >>
> >> 779b1a1cb13a cpuidle: governors: menu: Avoid selecting states with too much latency - v6.17-rc3
> >> fa3fa55de0d6 cpuidle: governors: menu: Avoid using invalid recent intervals data - v6.17-rc2
> >> baseline reference: v6.17-rc1
> >>
> >> teo was included also. As far as I can recall its response has always been similar to rc3. At least, recently.
> >>
> >> Three graphs are attached:
> >> Sampling data once per 20 seconds, the test is started after the first idle sample,
> >> and at least one sample is taken after the system returns to idle after the test.
> >> The faster the test runs the better.
> >>
> >> Test computer:
> >> Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
> >> Distro: Ubuntu 24.04.3, server, no desktop GUI.
> >> CPU frequency scaling driver: intel_pstate
> >> HWP: disabled.
> >> CPU frequency scaling governor: performance
> >> Ilde driver: intel_idle
> >> Idle governor: menu (except teo for one compare test run)
> >> Idle states: 4: name : description:
> >> state0/name:POLL desc:CPUIDLE CORE POLL IDLE
> >> state1/name:C1_ACPI desc:ACPI FFH MWAIT 0x0
> >> state2/name:C2_ACPI desc:ACPI FFH MWAIT 0x30
> >> state3/name:C3_ACPI desc:ACPI FFH MWAIT 0x60
> >
> > OK, so since the exit residency of an idle state cannot exceed its target
> > residency, the appended change (on top of 6.18-rc2) should make the throughput
> > regression go away.
>
> Indeed, the patch you appended below did make the
> throughput regression go away.
>
> Thank you.
OK, this is not an unreasonable change, so let me add a changelog to
it and submit it.
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v1 2/3] cpuidle: governors: menu: Rearrange main loop in menu_select()
2025-08-13 10:21 [PATCH v1 0/3] cpuidle: governors: menu: A fix, a corner case adjustment and a cleanup Rafael J. Wysocki
2025-08-13 10:25 ` [PATCH v1 1/3] cpuidle: governors: menu: Avoid selecting states with too much latency Rafael J. Wysocki
@ 2025-08-13 10:26 ` Rafael J. Wysocki
2025-08-14 13:00 ` Christian Loehle
2025-09-11 13:37 ` Frederic Weisbecker
2025-08-13 10:29 ` [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz_full CPUs Rafael J. Wysocki
2025-08-28 20:16 ` [PATCH v1] cpuidle: governors: teo: " Rafael J. Wysocki
3 siblings, 2 replies; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-08-13 10:26 UTC (permalink / raw)
To: Linux PM; +Cc: Frederic Weisbecker, LKML, Peter Zijlstra, Christian Loehle
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reduce the indentation level in the main loop of menu_select() by
rearranging some checks and assignments in it.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
drivers/cpuidle/governors/menu.c | 72 ++++++++++++++++++++-------------------
1 file changed, 37 insertions(+), 35 deletions(-)
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -314,45 +314,47 @@
if (s->exit_latency_ns > latency_req)
break;
- if (s->target_residency_ns > predicted_ns) {
- /*
- * Use a physical idle state, not busy polling, unless
- * a timer is going to trigger soon enough.
- */
- if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
- s->target_residency_ns <= data->next_timer_ns) {
- predicted_ns = s->target_residency_ns;
- idx = i;
- break;
- }
- if (predicted_ns < TICK_NSEC)
- break;
-
- if (!tick_nohz_tick_stopped()) {
- /*
- * If the state selected so far is shallow,
- * waking up early won't hurt, so retain the
- * tick in that case and let the governor run
- * again in the next iteration of the loop.
- */
- predicted_ns = drv->states[idx].target_residency_ns;
- break;
- }
+ if (s->target_residency_ns <= predicted_ns) {
+ idx = i;
+ continue;
+ }
+
+ /*
+ * Use a physical idle state, not busy polling, unless a timer
+ * is going to trigger soon enough.
+ */
+ if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
+ s->target_residency_ns <= data->next_timer_ns) {
+ predicted_ns = s->target_residency_ns;
+ idx = i;
+ break;
+ }
+ if (predicted_ns < TICK_NSEC)
+ break;
+
+ if (!tick_nohz_tick_stopped()) {
/*
- * If the state selected so far is shallow and this
- * state's target residency matches the time till the
- * closest timer event, select this one to avoid getting
- * stuck in the shallow one for too long.
+ * If the state selected so far is shallow, waking up
+ * early won't hurt, so retain the tick in that case and
+ * let the governor run again in the next iteration of
+ * the idle loop.
*/
- if (drv->states[idx].target_residency_ns < TICK_NSEC &&
- s->target_residency_ns <= delta_tick)
- idx = i;
-
- return idx;
+ predicted_ns = drv->states[idx].target_residency_ns;
+ break;
}
- idx = i;
+ /*
+ * If the state selected so far is shallow and this state's
+ * target residency matches the time till the closest timer
+ * event, select this one to avoid getting stuck in the shallow
+ * one for too long.
+ */
+ if (drv->states[idx].target_residency_ns < TICK_NSEC &&
+ s->target_residency_ns <= delta_tick)
+ idx = i;
+
+ return idx;
}
if (idx == -1)
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH v1 2/3] cpuidle: governors: menu: Rearrange main loop in menu_select()
2025-08-13 10:26 ` [PATCH v1 2/3] cpuidle: governors: menu: Rearrange main loop in menu_select() Rafael J. Wysocki
@ 2025-08-14 13:00 ` Christian Loehle
2025-09-11 13:37 ` Frederic Weisbecker
1 sibling, 0 replies; 27+ messages in thread
From: Christian Loehle @ 2025-08-14 13:00 UTC (permalink / raw)
To: Rafael J. Wysocki, Linux PM; +Cc: Frederic Weisbecker, LKML, Peter Zijlstra
On 8/13/25 11:26, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> Reduce the indentation level in the main loop of menu_select() by
> rearranging some checks and assignments in it.
>
> No intentional functional impact.
>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
> ---
> drivers/cpuidle/governors/menu.c | 72 ++++++++++++++++++++-------------------
> 1 file changed, 37 insertions(+), 35 deletions(-)
>
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -314,45 +314,47 @@
> if (s->exit_latency_ns > latency_req)
> break;
>
> - if (s->target_residency_ns > predicted_ns) {
> - /*
> - * Use a physical idle state, not busy polling, unless
> - * a timer is going to trigger soon enough.
> - */
> - if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
> - s->target_residency_ns <= data->next_timer_ns) {
> - predicted_ns = s->target_residency_ns;
> - idx = i;
> - break;
> - }
> - if (predicted_ns < TICK_NSEC)
> - break;
> -
> - if (!tick_nohz_tick_stopped()) {
> - /*
> - * If the state selected so far is shallow,
> - * waking up early won't hurt, so retain the
> - * tick in that case and let the governor run
> - * again in the next iteration of the loop.
> - */
> - predicted_ns = drv->states[idx].target_residency_ns;
> - break;
> - }
> + if (s->target_residency_ns <= predicted_ns) {
> + idx = i;
> + continue;
> + }
> +
> + /*
> + * Use a physical idle state, not busy polling, unless a timer
> + * is going to trigger soon enough.
> + */
> + if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
> + s->target_residency_ns <= data->next_timer_ns) {
> + predicted_ns = s->target_residency_ns;
> + idx = i;
> + break;
> + }
>
> + if (predicted_ns < TICK_NSEC)
> + break;
> +
> + if (!tick_nohz_tick_stopped()) {
> /*
> - * If the state selected so far is shallow and this
> - * state's target residency matches the time till the
> - * closest timer event, select this one to avoid getting
> - * stuck in the shallow one for too long.
> + * If the state selected so far is shallow, waking up
> + * early won't hurt, so retain the tick in that case and
> + * let the governor run again in the next iteration of
> + * the idle loop.
> */
> - if (drv->states[idx].target_residency_ns < TICK_NSEC &&
> - s->target_residency_ns <= delta_tick)
> - idx = i;
> -
> - return idx;
> + predicted_ns = drv->states[idx].target_residency_ns;
> + break;
> }
>
> - idx = i;
> + /*
> + * If the state selected so far is shallow and this state's
> + * target residency matches the time till the closest timer
> + * event, select this one to avoid getting stuck in the shallow
> + * one for too long.
> + */
> + if (drv->states[idx].target_residency_ns < TICK_NSEC &&
> + s->target_residency_ns <= delta_tick)
> + idx = i;
> +
> + return idx;
> }
>
> if (idx == -1)
>
>
>
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH v1 2/3] cpuidle: governors: menu: Rearrange main loop in menu_select()
2025-08-13 10:26 ` [PATCH v1 2/3] cpuidle: governors: menu: Rearrange main loop in menu_select() Rafael J. Wysocki
2025-08-14 13:00 ` Christian Loehle
@ 2025-09-11 13:37 ` Frederic Weisbecker
1 sibling, 0 replies; 27+ messages in thread
From: Frederic Weisbecker @ 2025-09-11 13:37 UTC (permalink / raw)
To: Rafael J. Wysocki; +Cc: Linux PM, LKML, Peter Zijlstra, Christian Loehle
Le Wed, Aug 13, 2025 at 12:26:35PM +0200, Rafael J. Wysocki a écrit :
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> Reduce the indentation level in the main loop of menu_select() by
> rearranging some checks and assignments in it.
>
> No intentional functional impact.
>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
--
Frederic Weisbecker
SUSE Labs
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz_full CPUs
2025-08-13 10:21 [PATCH v1 0/3] cpuidle: governors: menu: A fix, a corner case adjustment and a cleanup Rafael J. Wysocki
2025-08-13 10:25 ` [PATCH v1 1/3] cpuidle: governors: menu: Avoid selecting states with too much latency Rafael J. Wysocki
2025-08-13 10:26 ` [PATCH v1 2/3] cpuidle: governors: menu: Rearrange main loop in menu_select() Rafael J. Wysocki
@ 2025-08-13 10:29 ` Rafael J. Wysocki
2025-08-14 14:09 ` Christian Loehle
` (2 more replies)
2025-08-28 20:16 ` [PATCH v1] cpuidle: governors: teo: " Rafael J. Wysocki
3 siblings, 3 replies; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-08-13 10:29 UTC (permalink / raw)
To: Linux PM; +Cc: Frederic Weisbecker, LKML, Peter Zijlstra, Christian Loehle
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When the menu governor runs on a nohz_full CPU and there are no user
space timers in the workload on that CPU, it ends up selecting idle
states with target residency values above TICK_NSEC all the time due to
a tick_nohz_tick_stopped() check designed for a different use case.
Namely, on nohz_full CPUs the fact that the tick has been stopped does
not actually mean anything in particular, whereas in the other case it
indicates that previously the CPU was expected to be idle sufficiently
long for the tick to be stopped, so it is not unreasonable to expect
it to be idle beyond the tick period length again.
In some cases, this behavior causes latency in the workload to grow
undesirably. It may also cause the workload to consume more energy
than necessary if the CPU does not spend enough time in the selected
deep idle states.
Address this by amending the tick_nohz_tick_stopped() check in question
with a tick_nohz_full_cpu() one to avoid using the time till the next
timer event as the predicted_ns value all the time on nohz_full CPUs.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
drivers/cpuidle/governors/menu.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -293,8 +293,18 @@
* in a shallow idle state for a long time as a result of it. In that
* case, say we might mispredict and use the known time till the closest
* timer event for the idle state selection.
+ *
+ * However, on nohz_full CPUs the tick does not run as a rule and the
+ * time till the closest timer event may always be effectively infinite,
+ * so using it as a replacement for the predicted idle duration would
+ * effectively always cause the prediction results to be discarded and
+ * deep idle states to be selected all the time. That might introduce
+ * unwanted latency into the workload and cause more energy than
+ * necessary to be consumed if the discarded prediction results are
+ * actually accurate, so skip nohz_full CPUs here.
*/
- if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
+ if (tick_nohz_tick_stopped() && !tick_nohz_full_cpu(dev->cpu) &&
+ predicted_ns < TICK_NSEC)
predicted_ns = data->next_timer_ns;
/*
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz_full CPUs
2025-08-13 10:29 ` [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz_full CPUs Rafael J. Wysocki
@ 2025-08-14 14:09 ` Christian Loehle
2025-08-18 17:41 ` Rafael J. Wysocki
2025-09-11 14:17 ` Frederic Weisbecker
2026-02-08 15:59 ` Ionut Nechita (Wind River)
2 siblings, 1 reply; 27+ messages in thread
From: Christian Loehle @ 2025-08-14 14:09 UTC (permalink / raw)
To: Rafael J. Wysocki, Linux PM; +Cc: Frederic Weisbecker, LKML, Peter Zijlstra
On 8/13/25 11:29, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> When the menu governor runs on a nohz_full CPU and there are no user
> space timers in the workload on that CPU, it ends up selecting idle
> states with target residency values above TICK_NSEC all the time due to
> a tick_nohz_tick_stopped() check designed for a different use case.
> Namely, on nohz_full CPUs the fact that the tick has been stopped does
> not actually mean anything in particular, whereas in the other case it
> indicates that previously the CPU was expected to be idle sufficiently
> long for the tick to be stopped, so it is not unreasonable to expect
> it to be idle beyond the tick period length again.
>
> In some cases, this behavior causes latency in the workload to grow
> undesirably. It may also cause the workload to consume more energy
> than necessary if the CPU does not spend enough time in the selected
> deep idle states.
>
> Address this by amending the tick_nohz_tick_stopped() check in question
> with a tick_nohz_full_cpu() one to avoid using the time till the next
> timer event as the predicted_ns value all the time on nohz_full CPUs.
>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> drivers/cpuidle/governors/menu.c | 12 +++++++++++-
> 1 file changed, 11 insertions(+), 1 deletion(-)
>
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -293,8 +293,18 @@
> * in a shallow idle state for a long time as a result of it. In that
> * case, say we might mispredict and use the known time till the closest
> * timer event for the idle state selection.
> + *
> + * However, on nohz_full CPUs the tick does not run as a rule and the
> + * time till the closest timer event may always be effectively infinite,
> + * so using it as a replacement for the predicted idle duration would
> + * effectively always cause the prediction results to be discarded and
> + * deep idle states to be selected all the time. That might introduce
> + * unwanted latency into the workload and cause more energy than
> + * necessary to be consumed if the discarded prediction results are
> + * actually accurate, so skip nohz_full CPUs here.
> */
> - if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
> + if (tick_nohz_tick_stopped() && !tick_nohz_full_cpu(dev->cpu) &&
> + predicted_ns < TICK_NSEC)
> predicted_ns = data->next_timer_ns;
>
> /*
>
>
>
OTOH the behaviour with $SUBJECT possibly means that we use predicted_ns from
get_typical_interval() (which may suggest picking a shallow state based on
previous wakeup patterns) only then to never wake up again?
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz_full CPUs
2025-08-14 14:09 ` Christian Loehle
@ 2025-08-18 17:41 ` Rafael J. Wysocki
2025-08-19 9:10 ` Christian Loehle
0 siblings, 1 reply; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-08-18 17:41 UTC (permalink / raw)
To: Christian Loehle
Cc: Rafael J. Wysocki, Linux PM, Frederic Weisbecker, LKML,
Peter Zijlstra
On Thu, Aug 14, 2025 at 4:09 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 8/13/25 11:29, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > When the menu governor runs on a nohz_full CPU and there are no user
> > space timers in the workload on that CPU, it ends up selecting idle
> > states with target residency values above TICK_NSEC all the time due to
> > a tick_nohz_tick_stopped() check designed for a different use case.
> > Namely, on nohz_full CPUs the fact that the tick has been stopped does
> > not actually mean anything in particular, whereas in the other case it
> > indicates that previously the CPU was expected to be idle sufficiently
> > long for the tick to be stopped, so it is not unreasonable to expect
> > it to be idle beyond the tick period length again.
> >
> > In some cases, this behavior causes latency in the workload to grow
> > undesirably. It may also cause the workload to consume more energy
> > than necessary if the CPU does not spend enough time in the selected
> > deep idle states.
> >
> > Address this by amending the tick_nohz_tick_stopped() check in question
> > with a tick_nohz_full_cpu() one to avoid using the time till the next
> > timer event as the predicted_ns value all the time on nohz_full CPUs.
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> > drivers/cpuidle/governors/menu.c | 12 +++++++++++-
> > 1 file changed, 11 insertions(+), 1 deletion(-)
> >
> > --- a/drivers/cpuidle/governors/menu.c
> > +++ b/drivers/cpuidle/governors/menu.c
> > @@ -293,8 +293,18 @@
> > * in a shallow idle state for a long time as a result of it. In that
> > * case, say we might mispredict and use the known time till the closest
> > * timer event for the idle state selection.
> > + *
> > + * However, on nohz_full CPUs the tick does not run as a rule and the
> > + * time till the closest timer event may always be effectively infinite,
> > + * so using it as a replacement for the predicted idle duration would
> > + * effectively always cause the prediction results to be discarded and
> > + * deep idle states to be selected all the time. That might introduce
> > + * unwanted latency into the workload and cause more energy than
> > + * necessary to be consumed if the discarded prediction results are
> > + * actually accurate, so skip nohz_full CPUs here.
> > */
> > - if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
> > + if (tick_nohz_tick_stopped() && !tick_nohz_full_cpu(dev->cpu) &&
> > + predicted_ns < TICK_NSEC)
> > predicted_ns = data->next_timer_ns;
> >
> > /*
> >
> >
> >
>
> OTOH the behaviour with $SUBJECT possibly means that we use predicted_ns from
> get_typical_interval() (which may suggest picking a shallow state based on
> previous wakeup patterns) only then to never wake up again?
Yes, there is this risk, but the current behavior is more damaging IMV
because it (potentially) hurts both energy efficiency and performance.
It is also arguably easier for the user to remedy getting stuck in a
shallow idle state than to change governor's behavior (PM QoS is a bit
too blunt for this).
Moreover, configuring CPUs as nohz_full and leaving them in long idle
may not be the most efficient use of them.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz_full CPUs
2025-08-18 17:41 ` Rafael J. Wysocki
@ 2025-08-19 9:10 ` Christian Loehle
2025-08-19 11:56 ` Rafael J. Wysocki
0 siblings, 1 reply; 27+ messages in thread
From: Christian Loehle @ 2025-08-19 9:10 UTC (permalink / raw)
To: Rafael J. Wysocki; +Cc: Linux PM, Frederic Weisbecker, LKML, Peter Zijlstra
On 8/18/25 18:41, Rafael J. Wysocki wrote:
> On Thu, Aug 14, 2025 at 4:09 PM Christian Loehle
> <christian.loehle@arm.com> wrote:
>>
>> On 8/13/25 11:29, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> When the menu governor runs on a nohz_full CPU and there are no user
>>> space timers in the workload on that CPU, it ends up selecting idle
>>> states with target residency values above TICK_NSEC all the time due to
>>> a tick_nohz_tick_stopped() check designed for a different use case.
>>> Namely, on nohz_full CPUs the fact that the tick has been stopped does
>>> not actually mean anything in particular, whereas in the other case it
>>> indicates that previously the CPU was expected to be idle sufficiently
>>> long for the tick to be stopped, so it is not unreasonable to expect
>>> it to be idle beyond the tick period length again.
>>>
>>> In some cases, this behavior causes latency in the workload to grow
>>> undesirably. It may also cause the workload to consume more energy
>>> than necessary if the CPU does not spend enough time in the selected
>>> deep idle states.
>>>
>>> Address this by amending the tick_nohz_tick_stopped() check in question
>>> with a tick_nohz_full_cpu() one to avoid using the time till the next
>>> timer event as the predicted_ns value all the time on nohz_full CPUs.
>>>
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>> ---
>>> drivers/cpuidle/governors/menu.c | 12 +++++++++++-
>>> 1 file changed, 11 insertions(+), 1 deletion(-)
>>>
>>> --- a/drivers/cpuidle/governors/menu.c
>>> +++ b/drivers/cpuidle/governors/menu.c
>>> @@ -293,8 +293,18 @@
>>> * in a shallow idle state for a long time as a result of it. In that
>>> * case, say we might mispredict and use the known time till the closest
>>> * timer event for the idle state selection.
>>> + *
>>> + * However, on nohz_full CPUs the tick does not run as a rule and the
>>> + * time till the closest timer event may always be effectively infinite,
>>> + * so using it as a replacement for the predicted idle duration would
>>> + * effectively always cause the prediction results to be discarded and
>>> + * deep idle states to be selected all the time. That might introduce
>>> + * unwanted latency into the workload and cause more energy than
>>> + * necessary to be consumed if the discarded prediction results are
>>> + * actually accurate, so skip nohz_full CPUs here.
>>> */
>>> - if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
>>> + if (tick_nohz_tick_stopped() && !tick_nohz_full_cpu(dev->cpu) &&
>>> + predicted_ns < TICK_NSEC)
>>> predicted_ns = data->next_timer_ns;
>>>
>>> /*
>>>
>>>
>>>
>>
>> OTOH the behaviour with $SUBJECT possibly means that we use predicted_ns from
>> get_typical_interval() (which may suggest picking a shallow state based on
>> previous wakeup patterns) only then to never wake up again?
>
> Yes, there is this risk, but the current behavior is more damaging IMV
> because it (potentially) hurts both energy efficiency and performance.
>
> It is also arguably easier for the user to remedy getting stuck in a
> shallow idle state than to change governor's behavior (PM QoS is a bit
> too blunt for this).
>
> Moreover, configuring CPUs as nohz_full and leaving them in long idle
> may not be the most efficient use of them.
True, on the other hand the setup cost for nohz_full is so high, you'd expect
the additional idle states disabling depending on the workload isn't too much
to ask for...
Anyway feel free to go ahead.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz_full CPUs
2025-08-19 9:10 ` Christian Loehle
@ 2025-08-19 11:56 ` Rafael J. Wysocki
0 siblings, 0 replies; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-08-19 11:56 UTC (permalink / raw)
To: Christian Loehle
Cc: Rafael J. Wysocki, Linux PM, Frederic Weisbecker, LKML,
Peter Zijlstra
On Tue, Aug 19, 2025 at 11:10 AM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 8/18/25 18:41, Rafael J. Wysocki wrote:
> > On Thu, Aug 14, 2025 at 4:09 PM Christian Loehle
> > <christian.loehle@arm.com> wrote:
> >>
> >> On 8/13/25 11:29, Rafael J. Wysocki wrote:
> >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>>
> >>> When the menu governor runs on a nohz_full CPU and there are no user
> >>> space timers in the workload on that CPU, it ends up selecting idle
> >>> states with target residency values above TICK_NSEC all the time due to
> >>> a tick_nohz_tick_stopped() check designed for a different use case.
> >>> Namely, on nohz_full CPUs the fact that the tick has been stopped does
> >>> not actually mean anything in particular, whereas in the other case it
> >>> indicates that previously the CPU was expected to be idle sufficiently
> >>> long for the tick to be stopped, so it is not unreasonable to expect
> >>> it to be idle beyond the tick period length again.
> >>>
> >>> In some cases, this behavior causes latency in the workload to grow
> >>> undesirably. It may also cause the workload to consume more energy
> >>> than necessary if the CPU does not spend enough time in the selected
> >>> deep idle states.
> >>>
> >>> Address this by amending the tick_nohz_tick_stopped() check in question
> >>> with a tick_nohz_full_cpu() one to avoid using the time till the next
> >>> timer event as the predicted_ns value all the time on nohz_full CPUs.
> >>>
> >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>> ---
> >>> drivers/cpuidle/governors/menu.c | 12 +++++++++++-
> >>> 1 file changed, 11 insertions(+), 1 deletion(-)
> >>>
> >>> --- a/drivers/cpuidle/governors/menu.c
> >>> +++ b/drivers/cpuidle/governors/menu.c
> >>> @@ -293,8 +293,18 @@
> >>> * in a shallow idle state for a long time as a result of it. In that
> >>> * case, say we might mispredict and use the known time till the closest
> >>> * timer event for the idle state selection.
> >>> + *
> >>> + * However, on nohz_full CPUs the tick does not run as a rule and the
> >>> + * time till the closest timer event may always be effectively infinite,
> >>> + * so using it as a replacement for the predicted idle duration would
> >>> + * effectively always cause the prediction results to be discarded and
> >>> + * deep idle states to be selected all the time. That might introduce
> >>> + * unwanted latency into the workload and cause more energy than
> >>> + * necessary to be consumed if the discarded prediction results are
> >>> + * actually accurate, so skip nohz_full CPUs here.
> >>> */
> >>> - if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
> >>> + if (tick_nohz_tick_stopped() && !tick_nohz_full_cpu(dev->cpu) &&
> >>> + predicted_ns < TICK_NSEC)
> >>> predicted_ns = data->next_timer_ns;
> >>>
> >>> /*
> >>>
> >>>
> >>>
> >>
> >> OTOH the behaviour with $SUBJECT possibly means that we use predicted_ns from
> >> get_typical_interval() (which may suggest picking a shallow state based on
> >> previous wakeup patterns) only then to never wake up again?
> >
> > Yes, there is this risk, but the current behavior is more damaging IMV
> > because it (potentially) hurts both energy efficiency and performance.
> >
> > It is also arguably easier for the user to remedy getting stuck in a
> > shallow idle state than to change governor's behavior (PM QoS is a bit
> > too blunt for this).
> >
> > Moreover, configuring CPUs as nohz_full and leaving them in long idle
> > may not be the most efficient use of them.
>
> True, on the other hand the setup cost for nohz_full is so high, you'd expect
> the additional idle states disabling depending on the workload isn't too much
> to ask for...
Apparently, there are cases in which there is enough idle time to ask
for a deep idle state often enough, but as a rule the idle periods are
relatively short. In those cases, one would need to change the QoS
limit back and forth in anticipation of the "busier" and "calmer"
periods in the workload, which would be kind of equivalent to
implementing an idle governor in user space.
> Anyway feel free to go ahead.
Thank you!
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz_full CPUs
2025-08-13 10:29 ` [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz_full CPUs Rafael J. Wysocki
2025-08-14 14:09 ` Christian Loehle
@ 2025-09-11 14:17 ` Frederic Weisbecker
2025-09-11 17:07 ` Rafael J. Wysocki
2026-02-08 15:59 ` Ionut Nechita (Wind River)
2 siblings, 1 reply; 27+ messages in thread
From: Frederic Weisbecker @ 2025-09-11 14:17 UTC (permalink / raw)
To: Rafael J. Wysocki; +Cc: Linux PM, LKML, Peter Zijlstra, Christian Loehle
Le Wed, Aug 13, 2025 at 12:29:51PM +0200, Rafael J. Wysocki a écrit :
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> When the menu governor runs on a nohz_full CPU and there are no user
> space timers in the workload on that CPU, it ends up selecting idle
> states with target residency values above TICK_NSEC all the time due to
> a tick_nohz_tick_stopped() check designed for a different use case.
>
> Namely, on nohz_full CPUs the fact that the tick has been stopped does
> not actually mean anything in particular, whereas in the other case it
> indicates that previously the CPU was expected to be idle sufficiently
> long for the tick to be stopped, so it is not unreasonable to expect
> it to be idle beyond the tick period length again.
I understand what you mean but it may be hard to figure out for
reviewers. Can we rephrase it to something like:
When nohz_full is not running, the fact that the tick is stopped
indicates the CPU has been idle for sufficiently long so that
nohz has deferred it to the next timer callback. So it is
not unreasonable to expect the CPU to be idle beyond the tick
period length again.
However when nohz_full is running, the CPU may enter idle with the
tick already stopped. But this doesn't tell anything about the future
CPU's idleness.
>
> In some cases, this behavior causes latency in the workload to grow
> undesirably. It may also cause the workload to consume more energy
> than necessary if the CPU does not spend enough time in the selected
> deep idle states.
>
> Address this by amending the tick_nohz_tick_stopped() check in question
> with a tick_nohz_full_cpu() one to avoid using the time till the next
> timer event as the predicted_ns value all the time on nohz_full CPUs.
>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> drivers/cpuidle/governors/menu.c | 12 +++++++++++-
> 1 file changed, 11 insertions(+), 1 deletion(-)
>
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -293,8 +293,18 @@
> * in a shallow idle state for a long time as a result of it. In that
> * case, say we might mispredict and use the known time till the closest
> * timer event for the idle state selection.
> + *
> + * However, on nohz_full CPUs the tick does not run as a rule and the
> + * time till the closest timer event may always be effectively infinite,
> + * so using it as a replacement for the predicted idle duration would
> + * effectively always cause the prediction results to be discarded and
> + * deep idle states to be selected all the time. That might introduce
> + * unwanted latency into the workload and cause more energy than
> + * necessary to be consumed if the discarded prediction results are
> + * actually accurate, so skip nohz_full CPUs here.
> */
> - if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
> + if (tick_nohz_tick_stopped() && !tick_nohz_full_cpu(dev->cpu) &&
> + predicted_ns < TICK_NSEC)
> predicted_ns = data->next_timer_ns;
So, when !tick_nohz_full_cpu(dev->cpu), what is the purpose of this tick stopped
special case?
Is it because the next dynamic tick is a better prediction than the typical
interval once the tick is stopped?
Does that mean we might become more "pessimistic" concerning the predicted idle
time for nohz_full CPUs?
I guess too shallow C-states are still better than too deep but there should be
a word about that introduced side effect (if any).
Thanks!
> /*
>
>
>
--
Frederic Weisbecker
SUSE Labs
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz_full CPUs
2025-09-11 14:17 ` Frederic Weisbecker
@ 2025-09-11 17:07 ` Rafael J. Wysocki
2025-09-18 15:07 ` Frederic Weisbecker
0 siblings, 1 reply; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-09-11 17:07 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: Rafael J. Wysocki, Linux PM, LKML, Peter Zijlstra,
Christian Loehle
On Thu, Sep 11, 2025 at 4:17 PM Frederic Weisbecker <frederic@kernel.org> wrote:
>
> Le Wed, Aug 13, 2025 at 12:29:51PM +0200, Rafael J. Wysocki a écrit :
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > When the menu governor runs on a nohz_full CPU and there are no user
> > space timers in the workload on that CPU, it ends up selecting idle
> > states with target residency values above TICK_NSEC all the time due to
> > a tick_nohz_tick_stopped() check designed for a different use case.
> >
> > Namely, on nohz_full CPUs the fact that the tick has been stopped does
> > not actually mean anything in particular, whereas in the other case it
> > indicates that previously the CPU was expected to be idle sufficiently
> > long for the tick to be stopped, so it is not unreasonable to expect
> > it to be idle beyond the tick period length again.
>
> I understand what you mean but it may be hard to figure out for
> reviewers. Can we rephrase it to something like:
>
> When nohz_full is not running, the fact that the tick is stopped
> indicates the CPU has been idle for sufficiently long so that
> nohz has deferred it to the next timer callback. So it is
> not unreasonable to expect the CPU to be idle beyond the tick
> period length again.
>
> However when nohz_full is running, the CPU may enter idle with the
> tick already stopped. But this doesn't tell anything about the future
> CPU's idleness.
Sure, thanks for the hint.
> >
> > In some cases, this behavior causes latency in the workload to grow
> > undesirably. It may also cause the workload to consume more energy
> > than necessary if the CPU does not spend enough time in the selected
> > deep idle states.
> >
> > Address this by amending the tick_nohz_tick_stopped() check in question
> > with a tick_nohz_full_cpu() one to avoid using the time till the next
> > timer event as the predicted_ns value all the time on nohz_full CPUs.
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> > drivers/cpuidle/governors/menu.c | 12 +++++++++++-
> > 1 file changed, 11 insertions(+), 1 deletion(-)
> >
> > --- a/drivers/cpuidle/governors/menu.c
> > +++ b/drivers/cpuidle/governors/menu.c
> > @@ -293,8 +293,18 @@
> > * in a shallow idle state for a long time as a result of it. In that
> > * case, say we might mispredict and use the known time till the closest
> > * timer event for the idle state selection.
> > + *
> > + * However, on nohz_full CPUs the tick does not run as a rule and the
> > + * time till the closest timer event may always be effectively infinite,
> > + * so using it as a replacement for the predicted idle duration would
> > + * effectively always cause the prediction results to be discarded and
> > + * deep idle states to be selected all the time. That might introduce
> > + * unwanted latency into the workload and cause more energy than
> > + * necessary to be consumed if the discarded prediction results are
> > + * actually accurate, so skip nohz_full CPUs here.
> > */
> > - if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
> > + if (tick_nohz_tick_stopped() && !tick_nohz_full_cpu(dev->cpu) &&
> > + predicted_ns < TICK_NSEC)
> > predicted_ns = data->next_timer_ns;
>
> So, when !tick_nohz_full_cpu(dev->cpu), what is the purpose of this tick stopped
> special case?
>
> Is it because the next dynamic tick is a better prediction than the typical
> interval once the tick is stopped?
When !tick_nohz_full_cpu(dev->cpu), the tick is a safety net against
getting stuck in a shallow idle state for too long. In that case, if
the tick is stopped, the safety net is not there and it is better to
use a deep state.
However, data->next_timer_ns is a lower limit for the idle state
target residency because this is when the next timer is going to
trigger.
> Does that mean we might become more "pessimistic" concerning the predicted idle
> time for nohz_full CPUs?
Yes, and not just we might, but we do unless the idle periods in the
workload are "long".
> I guess too shallow C-states are still better than too deep but there should be
> a word about that introduced side effect (if any).
Yeah, I agree.
That said, on a nohz_full CPU there is no safety net against getting
stuck in a shallow idle state because the tick is not present. That's
why currently the governors don't allow shallow states to be used on
nohz_full CPUs.
The lack of a safety net is generally not a problem when the CPU has
been isolated to run something doing real work all the time, with
possible idle periods in the workload, but there are people who
isolate CPUs for energy-saving reasons and don't run anything on them
on purpose. For those folks, the current behavior to select deep idle
states every time is actually desirable.
So there are two use cases that cannot be addressed at once and I'm
thinking about adding a control knob to allow the user to decide which
way to go.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz_full CPUs
2025-09-11 17:07 ` Rafael J. Wysocki
@ 2025-09-18 15:07 ` Frederic Weisbecker
2025-09-23 17:25 ` Rafael J. Wysocki
0 siblings, 1 reply; 27+ messages in thread
From: Frederic Weisbecker @ 2025-09-18 15:07 UTC (permalink / raw)
To: Rafael J. Wysocki; +Cc: Linux PM, LKML, Peter Zijlstra, Christian Loehle
Le Thu, Sep 11, 2025 at 07:07:42PM +0200, Rafael J. Wysocki a écrit :
> On Thu, Sep 11, 2025 at 4:17 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> > So, when !tick_nohz_full_cpu(dev->cpu), what is the purpose of this tick stopped
> > special case?
> >
> > Is it because the next dynamic tick is a better prediction than the typical
> > interval once the tick is stopped?
>
> When !tick_nohz_full_cpu(dev->cpu), the tick is a safety net against
> getting stuck in a shallow idle state for too long. In that case, if
> the tick is stopped, the safety net is not there and it is better to
> use a deep state.
Right.
> However, data->next_timer_ns is a lower limit for the idle state
> target residency because this is when the next timer is going to
> trigger.
Ok.
>
> > Does that mean we might become more "pessimistic" concerning the predicted idle
> > time for nohz_full CPUs?
>
> Yes, and not just we might, but we do unless the idle periods in the
> workload are "long".
Ok.
>
> > I guess too shallow C-states are still better than too deep but there should be
> > a word about that introduced side effect (if any).
>
> Yeah, I agree.
>
> That said, on a nohz_full CPU there is no safety net against getting
> stuck in a shallow idle state because the tick is not present. That's
> why currently the governors don't allow shallow states to be used on
> nohz_full CPUs.
>
> The lack of a safety net is generally not a problem when the CPU has
> been isolated to run something doing real work all the time, with
> possible idle periods in the workload, but there are people who
> isolate CPUs for energy-saving reasons and don't run anything on them
> on purpose. For those folks, the current behavior to select deep idle
> states every time is actually desirable.
So far I haven't heard from anybody using nohz_full for powersavings. If
you have I'd be curious about it. Whether a task runs tickless or not, it
still runs and the CPU isn't sleeping. Also CPU 0 stays periodic on nohz_full,
which alone is a problem for powersaving but also prevents a whole package
from entering low power mode on NUMA.
Let's say it not optimized toward powersaving...
> So there are two use cases that cannot be addressed at once and I'm
> thinking about adding a control knob to allow the user to decide which
> way to go.
I'm tempted to say we should focus on having not too deep states,
at the expense of having too shallow. Of course I'm not entirely
comfortable with the idea because nohz_full CPUs may be idle for a while
on some workloads. And everyone deserves a rest at some point after
a long day.
I guess force restarting the tick upon idle entry would probably be
bad for tiny idle round-trips?
As for such a knob, I'm not sure anybody would use it.
Thanks.
--
Frederic Weisbecker
SUSE Labs
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz_full CPUs
2025-09-18 15:07 ` Frederic Weisbecker
@ 2025-09-23 17:25 ` Rafael J. Wysocki
0 siblings, 0 replies; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-09-23 17:25 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: Rafael J. Wysocki, Linux PM, LKML, Peter Zijlstra,
Christian Loehle
On Thu, Sep 18, 2025 at 5:07 PM Frederic Weisbecker <frederic@kernel.org> wrote:
>
> Le Thu, Sep 11, 2025 at 07:07:42PM +0200, Rafael J. Wysocki a écrit :
> > On Thu, Sep 11, 2025 at 4:17 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> > > So, when !tick_nohz_full_cpu(dev->cpu), what is the purpose of this tick stopped
> > > special case?
> > >
> > > Is it because the next dynamic tick is a better prediction than the typical
> > > interval once the tick is stopped?
> >
> > When !tick_nohz_full_cpu(dev->cpu), the tick is a safety net against
> > getting stuck in a shallow idle state for too long. In that case, if
> > the tick is stopped, the safety net is not there and it is better to
> > use a deep state.
>
> Right.
>
> > However, data->next_timer_ns is a lower limit for the idle state
> > target residency because this is when the next timer is going to
> > trigger.
>
> Ok.
>
> >
> > > Does that mean we might become more "pessimistic" concerning the predicted idle
> > > time for nohz_full CPUs?
> >
> > Yes, and not just we might, but we do unless the idle periods in the
> > workload are "long".
>
> Ok.
>
> >
> > > I guess too shallow C-states are still better than too deep but there should be
> > > a word about that introduced side effect (if any).
> >
> > Yeah, I agree.
> >
> > That said, on a nohz_full CPU there is no safety net against getting
> > stuck in a shallow idle state because the tick is not present. That's
> > why currently the governors don't allow shallow states to be used on
> > nohz_full CPUs.
> >
> > The lack of a safety net is generally not a problem when the CPU has
> > been isolated to run something doing real work all the time, with
> > possible idle periods in the workload, but there are people who
> > isolate CPUs for energy-saving reasons and don't run anything on them
> > on purpose. For those folks, the current behavior to select deep idle
> > states every time is actually desirable.
>
> So far I haven't heard from anybody using nohz_full for powersavings. If
> you have I'd be curious about it.
There is a project called LPMD that does this:
https://github.com/intel/intel-lpmd
> Whether a task runs tickless or not, it
> still runs and the CPU isn't sleeping. Also CPU 0 stays periodic on nohz_full,
> which alone is a problem for powersaving but also prevents a whole package
> from entering low power mode on NUMA.
That's not a problem for the above because it uses "isolation" for
taking some specific CPUs out of use (CPU0 is never one of them
AFAICS).
Also, it does depend on idle governors always putting those CPUs into
deep idle states.
> Let's say it not optimized toward powersaving...
Oh well ...
> > So there are two use cases that cannot be addressed at once and I'm
> > thinking about adding a control knob to allow the user to decide which
> > way to go.
>
> I'm tempted to say we should focus on having not too deep states,
> at the expense of having too shallow. Of course I'm not entirely
> comfortable with the idea because nohz_full CPUs may be idle for a while
> on some workloads. And everyone deserves a rest at some point after
> a long day.
Right.
> I guess force restarting the tick upon idle entry would probably be
> bad for tiny idle round-trips?
It wouldn't be exactly cheap in terms of latency I think.
> As for such a knob, I'm not sure anybody would use it.
Fair enough.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz_full CPUs
2025-08-13 10:29 ` [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz_full CPUs Rafael J. Wysocki
2025-08-14 14:09 ` Christian Loehle
2025-09-11 14:17 ` Frederic Weisbecker
@ 2026-02-08 15:59 ` Ionut Nechita (Wind River)
2026-02-20 13:02 ` Rafael J. Wysocki
2 siblings, 1 reply; 27+ messages in thread
From: Ionut Nechita (Wind River) @ 2026-02-08 15:59 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Ionut Nechita, Linux PM, Frederic Weisbecker, LKML,
Peter Zijlstra, Christian Loehle, Daniel Lezcano, Ionut Nechita,
Ionut Nechita
Hi Rafael,
I have a question regarding this patch: is it planned for upstream
integration, or is there a newer/improved version in the works?
I'm asking because I've been working on a related optimization in the
same code path. My patch [1] takes a different approach to the same
problem area -- instead of skipping the override entirely for nohz_full
CPUs, it changes:
predicted_ns = data->next_timer_ns;
to:
predicted_ns = min(predicted_ns, data->next_timer_ns);
The idea is to prevent selecting excessively deep C-states when the
prediction is short but the next timer is distant, which on platforms
like Sapphire Rapids with high exit latencies (150us+) can cause
significant latency spikes.
I notice that our patches touch the same block but address slightly
different aspects:
- Your patch: prevents the override from firing on nohz_full CPUs
where tick_stopped is always true, avoiding systematically
discarding the prediction.
- My patch: when the override does fire, uses min() instead of
unconditional replacement to preserve information from the
prediction.
These two fixes could potentially be complementary. However, my patch
is still under investigation due to limited hardware availability for
collecting more data across different platforms.
I'd appreciate your thoughts on whether these approaches could be
combined, or if your patch already addresses the use cases I'm seeing.
Thanks,
Ionut
[1] https://lore.kernel.org/linux-pm/20260122080937.22347-4-sunlightlinux@gmail.com/
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz_full CPUs
2026-02-08 15:59 ` Ionut Nechita (Wind River)
@ 2026-02-20 13:02 ` Rafael J. Wysocki
0 siblings, 0 replies; 27+ messages in thread
From: Rafael J. Wysocki @ 2026-02-20 13:02 UTC (permalink / raw)
To: Ionut Nechita (Wind River)
Cc: Rafael J. Wysocki, Linux PM, Frederic Weisbecker, LKML,
Peter Zijlstra, Christian Loehle, Daniel Lezcano, Ionut Nechita,
Ionut Nechita
Hi,
On Sun, Feb 8, 2026 at 5:00 PM Ionut Nechita (Wind River)
<ionut.nechita@windriver.com> wrote:
>
> Hi Rafael,
>
> I have a question regarding this patch: is it planned for upstream
> integration, or is there a newer/improved version in the works?
Not really.
It is too risky to be made at this point because of some systems'
dependency on the current behavior.
> I'm asking because I've been working on a related optimization in the
> same code path. My patch [1] takes a different approach to the same
> problem area -- instead of skipping the override entirely for nohz_full
> CPUs, it changes:
>
> predicted_ns = data->next_timer_ns;
>
> to:
>
> predicted_ns = min(predicted_ns, data->next_timer_ns);
This is risky if the next timer is distant because it may cause the
CPU to get stuck in a shallow idle state effectively till the next
timer. That in turn may prevent the processor as a whole from
entering a package idle state etc.
> The idea is to prevent selecting excessively deep C-states when the
> prediction is short but the next timer is distant, which on platforms
> like Sapphire Rapids with high exit latencies (150us+) can cause
> significant latency spikes.
Well, this essentially is a tradeoff between latency and power.
> I notice that our patches touch the same block but address slightly
> different aspects:
>
> - Your patch: prevents the override from firing on nohz_full CPUs
> where tick_stopped is always true, avoiding systematically
> discarding the prediction.
> - My patch: when the override does fire, uses min() instead of
> unconditional replacement to preserve information from the
> prediction.
>
> These two fixes could potentially be complementary. However, my patch
> is still under investigation due to limited hardware availability for
> collecting more data across different platforms.
>
> I'd appreciate your thoughts on whether these approaches could be
> combined, or if your patch already addresses the use cases I'm seeing.
Actually, I don't think that any of them are suitable.
Please see this series:
https://lore.kernel.org/linux-pm/1953482.tdWV9SEqCh@rafael.j.wysocki/
which takes a somewhat different direction.
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v1] cpuidle: governors: teo: Special-case nohz_full CPUs
2025-08-13 10:21 [PATCH v1 0/3] cpuidle: governors: menu: A fix, a corner case adjustment and a cleanup Rafael J. Wysocki
` (2 preceding siblings ...)
2025-08-13 10:29 ` [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz_full CPUs Rafael J. Wysocki
@ 2025-08-28 20:16 ` Rafael J. Wysocki
2025-08-29 19:37 ` Rafael J. Wysocki
3 siblings, 1 reply; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-08-28 20:16 UTC (permalink / raw)
To: Linux PM; +Cc: Frederic Weisbecker, LKML, Peter Zijlstra, Christian Loehle
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This change follows an analogous modification of the menu governor [1].
Namely, when the governor runs on a nohz_full CPU and there are no user
space timers in the workload on that CPU, it ends up selecting idle
states with target residency values above TICK_NSEC, or the deepest
enabled idle state in the absence of any of those, all the time due to
a tick_nohz_tick_stopped() check designed for running on CPUs where the
tick is not permanently disabled. In that case, the fact that the tick
has been stopped means that the CPU was expected to be idle sufficiently
long previously, so it is not unreasonable to expect it to be idle
sufficiently long again, but this inference does not apply to nohz_full
CPUs.
In some cases, latency in the workload grows undesirably as a result of
selecting overly deep idle states, and the workload may also consume
more energy than necessary if the CPU does not spend enough time in the
selected deep idle state.
Address this by amending the tick_nohz_tick_stopped() check in question
with a tick_nohz_full_cpu() one to avoid effectively ignoring all
shallow idle states on nohz_full CPUs. While doing so introduces a risk
of getting stuck in a shallow idle state for a long time, that only
affects energy efficiently, but the current behavior potentially hurts
both energy efficiency and performance that is arguably the priority for
nohz_full CPUs.
While at it, add a comment explaining the logic in teo_state_ok().
Link: https://lore.kernel.org/linux-pm/2244365.irdbgypaU6@rafael.j.wysocki/ [1]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
drivers/cpuidle/governors/teo.c | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -227,9 +227,17 @@
cpu_data->total += PULSE;
}
-static bool teo_state_ok(int i, struct cpuidle_driver *drv)
+static bool teo_state_ok(int i, struct cpuidle_driver *drv, struct cpuidle_device *dev)
{
- return !tick_nohz_tick_stopped() ||
+ /*
+ * If the scheduler tick has been stopped already, avoid selecting idle
+ * states with target residency below the tick period length under the
+ * assumption that the CPU is likely to be idle sufficiently long for
+ * the tick to be stopped again (or the tick would not have been
+ * stopped previously in the first place). However, do not do that on
+ * nohz_full CPUs where the above assumption does not hold.
+ */
+ return !tick_nohz_tick_stopped() || tick_nohz_full_cpu(dev->cpu) ||
drv->states[i].target_residency_ns >= TICK_NSEC;
}
@@ -379,7 +387,7 @@
* shallow or disabled, in which case take the
* first enabled state that is deep enough.
*/
- if (teo_state_ok(i, drv) &&
+ if (teo_state_ok(i, drv, dev) &&
!dev->states_usage[i].disable) {
idx = i;
break;
@@ -391,7 +399,7 @@
if (dev->states_usage[i].disable)
continue;
- if (teo_state_ok(i, drv)) {
+ if (teo_state_ok(i, drv, dev)) {
/*
* The current state is deep enough, but still
* there may be a better one.
@@ -460,7 +468,7 @@
*/
if (drv->states[idx].target_residency_ns > duration_ns) {
i = teo_find_shallower_state(drv, dev, idx, duration_ns, false);
- if (teo_state_ok(i, drv))
+ if (teo_state_ok(i, drv, dev))
idx = i;
}
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH v1] cpuidle: governors: teo: Special-case nohz_full CPUs
2025-08-28 20:16 ` [PATCH v1] cpuidle: governors: teo: " Rafael J. Wysocki
@ 2025-08-29 19:37 ` Rafael J. Wysocki
2025-08-31 21:30 ` Christian Loehle
0 siblings, 1 reply; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-08-29 19:37 UTC (permalink / raw)
To: Linux PM; +Cc: Frederic Weisbecker, LKML, Peter Zijlstra, Christian Loehle
On Thu, Aug 28, 2025 at 10:16 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> This change follows an analogous modification of the menu governor [1].
>
> Namely, when the governor runs on a nohz_full CPU and there are no user
> space timers in the workload on that CPU, it ends up selecting idle
> states with target residency values above TICK_NSEC, or the deepest
> enabled idle state in the absence of any of those, all the time due to
> a tick_nohz_tick_stopped() check designed for running on CPUs where the
> tick is not permanently disabled. In that case, the fact that the tick
> has been stopped means that the CPU was expected to be idle sufficiently
> long previously, so it is not unreasonable to expect it to be idle
> sufficiently long again, but this inference does not apply to nohz_full
> CPUs.
>
> In some cases, latency in the workload grows undesirably as a result of
> selecting overly deep idle states, and the workload may also consume
> more energy than necessary if the CPU does not spend enough time in the
> selected deep idle state.
>
> Address this by amending the tick_nohz_tick_stopped() check in question
> with a tick_nohz_full_cpu() one to avoid effectively ignoring all
> shallow idle states on nohz_full CPUs. While doing so introduces a risk
> of getting stuck in a shallow idle state for a long time, that only
> affects energy efficiently, but the current behavior potentially hurts
> both energy efficiency and performance that is arguably the priority for
> nohz_full CPUs.
This change is likely to break the use case in which CPU isolation is
used for power management reasons, to prevent CPUs from running any
code and so to save energy.
In that case, going into the deepest state every time on nohz_full
CPUs is a feature, so it can't be changed unconditionally.
For this reason, I'm not going to apply this patch and I'm going to
drop the menu governor one below.
The only way to allow everyone to do what they want/need I can see
would be to add a control knob for adjusting the behavior of cpuidle
governors regarding the handling of nohz_full CPUs.
> While at it, add a comment explaining the logic in teo_state_ok().
>
> Link: https://lore.kernel.org/linux-pm/2244365.irdbgypaU6@rafael.j.wysocki/ [1]
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> drivers/cpuidle/governors/teo.c | 18 +++++++++++++-----
> 1 file changed, 13 insertions(+), 5 deletions(-)
>
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -227,9 +227,17 @@
> cpu_data->total += PULSE;
> }
>
> -static bool teo_state_ok(int i, struct cpuidle_driver *drv)
> +static bool teo_state_ok(int i, struct cpuidle_driver *drv, struct cpuidle_device *dev)
> {
> - return !tick_nohz_tick_stopped() ||
> + /*
> + * If the scheduler tick has been stopped already, avoid selecting idle
> + * states with target residency below the tick period length under the
> + * assumption that the CPU is likely to be idle sufficiently long for
> + * the tick to be stopped again (or the tick would not have been
> + * stopped previously in the first place). However, do not do that on
> + * nohz_full CPUs where the above assumption does not hold.
> + */
> + return !tick_nohz_tick_stopped() || tick_nohz_full_cpu(dev->cpu) ||
> drv->states[i].target_residency_ns >= TICK_NSEC;
> }
>
> @@ -379,7 +387,7 @@
> * shallow or disabled, in which case take the
> * first enabled state that is deep enough.
> */
> - if (teo_state_ok(i, drv) &&
> + if (teo_state_ok(i, drv, dev) &&
> !dev->states_usage[i].disable) {
> idx = i;
> break;
> @@ -391,7 +399,7 @@
> if (dev->states_usage[i].disable)
> continue;
>
> - if (teo_state_ok(i, drv)) {
> + if (teo_state_ok(i, drv, dev)) {
> /*
> * The current state is deep enough, but still
> * there may be a better one.
> @@ -460,7 +468,7 @@
> */
> if (drv->states[idx].target_residency_ns > duration_ns) {
> i = teo_find_shallower_state(drv, dev, idx, duration_ns, false);
> - if (teo_state_ok(i, drv))
> + if (teo_state_ok(i, drv, dev))
> idx = i;
> }
>
>
>
>
>
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH v1] cpuidle: governors: teo: Special-case nohz_full CPUs
2025-08-29 19:37 ` Rafael J. Wysocki
@ 2025-08-31 21:30 ` Christian Loehle
2025-09-01 19:08 ` Rafael J. Wysocki
0 siblings, 1 reply; 27+ messages in thread
From: Christian Loehle @ 2025-08-31 21:30 UTC (permalink / raw)
To: Rafael J. Wysocki, Linux PM; +Cc: Frederic Weisbecker, LKML, Peter Zijlstra
On 8/29/25 20:37, Rafael J. Wysocki wrote:
> On Thu, Aug 28, 2025 at 10:16 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>>
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> This change follows an analogous modification of the menu governor [1].
>>
>> Namely, when the governor runs on a nohz_full CPU and there are no user
>> space timers in the workload on that CPU, it ends up selecting idle
>> states with target residency values above TICK_NSEC, or the deepest
>> enabled idle state in the absence of any of those, all the time due to
>> a tick_nohz_tick_stopped() check designed for running on CPUs where the
>> tick is not permanently disabled. In that case, the fact that the tick
>> has been stopped means that the CPU was expected to be idle sufficiently
>> long previously, so it is not unreasonable to expect it to be idle
>> sufficiently long again, but this inference does not apply to nohz_full
>> CPUs.
>>
>> In some cases, latency in the workload grows undesirably as a result of
>> selecting overly deep idle states, and the workload may also consume
>> more energy than necessary if the CPU does not spend enough time in the
>> selected deep idle state.
>>
>> Address this by amending the tick_nohz_tick_stopped() check in question
>> with a tick_nohz_full_cpu() one to avoid effectively ignoring all
>> shallow idle states on nohz_full CPUs. While doing so introduces a risk
>> of getting stuck in a shallow idle state for a long time, that only
>> affects energy efficiently, but the current behavior potentially hurts
>> both energy efficiency and performance that is arguably the priority for
>> nohz_full CPUs.
>
> This change is likely to break the use case in which CPU isolation is
> used for power management reasons, to prevent CPUs from running any
> code and so to save energy.
>
> In that case, going into the deepest state every time on nohz_full
> CPUs is a feature, so it can't be changed unconditionally.
>
> For this reason, I'm not going to apply this patch and I'm going to
> drop the menu governor one below.
>
> The only way to allow everyone to do what they want/need I can see
> would be to add a control knob for adjusting the behavior of cpuidle
> governors regarding the handling of nohz_full CPUs.
But then what's the advantage instead of just using
/sys/devices/system/cpu/cpuX/power/latency
for the nohz_full CPUs (if you don't want the current 'over-eagerly
selecting deepest state on nohz_full')?
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v1] cpuidle: governors: teo: Special-case nohz_full CPUs
2025-08-31 21:30 ` Christian Loehle
@ 2025-09-01 19:08 ` Rafael J. Wysocki
0 siblings, 0 replies; 27+ messages in thread
From: Rafael J. Wysocki @ 2025-09-01 19:08 UTC (permalink / raw)
To: Christian Loehle
Cc: Rafael J. Wysocki, Linux PM, Frederic Weisbecker, LKML,
Peter Zijlstra
On Sun, Aug 31, 2025 at 11:30 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 8/29/25 20:37, Rafael J. Wysocki wrote:
> > On Thu, Aug 28, 2025 at 10:16 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >>
> >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>
> >> This change follows an analogous modification of the menu governor [1].
> >>
> >> Namely, when the governor runs on a nohz_full CPU and there are no user
> >> space timers in the workload on that CPU, it ends up selecting idle
> >> states with target residency values above TICK_NSEC, or the deepest
> >> enabled idle state in the absence of any of those, all the time due to
> >> a tick_nohz_tick_stopped() check designed for running on CPUs where the
> >> tick is not permanently disabled. In that case, the fact that the tick
> >> has been stopped means that the CPU was expected to be idle sufficiently
> >> long previously, so it is not unreasonable to expect it to be idle
> >> sufficiently long again, but this inference does not apply to nohz_full
> >> CPUs.
> >>
> >> In some cases, latency in the workload grows undesirably as a result of
> >> selecting overly deep idle states, and the workload may also consume
> >> more energy than necessary if the CPU does not spend enough time in the
> >> selected deep idle state.
> >>
> >> Address this by amending the tick_nohz_tick_stopped() check in question
> >> with a tick_nohz_full_cpu() one to avoid effectively ignoring all
> >> shallow idle states on nohz_full CPUs. While doing so introduces a risk
> >> of getting stuck in a shallow idle state for a long time, that only
> >> affects energy efficiently, but the current behavior potentially hurts
> >> both energy efficiency and performance that is arguably the priority for
> >> nohz_full CPUs.
> >
> > This change is likely to break the use case in which CPU isolation is
> > used for power management reasons, to prevent CPUs from running any
> > code and so to save energy.
> >
> > In that case, going into the deepest state every time on nohz_full
> > CPUs is a feature, so it can't be changed unconditionally.
> >
> > For this reason, I'm not going to apply this patch and I'm going to
> > drop the menu governor one below.
> >
> > The only way to allow everyone to do what they want/need I can see
> > would be to add a control knob for adjusting the behavior of cpuidle
> > governors regarding the handling of nohz_full CPUs.
>
> But then what's the advantage instead of just using
> /sys/devices/system/cpu/cpuX/power/latency
> for the nohz_full CPUs (if you don't want the current 'over-eagerly
> selecting deepest state on nohz_full')?
The difference is that PM QoS prevents the CPU from entering
high-latency idle states at all (depending on the limit value),
whereas the governor will sometimes ask for a deep idle state
(depending on the actual wakeup pattern). If wakeup patterns in the
workload change over time, it may be quite cumbersome to have to
update PM QoS every time to follow those changes (and the operator may
not even be aware of them).
^ permalink raw reply [flat|nested] 27+ messages in thread