[PATCH v1 0/3] cpuidle: governors: menu: A fix, a corner case adjustment and a cleanup

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 0/3] cpuidle: governors: menu: A fix, a corner case adjustment and a cleanup
@ 2025-08-13 10:21 Rafael J. Wysocki
  2025-08-13 10:25 ` [PATCH v1 1/3] cpuidle: governors: menu: Avoid selecting states with too much latency Rafael J. Wysocki
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Rafael J. Wysocki @ 2025-08-13 10:21 UTC (permalink / raw)
  To: Linux PM; +Cc: Frederic Weisbecker, LKML, Peter Zijlstra, Christian Loehle

Hi All,

This series fixes a minor issue in the menu governor, related to violating
the PM QoS CPU latency limit in some rare cases, rearranges the code in
menu_select() to get rid of some excessive indentation, and adjusts the
handling of a corner case related to nohz_full CPUs.

Please refer to the changelogs of individual patches for details.

Thanks!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v1 1/3] cpuidle: governors: menu: Avoid selecting states with too much latency
  2025-08-13 10:21 [PATCH v1 0/3] cpuidle: governors: menu: A fix, a corner case adjustment and a cleanup Rafael J. Wysocki
@ 2025-08-13 10:25 ` Rafael J. Wysocki
  2025-08-13 19:13   ` Christian Loehle
  2025-08-13 10:26 ` [PATCH v1 2/3] cpuidle: governors: menu: Rearrange main loop in menu_select() Rafael J. Wysocki
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Rafael J. Wysocki @ 2025-08-13 10:25 UTC (permalink / raw)
  To: Linux PM; +Cc: Frederic Weisbecker, LKML, Peter Zijlstra, Christian Loehle

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Occasionally, the exit latency of the idle state selected by the menu
governor may exceed the PM QoS CPU wakeup latency limit.  Namely, if the
scheduler tick has been stopped already and predicted_ns is greater than
the tick period length, the governor may return an idle state whose exit
latency exceeds latency_req because that decision is made before
checking the current idle state's exit latency.

For instance, say that there are 3 idle states, 0, 1, and 2.  For idle
states 0 and 1, the exit latency is equal to the target residency and
the values are 0 and 5 us, respectively.  State 2 is deeper and has the
exit latency and target residency of 200 us and 2 ms (which is greater
than the tick period length), respectively.

Say that predicted_ns is equal to TICK_NSEC and the PM QoS latency
limit is 20 us.  After the first two iterations of the main loop in
menu_select(), idx becomes 1 and in the third iteration of it the target
residency of the current state (state 2) is greater than predicted_ns.
State 2 is not a polling one and predicted_ns is not less than TICK_NSEC,
so the check on whether or not the tick has been stopped is done.  Say
that the tick has been stopped already and there are no imminent timers
(that is, delta_tick is greater than the target residency of state 2).
In that case, idx becomes 2 and it is returned immediately, but the exit
latency of state 2 exceeds the latency limit.

Address this issue by modifying the code to compare the exit latency of
the current idle state (idle state i) with the latency limit before
comparing its target residecy with predicted_ns, which allows one
more exit_latency_ns check that becomes redundant to be dropped.

However, after the above change, latency_req cannot take the predicted_ns
value any more, which takes place after commit 38f83090f515 ("cpuidle:
menu: Remove iowait influence"), because it may cause a polling state
to be returned prematurely.

In the context of the previous example say that predicted_ns is 3000 and
the PM QoS latency limit is still 20 us.  Additionally, say that idle
state 0 is a polling one.  Moving the exit_latency_ns check before the
target_residency_ns one causes the loop to terminate in the second
iteration, before the target_residency_ns check, so idle state 0 will be
returned even though previously state 1 would be returned if there were
no imminent timers.

For this reason, remove the assignment of the predicted_ns value to
latency_req from the code.

Fixes: 5ef499cd571c ("cpuidle: menu: Handle stopped tick more aggressively")
Cc: 4.17+ <stable@vger.kernel.org> # 4.17+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpuidle/governors/menu.c |   29 ++++++++++++-----------------
 1 file changed, 12 insertions(+), 17 deletions(-)

--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -287,20 +287,15 @@
 		return 0;
 	}

-	if (tick_nohz_tick_stopped()) {
-		/*
-		 * If the tick is already stopped, the cost of possible short
-		 * idle duration misprediction is much higher, because the CPU
-		 * may be stuck in a shallow idle state for a long time as a
-		 * result of it.  In that case say we might mispredict and use
-		 * the known time till the closest timer event for the idle
-		 * state selection.
-		 */
-		if (predicted_ns < TICK_NSEC)
-			predicted_ns = data->next_timer_ns;
-	} else if (latency_req > predicted_ns) {
-		latency_req = predicted_ns;
-	}
+	/*
+	 * If the tick is already stopped, the cost of possible short idle
+	 * duration misprediction is much higher, because the CPU may be stuck
+	 * in a shallow idle state for a long time as a result of it.  In that
+	 * case, say we might mispredict and use the known time till the closest
+	 * timer event for the idle state selection.
+	 */
+	if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
+		predicted_ns = data->next_timer_ns;

 	/*
 	 * Find the idle state with the lowest power while satisfying
@@ -316,13 +311,15 @@
 		if (idx == -1)
 			idx = i; /* first enabled state */

+		if (s->exit_latency_ns > latency_req)
+			break;
+
 		if (s->target_residency_ns > predicted_ns) {
 			/*
 			 * Use a physical idle state, not busy polling, unless
 			 * a timer is going to trigger soon enough.
 			 */
 			if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
-			    s->exit_latency_ns <= latency_req &&
 			    s->target_residency_ns <= data->next_timer_ns) {
 				predicted_ns = s->target_residency_ns;
 				idx = i;
@@ -354,8 +351,6 @@

 			return idx;
 		}
-		if (s->exit_latency_ns > latency_req)
-			break;

 		idx = i;
 	}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v1 2/3] cpuidle: governors: menu: Rearrange main loop in menu_select()
  2025-08-13 10:21 [PATCH v1 0/3] cpuidle: governors: menu: A fix, a corner case adjustment and a cleanup Rafael J. Wysocki
  2025-08-13 10:25 ` [PATCH v1 1/3] cpuidle: governors: menu: Avoid selecting states with too much latency Rafael J. Wysocki
@ 2025-08-13 10:26 ` Rafael J. Wysocki
  2025-08-14 13:00   ` Christian Loehle
  2025-08-13 10:29 ` [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz_full CPUs Rafael J. Wysocki
  2025-08-28 20:16 ` [PATCH v1] cpuidle: governors: teo: " Rafael J. Wysocki
  3 siblings, 1 reply; 15+ messages in thread
From: Rafael J. Wysocki @ 2025-08-13 10:26 UTC (permalink / raw)
  To: Linux PM; +Cc: Frederic Weisbecker, LKML, Peter Zijlstra, Christian Loehle

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Reduce the indentation level in the main loop of menu_select() by
rearranging some checks and assignments in it.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpuidle/governors/menu.c |   72 ++++++++++++++++++++-------------------
 1 file changed, 37 insertions(+), 35 deletions(-)

--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -314,45 +314,47 @@
 		if (s->exit_latency_ns > latency_req)
 			break;
 
-		if (s->target_residency_ns > predicted_ns) {
-			/*
-			 * Use a physical idle state, not busy polling, unless
-			 * a timer is going to trigger soon enough.
-			 */
-			if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
-			    s->target_residency_ns <= data->next_timer_ns) {
-				predicted_ns = s->target_residency_ns;
-				idx = i;
-				break;
-			}
-			if (predicted_ns < TICK_NSEC)
-				break;
-
-			if (!tick_nohz_tick_stopped()) {
-				/*
-				 * If the state selected so far is shallow,
-				 * waking up early won't hurt, so retain the
-				 * tick in that case and let the governor run
-				 * again in the next iteration of the loop.
-				 */
-				predicted_ns = drv->states[idx].target_residency_ns;
-				break;
-			}
+		if (s->target_residency_ns <= predicted_ns) {
+			idx = i;
+			continue;
+		}
+
+		/*
+		 * Use a physical idle state, not busy polling, unless a timer
+		 * is going to trigger soon enough.
+		 */
+		if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
+		    s->target_residency_ns <= data->next_timer_ns) {
+			predicted_ns = s->target_residency_ns;
+			idx = i;
+			break;
+		}
 
+		if (predicted_ns < TICK_NSEC)
+			break;
+
+		if (!tick_nohz_tick_stopped()) {
 			/*
-			 * If the state selected so far is shallow and this
-			 * state's target residency matches the time till the
-			 * closest timer event, select this one to avoid getting
-			 * stuck in the shallow one for too long.
+			 * If the state selected so far is shallow, waking up
+			 * early won't hurt, so retain the tick in that case and
+			 * let the governor run again in the next iteration of
+			 * the idle loop.
 			 */
-			if (drv->states[idx].target_residency_ns < TICK_NSEC &&
-			    s->target_residency_ns <= delta_tick)
-				idx = i;
-
-			return idx;
+			predicted_ns = drv->states[idx].target_residency_ns;
+			break;
 		}
 
-		idx = i;
+		/*
+		 * If the state selected so far is shallow and this state's
+		 * target residency matches the time till the closest timer
+		 * event, select this one to avoid getting stuck in the shallow
+		 * one for too long.
+		 */
+		if (drv->states[idx].target_residency_ns < TICK_NSEC &&
+		    s->target_residency_ns <= delta_tick)
+			idx = i;
+
+		return idx;
 	}
 
 	if (idx == -1)




^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz_full CPUs
  2025-08-13 10:21 [PATCH v1 0/3] cpuidle: governors: menu: A fix, a corner case adjustment and a cleanup Rafael J. Wysocki
  2025-08-13 10:25 ` [PATCH v1 1/3] cpuidle: governors: menu: Avoid selecting states with too much latency Rafael J. Wysocki
  2025-08-13 10:26 ` [PATCH v1 2/3] cpuidle: governors: menu: Rearrange main loop in menu_select() Rafael J. Wysocki
@ 2025-08-13 10:29 ` Rafael J. Wysocki
  2025-08-14 14:09   ` Christian Loehle
  2025-08-28 20:16 ` [PATCH v1] cpuidle: governors: teo: " Rafael J. Wysocki
  3 siblings, 1 reply; 15+ messages in thread
From: Rafael J. Wysocki @ 2025-08-13 10:29 UTC (permalink / raw)
  To: Linux PM; +Cc: Frederic Weisbecker, LKML, Peter Zijlstra, Christian Loehle

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

When the menu governor runs on a nohz_full CPU and there are no user
space timers in the workload on that CPU, it ends up selecting idle
states with target residency values above TICK_NSEC all the time due to
a tick_nohz_tick_stopped() check designed for a different use case.
Namely, on nohz_full CPUs the fact that the tick has been stopped does
not actually mean anything in particular, whereas in the other case it
indicates that previously the CPU was expected to be idle sufficiently
long for the tick to be stopped, so it is not unreasonable to expect
it to be idle beyond the tick period length again.

In some cases, this behavior causes latency in the workload to grow
undesirably.  It may also cause the workload to consume more energy
than necessary if the CPU does not spend enough time in the selected
deep idle states.

Address this by amending the tick_nohz_tick_stopped() check in question
with a tick_nohz_full_cpu() one to avoid using the time till the next
timer event as the predicted_ns value all the time on nohz_full CPUs.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpuidle/governors/menu.c |   12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -293,8 +293,18 @@
 	 * in a shallow idle state for a long time as a result of it.  In that
 	 * case, say we might mispredict and use the known time till the closest
 	 * timer event for the idle state selection.
+	 *
+	 * However, on nohz_full CPUs the tick does not run as a rule and the
+	 * time till the closest timer event may always be effectively infinite,
+	 * so using it as a replacement for the predicted idle duration would
+	 * effectively always cause the prediction results to be discarded and
+	 * deep idle states to be selected all the time.  That might introduce
+	 * unwanted latency into the workload and cause more energy than
+	 * necessary to be consumed if the discarded prediction results are
+	 * actually accurate, so skip nohz_full CPUs here.
 	 */
-	if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
+	if (tick_nohz_tick_stopped() && !tick_nohz_full_cpu(dev->cpu) &&
+	    predicted_ns < TICK_NSEC)
 		predicted_ns = data->next_timer_ns;

 	/*

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 1/3] cpuidle: governors: menu: Avoid selecting states with too much latency
  2025-08-13 10:25 ` [PATCH v1 1/3] cpuidle: governors: menu: Avoid selecting states with too much latency Rafael J. Wysocki
@ 2025-08-13 19:13   ` Christian Loehle
  2025-08-18 17:08     ` Rafael J. Wysocki
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Loehle @ 2025-08-13 19:13 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM; +Cc: Frederic Weisbecker, LKML, Peter Zijlstra

On 8/13/25 11:25, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Occasionally, the exit latency of the idle state selected by the menu
> governor may exceed the PM QoS CPU wakeup latency limit.  Namely, if the
> scheduler tick has been stopped already and predicted_ns is greater than
> the tick period length, the governor may return an idle state whose exit
> latency exceeds latency_req because that decision is made before
> checking the current idle state's exit latency.
> 
> For instance, say that there are 3 idle states, 0, 1, and 2.  For idle
> states 0 and 1, the exit latency is equal to the target residency and
> the values are 0 and 5 us, respectively.  State 2 is deeper and has the
> exit latency and target residency of 200 us and 2 ms (which is greater
> than the tick period length), respectively.
> 
> Say that predicted_ns is equal to TICK_NSEC and the PM QoS latency
> limit is 20 us.  After the first two iterations of the main loop in
> menu_select(), idx becomes 1 and in the third iteration of it the target
Can drop "of it" here?

> residency of the current state (state 2) is greater than predicted_ns.
> State 2 is not a polling one and predicted_ns is not less than TICK_NSEC,
> so the check on whether or not the tick has been stopped is done.  Say
> that the tick has been stopped already and there are no imminent timers
> (that is, delta_tick is greater than the target residency of state 2).
> In that case, idx becomes 2 and it is returned immediately, but the exit
> latency of state 2 exceeds the latency limit.
> 
> Address this issue by modifying the code to compare the exit latency of
> the current idle state (idle state i) with the latency limit before
> comparing its target residecy with predicted_ns, which allows one
residency

> more exit_latency_ns check that becomes redundant to be dropped.
> 
> However, after the above change, latency_req cannot take the predicted_ns
> value any more, which takes place after commit 38f83090f515 ("cpuidle:
> menu: Remove iowait influence"), because it may cause a polling state
> to be returned prematurely.
> 
> In the context of the previous example say that predicted_ns is 3000 and
> the PM QoS latency limit is still 20 us.  Additionally, say that idle
> state 0 is a polling one.  Moving the exit_latency_ns check before the
> target_residency_ns one causes the loop to terminate in the second
> iteration, before the target_residency_ns check, so idle state 0 will be
> returned even though previously state 1 would be returned if there were
> no imminent timers.
> 
> For this reason, remove the assignment of the predicted_ns value to
> latency_req from the code.
> 
> Fixes: 5ef499cd571c ("cpuidle: menu: Handle stopped tick more aggressively")
> Cc: 4.17+ <stable@vger.kernel.org> # 4.17+
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  drivers/cpuidle/governors/menu.c |   29 ++++++++++++-----------------
>  1 file changed, 12 insertions(+), 17 deletions(-)
> 
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -287,20 +287,15 @@
>  		return 0;
>  	}
>  
> -	if (tick_nohz_tick_stopped()) {
> -		/*
> -		 * If the tick is already stopped, the cost of possible short
> -		 * idle duration misprediction is much higher, because the CPU
> -		 * may be stuck in a shallow idle state for a long time as a
> -		 * result of it.  In that case say we might mispredict and use
> -		 * the known time till the closest timer event for the idle
> -		 * state selection.
> -		 */
> -		if (predicted_ns < TICK_NSEC)
> -			predicted_ns = data->next_timer_ns;
> -	} else if (latency_req > predicted_ns) {
> -		latency_req = predicted_ns;
> -	}
> +	/*
> +	 * If the tick is already stopped, the cost of possible short idle
> +	 * duration misprediction is much higher, because the CPU may be stuck
> +	 * in a shallow idle state for a long time as a result of it.  In that
> +	 * case, say we might mispredict and use the known time till the closest
> +	 * timer event for the idle state selection.
> +	 */
> +	if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
> +		predicted_ns = data->next_timer_ns;
>  
>  	/*
>  	 * Find the idle state with the lowest power while satisfying
> @@ -316,13 +311,15 @@
>  		if (idx == -1)
>  			idx = i; /* first enabled state */
>  
> +		if (s->exit_latency_ns > latency_req)
> +			break;
> +
>  		if (s->target_residency_ns > predicted_ns) {
>  			/*
>  			 * Use a physical idle state, not busy polling, unless
>  			 * a timer is going to trigger soon enough.
>  			 */
>  			if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
> -			    s->exit_latency_ns <= latency_req &&
>  			    s->target_residency_ns <= data->next_timer_ns) {
>  				predicted_ns = s->target_residency_ns;
>  				idx = i;
> @@ -354,8 +351,6 @@
>  
>  			return idx;
>  		}
> -		if (s->exit_latency_ns > latency_req)
> -			break;
>  
>  		idx = i;
>  	}
> 
> 
> 

Good catch!
Reviewed-by: Christian Loehle <christian.loehle@arm.com>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 2/3] cpuidle: governors: menu: Rearrange main loop in menu_select()
  2025-08-13 10:26 ` [PATCH v1 2/3] cpuidle: governors: menu: Rearrange main loop in menu_select() Rafael J. Wysocki
@ 2025-08-14 13:00   ` Christian Loehle
  0 siblings, 0 replies; 15+ messages in thread
From: Christian Loehle @ 2025-08-14 13:00 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM; +Cc: Frederic Weisbecker, LKML, Peter Zijlstra

On 8/13/25 11:26, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Reduce the indentation level in the main loop of menu_select() by
> rearranging some checks and assignments in it.
> 
> No intentional functional impact.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


Reviewed-by: Christian Loehle <christian.loehle@arm.com>

> ---
>  drivers/cpuidle/governors/menu.c |   72 ++++++++++++++++++++-------------------
>  1 file changed, 37 insertions(+), 35 deletions(-)
> 
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -314,45 +314,47 @@
>  		if (s->exit_latency_ns > latency_req)
>  			break;
>  
> -		if (s->target_residency_ns > predicted_ns) {
> -			/*
> -			 * Use a physical idle state, not busy polling, unless
> -			 * a timer is going to trigger soon enough.
> -			 */
> -			if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
> -			    s->target_residency_ns <= data->next_timer_ns) {
> -				predicted_ns = s->target_residency_ns;
> -				idx = i;
> -				break;
> -			}
> -			if (predicted_ns < TICK_NSEC)
> -				break;
> -
> -			if (!tick_nohz_tick_stopped()) {
> -				/*
> -				 * If the state selected so far is shallow,
> -				 * waking up early won't hurt, so retain the
> -				 * tick in that case and let the governor run
> -				 * again in the next iteration of the loop.
> -				 */
> -				predicted_ns = drv->states[idx].target_residency_ns;
> -				break;
> -			}
> +		if (s->target_residency_ns <= predicted_ns) {
> +			idx = i;
> +			continue;
> +		}
> +
> +		/*
> +		 * Use a physical idle state, not busy polling, unless a timer
> +		 * is going to trigger soon enough.
> +		 */
> +		if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
> +		    s->target_residency_ns <= data->next_timer_ns) {
> +			predicted_ns = s->target_residency_ns;
> +			idx = i;
> +			break;
> +		}
>  
> +		if (predicted_ns < TICK_NSEC)
> +			break;
> +
> +		if (!tick_nohz_tick_stopped()) {
>  			/*
> -			 * If the state selected so far is shallow and this
> -			 * state's target residency matches the time till the
> -			 * closest timer event, select this one to avoid getting
> -			 * stuck in the shallow one for too long.
> +			 * If the state selected so far is shallow, waking up
> +			 * early won't hurt, so retain the tick in that case and
> +			 * let the governor run again in the next iteration of
> +			 * the idle loop.
>  			 */
> -			if (drv->states[idx].target_residency_ns < TICK_NSEC &&
> -			    s->target_residency_ns <= delta_tick)
> -				idx = i;
> -
> -			return idx;
> +			predicted_ns = drv->states[idx].target_residency_ns;
> +			break;
>  		}
>  
> -		idx = i;
> +		/*
> +		 * If the state selected so far is shallow and this state's
> +		 * target residency matches the time till the closest timer
> +		 * event, select this one to avoid getting stuck in the shallow
> +		 * one for too long.
> +		 */
> +		if (drv->states[idx].target_residency_ns < TICK_NSEC &&
> +		    s->target_residency_ns <= delta_tick)
> +			idx = i;
> +
> +		return idx;
>  	}
>  
>  	if (idx == -1)
> 
> 
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz_full CPUs
  2025-08-13 10:29 ` [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz_full CPUs Rafael J. Wysocki
@ 2025-08-14 14:09   ` Christian Loehle
  2025-08-18 17:41     ` Rafael J. Wysocki
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Loehle @ 2025-08-14 14:09 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM; +Cc: Frederic Weisbecker, LKML, Peter Zijlstra

On 8/13/25 11:29, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> When the menu governor runs on a nohz_full CPU and there are no user
> space timers in the workload on that CPU, it ends up selecting idle
> states with target residency values above TICK_NSEC all the time due to
> a tick_nohz_tick_stopped() check designed for a different use case.
> Namely, on nohz_full CPUs the fact that the tick has been stopped does
> not actually mean anything in particular, whereas in the other case it
> indicates that previously the CPU was expected to be idle sufficiently
> long for the tick to be stopped, so it is not unreasonable to expect
> it to be idle beyond the tick period length again.
>   
> In some cases, this behavior causes latency in the workload to grow
> undesirably.  It may also cause the workload to consume more energy
> than necessary if the CPU does not spend enough time in the selected
> deep idle states.
> 
> Address this by amending the tick_nohz_tick_stopped() check in question
> with a tick_nohz_full_cpu() one to avoid using the time till the next
> timer event as the predicted_ns value all the time on nohz_full CPUs.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  drivers/cpuidle/governors/menu.c |   12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -293,8 +293,18 @@
>  	 * in a shallow idle state for a long time as a result of it.  In that
>  	 * case, say we might mispredict and use the known time till the closest
>  	 * timer event for the idle state selection.
> +	 *
> +	 * However, on nohz_full CPUs the tick does not run as a rule and the
> +	 * time till the closest timer event may always be effectively infinite,
> +	 * so using it as a replacement for the predicted idle duration would
> +	 * effectively always cause the prediction results to be discarded and
> +	 * deep idle states to be selected all the time.  That might introduce
> +	 * unwanted latency into the workload and cause more energy than
> +	 * necessary to be consumed if the discarded prediction results are
> +	 * actually accurate, so skip nohz_full CPUs here.
>  	 */
> -	if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
> +	if (tick_nohz_tick_stopped() && !tick_nohz_full_cpu(dev->cpu) &&
> +	    predicted_ns < TICK_NSEC)
>  		predicted_ns = data->next_timer_ns;
>  
>  	/*
> 
> 
> 

OTOH the behaviour with $SUBJECT possibly means that we use predicted_ns from
get_typical_interval() (which may suggest picking a shallow state based on
previous wakeup patterns) only then to never wake up again?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 1/3] cpuidle: governors: menu: Avoid selecting states with too much latency
  2025-08-13 19:13   ` Christian Loehle
@ 2025-08-18 17:08     ` Rafael J. Wysocki
  0 siblings, 0 replies; 15+ messages in thread
From: Rafael J. Wysocki @ 2025-08-18 17:08 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Rafael J. Wysocki, Linux PM, Frederic Weisbecker, LKML,
	Peter Zijlstra

On Wed, Aug 13, 2025 at 9:13 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 8/13/25 11:25, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Occasionally, the exit latency of the idle state selected by the menu
> > governor may exceed the PM QoS CPU wakeup latency limit.  Namely, if the
> > scheduler tick has been stopped already and predicted_ns is greater than
> > the tick period length, the governor may return an idle state whose exit
> > latency exceeds latency_req because that decision is made before
> > checking the current idle state's exit latency.
> >
> > For instance, say that there are 3 idle states, 0, 1, and 2.  For idle
> > states 0 and 1, the exit latency is equal to the target residency and
> > the values are 0 and 5 us, respectively.  State 2 is deeper and has the
> > exit latency and target residency of 200 us and 2 ms (which is greater
> > than the tick period length), respectively.
> >
> > Say that predicted_ns is equal to TICK_NSEC and the PM QoS latency
> > limit is 20 us.  After the first two iterations of the main loop in
> > menu_select(), idx becomes 1 and in the third iteration of it the target
> Can drop "of it" here?

But it also doesn't really hurt I think.

> > residency of the current state (state 2) is greater than predicted_ns.
> > State 2 is not a polling one and predicted_ns is not less than TICK_NSEC,
> > so the check on whether or not the tick has been stopped is done.  Say
> > that the tick has been stopped already and there are no imminent timers
> > (that is, delta_tick is greater than the target residency of state 2).
> > In that case, idx becomes 2 and it is returned immediately, but the exit
> > latency of state 2 exceeds the latency limit.
> >
> > Address this issue by modifying the code to compare the exit latency of
> > the current idle state (idle state i) with the latency limit before
> > comparing its target residecy with predicted_ns, which allows one
> residency

Fixed while applying.

> > more exit_latency_ns check that becomes redundant to be dropped.
> >
> > However, after the above change, latency_req cannot take the predicted_ns
> > value any more, which takes place after commit 38f83090f515 ("cpuidle:
> > menu: Remove iowait influence"), because it may cause a polling state
> > to be returned prematurely.
> >
> > In the context of the previous example say that predicted_ns is 3000 and
> > the PM QoS latency limit is still 20 us.  Additionally, say that idle
> > state 0 is a polling one.  Moving the exit_latency_ns check before the
> > target_residency_ns one causes the loop to terminate in the second
> > iteration, before the target_residency_ns check, so idle state 0 will be
> > returned even though previously state 1 would be returned if there were
> > no imminent timers.
> >
> > For this reason, remove the assignment of the predicted_ns value to
> > latency_req from the code.
> >
> > Fixes: 5ef499cd571c ("cpuidle: menu: Handle stopped tick more aggressively")
> > Cc: 4.17+ <stable@vger.kernel.org> # 4.17+
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >  drivers/cpuidle/governors/menu.c |   29 ++++++++++++-----------------
> >  1 file changed, 12 insertions(+), 17 deletions(-)
> >
> > --- a/drivers/cpuidle/governors/menu.c
> > +++ b/drivers/cpuidle/governors/menu.c
> > @@ -287,20 +287,15 @@
> >               return 0;
> >       }
> >
> > -     if (tick_nohz_tick_stopped()) {
> > -             /*
> > -              * If the tick is already stopped, the cost of possible short
> > -              * idle duration misprediction is much higher, because the CPU
> > -              * may be stuck in a shallow idle state for a long time as a
> > -              * result of it.  In that case say we might mispredict and use
> > -              * the known time till the closest timer event for the idle
> > -              * state selection.
> > -              */
> > -             if (predicted_ns < TICK_NSEC)
> > -                     predicted_ns = data->next_timer_ns;
> > -     } else if (latency_req > predicted_ns) {
> > -             latency_req = predicted_ns;
> > -     }
> > +     /*
> > +      * If the tick is already stopped, the cost of possible short idle
> > +      * duration misprediction is much higher, because the CPU may be stuck
> > +      * in a shallow idle state for a long time as a result of it.  In that
> > +      * case, say we might mispredict and use the known time till the closest
> > +      * timer event for the idle state selection.
> > +      */
> > +     if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
> > +             predicted_ns = data->next_timer_ns;
> >
> >       /*
> >        * Find the idle state with the lowest power while satisfying
> > @@ -316,13 +311,15 @@
> >               if (idx == -1)
> >                       idx = i; /* first enabled state */
> >
> > +             if (s->exit_latency_ns > latency_req)
> > +                     break;
> > +
> >               if (s->target_residency_ns > predicted_ns) {
> >                       /*
> >                        * Use a physical idle state, not busy polling, unless
> >                        * a timer is going to trigger soon enough.
> >                        */
> >                       if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
> > -                         s->exit_latency_ns <= latency_req &&
> >                           s->target_residency_ns <= data->next_timer_ns) {
> >                               predicted_ns = s->target_residency_ns;
> >                               idx = i;
> > @@ -354,8 +351,6 @@
> >
> >                       return idx;
> >               }
> > -             if (s->exit_latency_ns > latency_req)
> > -                     break;
> >
> >               idx = i;
> >       }
> >
> >
> >
>
> Good catch!
> Reviewed-by: Christian Loehle <christian.loehle@arm.com>

Thank you!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz_full CPUs
  2025-08-14 14:09   ` Christian Loehle
@ 2025-08-18 17:41     ` Rafael J. Wysocki
  2025-08-19  9:10       ` Christian Loehle
  0 siblings, 1 reply; 15+ messages in thread
From: Rafael J. Wysocki @ 2025-08-18 17:41 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Rafael J. Wysocki, Linux PM, Frederic Weisbecker, LKML,
	Peter Zijlstra

On Thu, Aug 14, 2025 at 4:09 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 8/13/25 11:29, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > When the menu governor runs on a nohz_full CPU and there are no user
> > space timers in the workload on that CPU, it ends up selecting idle
> > states with target residency values above TICK_NSEC all the time due to
> > a tick_nohz_tick_stopped() check designed for a different use case.
> > Namely, on nohz_full CPUs the fact that the tick has been stopped does
> > not actually mean anything in particular, whereas in the other case it
> > indicates that previously the CPU was expected to be idle sufficiently
> > long for the tick to be stopped, so it is not unreasonable to expect
> > it to be idle beyond the tick period length again.
> >
> > In some cases, this behavior causes latency in the workload to grow
> > undesirably.  It may also cause the workload to consume more energy
> > than necessary if the CPU does not spend enough time in the selected
> > deep idle states.
> >
> > Address this by amending the tick_nohz_tick_stopped() check in question
> > with a tick_nohz_full_cpu() one to avoid using the time till the next
> > timer event as the predicted_ns value all the time on nohz_full CPUs.
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >  drivers/cpuidle/governors/menu.c |   12 +++++++++++-
> >  1 file changed, 11 insertions(+), 1 deletion(-)
> >
> > --- a/drivers/cpuidle/governors/menu.c
> > +++ b/drivers/cpuidle/governors/menu.c
> > @@ -293,8 +293,18 @@
> >        * in a shallow idle state for a long time as a result of it.  In that
> >        * case, say we might mispredict and use the known time till the closest
> >        * timer event for the idle state selection.
> > +      *
> > +      * However, on nohz_full CPUs the tick does not run as a rule and the
> > +      * time till the closest timer event may always be effectively infinite,
> > +      * so using it as a replacement for the predicted idle duration would
> > +      * effectively always cause the prediction results to be discarded and
> > +      * deep idle states to be selected all the time.  That might introduce
> > +      * unwanted latency into the workload and cause more energy than
> > +      * necessary to be consumed if the discarded prediction results are
> > +      * actually accurate, so skip nohz_full CPUs here.
> >        */
> > -     if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
> > +     if (tick_nohz_tick_stopped() && !tick_nohz_full_cpu(dev->cpu) &&
> > +         predicted_ns < TICK_NSEC)
> >               predicted_ns = data->next_timer_ns;
> >
> >       /*
> >
> >
> >
>
> OTOH the behaviour with $SUBJECT possibly means that we use predicted_ns from
> get_typical_interval() (which may suggest picking a shallow state based on
> previous wakeup patterns) only then to never wake up again?

Yes, there is this risk, but the current behavior is more damaging IMV
because it (potentially) hurts both energy efficiency and performance.

It is also arguably easier for the user to remedy getting stuck in a
shallow idle state than to change governor's behavior (PM QoS is a bit
too blunt for this).

Moreover, configuring CPUs as nohz_full and leaving them in long idle
may not be the most efficient use of them.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz_full CPUs
  2025-08-18 17:41     ` Rafael J. Wysocki
@ 2025-08-19  9:10       ` Christian Loehle
  2025-08-19 11:56         ` Rafael J. Wysocki
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Loehle @ 2025-08-19  9:10 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Linux PM, Frederic Weisbecker, LKML, Peter Zijlstra

On 8/18/25 18:41, Rafael J. Wysocki wrote:
> On Thu, Aug 14, 2025 at 4:09 PM Christian Loehle
> <christian.loehle@arm.com> wrote:
>>
>> On 8/13/25 11:29, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> When the menu governor runs on a nohz_full CPU and there are no user
>>> space timers in the workload on that CPU, it ends up selecting idle
>>> states with target residency values above TICK_NSEC all the time due to
>>> a tick_nohz_tick_stopped() check designed for a different use case.
>>> Namely, on nohz_full CPUs the fact that the tick has been stopped does
>>> not actually mean anything in particular, whereas in the other case it
>>> indicates that previously the CPU was expected to be idle sufficiently
>>> long for the tick to be stopped, so it is not unreasonable to expect
>>> it to be idle beyond the tick period length again.
>>>
>>> In some cases, this behavior causes latency in the workload to grow
>>> undesirably.  It may also cause the workload to consume more energy
>>> than necessary if the CPU does not spend enough time in the selected
>>> deep idle states.
>>>
>>> Address this by amending the tick_nohz_tick_stopped() check in question
>>> with a tick_nohz_full_cpu() one to avoid using the time till the next
>>> timer event as the predicted_ns value all the time on nohz_full CPUs.
>>>
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>> ---
>>>  drivers/cpuidle/governors/menu.c |   12 +++++++++++-
>>>  1 file changed, 11 insertions(+), 1 deletion(-)
>>>
>>> --- a/drivers/cpuidle/governors/menu.c
>>> +++ b/drivers/cpuidle/governors/menu.c
>>> @@ -293,8 +293,18 @@
>>>        * in a shallow idle state for a long time as a result of it.  In that
>>>        * case, say we might mispredict and use the known time till the closest
>>>        * timer event for the idle state selection.
>>> +      *
>>> +      * However, on nohz_full CPUs the tick does not run as a rule and the
>>> +      * time till the closest timer event may always be effectively infinite,
>>> +      * so using it as a replacement for the predicted idle duration would
>>> +      * effectively always cause the prediction results to be discarded and
>>> +      * deep idle states to be selected all the time.  That might introduce
>>> +      * unwanted latency into the workload and cause more energy than
>>> +      * necessary to be consumed if the discarded prediction results are
>>> +      * actually accurate, so skip nohz_full CPUs here.
>>>        */
>>> -     if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
>>> +     if (tick_nohz_tick_stopped() && !tick_nohz_full_cpu(dev->cpu) &&
>>> +         predicted_ns < TICK_NSEC)
>>>               predicted_ns = data->next_timer_ns;
>>>
>>>       /*
>>>
>>>
>>>
>>
>> OTOH the behaviour with $SUBJECT possibly means that we use predicted_ns from
>> get_typical_interval() (which may suggest picking a shallow state based on
>> previous wakeup patterns) only then to never wake up again?
> 
> Yes, there is this risk, but the current behavior is more damaging IMV
> because it (potentially) hurts both energy efficiency and performance.
> 
> It is also arguably easier for the user to remedy getting stuck in a
> shallow idle state than to change governor's behavior (PM QoS is a bit
> too blunt for this).
> 
> Moreover, configuring CPUs as nohz_full and leaving them in long idle
> may not be the most efficient use of them.

True, on the other hand the setup cost for nohz_full is so high, you'd expect
the additional idle states disabling depending on the workload isn't too much
to ask for...
Anyway feel free to go ahead.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz_full CPUs
  2025-08-19  9:10       ` Christian Loehle
@ 2025-08-19 11:56         ` Rafael J. Wysocki
  0 siblings, 0 replies; 15+ messages in thread
From: Rafael J. Wysocki @ 2025-08-19 11:56 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Rafael J. Wysocki, Linux PM, Frederic Weisbecker, LKML,
	Peter Zijlstra

On Tue, Aug 19, 2025 at 11:10 AM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 8/18/25 18:41, Rafael J. Wysocki wrote:
> > On Thu, Aug 14, 2025 at 4:09 PM Christian Loehle
> > <christian.loehle@arm.com> wrote:
> >>
> >> On 8/13/25 11:29, Rafael J. Wysocki wrote:
> >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>>
> >>> When the menu governor runs on a nohz_full CPU and there are no user
> >>> space timers in the workload on that CPU, it ends up selecting idle
> >>> states with target residency values above TICK_NSEC all the time due to
> >>> a tick_nohz_tick_stopped() check designed for a different use case.
> >>> Namely, on nohz_full CPUs the fact that the tick has been stopped does
> >>> not actually mean anything in particular, whereas in the other case it
> >>> indicates that previously the CPU was expected to be idle sufficiently
> >>> long for the tick to be stopped, so it is not unreasonable to expect
> >>> it to be idle beyond the tick period length again.
> >>>
> >>> In some cases, this behavior causes latency in the workload to grow
> >>> undesirably.  It may also cause the workload to consume more energy
> >>> than necessary if the CPU does not spend enough time in the selected
> >>> deep idle states.
> >>>
> >>> Address this by amending the tick_nohz_tick_stopped() check in question
> >>> with a tick_nohz_full_cpu() one to avoid using the time till the next
> >>> timer event as the predicted_ns value all the time on nohz_full CPUs.
> >>>
> >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>> ---
> >>>  drivers/cpuidle/governors/menu.c |   12 +++++++++++-
> >>>  1 file changed, 11 insertions(+), 1 deletion(-)
> >>>
> >>> --- a/drivers/cpuidle/governors/menu.c
> >>> +++ b/drivers/cpuidle/governors/menu.c
> >>> @@ -293,8 +293,18 @@
> >>>        * in a shallow idle state for a long time as a result of it.  In that
> >>>        * case, say we might mispredict and use the known time till the closest
> >>>        * timer event for the idle state selection.
> >>> +      *
> >>> +      * However, on nohz_full CPUs the tick does not run as a rule and the
> >>> +      * time till the closest timer event may always be effectively infinite,
> >>> +      * so using it as a replacement for the predicted idle duration would
> >>> +      * effectively always cause the prediction results to be discarded and
> >>> +      * deep idle states to be selected all the time.  That might introduce
> >>> +      * unwanted latency into the workload and cause more energy than
> >>> +      * necessary to be consumed if the discarded prediction results are
> >>> +      * actually accurate, so skip nohz_full CPUs here.
> >>>        */
> >>> -     if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
> >>> +     if (tick_nohz_tick_stopped() && !tick_nohz_full_cpu(dev->cpu) &&
> >>> +         predicted_ns < TICK_NSEC)
> >>>               predicted_ns = data->next_timer_ns;
> >>>
> >>>       /*
> >>>
> >>>
> >>>
> >>
> >> OTOH the behaviour with $SUBJECT possibly means that we use predicted_ns from
> >> get_typical_interval() (which may suggest picking a shallow state based on
> >> previous wakeup patterns) only then to never wake up again?
> >
> > Yes, there is this risk, but the current behavior is more damaging IMV
> > because it (potentially) hurts both energy efficiency and performance.
> >
> > It is also arguably easier for the user to remedy getting stuck in a
> > shallow idle state than to change governor's behavior (PM QoS is a bit
> > too blunt for this).
> >
> > Moreover, configuring CPUs as nohz_full and leaving them in long idle
> > may not be the most efficient use of them.
>
> True, on the other hand the setup cost for nohz_full is so high, you'd expect
> the additional idle states disabling depending on the workload isn't too much
> to ask for...

Apparently, there are cases in which there is enough idle time to ask
for a deep idle state often enough, but as a rule the idle periods are
relatively short.  In those cases, one would need to change the QoS
limit back and forth in anticipation of the "busier" and "calmer"
periods in the workload, which would be kind of equivalent to
implementing an idle governor in user space.

> Anyway feel free to go ahead.

Thank you!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v1] cpuidle: governors: teo: Special-case nohz_full CPUs
  2025-08-13 10:21 [PATCH v1 0/3] cpuidle: governors: menu: A fix, a corner case adjustment and a cleanup Rafael J. Wysocki
                   ` (2 preceding siblings ...)
  2025-08-13 10:29 ` [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz_full CPUs Rafael J. Wysocki
@ 2025-08-28 20:16 ` Rafael J. Wysocki
  2025-08-29 19:37   ` Rafael J. Wysocki
  3 siblings, 1 reply; 15+ messages in thread
From: Rafael J. Wysocki @ 2025-08-28 20:16 UTC (permalink / raw)
  To: Linux PM; +Cc: Frederic Weisbecker, LKML, Peter Zijlstra, Christian Loehle

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

This change follows an analogous modification of the menu governor [1].

Namely, when the governor runs on a nohz_full CPU and there are no user
space timers in the workload on that CPU, it ends up selecting idle
states with target residency values above TICK_NSEC, or the deepest
enabled idle state in the absence of any of those, all the time due to
a tick_nohz_tick_stopped() check designed for running on CPUs where the
tick is not permanently disabled.  In that case, the fact that the tick
has been stopped means that the CPU was expected to be idle sufficiently
long previously, so it is not unreasonable to expect it to be idle
sufficiently long again, but this inference does not apply to nohz_full
CPUs.

In some cases, latency in the workload grows undesirably as a result of
selecting overly deep idle states, and the workload may also consume
more energy than necessary if the CPU does not spend enough time in the
selected deep idle state.

Address this by amending the tick_nohz_tick_stopped() check in question
with a tick_nohz_full_cpu() one to avoid effectively ignoring all
shallow idle states on nohz_full CPUs.  While doing so introduces a risk
of getting stuck in a shallow idle state for a long time, that only
affects energy efficiently, but the current behavior potentially hurts
both energy efficiency and performance that is arguably the priority for
nohz_full CPUs.

While at it, add a comment explaining the logic in teo_state_ok().

Link: https://lore.kernel.org/linux-pm/2244365.irdbgypaU6@rafael.j.wysocki/ [1]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpuidle/governors/teo.c |   18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -227,9 +227,17 @@
 	cpu_data->total += PULSE;
 }

-static bool teo_state_ok(int i, struct cpuidle_driver *drv)
+static bool teo_state_ok(int i, struct cpuidle_driver *drv, struct cpuidle_device *dev)
 {
-	return !tick_nohz_tick_stopped() ||
+	/*
+	 * If the scheduler tick has been stopped already, avoid selecting idle
+	 * states with target residency below the tick period length under the
+	 * assumption that the CPU is likely to be idle sufficiently long for
+	 * the tick to be stopped again (or the tick would not have been
+	 * stopped previously in the first place).  However, do not do that on
+	 * nohz_full CPUs where the above assumption does not hold.
+	 */
+	return !tick_nohz_tick_stopped() || tick_nohz_full_cpu(dev->cpu) ||
 		drv->states[i].target_residency_ns >= TICK_NSEC;
 }

@@ -379,7 +387,7 @@
 				 * shallow or disabled, in which case take the
 				 * first enabled state that is deep enough.
 				 */
-				if (teo_state_ok(i, drv) &&
+				if (teo_state_ok(i, drv, dev) &&
 				    !dev->states_usage[i].disable) {
 					idx = i;
 					break;
@@ -391,7 +399,7 @@
 			if (dev->states_usage[i].disable)
 				continue;

-			if (teo_state_ok(i, drv)) {
+			if (teo_state_ok(i, drv, dev)) {
 				/*
 				 * The current state is deep enough, but still
 				 * there may be a better one.
@@ -460,7 +468,7 @@
 	 */
 	if (drv->states[idx].target_residency_ns > duration_ns) {
 		i = teo_find_shallower_state(drv, dev, idx, duration_ns, false);
-		if (teo_state_ok(i, drv))
+		if (teo_state_ok(i, drv, dev))
 			idx = i;
 	}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] cpuidle: governors: teo: Special-case nohz_full CPUs
  2025-08-28 20:16 ` [PATCH v1] cpuidle: governors: teo: " Rafael J. Wysocki
@ 2025-08-29 19:37   ` Rafael J. Wysocki
  2025-08-31 21:30     ` Christian Loehle
  0 siblings, 1 reply; 15+ messages in thread
From: Rafael J. Wysocki @ 2025-08-29 19:37 UTC (permalink / raw)
  To: Linux PM; +Cc: Frederic Weisbecker, LKML, Peter Zijlstra, Christian Loehle

On Thu, Aug 28, 2025 at 10:16 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> This change follows an analogous modification of the menu governor [1].
>
> Namely, when the governor runs on a nohz_full CPU and there are no user
> space timers in the workload on that CPU, it ends up selecting idle
> states with target residency values above TICK_NSEC, or the deepest
> enabled idle state in the absence of any of those, all the time due to
> a tick_nohz_tick_stopped() check designed for running on CPUs where the
> tick is not permanently disabled.  In that case, the fact that the tick
> has been stopped means that the CPU was expected to be idle sufficiently
> long previously, so it is not unreasonable to expect it to be idle
> sufficiently long again, but this inference does not apply to nohz_full
> CPUs.
>
> In some cases, latency in the workload grows undesirably as a result of
> selecting overly deep idle states, and the workload may also consume
> more energy than necessary if the CPU does not spend enough time in the
> selected deep idle state.
>
> Address this by amending the tick_nohz_tick_stopped() check in question
> with a tick_nohz_full_cpu() one to avoid effectively ignoring all
> shallow idle states on nohz_full CPUs.  While doing so introduces a risk
> of getting stuck in a shallow idle state for a long time, that only
> affects energy efficiently, but the current behavior potentially hurts
> both energy efficiency and performance that is arguably the priority for
> nohz_full CPUs.

This change is likely to break the use case in which CPU isolation is
used for power management reasons, to prevent CPUs from running any
code and so to save energy.

In that case, going into the deepest state every time on nohz_full
CPUs is a feature, so it can't be changed unconditionally.

For this reason, I'm not going to apply this patch and I'm going to
drop the menu governor one below.

The only way to allow everyone to do what they want/need I can see
would be to add a control knob for adjusting the behavior of cpuidle
governors regarding the handling of nohz_full CPUs.

> While at it, add a comment explaining the logic in teo_state_ok().
>
> Link: https://lore.kernel.org/linux-pm/2244365.irdbgypaU6@rafael.j.wysocki/ [1]
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  drivers/cpuidle/governors/teo.c |   18 +++++++++++++-----
>  1 file changed, 13 insertions(+), 5 deletions(-)
>
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -227,9 +227,17 @@
>         cpu_data->total += PULSE;
>  }
>
> -static bool teo_state_ok(int i, struct cpuidle_driver *drv)
> +static bool teo_state_ok(int i, struct cpuidle_driver *drv, struct cpuidle_device *dev)
>  {
> -       return !tick_nohz_tick_stopped() ||
> +       /*
> +        * If the scheduler tick has been stopped already, avoid selecting idle
> +        * states with target residency below the tick period length under the
> +        * assumption that the CPU is likely to be idle sufficiently long for
> +        * the tick to be stopped again (or the tick would not have been
> +        * stopped previously in the first place).  However, do not do that on
> +        * nohz_full CPUs where the above assumption does not hold.
> +        */
> +       return !tick_nohz_tick_stopped() || tick_nohz_full_cpu(dev->cpu) ||
>                 drv->states[i].target_residency_ns >= TICK_NSEC;
>  }
>
> @@ -379,7 +387,7 @@
>                                  * shallow or disabled, in which case take the
>                                  * first enabled state that is deep enough.
>                                  */
> -                               if (teo_state_ok(i, drv) &&
> +                               if (teo_state_ok(i, drv, dev) &&
>                                     !dev->states_usage[i].disable) {
>                                         idx = i;
>                                         break;
> @@ -391,7 +399,7 @@
>                         if (dev->states_usage[i].disable)
>                                 continue;
>
> -                       if (teo_state_ok(i, drv)) {
> +                       if (teo_state_ok(i, drv, dev)) {
>                                 /*
>                                  * The current state is deep enough, but still
>                                  * there may be a better one.
> @@ -460,7 +468,7 @@
>          */
>         if (drv->states[idx].target_residency_ns > duration_ns) {
>                 i = teo_find_shallower_state(drv, dev, idx, duration_ns, false);
> -               if (teo_state_ok(i, drv))
> +               if (teo_state_ok(i, drv, dev))
>                         idx = i;
>         }
>
>
>
>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] cpuidle: governors: teo: Special-case nohz_full CPUs
  2025-08-29 19:37   ` Rafael J. Wysocki
@ 2025-08-31 21:30     ` Christian Loehle
  2025-09-01 19:08       ` Rafael J. Wysocki
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Loehle @ 2025-08-31 21:30 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM; +Cc: Frederic Weisbecker, LKML, Peter Zijlstra

On 8/29/25 20:37, Rafael J. Wysocki wrote:
> On Thu, Aug 28, 2025 at 10:16 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>>
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> This change follows an analogous modification of the menu governor [1].
>>
>> Namely, when the governor runs on a nohz_full CPU and there are no user
>> space timers in the workload on that CPU, it ends up selecting idle
>> states with target residency values above TICK_NSEC, or the deepest
>> enabled idle state in the absence of any of those, all the time due to
>> a tick_nohz_tick_stopped() check designed for running on CPUs where the
>> tick is not permanently disabled.  In that case, the fact that the tick
>> has been stopped means that the CPU was expected to be idle sufficiently
>> long previously, so it is not unreasonable to expect it to be idle
>> sufficiently long again, but this inference does not apply to nohz_full
>> CPUs.
>>
>> In some cases, latency in the workload grows undesirably as a result of
>> selecting overly deep idle states, and the workload may also consume
>> more energy than necessary if the CPU does not spend enough time in the
>> selected deep idle state.
>>
>> Address this by amending the tick_nohz_tick_stopped() check in question
>> with a tick_nohz_full_cpu() one to avoid effectively ignoring all
>> shallow idle states on nohz_full CPUs.  While doing so introduces a risk
>> of getting stuck in a shallow idle state for a long time, that only
>> affects energy efficiently, but the current behavior potentially hurts
>> both energy efficiency and performance that is arguably the priority for
>> nohz_full CPUs.
> 
> This change is likely to break the use case in which CPU isolation is
> used for power management reasons, to prevent CPUs from running any
> code and so to save energy.
> 
> In that case, going into the deepest state every time on nohz_full
> CPUs is a feature, so it can't be changed unconditionally.
> 
> For this reason, I'm not going to apply this patch and I'm going to
> drop the menu governor one below.
> 
> The only way to allow everyone to do what they want/need I can see
> would be to add a control knob for adjusting the behavior of cpuidle
> governors regarding the handling of nohz_full CPUs.

But then what's the advantage instead of just using
/sys/devices/system/cpu/cpuX/power/latency
for the nohz_full CPUs (if you don't want the current 'over-eagerly
selecting deepest state on nohz_full')?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] cpuidle: governors: teo: Special-case nohz_full CPUs
  2025-08-31 21:30     ` Christian Loehle
@ 2025-09-01 19:08       ` Rafael J. Wysocki
  0 siblings, 0 replies; 15+ messages in thread
From: Rafael J. Wysocki @ 2025-09-01 19:08 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Rafael J. Wysocki, Linux PM, Frederic Weisbecker, LKML,
	Peter Zijlstra

On Sun, Aug 31, 2025 at 11:30 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 8/29/25 20:37, Rafael J. Wysocki wrote:
> > On Thu, Aug 28, 2025 at 10:16 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >>
> >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>
> >> This change follows an analogous modification of the menu governor [1].
> >>
> >> Namely, when the governor runs on a nohz_full CPU and there are no user
> >> space timers in the workload on that CPU, it ends up selecting idle
> >> states with target residency values above TICK_NSEC, or the deepest
> >> enabled idle state in the absence of any of those, all the time due to
> >> a tick_nohz_tick_stopped() check designed for running on CPUs where the
> >> tick is not permanently disabled.  In that case, the fact that the tick
> >> has been stopped means that the CPU was expected to be idle sufficiently
> >> long previously, so it is not unreasonable to expect it to be idle
> >> sufficiently long again, but this inference does not apply to nohz_full
> >> CPUs.
> >>
> >> In some cases, latency in the workload grows undesirably as a result of
> >> selecting overly deep idle states, and the workload may also consume
> >> more energy than necessary if the CPU does not spend enough time in the
> >> selected deep idle state.
> >>
> >> Address this by amending the tick_nohz_tick_stopped() check in question
> >> with a tick_nohz_full_cpu() one to avoid effectively ignoring all
> >> shallow idle states on nohz_full CPUs.  While doing so introduces a risk
> >> of getting stuck in a shallow idle state for a long time, that only
> >> affects energy efficiently, but the current behavior potentially hurts
> >> both energy efficiency and performance that is arguably the priority for
> >> nohz_full CPUs.
> >
> > This change is likely to break the use case in which CPU isolation is
> > used for power management reasons, to prevent CPUs from running any
> > code and so to save energy.
> >
> > In that case, going into the deepest state every time on nohz_full
> > CPUs is a feature, so it can't be changed unconditionally.
> >
> > For this reason, I'm not going to apply this patch and I'm going to
> > drop the menu governor one below.
> >
> > The only way to allow everyone to do what they want/need I can see
> > would be to add a control knob for adjusting the behavior of cpuidle
> > governors regarding the handling of nohz_full CPUs.
>
> But then what's the advantage instead of just using
> /sys/devices/system/cpu/cpuX/power/latency
> for the nohz_full CPUs (if you don't want the current 'over-eagerly
> selecting deepest state on nohz_full')?

The difference is that PM QoS prevents the CPU from entering
high-latency idle states at all (depending on the limit value),
whereas the governor will sometimes ask for a deep idle state
(depending on the actual wakeup pattern).  If wakeup patterns in the
workload change over time, it may be quite cumbersome to have to
update PM QoS every time to follow those changes (and the operator may
not even be aware of them).

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-09-01 19:08 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-13 10:21 [PATCH v1 0/3] cpuidle: governors: menu: A fix, a corner case adjustment and a cleanup Rafael J. Wysocki
2025-08-13 10:25 ` [PATCH v1 1/3] cpuidle: governors: menu: Avoid selecting states with too much latency Rafael J. Wysocki
2025-08-13 19:13   ` Christian Loehle
2025-08-18 17:08     ` Rafael J. Wysocki
2025-08-13 10:26 ` [PATCH v1 2/3] cpuidle: governors: menu: Rearrange main loop in menu_select() Rafael J. Wysocki
2025-08-14 13:00   ` Christian Loehle
2025-08-13 10:29 ` [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz_full CPUs Rafael J. Wysocki
2025-08-14 14:09   ` Christian Loehle
2025-08-18 17:41     ` Rafael J. Wysocki
2025-08-19  9:10       ` Christian Loehle
2025-08-19 11:56         ` Rafael J. Wysocki
2025-08-28 20:16 ` [PATCH v1] cpuidle: governors: teo: " Rafael J. Wysocki
2025-08-29 19:37   ` Rafael J. Wysocki
2025-08-31 21:30     ` Christian Loehle
2025-09-01 19:08       ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).