public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 1/1] cpuidle: menu: Use min() to prevent deep C-states when tick is stopped
       [not found] <20260122080937.22347-2-sunlightlinux@gmail.com>
@ 2026-01-22  8:09 ` Ionut Nechita (Sunlight Linux)
  2026-01-22 11:19   ` David Laight
  0 siblings, 1 reply; 2+ messages in thread
From: Ionut Nechita (Sunlight Linux) @ 2026-01-22  8:09 UTC (permalink / raw)
  To: rafael
  Cc: daniel.lezcano, christian.loehle, linux-pm, linux-kernel,
	yumpusamongus, Ionut Nechita, stable

From: Ionut Nechita <ionut_n2001@yahoo.com>

When the tick is already stopped and the predicted idle duration is short
(< TICK_NSEC), the original code uses next_timer_ns directly. This can
lead to selecting excessively deep C-states when the actual idle duration
is much shorter than the next timer event.

On modern Intel server platforms (Sapphire Rapids and newer), deep package
C-states can have exit latencies of 150-190us due to:
- Tile-based architecture with per-tile power gating
- DDR5 and CXL power management overhead
- Complex mesh interconnect resynchronization

When a network packet arrives after 500us but the governor selected a deep
C-state (PC6) based on a 10ms timer, the high exit latency (150us+)
dominates the response time.

Use the minimum of predicted_ns and next_timer_ns instead of using
next_timer_ns directly. This avoids selecting unnecessarily deep states
when the prediction is short but the next timer is distant, while still
being conservative enough to prevent getting stuck in shallow states for
extended periods.

Testing on Sapphire Rapids with qperf tcp_lat shows:
- Before: 151us average latency (frequent PC6 entry)
- After: ~30us average latency (avoids PC6 on short predictions)
- Improvement: 5x latency reduction

The fix is platform-agnostic and benefits other platforms with high
C-state exit latencies. Testing on systems with large C-state gaps
(e.g., C2 at 36us → C3 at 700us with 350us latency) shows similar
improvements in avoiding deep state selection for short idle periods.

Power efficiency testing shows minimal impact (<1% difference in package
power consumption during mixed workloads), well within measurement noise.

Cc: stable@vger.kernel.org
Signed-off-by: Ionut Nechita <ionut_n2001@yahoo.com>
---
 drivers/cpuidle/governors/menu.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 64d6f7a1c776..199eac2a1849 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -287,12 +287,16 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 	/*
 	 * If the tick is already stopped, the cost of possible short idle
 	 * duration misprediction is much higher, because the CPU may be stuck
-	 * in a shallow idle state for a long time as a result of it.  In that
-	 * case, say we might mispredict and use the known time till the closest
-	 * timer event for the idle state selection.
+	 * in a shallow idle state for a long time as a result of it.
+	 *
+	 * Instead of using next_timer_ns directly (which could be very large,
+	 * e.g., 10ms), use the minimum of the prediction and the timer. This
+	 * prevents selecting excessively deep C-states when the prediction
+	 * suggests a short idle period, while still clamping to next_timer_ns
+	 * to avoid unnecessarily shallow states.
 	 */
 	if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
-		predicted_ns = data->next_timer_ns;
+		predicted_ns = min(predicted_ns, data->next_timer_ns);
 
 	/*
 	 * Find the idle state with the lowest power while satisfying
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [PATCH v2 1/1] cpuidle: menu: Use min() to prevent deep C-states when tick is stopped
  2026-01-22  8:09 ` [PATCH v2 1/1] cpuidle: menu: Use min() to prevent deep C-states when tick is stopped Ionut Nechita (Sunlight Linux)
@ 2026-01-22 11:19   ` David Laight
  0 siblings, 0 replies; 2+ messages in thread
From: David Laight @ 2026-01-22 11:19 UTC (permalink / raw)
  To: Ionut Nechita (Sunlight Linux)
  Cc: rafael, daniel.lezcano, christian.loehle, linux-pm, linux-kernel,
	yumpusamongus, Ionut Nechita, stable

On Thu, 22 Jan 2026 10:09:39 +0200
"Ionut Nechita (Sunlight Linux)" <sunlightlinux@gmail.com> wrote:

> From: Ionut Nechita <ionut_n2001@yahoo.com>
> 
> When the tick is already stopped and the predicted idle duration is short
> (< TICK_NSEC), the original code uses next_timer_ns directly. This can
> lead to selecting excessively deep C-states when the actual idle duration
> is much shorter than the next timer event.
> 
> On modern Intel server platforms (Sapphire Rapids and newer), deep package
> C-states can have exit latencies of 150-190us due to:
> - Tile-based architecture with per-tile power gating
> - DDR5 and CXL power management overhead
> - Complex mesh interconnect resynchronization
> 
> When a network packet arrives after 500us but the governor selected a deep
> C-state (PC6) based on a 10ms timer, the high exit latency (150us+)
> dominates the response time.
....

We had to disable the deep sleep states on much older Intel -7 cpus.
The problem was that we needed to wake up multiple cpu and they tended
to get woken in turn - so it was far too long before they were all running.
I suspect that pretty much anything that cares about latency has always
needed to disable them.

	David

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-01-22 11:20 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260122080937.22347-2-sunlightlinux@gmail.com>
2026-01-22  8:09 ` [PATCH v2 1/1] cpuidle: menu: Use min() to prevent deep C-states when tick is stopped Ionut Nechita (Sunlight Linux)
2026-01-22 11:19   ` David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox