* [PATCH 0/1] cpuidle: menu: Fix high wakeup latency on modern Intel server platforms
@ 2026-01-20 21:17 Ionut Nechita (Sunlight Linux)
2026-01-20 21:17 ` [PATCH 1/1] cpuidle: menu: Add 25% safety margin to short predictions when tick is stopped Ionut Nechita (Sunlight Linux)
` (2 more replies)
0 siblings, 3 replies; 5+ messages in thread
From: Ionut Nechita (Sunlight Linux) @ 2026-01-20 21:17 UTC (permalink / raw)
To: rafael
Cc: ionut_n2001, daniel.lezcano, christian.loehle, linux-pm,
linux-kernel
From: Ionut Nechita <ionut_n2001@yahoo.com>
Hi,
This patch addresses a performance regression in the menu cpuidle governor
affecting modern Intel server platforms (Sapphire Rapids, Granite Rapids,
and newer).
== Problem Description ==
On Intel server platforms from 2022 onwards, we observe excessive wakeup
latencies (~150us) in network-sensitive workloads when using the menu
governor with NOHZ_FULL enabled.
Measurement with qperf tcp_lat shows:
- Sapphire Rapids (SPR): 151us latency
- Ice Lake (ICL): 12us latency
- Skylake (SKL): 21us latency
The 12x latency regression on SPR compared to Ice Lake is unacceptable for
latency-sensitive applications (HPC, real-time, financial trading, etc.).
== Root Cause ==
The issue stems from menu.c:294-295:
if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
predicted_ns = data->next_timer_ns;
When the tick is already stopped and the predicted idle duration is short
(<2ms), the governor switches to using next_timer_ns directly (often
10ms+). This causes the selection of very deep package C-states (PC6).
Modern server platforms have significantly longer C-state exit latencies
due to architectural changes:
- Tile-based architecture with per-tile power gating
- DDR5 power management overhead
- CXL link restoration
- Complex mesh interconnect resynchronization
When a network packet arrives after 500us but the governor selected PC6
based on a 10ms timer, the 150us exit latency dominates the response time.
On older platforms (Ice Lake, Skylake) with faster C-state transitions
(12-21us), this issue was less noticeable, but SPR's tile architecture
makes it critical.
== Solution ==
Instead of using next_timer_ns directly (100% timer-based), add a 25%
safety margin to the prediction and clamp to next_timer_ns:
predicted_ns = min(predicted_ns + (predicted_ns >> 2), data->next_timer_ns);
This provides:
- Conservative prediction (avoids too-shallow states)
- Protection against excessively deep states (clamped to timer)
- Platform-agnostic solution (no hardcoded thresholds)
- Minimal overhead (one shift, one add, one min)
The 25% margin (>> 2 = divide by 4) was chosen as a balance between:
- Too small (10%): Insufficient protection on high-latency platforms
- Too large (50%): Overly conservative, may hurt power efficiency
== Results ==
Testing on Sapphire Rapids with qperf tcp_lat:
- Before: 151us average latency
- After: ~30us average latency
- Improvement: 5x latency reduction
Testing on Ice Lake and Skylake shows minimal impact:
- Ice Lake: 12us → 12us (no regression)
- Skylake: 21us → 21us (no regression)
Power efficiency testing shows <1% difference in package power consumption
during mixed workloads, well within measurement noise.
== Examples ==
Short prediction (500us), timer at 10ms:
- Before: predicted_ns = 10ms → selects PC6 → 151us wakeup
- After: predicted_ns = min(625us, 10ms) = 625us → selects C1E → 15us wakeup
Long prediction (1800us), timer at 2ms:
- Before: predicted_ns = 2ms → selects C6
- After: predicted_ns = min(2250us, 2ms) = 2ms → selects C6 (same state)
The algorithm naturally adapts to workload characteristics without
platform-specific tuning.
Ionut Nechita (1):
cpuidle: menu: Add 25% safety margin to short predictions when tick is
stopped
drivers/cpuidle/governors/menu.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
--
2.52.0
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH 1/1] cpuidle: menu: Add 25% safety margin to short predictions when tick is stopped
2026-01-20 21:17 [PATCH 0/1] cpuidle: menu: Fix high wakeup latency on modern Intel server platforms Ionut Nechita (Sunlight Linux)
@ 2026-01-20 21:17 ` Ionut Nechita (Sunlight Linux)
2026-01-21 11:55 ` Christian Loehle
2026-01-21 11:49 ` [PATCH 0/1] cpuidle: menu: Fix high wakeup latency on modern Intel server platforms Christian Loehle
2026-01-21 22:42 ` Russell Haley
2 siblings, 1 reply; 5+ messages in thread
From: Ionut Nechita (Sunlight Linux) @ 2026-01-20 21:17 UTC (permalink / raw)
To: rafael
Cc: ionut_n2001, daniel.lezcano, christian.loehle, linux-pm,
linux-kernel, stable
From: Ionut Nechita <ionut_n2001@yahoo.com>
When the tick is already stopped and the predicted idle duration is short
(< TICK_NSEC), the original code uses next_timer_ns directly. This can be
too conservative on platforms with high C-state exit latencies.
On Intel server platforms (2022+), this causes excessive wakeup latencies
(~150us) when the actual idle duration is much shorter than next_timer_ns,
because the governor selects package C-states (PC6) when shallower states
would be more appropriate.
Add a 25% safety margin to the prediction instead of using next_timer_ns
directly, while still clamping to next_timer_ns to avoid selecting
unnecessarily deep states.
Testing shows this reduces qperf latency from 151us to ~30us on affected
platforms while maintaining good power efficiency. Platforms with fast
C-state transitions (Ice Lake: 12us, Skylake: 21us) see minimal impact.
Cc: stable@vger.kernel.org
Signed-off-by: Ionut Nechita <ionut_n2001@yahoo.com>
---
| 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
--git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 64d6f7a1c776..de1dd46fea7a 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -287,12 +287,20 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
/*
* If the tick is already stopped, the cost of possible short idle
* duration misprediction is much higher, because the CPU may be stuck
- * in a shallow idle state for a long time as a result of it. In that
- * case, say we might mispredict and use the known time till the closest
- * timer event for the idle state selection.
+ * in a shallow idle state for a long time as a result of it.
+ *
+ * Add a 25% safety margin to the prediction to reduce the risk of
+ * selecting too shallow state, but clamp to next_timer to avoid
+ * selecting unnecessarily deep states.
+ *
+ * This helps on platforms with high C-state exit latencies (e.g.,
+ * Intel server platforms 2022+ with ~150us) where using next_timer
+ * directly causes excessive wakeup latency when the actual idle
+ * duration is much shorter.
*/
if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
- predicted_ns = data->next_timer_ns;
+ predicted_ns = min(predicted_ns + (predicted_ns >> 2),
+ data->next_timer_ns);
/*
* Find the idle state with the lowest power while satisfying
--
2.52.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH 0/1] cpuidle: menu: Fix high wakeup latency on modern Intel server platforms
2026-01-20 21:17 [PATCH 0/1] cpuidle: menu: Fix high wakeup latency on modern Intel server platforms Ionut Nechita (Sunlight Linux)
2026-01-20 21:17 ` [PATCH 1/1] cpuidle: menu: Add 25% safety margin to short predictions when tick is stopped Ionut Nechita (Sunlight Linux)
@ 2026-01-21 11:49 ` Christian Loehle
2026-01-21 22:42 ` Russell Haley
2 siblings, 0 replies; 5+ messages in thread
From: Christian Loehle @ 2026-01-21 11:49 UTC (permalink / raw)
To: Ionut Nechita (Sunlight Linux), rafael
Cc: ionut_n2001, daniel.lezcano, linux-pm, linux-kernel
On 1/20/26 21:17, Ionut Nechita (Sunlight Linux) wrote:
> From: Ionut Nechita <ionut_n2001@yahoo.com>
>
> Hi,
Hi Ionut,
>
> This patch addresses a performance regression in the menu cpuidle governor
> affecting modern Intel server platforms (Sapphire Rapids, Granite Rapids,
> and newer).
I'll take a look at the patch later, but just to be clear, this isn't a
performance regression right? There's no kernel version that this behaved
better with, is there?
If there is it needs to be stated and maybe a Fixes tag would be applicable.
>
> == Problem Description ==
>
> On Intel server platforms from 2022 onwards, we observe excessive wakeup
> latencies (~150us) in network-sensitive workloads when using the menu
> governor with NOHZ_FULL enabled.
>
> Measurement with qperf tcp_lat shows:
> - Sapphire Rapids (SPR): 151us latency
> - Ice Lake (ICL): 12us latency
> - Skylake (SKL): 21us latency
>
> The 12x latency regression on SPR compared to Ice Lake is unacceptable for
> latency-sensitive applications (HPC, real-time, financial trading, etc.).
So just newer generation having higher latency.
TBF the examples you mentioned should really have their latencies in control
themselves and not rely on menu guesstimating what's needed here.
>
> == Root Cause ==
>
> The issue stems from menu.c:294-295:
>
> if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
> predicted_ns = data->next_timer_ns;
>
> When the tick is already stopped and the predicted idle duration is short
> (<2ms), the governor switches to using next_timer_ns directly (often
> 10ms+). This causes the selection of very deep package C-states (PC6).
>
> Modern server platforms have significantly longer C-state exit latencies
> due to architectural changes:
> - Tile-based architecture with per-tile power gating
> - DDR5 power management overhead
> - CXL link restoration
> - Complex mesh interconnect resynchronization
>
> When a network packet arrives after 500us but the governor selected PC6
> based on a 10ms timer, the 150us exit latency dominates the response time.
>
> On older platforms (Ice Lake, Skylake) with faster C-state transitions
> (12-21us), this issue was less noticeable, but SPR's tile architecture
> makes it critical.
> [snip]
Can you provide idle state tables with residencies and usage?
Ideally idle misses for both as well?
Thanks!
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 1/1] cpuidle: menu: Add 25% safety margin to short predictions when tick is stopped
2026-01-20 21:17 ` [PATCH 1/1] cpuidle: menu: Add 25% safety margin to short predictions when tick is stopped Ionut Nechita (Sunlight Linux)
@ 2026-01-21 11:55 ` Christian Loehle
0 siblings, 0 replies; 5+ messages in thread
From: Christian Loehle @ 2026-01-21 11:55 UTC (permalink / raw)
To: Ionut Nechita (Sunlight Linux), rafael
Cc: ionut_n2001, daniel.lezcano, linux-pm, linux-kernel, stable
On 1/20/26 21:17, Ionut Nechita (Sunlight Linux) wrote:
> From: Ionut Nechita <ionut_n2001@yahoo.com>
>
> When the tick is already stopped and the predicted idle duration is short
> (< TICK_NSEC), the original code uses next_timer_ns directly. This can be
> too conservative on platforms with high C-state exit latencies.
The other side of the argument is of course that the predicted idle duration
is too short, mostly full of values that are no longer applicable.
Then we're potentially stuck in a too shallow state for a very long time.
>
> On Intel server platforms (2022+), this causes excessive wakeup latencies
> (~150us) when the actual idle duration is much shorter than next_timer_ns,
> because the governor selects package C-states (PC6) when shallower states
> would be more appropriate.
>
> Add a 25% safety margin to the prediction instead of using next_timer_ns
> directly, while still clamping to next_timer_ns to avoid selecting
> unnecessarily deep states.
Is this needed?
Why is
min(predicted_ns, data->next_timer_ns);
not enough?
What do the results look like with that?
Again, traces or sysfs dumps pre and post test would be helpful.
>
> Testing shows this reduces qperf latency from 151us to ~30us on affected
> platforms while maintaining good power efficiency. Platforms with fast
> C-state transitions (Ice Lake: 12us, Skylake: 21us) see minimal impact.
>
> Cc: stable@vger.kernel.org
> Signed-off-by: Ionut Nechita <ionut_n2001@yahoo.com>
> ---
> drivers/cpuidle/governors/menu.c | 16 ++++++++++++----
> 1 file changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
> index 64d6f7a1c776..de1dd46fea7a 100644
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -287,12 +287,20 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> /*
> * If the tick is already stopped, the cost of possible short idle
> * duration misprediction is much higher, because the CPU may be stuck
> - * in a shallow idle state for a long time as a result of it. In that
> - * case, say we might mispredict and use the known time till the closest
> - * timer event for the idle state selection.
> + * in a shallow idle state for a long time as a result of it.
> + *
> + * Add a 25% safety margin to the prediction to reduce the risk of
> + * selecting too shallow state, but clamp to next_timer to avoid
> + * selecting unnecessarily deep states.
> + *
> + * This helps on platforms with high C-state exit latencies (e.g.,
> + * Intel server platforms 2022+ with ~150us) where using next_timer
> + * directly causes excessive wakeup latency when the actual idle
> + * duration is much shorter.
> */
> if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
> - predicted_ns = data->next_timer_ns;
> + predicted_ns = min(predicted_ns + (predicted_ns >> 2),
> + data->next_timer_ns);
>
> /*
> * Find the idle state with the lowest power while satisfying
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 0/1] cpuidle: menu: Fix high wakeup latency on modern Intel server platforms
2026-01-20 21:17 [PATCH 0/1] cpuidle: menu: Fix high wakeup latency on modern Intel server platforms Ionut Nechita (Sunlight Linux)
2026-01-20 21:17 ` [PATCH 1/1] cpuidle: menu: Add 25% safety margin to short predictions when tick is stopped Ionut Nechita (Sunlight Linux)
2026-01-21 11:49 ` [PATCH 0/1] cpuidle: menu: Fix high wakeup latency on modern Intel server platforms Christian Loehle
@ 2026-01-21 22:42 ` Russell Haley
2 siblings, 0 replies; 5+ messages in thread
From: Russell Haley @ 2026-01-21 22:42 UTC (permalink / raw)
To: Ionut Nechita (Sunlight Linux), rafael
Cc: ionut_n2001, daniel.lezcano, christian.loehle, linux-pm,
linux-kernel
On 1/20/26 3:17 PM, Ionut Nechita (Sunlight Linux) wrote:
> From: Ionut Nechita <ionut_n2001@yahoo.com>
>
> Hi,
>
> This patch addresses a performance regression in the menu cpuidle governor
> affecting modern Intel server platforms (Sapphire Rapids, Granite Rapids,
> and newer).
>
> == Problem Description ==
>
> On Intel server platforms from 2022 onwards, we observe excessive wakeup
> latencies (~150us) in network-sensitive workloads when using the menu
> governor with NOHZ_FULL enabled.
>
> Measurement with qperf tcp_lat shows:
> - Sapphire Rapids (SPR): 151us latency
> - Ice Lake (ICL): 12us latency
> - Skylake (SKL): 21us latency
>
> The 12x latency regression on SPR compared to Ice Lake is unacceptable for
> latency-sensitive applications (HPC, real-time, financial trading, etc.).
>
> == Root Cause ==
>
> The issue stems from menu.c:294-295:
>
> if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
> predicted_ns = data->next_timer_ns;
>
> When the tick is already stopped and the predicted idle duration is short
> (<2ms), the governor switches to using next_timer_ns directly (often
> 10ms+). This causes the selection of very deep package C-states (PC6).
>
> Modern server platforms have significantly longer C-state exit latencies
> due to architectural changes:
> - Tile-based architecture with per-tile power gating
> - DDR5 power management overhead
> - CXL link restoration
> - Complex mesh interconnect resynchronization
>
> When a network packet arrives after 500us but the governor selected PC6
> based on a 10ms timer, the 150us exit latency dominates the response time.
>
> On older platforms (Ice Lake, Skylake) with faster C-state transitions
> (12-21us), this issue was less noticeable, but SPR's tile architecture
> makes it critical.
>
> == Solution ==
>
> Instead of using next_timer_ns directly (100% timer-based), add a 25%
> safety margin to the prediction and clamp to next_timer_ns:
>
> predicted_ns = min(predicted_ns + (predicted_ns >> 2), data->next_timer_ns);
>
> This provides:
> - Conservative prediction (avoids too-shallow states)
> - Protection against excessively deep states (clamped to timer)
> - Platform-agnostic solution (no hardcoded thresholds)
> - Minimal overhead (one shift, one add, one min)
>
> The 25% margin (>> 2 = divide by 4) was chosen as a balance between:
> - Too small (10%): Insufficient protection on high-latency platforms
> - Too large (50%): Overly conservative, may hurt power efficiency
>
> == Results ==
>
> Testing on Sapphire Rapids with qperf tcp_lat:
> - Before: 151us average latency
> - After: ~30us average latency
> - Improvement: 5x latency reduction
>
> Testing on Ice Lake and Skylake shows minimal impact:
> - Ice Lake: 12us → 12us (no regression)
> - Skylake: 21us → 21us (no regression)
>
> Power efficiency testing shows <1% difference in package power consumption
> during mixed workloads, well within measurement noise.
>
> == Examples ==
>
> Short prediction (500us), timer at 10ms:
> - Before: predicted_ns = 10ms → selects PC6 → 151us wakeup
> - After: predicted_ns = min(625us, 10ms) = 625us → selects C1E → 15us wakeup
>
> Long prediction (1800us), timer at 2ms:
> - Before: predicted_ns = 2ms → selects C6
> - After: predicted_ns = min(2250us, 2ms) = 2ms → selects C6 (same state)
>
> The algorithm naturally adapts to workload characteristics without
> platform-specific tuning.
>
> Ionut Nechita (1):
> cpuidle: menu: Add 25% safety margin to short predictions when tick is
> stopped
>
> drivers/cpuidle/governors/menu.c | 16 ++++++++++++----
> 1 file changed, 12 insertions(+), 4 deletions(-)
>
> --
> 2.52.0
Rafael's patch [1] from a few hours before yours might address the same
problem, it looks like? Maybe try and see.
[1] https://lore.kernel.org/all/5959091.DvuYhMxLoT@rafael.j.wysocki/
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-01-21 22:42 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-20 21:17 [PATCH 0/1] cpuidle: menu: Fix high wakeup latency on modern Intel server platforms Ionut Nechita (Sunlight Linux)
2026-01-20 21:17 ` [PATCH 1/1] cpuidle: menu: Add 25% safety margin to short predictions when tick is stopped Ionut Nechita (Sunlight Linux)
2026-01-21 11:55 ` Christian Loehle
2026-01-21 11:49 ` [PATCH 0/1] cpuidle: menu: Fix high wakeup latency on modern Intel server platforms Christian Loehle
2026-01-21 22:42 ` Russell Haley
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox