* [PATCH v2 0/1] cpuidle: menu: Fix high wakeup latency on modern platforms
@ 2026-01-22 8:09 Ionut Nechita (Sunlight Linux)
2026-01-22 8:09 ` [PATCH v2 1/1] cpuidle: menu: Use min() to prevent deep C-states when tick is stopped Ionut Nechita (Sunlight Linux)
2026-01-22 8:49 ` [PATCH v2 0/1] cpuidle: menu: Fix high wakeup latency on modern platforms Christian Loehle
0 siblings, 2 replies; 6+ messages in thread
From: Ionut Nechita (Sunlight Linux) @ 2026-01-22 8:09 UTC (permalink / raw)
To: rafael
Cc: daniel.lezcano, christian.loehle, linux-pm, linux-kernel,
yumpusamongus, Ionut Nechita
From: Ionut Nechita <ionut_n2001@yahoo.com>
Hi,
This v2 patch addresses high wakeup latency in the menu cpuidle governor
on modern platforms with high C-state exit latencies.
Changes in v2:
==============
Based on Christian Loehle's feedback, I've simplified the approach to use
min(predicted_ns, data->next_timer_ns) instead of the 25% safety margin
from v1.
The simpler approach is cleaner and achieves the same goal: preventing the
governor from selecting excessively deep C-states when the prediction
suggests a short idle period but next_timer_ns is large (e.g., 10ms).
I will test both approaches (simple min vs 25% margin) and provide
detailed comparison data including:
- C-state residency tables
- Usage statistics
- Idle miss counts (above/below)
- Actual latency measurements
Thank you Christian for the valuable feedback and for pointing out that
the simpler approach may be sufficient.
Background:
===========
On Intel server platforms from 2022 onwards (Sapphire Rapids, Granite
Rapids), we observe excessive wakeup latencies (~150us) in network-
sensitive workloads when using the menu governor with NOHZ_FULL enabled.
The issue stems from the governor using next_timer_ns directly when the
tick is already stopped and predicted_ns < TICK_NSEC. This causes
selection of very deep package C-states (PC6) even when the prediction
suggests a much shorter idle duration.
On platforms with high C-state exit latencies (Intel SPR: 190us for C6,
or systems with large C-state gaps like C2 36us → C3 700us with 350us
exit latency), this results in significant wakeup penalties.
Testing:
========
Initial testing on Sapphire Rapids shows 5x latency reduction
(151us → ~30us). I will provide comprehensive test results comparing
baseline, simple min(), and the 25% margin approach.
Ionut Nechita (1):
cpuidle: menu: Use min() to prevent deep C-states when tick is stopped
drivers/cpuidle/governors/menu.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
--
2.52.0
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v2 1/1] cpuidle: menu: Use min() to prevent deep C-states when tick is stopped
2026-01-22 8:09 [PATCH v2 0/1] cpuidle: menu: Fix high wakeup latency on modern platforms Ionut Nechita (Sunlight Linux)
@ 2026-01-22 8:09 ` Ionut Nechita (Sunlight Linux)
2026-01-22 11:19 ` David Laight
2026-01-22 8:49 ` [PATCH v2 0/1] cpuidle: menu: Fix high wakeup latency on modern platforms Christian Loehle
1 sibling, 1 reply; 6+ messages in thread
From: Ionut Nechita (Sunlight Linux) @ 2026-01-22 8:09 UTC (permalink / raw)
To: rafael
Cc: daniel.lezcano, christian.loehle, linux-pm, linux-kernel,
yumpusamongus, Ionut Nechita, stable
From: Ionut Nechita <ionut_n2001@yahoo.com>
When the tick is already stopped and the predicted idle duration is short
(< TICK_NSEC), the original code uses next_timer_ns directly. This can
lead to selecting excessively deep C-states when the actual idle duration
is much shorter than the next timer event.
On modern Intel server platforms (Sapphire Rapids and newer), deep package
C-states can have exit latencies of 150-190us due to:
- Tile-based architecture with per-tile power gating
- DDR5 and CXL power management overhead
- Complex mesh interconnect resynchronization
When a network packet arrives after 500us but the governor selected a deep
C-state (PC6) based on a 10ms timer, the high exit latency (150us+)
dominates the response time.
Use the minimum of predicted_ns and next_timer_ns instead of using
next_timer_ns directly. This avoids selecting unnecessarily deep states
when the prediction is short but the next timer is distant, while still
being conservative enough to prevent getting stuck in shallow states for
extended periods.
Testing on Sapphire Rapids with qperf tcp_lat shows:
- Before: 151us average latency (frequent PC6 entry)
- After: ~30us average latency (avoids PC6 on short predictions)
- Improvement: 5x latency reduction
The fix is platform-agnostic and benefits other platforms with high
C-state exit latencies. Testing on systems with large C-state gaps
(e.g., C2 at 36us → C3 at 700us with 350us latency) shows similar
improvements in avoiding deep state selection for short idle periods.
Power efficiency testing shows minimal impact (<1% difference in package
power consumption during mixed workloads), well within measurement noise.
Cc: stable@vger.kernel.org
Signed-off-by: Ionut Nechita <ionut_n2001@yahoo.com>
---
| 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
--git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 64d6f7a1c776..199eac2a1849 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -287,12 +287,16 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
/*
* If the tick is already stopped, the cost of possible short idle
* duration misprediction is much higher, because the CPU may be stuck
- * in a shallow idle state for a long time as a result of it. In that
- * case, say we might mispredict and use the known time till the closest
- * timer event for the idle state selection.
+ * in a shallow idle state for a long time as a result of it.
+ *
+ * Instead of using next_timer_ns directly (which could be very large,
+ * e.g., 10ms), use the minimum of the prediction and the timer. This
+ * prevents selecting excessively deep C-states when the prediction
+ * suggests a short idle period, while still clamping to next_timer_ns
+ * to avoid unnecessarily shallow states.
*/
if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
- predicted_ns = data->next_timer_ns;
+ predicted_ns = min(predicted_ns, data->next_timer_ns);
/*
* Find the idle state with the lowest power while satisfying
--
2.52.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v2 0/1] cpuidle: menu: Fix high wakeup latency on modern platforms
2026-01-22 8:09 [PATCH v2 0/1] cpuidle: menu: Fix high wakeup latency on modern platforms Ionut Nechita (Sunlight Linux)
2026-01-22 8:09 ` [PATCH v2 1/1] cpuidle: menu: Use min() to prevent deep C-states when tick is stopped Ionut Nechita (Sunlight Linux)
@ 2026-01-22 8:49 ` Christian Loehle
2026-01-26 20:19 ` Ionut Nechita (Sunlight Linux)
1 sibling, 1 reply; 6+ messages in thread
From: Christian Loehle @ 2026-01-22 8:49 UTC (permalink / raw)
To: Ionut Nechita (Sunlight Linux), rafael
Cc: daniel.lezcano, linux-pm, linux-kernel, yumpusamongus,
Ionut Nechita
On 1/22/26 08:09, Ionut Nechita (Sunlight Linux) wrote:
> From: Ionut Nechita <ionut_n2001@yahoo.com>
>
> Hi,
>
> This v2 patch addresses high wakeup latency in the menu cpuidle governor
> on modern platforms with high C-state exit latencies.
>
> Changes in v2:
> ==============
>
> Based on Christian Loehle's feedback, I've simplified the approach to use
> min(predicted_ns, data->next_timer_ns) instead of the 25% safety margin
> from v1.
>
> The simpler approach is cleaner and achieves the same goal: preventing the
> governor from selecting excessively deep C-states when the prediction
> suggests a short idle period but next_timer_ns is large (e.g., 10ms).
>
> I will test both approaches (simple min vs 25% margin) and provide
> detailed comparison data including:
> - C-state residency tables
> - Usage statistics
> - Idle miss counts (above/below)
> - Actual latency measurements
>
> Thank you Christian for the valuable feedback and for pointing out that
> the simpler approach may be sufficient.
>
It was more of a question than a suggestion outright... And I still have
more of them, quoting v1:
+ * Add a 25% safety margin to the prediction to reduce the risk of
+ * selecting too shallow state, but clamp to next_timer to avoid
+ * selecting unnecessarily deep states.
but the safety margin was ontop of the prediction, i.e. it skewed towards
deeper states (not shallower ones).
You also measured 150us wakeup latency, does this match the reported exit
latency for your platform (roughly)?
What do the platform states look like for you?
A trace or cpuidle sysfs dump pre and post workload would really help to
understand the situation.
Also regarding NOHZ_FULL, does that make a difference for your workload?
That would sort of imply very few idle wakeups (otherwise that bit of tick
overhead probably wouldn't matter. Is the NOHZ_FULL gain only in latency?
Frankly, if there's relatively strict latency requirements on the system
you need to let cpuidle know via pm qos or dma_latency....
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2 1/1] cpuidle: menu: Use min() to prevent deep C-states when tick is stopped
2026-01-22 8:09 ` [PATCH v2 1/1] cpuidle: menu: Use min() to prevent deep C-states when tick is stopped Ionut Nechita (Sunlight Linux)
@ 2026-01-22 11:19 ` David Laight
0 siblings, 0 replies; 6+ messages in thread
From: David Laight @ 2026-01-22 11:19 UTC (permalink / raw)
To: Ionut Nechita (Sunlight Linux)
Cc: rafael, daniel.lezcano, christian.loehle, linux-pm, linux-kernel,
yumpusamongus, Ionut Nechita, stable
On Thu, 22 Jan 2026 10:09:39 +0200
"Ionut Nechita (Sunlight Linux)" <sunlightlinux@gmail.com> wrote:
> From: Ionut Nechita <ionut_n2001@yahoo.com>
>
> When the tick is already stopped and the predicted idle duration is short
> (< TICK_NSEC), the original code uses next_timer_ns directly. This can
> lead to selecting excessively deep C-states when the actual idle duration
> is much shorter than the next timer event.
>
> On modern Intel server platforms (Sapphire Rapids and newer), deep package
> C-states can have exit latencies of 150-190us due to:
> - Tile-based architecture with per-tile power gating
> - DDR5 and CXL power management overhead
> - Complex mesh interconnect resynchronization
>
> When a network packet arrives after 500us but the governor selected a deep
> C-state (PC6) based on a 10ms timer, the high exit latency (150us+)
> dominates the response time.
....
We had to disable the deep sleep states on much older Intel -7 cpus.
The problem was that we needed to wake up multiple cpu and they tended
to get woken in turn - so it was far too long before they were all running.
I suspect that pretty much anything that cares about latency has always
needed to disable them.
David
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2 0/1] cpuidle: menu: Fix high wakeup latency on modern platforms
2026-01-22 8:49 ` [PATCH v2 0/1] cpuidle: menu: Fix high wakeup latency on modern platforms Christian Loehle
@ 2026-01-26 20:19 ` Ionut Nechita (Sunlight Linux)
2026-02-09 23:24 ` Russell Haley
0 siblings, 1 reply; 6+ messages in thread
From: Ionut Nechita (Sunlight Linux) @ 2026-01-26 20:19 UTC (permalink / raw)
To: christian.loehle
Cc: daniel.lezcano, ionut_n2001, linux-kernel, linux-pm, rafael,
sunlightlinux, yumpusamongus
From: Ionut Nechita <sunlightlinux@gmail.com>
On Thu, Jan 22 2026 at 08:49, Christian Loehle wrote:
> It was more of a question than a suggestion outright... And I still have
> more of them, quoting v1:
Thank you for the detailed feedback. Let me provide more context about
the workload and the platforms where I observed this issue.
> You also measured 150us wakeup latency, does this match the reported exit
> latency for your platform (roughly)?
> What do the platform states look like for you?
Yes, the measured latency matches the reported exit latencies. Here are
the platforms I've tested:
1. Intel Xeon Gold 6443N (Sapphire Rapids):
- C6 state: 190us latency, 600us residency target
- C1E state: 2us latency, 4us residency target
- Driver: intel_idle
2. AMD Ryzen 9 5900HS (laptop):
- C3 state: 350us latency, 700us residency target
- C2 state: 18us latency, 36us residency target
- Driver: acpi_idle
The problem manifests primarily on the Sapphire Rapids platform where
C6 has 190us exit latency.
> Also regarding NOHZ_FULL, does that make a difference for your workload?
Yes, absolutely. The workload context is:
- PREEMPT_RT kernel (realtime)
- Isolated cores (isolcpus=)
- NOHZ_FULL enabled on isolated cores
- Inter-core communication latency testing with qperf
- kthreads and IRQ affinity set to non-isolated cores
The scenario: Core A (isolated, NOHZ_FULL) sends a message to Core B
(also isolated, NOHZ_FULL, currently idle). Core B enters C6 during
idle, then when the message arrives, the 190us exit latency dominates
the response time. This is unacceptable for realtime workloads.
> Frankly, if there's relatively strict latency requirements on the system
> you need to let cpuidle know via pm qos or dma_latency....
I considered PM QoS and /dev/cpu_dma_latency, but they have limitations
for this use case:
1. Global PM QoS affects all cores, not just the isolated ones
2. Per-task PM QoS requires application modifications
3. /dev/cpu_dma_latency is system-wide, not per-core
For isolated cores with NOHZ_FULL in a realtime environment, we want
the governor to make smarter decisions based on actual predicted idle
time rather than relying on next_timer_ns which can be arbitrarily large
on tickless cores.
> A trace or cpuidle sysfs dump pre and post workload would really help to
> understand the situation.
I will collect and provide:
- ftrace cpuidle event traces
- Complete sysfs cpuidle dumps pre/post workload
- C-state residency and usage statistics
- Detailed qperf latency measurements
Regarding the safety margin question from v1: you're right that I need
to clarify the logic. The goal is to clamp the upper bound to avoid
unnecessarily deep states when prediction suggests short idle, while
still respecting the prediction for target residency selection.
I'll send a follow-up with the detailed trace data and measurements.
Thanks for your patience and valuable feedback,
Ionut
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2 0/1] cpuidle: menu: Fix high wakeup latency on modern platforms
2026-01-26 20:19 ` Ionut Nechita (Sunlight Linux)
@ 2026-02-09 23:24 ` Russell Haley
0 siblings, 0 replies; 6+ messages in thread
From: Russell Haley @ 2026-02-09 23:24 UTC (permalink / raw)
To: Ionut Nechita (Sunlight Linux), christian.loehle
Cc: daniel.lezcano, ionut_n2001, linux-kernel, linux-pm, rafael
On 1/26/26 2:19 PM, Ionut Nechita (Sunlight Linux) wrote:
> I considered PM QoS and /dev/cpu_dma_latency, but they have limitations
> for this use case:
>
> 1. Global PM QoS affects all cores, not just the isolated ones
> 2. Per-task PM QoS requires application modifications
> 3. /dev/cpu_dma_latency is system-wide, not per-core
>
> For isolated cores with NOHZ_FULL in a realtime environment, we want
> the governor to make smarter decisions based on actual predicted idle
> time rather than relying on next_timer_ns which can be arbitrarily large
> on tickless cores.
>
In case it helps, you can write "1" to
/sys/devices/system/cpu/cpu*/cpuidle/state*/disable
to lock out any idle states that are too deep. That's per-core, although
it's not as "crash clean" as holding an FD for /dev/cpu_dma_latency.
- Russell Haley
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-02-09 23:24 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-22 8:09 [PATCH v2 0/1] cpuidle: menu: Fix high wakeup latency on modern platforms Ionut Nechita (Sunlight Linux)
2026-01-22 8:09 ` [PATCH v2 1/1] cpuidle: menu: Use min() to prevent deep C-states when tick is stopped Ionut Nechita (Sunlight Linux)
2026-01-22 11:19 ` David Laight
2026-01-22 8:49 ` [PATCH v2 0/1] cpuidle: menu: Fix high wakeup latency on modern platforms Christian Loehle
2026-01-26 20:19 ` Ionut Nechita (Sunlight Linux)
2026-02-09 23:24 ` Russell Haley
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox