[PATCH 0/1] cpuidle: menu: Fix high wakeup latency on modern Intel server platforms

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Ionut Nechita (Sunlight Linux)" <sunlightlinux@gmail.com>
To: rafael@kernel.org
Cc: ionut_n2001@yahoo.com, daniel.lezcano@linaro.org,
	christian.loehle@arm.com, linux-pm@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH 0/1] cpuidle: menu: Fix high wakeup latency on modern Intel server platforms
Date: Tue, 20 Jan 2026 23:17:24 +0200	[thread overview]
Message-ID: <20260120211725.124349-1-sunlightlinux@gmail.com> (raw)

From: Ionut Nechita <ionut_n2001@yahoo.com>

Hi,

This patch addresses a performance regression in the menu cpuidle governor
affecting modern Intel server platforms (Sapphire Rapids, Granite Rapids,
and newer).

== Problem Description ==

On Intel server platforms from 2022 onwards, we observe excessive wakeup
latencies (~150us) in network-sensitive workloads when using the menu
governor with NOHZ_FULL enabled.

Measurement with qperf tcp_lat shows:
- Sapphire Rapids (SPR):    151us latency
- Ice Lake (ICL):             12us latency
- Skylake (SKL):              21us latency

The 12x latency regression on SPR compared to Ice Lake is unacceptable for
latency-sensitive applications (HPC, real-time, financial trading, etc.).

== Root Cause ==

The issue stems from menu.c:294-295:

    if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
        predicted_ns = data->next_timer_ns;

When the tick is already stopped and the predicted idle duration is short
(<2ms), the governor switches to using next_timer_ns directly (often
10ms+). This causes the selection of very deep package C-states (PC6).

Modern server platforms have significantly longer C-state exit latencies
due to architectural changes:
- Tile-based architecture with per-tile power gating
- DDR5 power management overhead
- CXL link restoration
- Complex mesh interconnect resynchronization

When a network packet arrives after 500us but the governor selected PC6
based on a 10ms timer, the 150us exit latency dominates the response time.

On older platforms (Ice Lake, Skylake) with faster C-state transitions
(12-21us), this issue was less noticeable, but SPR's tile architecture
makes it critical.

== Solution ==

Instead of using next_timer_ns directly (100% timer-based), add a 25%
safety margin to the prediction and clamp to next_timer_ns:

    predicted_ns = min(predicted_ns + (predicted_ns >> 2), data->next_timer_ns);

This provides:
- Conservative prediction (avoids too-shallow states)
- Protection against excessively deep states (clamped to timer)
- Platform-agnostic solution (no hardcoded thresholds)
- Minimal overhead (one shift, one add, one min)

The 25% margin (>> 2 = divide by 4) was chosen as a balance between:
- Too small (10%): Insufficient protection on high-latency platforms
- Too large (50%): Overly conservative, may hurt power efficiency

== Results ==

Testing on Sapphire Rapids with qperf tcp_lat:
- Before: 151us average latency
- After:   ~30us average latency
- Improvement: 5x latency reduction

Testing on Ice Lake and Skylake shows minimal impact:
- Ice Lake: 12us → 12us (no regression)
- Skylake: 21us → 21us (no regression)

Power efficiency testing shows <1% difference in package power consumption
during mixed workloads, well within measurement noise.

== Examples ==

Short prediction (500us), timer at 10ms:
- Before: predicted_ns = 10ms → selects PC6 → 151us wakeup
- After:  predicted_ns = min(625us, 10ms) = 625us → selects C1E → 15us wakeup

Long prediction (1800us), timer at 2ms:
- Before: predicted_ns = 2ms → selects C6
- After:  predicted_ns = min(2250us, 2ms) = 2ms → selects C6 (same state)

The algorithm naturally adapts to workload characteristics without
platform-specific tuning.

Ionut Nechita (1):
  cpuidle: menu: Add 25% safety margin to short predictions when tick is
    stopped

 drivers/cpuidle/governors/menu.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

--
2.52.0

next             reply	other threads:[~2026-01-20 21:17 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-20 21:17 Ionut Nechita (Sunlight Linux) [this message]
2026-01-20 21:17 ` [PATCH 1/1] cpuidle: menu: Add 25% safety margin to short predictions when tick is stopped Ionut Nechita (Sunlight Linux)
2026-01-21 11:55   ` Christian Loehle
2026-01-21 11:49 ` [PATCH 0/1] cpuidle: menu: Fix high wakeup latency on modern Intel server platforms Christian Loehle
2026-01-21 22:42 ` Russell Haley

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260120211725.124349-1-sunlightlinux@gmail.com \
    --to=sunlightlinux@gmail.com \
    --cc=christian.loehle@arm.com \
    --cc=daniel.lezcano@linaro.org \
    --cc=ionut_n2001@yahoo.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=rafael@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.