public inbox for linux-pm@vger.kernel.org
 help / color / mirror / Atom feed
From: Russell Haley <yumpusamongus@gmail.com>
To: "Ionut Nechita (Sunlight Linux)" <sunlightlinux@gmail.com>,
	rafael@kernel.org
Cc: ionut_n2001@yahoo.com, daniel.lezcano@linaro.org,
	christian.loehle@arm.com, linux-pm@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/1] cpuidle: menu: Fix high wakeup latency on modern Intel server platforms
Date: Wed, 21 Jan 2026 16:42:41 -0600	[thread overview]
Message-ID: <a716c51c-05ab-429a-9be1-915a401a2197@gmail.com> (raw)
In-Reply-To: <20260120211725.124349-1-sunlightlinux@gmail.com>

On 1/20/26 3:17 PM, Ionut Nechita (Sunlight Linux) wrote:
> From: Ionut Nechita <ionut_n2001@yahoo.com>
> 
> Hi,
> 
> This patch addresses a performance regression in the menu cpuidle governor
> affecting modern Intel server platforms (Sapphire Rapids, Granite Rapids,
> and newer).
> 
> == Problem Description ==
> 
> On Intel server platforms from 2022 onwards, we observe excessive wakeup
> latencies (~150us) in network-sensitive workloads when using the menu
> governor with NOHZ_FULL enabled.
> 
> Measurement with qperf tcp_lat shows:
> - Sapphire Rapids (SPR):    151us latency
> - Ice Lake (ICL):             12us latency
> - Skylake (SKL):              21us latency
> 
> The 12x latency regression on SPR compared to Ice Lake is unacceptable for
> latency-sensitive applications (HPC, real-time, financial trading, etc.).
> 
> == Root Cause ==
> 
> The issue stems from menu.c:294-295:
> 
>     if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
>         predicted_ns = data->next_timer_ns;
> 
> When the tick is already stopped and the predicted idle duration is short
> (<2ms), the governor switches to using next_timer_ns directly (often
> 10ms+). This causes the selection of very deep package C-states (PC6).
> 
> Modern server platforms have significantly longer C-state exit latencies
> due to architectural changes:
> - Tile-based architecture with per-tile power gating
> - DDR5 power management overhead
> - CXL link restoration
> - Complex mesh interconnect resynchronization
> 
> When a network packet arrives after 500us but the governor selected PC6
> based on a 10ms timer, the 150us exit latency dominates the response time.
> 
> On older platforms (Ice Lake, Skylake) with faster C-state transitions
> (12-21us), this issue was less noticeable, but SPR's tile architecture
> makes it critical.
> 
> == Solution ==
> 
> Instead of using next_timer_ns directly (100% timer-based), add a 25%
> safety margin to the prediction and clamp to next_timer_ns:
> 
>     predicted_ns = min(predicted_ns + (predicted_ns >> 2), data->next_timer_ns);
> 
> This provides:
> - Conservative prediction (avoids too-shallow states)
> - Protection against excessively deep states (clamped to timer)
> - Platform-agnostic solution (no hardcoded thresholds)
> - Minimal overhead (one shift, one add, one min)
> 
> The 25% margin (>> 2 = divide by 4) was chosen as a balance between:
> - Too small (10%): Insufficient protection on high-latency platforms
> - Too large (50%): Overly conservative, may hurt power efficiency
> 
> == Results ==
> 
> Testing on Sapphire Rapids with qperf tcp_lat:
> - Before: 151us average latency
> - After:   ~30us average latency
> - Improvement: 5x latency reduction
> 
> Testing on Ice Lake and Skylake shows minimal impact:
> - Ice Lake: 12us → 12us (no regression)
> - Skylake: 21us → 21us (no regression)
> 
> Power efficiency testing shows <1% difference in package power consumption
> during mixed workloads, well within measurement noise.
> 
> == Examples ==
> 
> Short prediction (500us), timer at 10ms:
> - Before: predicted_ns = 10ms → selects PC6 → 151us wakeup
> - After:  predicted_ns = min(625us, 10ms) = 625us → selects C1E → 15us wakeup
> 
> Long prediction (1800us), timer at 2ms:
> - Before: predicted_ns = 2ms → selects C6
> - After:  predicted_ns = min(2250us, 2ms) = 2ms → selects C6 (same state)
> 
> The algorithm naturally adapts to workload characteristics without
> platform-specific tuning.
> 
> Ionut Nechita (1):
>   cpuidle: menu: Add 25% safety margin to short predictions when tick is
>     stopped
> 
>  drivers/cpuidle/governors/menu.c | 16 ++++++++++++----
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> --
> 2.52.0

Rafael's patch [1] from a few hours before yours might address the same
problem, it looks like?  Maybe try and see.

[1] https://lore.kernel.org/all/5959091.DvuYhMxLoT@rafael.j.wysocki/



      parent reply	other threads:[~2026-01-21 22:42 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-20 21:17 [PATCH 0/1] cpuidle: menu: Fix high wakeup latency on modern Intel server platforms Ionut Nechita (Sunlight Linux)
2026-01-20 21:17 ` [PATCH 1/1] cpuidle: menu: Add 25% safety margin to short predictions when tick is stopped Ionut Nechita (Sunlight Linux)
2026-01-21 11:55   ` Christian Loehle
2026-01-21 11:49 ` [PATCH 0/1] cpuidle: menu: Fix high wakeup latency on modern Intel server platforms Christian Loehle
2026-01-21 22:42 ` Russell Haley [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a716c51c-05ab-429a-9be1-915a401a2197@gmail.com \
    --to=yumpusamongus@gmail.com \
    --cc=christian.loehle@arm.com \
    --cc=daniel.lezcano@linaro.org \
    --cc=ionut_n2001@yahoo.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=rafael@kernel.org \
    --cc=sunlightlinux@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox