public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: "Ionut Nechita (Sunlight Linux)" <sunlightlinux@gmail.com>
To: rafael@kernel.org
Cc: daniel.lezcano@linaro.org, christian.loehle@arm.com,
	linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org,
	yumpusamongus@gmail.com, Ionut Nechita <ionut_n2001@yahoo.com>,
	stable@vger.kernel.org
Subject: [PATCH v2 1/1] cpuidle: menu: Use min() to prevent deep C-states when tick is stopped
Date: Thu, 22 Jan 2026 10:09:39 +0200	[thread overview]
Message-ID: <20260122080937.22347-4-sunlightlinux@gmail.com> (raw)
In-Reply-To: <20260122080937.22347-2-sunlightlinux@gmail.com>

From: Ionut Nechita <ionut_n2001@yahoo.com>

When the tick is already stopped and the predicted idle duration is short
(< TICK_NSEC), the original code uses next_timer_ns directly. This can
lead to selecting excessively deep C-states when the actual idle duration
is much shorter than the next timer event.

On modern Intel server platforms (Sapphire Rapids and newer), deep package
C-states can have exit latencies of 150-190us due to:
- Tile-based architecture with per-tile power gating
- DDR5 and CXL power management overhead
- Complex mesh interconnect resynchronization

When a network packet arrives after 500us but the governor selected a deep
C-state (PC6) based on a 10ms timer, the high exit latency (150us+)
dominates the response time.

Use the minimum of predicted_ns and next_timer_ns instead of using
next_timer_ns directly. This avoids selecting unnecessarily deep states
when the prediction is short but the next timer is distant, while still
being conservative enough to prevent getting stuck in shallow states for
extended periods.

Testing on Sapphire Rapids with qperf tcp_lat shows:
- Before: 151us average latency (frequent PC6 entry)
- After: ~30us average latency (avoids PC6 on short predictions)
- Improvement: 5x latency reduction

The fix is platform-agnostic and benefits other platforms with high
C-state exit latencies. Testing on systems with large C-state gaps
(e.g., C2 at 36us → C3 at 700us with 350us latency) shows similar
improvements in avoiding deep state selection for short idle periods.

Power efficiency testing shows minimal impact (<1% difference in package
power consumption during mixed workloads), well within measurement noise.

Cc: stable@vger.kernel.org
Signed-off-by: Ionut Nechita <ionut_n2001@yahoo.com>
---
 drivers/cpuidle/governors/menu.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 64d6f7a1c776..199eac2a1849 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -287,12 +287,16 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 	/*
 	 * If the tick is already stopped, the cost of possible short idle
 	 * duration misprediction is much higher, because the CPU may be stuck
-	 * in a shallow idle state for a long time as a result of it.  In that
-	 * case, say we might mispredict and use the known time till the closest
-	 * timer event for the idle state selection.
+	 * in a shallow idle state for a long time as a result of it.
+	 *
+	 * Instead of using next_timer_ns directly (which could be very large,
+	 * e.g., 10ms), use the minimum of the prediction and the timer. This
+	 * prevents selecting excessively deep C-states when the prediction
+	 * suggests a short idle period, while still clamping to next_timer_ns
+	 * to avoid unnecessarily shallow states.
 	 */
 	if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
-		predicted_ns = data->next_timer_ns;
+		predicted_ns = min(predicted_ns, data->next_timer_ns);
 
 	/*
 	 * Find the idle state with the lowest power while satisfying
-- 
2.52.0


  reply	other threads:[~2026-01-22  8:10 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-22  8:09 [PATCH v2 0/1] cpuidle: menu: Fix high wakeup latency on modern platforms Ionut Nechita (Sunlight Linux)
2026-01-22  8:09 ` Ionut Nechita (Sunlight Linux) [this message]
2026-01-22 11:19   ` [PATCH v2 1/1] cpuidle: menu: Use min() to prevent deep C-states when tick is stopped David Laight
2026-01-22  8:49 ` [PATCH v2 0/1] cpuidle: menu: Fix high wakeup latency on modern platforms Christian Loehle
2026-01-26 20:19   ` Ionut Nechita (Sunlight Linux)
2026-02-09 23:24     ` Russell Haley

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260122080937.22347-4-sunlightlinux@gmail.com \
    --to=sunlightlinux@gmail.com \
    --cc=christian.loehle@arm.com \
    --cc=daniel.lezcano@linaro.org \
    --cc=ionut_n2001@yahoo.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=rafael@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=yumpusamongus@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox