[PATCH v1 0/2] cpuidle: teo: Refine handling of short idle intervals

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v1 0/2] cpuidle: teo: Refine handling of short idle intervals
@ 2025-04-03 19:16 Rafael J. Wysocki
  2025-04-03 19:16 ` [PATCH v1 1/2] cpuidle: teo: Move candidate state lookup to separate function Rafael J. Wysocki
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Rafael J. Wysocki @ 2025-04-03 19:16 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Daniel Lezcano, Christian Loehle, Artem Bityutskiy,
	Doug Smythies, Aboorva Devarajan

Hi Everyone,

This series is intended to address an issue with overly aggressive selection
of idle state 0 (the polling state) in teo on x86 in some cases when timer
wakeups dominate the CPU wakeup pattern.

In those cases, timer wakeups are not taken into account when they are
within the LATENCY_THRESHOLD_NS range and the idle state selection may
be based entirely on non-timer wakeups which may be rare.  This causes
the prediction accuracy to be low and too much energy may be used as
a result.

The first patch is preparatory and it is not expected to make any
functional difference.

The second patch causes teo to take timer wakeups into account if it
is about to skip the tick_nohz_get_sleep_length() invocation, so they
get a chance to influence the idle state selection.

I have been using this series on my systems for several weeks and observed
a significant reduction of the polling state selection rate in multiple
workloads.

Thanks!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v1 1/2] cpuidle: teo: Move candidate state lookup to separate function
  2025-04-03 19:16 [PATCH v1 0/2] cpuidle: teo: Refine handling of short idle intervals Rafael J. Wysocki
@ 2025-04-03 19:16 ` Rafael J. Wysocki
  2025-04-03 19:18 ` [PATCH v1 2/2] cpuidle: teo: Refine handling of short idle intervals Rafael J. Wysocki
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Rafael J. Wysocki @ 2025-04-03 19:16 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Daniel Lezcano, Christian Loehle, Artem Bityutskiy,
	Doug Smythies, Aboorva Devarajan

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Move the code looking up a new candidate idle state in teo, after
deciding that the initial candidate (the deepest enabled idle state) is
likely too deep, into a separate function in preparation for subsequent
changes.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpuidle/governors/teo.c |  120 +++++++++++++++++++++-------------------
 1 file changed, 63 insertions(+), 57 deletions(-)

--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -259,6 +259,67 @@
 	return state_idx;
 }
 
+static int teo_get_candidate(struct cpuidle_driver *drv,
+			     struct cpuidle_device *dev,
+			     struct teo_cpu *cpu_data,
+			     int idx, unsigned int idx_intercepts)
+{
+	int first_suitable_idx = idx;
+	unsigned int intercepts = 0;
+	int i;
+
+	/*
+	 * Look for the deepest idle state whose target residency had
+	 * not exceeded the idle duration in over a half of the relevant
+	 * cases in the past.
+	 *
+	 * Take the possible duration limitation present if the tick
+	 * has been stopped already into account.
+	 */
+	for (i = idx - 1; i >= 0; i--) {
+		intercepts += cpu_data->state_bins[i].intercepts;
+		if (2 * intercepts > idx_intercepts) {
+			/*
+			 * Use the current state unless it is too
+			 * shallow or disabled, in which case take the
+			 * first enabled state that is deep enough.
+			 */
+			if (teo_state_ok(i, drv) && !dev->states_usage[i].disable) {
+				idx = i;
+				break;
+			}
+
+			idx = first_suitable_idx;
+			break;
+		}
+
+		if (dev->states_usage[i].disable)
+			continue;
+
+		if (teo_state_ok(i, drv)) {
+			/*
+			 * The current state is deep enough, but still
+			 * there may be a better one.
+			 */
+			first_suitable_idx = i;
+			continue;
+		}
+
+		/*
+		 * The current state is too shallow, so if no suitable
+		 * states other than the initial candidate have been
+		 * found, give up (the remaining states to check are
+		 * shallower still), but otherwise the first suitable
+		 * state other than the initial candidate may turn out
+		 * to be preferable.
+		 */
+		if (first_suitable_idx == idx)
+			break;
+	}
+
+	return idx;
+}
+
 /**
  * teo_select - Selects the next idle state to enter.
  * @drv: cpuidle driver containing state data.
@@ -355,63 +416,8 @@
 	 * all of the deeper states, a shallower idle state is likely to be a
 	 * better choice.
 	 */
-	if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) {
-		int first_suitable_idx = idx;
-
-		/*
-		 * Look for the deepest idle state whose target residency had
-		 * not exceeded the idle duration in over a half of the relevant
-		 * cases in the past.
-		 *
-		 * Take the possible duration limitation present if the tick
-		 * has been stopped already into account.
-		 */
-		intercept_sum = 0;
-
-		for (i = idx - 1; i >= 0; i--) {
-			struct teo_bin *bin = &cpu_data->state_bins[i];
-
-			intercept_sum += bin->intercepts;
-
-			if (2 * intercept_sum > idx_intercept_sum) {
-				/*
-				 * Use the current state unless it is too
-				 * shallow or disabled, in which case take the
-				 * first enabled state that is deep enough.
-				 */
-				if (teo_state_ok(i, drv) &&
-				    !dev->states_usage[i].disable) {
-					idx = i;
-					break;
-				}
-				idx = first_suitable_idx;
-				break;
-			}
-
-			if (dev->states_usage[i].disable)
-				continue;
-
-			if (teo_state_ok(i, drv)) {
-				/*
-				 * The current state is deep enough, but still
-				 * there may be a better one.
-				 */
-				first_suitable_idx = i;
-				continue;
-			}
-
-			/*
-			 * The current state is too shallow, so if no suitable
-			 * states other than the initial candidate have been
-			 * found, give up (the remaining states to check are
-			 * shallower still), but otherwise the first suitable
-			 * state other than the initial candidate may turn out
-			 * to be preferable.
-			 */
-			if (first_suitable_idx == idx)
-				break;
-		}
-	}
+	if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum)
+		idx = teo_get_candidate(drv, dev, cpu_data, idx, idx_intercept_sum);
 
 	/*
 	 * If there is a latency constraint, it may be necessary to select an




^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v1 2/2] cpuidle: teo: Refine handling of short idle intervals
  2025-04-03 19:16 [PATCH v1 0/2] cpuidle: teo: Refine handling of short idle intervals Rafael J. Wysocki
  2025-04-03 19:16 ` [PATCH v1 1/2] cpuidle: teo: Move candidate state lookup to separate function Rafael J. Wysocki
@ 2025-04-03 19:18 ` Rafael J. Wysocki
  2025-04-16 15:00   ` Christian Loehle
  2025-04-09  6:52 ` [PATCH v1 0/2] " Artem Bityutskiy
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Rafael J. Wysocki @ 2025-04-03 19:18 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Daniel Lezcano, Christian Loehle, Artem Bityutskiy,
	Doug Smythies, Aboorva Devarajan

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Make teo take all recent wakeups (both timer and non-timer) into
account when looking for a new candidate idle state in the cases
when the majority of recent idle intervals are within the
LATENCY_THRESHOLD_NS range or the latency limit is within the
LATENCY_THRESHOLD_NS range.

Since the tick_nohz_get_sleep_length() invocation is likely to be
skipped in those cases, timer wakeups should arguably be taken into
account somehow in case they are significant while the current code
mostly looks at non-timer wakeups under the assumption that frequent
timer wakeups are unlikely in the given idle duration range which
may or may not be accurate.

The most natural way to do that is to add the "hits" metric to the
sums used during the new candidate idle state lookup which effectively
means the above.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpuidle/governors/teo.c |   99 ++++++++++++++++++++++------------------
 1 file changed, 55 insertions(+), 44 deletions(-)

--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -261,11 +261,12 @@
 
 static int teo_get_candidate(struct cpuidle_driver *drv,
 			     struct cpuidle_device *dev,
-			     struct teo_cpu *cpu_data,
-			     int idx, unsigned int idx_intercepts)
+			     struct teo_cpu *cpu_data, int constraint_idx,
+			     int idx, unsigned int idx_events,
+			     bool count_all_events)
 {
 	int first_suitable_idx = idx;
-	unsigned int intercepts = 0;
+	unsigned int events = 0;
 	int i;
 
 	/*
@@ -277,8 +278,11 @@
 	 * has been stopped already into account.
 	 */
 	for (i = idx - 1; i >= 0; i--) {
-		intercepts += cpu_data->state_bins[i].intercepts;
-		if (2 * intercepts > idx_intercepts) {
+		events += cpu_data->state_bins[i].intercepts;
+		if (count_all_events)
+			events += cpu_data->state_bins[i].hits;
+
+		if (2 * events > idx_events) {
 			/*
 			 * Use the current state unless it is too
 			 * shallow or disabled, in which case take the
@@ -316,6 +320,12 @@
 		if (first_suitable_idx == idx)
 			break;
 	}
+	/*
+	 * If there is a latency constraint, it may be necessary to select an
+	 * idle state shallower than the current candidate one.
+	 */
+	if (idx > constraint_idx)
+		return constraint_idx;
 
 	return idx;
 }
@@ -410,49 +420,50 @@
 	}
 
 	/*
-	 * If the sum of the intercepts metric for all of the idle states
-	 * shallower than the current candidate one (idx) is greater than the
-	 * sum of the intercepts and hits metrics for the candidate state and
-	 * all of the deeper states, a shallower idle state is likely to be a
-	 * better choice.
-	 */
-	if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum)
-		idx = teo_get_candidate(drv, dev, cpu_data, idx, idx_intercept_sum);
-
-	/*
-	 * If there is a latency constraint, it may be necessary to select an
-	 * idle state shallower than the current candidate one.
-	 */
-	if (idx > constraint_idx)
-		idx = constraint_idx;
-
-	/*
-	 * If either the candidate state is state 0 or its target residency is
-	 * low enough, there is basically nothing more to do, but if the sleep
-	 * length is not updated, the subsequent wakeup will be counted as an
-	 * "intercept" which may be problematic in the cases when timer wakeups
-	 * are dominant.  Namely, it may effectively prevent deeper idle states
-	 * from being selected at one point even if no imminent timers are
-	 * scheduled.
-	 *
-	 * However, frequent timers in the RESIDENCY_THRESHOLD_NS range on one
-	 * CPU are unlikely (user space has a default 50 us slack value for
-	 * hrtimers and there are relatively few timers with a lower deadline
-	 * value in the kernel), and even if they did happen, the potential
-	 * benefit from using a deep idle state in that case would be
-	 * questionable anyway for latency reasons.  Thus if the measured idle
-	 * duration falls into that range in the majority of cases, assume
-	 * non-timer wakeups to be dominant and skip updating the sleep length
-	 * to reduce latency.
+	 * If the measured idle duration has fallen into the
+	 * RESIDENCY_THRESHOLD_NS range in the majority of recent cases, it is
+	 * likely to fall into that range next time, so it is better to avoid
+	 * adding latency to the idle state selection path.  Accordingly, aim
+	 * for skipping the sleep length update in that case.
 	 *
 	 * Also, if the latency constraint is sufficiently low, it will force
 	 * shallow idle states regardless of the wakeup type, so the sleep
-	 * length need not be known in that case.
+	 * length need not be known in that case either.
 	 */
-	if ((!idx || drv->states[idx].target_residency_ns < RESIDENCY_THRESHOLD_NS) &&
-	    (2 * cpu_data->short_idles >= cpu_data->total ||
-	     latency_req < LATENCY_THRESHOLD_NS))
-		goto out_tick;
+	if (2 * cpu_data->short_idles >= cpu_data->total ||
+	    latency_req < LATENCY_THRESHOLD_NS) {
+		/*
+		 * Look for a new candidate idle state and use all events (both
+		 * "intercepts" and "hits") because the sleep length update is
+		 * likely to be skipped and timer wakeups need to be taken into
+		 * account in a different way in case they are significant.
+		 */
+		idx = teo_get_candidate(drv, dev, cpu_data, idx, constraint_idx,
+					idx_intercept_sum + idx_hit_sum, true);
+		/*
+		 * If the new candidate state is state 0 or its target residency
+		 * is low enough, return it right away without stopping the
+		 * scheduler tick.
+		 */
+		if (!idx || drv->states[idx].target_residency_ns < RESIDENCY_THRESHOLD_NS)
+			goto out_tick;
+	} else if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) {
+		/*
+		 * Look for a new candidate state because the current one is
+		 * likely too deep, but use the "intercepts" metric only because
+		 * the sleep length is going to be determined later and for now
+		 * it is only necessary to find a state that will be suitable
+		 * in case the CPU is "intercepted".
+		 */
+		idx = teo_get_candidate(drv, dev, cpu_data, idx, constraint_idx,
+					idx_intercept_sum, false);
+	} else if (idx > constraint_idx) {
+		/*
+		 * The current candidate state is too deep for the latency
+		 * constraint at hand, so change it to a suitable one.
+		 */
+		idx = constraint_idx;
+	}
 
 	duration_ns = tick_nohz_get_sleep_length(&delta_tick);
 	cpu_data->sleep_length_ns = duration_ns;




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 0/2] cpuidle: teo: Refine handling of short idle intervals
  2025-04-03 19:16 [PATCH v1 0/2] cpuidle: teo: Refine handling of short idle intervals Rafael J. Wysocki
  2025-04-03 19:16 ` [PATCH v1 1/2] cpuidle: teo: Move candidate state lookup to separate function Rafael J. Wysocki
  2025-04-03 19:18 ` [PATCH v1 2/2] cpuidle: teo: Refine handling of short idle intervals Rafael J. Wysocki
@ 2025-04-09  6:52 ` Artem Bityutskiy
  2025-04-09 12:06   ` Rafael J. Wysocki
  2025-04-09 14:36 ` Doug Smythies
  2025-04-14  7:15 ` Aboorva Devarajan
  4 siblings, 1 reply; 14+ messages in thread
From: Artem Bityutskiy @ 2025-04-09  6:52 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM
  Cc: LKML, Daniel Lezcano, Christian Loehle, Doug Smythies,
	Aboorva Devarajan

On Thu, 2025-04-03 at 21:16 +0200, Rafael J. Wysocki wrote:
> Hi Everyone,
> 
> This series is intended to address an issue with overly aggressive selection
> of idle state 0 (the polling state) in teo on x86 in some cases when timer
> wakeups dominate the CPU wakeup pattern.

Hi Rafael, I ran SPECjbb2015 with and without these 2 patches on Granite Rapids
Xeon (GNR).

Expectation: no measurable difference, because there is almost no POLL in case
of SPECjbb2015 on GNR.

Result: no measurable difference.

Conclusion: these 2 patches do not introduce a regression as measured by
SPECjbb2015 on GNR.

"No regression" is also a useful piece of information, so reporting.

Thanks, Artem.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 0/2] cpuidle: teo: Refine handling of short idle intervals
  2025-04-09  6:52 ` [PATCH v1 0/2] " Artem Bityutskiy
@ 2025-04-09 12:06   ` Rafael J. Wysocki
  0 siblings, 0 replies; 14+ messages in thread
From: Rafael J. Wysocki @ 2025-04-09 12:06 UTC (permalink / raw)
  To: Artem Bityutskiy
  Cc: Rafael J. Wysocki, Linux PM, LKML, Daniel Lezcano,
	Christian Loehle, Doug Smythies, Aboorva Devarajan

On Wed, Apr 9, 2025 at 8:52 AM Artem Bityutskiy
<artem.bityutskiy@linux.intel.com> wrote:
>
> On Thu, 2025-04-03 at 21:16 +0200, Rafael J. Wysocki wrote:
> > Hi Everyone,
> >
> > This series is intended to address an issue with overly aggressive selection
> > of idle state 0 (the polling state) in teo on x86 in some cases when timer
> > wakeups dominate the CPU wakeup pattern.
>
> Hi Rafael, I ran SPECjbb2015 with and without these 2 patches on Granite Rapids
> Xeon (GNR).
>
> Expectation: no measurable difference, because there is almost no POLL in case
> of SPECjbb2015 on GNR.
>
> Result: no measurable difference.
>
> Conclusion: these 2 patches do not introduce a regression as measured by
> SPECjbb2015 on GNR.
>
> "No regression" is also a useful piece of information, so reporting.

Thank you!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH v1 0/2] cpuidle: teo: Refine handling of short idle intervals
  2025-04-03 19:16 [PATCH v1 0/2] cpuidle: teo: Refine handling of short idle intervals Rafael J. Wysocki
                   ` (2 preceding siblings ...)
  2025-04-09  6:52 ` [PATCH v1 0/2] " Artem Bityutskiy
@ 2025-04-09 14:36 ` Doug Smythies
  2025-04-09 14:41   ` Rafael J. Wysocki
  2025-04-14  7:15 ` Aboorva Devarajan
  4 siblings, 1 reply; 14+ messages in thread
From: Doug Smythies @ 2025-04-09 14:36 UTC (permalink / raw)
  To: 'Rafael J. Wysocki'
  Cc: 'LKML', 'Daniel Lezcano',
	'Christian Loehle', 'Artem Bityutskiy',
	'Aboorva Devarajan', 'Linux PM', Doug Smythies

On 2025.04.03 12:16 Rafael J. Wysocki wrote:

> Hi Everyone,

Hi Rafael,

> This series is intended to address an issue with overly aggressive selection
> of idle state 0 (the polling state) in teo on x86 in some cases when timer
> wakeups dominate the CPU wakeup pattern.
>
> In those cases, timer wakeups are not taken into account when they are
> within the LATENCY_THRESHOLD_NS range and the idle state selection may
> be based entirely on non-timer wakeups which may be rare.  This causes
> the prediction accuracy to be low and too much energy may be used as
> a result.
>
> The first patch is preparatory and it is not expected to make any
> functional difference.
>
> The second patch causes teo to take timer wakeups into account if it
> is about to skip the tick_nohz_get_sleep_length() invocation, so they
> get a chance to influence the idle state selection.
>
> I have been using this series on my systems for several weeks and observed
> a significant reduction of the polling state selection rate in multiple
> workloads.

I ran many tests on this patch set.
In general, there is nothing significant to report.

There seemed to be a little less power use for the adrestia test and it took a little longer to execute, but the average wakeup latency was the same.

I am still having noise and repeatability issues with my main periodic tests, where CPU is swept from low to high at serveral work sleep frequencies.
But I didn't observe anything significant.

In order to use more shallow idle states with a periodic workflow, I launched 2000 threads with each at 113 Hertz work/sleep frequency and almost no work to do for each work packet.
The patched version used between 1 and 1.5 less processor package power, at around 85 watts.
The patched version spent about 3.5% in idle state 0 verses about 5% for the unpatched version.
The patched version spent about 31.8% in idle state 1 verses about 30.2% for the unpatched version.

Test computer:
Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
Distro: Ubuntu 24.04.1, server, no desktop GUI.
CPU frequency scaling driver: intel_pstate
HWP: disabled.
CPU frequency scaling governor: performance
Ilde driver: intel_idle
Idle governor: teo
Idle states: 4: name : description:
  state0/name:POLL                desc:CPUIDLE CORE POLL IDLE
  state1/name:C1_ACPI          desc:ACPI FFH MWAIT 0x0
  state2/name:C2_ACPI          desc:ACPI FFH MWAIT 0x30
  state3/name:C3_ACPI          desc:ACPI FFH MWAIT 0x60

... Doug

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 0/2] cpuidle: teo: Refine handling of short idle intervals
  2025-04-09 14:36 ` Doug Smythies
@ 2025-04-09 14:41   ` Rafael J. Wysocki
  0 siblings, 0 replies; 14+ messages in thread
From: Rafael J. Wysocki @ 2025-04-09 14:41 UTC (permalink / raw)
  To: Doug Smythies
  Cc: Rafael J. Wysocki, LKML, Daniel Lezcano, Christian Loehle,
	Artem Bityutskiy, Aboorva Devarajan, Linux PM

On Wed, Apr 9, 2025 at 4:36 PM Doug Smythies <dsmythies@telus.net> wrote:
>
> On 2025.04.03 12:16 Rafael J. Wysocki wrote:
>
> > Hi Everyone,
>
> Hi Rafael,
>
> > This series is intended to address an issue with overly aggressive selection
> > of idle state 0 (the polling state) in teo on x86 in some cases when timer
> > wakeups dominate the CPU wakeup pattern.
> >
> > In those cases, timer wakeups are not taken into account when they are
> > within the LATENCY_THRESHOLD_NS range and the idle state selection may
> > be based entirely on non-timer wakeups which may be rare.  This causes
> > the prediction accuracy to be low and too much energy may be used as
> > a result.
> >
> > The first patch is preparatory and it is not expected to make any
> > functional difference.
> >
> > The second patch causes teo to take timer wakeups into account if it
> > is about to skip the tick_nohz_get_sleep_length() invocation, so they
> > get a chance to influence the idle state selection.
> >
> > I have been using this series on my systems for several weeks and observed
> > a significant reduction of the polling state selection rate in multiple
> > workloads.
>
> I ran many tests on this patch set.
> In general, there is nothing significant to report.
>
> There seemed to be a little less power use for the adrestia test and it took a little longer to execute, but the average wakeup latency was the same.
>
> I am still having noise and repeatability issues with my main periodic tests, where CPU is swept from low to high at serveral work sleep frequencies.
> But I didn't observe anything significant.
>
> In order to use more shallow idle states with a periodic workflow, I launched 2000 threads with each at 113 Hertz work/sleep frequency and almost no work to do for each work packet.
> The patched version used between 1 and 1.5 less processor package power, at around 85 watts.
> The patched version spent about 3.5% in idle state 0 verses about 5% for the unpatched version.
> The patched version spent about 31.8% in idle state 1 verses about 30.2% for the unpatched version.
>
> Test computer:
> Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
> Distro: Ubuntu 24.04.1, server, no desktop GUI.
> CPU frequency scaling driver: intel_pstate
> HWP: disabled.
> CPU frequency scaling governor: performance
> Ilde driver: intel_idle
> Idle governor: teo
> Idle states: 4: name : description:
>   state0/name:POLL                desc:CPUIDLE CORE POLL IDLE
>   state1/name:C1_ACPI          desc:ACPI FFH MWAIT 0x0
>   state2/name:C2_ACPI          desc:ACPI FFH MWAIT 0x30
>   state3/name:C3_ACPI          desc:ACPI FFH MWAIT 0x60
>
> ... Doug

Thank you!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 0/2] cpuidle: teo: Refine handling of short idle intervals
  2025-04-03 19:16 [PATCH v1 0/2] cpuidle: teo: Refine handling of short idle intervals Rafael J. Wysocki
                   ` (3 preceding siblings ...)
  2025-04-09 14:36 ` Doug Smythies
@ 2025-04-14  7:15 ` Aboorva Devarajan
  4 siblings, 0 replies; 14+ messages in thread
From: Aboorva Devarajan @ 2025-04-14  7:15 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM
  Cc: LKML, Daniel Lezcano, Christian Loehle, Artem Bityutskiy,
	Doug Smythies

On Thu, 2025-04-03 at 21:16 +0200, Rafael J. Wysocki wrote:
> Hi Everyone,
> 
> This series is intended to address an issue with overly aggressive selection
> of idle state 0 (the polling state) in teo on x86 in some cases when timer
> wakeups dominate the CPU wakeup pattern.
> 
> In those cases, timer wakeups are not taken into account when they are
> within the LATENCY_THRESHOLD_NS range and the idle state selection may
> be based entirely on non-timer wakeups which may be rare.  This causes
> the prediction accuracy to be low and too much energy may be used as
> a result.
> 
> The first patch is preparatory and it is not expected to make any
> functional difference.
> 
> The second patch causes teo to take timer wakeups into account if it
> is about to skip the tick_nohz_get_sleep_length() invocation, so they
> get a chance to influence the idle state selection.
> 
> I have been using this series on my systems for several weeks and observed
> a significant reduction of the polling state selection rate in multiple
> workloads.
> 
> Thanks!
> 
> 

Hi Rafael,

I'm running some tests and going through the patch.
I haven't noticed any deviations so far, will post the results shortly.

Thanks,
Aboorva







^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 2/2] cpuidle: teo: Refine handling of short idle intervals
  2025-04-03 19:18 ` [PATCH v1 2/2] cpuidle: teo: Refine handling of short idle intervals Rafael J. Wysocki
@ 2025-04-16 15:00   ` Christian Loehle
  2025-04-16 15:28     ` Rafael J. Wysocki
  0 siblings, 1 reply; 14+ messages in thread
From: Christian Loehle @ 2025-04-16 15:00 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM
  Cc: LKML, Daniel Lezcano, Artem Bityutskiy, Doug Smythies,
	Aboorva Devarajan

On 4/3/25 20:18, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Make teo take all recent wakeups (both timer and non-timer) into
> account when looking for a new candidate idle state in the cases
> when the majority of recent idle intervals are within the
> LATENCY_THRESHOLD_NS range or the latency limit is within the
> LATENCY_THRESHOLD_NS range.
> 
> Since the tick_nohz_get_sleep_length() invocation is likely to be
> skipped in those cases, timer wakeups should arguably be taken into
> account somehow in case they are significant while the current code
> mostly looks at non-timer wakeups under the assumption that frequent
> timer wakeups are unlikely in the given idle duration range which
> may or may not be accurate.
> 
> The most natural way to do that is to add the "hits" metric to the
> sums used during the new candidate idle state lookup which effectively
> means the above.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Hi Rafael,
I might be missing something so bare with me.
Quoting the cover-letter too:
"In those cases, timer wakeups are not taken into account when they are
within the LATENCY_THRESHOLD_NS range and the idle state selection may
be based entirely on non-timer wakeups which may be rare.  This causes
the prediction accuracy to be low and too much energy may be used as
a result.

The first patch is preparatory and it is not expected to make any
functional difference.

The second patch causes teo to take timer wakeups into account if it
is about to skip the tick_nohz_get_sleep_length() invocation, so they
get a chance to influence the idle state selection."

If the timer wakeups are < LATENCY_THRESHOLD_NS we will not do

cpu_data->sleep_length_ns = tick_nohz_get_sleep_length(&delta_tick);                              

but

cpu_data->sleep_length_ns = KTIME_MAX;

therefore
idx_timer = drv->state_count - 1
idx_duration = some state with residency < LATENCY_THRESHOLD_NS

For any reasonable system therefore idx_timer != idx_duration
(i.e. there's an idle state deeper than LATENCY_THRESHOLD_NS).
So hits will never be incremented?
How would adding hits then help this case?

> ---
>  drivers/cpuidle/governors/teo.c |   99 ++++++++++++++++++++++------------------
>  1 file changed, 55 insertions(+), 44 deletions(-)
> 
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -261,11 +261,12 @@
>  
>  static int teo_get_candidate(struct cpuidle_driver *drv,
>  			     struct cpuidle_device *dev,
> -			     struct teo_cpu *cpu_data,
> -			     int idx, unsigned int idx_intercepts)
> +			     struct teo_cpu *cpu_data, int constraint_idx,
> +			     int idx, unsigned int idx_events,
> +			     bool count_all_events)
>  {
>  	int first_suitable_idx = idx;
> -	unsigned int intercepts = 0;
> +	unsigned int events = 0;
>  	int i;
>  
>  	/*
> @@ -277,8 +278,11 @@
>  	 * has been stopped already into account.
>  	 */
>  	for (i = idx - 1; i >= 0; i--) {
> -		intercepts += cpu_data->state_bins[i].intercepts;
> -		if (2 * intercepts > idx_intercepts) {
> +		events += cpu_data->state_bins[i].intercepts;
> +		if (count_all_events)
> +			events += cpu_data->state_bins[i].hits;
> +
> +		if (2 * events > idx_events) {
>  			/*
>  			 * Use the current state unless it is too
>  			 * shallow or disabled, in which case take the
> @@ -316,6 +320,12 @@
>  		if (first_suitable_idx == idx)
>  			break;
>  	}
> +	/*
> +	 * If there is a latency constraint, it may be necessary to select an
> +	 * idle state shallower than the current candidate one.
> +	 */
> +	if (idx > constraint_idx)
> +		return constraint_idx;
>  
>  	return idx;
>  }
> @@ -410,49 +420,50 @@
>  	}
>  
>  	/*
> -	 * If the sum of the intercepts metric for all of the idle states
> -	 * shallower than the current candidate one (idx) is greater than the
> -	 * sum of the intercepts and hits metrics for the candidate state and
> -	 * all of the deeper states, a shallower idle state is likely to be a
> -	 * better choice.
> -	 */
> -	if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum)
> -		idx = teo_get_candidate(drv, dev, cpu_data, idx, idx_intercept_sum);
> -
> -	/*
> -	 * If there is a latency constraint, it may be necessary to select an
> -	 * idle state shallower than the current candidate one.
> -	 */
> -	if (idx > constraint_idx)
> -		idx = constraint_idx;
> -
> -	/*
> -	 * If either the candidate state is state 0 or its target residency is
> -	 * low enough, there is basically nothing more to do, but if the sleep
> -	 * length is not updated, the subsequent wakeup will be counted as an
> -	 * "intercept" which may be problematic in the cases when timer wakeups
> -	 * are dominant.  Namely, it may effectively prevent deeper idle states
> -	 * from being selected at one point even if no imminent timers are
> -	 * scheduled.
> -	 *
> -	 * However, frequent timers in the RESIDENCY_THRESHOLD_NS range on one
> -	 * CPU are unlikely (user space has a default 50 us slack value for
> -	 * hrtimers and there are relatively few timers with a lower deadline
> -	 * value in the kernel), and even if they did happen, the potential
> -	 * benefit from using a deep idle state in that case would be
> -	 * questionable anyway for latency reasons.  Thus if the measured idle
> -	 * duration falls into that range in the majority of cases, assume
> -	 * non-timer wakeups to be dominant and skip updating the sleep length
> -	 * to reduce latency.
> +	 * If the measured idle duration has fallen into the
> +	 * RESIDENCY_THRESHOLD_NS range in the majority of recent cases, it is
> +	 * likely to fall into that range next time, so it is better to avoid
> +	 * adding latency to the idle state selection path.  Accordingly, aim
> +	 * for skipping the sleep length update in that case.
>  	 *
>  	 * Also, if the latency constraint is sufficiently low, it will force
>  	 * shallow idle states regardless of the wakeup type, so the sleep
> -	 * length need not be known in that case.
> +	 * length need not be known in that case either.
>  	 */
> -	if ((!idx || drv->states[idx].target_residency_ns < RESIDENCY_THRESHOLD_NS) &&
> -	    (2 * cpu_data->short_idles >= cpu_data->total ||
> -	     latency_req < LATENCY_THRESHOLD_NS))
> -		goto out_tick;
> +	if (2 * cpu_data->short_idles >= cpu_data->total ||
> +	    latency_req < LATENCY_THRESHOLD_NS) {
> +		/*
> +		 * Look for a new candidate idle state and use all events (both
> +		 * "intercepts" and "hits") because the sleep length update is
> +		 * likely to be skipped and timer wakeups need to be taken into
> +		 * account in a different way in case they are significant.
> +		 */
> +		idx = teo_get_candidate(drv, dev, cpu_data, idx, constraint_idx,
> +					idx_intercept_sum + idx_hit_sum, true);
> +		/*
> +		 * If the new candidate state is state 0 or its target residency
> +		 * is low enough, return it right away without stopping the
> +		 * scheduler tick.
> +		 */
> +		if (!idx || drv->states[idx].target_residency_ns < RESIDENCY_THRESHOLD_NS)
> +			goto out_tick;
> +	} else if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) {
> +		/*
> +		 * Look for a new candidate state because the current one is
> +		 * likely too deep, but use the "intercepts" metric only because
> +		 * the sleep length is going to be determined later and for now
> +		 * it is only necessary to find a state that will be suitable
> +		 * in case the CPU is "intercepted".
> +		 */
> +		idx = teo_get_candidate(drv, dev, cpu_data, idx, constraint_idx,
> +					idx_intercept_sum, false);
> +	} else if (idx > constraint_idx) {
> +		/*
> +		 * The current candidate state is too deep for the latency
> +		 * constraint at hand, so change it to a suitable one.
> +		 */
> +		idx = constraint_idx;
> +	}
>  
>  	duration_ns = tick_nohz_get_sleep_length(&delta_tick);
>  	cpu_data->sleep_length_ns = duration_ns;
> 
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 2/2] cpuidle: teo: Refine handling of short idle intervals
  2025-04-16 15:00   ` Christian Loehle
@ 2025-04-16 15:28     ` Rafael J. Wysocki
  2025-04-17 11:58       ` Christian Loehle
  0 siblings, 1 reply; 14+ messages in thread
From: Rafael J. Wysocki @ 2025-04-16 15:28 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Rafael J. Wysocki, Linux PM, LKML, Daniel Lezcano,
	Artem Bityutskiy, Doug Smythies, Aboorva Devarajan

On Wed, Apr 16, 2025 at 5:00 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 4/3/25 20:18, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Make teo take all recent wakeups (both timer and non-timer) into
> > account when looking for a new candidate idle state in the cases
> > when the majority of recent idle intervals are within the
> > LATENCY_THRESHOLD_NS range or the latency limit is within the
> > LATENCY_THRESHOLD_NS range.
> >
> > Since the tick_nohz_get_sleep_length() invocation is likely to be
> > skipped in those cases, timer wakeups should arguably be taken into
> > account somehow in case they are significant while the current code
> > mostly looks at non-timer wakeups under the assumption that frequent
> > timer wakeups are unlikely in the given idle duration range which
> > may or may not be accurate.
> >
> > The most natural way to do that is to add the "hits" metric to the
> > sums used during the new candidate idle state lookup which effectively
> > means the above.
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> Hi Rafael,
> I might be missing something so bare with me.
> Quoting the cover-letter too:
> "In those cases, timer wakeups are not taken into account when they are
> within the LATENCY_THRESHOLD_NS range and the idle state selection may
> be based entirely on non-timer wakeups which may be rare.  This causes
> the prediction accuracy to be low and too much energy may be used as
> a result.
>
> The first patch is preparatory and it is not expected to make any
> functional difference.
>
> The second patch causes teo to take timer wakeups into account if it
> is about to skip the tick_nohz_get_sleep_length() invocation, so they
> get a chance to influence the idle state selection."
>
> If the timer wakeups are < LATENCY_THRESHOLD_NS we will not do
>
> cpu_data->sleep_length_ns = tick_nohz_get_sleep_length(&delta_tick);
>
> but
>
> cpu_data->sleep_length_ns = KTIME_MAX;
>
> therefore
> idx_timer = drv->state_count - 1
> idx_duration = some state with residency < LATENCY_THRESHOLD_NS
>
> For any reasonable system therefore idx_timer != idx_duration
> (i.e. there's an idle state deeper than LATENCY_THRESHOLD_NS).
> So hits will never be incremented?

Why never?

First of all, you need to get into the "2 * cpu_data->short_idles >=
cpu_data->total" case somehow and this may be through timer wakeups.

> How would adding hits then help this case?

They may be dominant when this condition triggers for the first time.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 2/2] cpuidle: teo: Refine handling of short idle intervals
  2025-04-16 15:28     ` Rafael J. Wysocki
@ 2025-04-17 11:58       ` Christian Loehle
  2025-04-17 15:21         ` Rafael J. Wysocki
  0 siblings, 1 reply; 14+ messages in thread
From: Christian Loehle @ 2025-04-17 11:58 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Linux PM, LKML, Daniel Lezcano,
	Artem Bityutskiy, Doug Smythies, Aboorva Devarajan

On 4/16/25 16:28, Rafael J. Wysocki wrote:
> On Wed, Apr 16, 2025 at 5:00 PM Christian Loehle
> <christian.loehle@arm.com> wrote:
>>
>> On 4/3/25 20:18, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> Make teo take all recent wakeups (both timer and non-timer) into
>>> account when looking for a new candidate idle state in the cases
>>> when the majority of recent idle intervals are within the
>>> LATENCY_THRESHOLD_NS range or the latency limit is within the
>>> LATENCY_THRESHOLD_NS range.
>>>
>>> Since the tick_nohz_get_sleep_length() invocation is likely to be
>>> skipped in those cases, timer wakeups should arguably be taken into
>>> account somehow in case they are significant while the current code
>>> mostly looks at non-timer wakeups under the assumption that frequent
>>> timer wakeups are unlikely in the given idle duration range which
>>> may or may not be accurate.
>>>
>>> The most natural way to do that is to add the "hits" metric to the
>>> sums used during the new candidate idle state lookup which effectively
>>> means the above.
>>>
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> Hi Rafael,
>> I might be missing something so bare with me.
>> Quoting the cover-letter too:
>> "In those cases, timer wakeups are not taken into account when they are
>> within the LATENCY_THRESHOLD_NS range and the idle state selection may
>> be based entirely on non-timer wakeups which may be rare.  This causes
>> the prediction accuracy to be low and too much energy may be used as
>> a result.
>>
>> The first patch is preparatory and it is not expected to make any
>> functional difference.
>>
>> The second patch causes teo to take timer wakeups into account if it
>> is about to skip the tick_nohz_get_sleep_length() invocation, so they
>> get a chance to influence the idle state selection."
>>
>> If the timer wakeups are < LATENCY_THRESHOLD_NS we will not do
>>
>> cpu_data->sleep_length_ns = tick_nohz_get_sleep_length(&delta_tick);
>>
>> but
>>
>> cpu_data->sleep_length_ns = KTIME_MAX;
>>
>> therefore
>> idx_timer = drv->state_count - 1
>> idx_duration = some state with residency < LATENCY_THRESHOLD_NS
>>
>> For any reasonable system therefore idx_timer != idx_duration
>> (i.e. there's an idle state deeper than LATENCY_THRESHOLD_NS).
>> So hits will never be incremented?
> 
> Why never?
> 
> First of all, you need to get into the "2 * cpu_data->short_idles >=
> cpu_data->total" case somehow and this may be through timer wakeups.

Okay, maybe I had a too static scenario in mind here.
Let me think it through one more time.

> 
>> How would adding hits then help this case?
> 
> They may be dominant when this condition triggers for the first time.

I see.

Anything in particular this would help a lot with?
There's no noticeable behavior change in my usual tests, which is
expected, given we have only WFI in LATENCY_THRESHOLD_NS.

I did fake a WFI2 with residency=5 latency=1, teo-m is mainline, teo
is with series applied:

device   gov    iter   iops    idles     idle_misses  idle_miss_ratio  belows   aboves   WFI       WFI2
-------  -----  -----  ------  --------  ------------  ----------------  --------  -------  --------  --------
nvme0n1  teo    0      80223   8601862   1079609       0.126             918363    161246   205096    4080894
nvme0n1  teo    1      78522   8488322   1054171       0.124             890420    163751   208664    4020130
nvme0n1  teo    2      77901   8375258   1031275       0.123             878083    153192   194500    3977655
nvme0n1  teo    3      77517   8344681   1023423       0.123             869548    153875   195262    3961675
nvme0n1  teo    4      77934   8356760   1027556       0.123             876438    151118   191848    3971578
nvme0n1  teo    5      77864   8371566   1033686       0.123             877745    155941   197903    3972844
nvme0n1  teo    6      78057   8417326   1040512       0.124             881420    159092   201922    3991785
nvme0n1  teo    7      78214   8490292   1050379       0.124             884528    165851   210860    4019102
nvme0n1  teo    8      78100   8357664   1034487       0.124             882781    151706   192728    3971505
nvme0n1  teo    9      76895   8316098   1014695       0.122             861950    152745   193680    3948573
nvme0n1  teo-m  0      76729   8261670   1032158       0.125             845247    186911   237147    3877992
nvme0n1  teo-m  1      77763   8344526   1053266       0.126             867094    186172   237526    3919320
nvme0n1  teo-m  2      76717   8285070   1034706       0.125             848385    186321   236956    3889534
nvme0n1  teo-m  3      76920   8270834   1030223       0.125             847490    182733   232081    3887525
nvme0n1  teo-m  4      77198   8329578   1044724       0.125             855438    189286   240947    3908194
nvme0n1  teo-m  5      77361   8338772   1046903       0.126             857291    189612   241577    3912576
nvme0n1  teo-m  6      76827   8346204   1037520       0.124             846008    191512   243167    3914194
nvme0n1  teo-m  7      77931   8367212   1053337       0.126             866549    186788   237852    3930510
nvme0n1  teo-m  8      77870   8358306   1056011       0.126             867167    188844   240602    3923417
nvme0n1  teo-m  9      77405   8338356   1046012       0.125             856605    189407   240694    3913012

The difference is small, but it's there even though this isn't
a timer-heavy workload at all.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 2/2] cpuidle: teo: Refine handling of short idle intervals
  2025-04-17 11:58       ` Christian Loehle
@ 2025-04-17 15:21         ` Rafael J. Wysocki
  2025-04-17 17:18           ` Christian Loehle
  0 siblings, 1 reply; 14+ messages in thread
From: Rafael J. Wysocki @ 2025-04-17 15:21 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Linux PM, LKML,
	Daniel Lezcano, Artem Bityutskiy, Doug Smythies,
	Aboorva Devarajan

On Thu, Apr 17, 2025 at 1:58 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 4/16/25 16:28, Rafael J. Wysocki wrote:
> > On Wed, Apr 16, 2025 at 5:00 PM Christian Loehle
> > <christian.loehle@arm.com> wrote:
> >>
> >> On 4/3/25 20:18, Rafael J. Wysocki wrote:
> >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>>
> >>> Make teo take all recent wakeups (both timer and non-timer) into
> >>> account when looking for a new candidate idle state in the cases
> >>> when the majority of recent idle intervals are within the
> >>> LATENCY_THRESHOLD_NS range or the latency limit is within the
> >>> LATENCY_THRESHOLD_NS range.
> >>>
> >>> Since the tick_nohz_get_sleep_length() invocation is likely to be
> >>> skipped in those cases, timer wakeups should arguably be taken into
> >>> account somehow in case they are significant while the current code
> >>> mostly looks at non-timer wakeups under the assumption that frequent
> >>> timer wakeups are unlikely in the given idle duration range which
> >>> may or may not be accurate.
> >>>
> >>> The most natural way to do that is to add the "hits" metric to the
> >>> sums used during the new candidate idle state lookup which effectively
> >>> means the above.
> >>>
> >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>
> >> Hi Rafael,
> >> I might be missing something so bare with me.
> >> Quoting the cover-letter too:
> >> "In those cases, timer wakeups are not taken into account when they are
> >> within the LATENCY_THRESHOLD_NS range and the idle state selection may
> >> be based entirely on non-timer wakeups which may be rare.  This causes
> >> the prediction accuracy to be low and too much energy may be used as
> >> a result.
> >>
> >> The first patch is preparatory and it is not expected to make any
> >> functional difference.
> >>
> >> The second patch causes teo to take timer wakeups into account if it
> >> is about to skip the tick_nohz_get_sleep_length() invocation, so they
> >> get a chance to influence the idle state selection."
> >>
> >> If the timer wakeups are < LATENCY_THRESHOLD_NS we will not do
> >>
> >> cpu_data->sleep_length_ns = tick_nohz_get_sleep_length(&delta_tick);
> >>
> >> but
> >>
> >> cpu_data->sleep_length_ns = KTIME_MAX;
> >>
> >> therefore
> >> idx_timer = drv->state_count - 1
> >> idx_duration = some state with residency < LATENCY_THRESHOLD_NS
> >>
> >> For any reasonable system therefore idx_timer != idx_duration
> >> (i.e. there's an idle state deeper than LATENCY_THRESHOLD_NS).
> >> So hits will never be incremented?
> >
> > Why never?
> >
> > First of all, you need to get into the "2 * cpu_data->short_idles >=
> > cpu_data->total" case somehow and this may be through timer wakeups.
>
> Okay, maybe I had a too static scenario in mind here.
> Let me think it through one more time.

Well, this is subtle and your question is actually a good one.

> >
> >> How would adding hits then help this case?
> >
> > They may be dominant when this condition triggers for the first time.
>
> I see.
>
> Anything in particular this would help a lot with?

So I've been trying to reproduce my own results using essentially the
linux-next branch of mine (6.15-rc2 with some material on top) as the
baseline and so far I've been unable to do that.  There's no
significant difference from these patches or at least they don't help
as much as I thought they would.

> There's no noticeable behavior change in my usual tests, which is
> expected, given we have only WFI in LATENCY_THRESHOLD_NS.
>
> I did fake a WFI2 with residency=5 latency=1, teo-m is mainline, teo
> is with series applied:
>
> device   gov    iter   iops    idles     idle_misses  idle_miss_ratio  belows   aboves   WFI       WFI2
> -------  -----  -----  ------  --------  ------------  ----------------  --------  -------  --------  --------
> nvme0n1  teo    0      80223   8601862   1079609       0.126             918363    161246   205096    4080894
> nvme0n1  teo    1      78522   8488322   1054171       0.124             890420    163751   208664    4020130
> nvme0n1  teo    2      77901   8375258   1031275       0.123             878083    153192   194500    3977655
> nvme0n1  teo    3      77517   8344681   1023423       0.123             869548    153875   195262    3961675
> nvme0n1  teo    4      77934   8356760   1027556       0.123             876438    151118   191848    3971578
> nvme0n1  teo    5      77864   8371566   1033686       0.123             877745    155941   197903    3972844
> nvme0n1  teo    6      78057   8417326   1040512       0.124             881420    159092   201922    3991785
> nvme0n1  teo    7      78214   8490292   1050379       0.124             884528    165851   210860    4019102
> nvme0n1  teo    8      78100   8357664   1034487       0.124             882781    151706   192728    3971505
> nvme0n1  teo    9      76895   8316098   1014695       0.122             861950    152745   193680    3948573
> nvme0n1  teo-m  0      76729   8261670   1032158       0.125             845247    186911   237147    3877992
> nvme0n1  teo-m  1      77763   8344526   1053266       0.126             867094    186172   237526    3919320
> nvme0n1  teo-m  2      76717   8285070   1034706       0.125             848385    186321   236956    3889534
> nvme0n1  teo-m  3      76920   8270834   1030223       0.125             847490    182733   232081    3887525
> nvme0n1  teo-m  4      77198   8329578   1044724       0.125             855438    189286   240947    3908194
> nvme0n1  teo-m  5      77361   8338772   1046903       0.126             857291    189612   241577    3912576
> nvme0n1  teo-m  6      76827   8346204   1037520       0.124             846008    191512   243167    3914194
> nvme0n1  teo-m  7      77931   8367212   1053337       0.126             866549    186788   237852    3930510
> nvme0n1  teo-m  8      77870   8358306   1056011       0.126             867167    188844   240602    3923417
> nvme0n1  teo-m  9      77405   8338356   1046012       0.125             856605    189407   240694    3913012
>
> The difference is small, but it's there even though this isn't
> a timer-heavy workload at all.

This is interesting, so thanks for doing it, but the goal really was
to help with the polling state usage on x86 and that doesn't appear to
be happening, so I'm going to drop these patches at least for now.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 2/2] cpuidle: teo: Refine handling of short idle intervals
  2025-04-17 15:21         ` Rafael J. Wysocki
@ 2025-04-17 17:18           ` Christian Loehle
  2025-04-17 19:05             ` Rafael J. Wysocki
  0 siblings, 1 reply; 14+ messages in thread
From: Christian Loehle @ 2025-04-17 17:18 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Linux PM, LKML, Daniel Lezcano,
	Artem Bityutskiy, Doug Smythies, Aboorva Devarajan

On 4/17/25 16:21, Rafael J. Wysocki wrote:
> On Thu, Apr 17, 2025 at 1:58 PM Christian Loehle
> <christian.loehle@arm.com> wrote:
>>
>> On 4/16/25 16:28, Rafael J. Wysocki wrote:
>>> On Wed, Apr 16, 2025 at 5:00 PM Christian Loehle
>>> <christian.loehle@arm.com> wrote:
>>>>
>>>> On 4/3/25 20:18, Rafael J. Wysocki wrote:
>>>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>>>
>>>>> Make teo take all recent wakeups (both timer and non-timer) into
>>>>> account when looking for a new candidate idle state in the cases
>>>>> when the majority of recent idle intervals are within the
>>>>> LATENCY_THRESHOLD_NS range or the latency limit is within the
>>>>> LATENCY_THRESHOLD_NS range.
>>>>>
>>>>> Since the tick_nohz_get_sleep_length() invocation is likely to be
>>>>> skipped in those cases, timer wakeups should arguably be taken into
>>>>> account somehow in case they are significant while the current code
>>>>> mostly looks at non-timer wakeups under the assumption that frequent
>>>>> timer wakeups are unlikely in the given idle duration range which
>>>>> may or may not be accurate.
>>>>>
>>>>> The most natural way to do that is to add the "hits" metric to the
>>>>> sums used during the new candidate idle state lookup which effectively
>>>>> means the above.
>>>>>
>>>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>>
>>>> Hi Rafael,
>>>> I might be missing something so bare with me.
>>>> Quoting the cover-letter too:
>>>> "In those cases, timer wakeups are not taken into account when they are
>>>> within the LATENCY_THRESHOLD_NS range and the idle state selection may
>>>> be based entirely on non-timer wakeups which may be rare.  This causes
>>>> the prediction accuracy to be low and too much energy may be used as
>>>> a result.
>>>>
>>>> The first patch is preparatory and it is not expected to make any
>>>> functional difference.
>>>>
>>>> The second patch causes teo to take timer wakeups into account if it
>>>> is about to skip the tick_nohz_get_sleep_length() invocation, so they
>>>> get a chance to influence the idle state selection."
>>>>
>>>> If the timer wakeups are < LATENCY_THRESHOLD_NS we will not do
>>>>
>>>> cpu_data->sleep_length_ns = tick_nohz_get_sleep_length(&delta_tick);
>>>>
>>>> but
>>>>
>>>> cpu_data->sleep_length_ns = KTIME_MAX;
>>>>
>>>> therefore
>>>> idx_timer = drv->state_count - 1
>>>> idx_duration = some state with residency < LATENCY_THRESHOLD_NS
>>>>
>>>> For any reasonable system therefore idx_timer != idx_duration
>>>> (i.e. there's an idle state deeper than LATENCY_THRESHOLD_NS).
>>>> So hits will never be incremented?
>>>
>>> Why never?
>>>
>>> First of all, you need to get into the "2 * cpu_data->short_idles >=
>>> cpu_data->total" case somehow and this may be through timer wakeups.
>>
>> Okay, maybe I had a too static scenario in mind here.
>> Let me think it through one more time.
> 
> Well, this is subtle and your question is actually a good one.
> 
>>>
>>>> How would adding hits then help this case?
>>>
>>> They may be dominant when this condition triggers for the first time.
>>
>> I see.
>>
>> Anything in particular this would help a lot with?
> 
> So I've been trying to reproduce my own results using essentially the
> linux-next branch of mine (6.15-rc2 with some material on top) as the
> baseline and so far I've been unable to do that.  There's no
> significant difference from these patches or at least they don't help
> as much as I thought they would.
> 
>> There's no noticeable behavior change in my usual tests, which is
>> expected, given we have only WFI in LATENCY_THRESHOLD_NS.
>>
>> I did fake a WFI2 with residency=5 latency=1, teo-m is mainline, teo
>> is with series applied:
>>
>> device   gov    iter   iops    idles     idle_misses  idle_miss_ratio  belows   aboves   WFI       WFI2
>> -------  -----  -----  ------  --------  ------------  ----------------  --------  -------  --------  --------
>> nvme0n1  teo    0      80223   8601862   1079609       0.126             918363    161246   205096    4080894
>> nvme0n1  teo    1      78522   8488322   1054171       0.124             890420    163751   208664    4020130
>> nvme0n1  teo    2      77901   8375258   1031275       0.123             878083    153192   194500    3977655
>> nvme0n1  teo    3      77517   8344681   1023423       0.123             869548    153875   195262    3961675
>> nvme0n1  teo    4      77934   8356760   1027556       0.123             876438    151118   191848    3971578
>> nvme0n1  teo    5      77864   8371566   1033686       0.123             877745    155941   197903    3972844
>> nvme0n1  teo    6      78057   8417326   1040512       0.124             881420    159092   201922    3991785
>> nvme0n1  teo    7      78214   8490292   1050379       0.124             884528    165851   210860    4019102
>> nvme0n1  teo    8      78100   8357664   1034487       0.124             882781    151706   192728    3971505
>> nvme0n1  teo    9      76895   8316098   1014695       0.122             861950    152745   193680    3948573
>> nvme0n1  teo-m  0      76729   8261670   1032158       0.125             845247    186911   237147    3877992
>> nvme0n1  teo-m  1      77763   8344526   1053266       0.126             867094    186172   237526    3919320
>> nvme0n1  teo-m  2      76717   8285070   1034706       0.125             848385    186321   236956    3889534
>> nvme0n1  teo-m  3      76920   8270834   1030223       0.125             847490    182733   232081    3887525
>> nvme0n1  teo-m  4      77198   8329578   1044724       0.125             855438    189286   240947    3908194
>> nvme0n1  teo-m  5      77361   8338772   1046903       0.126             857291    189612   241577    3912576
>> nvme0n1  teo-m  6      76827   8346204   1037520       0.124             846008    191512   243167    3914194
>> nvme0n1  teo-m  7      77931   8367212   1053337       0.126             866549    186788   237852    3930510
>> nvme0n1  teo-m  8      77870   8358306   1056011       0.126             867167    188844   240602    3923417
>> nvme0n1  teo-m  9      77405   8338356   1046012       0.125             856605    189407   240694    3913012
>>
>> The difference is small, but it's there even though this isn't
>> a timer-heavy workload at all.
> 
> This is interesting, so thanks for doing it, but the goal really was
> to help with the polling state usage on x86 and that doesn't appear to
> be happening, so I'm going to drop these patches at least for now.

Alright, well my testing on x86 is limited, but I assume you are
referring to systems were we do have
state0 latency=0 residency=0 polling
state1 latency=1 residency=1
in theory teo shouldn't be super aggressive on state0 then with the
intercept logic, unless the idle durations are recorded as <1us.
I wonder what goes wrong, any traces or workloads you recommend looking
at?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 2/2] cpuidle: teo: Refine handling of short idle intervals
  2025-04-17 17:18           ` Christian Loehle
@ 2025-04-17 19:05             ` Rafael J. Wysocki
  0 siblings, 0 replies; 14+ messages in thread
From: Rafael J. Wysocki @ 2025-04-17 19:05 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Linux PM, LKML,
	Daniel Lezcano, Artem Bityutskiy, Doug Smythies,
	Aboorva Devarajan

On Thu, Apr 17, 2025 at 7:18 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 4/17/25 16:21, Rafael J. Wysocki wrote:
> > On Thu, Apr 17, 2025 at 1:58 PM Christian Loehle
> > <christian.loehle@arm.com> wrote:
> >>
> >> On 4/16/25 16:28, Rafael J. Wysocki wrote:
> >>> On Wed, Apr 16, 2025 at 5:00 PM Christian Loehle
> >>> <christian.loehle@arm.com> wrote:
> >>>>
> >>>> On 4/3/25 20:18, Rafael J. Wysocki wrote:
> >>>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>>>>
> >>>>> Make teo take all recent wakeups (both timer and non-timer) into
> >>>>> account when looking for a new candidate idle state in the cases
> >>>>> when the majority of recent idle intervals are within the
> >>>>> LATENCY_THRESHOLD_NS range or the latency limit is within the
> >>>>> LATENCY_THRESHOLD_NS range.
> >>>>>
> >>>>> Since the tick_nohz_get_sleep_length() invocation is likely to be
> >>>>> skipped in those cases, timer wakeups should arguably be taken into
> >>>>> account somehow in case they are significant while the current code
> >>>>> mostly looks at non-timer wakeups under the assumption that frequent
> >>>>> timer wakeups are unlikely in the given idle duration range which
> >>>>> may or may not be accurate.
> >>>>>
> >>>>> The most natural way to do that is to add the "hits" metric to the
> >>>>> sums used during the new candidate idle state lookup which effectively
> >>>>> means the above.
> >>>>>
> >>>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>>>
> >>>> Hi Rafael,
> >>>> I might be missing something so bare with me.
> >>>> Quoting the cover-letter too:
> >>>> "In those cases, timer wakeups are not taken into account when they are
> >>>> within the LATENCY_THRESHOLD_NS range and the idle state selection may
> >>>> be based entirely on non-timer wakeups which may be rare.  This causes
> >>>> the prediction accuracy to be low and too much energy may be used as
> >>>> a result.
> >>>>
> >>>> The first patch is preparatory and it is not expected to make any
> >>>> functional difference.
> >>>>
> >>>> The second patch causes teo to take timer wakeups into account if it
> >>>> is about to skip the tick_nohz_get_sleep_length() invocation, so they
> >>>> get a chance to influence the idle state selection."
> >>>>
> >>>> If the timer wakeups are < LATENCY_THRESHOLD_NS we will not do
> >>>>
> >>>> cpu_data->sleep_length_ns = tick_nohz_get_sleep_length(&delta_tick);
> >>>>
> >>>> but
> >>>>
> >>>> cpu_data->sleep_length_ns = KTIME_MAX;
> >>>>
> >>>> therefore
> >>>> idx_timer = drv->state_count - 1
> >>>> idx_duration = some state with residency < LATENCY_THRESHOLD_NS
> >>>>
> >>>> For any reasonable system therefore idx_timer != idx_duration
> >>>> (i.e. there's an idle state deeper than LATENCY_THRESHOLD_NS).
> >>>> So hits will never be incremented?
> >>>
> >>> Why never?
> >>>
> >>> First of all, you need to get into the "2 * cpu_data->short_idles >=
> >>> cpu_data->total" case somehow and this may be through timer wakeups.
> >>
> >> Okay, maybe I had a too static scenario in mind here.
> >> Let me think it through one more time.
> >
> > Well, this is subtle and your question is actually a good one.
> >
> >>>
> >>>> How would adding hits then help this case?
> >>>
> >>> They may be dominant when this condition triggers for the first time.
> >>
> >> I see.
> >>
> >> Anything in particular this would help a lot with?
> >
> > So I've been trying to reproduce my own results using essentially the
> > linux-next branch of mine (6.15-rc2 with some material on top) as the
> > baseline and so far I've been unable to do that.  There's no
> > significant difference from these patches or at least they don't help
> > as much as I thought they would.
> >
> >> There's no noticeable behavior change in my usual tests, which is
> >> expected, given we have only WFI in LATENCY_THRESHOLD_NS.
> >>
> >> I did fake a WFI2 with residency=5 latency=1, teo-m is mainline, teo
> >> is with series applied:
> >>
> >> device   gov    iter   iops    idles     idle_misses  idle_miss_ratio  belows   aboves   WFI       WFI2
> >> -------  -----  -----  ------  --------  ------------  ----------------  --------  -------  --------  --------
> >> nvme0n1  teo    0      80223   8601862   1079609       0.126             918363    161246   205096    4080894
> >> nvme0n1  teo    1      78522   8488322   1054171       0.124             890420    163751   208664    4020130
> >> nvme0n1  teo    2      77901   8375258   1031275       0.123             878083    153192   194500    3977655
> >> nvme0n1  teo    3      77517   8344681   1023423       0.123             869548    153875   195262    3961675
> >> nvme0n1  teo    4      77934   8356760   1027556       0.123             876438    151118   191848    3971578
> >> nvme0n1  teo    5      77864   8371566   1033686       0.123             877745    155941   197903    3972844
> >> nvme0n1  teo    6      78057   8417326   1040512       0.124             881420    159092   201922    3991785
> >> nvme0n1  teo    7      78214   8490292   1050379       0.124             884528    165851   210860    4019102
> >> nvme0n1  teo    8      78100   8357664   1034487       0.124             882781    151706   192728    3971505
> >> nvme0n1  teo    9      76895   8316098   1014695       0.122             861950    152745   193680    3948573
> >> nvme0n1  teo-m  0      76729   8261670   1032158       0.125             845247    186911   237147    3877992
> >> nvme0n1  teo-m  1      77763   8344526   1053266       0.126             867094    186172   237526    3919320
> >> nvme0n1  teo-m  2      76717   8285070   1034706       0.125             848385    186321   236956    3889534
> >> nvme0n1  teo-m  3      76920   8270834   1030223       0.125             847490    182733   232081    3887525
> >> nvme0n1  teo-m  4      77198   8329578   1044724       0.125             855438    189286   240947    3908194
> >> nvme0n1  teo-m  5      77361   8338772   1046903       0.126             857291    189612   241577    3912576
> >> nvme0n1  teo-m  6      76827   8346204   1037520       0.124             846008    191512   243167    3914194
> >> nvme0n1  teo-m  7      77931   8367212   1053337       0.126             866549    186788   237852    3930510
> >> nvme0n1  teo-m  8      77870   8358306   1056011       0.126             867167    188844   240602    3923417
> >> nvme0n1  teo-m  9      77405   8338356   1046012       0.125             856605    189407   240694    3913012
> >>
> >> The difference is small, but it's there even though this isn't
> >> a timer-heavy workload at all.
> >
> > This is interesting, so thanks for doing it, but the goal really was
> > to help with the polling state usage on x86 and that doesn't appear to
> > be happening, so I'm going to drop these patches at least for now.
>
> Alright, well my testing on x86 is limited, but I assume you are
> referring to systems were we do have
> state0 latency=0 residency=0 polling
> state1 latency=1 residency=1
> in theory teo shouldn't be super aggressive on state0 then with the
> intercept logic, unless the idle durations are recorded as <1us.
> I wonder what goes wrong, any traces or workloads you recommend looking
> at?

I've observed state0 being selected too often and being too shallow
90% or so of the time.

I don't have anything showing this specifically in a dramatic fashion
and yes, state1 is usually selected much more often than state0.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-04-17 19:05 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-03 19:16 [PATCH v1 0/2] cpuidle: teo: Refine handling of short idle intervals Rafael J. Wysocki
2025-04-03 19:16 ` [PATCH v1 1/2] cpuidle: teo: Move candidate state lookup to separate function Rafael J. Wysocki
2025-04-03 19:18 ` [PATCH v1 2/2] cpuidle: teo: Refine handling of short idle intervals Rafael J. Wysocki
2025-04-16 15:00   ` Christian Loehle
2025-04-16 15:28     ` Rafael J. Wysocki
2025-04-17 11:58       ` Christian Loehle
2025-04-17 15:21         ` Rafael J. Wysocki
2025-04-17 17:18           ` Christian Loehle
2025-04-17 19:05             ` Rafael J. Wysocki
2025-04-09  6:52 ` [PATCH v1 0/2] " Artem Bityutskiy
2025-04-09 12:06   ` Rafael J. Wysocki
2025-04-09 14:36 ` Doug Smythies
2025-04-09 14:41   ` Rafael J. Wysocki
2025-04-14  7:15 ` Aboorva Devarajan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox