[PATCHv2 0/3] cpuidle: teo: Fixing utilization and intercept logic

linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCHv2 0/3] cpuidle: teo: Fixing utilization and intercept logic
@ 2024-06-11 11:24 Christian Loehle
  2024-06-11 11:24 ` [PATCHv2 1/3] Revert: "cpuidle: teo: Introduce util-awareness" Christian Loehle
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Christian Loehle @ 2024-06-11 11:24 UTC (permalink / raw)
  To: linux-pm, linux-kernel, rafael
  Cc: vincent.guittot, qyousef, peterz, daniel.lezcano, ulf.hansson,
	anna-maria, kajetan.puchalski, lukasz.luba, dietmar.eggemann,
	Christian Loehle

Hi all,
so my investigation into teo lead to the following fixes.

1/3:
As discussed the utilization threshold is too high while
there are benefits in certain workloads, there are quite a few
regressions, too. Revert the Util-awareness patch.
This in itself leads to regressions, but part of it can be offset
by the later patches.
See
https://lore.kernel.org/lkml/CAKfTPtA6ZzRR-zMN7sodOW+N_P+GqwNv4tGR+aMB5VXRT2b5bg@mail.gmail.com/
2/3:
Remove the 'recent' intercept logic, see my findings in:
https://lore.kernel.org/lkml/0ce2d536-1125-4df8-9a5b-0d5e389cd8af@arm.com/
I haven't found a way to salvage this properly, so I removed it.
The regular intercept seems to decay fast enough to not need this, but
we could change it if that turns out that we need this to be faster in
ramp-up and decaying.
3/3:
The rest of the intercept logic had issues, too.
See the commit.

Happy for anyone to take a look and test as well.

Some numbers for context, comparing:
- IO workload (intercept heavy).
- Timer workload very low utilization (check for deepest state)
- hackbench (high utilization)
- Geekbench 5 on Pixel6 (high utilization)
Tests 1 to 3 are on RK3399 with CONFIG_HZ=100.
target_residencies: 1, 900, 2000

1. IO workload, 5 runs, results sorted, in read IOPS.
fio --minimal --time_based --name=fiotest --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=psync --iodepth=1 --direct=1 | cut -d \; -f 8;

teo fixed v2:
/dev/nvme0n1
[4599, 4658, 4692, 4694, 4720]
/dev/mmcblk2
[5700, 5730, 5735, 5747, 5977]
/dev/mmcblk1
[2052, 2054, 2066, 2067, 2073]

teo mainline:
/dev/nvme0n1
[3793, 3825, 3846, 3865, 3964]
/dev/mmcblk2
[3831, 4110, 4154, 4203, 4228]
/dev/mmcblk1
[1559, 1564, 1596, 1611, 1618]

menu:
/dev/nvme0n1
[2571, 2630, 2804, 2813, 2917]
/dev/mmcblk2
[4181, 4260, 5062, 5260, 5329]
/dev/mmcblk1
[1567, 1581, 1585, 1603, 1769]

2. Timer workload (through IO for my convenience 😉 )
Results in read IOPS, fio same as above.
echo "0 2097152 zero" | dmsetup create dm-zeros
echo "0 2097152 delay /dev/mapper/dm-zeros 0 50" | dmsetup create dm-slow
(Each IO is delayed by timer of 50ms, should be mostly in state2, for 5s total)

teo fixed v2:
idle_state time
2.0 	4.807025
-1.0 	0.219766
0.0 	0.072007
1.0 	0.169570

3199 cpu_idle total
38 cpu_idle_miss
31 cpu_idle_miss above
7 cpu_idle_miss below

teo mainline:
idle_state time
1.0 	4.897942
-1.0 	0.095375
0.0 	0.253581

3221 cpu_idle total
1269 cpu_idle_miss
22 cpu_idle_miss above
1247 cpu_idle_miss below

menu:
idle_state time
2.0 	4.295546
-1.0 	0.234164
1.0 	0.356344
0.0 	0.401507

3421 cpu_idle total
129 cpu_idle_miss
52 cpu_idle_miss above
77 cpu_idle_miss below

Residencies:
teo mainline isn't in state2 at all, teo fixed is more in state2 than menu, but
both are in state2 the vast majority of the time as expected.

tldr: overall teo fixed spends more time in state2 while having
fewer idle_miss than menu.
teo mainline was just way too aggressive at selecting shallow states.

3. Hackbench, 5 runs
for i in $(seq 0 4); do hackbench -l 100 -g 100 ; sleep 1; done

teo fixed v2:
Time: 4.937
Time: 4.898
Time: 4.871
Time: 4.833
Time: 4.898

teo mainline:
Time: 4.945
Time: 5.021
Time: 4.927
Time: 4.923
Time: 5.137

menu:
Time: 4.964
Time: 4.847
Time: 4.914
Time: 4.841
Time: 4.800

tldr: all comparable, teo mainline slightly worse

4. Geekbench 5 (multi-core) on Pixel6
(Device is cooled for each iteration separately)
teo mainline:
3113, 3068, 3079
mean 3086.66

teo revert util-awareness:
2947, 2941, 2952
mean 2946.66 (-4.54%)

teo fixed v2:
3032, 3066, 3019
mean 3039 (-1.54%)

Changes since v1:
- Removed all non-fixes.
- Do a full revert of Util-awareness instead of increasing thresholds.
- Address Dietmar's comments.
https://lore.kernel.org/linux-kernel/20240606090050.327614-2-christian.loehle@arm.com/T/

Kind Regards,
Christian

Christian Loehle (3):
  Revert: "cpuidle: teo: Introduce util-awareness"
  cpuidle: teo: Remove recent intercepts metric
  cpuidle: teo: Don't count non-existent intercepts

 drivers/cpuidle/governors/teo.c | 189 +++++---------------------------
 1 file changed, 27 insertions(+), 162 deletions(-)

--
2.34.1

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCHv2 1/3] Revert: "cpuidle: teo: Introduce util-awareness"
  2024-06-11 11:24 [PATCHv2 0/3] cpuidle: teo: Fixing utilization and intercept logic Christian Loehle
@ 2024-06-11 11:24 ` Christian Loehle
  2024-06-11 13:37   ` Vincent Guittot
                     ` (2 more replies)
  2024-06-11 11:24 ` [PATCHv2 2/3] cpuidle: teo: Remove recent intercepts metric Christian Loehle
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 16+ messages in thread
From: Christian Loehle @ 2024-06-11 11:24 UTC (permalink / raw)
  To: linux-pm, linux-kernel, rafael
  Cc: vincent.guittot, qyousef, peterz, daniel.lezcano, ulf.hansson,
	anna-maria, kajetan.puchalski, lukasz.luba, dietmar.eggemann,
	Christian Loehle

This reverts commit 9ce0f7c4bc64d820b02a1c53f7e8dba9539f942b.

Util-awareness was reported to be too aggressive in selecting shallower
states. Additionally a single threshold was found to not be suitable
for reasoning about sleep length as, for all practical purposes,
almost arbitrary sleep lengths are still possible for any load value.

Fixes: 9ce0f7c4bc64 ("cpuidle: teo: Introduce util-awareness")
Reported-by: Qais Yousef <qyousef@layalina.io>
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
---
 drivers/cpuidle/governors/teo.c | 100 --------------------------------
 1 file changed, 100 deletions(-)

diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index 7244f71c59c5..d8554c20cf10 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -104,56 +104,16 @@
  *      select the given idle state instead of the candidate one.
  *
  * 3. By default, select the candidate state.
- *
- * Util-awareness mechanism:
- *
- * The idea behind the util-awareness extension is that there are two distinct
- * scenarios for the CPU which should result in two different approaches to idle
- * state selection - utilized and not utilized.
- *
- * In this case, 'utilized' means that the average runqueue util of the CPU is
- * above a certain threshold.
- *
- * When the CPU is utilized while going into idle, more likely than not it will
- * be woken up to do more work soon and so a shallower idle state should be
- * selected to minimise latency and maximise performance. When the CPU is not
- * being utilized, the usual metrics-based approach to selecting the deepest
- * available idle state should be preferred to take advantage of the power
- * saving.
- *
- * In order to achieve this, the governor uses a utilization threshold.
- * The threshold is computed per-CPU as a percentage of the CPU's capacity
- * by bit shifting the capacity value. Based on testing, the shift of 6 (~1.56%)
- * seems to be getting the best results.
- *
- * Before selecting the next idle state, the governor compares the current CPU
- * util to the precomputed util threshold. If it's below, it defaults to the
- * TEO metrics mechanism. If it's above, the closest shallower idle state will
- * be selected instead, as long as is not a polling state.
  */
 
 #include <linux/cpuidle.h>
 #include <linux/jiffies.h>
 #include <linux/kernel.h>
-#include <linux/sched.h>
 #include <linux/sched/clock.h>
-#include <linux/sched/topology.h>
 #include <linux/tick.h>
 
 #include "gov.h"
 
-/*
- * The number of bits to shift the CPU's capacity by in order to determine
- * the utilized threshold.
- *
- * 6 was chosen based on testing as the number that achieved the best balance
- * of power and performance on average.
- *
- * The resulting threshold is high enough to not be triggered by background
- * noise and low enough to react quickly when activity starts to ramp up.
- */
-#define UTIL_THRESHOLD_SHIFT 6
-
 /*
  * The PULSE value is added to metrics when they grow and the DECAY_SHIFT value
  * is used for decreasing metrics on a regular basis.
@@ -188,7 +148,6 @@ struct teo_bin {
  * @next_recent_idx: Index of the next @recent_idx entry to update.
  * @recent_idx: Indices of bins corresponding to recent "intercepts".
  * @tick_hits: Number of "hits" after TICK_NSEC.
- * @util_threshold: Threshold above which the CPU is considered utilized
  */
 struct teo_cpu {
 	s64 time_span_ns;
@@ -198,28 +157,10 @@ struct teo_cpu {
 	int next_recent_idx;
 	int recent_idx[NR_RECENT];
 	unsigned int tick_hits;
-	unsigned long util_threshold;
 };
 
 static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
 
-/**
- * teo_cpu_is_utilized - Check if the CPU's util is above the threshold
- * @cpu: Target CPU
- * @cpu_data: Governor CPU data for the target CPU
- */
-#ifdef CONFIG_SMP
-static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data)
-{
-	return sched_cpu_util(cpu) > cpu_data->util_threshold;
-}
-#else
-static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data)
-{
-	return false;
-}
-#endif
-
 /**
  * teo_update - Update CPU metrics after wakeup.
  * @drv: cpuidle driver containing state data.
@@ -386,7 +327,6 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 	int constraint_idx = 0;
 	int idx0 = 0, idx = -1;
 	bool alt_intercepts, alt_recent;
-	bool cpu_utilized;
 	s64 duration_ns;
 	int i;
 
@@ -411,32 +351,6 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 	if (!dev->states_usage[0].disable)
 		idx = 0;
 
-	cpu_utilized = teo_cpu_is_utilized(dev->cpu, cpu_data);
-	/*
-	 * If the CPU is being utilized over the threshold and there are only 2
-	 * states to choose from, the metrics need not be considered, so choose
-	 * the shallowest non-polling state and exit.
-	 */
-	if (drv->state_count < 3 && cpu_utilized) {
-		/*
-		 * If state 0 is enabled and it is not a polling one, select it
-		 * right away unless the scheduler tick has been stopped, in
-		 * which case care needs to be taken to leave the CPU in a deep
-		 * enough state in case it is not woken up any time soon after
-		 * all.  If state 1 is disabled, though, state 0 must be used
-		 * anyway.
-		 */
-		if ((!idx && !(drv->states[0].flags & CPUIDLE_FLAG_POLLING) &&
-		    teo_state_ok(0, drv)) || dev->states_usage[1].disable) {
-			idx = 0;
-			goto out_tick;
-		}
-		/* Assume that state 1 is not a polling one and use it. */
-		idx = 1;
-		duration_ns = drv->states[1].target_residency_ns;
-		goto end;
-	}
-
 	/* Compute the sums of metrics for early wakeup pattern detection. */
 	for (i = 1; i < drv->state_count; i++) {
 		struct teo_bin *prev_bin = &cpu_data->state_bins[i-1];
@@ -560,18 +474,6 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 	if (idx > constraint_idx)
 		idx = constraint_idx;
 
-	/*
-	 * If the CPU is being utilized over the threshold, choose a shallower
-	 * non-polling state to improve latency, unless the scheduler tick has
-	 * been stopped already and the shallower state's target residency is
-	 * not sufficiently large.
-	 */
-	if (cpu_utilized) {
-		i = teo_find_shallower_state(drv, dev, idx, KTIME_MAX, true);
-		if (teo_state_ok(i, drv))
-			idx = i;
-	}
-
 	/*
 	 * Skip the timers check if state 0 is the current candidate one,
 	 * because an immediate non-timer wakeup is expected in that case.
@@ -667,11 +569,9 @@ static int teo_enable_device(struct cpuidle_driver *drv,
 			     struct cpuidle_device *dev)
 {
 	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
-	unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu);
 	int i;
 
 	memset(cpu_data, 0, sizeof(*cpu_data));
-	cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;
 
 	for (i = 0; i < NR_RECENT; i++)
 		cpu_data->recent_idx[i] = -1;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCHv2 2/3] cpuidle: teo: Remove recent intercepts metric
  2024-06-11 11:24 [PATCHv2 0/3] cpuidle: teo: Fixing utilization and intercept logic Christian Loehle
  2024-06-11 11:24 ` [PATCHv2 1/3] Revert: "cpuidle: teo: Introduce util-awareness" Christian Loehle
@ 2024-06-11 11:24 ` Christian Loehle
  2024-06-26 10:44   ` Dietmar Eggemann
  2024-06-11 11:24 ` [PATCHv2 3/3] cpuidle: teo: Don't count non-existent intercepts Christian Loehle
  2024-06-17  0:20 ` [PATCHv2 0/3] cpuidle: teo: Fixing utilization and intercept logic Doug Smythies
  3 siblings, 1 reply; 16+ messages in thread
From: Christian Loehle @ 2024-06-11 11:24 UTC (permalink / raw)
  To: linux-pm, linux-kernel, rafael
  Cc: vincent.guittot, qyousef, peterz, daniel.lezcano, ulf.hansson,
	anna-maria, kajetan.puchalski, lukasz.luba, dietmar.eggemann,
	Christian Loehle

The logic for recent intercepts didn't work, there is an underflow
of the 'recent' value that can be observed during boot already, which
teo usually doesn't recover from, making the entire logic pointless.
Furthermore the recent intercepts also were never reset, thus not
actually being very 'recent'.

Having underflowed 'recent' values lead to teo always acting as if
we were in a scenario were expected sleep length based on timers is
too high and it therefore unnecessarily selecting shallower states.

Experiments show that the remaining 'intercept' logic is enough to
quickly react to scenarios in which teo cannot rely on the timer
expected sleep length.

See also here:
https://lore.kernel.org/lkml/0ce2d536-1125-4df8-9a5b-0d5e389cd8af@arm.com/

Fixes: 77577558f25d ("cpuidle: teo: Rework most recent idle duration values treatment")
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
---
 drivers/cpuidle/governors/teo.c | 73 ++++++---------------------------
 1 file changed, 12 insertions(+), 61 deletions(-)

diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index d8554c20cf10..cc7df59f488d 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -59,10 +59,6 @@
  * shallower than the one whose bin is fallen into by the sleep length (these
  * situations are referred to as "intercepts" below).
  *
- * In addition to the metrics described above, the governor counts recent
- * intercepts (that is, intercepts that have occurred during the last
- * %NR_RECENT invocations of it for the given CPU) for each bin.
- *
  * In order to select an idle state for a CPU, the governor takes the following
  * steps (modulo the possible latency constraint that must be taken into account
  * too):
@@ -81,20 +77,15 @@
  *      idle long enough to avoid being intercepted if the sleep length had been
  *      equal to the current one).
  *
- *    - The sum of the numbers of recent intercepts for all of the idle states
- *      shallower than the candidate one.
- *
- * 2. If the second sum is greater than the first one or the third sum is
- *    greater than %NR_RECENT / 2, the CPU is likely to wake up early, so look
- *    for an alternative idle state to select.
+ * 2. If the second sum is greater than the first one the CPU is likely to wake
+ *    up early, so look for an alternative idle state to select.
  *
  *    - Traverse the idle states shallower than the candidate one in the
  *      descending order.
  *
- *    - For each of them compute the sum of the "intercepts" metrics and the sum
- *      of the numbers of recent intercepts over all of the idle states between
- *      it and the candidate one (including the former and excluding the
- *      latter).
+ *    - For each of them compute the sum of the "intercepts" metrics over all
+ *      of the idle states between it and the candidate one (including the
+ *      former and excluding the latter).
  *
  *    - If each of these sums that needs to be taken into account (because the
  *      check related to it has indicated that the CPU is likely to wake up
@@ -121,22 +112,14 @@
 #define PULSE		1024
 #define DECAY_SHIFT	3
 
-/*
- * Number of the most recent idle duration values to take into consideration for
- * the detection of recent early wakeup patterns.
- */
-#define NR_RECENT	9
-
 /**
  * struct teo_bin - Metrics used by the TEO cpuidle governor.
  * @intercepts: The "intercepts" metric.
  * @hits: The "hits" metric.
- * @recent: The number of recent "intercepts".
  */
 struct teo_bin {
 	unsigned int intercepts;
 	unsigned int hits;
-	unsigned int recent;
 };
 
 /**
@@ -145,8 +128,6 @@ struct teo_bin {
  * @sleep_length_ns: Time till the closest timer event (at the selection time).
  * @state_bins: Idle state data bins for this CPU.
  * @total: Grand total of the "intercepts" and "hits" metrics for all bins.
- * @next_recent_idx: Index of the next @recent_idx entry to update.
- * @recent_idx: Indices of bins corresponding to recent "intercepts".
  * @tick_hits: Number of "hits" after TICK_NSEC.
  */
 struct teo_cpu {
@@ -154,8 +135,6 @@ struct teo_cpu {
 	s64 sleep_length_ns;
 	struct teo_bin state_bins[CPUIDLE_STATE_MAX];
 	unsigned int total;
-	int next_recent_idx;
-	int recent_idx[NR_RECENT];
 	unsigned int tick_hits;
 };
 
@@ -227,13 +206,6 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 		}
 	}
 
-	i = cpu_data->next_recent_idx++;
-	if (cpu_data->next_recent_idx >= NR_RECENT)
-		cpu_data->next_recent_idx = 0;
-
-	if (cpu_data->recent_idx[i] >= 0)
-		cpu_data->state_bins[cpu_data->recent_idx[i]].recent--;
-
 	/*
 	 * If the deepest state's target residency is below the tick length,
 	 * make a record of it to help teo_select() decide whether or not
@@ -260,14 +232,10 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 	 * Otherwise, update the "intercepts" metric for the bin fallen into by
 	 * the measured idle duration.
 	 */
-	if (idx_timer == idx_duration) {
+	if (idx_timer == idx_duration)
 		cpu_data->state_bins[idx_timer].hits += PULSE;
-		cpu_data->recent_idx[i] = -1;
-	} else {
+	else
 		cpu_data->state_bins[idx_duration].intercepts += PULSE;
-		cpu_data->state_bins[idx_duration].recent++;
-		cpu_data->recent_idx[i] = idx_duration;
-	}
 
 end:
 	cpu_data->total += PULSE;
@@ -320,8 +288,6 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 	unsigned int tick_intercept_sum = 0;
 	unsigned int idx_intercept_sum = 0;
 	unsigned int intercept_sum = 0;
-	unsigned int idx_recent_sum = 0;
-	unsigned int recent_sum = 0;
 	unsigned int idx_hit_sum = 0;
 	unsigned int hit_sum = 0;
 	int constraint_idx = 0;
@@ -362,7 +328,6 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 		 */
 		intercept_sum += prev_bin->intercepts;
 		hit_sum += prev_bin->hits;
-		recent_sum += prev_bin->recent;
 
 		if (dev->states_usage[i].disable)
 			continue;
@@ -378,7 +343,6 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 		/* Save the sums for the current state. */
 		idx_intercept_sum = intercept_sum;
 		idx_hit_sum = hit_sum;
-		idx_recent_sum = recent_sum;
 	}
 
 	/* Avoid unnecessary overhead. */
@@ -403,37 +367,28 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 	 * If the sum of the intercepts metric for all of the idle states
 	 * shallower than the current candidate one (idx) is greater than the
 	 * sum of the intercepts and hits metrics for the candidate state and
-	 * all of the deeper states, or the sum of the numbers of recent
-	 * intercepts over all of the states shallower than the candidate one
-	 * is greater than a half of the number of recent events taken into
-	 * account, a shallower idle state is likely to be a better choice.
+	 * all of the deeper states a shallower idle state is likely to be a
+	 * better choice.
 	 */
-	alt_intercepts = 2 * idx_intercept_sum > cpu_data->total - idx_hit_sum;
-	alt_recent = idx_recent_sum > NR_RECENT / 2;
-	if (alt_recent || alt_intercepts) {
+	if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) {
 		int first_suitable_idx = idx;
 
 		/*
 		 * Look for the deepest idle state whose target residency had
 		 * not exceeded the idle duration in over a half of the relevant
-		 * cases (both with respect to intercepts overall and with
-		 * respect to the recent intercepts only) in the past.
+		 * cases in the past.
 		 *
 		 * Take the possible duration limitation present if the tick
 		 * has been stopped already into account.
 		 */
 		intercept_sum = 0;
-		recent_sum = 0;
 
 		for (i = idx - 1; i >= 0; i--) {
 			struct teo_bin *bin = &cpu_data->state_bins[i];
 
 			intercept_sum += bin->intercepts;
-			recent_sum += bin->recent;
 
-			if ((!alt_recent || 2 * recent_sum > idx_recent_sum) &&
-			    (!alt_intercepts ||
-			     2 * intercept_sum > idx_intercept_sum)) {
+			if (2 * intercept_sum > idx_intercept_sum) {
 				/*
 				 * Use the current state unless it is too
 				 * shallow or disabled, in which case take the
@@ -569,13 +524,9 @@ static int teo_enable_device(struct cpuidle_driver *drv,
 			     struct cpuidle_device *dev)
 {
 	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
-	int i;
 
 	memset(cpu_data, 0, sizeof(*cpu_data));
 
-	for (i = 0; i < NR_RECENT; i++)
-		cpu_data->recent_idx[i] = -1;
-
 	return 0;
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCHv2 3/3] cpuidle: teo: Don't count non-existent intercepts
  2024-06-11 11:24 [PATCHv2 0/3] cpuidle: teo: Fixing utilization and intercept logic Christian Loehle
  2024-06-11 11:24 ` [PATCHv2 1/3] Revert: "cpuidle: teo: Introduce util-awareness" Christian Loehle
  2024-06-11 11:24 ` [PATCHv2 2/3] cpuidle: teo: Remove recent intercepts metric Christian Loehle
@ 2024-06-11 11:24 ` Christian Loehle
  2024-06-26 13:09   ` Dietmar Eggemann
  2024-06-17  0:20 ` [PATCHv2 0/3] cpuidle: teo: Fixing utilization and intercept logic Doug Smythies
  3 siblings, 1 reply; 16+ messages in thread
From: Christian Loehle @ 2024-06-11 11:24 UTC (permalink / raw)
  To: linux-pm, linux-kernel, rafael
  Cc: vincent.guittot, qyousef, peterz, daniel.lezcano, ulf.hansson,
	anna-maria, kajetan.puchalski, lukasz.luba, dietmar.eggemann,
	Christian Loehle

When bailing out early, teo will not query the sleep length anymore
since commit 6da8f9ba5a87 ("cpuidle: teo:
Skip tick_nohz_get_sleep_length() call in some cases") with an
expected sleep_length_ns value of KTIME_MAX.
This lead to state0 accumulating lots of 'intercepts' because
the actually measured sleep length was < KTIME_MAX, so count KTIME_MAX
as a hit (we have to count them as something otherwise we are stuck)
and therefore teo taking too long to recover from intercept-heavy
scenarios.

Fundamentally we can only do one of the two:
1. Skip sleep_length_ns query when we think intercept is likely
2. Have accurate data if sleep_length_ns is actually intercepted when
we believe it is currently intercepted.

This patch chooses the latter as I've found the additional time it
takes to query the sleep length to be negligible and the variants of
option 1 (count all unknowns as misses or count all unknown as hits)
had significant regressions (as misses had lots of too shallow idle
state selections and as hits had terrible performance in
intercept-heavy workloads).

Fixes: 6da8f9ba5a87 ("cpuidle: teo: Skip tick_nohz_get_sleep_length() call in some cases")
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
---
 drivers/cpuidle/governors/teo.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index cc7df59f488d..1e4b40474f49 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -231,8 +231,13 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 	 * length, this is a "hit", so update the "hits" metric for that bin.
 	 * Otherwise, update the "intercepts" metric for the bin fallen into by
 	 * the measured idle duration.
+	 * If teo_select() bailed out early, we have to count this as a hit as
+	 * we don't know what the true sleep length would've been. Otherwise
+	 * we accumulate lots of intercepts at the shallower state (especially
+	 * state0) even though there weren't any intercepts due to us
+	 * anticipating one.
 	 */
-	if (idx_timer == idx_duration)
+	if (idx_timer == idx_duration || cpu_data->sleep_length_ns == KTIME_MAX)
 		cpu_data->state_bins[idx_timer].hits += PULSE;
 	else
 		cpu_data->state_bins[idx_duration].intercepts += PULSE;
@@ -292,7 +297,7 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 	unsigned int hit_sum = 0;
 	int constraint_idx = 0;
 	int idx0 = 0, idx = -1;
-	bool alt_intercepts, alt_recent;
+	int prev_intercept_idx;
 	s64 duration_ns;
 	int i;
 
@@ -370,6 +375,7 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 	 * all of the deeper states a shallower idle state is likely to be a
 	 * better choice.
 	 */
+	prev_intercept_idx = idx;
 	if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) {
 		int first_suitable_idx = idx;
 
@@ -421,6 +427,14 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 			first_suitable_idx = i;
 		}
 	}
+	if (!idx && prev_intercept_idx) {
+		/*
+		 * We have to query the sleep length here otherwise we don't
+		 * know after wakeup if our guess was correct.
+		 */
+		duration_ns = tick_nohz_get_sleep_length(&delta_tick);
+		cpu_data->sleep_length_ns = duration_ns;
+	}
 
 	/*
 	 * If there is a latency constraint, it may be necessary to select an
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCHv2 1/3] Revert: "cpuidle: teo: Introduce util-awareness"
  2024-06-11 11:24 ` [PATCHv2 1/3] Revert: "cpuidle: teo: Introduce util-awareness" Christian Loehle
@ 2024-06-11 13:37   ` Vincent Guittot
  2024-06-19 17:49   ` Qais Yousef
  2024-06-21 13:22   ` [PATCHv3 " Christian Loehle
  2 siblings, 0 replies; 16+ messages in thread
From: Vincent Guittot @ 2024-06-11 13:37 UTC (permalink / raw)
  To: Christian Loehle
  Cc: linux-pm, linux-kernel, rafael, qyousef, peterz, daniel.lezcano,
	ulf.hansson, anna-maria, kajetan.puchalski, lukasz.luba,
	dietmar.eggemann

On Tue, 11 Jun 2024 at 13:24, Christian Loehle <christian.loehle@arm.com> wrote:
>
> This reverts commit 9ce0f7c4bc64d820b02a1c53f7e8dba9539f942b.
>
> Util-awareness was reported to be too aggressive in selecting shallower
> states. Additionally a single threshold was found to not be suitable
> for reasoning about sleep length as, for all practical purposes,
> almost arbitrary sleep lengths are still possible for any load value.
>
> Fixes: 9ce0f7c4bc64 ("cpuidle: teo: Introduce util-awareness")
> Reported-by: Qais Yousef <qyousef@layalina.io>
> Reported-by: Vincent Guittot <vincent.guittot@linaro.org>

The spurious wakeups that I reported on my rb5, are gone with this patchset

Tested-by: Vincent Guittot <vincent.guittot@linaro.org>

> Signed-off-by: Christian Loehle <christian.loehle@arm.com>
> ---
>  drivers/cpuidle/governors/teo.c | 100 --------------------------------
>  1 file changed, 100 deletions(-)
>
> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
> index 7244f71c59c5..d8554c20cf10 100644
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -104,56 +104,16 @@
>   *      select the given idle state instead of the candidate one.
>   *
>   * 3. By default, select the candidate state.
> - *
> - * Util-awareness mechanism:
> - *
> - * The idea behind the util-awareness extension is that there are two distinct
> - * scenarios for the CPU which should result in two different approaches to idle
> - * state selection - utilized and not utilized.
> - *
> - * In this case, 'utilized' means that the average runqueue util of the CPU is
> - * above a certain threshold.
> - *
> - * When the CPU is utilized while going into idle, more likely than not it will
> - * be woken up to do more work soon and so a shallower idle state should be
> - * selected to minimise latency and maximise performance. When the CPU is not
> - * being utilized, the usual metrics-based approach to selecting the deepest
> - * available idle state should be preferred to take advantage of the power
> - * saving.
> - *
> - * In order to achieve this, the governor uses a utilization threshold.
> - * The threshold is computed per-CPU as a percentage of the CPU's capacity
> - * by bit shifting the capacity value. Based on testing, the shift of 6 (~1.56%)
> - * seems to be getting the best results.
> - *
> - * Before selecting the next idle state, the governor compares the current CPU
> - * util to the precomputed util threshold. If it's below, it defaults to the
> - * TEO metrics mechanism. If it's above, the closest shallower idle state will
> - * be selected instead, as long as is not a polling state.
>   */
>
>  #include <linux/cpuidle.h>
>  #include <linux/jiffies.h>
>  #include <linux/kernel.h>
> -#include <linux/sched.h>
>  #include <linux/sched/clock.h>
> -#include <linux/sched/topology.h>
>  #include <linux/tick.h>
>
>  #include "gov.h"
>
> -/*
> - * The number of bits to shift the CPU's capacity by in order to determine
> - * the utilized threshold.
> - *
> - * 6 was chosen based on testing as the number that achieved the best balance
> - * of power and performance on average.
> - *
> - * The resulting threshold is high enough to not be triggered by background
> - * noise and low enough to react quickly when activity starts to ramp up.
> - */
> -#define UTIL_THRESHOLD_SHIFT 6
> -
>  /*
>   * The PULSE value is added to metrics when they grow and the DECAY_SHIFT value
>   * is used for decreasing metrics on a regular basis.
> @@ -188,7 +148,6 @@ struct teo_bin {
>   * @next_recent_idx: Index of the next @recent_idx entry to update.
>   * @recent_idx: Indices of bins corresponding to recent "intercepts".
>   * @tick_hits: Number of "hits" after TICK_NSEC.
> - * @util_threshold: Threshold above which the CPU is considered utilized
>   */
>  struct teo_cpu {
>         s64 time_span_ns;
> @@ -198,28 +157,10 @@ struct teo_cpu {
>         int next_recent_idx;
>         int recent_idx[NR_RECENT];
>         unsigned int tick_hits;
> -       unsigned long util_threshold;
>  };
>
>  static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
>
> -/**
> - * teo_cpu_is_utilized - Check if the CPU's util is above the threshold
> - * @cpu: Target CPU
> - * @cpu_data: Governor CPU data for the target CPU
> - */
> -#ifdef CONFIG_SMP
> -static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data)
> -{
> -       return sched_cpu_util(cpu) > cpu_data->util_threshold;
> -}
> -#else
> -static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data)
> -{
> -       return false;
> -}
> -#endif
> -
>  /**
>   * teo_update - Update CPU metrics after wakeup.
>   * @drv: cpuidle driver containing state data.
> @@ -386,7 +327,6 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>         int constraint_idx = 0;
>         int idx0 = 0, idx = -1;
>         bool alt_intercepts, alt_recent;
> -       bool cpu_utilized;
>         s64 duration_ns;
>         int i;
>
> @@ -411,32 +351,6 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>         if (!dev->states_usage[0].disable)
>                 idx = 0;
>
> -       cpu_utilized = teo_cpu_is_utilized(dev->cpu, cpu_data);
> -       /*
> -        * If the CPU is being utilized over the threshold and there are only 2
> -        * states to choose from, the metrics need not be considered, so choose
> -        * the shallowest non-polling state and exit.
> -        */
> -       if (drv->state_count < 3 && cpu_utilized) {
> -               /*
> -                * If state 0 is enabled and it is not a polling one, select it
> -                * right away unless the scheduler tick has been stopped, in
> -                * which case care needs to be taken to leave the CPU in a deep
> -                * enough state in case it is not woken up any time soon after
> -                * all.  If state 1 is disabled, though, state 0 must be used
> -                * anyway.
> -                */
> -               if ((!idx && !(drv->states[0].flags & CPUIDLE_FLAG_POLLING) &&
> -                   teo_state_ok(0, drv)) || dev->states_usage[1].disable) {
> -                       idx = 0;
> -                       goto out_tick;
> -               }
> -               /* Assume that state 1 is not a polling one and use it. */
> -               idx = 1;
> -               duration_ns = drv->states[1].target_residency_ns;
> -               goto end;
> -       }
> -
>         /* Compute the sums of metrics for early wakeup pattern detection. */
>         for (i = 1; i < drv->state_count; i++) {
>                 struct teo_bin *prev_bin = &cpu_data->state_bins[i-1];
> @@ -560,18 +474,6 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>         if (idx > constraint_idx)
>                 idx = constraint_idx;
>
> -       /*
> -        * If the CPU is being utilized over the threshold, choose a shallower
> -        * non-polling state to improve latency, unless the scheduler tick has
> -        * been stopped already and the shallower state's target residency is
> -        * not sufficiently large.
> -        */
> -       if (cpu_utilized) {
> -               i = teo_find_shallower_state(drv, dev, idx, KTIME_MAX, true);
> -               if (teo_state_ok(i, drv))
> -                       idx = i;
> -       }
> -
>         /*
>          * Skip the timers check if state 0 is the current candidate one,
>          * because an immediate non-timer wakeup is expected in that case.
> @@ -667,11 +569,9 @@ static int teo_enable_device(struct cpuidle_driver *drv,
>                              struct cpuidle_device *dev)
>  {
>         struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
> -       unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu);
>         int i;
>
>         memset(cpu_data, 0, sizeof(*cpu_data));
> -       cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;
>
>         for (i = 0; i < NR_RECENT; i++)
>                 cpu_data->recent_idx[i] = -1;
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCHv2 0/3] cpuidle: teo: Fixing utilization and intercept logic
  2024-06-11 11:24 [PATCHv2 0/3] cpuidle: teo: Fixing utilization and intercept logic Christian Loehle
                   ` (2 preceding siblings ...)
  2024-06-11 11:24 ` [PATCHv2 3/3] cpuidle: teo: Don't count non-existent intercepts Christian Loehle
@ 2024-06-17  0:20 ` Doug Smythies
  2024-06-18 11:17   ` Christian Loehle
  3 siblings, 1 reply; 16+ messages in thread
From: Doug Smythies @ 2024-06-17  0:20 UTC (permalink / raw)
  To: 'Christian Loehle', rafael
  Cc: vincent.guittot, qyousef, peterz, daniel.lezcano, ulf.hansson,
	anna-maria, kajetan.puchalski, lukasz.luba, dietmar.eggemann,
	linux-pm, linux-kernel, Doug Smythies

On 2024.06.11 04:24 Christian Loehle wrote:

...
> Happy for anyone to take a look and test as well.
...

I tested the patch set.
I do a set of tests adopted over some years now.
Readers may recall that some of the tests search over a wide range of operating conditions looking for areas to focus on in more detail.
One interesting observation is that everything seems to run much slower than the last time I did this, last August, Kernel 6.5-rc4.

Test system:
Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz (6 cores, 2 thread per core, 12 CPUs)
CPU Frequency scaling driver: intel_pstate
HWP (HardWare Pstate) control: Disabled
CPU frequency scaling governor: Performance
Idle states: 4: name : description:
   state0/name:POLL		desc:CPUIDLE CORE POLL IDLE
   state1/name:C1_ACPI		desc:ACPI FFH MWAIT 0x0
   state2/name:C2_ACPI		desc:ACPI FFH MWAIT 0x30
   state3/name:C3_ACPI		desc:ACPI FFH MWAIT 0x60
Ilde driver: intel_idle
Idle governor: as per individual test
Kernel: 6.10-rc2 and with V1 and V2 patch sets (1000 Hz tick rate)
Legend:
   teo: unmodified 6.10-rc2
   menu: 
   ladder:
   cl: Kernel 6.10-rc2 + Christian Loehle patch set V1
   clv2: Kernel 6.10-rc2 + Christian Loehle patch set V2
System is extremely idle, other than the test work.

Test 1: 2 core ping pong sweep:

Pass a token between 2 CPUs on 2 different cores.
Do a variable amount of work at each stop.

Purpose: To utilize the shallowest idle states
and observe the transition from using more of 1
idle state to another.

Results relative to teo (negative is better):
		menu		ladder		clv2		cl
average		-2.09%		11.11%		2.88%		1.81%
max		10.63%		33.83%		9.45%		10.13%
min		-11.58%	6.25%		-3.61%		-3.34%

While there are a few operating conditions where clv2 performs better than teo, overall it is worse.

Further details:
http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/2-core-pp-relative.png
http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/2-core-pp-data.png
http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/perf/

Test 2: 6 core ping pong sweep:

Pass a token between 6 CPUs on 6 different cores.
Do a variable amount of work at each stop.

Purpose: To utilize the midrange idle states
and observe the transitions between use of
idle states.

Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors,
transitioning between much less power and slower performance and much more power and higher performance.
On either side of this area, the differences between all idle governors are negligible.
Only data from before this area (from results 1 t0 95) was included in the below results.

Results relative to teo (negative is better):
		menu	ladder	cl	clv2
average		0.16%	4.32%	2.54%	2.64%
max		0.92%	14.32%	8.78%	8.50%
min		-0.44%	0.27%	0.09%	0.05%

One large clv2 difference seems to be excessive use of the deepest idle state,
with corresponding 100% hit rate on the "Idle State 3 was to deep" metric.
Example (20 second sample time):

teo: Idle state 3 entries: 600, 74.33% were to deep or 451. Processor power was 38.0 watts.
clv2: Idle state 3 entries: 4,375,243, 100.00% were to deep or 4,375,243. Processor power was 40.6 watts.
clv2 loop times were about 8% worse than teo.

Further details:
http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data-detail-a.png
http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data-detail-b.png
http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data.png
http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/perf/

Test 3: sleeping ebizzy - 128 threads.

Purpose: This test has given interesting results in the past.
The test varies the sleep interval between record lookups.
The result is varying usage of idle states.

Results: relative to teo (negative is better):
		menu	clv2	ladder	cl
average		0.06%	0.38%	0.81%	0.35%
max		2.53%	3.20%	5.00%	2.87%
min		-2.13%	-1.66%	-3.30%	-2.13%

No strong conclusion here, from just the data.
However, clv2 seems to use a bit more processor power, on average.

Further details:

Test4: adrestia wakeup latency tests. 500 threads.

Purpose: The test was reported in 2023.09 by the kernel test robot and looked
both interesting and gave interesting results, so I added it to the tests I run.

Results:
teo:wakeup cost (periodic, 20us): 3130nSec reference
clv2:wakeup cost (periodic, 20us): 3179nSec +1.57%
cl:wakeup cost (periodic, 20us): 3206nSec +2.43%
menu:wakeup cost (periodic, 20us): 2933nSec -6.29%
ladder:wakeup cost (periodic, 20us): 3530nSec +12.78%

No strong conclusion here, from just the data.
However, clv2 seems to use a bit more processor power, on average.
teo: 69.72 watts
clv2: 72.91 watts +4.6%
Note 1: The first 5 minutes of the test powers were discarded to allow for thermal stabilization.
Note 2: More work is required for this test, because the teo one actually took longer to execute, due to more outlier results than the other tests.

There were several other tests run but are not written up herein.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv2 0/3] cpuidle: teo: Fixing utilization and intercept logic
  2024-06-17  0:20 ` [PATCHv2 0/3] cpuidle: teo: Fixing utilization and intercept logic Doug Smythies
@ 2024-06-18 11:17   ` Christian Loehle
  2024-06-18 17:24     ` Doug Smythies
  0 siblings, 1 reply; 16+ messages in thread
From: Christian Loehle @ 2024-06-18 11:17 UTC (permalink / raw)
  To: Doug Smythies
  Cc: rafael, vincent.guittot, qyousef, peterz, daniel.lezcano,
	ulf.hansson, anna-maria, kajetan.puchalski, lukasz.luba,
	dietmar.eggemann, linux-pm, linux-kernel

On Sun, Jun 16, 2024 at 05:20:43PM -0700, Doug Smythies wrote:
> On 2024.06.11 04:24 Christian Loehle wrote:
>
> ...
> > Happy for anyone to take a look and test as well.
> ...
>
> I tested the patch set.
> I do a set of tests adopted over some years now.
> Readers may recall that some of the tests search over a wide range of operating conditions looking for areas to focus on in more detail.
> One interesting observation is that everything seems to run much slower than the last time I did this, last August, Kernel 6.5-rc4.
>

Thank you very much Doug, that is helpful indeed!

> Test system:
> Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz (6 cores, 2 thread per core, 12 CPUs)
> CPU Frequency scaling driver: intel_pstate
> HWP (HardWare Pstate) control: Disabled
> CPU frequency scaling governor: Performance
> Idle states: 4: name : description:
>    state0/name:POLL		desc:CPUIDLE CORE POLL IDLE
>    state1/name:C1_ACPI		desc:ACPI FFH MWAIT 0x0
>    state2/name:C2_ACPI		desc:ACPI FFH MWAIT 0x30
>    state3/name:C3_ACPI		desc:ACPI FFH MWAIT 0x60

What are target residencies and exit latencies?

> Ilde driver: intel_idle
> Idle governor: as per individual test
> Kernel: 6.10-rc2 and with V1 and V2 patch sets (1000 Hz tick rate)
> Legend:
>    teo: unmodified 6.10-rc2
>    menu:
>    ladder:
>    cl: Kernel 6.10-rc2 + Christian Loehle patch set V1
>    clv2: Kernel 6.10-rc2 + Christian Loehle patch set V2
> System is extremely idle, other than the test work.

If you don't mind spinning up another one, I'd be very curious about
results from just the Util-awareness revert (i.e. v2 1/3).
If not I'll try to reproduce your tests.

>
> Test 1: 2 core ping pong sweep:
>
> Pass a token between 2 CPUs on 2 different cores.
> Do a variable amount of work at each stop.

Hard to interpret the results here, as state residencies would be the
most useful one, but from the results I assume that residencies are
calculated over all possible CPUs, so 4/6 CPUs are pretty much idle
the entire time, resulting in >75% state3 residency overall.

>
> Purpose: To utilize the shallowest idle states
> and observe the transition from using more of 1
> idle state to another.
>
> Results relative to teo (negative is better):
> 		menu		ladder		clv2		cl
> average		-2.09%		11.11%		2.88%		1.81%
> max		10.63%		33.83%		9.45%		10.13%
> min		-11.58%	6.25%		-3.61%		-3.34%
>
> While there are a few operating conditions where clv2 performs better than teo, overall it is worse.
>
> Further details:
> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/2-core-pp-relative.png
> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/2-core-pp-data.png
> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/perf/
>
> Test 2: 6 core ping pong sweep:
>
> Pass a token between 6 CPUs on 6 different cores.
> Do a variable amount of work at each stop.
>

My first guess would've been that this is the perfect workload for the
very low utilization threshold, but even teo has >40% state3 residency
consistently here.

> Purpose: To utilize the midrange idle states
> and observe the transitions between use of
> idle states.
>
> Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors,
> transitioning between much less power and slower performance and much more power and higher performance.
> On either side of this area, the differences between all idle governors are negligible.
> Only data from before this area (from results 1 t0 95) was included in the below results.
>

I see and agree with your interpretation. Difference in power between
all tested seems to be negligible during that window. Interestingly
the residencies of idle states seem to be very different, like ladder
being mostly in deepest state3. Maybe total package power is too coarse
to show the differences for this test.

> Results relative to teo (negative is better):
> 		menu	ladder	cl	clv2
> average		0.16%	4.32%	2.54%	2.64%
> max		0.92%	14.32%	8.78%	8.50%
> min		-0.44%	0.27%	0.09%	0.05%
>
> One large clv2 difference seems to be excessive use of the deepest idle state,
> with corresponding 100% hit rate on the "Idle State 3 was to deep" metric.
> Example (20 second sample time):
>
> teo: Idle state 3 entries: 600, 74.33% were to deep or 451. Processor power was 38.0 watts.
> clv2: Idle state 3 entries: 4,375,243, 100.00% were to deep or 4,375,243. Processor power was 40.6 watts.
> clv2 loop times were about 8% worse than teo.
>

Some of the idle state 3 residencies seem to be >100% at the end here,
not sure what's up with that.

> Further details:
> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data-detail-a.png
> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data-detail-b.png
> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data.png
> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/perf/
>
> Test 3: sleeping ebizzy - 128 threads.
>
> Purpose: This test has given interesting results in the past.
> The test varies the sleep interval between record lookups.
> The result is varying usage of idle states.
>
> Results: relative to teo (negative is better):
> 		menu	clv2	ladder	cl
> average		0.06%	0.38%	0.81%	0.35%
> max		2.53%	3.20%	5.00%	2.87%
> min		-2.13%	-1.66%	-3.30%	-2.13%
>
> No strong conclusion here, from just the data.
> However, clv2 seems to use a bit more processor power, on average.

Not sure about that, from the residencies ladder and teo should be
decisive losers in terms of power. While later in the test teo seems
to be getting worse in power it doesn't quite reflect the difference
in states.
E.g. clv2 finishing with 65% state2 residency while teo has 40%, but
I'll try to get per-CPU power measurements on this one.
Interestingly ladder is a clear winner if anything, if that is reliable
as a result that could indicate a too aggressive tick stop from the
other governors, but cl isn't that much better than clv2 here, even
though it stops the tick less aggressively.

>
> Further details:

Link is missing, but I found
http://smythies.com/~doug/linux/idle/teo-util3/ebizzy/
from browsing your page.

>
> Test4: adrestia wakeup latency tests. 500 threads.
>
> Purpose: The test was reported in 2023.09 by the kernel test robot and looked
> both interesting and gave interesting results, so I added it to the tests I run.

http://smythies.com/~doug/linux/idle/teo-util3/adrestia/periodic/perf/
So interestingly we can see, what I would call, the misbehavior of teo
here, with teo skipping state2 and state3 entirely. You would expect
a power regressionhere, but it doesn't translate into package power
anyway.

>
> Results:
> teo:wakeup cost (periodic, 20us): 3130nSec reference
> clv2:wakeup cost (periodic, 20us): 3179nSec +1.57%
> cl:wakeup cost (periodic, 20us): 3206nSec +2.43%
> menu:wakeup cost (periodic, 20us): 2933nSec -6.29%
> ladder:wakeup cost (periodic, 20us): 3530nSec +12.78%

Is this measured as wakeup latency?
I can't find much info about the test setup here, do you mind sharing
something on it?

>
> No strong conclusion here, from just the data.
> However, clv2 seems to use a bit more processor power, on average.
> teo: 69.72 watts
> clv2: 72.91 watts +4.6%
> Note 1: The first 5 minutes of the test powers were discarded to allow for thermal stabilization.
> Note 2: More work is required for this test, because the teo one actually took longer to execute, due to more outlier results than the other tests.
>
> There were several other tests run but are not written up herein.
>

Because results are on par for all? Or inconclusive / not reproducible?

Some final words:
I was hoping to get rid of Util-awareness with fixed the fixed intercept logic
and my test showed that this isn't unreasonable.
Here we do see a case where there is some regression vs Util-awareness.
The intercept logic is currently decaying quite aggressively, maybe
that could be tuned to improve teo behavior.

Kind Regards,
Christian

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCHv2 0/3] cpuidle: teo: Fixing utilization and intercept logic
  2024-06-18 11:17   ` Christian Loehle
@ 2024-06-18 17:24     ` Doug Smythies
  2024-06-20 11:19       ` Christian Loehle
  0 siblings, 1 reply; 16+ messages in thread
From: Doug Smythies @ 2024-06-18 17:24 UTC (permalink / raw)
  To: 'Christian Loehle'
  Cc: rafael, vincent.guittot, qyousef, peterz, daniel.lezcano,
	ulf.hansson, anna-maria, kajetan.puchalski, lukasz.luba,
	dietmar.eggemann, linux-pm, linux-kernel, Doug Smythies

Hi Christian,

Thank you for your reply.

On 2024.06.18 03:54 Christian Loehle wrote:
> On Sun, Jun 16, 2024 at 05:20:43PM -0700, Doug Smythies wrote:
>> On 2024.06.11 04:24 Christian Loehle wrote:
>>
>> ...
>> > Happy for anyone to take a look and test as well.
>> ...
>>
>> I tested the patch set.
>> I do a set of tests adopted over some years now.
>> Readers may recall that some of the tests search over a wide range of operating conditions looking for areas to focus on in more
detail.
>> One interesting observation is that everything seems to run much slower than the last time I did this, last August, Kernel
6.5-rc4.
>>
> 
> Thank you very much Doug, that is helpful indeed!
> 
>> Test system:
>> Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz (6 cores, 2 thread per core, 12 CPUs)
>> CPU Frequency scaling driver: intel_pstate
>> HWP (HardWare Pstate) control: Disabled
>> CPU frequency scaling governor: Performance
>> Idle states: 4: name : description:
>>    state0/name:POLL		desc:CPUIDLE CORE POLL IDLE
>>    state1/name:C1_ACPI		desc:ACPI FFH MWAIT 0x0
>>    state2/name:C2_ACPI		desc:ACPI FFH MWAIT 0x30
>>    state3/name:C3_ACPI		desc:ACPI FFH MWAIT 0x60
> 
> What are target residencies and exit latencies?

Of course. Here:

/sys/devices/system/cpu/cpu1/cpuidle/state0/residency:0
/sys/devices/system/cpu/cpu1/cpuidle/state1/residency:1
/sys/devices/system/cpu/cpu1/cpuidle/state2/residency:360
/sys/devices/system/cpu/cpu1/cpuidle/state3/residency:3102

/sys/devices/system/cpu/cpu1/cpuidle/state0/latency:0
/sys/devices/system/cpu/cpu1/cpuidle/state1/latency:1
/sys/devices/system/cpu/cpu1/cpuidle/state2/latency:120
/sys/devices/system/cpu/cpu1/cpuidle/state3/latency:1034
 
>> Ilde driver: intel_idle
>> Idle governor: as per individual test
>> Kernel: 6.10-rc2 and with V1 and V2 patch sets (1000 Hz tick rate)
>> Legend:
>>    teo: unmodified 6.10-rc2
>>    menu:
>>    ladder:
>>    cl: Kernel 6.10-rc2 + Christian Loehle patch set V1
>>    clv2: Kernel 6.10-rc2 + Christian Loehle patch set V2
>> System is extremely idle, other than the test work.
> 
> If you don't mind spinning up another one, I'd be very curious about
> results from just the Util-awareness revert (i.e. v2 1/3).
> If not I'll try to reproduce your tests.

I will, but not today.
I have never been a fan of Util-awareness.

>> Test 1: 2 core ping pong sweep:
>>
>> Pass a token between 2 CPUs on 2 different cores.
>> Do a variable amount of work at each stop.
> 
> Hard to interpret the results here, as state residencies would be the
> most useful one, but from the results I assume that residencies are
> calculated over all possible CPUs, so 4/6 CPUs are pretty much idle
> the entire time, resulting in >75% state3 residency overall.

It would be 10 of 12 CPUs are idle and 4 of 6 cores.
But fair enough, the residency stats are being dominated by the idle CPUs.
I usually look at the usage in conjunction with the residency percentages.
At 10 minutes (20 second sample period):
teo entered idle state 3 517 times ; clv2 was 1,979,541 times
At 20 minutes:
teo entered idle state 3 525 times ; clv2 was 3,011,263 times
Anyway, I could hack something to just use data from the 2 CPUs involved.

>> Purpose: To utilize the shallowest idle states
>> and observe the transition from using more of 1
>> idle state to another.
>>
>> Results relative to teo (negative is better):
>>		menu		ladder		clv2		cl
>> average	-2.09%		11.11%		2.88%		1.81%
>> max		10.63%		33.83%		9.45%		10.13%
>> min		-11.58%		6.25%		-3.61%		-3.34%
>>
>> While there are a few operating conditions where clv2 performs better than teo, overall it is worse.
>>
>> Further details:
>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/2-core-pp-relative.png
>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/2-core-pp-data.png
>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/perf/
>>
>> Test 2: 6 core ping pong sweep:
>>
>> Pass a token between 6 CPUs on 6 different cores.
>> Do a variable amount of work at each stop.
>>
> 
> My first guess would've been that this is the perfect workload for the
> very low utilization threshold, but even teo has >40% state3 residency
> consistently here.

There are still 6 idle CPUs.
I'll try a 12 CPUs using each core twice type sweep test,
but I think I settled on 6 because it focused on what I wanted for results.

>> Purpose: To utilize the midrange idle states
>> and observe the transitions between use of
>> idle states.
>>
>> Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors,
>> transitioning between much less power and slower performance and much more power and higher performance.
>> On either side of this area, the differences between all idle governors are negligible.
>> Only data from before this area (from results 1 t0 95) was included in the below results.
> 
> I see and agree with your interpretation. Difference in power between
> all tested seems to be negligible during that window. Interestingly
> the residencies of idle states seem to be very different, like ladder
> being mostly in deepest state3. Maybe total package power is too coarse
> to show the differences for this test.
> 
>> Results relative to teo (negative is better):
>>		menu	ladder	cl	clv2
>> average	0.16%	4.32%	2.54%	2.64%
>> max		0.92%	14.32%	8.78%	8.50%
>> min		-0.44%	0.27%	0.09%	0.05%
>>
>> One large clv2 difference seems to be excessive use of the deepest idle state,
>> with corresponding 100% hit rate on the "Idle State 3 was to deep" metric.
>> Example (20 second sample time):
>>
>> teo: Idle state 3 entries: 600, 74.33% were to deep or 451. Processor power was 38.0 watts.
>> clv2: Idle state 3 entries: 4,375,243, 100.00% were to deep or 4,375,243. Processor power was 40.6 watts.
>> clv2 loop times were about 8% worse than teo.
> 
> Some of the idle state 3 residencies seem to be >100% at the end here,
> not sure what's up with that.
 
The test is over and the system is completely idle.
And yes, there are 4 calculations that come out > 100%, the worst being 100.71%,
with a total sum over all idle states of 100.79%.
I can look into it if you want but have never expected the numbers to be that accurate.

>> Further details:
>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data-detail-a.png
>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data-detail-b.png
>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data.png
>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/perf/
>>
>> Test 3: sleeping ebizzy - 128 threads.
>>
>> Purpose: This test has given interesting results in the past.
>> The test varies the sleep interval between record lookups.
>> The result is varying usage of idle states.
>>
>> Results: relative to teo (negative is better):
>>		menu	clv2	ladder	cl
>> average	0.06%	0.38%	0.81%	0.35%
>> max		2.53%	3.20%	5.00%	2.87%
>> min		-2.13%	-1.66%	-3.30%	-2.13%
>>
>> No strong conclusion here, from just the data.
>> However, clv2 seems to use a bit more processor power, on average.
> 
> Not sure about that, from the residencies ladder and teo should be
> decisive losers in terms of power. While later in the test teo seems
> to be getting worse in power it doesn't quite reflect the difference
> in states.
> E.g. clv2 finishing with 65% state2 residency while teo has 40%, but
> I'll try to get per-CPU power measurements on this one.
> Interestingly ladder is a clear winner if anything, if that is reliable
> as a result that could indicate a too aggressive tick stop from the
> other governors, but cl isn't that much better than clv2 here, even
> though it stops the tick less aggressively.

I agree with what you are saying.
It is a shorter test at only 25 minutes.
It might be worth trying the test again with more strict attention to
stabilizing the system thermally before each test.
The processor power will vary by a few watts for the exact same load
as a function of processor package temperature and coolant (my system is
water cooled) temperature and can take 20 to 30 minutes to settle.

Reference:
http://smythies.com/~doug/linux/idle/teo-util3/temperature/thermal-stabalization-time.png

>>
>> Further details:
> 
> Link is missing, but I found
> http://smythies.com/~doug/linux/idle/teo-util3/ebizzy/
> from browsing your page.

Yes, I accidently hit "Send" on my original email before it was actually finished.
But, then I was tired and thought "close enough".

>> Test4: adrestia wakeup latency tests. 500 threads.
>>
>> Purpose: The test was reported in 2023.09 by the kernel test robot and looked
>> both interesting and gave interesting results, so I added it to the tests I run.
> 
> http://smythies.com/~doug/linux/idle/teo-util3/adrestia/periodic/perf/
> So interestingly we can see, what I would call, the misbehavior of teo
> here, with teo skipping state2 and state3 entirely. You would expect
> a power regression here, but it doesn't translate into package power
> anyway.
> 
>>
>> Results:
>> teo:wakeup cost (periodic, 20us): 3130nSec reference
>> clv2:wakeup cost (periodic, 20us): 3179nSec +1.57%
>> cl:wakeup cost (periodic, 20us): 3206nSec +2.43%
>> menu:wakeup cost (periodic, 20us): 2933nSec -6.29%
>> ladder:wakeup cost (periodic, 20us): 3530nSec +12.78%
> 
> Is this measured as wakeup latency?
> I can't find much info about the test setup here, do you mind sharing
> something on it?

I admit to being vague on this one, and I'll need some time to review.
The notes I left for myself last September are here:
http://smythies.com/~doug/linux/idle/teo-util2/adrestia/README.txt
 
>> No strong conclusion here, from just the data.
>> However, clv2 seems to use a bit more processor power, on average.
>> teo: 69.72 watts
>> clv2: 72.91 watts +4.6%
>> Note 1: The first 5 minutes of the test powers were discarded to allow for thermal stabilization.

which might not have been long enough, see the thermal notes above.

>> Note 2: More work is required for this test, because the teo one actually took longer to execute, due to more outlier results
than the other tests.

>> There were several other tests run but are not written up herein.
>> 
> Because results are on par for all? Or inconclusive / not reproducible?

Yes, because nothing of significance was observed or the test was more or less a repeat of an already covered test.
Initially, I had a mistake in my baseline teo tests, and a couple of the not written up tests have still not been re-run with the
proper baseline.

> Some final words:
> I was hoping to get rid of Util-awareness with fixed the fixed intercept logic
> and my test showed that this isn't unreasonable.
> Here we do see a case where there is some regression vs Util-awareness.
> The intercept logic is currently decaying quite aggressively, maybe
> that could be tuned to improve teo behavior.
> 
> Kind Regards,
> Christian

... Doug


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv2 1/3] Revert: "cpuidle: teo: Introduce util-awareness"
  2024-06-11 11:24 ` [PATCHv2 1/3] Revert: "cpuidle: teo: Introduce util-awareness" Christian Loehle
  2024-06-11 13:37   ` Vincent Guittot
@ 2024-06-19 17:49   ` Qais Yousef
  2024-06-21 13:22   ` [PATCHv3 " Christian Loehle
  2 siblings, 0 replies; 16+ messages in thread
From: Qais Yousef @ 2024-06-19 17:49 UTC (permalink / raw)
  To: Christian Loehle
  Cc: linux-pm, linux-kernel, rafael, vincent.guittot, peterz,
	daniel.lezcano, ulf.hansson, anna-maria, kajetan.puchalski,
	lukasz.luba, dietmar.eggemann

On 06/11/24 12:24, Christian Loehle wrote:
> This reverts commit 9ce0f7c4bc64d820b02a1c53f7e8dba9539f942b.
> 
> Util-awareness was reported to be too aggressive in selecting shallower
> states. Additionally a single threshold was found to not be suitable
> for reasoning about sleep length as, for all practical purposes,
> almost arbitrary sleep lengths are still possible for any load value.
> 
> Fixes: 9ce0f7c4bc64 ("cpuidle: teo: Introduce util-awareness")
> Reported-by: Qais Yousef <qyousef@layalina.io>
> Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
> Signed-off-by: Christian Loehle <christian.loehle@arm.com>
> ---

Reviewed-by: Qais Yousef <qyousef@layalina.io>

Thanks!

--
Qais Yousef

>  drivers/cpuidle/governors/teo.c | 100 --------------------------------
>  1 file changed, 100 deletions(-)
> 
> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
> index 7244f71c59c5..d8554c20cf10 100644
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -104,56 +104,16 @@
>   *      select the given idle state instead of the candidate one.
>   *
>   * 3. By default, select the candidate state.
> - *
> - * Util-awareness mechanism:
> - *
> - * The idea behind the util-awareness extension is that there are two distinct
> - * scenarios for the CPU which should result in two different approaches to idle
> - * state selection - utilized and not utilized.
> - *
> - * In this case, 'utilized' means that the average runqueue util of the CPU is
> - * above a certain threshold.
> - *
> - * When the CPU is utilized while going into idle, more likely than not it will
> - * be woken up to do more work soon and so a shallower idle state should be
> - * selected to minimise latency and maximise performance. When the CPU is not
> - * being utilized, the usual metrics-based approach to selecting the deepest
> - * available idle state should be preferred to take advantage of the power
> - * saving.
> - *
> - * In order to achieve this, the governor uses a utilization threshold.
> - * The threshold is computed per-CPU as a percentage of the CPU's capacity
> - * by bit shifting the capacity value. Based on testing, the shift of 6 (~1.56%)
> - * seems to be getting the best results.
> - *
> - * Before selecting the next idle state, the governor compares the current CPU
> - * util to the precomputed util threshold. If it's below, it defaults to the
> - * TEO metrics mechanism. If it's above, the closest shallower idle state will
> - * be selected instead, as long as is not a polling state.
>   */
>  
>  #include <linux/cpuidle.h>
>  #include <linux/jiffies.h>
>  #include <linux/kernel.h>
> -#include <linux/sched.h>
>  #include <linux/sched/clock.h>
> -#include <linux/sched/topology.h>
>  #include <linux/tick.h>
>  
>  #include "gov.h"
>  
> -/*
> - * The number of bits to shift the CPU's capacity by in order to determine
> - * the utilized threshold.
> - *
> - * 6 was chosen based on testing as the number that achieved the best balance
> - * of power and performance on average.
> - *
> - * The resulting threshold is high enough to not be triggered by background
> - * noise and low enough to react quickly when activity starts to ramp up.
> - */
> -#define UTIL_THRESHOLD_SHIFT 6
> -
>  /*
>   * The PULSE value is added to metrics when they grow and the DECAY_SHIFT value
>   * is used for decreasing metrics on a regular basis.
> @@ -188,7 +148,6 @@ struct teo_bin {
>   * @next_recent_idx: Index of the next @recent_idx entry to update.
>   * @recent_idx: Indices of bins corresponding to recent "intercepts".
>   * @tick_hits: Number of "hits" after TICK_NSEC.
> - * @util_threshold: Threshold above which the CPU is considered utilized
>   */
>  struct teo_cpu {
>  	s64 time_span_ns;
> @@ -198,28 +157,10 @@ struct teo_cpu {
>  	int next_recent_idx;
>  	int recent_idx[NR_RECENT];
>  	unsigned int tick_hits;
> -	unsigned long util_threshold;
>  };
>  
>  static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
>  
> -/**
> - * teo_cpu_is_utilized - Check if the CPU's util is above the threshold
> - * @cpu: Target CPU
> - * @cpu_data: Governor CPU data for the target CPU
> - */
> -#ifdef CONFIG_SMP
> -static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data)
> -{
> -	return sched_cpu_util(cpu) > cpu_data->util_threshold;
> -}
> -#else
> -static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data)
> -{
> -	return false;
> -}
> -#endif
> -
>  /**
>   * teo_update - Update CPU metrics after wakeup.
>   * @drv: cpuidle driver containing state data.
> @@ -386,7 +327,6 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>  	int constraint_idx = 0;
>  	int idx0 = 0, idx = -1;
>  	bool alt_intercepts, alt_recent;
> -	bool cpu_utilized;
>  	s64 duration_ns;
>  	int i;
>  
> @@ -411,32 +351,6 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>  	if (!dev->states_usage[0].disable)
>  		idx = 0;
>  
> -	cpu_utilized = teo_cpu_is_utilized(dev->cpu, cpu_data);
> -	/*
> -	 * If the CPU is being utilized over the threshold and there are only 2
> -	 * states to choose from, the metrics need not be considered, so choose
> -	 * the shallowest non-polling state and exit.
> -	 */
> -	if (drv->state_count < 3 && cpu_utilized) {
> -		/*
> -		 * If state 0 is enabled and it is not a polling one, select it
> -		 * right away unless the scheduler tick has been stopped, in
> -		 * which case care needs to be taken to leave the CPU in a deep
> -		 * enough state in case it is not woken up any time soon after
> -		 * all.  If state 1 is disabled, though, state 0 must be used
> -		 * anyway.
> -		 */
> -		if ((!idx && !(drv->states[0].flags & CPUIDLE_FLAG_POLLING) &&
> -		    teo_state_ok(0, drv)) || dev->states_usage[1].disable) {
> -			idx = 0;
> -			goto out_tick;
> -		}
> -		/* Assume that state 1 is not a polling one and use it. */
> -		idx = 1;
> -		duration_ns = drv->states[1].target_residency_ns;
> -		goto end;
> -	}
> -
>  	/* Compute the sums of metrics for early wakeup pattern detection. */
>  	for (i = 1; i < drv->state_count; i++) {
>  		struct teo_bin *prev_bin = &cpu_data->state_bins[i-1];
> @@ -560,18 +474,6 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>  	if (idx > constraint_idx)
>  		idx = constraint_idx;
>  
> -	/*
> -	 * If the CPU is being utilized over the threshold, choose a shallower
> -	 * non-polling state to improve latency, unless the scheduler tick has
> -	 * been stopped already and the shallower state's target residency is
> -	 * not sufficiently large.
> -	 */
> -	if (cpu_utilized) {
> -		i = teo_find_shallower_state(drv, dev, idx, KTIME_MAX, true);
> -		if (teo_state_ok(i, drv))
> -			idx = i;
> -	}
> -
>  	/*
>  	 * Skip the timers check if state 0 is the current candidate one,
>  	 * because an immediate non-timer wakeup is expected in that case.
> @@ -667,11 +569,9 @@ static int teo_enable_device(struct cpuidle_driver *drv,
>  			     struct cpuidle_device *dev)
>  {
>  	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
> -	unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu);
>  	int i;
>  
>  	memset(cpu_data, 0, sizeof(*cpu_data));
> -	cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;
>  
>  	for (i = 0; i < NR_RECENT; i++)
>  		cpu_data->recent_idx[i] = -1;
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv2 0/3] cpuidle: teo: Fixing utilization and intercept logic
  2024-06-18 17:24     ` Doug Smythies
@ 2024-06-20 11:19       ` Christian Loehle
  2024-06-25 16:35         ` Doug Smythies
  0 siblings, 1 reply; 16+ messages in thread
From: Christian Loehle @ 2024-06-20 11:19 UTC (permalink / raw)
  To: Doug Smythies
  Cc: rafael, vincent.guittot, qyousef, peterz, daniel.lezcano,
	ulf.hansson, anna-maria, kajetan.puchalski, lukasz.luba,
	dietmar.eggemann, linux-pm, linux-kernel

On Tue, Jun 18, 2024 at 10:24:46AM -0700, Doug Smythies wrote:
> Hi Christian,
>
> Thank you for your reply.

Thank you for taking the time!

>
> On 2024.06.18 03:54 Christian Loehle wrote:
> > On Sun, Jun 16, 2024 at 05:20:43PM -0700, Doug Smythies wrote:
> >> On 2024.06.11 04:24 Christian Loehle wrote:
> >>
> >> ...
> >> > Happy for anyone to take a look and test as well.
> >> ...
> >>
> >> I tested the patch set.
> >> I do a set of tests adopted over some years now.
> >> Readers may recall that some of the tests search over a wide range of operating conditions looking for areas to focus on in more
> detail.
> >> One interesting observation is that everything seems to run much slower than the last time I did this, last August, Kernel
> 6.5-rc4.
> >>
> >
> > Thank you very much Doug, that is helpful indeed!
> >
> >> Test system:
> >> Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz (6 cores, 2 thread per core, 12 CPUs)
> >> CPU Frequency scaling driver: intel_pstate
> >> HWP (HardWare Pstate) control: Disabled
> >> CPU frequency scaling governor: Performance
> >> Idle states: 4: name : description:
> >>    state0/name:POLL		desc:CPUIDLE CORE POLL IDLE
> >>    state1/name:C1_ACPI		desc:ACPI FFH MWAIT 0x0
> >>    state2/name:C2_ACPI		desc:ACPI FFH MWAIT 0x30
> >>    state3/name:C3_ACPI		desc:ACPI FFH MWAIT 0x60
> >
> > What are target residencies and exit latencies?
>
> Of course. Here:
>
> /sys/devices/system/cpu/cpu1/cpuidle/state0/residency:0
> /sys/devices/system/cpu/cpu1/cpuidle/state1/residency:1
> /sys/devices/system/cpu/cpu1/cpuidle/state2/residency:360
> /sys/devices/system/cpu/cpu1/cpuidle/state3/residency:3102
>
> /sys/devices/system/cpu/cpu1/cpuidle/state0/latency:0
> /sys/devices/system/cpu/cpu1/cpuidle/state1/latency:1
> /sys/devices/system/cpu/cpu1/cpuidle/state2/latency:120
> /sys/devices/system/cpu/cpu1/cpuidle/state3/latency:1034

Thanks,
what am I missing here that these are two different sets of states?

>
> >> Ilde driver: intel_idle
> >> Idle governor: as per individual test
> >> Kernel: 6.10-rc2 and with V1 and V2 patch sets (1000 Hz tick rate)
> >> Legend:
> >>    teo: unmodified 6.10-rc2
> >>    menu:
> >>    ladder:
> >>    cl: Kernel 6.10-rc2 + Christian Loehle patch set V1
> >>    clv2: Kernel 6.10-rc2 + Christian Loehle patch set V2
> >> System is extremely idle, other than the test work.
> >
> > If you don't mind spinning up another one, I'd be very curious about
> > results from just the Util-awareness revert (i.e. v2 1/3).
> > If not I'll try to reproduce your tests.
>
> I will, but not today.

Thank you.

> I have never been a fan of Util-awareness.

Well if you want to elaborate on that I guess now is the time and
here is the place. ;)

>
> >> Test 1: 2 core ping pong sweep:
> >>
> >> Pass a token between 2 CPUs on 2 different cores.
> >> Do a variable amount of work at each stop.
> >
> > Hard to interpret the results here, as state residencies would be the
> > most useful one, but from the results I assume that residencies are
> > calculated over all possible CPUs, so 4/6 CPUs are pretty much idle
> > the entire time, resulting in >75% state3 residency overall.
>
> It would be 10 of 12 CPUs are idle and 4 of 6 cores.

Of course, my bad.

> But fair enough, the residency stats are being dominated by the idle CPUs.
> I usually look at the usage in conjunction with the residency percentages.
> At 10 minutes (20 second sample period):
> teo entered idle state 3 517 times ; clv2 was 1,979,541 times
> At 20 minutes:
> teo entered idle state 3 525 times ; clv2 was 3,011,263 times
> Anyway, I could hack something to just use data from the 2 CPUs involved.

Your method works, just a bit awkward, I guess I'm spoiled in that
regard :)
(Shameless plug:
https://tooling.sites.arm.com/lisa/latest/trace_analysis.html#lisa.analysis.idle.IdleAnalysis.plot_cpu_idle_state_residency
)
>
> >> Purpose: To utilize the shallowest idle states
> >> and observe the transition from using more of 1
> >> idle state to another.
> >>
> >> Results relative to teo (negative is better):
> >>		menu		ladder		clv2		cl
> >> average	-2.09%		11.11%		2.88%		1.81%
> >> max		10.63%		33.83%		9.45%		10.13%
> >> min		-11.58%		6.25%		-3.61%		-3.34%
> >>
> >> While there are a few operating conditions where clv2 performs better than teo, overall it is worse.
> >>
> >> Further details:
> >> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/2-core-pp-relative.png
> >> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/2-core-pp-data.png
> >> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/perf/
> >>
> >> Test 2: 6 core ping pong sweep:
> >>
> >> Pass a token between 6 CPUs on 6 different cores.
> >> Do a variable amount of work at each stop.
> >>
> >
> > My first guess would've been that this is the perfect workload for the
> > very low utilization threshold, but even teo has >40% state3 residency
> > consistently here.
>
> There are still 6 idle CPUs.
> I'll try a 12 CPUs using each core twice type sweep test,
> but I think I settled on 6 because it focused on what I wanted for results.

I see, again, my bad.

>
> >> Purpose: To utilize the midrange idle states
> >> and observe the transitions between use of
> >> idle states.
> >>
> >> Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors,
> >> transitioning between much less power and slower performance and much more power and higher performance.
> >> On either side of this area, the differences between all idle governors are negligible.
> >> Only data from before this area (from results 1 t0 95) was included in the below results.
> >
> > I see and agree with your interpretation. Difference in power between
> > all tested seems to be negligible during that window. Interestingly
> > the residencies of idle states seem to be very different, like ladder
> > being mostly in deepest state3. Maybe total package power is too coarse
> > to show the differences for this test.
> >
> >> Results relative to teo (negative is better):
> >>		menu	ladder	cl	clv2
> >> average	0.16%	4.32%	2.54%	2.64%
> >> max		0.92%	14.32%	8.78%	8.50%
> >> min		-0.44%	0.27%	0.09%	0.05%
> >>
> >> One large clv2 difference seems to be excessive use of the deepest idle state,
> >> with corresponding 100% hit rate on the "Idle State 3 was to deep" metric.
> >> Example (20 second sample time):
> >>
> >> teo: Idle state 3 entries: 600, 74.33% were to deep or 451. Processor power was 38.0 watts.
> >> clv2: Idle state 3 entries: 4,375,243, 100.00% were to deep or 4,375,243. Processor power was 40.6 watts.
> >> clv2 loop times were about 8% worse than teo.
> >
> > Some of the idle state 3 residencies seem to be >100% at the end here,
> > not sure what's up with that.
>
> The test is over and the system is completely idle.
> And yes, there are 4 calculations that come out > 100%, the worst being 100.71%,
> with a total sum over all idle states of 100.79%.
> I can look into it if you want but have never expected the numbers to be that accurate.

Hopefully it's just some weird rounding thing, it just looks strange.

>
> >> Further details:
> >> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data-detail-a.png
> >> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data-detail-b.png
> >> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data.png
> >> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/perf/
> >>
> >> Test 3: sleeping ebizzy - 128 threads.
> >>
> >> Purpose: This test has given interesting results in the past.
> >> The test varies the sleep interval between record lookups.
> >> The result is varying usage of idle states.
> >>
> >> Results: relative to teo (negative is better):
> >>		menu	clv2	ladder	cl
> >> average	0.06%	0.38%	0.81%	0.35%
> >> max		2.53%	3.20%	5.00%	2.87%
> >> min		-2.13%	-1.66%	-3.30%	-2.13%
> >>
> >> No strong conclusion here, from just the data.
> >> However, clv2 seems to use a bit more processor power, on average.
> >
> > Not sure about that, from the residencies ladder and teo should be
> > decisive losers in terms of power. While later in the test teo seems
> > to be getting worse in power it doesn't quite reflect the difference
> > in states.
> > E.g. clv2 finishing with 65% state2 residency while teo has 40%, but
> > I'll try to get per-CPU power measurements on this one.
> > Interestingly ladder is a clear winner if anything, if that is reliable
> > as a result that could indicate a too aggressive tick stop from the
> > other governors, but cl isn't that much better than clv2 here, even
> > though it stops the tick less aggressively.
>
> I agree with what you are saying.
> It is a shorter test at only 25 minutes.
> It might be worth trying the test again with more strict attention to
> stabilizing the system thermally before each test.
> The processor power will vary by a few watts for the exact same load
> as a function of processor package temperature and coolant (my system is
> water cooled) temperature and can take 20 to 30 minutes to settle.
>
> Reference:
> http://smythies.com/~doug/linux/idle/teo-util3/temperature/thermal-stabalization-time.png
>
> >>
> >> Further details:
> >
> > Link is missing, but I found
> > http://smythies.com/~doug/linux/idle/teo-util3/ebizzy/
> > from browsing your page.
>
> Yes, I accidently hit "Send" on my original email before it was actually finished.
> But, then I was tired and thought "close enough".
>
> >> Test4: adrestia wakeup latency tests. 500 threads.
> >>
> >> Purpose: The test was reported in 2023.09 by the kernel test robot and looked
> >> both interesting and gave interesting results, so I added it to the tests I run.
> >
> > http://smythies.com/~doug/linux/idle/teo-util3/adrestia/periodic/perf/
> > So interestingly we can see, what I would call, the misbehavior of teo
> > here, with teo skipping state2 and state3 entirely. You would expect
> > a power regression here, but it doesn't translate into package power
> > anyway.
> >
> >>
> >> Results:
> >> teo:wakeup cost (periodic, 20us): 3130nSec reference
> >> clv2:wakeup cost (periodic, 20us): 3179nSec +1.57%
> >> cl:wakeup cost (periodic, 20us): 3206nSec +2.43%
> >> menu:wakeup cost (periodic, 20us): 2933nSec -6.29%
> >> ladder:wakeup cost (periodic, 20us): 3530nSec +12.78%
> >
> > Is this measured as wakeup latency?
> > I can't find much info about the test setup here, do you mind sharing
> > something on it?
>
> I admit to being vague on this one, and I'll need some time to review.
> The notes I left for myself last September are here:
> http://smythies.com/~doug/linux/idle/teo-util2/adrestia/README.txt

Thanks!

>
> >> No strong conclusion here, from just the data.
> >> However, clv2 seems to use a bit more processor power, on average.
> >> teo: 69.72 watts
> >> clv2: 72.91 watts +4.6%
> >> Note 1: The first 5 minutes of the test powers were discarded to allow for thermal stabilization.
>
> which might not have been long enough, see the thermal notes above.
>
> >> Note 2: More work is required for this test, because the teo one actually took longer to execute, due to more outlier results
> than the other tests.
>
> >> There were several other tests run but are not written up herein.
> >>
> > Because results are on par for all? Or inconclusive / not reproducible?
>
> Yes, because nothing of significance was observed or the test was more or less a repeat of an already covered test.
> Initially, I had a mistake in my baseline teo tests, and a couple of the not written up tests have still not been re-run with the
> proper baseline.

Thank you for testing, that's what I hoped.

Kind Regards,
Christian

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCHv3 1/3] Revert: "cpuidle: teo: Introduce util-awareness"
  2024-06-11 11:24 ` [PATCHv2 1/3] Revert: "cpuidle: teo: Introduce util-awareness" Christian Loehle
  2024-06-11 13:37   ` Vincent Guittot
  2024-06-19 17:49   ` Qais Yousef
@ 2024-06-21 13:22   ` Christian Loehle
  2 siblings, 0 replies; 16+ messages in thread
From: Christian Loehle @ 2024-06-21 13:22 UTC (permalink / raw)
  To: linux-pm, linux-kernel, rafael
  Cc: vincent.guittot, qyousef, peterz, daniel.lezcano, ulf.hansson,
	anna-maria, kajetan.puchalski, lukasz.luba, dietmar.eggemann

This reverts commit 9ce0f7c4bc64d820b02a1c53f7e8dba9539f942b.

Util-awareness was reported to be too aggressive in selecting shallower
states. Additionally a single threshold was found to not be suitable
for reasoning about sleep length as, for all practical purposes,
almost arbitrary sleep lengths are still possible for any load value.

Fixes: 9ce0f7c4bc64 ("cpuidle: teo: Introduce util-awareness")
Reported-by: Qais Yousef <qyousef@layalina.io>
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
---
v3: Restore copyright header too and pick up tags

 drivers/cpuidle/governors/teo.c | 105 --------------------------------
 1 file changed, 105 deletions(-)

diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index 7244f71c59c5..b6a8ba6772e1 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -2,13 +2,8 @@
 /*
  * Timer events oriented CPU idle governor
  *
- * TEO governor:
  * Copyright (C) 2018 - 2021 Intel Corporation
  * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- *
- * Util-awareness mechanism:
- * Copyright (C) 2022 Arm Ltd.
- * Author: Kajetan Puchalski <kajetan.puchalski@arm.com>
  */
 
 /**
@@ -104,56 +99,16 @@
  *      select the given idle state instead of the candidate one.
  *
  * 3. By default, select the candidate state.
- *
- * Util-awareness mechanism:
- *
- * The idea behind the util-awareness extension is that there are two distinct
- * scenarios for the CPU which should result in two different approaches to idle
- * state selection - utilized and not utilized.
- *
- * In this case, 'utilized' means that the average runqueue util of the CPU is
- * above a certain threshold.
- *
- * When the CPU is utilized while going into idle, more likely than not it will
- * be woken up to do more work soon and so a shallower idle state should be
- * selected to minimise latency and maximise performance. When the CPU is not
- * being utilized, the usual metrics-based approach to selecting the deepest
- * available idle state should be preferred to take advantage of the power
- * saving.
- *
- * In order to achieve this, the governor uses a utilization threshold.
- * The threshold is computed per-CPU as a percentage of the CPU's capacity
- * by bit shifting the capacity value. Based on testing, the shift of 6 (~1.56%)
- * seems to be getting the best results.
- *
- * Before selecting the next idle state, the governor compares the current CPU
- * util to the precomputed util threshold. If it's below, it defaults to the
- * TEO metrics mechanism. If it's above, the closest shallower idle state will
- * be selected instead, as long as is not a polling state.
  */
 
 #include <linux/cpuidle.h>
 #include <linux/jiffies.h>
 #include <linux/kernel.h>
-#include <linux/sched.h>
 #include <linux/sched/clock.h>
-#include <linux/sched/topology.h>
 #include <linux/tick.h>
 
 #include "gov.h"
 
-/*
- * The number of bits to shift the CPU's capacity by in order to determine
- * the utilized threshold.
- *
- * 6 was chosen based on testing as the number that achieved the best balance
- * of power and performance on average.
- *
- * The resulting threshold is high enough to not be triggered by background
- * noise and low enough to react quickly when activity starts to ramp up.
- */
-#define UTIL_THRESHOLD_SHIFT 6
-
 /*
  * The PULSE value is added to metrics when they grow and the DECAY_SHIFT value
  * is used for decreasing metrics on a regular basis.
@@ -188,7 +143,6 @@ struct teo_bin {
  * @next_recent_idx: Index of the next @recent_idx entry to update.
  * @recent_idx: Indices of bins corresponding to recent "intercepts".
  * @tick_hits: Number of "hits" after TICK_NSEC.
- * @util_threshold: Threshold above which the CPU is considered utilized
  */
 struct teo_cpu {
 	s64 time_span_ns;
@@ -198,28 +152,10 @@ struct teo_cpu {
 	int next_recent_idx;
 	int recent_idx[NR_RECENT];
 	unsigned int tick_hits;
-	unsigned long util_threshold;
 };
 
 static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
 
-/**
- * teo_cpu_is_utilized - Check if the CPU's util is above the threshold
- * @cpu: Target CPU
- * @cpu_data: Governor CPU data for the target CPU
- */
-#ifdef CONFIG_SMP
-static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data)
-{
-	return sched_cpu_util(cpu) > cpu_data->util_threshold;
-}
-#else
-static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data)
-{
-	return false;
-}
-#endif
-
 /**
  * teo_update - Update CPU metrics after wakeup.
  * @drv: cpuidle driver containing state data.
@@ -386,7 +322,6 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 	int constraint_idx = 0;
 	int idx0 = 0, idx = -1;
 	bool alt_intercepts, alt_recent;
-	bool cpu_utilized;
 	s64 duration_ns;
 	int i;
 
@@ -411,32 +346,6 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 	if (!dev->states_usage[0].disable)
 		idx = 0;
 
-	cpu_utilized = teo_cpu_is_utilized(dev->cpu, cpu_data);
-	/*
-	 * If the CPU is being utilized over the threshold and there are only 2
-	 * states to choose from, the metrics need not be considered, so choose
-	 * the shallowest non-polling state and exit.
-	 */
-	if (drv->state_count < 3 && cpu_utilized) {
-		/*
-		 * If state 0 is enabled and it is not a polling one, select it
-		 * right away unless the scheduler tick has been stopped, in
-		 * which case care needs to be taken to leave the CPU in a deep
-		 * enough state in case it is not woken up any time soon after
-		 * all.  If state 1 is disabled, though, state 0 must be used
-		 * anyway.
-		 */
-		if ((!idx && !(drv->states[0].flags & CPUIDLE_FLAG_POLLING) &&
-		    teo_state_ok(0, drv)) || dev->states_usage[1].disable) {
-			idx = 0;
-			goto out_tick;
-		}
-		/* Assume that state 1 is not a polling one and use it. */
-		idx = 1;
-		duration_ns = drv->states[1].target_residency_ns;
-		goto end;
-	}
-
 	/* Compute the sums of metrics for early wakeup pattern detection. */
 	for (i = 1; i < drv->state_count; i++) {
 		struct teo_bin *prev_bin = &cpu_data->state_bins[i-1];
@@ -560,18 +469,6 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 	if (idx > constraint_idx)
 		idx = constraint_idx;
 
-	/*
-	 * If the CPU is being utilized over the threshold, choose a shallower
-	 * non-polling state to improve latency, unless the scheduler tick has
-	 * been stopped already and the shallower state's target residency is
-	 * not sufficiently large.
-	 */
-	if (cpu_utilized) {
-		i = teo_find_shallower_state(drv, dev, idx, KTIME_MAX, true);
-		if (teo_state_ok(i, drv))
-			idx = i;
-	}
-
 	/*
 	 * Skip the timers check if state 0 is the current candidate one,
 	 * because an immediate non-timer wakeup is expected in that case.
@@ -667,11 +564,9 @@ static int teo_enable_device(struct cpuidle_driver *drv,
 			     struct cpuidle_device *dev)
 {
 	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
-	unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu);
 	int i;
 
 	memset(cpu_data, 0, sizeof(*cpu_data));
-	cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;
 
 	for (i = 0; i < NR_RECENT; i++)
 		cpu_data->recent_idx[i] = -1;
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* RE: [PATCHv2 0/3] cpuidle: teo: Fixing utilization and intercept logic
  2024-06-20 11:19       ` Christian Loehle
@ 2024-06-25 16:35         ` Doug Smythies
  0 siblings, 0 replies; 16+ messages in thread
From: Doug Smythies @ 2024-06-25 16:35 UTC (permalink / raw)
  To: 'Christian Loehle'
  Cc: rafael, vincent.guittot, qyousef, peterz, daniel.lezcano,
	ulf.hansson, anna-maria, kajetan.puchalski, lukasz.luba,
	dietmar.eggemann, linux-pm, linux-kernel, Doug Smythies

Hi Christian,

It took awhile.

On 2024.06.20 04:19 Christian Loehle wrote:
> On Tue, Jun 18, 2024 at 10:24:46AM -0700, Doug Smythies wrote:
>> Hi Christian,
>>
>> Thank you for your reply.
> 
> Thank you for taking the time!
> 
>>
>> On 2024.06.18 03:54 Christian Loehle wrote:
>>> On Sun, Jun 16, 2024 at 05:20:43PM -0700, Doug Smythies wrote:
>>>> On 2024.06.11 04:24 Christian Loehle wrote:
>>>>
>>>> ...
>>>> > Happy for anyone to take a look and test as well.
>>>> ...
>>>>
>>>> I tested the patch set.
>>>> I do a set of tests adopted over some years now.
>>>> Readers may recall that some of the tests search over a wide range of operating conditions looking for areas to focus on in
more
>> detail.
>>>> One interesting observation is that everything seems to run much slower than the last time I did this, last August, Kernel
>> 6.5-rc4.
>>>>
>>>
>>> Thank you very much Doug, that is helpful indeed!
>>>
>>>> Test system:
>>>> Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz (6 cores, 2 thread per core, 12 CPUs)
>>>> CPU Frequency scaling driver: intel_pstate
>>>> HWP (HardWare Pstate) control: Disabled
>>>> CPU frequency scaling governor: Performance
>>>> Idle states: 4: name : description:
>>>>    state0/name:POLL		desc:CPUIDLE CORE POLL IDLE
>>>>    state1/name:C1_ACPI		desc:ACPI FFH MWAIT 0x0
>>>>    state2/name:C2_ACPI		desc:ACPI FFH MWAIT 0x30
>>>>    state3/name:C3_ACPI		desc:ACPI FFH MWAIT 0x60
>>>
>>> What are target residencies and exit latencies?
>>
>> Of course. Here:
>>
>> /sys/devices/system/cpu/cpu1/cpuidle/state0/residency:0
>> /sys/devices/system/cpu/cpu1/cpuidle/state1/residency:1
>> /sys/devices/system/cpu/cpu1/cpuidle/state2/residency:360
>> /sys/devices/system/cpu/cpu1/cpuidle/state3/residency:3102
>>
>> /sys/devices/system/cpu/cpu1/cpuidle/state0/latency:0
>> /sys/devices/system/cpu/cpu1/cpuidle/state1/latency:1
>> /sys/devices/system/cpu/cpu1/cpuidle/state2/latency:120
>> /sys/devices/system/cpu/cpu1/cpuidle/state3/latency:1034
> 
> Thanks,
> what am I missing here that these are two different sets of states?

I don't know what you are missing. Those are not two different sets of states.
Maybe I am missing something?

>>>> Ilde driver: intel_idle
>>>> Idle governor: as per individual test
>>>> Kernel: 6.10-rc2 and with V1 and V2 patch sets (1000 Hz tick rate)
>>>> Legend:
>>>>    teo: unmodified 6.10-rc2
>>>>    menu:
>>>>    ladder:
>>>>    cl: Kernel 6.10-rc2 + Christian Loehle patch set V1
>>>>    clv2: Kernel 6.10-rc2 + Christian Loehle patch set V2
             no-util: Kernel 6.10-rc2 + Christian Loehle [PATCHv2 1/3] Revert: "cpuidle: teo: Introduce util-awareness"
>>>> System is extremely idle, other than the test work.
>>>
>>> If you don't mind spinning up another one, I'd be very curious about
>>> results from just the Util-awareness revert (i.e. v2 1/3).
>>> If not I'll try to reproduce your tests.
>>
>> I will, but not today.

Most, if not all, links have been replaced adding "no-util" data.
Summary: there is negligible difference between "teo" and "no-util".
Isn't that what is expected for a system with 4 idle states?

Note 1: I forgot to change the date on several of the graphs.
 
> Thank you.
> 
>> I have never been a fan of Util-awareness.
> 
> Well if you want to elaborate on that I guess now is the time and
> here is the place. ;)

Most of my concerns with the original versions were fixed,
which is why it now has little to no effect on a system with 4 idle states.
Beyond that, I haven't had the time to review all of my old tests and findings.

>>>> Test 1: 2 core ping pong sweep:
>>>>
>>>> Pass a token between 2 CPUs on 2 different cores.
>>>> Do a variable amount of work at each stop.
>>>
>>> Hard to interpret the results here, as state residencies would be the
>>> most useful one, but from the results I assume that residencies are
>>> calculated over all possible CPUs, so 4/6 CPUs are pretty much idle
>>> the entire time, resulting in >75% state3 residency overall.
>>
>> It would be 10 of 12 CPUs are idle and 4 of 6 cores.
> 
> Of course, my bad.
> 
>> But fair enough, the residency stats are being dominated by the idle CPUs.
>> I usually look at the usage in conjunction with the residency percentages.
>> At 10 minutes (20 second sample period):
>> teo entered idle state 3 517 times ; clv2 was 1,979,541 times
>> At 20 minutes:
>> teo entered idle state 3 525 times ; clv2 was 3,011,263 times
>> Anyway, I could hack something to just use data from the 2 CPUs involved.
> 
> Your method works, just a bit awkward, I guess I'm spoiled in that
> regard :)
> (Shameless plug:
> https://tooling.sites.arm.com/lisa/latest/trace_analysis.html#lisa.analysis.idle.IdleAnalysis.plot_cpu_idle_state_residency
> )

Very interesting. If I ever get more time, I'll try it.

>>>> Purpose: To utilize the shallowest idle states
>>>> and observe the transition from using more of 1
>>>> idle state to another.
>>>>
>>>> Results relative to teo (negative is better):
	menu		ladder		clv2		cl		no-util
ave	-2.09%		11.11%		2.88%		1.81%		0.32%
max	10.63%		33.83%		9.45%		10.13%		8.00%
min	-11.58%	6.25%		-3.61%		-3.34%		-1.06%

Note 1: Old data re-stated with all the ">>>" stuff removed.
Note 2: The max +8.00% for no-util is misleading, as it was just a slight difference in a transition point.

>>>> While there are a few operating conditions where clv2 performs better than teo, overall it is worse.
>>>>
>>>> Further details:
>>>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/2-core-pp-relative.png
>>>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/2-core-pp-data.png
>>>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/perf/
>>>>
>>>> Test 2: 6 core ping pong sweep:
>>>>
>>>> Pass a token between 6 CPUs on 6 different cores.
>>>> Do a variable amount of work at each stop.
>>>>
>>>
>>> My first guess would've been that this is the perfect workload for the
>>> very low utilization threshold, but even teo has >40% state3 residency
>>> consistently here.
>>
>> There are still 6 idle CPUs.
>> I'll try a 12 CPUs using each core twice type sweep test,
>> but I think I settled on 6 because it focused on what I wanted for results.
> 
> I see, again, my bad.

I had a 12 CPU type test script already and have used it in the past. Anyway:

Results relative to teo (negative is better):
	no-util	menu	clv2
ave	0.07%	0.77%	1.41%
max	0.85%	2.78%	11.45%
min	-1.30%	-0.62%	0.00%

Note 1:	only test runs 1 to 120 are included, eliminating the bi-stable uncertainty region
	of the higher test runs.
Note 2: This test does show differences between teo and no-util in idle state usage in
	the bi-stable region. I do not know if it is repeatable.

Further details:
http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/12-1/12-cpu-pp-data.png
http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/12-1/12-cpu-pp-data-detail-a.png
http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/12-1/12-cpu-pp-relative.png
http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/12-1/perf/
 
>>>> Purpose: To utilize the midrange idle states
>>>> and observe the transitions between use of
>>>> idle states.
>>>>
>>>> Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors,
>>>> transitioning between much less power and slower performance and much more power and higher performance.
>>>> On either side of this area, the differences between all idle governors are negligible.
>>>> Only data from before this area (from results 1 t0 95) was included in the below results.
>>>
>>> I see and agree with your interpretation. Difference in power between
>>> all tested seems to be negligible during that window. Interestingly
>>> the residencies of idle states seem to be very different, like ladder
>>> being mostly in deepest state3. Maybe total package power is too coarse
>>> to show the differences for this test.
>>>
>>>> Results relative to teo (negative is better):
	menu	ladder	cl	clv2	no-util
ave	0.16%	4.32%	2.54%	2.64%	0.25%
max	0.92%	14.32%	8.78%	8.50%	14.96%
min	-0.44%	0.27%	0.09%	0.05%	-0.54%

Note 1: Old data re-stated with all the ">>>" stuff removed.
Note 2: The max 14.96% for no-util was the during test start.
	It is not always repeatable. See the dwell test results way further down below.

>>>> One large clv2 difference seems to be excessive use of the deepest idle state,
>>>> with corresponding 100% hit rate on the "Idle State 3 was to deep" metric.
>>>> Example (20 second sample time):
>>>>
>>>> teo: Idle state 3 entries: 600, 74.33% were to deep or 451. Processor power was 38.0 watts.
>>>> clv2: Idle state 3 entries: 4,375,243, 100.00% were to deep or 4,375,243. Processor power was 40.6 watts.
>>>> clv2 loop times were about 8% worse than teo.
>>>
>>> Some of the idle state 3 residencies seem to be >100% at the end here,
>>> not sure what's up with that.
>>
>> The test is over and the system is completely idle.
>> And yes, there are 4 calculations that come out > 100%, the worst being 100.71%,
>> with a total sum over all idle states of 100.79%.
>> I can look into it if you want but have never expected the numbers to be that accurate.
> 
> Hopefully it's just some weird rounding thing, it just looks strange.
> 
>>
>>>> Further details:
>>>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data-detail-a.png
>>>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data-detail-b.png
>>>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data.png
>>>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/perf/
>>>>
>>>> Test 3: sleeping ebizzy - 128 threads.
>>>>
>>>> Purpose: This test has given interesting results in the past.
>>>> The test varies the sleep interval between record lookups.
>>>> The result is varying usage of idle states.
>>>>
>>>> Results: relative to teo (negative is better):
	menu	clv2	ladder	cl	no-util
ave	0.06%	0.38%	0.81%	0.35%	-0.03%
max	2.53%	3.20%	5.00%	2.87%	0.79%
min	-2.13%	-1.66%	-3.30%	-2.13%	-1.19%

Note 1: Old data re-stated with all the ">>>" stuff removed.

>>>> No strong conclusion here, from just the data.
>>>> However, clv2 seems to use a bit more processor power, on average.
>>>
>>> Not sure about that, from the residencies ladder and teo should be
>>> decisive losers in terms of power. While later in the test teo seems
>>> to be getting worse in power it doesn't quite reflect the difference
>>> in states.
>>> E.g. clv2 finishing with 65% state2 residency while teo has 40%, but
>>> I'll try to get per-CPU power measurements on this one.
>>> Interestingly ladder is a clear winner if anything, if that is reliable
>>> as a result that could indicate a too aggressive tick stop from the
>>> other governors, but cl isn't that much better than clv2 here, even
>>> though it stops the tick less aggressively.
>>
>> I agree with what you are saying.
>> It is a shorter test at only 25 minutes.
>> It might be worth trying the test again with more strict attention to
>> stabilizing the system thermally before each test.
>> The processor power will vary by a few watts for the exact same load
>> as a function of processor package temperature and coolant (my system is
>> water cooled) temperature and can take 20 to 30 minutes to settle.
>>
>> Reference:
>> http://smythies.com/~doug/linux/idle/teo-util3/temperature/thermal-stabalization-time.png
>>
>>>>
>>>> Further details:
>>>
>>> Link is missing, but I found
>>> http://smythies.com/~doug/linux/idle/teo-util3/ebizzy/
>>> from browsing your page.
>>
>> Yes, I accidently hit "Send" on my original email before it was actually finished.
>> But, then I was tired and thought "close enough".
>>
>>>> Test4: adrestia wakeup latency tests. 500 threads.
>>>>
>>>> Purpose: The test was reported in 2023.09 by the kernel test robot and looked
>>>> both interesting and gave interesting results, so I added it to the tests I run.
>>>
>>> http://smythies.com/~doug/linux/idle/teo-util3/adrestia/periodic/perf/
>>> So interestingly we can see, what I would call, the misbehavior of teo
>>> here, with teo skipping state2 and state3 entirely. You would expect
>>> a power regression here, but it doesn't translate into package power
>>> anyway.
>>>
>>>>
>>>> Results:
teo:wakeup cost (periodic, 20us): 3130nSec reference
clv2:wakeup cost (periodic, 20us): 3179nSec +1.57%
cl:wakeup cost (periodic, 20us): 3206nSec +2.43%
menu:wakeup cost (periodic, 20us): 2933nSec -6.29%
ladder:wakeup cost (periodic, 20us): 3530nSec +12.78%
no-util: wakeup cost (periodic, 20us): 3062nSec -2.17%

The really informative graph is this one:
http://smythies.com/~doug/linux/idle/teo-util3/adrestia/periodic/histogram-detail-a.png

Further details:
http://smythies.com/~doug/linux/idle/teo-util3/adrestia/periodic/histogram-detail-b.png
http://smythies.com/~doug/linux/idle/teo-util3/adrestia/periodic/perf/

>>>
>>> Is this measured as wakeup latency?
>>> I can't find much info about the test setup here, do you mind sharing
>>> something on it?
>>
>> I admit to being vague on this one, and I'll need some time to review.
>> The notes I left for myself last September are here:
>> http://smythies.com/~doug/linux/idle/teo-util2/adrestia/README.txt

Those notes have been updated but are still not very good.
There is bunch of system overhead in the "wakeup cost".

> 
> Thanks!
> 
>>
>>>> No strong conclusion here, from just the data.
>>>> However, clv2 seems to use a bit more processor power, on average.
>>>> teo: 69.72 watts
>>>> clv2: 72.91 watts +4.6%
>>>> Note 1: The first 5 minutes of the test powers were discarded to allow for thermal stabilization.
>>
>> which might not have been long enough, see the thermal notes above.
>>
>>>> Note 2: More work is required for this test, because the teo one actually took longer to execute, due to more outlier results
>> than the other tests.
>>
>>>> There were several other tests run but are not written up herein.
>>>>
>>> Because results are on par for all? Or inconclusive / not reproducible?
>>
>> Yes, because nothing of significance was observed or the test was more or less a repeat of an already covered test.
>> Initially, I had a mistake in my baseline teo tests, and a couple of the not written up tests have still not been re-run with the
>> proper baseline.
> 
> Thank you for testing, that's what I hoped.
> 
> Kind Regards,
> Christian

Results from a 6 core ping pong dwell test:

Note:	This is the same spot as the first data point from the above 6 core sweep test.
	It is important to note that the no-util results was not about +15% as above.

averages:

teo: 11.91786092 reference.
clv2: 12.94927586 +8.65%
cl: 12.89657797 +8.22%
menu: 11.85430331 -0.54%
ladder: 13.93532619 +17.08%
no-util: 11.93479453 +0.14%

Further details:
http://smythies.com/~doug/linux/idle/teo-util3/6-5000000-0/perf/

... Doug



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv2 2/3] cpuidle: teo: Remove recent intercepts metric
  2024-06-11 11:24 ` [PATCHv2 2/3] cpuidle: teo: Remove recent intercepts metric Christian Loehle
@ 2024-06-26 10:44   ` Dietmar Eggemann
  2024-06-26 12:41     ` Christian Loehle
  0 siblings, 1 reply; 16+ messages in thread
From: Dietmar Eggemann @ 2024-06-26 10:44 UTC (permalink / raw)
  To: Christian Loehle, linux-pm, linux-kernel, rafael
  Cc: vincent.guittot, qyousef, peterz, daniel.lezcano, ulf.hansson,
	anna-maria, kajetan.puchalski, lukasz.luba

On 11/06/2024 13:24, Christian Loehle wrote:
> The logic for recent intercepts didn't work, there is an underflow
> of the 'recent' value that can be observed during boot already, which
> teo usually doesn't recover from, making the entire logic pointless.
> Furthermore the recent intercepts also were never reset, thus not
> actually being very 'recent'.
> 
> Having underflowed 'recent' values lead to teo always acting as if
> we were in a scenario were expected sleep length based on timers is
> too high and it therefore unnecessarily selecting shallower states.
> 
> Experiments show that the remaining 'intercept' logic is enough to
> quickly react to scenarios in which teo cannot rely on the timer
> expected sleep length.
> 
> See also here:
> https://lore.kernel.org/lkml/0ce2d536-1125-4df8-9a5b-0d5e389cd8af@arm.com/

So the fixes proposed in teo_update() there:

(1) add '&& cpu_data->state_bins[cpu_data->recent_idx[i]].recent' to the 
    if condition to decrement 'cpu_data->state_bins[].recent' or

(2) Set 'cpu_data->state_bins[].recent' = 0 instead of decrement

didn't work?

> Fixes: 77577558f25d ("cpuidle: teo: Rework most recent idle duration values treatment")
> Signed-off-by: Christian Loehle <christian.loehle@arm.com>
> ---
>  drivers/cpuidle/governors/teo.c | 73 ++++++---------------------------
>  1 file changed, 12 insertions(+), 61 deletions(-)
> 
> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
> index d8554c20cf10..cc7df59f488d 100644
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -59,10 +59,6 @@
>   * shallower than the one whose bin is fallen into by the sleep length (these
>   * situations are referred to as "intercepts" below).
>   *
> - * In addition to the metrics described above, the governor counts recent
> - * intercepts (that is, intercepts that have occurred during the last
> - * %NR_RECENT invocations of it for the given CPU) for each bin.
> - *
>   * In order to select an idle state for a CPU, the governor takes the following
>   * steps (modulo the possible latency constraint that must be taken into account
>   * too):
> @@ -81,20 +77,15 @@

 66  * 1. Find the deepest CPU idle state whose target residency does not exceed
 67  *    the current sleep length (the candidate idle state) and compute 3 sums as

s/3/2 ?

[...]

> @@ -320,8 +288,6 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>  	unsigned int tick_intercept_sum = 0;
>  	unsigned int idx_intercept_sum = 0;
>  	unsigned int intercept_sum = 0;
> -	unsigned int idx_recent_sum = 0;
> -	unsigned int recent_sum = 0;
>  	unsigned int idx_hit_sum = 0;
>  	unsigned int hit_sum = 0;
>  	int constraint_idx = 0;

Looks like 'bool alt_intercepts, alt_recent' have to go here already and
not in 3/3.

[...]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv2 2/3] cpuidle: teo: Remove recent intercepts metric
  2024-06-26 10:44   ` Dietmar Eggemann
@ 2024-06-26 12:41     ` Christian Loehle
  0 siblings, 0 replies; 16+ messages in thread
From: Christian Loehle @ 2024-06-26 12:41 UTC (permalink / raw)
  To: Dietmar Eggemann, linux-pm, linux-kernel, rafael
  Cc: vincent.guittot, qyousef, peterz, daniel.lezcano, ulf.hansson,
	anna-maria, kajetan.puchalski, lukasz.luba

On 6/26/24 11:44, Dietmar Eggemann wrote:
> On 11/06/2024 13:24, Christian Loehle wrote:
>> The logic for recent intercepts didn't work, there is an underflow
>> of the 'recent' value that can be observed during boot already, which
>> teo usually doesn't recover from, making the entire logic pointless.
>> Furthermore the recent intercepts also were never reset, thus not
>> actually being very 'recent'.
>>
>> Having underflowed 'recent' values lead to teo always acting as if
>> we were in a scenario were expected sleep length based on timers is
>> too high and it therefore unnecessarily selecting shallower states.
>>
>> Experiments show that the remaining 'intercept' logic is enough to
>> quickly react to scenarios in which teo cannot rely on the timer
>> expected sleep length.
>>
>> See also here:
>> https://lore.kernel.org/lkml/0ce2d536-1125-4df8-9a5b-0d5e389cd8af@arm.com/
> 
> So the fixes proposed in teo_update() there:
> 
> (1) add '&& cpu_data->state_bins[cpu_data->recent_idx[i]].recent' to the 
>     if condition to decrement 'cpu_data->state_bins[].recent' or

That fixes the underflow for sure, but doesn't make much sense either.
You still have a value called 'recent' that isn't actually reset regularly.
So an intercept-heavy workload can still affect the system
minutes/hours/days later. There is no reset nor EMA, so if anything, the
regular intercept counters would be the 'recent' one.
See also below.

> 
> (2) Set 'cpu_data->state_bins[].recent' = 0 instead of decrement

That also 'works' in the sense that it doesn't contain blatantly obvious
values then. It is basically a 1bit information about recent intercepts,
I didn't see much practical use for that, given that the EMA of intercepts
(See patch 3), is quite aggressive (i.e. 'recent').
If an additional intercept that is more 'recent' (and one that is less
'recent') was ever needed I don't know and would defer to Rafael here.
I don't see the need for one after this series is applied, though.

The two suggestions above are 'fixes', this v2 2/3 is now more of a
revert you could say. (I didn't tag it as such explicitly because the
commit did more than just add the 'recent' tracking, so it's not a revert.)

> 
> didn't work?
> 
>> Fixes: 77577558f25d ("cpuidle: teo: Rework most recent idle duration values treatment")
>> Signed-off-by: Christian Loehle <christian.loehle@arm.com>
>> ---
>>  drivers/cpuidle/governors/teo.c | 73 ++++++---------------------------
>>  1 file changed, 12 insertions(+), 61 deletions(-)
>>
>> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
>> index d8554c20cf10..cc7df59f488d 100644
>> --- a/drivers/cpuidle/governors/teo.c
>> +++ b/drivers/cpuidle/governors/teo.c
>> @@ -59,10 +59,6 @@
>>   * shallower than the one whose bin is fallen into by the sleep length (these
>>   * situations are referred to as "intercepts" below).
>>   *
>> - * In addition to the metrics described above, the governor counts recent
>> - * intercepts (that is, intercepts that have occurred during the last
>> - * %NR_RECENT invocations of it for the given CPU) for each bin.
>> - *
>>   * In order to select an idle state for a CPU, the governor takes the following
>>   * steps (modulo the possible latency constraint that must be taken into account
>>   * too):
>> @@ -81,20 +77,15 @@
> 
>  66  * 1. Find the deepest CPU idle state whose target residency does not exceed
>  67  *    the current sleep length (the candidate idle state) and compute 3 sums as
> 
> s/3/2 ?

Thanks, good catch!

> 
> [...]
> 
>> @@ -320,8 +288,6 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>>  	unsigned int tick_intercept_sum = 0;
>>  	unsigned int idx_intercept_sum = 0;
>>  	unsigned int intercept_sum = 0;
>> -	unsigned int idx_recent_sum = 0;
>> -	unsigned int recent_sum = 0;
>>  	unsigned int idx_hit_sum = 0;
>>  	unsigned int hit_sum = 0;
>>  	int constraint_idx = 0;
> 
> Looks like 'bool alt_intercepts, alt_recent' have to go here already and
> not in 3/3.

Correct, will pick that up.
Thank you for reviewing!

> 
> [...]


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv2 3/3] cpuidle: teo: Don't count non-existent intercepts
  2024-06-11 11:24 ` [PATCHv2 3/3] cpuidle: teo: Don't count non-existent intercepts Christian Loehle
@ 2024-06-26 13:09   ` Dietmar Eggemann
  2024-06-28  9:33     ` Christian Loehle
  0 siblings, 1 reply; 16+ messages in thread
From: Dietmar Eggemann @ 2024-06-26 13:09 UTC (permalink / raw)
  To: Christian Loehle, linux-pm, linux-kernel, rafael
  Cc: vincent.guittot, qyousef, peterz, daniel.lezcano, ulf.hansson,
	anna-maria, kajetan.puchalski, lukasz.luba

On 11/06/2024 13:24, Christian Loehle wrote:
> When bailing out early, teo will not query the sleep length anymore
> since commit 6da8f9ba5a87 ("cpuidle: teo:
> Skip tick_nohz_get_sleep_length() call in some cases") with an
> expected sleep_length_ns value of KTIME_MAX.
> This lead to state0 accumulating lots of 'intercepts' because
> the actually measured sleep length was < KTIME_MAX, so count KTIME_MAX
> as a hit (we have to count them as something otherwise we are stuck)
> and therefore teo taking too long to recover from intercept-heavy
> scenarios.
> 
> Fundamentally we can only do one of the two:
> 1. Skip sleep_length_ns query when we think intercept is likely

Isn't this what we do right now? Skip tick_nohz_get_sleep_length() for
state0 and set cpu_data->sleep_length_ns to KTIME_MAX in teo_select()
and then count this as an 'intercept' in teo_update().

> 2. Have accurate data if sleep_length_ns is actually intercepted when
> we believe it is currently intercepted.
> 
> This patch chooses the latter as I've found the additional time it
> takes to query the sleep length to be negligible and the variants of
> option 1 (count all unknowns as misses or count all unknown as hits)
> had significant regressions (as misses had lots of too shallow idle
> state selections and as hits had terrible performance in
> intercept-heavy workloads).

IMHO, you do 2 things here:

(1) Set 'cpu_data->sleep_length_ns != KTIME_MAX' for '!idx &&
    prev_intercept_idx' in teo_select().

(2) Force an update with 'cpu_data->sleep_length_ns == KTIME_MAX' to be
    counted as a 'hit' rather an 'intercept' in teo_update().

Can't really see how this matches the explanatory text exactly.


> 
> Fixes: 6da8f9ba5a87 ("cpuidle: teo: Skip tick_nohz_get_sleep_length() call in some cases")
> Signed-off-by: Christian Loehle <christian.loehle@arm.com>
> ---
>  drivers/cpuidle/governors/teo.c | 18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
> index cc7df59f488d..1e4b40474f49 100644
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -231,8 +231,13 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
>  	 * length, this is a "hit", so update the "hits" metric for that bin.
>  	 * Otherwise, update the "intercepts" metric for the bin fallen into by
>  	 * the measured idle duration.
> +	 * If teo_select() bailed out early, we have to count this as a hit as
> +	 * we don't know what the true sleep length would've been. Otherwise
> +	 * we accumulate lots of intercepts at the shallower state (especially
> +	 * state0) even though there weren't any intercepts due to us
> +	 * anticipating one.
>  	 */
> -	if (idx_timer == idx_duration)
> +	if (idx_timer == idx_duration || cpu_data->sleep_length_ns == KTIME_MAX)
>  		cpu_data->state_bins[idx_timer].hits += PULSE;
>  	else
>  		cpu_data->state_bins[idx_duration].intercepts += PULSE;
> @@ -292,7 +297,7 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>  	unsigned int hit_sum = 0;
>  	int constraint_idx = 0;
>  	int idx0 = 0, idx = -1;
> -	bool alt_intercepts, alt_recent;
> +	int prev_intercept_idx;
>  	s64 duration_ns;
>  	int i;
>  
> @@ -370,6 +375,7 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>  	 * all of the deeper states a shallower idle state is likely to be a
>  	 * better choice.
>  	 */
> +	prev_intercept_idx = idx;
>  	if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) {
>  		int first_suitable_idx = idx;
>  
> @@ -421,6 +427,14 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>  			first_suitable_idx = i;
>  		}
>  	}
> +	if (!idx && prev_intercept_idx) {
> +		/*
> +		 * We have to query the sleep length here otherwise we don't
> +		 * know after wakeup if our guess was correct.
> +		 */
> +		duration_ns = tick_nohz_get_sleep_length(&delta_tick);
> +		cpu_data->sleep_length_ns = duration_ns;
> +	}
>  
>  	/*
>  	 * If there is a latency constraint, it may be necessary to select an


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv2 3/3] cpuidle: teo: Don't count non-existent intercepts
  2024-06-26 13:09   ` Dietmar Eggemann
@ 2024-06-28  9:33     ` Christian Loehle
  0 siblings, 0 replies; 16+ messages in thread
From: Christian Loehle @ 2024-06-28  9:33 UTC (permalink / raw)
  To: Dietmar Eggemann, linux-pm, linux-kernel, rafael
  Cc: vincent.guittot, qyousef, peterz, daniel.lezcano, ulf.hansson,
	anna-maria, kajetan.puchalski, lukasz.luba

On 6/26/24 14:09, Dietmar Eggemann wrote:
> On 11/06/2024 13:24, Christian Loehle wrote:
>> When bailing out early, teo will not query the sleep length anymore
>> since commit 6da8f9ba5a87 ("cpuidle: teo:
>> Skip tick_nohz_get_sleep_length() call in some cases") with an
>> expected sleep_length_ns value of KTIME_MAX.
>> This lead to state0 accumulating lots of 'intercepts' because
>> the actually measured sleep length was < KTIME_MAX, so count KTIME_MAX
>> as a hit (we have to count them as something otherwise we are stuck)
>> and therefore teo taking too long to recover from intercept-heavy
>> scenarios.
>>
>> Fundamentally we can only do one of the two:
>> 1. Skip sleep_length_ns query when we think intercept is likely
> 
> Isn't this what we do right now? Skip tick_nohz_get_sleep_length() for
> state0 and set cpu_data->sleep_length_ns to KTIME_MAX in teo_select()
> and then count this as an 'intercept' in teo_update().

Yes it is, I'll state that more explicitly in the future, it was
kind of implied by "we can only do one of the two" and "This patch
chooses the latter".

> 
>> 2. Have accurate data if sleep_length_ns is actually intercepted when
>> we believe it is currently intercepted.
>>
>> This patch chooses the latter as I've found the additional time it
>> takes to query the sleep length to be negligible and the variants of
>> option 1 (count all unknowns as misses or count all unknown as hits)
>> had significant regressions (as misses had lots of too shallow idle
>> state selections and as hits had terrible performance in
>> intercept-heavy workloads).
> 
> IMHO, you do 2 things here:
> 
> (1) Set 'cpu_data->sleep_length_ns != KTIME_MAX' for '!idx &&
>     prev_intercept_idx' in teo_select().

This is just the technical way of saying "We query the sleep length
even though teo is in an intercept scenario (and don't use the sleep
length for determining the current idle state selection)".

> 
> (2) Force an update with 'cpu_data->sleep_length_ns == KTIME_MAX' to be
>     counted as a 'hit' rather an 'intercept' in teo_update().

I assume this is the one you have an issue with and you're correct.
This isn't quite obvious from the text.
Furthermore the KTIME_MAX won't be set for intercept-bailouts anyway
with (1) and thus doesn't become that important in terms of resulting
behavior. I still think (2) is the semantically correct thing to do,
but I'll drop it for the fixes series and add it to the remaining
'improvements' pile.
Thanks!

> 
> Can't really see how this matches the explanatory text exactly.
> [SNIP]

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2024-06-28  9:33 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-11 11:24 [PATCHv2 0/3] cpuidle: teo: Fixing utilization and intercept logic Christian Loehle
2024-06-11 11:24 ` [PATCHv2 1/3] Revert: "cpuidle: teo: Introduce util-awareness" Christian Loehle
2024-06-11 13:37   ` Vincent Guittot
2024-06-19 17:49   ` Qais Yousef
2024-06-21 13:22   ` [PATCHv3 " Christian Loehle
2024-06-11 11:24 ` [PATCHv2 2/3] cpuidle: teo: Remove recent intercepts metric Christian Loehle
2024-06-26 10:44   ` Dietmar Eggemann
2024-06-26 12:41     ` Christian Loehle
2024-06-11 11:24 ` [PATCHv2 3/3] cpuidle: teo: Don't count non-existent intercepts Christian Loehle
2024-06-26 13:09   ` Dietmar Eggemann
2024-06-28  9:33     ` Christian Loehle
2024-06-17  0:20 ` [PATCHv2 0/3] cpuidle: teo: Fixing utilization and intercept logic Doug Smythies
2024-06-18 11:17   ` Christian Loehle
2024-06-18 17:24     ` Doug Smythies
2024-06-20 11:19       ` Christian Loehle
2024-06-25 16:35         ` Doug Smythies

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).