[RFC][PATCH 0/3] cpuidle,teo: Improve TEO tick decisions

linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH 0/3] cpuidle,teo: Improve TEO tick decisions
@ 2023-07-28 14:55 Peter Zijlstra
  2023-07-28 14:55 ` [RFC][PATCH 1/3] cpuidle: Inject tick boundary state Peter Zijlstra
                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Peter Zijlstra @ 2023-07-28 14:55 UTC (permalink / raw)
  To: anna-maria, rafael, tglx, frederic, gautham.shenoy
  Cc: linux-kernel, peterz, daniel.lezcano, linux-pm, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid

Hi,

Wanted to send this yesterday, but my home server died and took everything
down :/

These patches are lightly tested but appear to behave as expected.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC][PATCH 1/3] cpuidle: Inject tick boundary state
  2023-07-28 14:55 [RFC][PATCH 0/3] cpuidle,teo: Improve TEO tick decisions Peter Zijlstra
@ 2023-07-28 14:55 ` Peter Zijlstra
  2023-07-28 15:36   ` Rafael J. Wysocki
  2023-07-28 14:55 ` [RFC][PATCH 2/3] cpuidle,teo: Improve NOHZ management Peter Zijlstra
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2023-07-28 14:55 UTC (permalink / raw)
  To: anna-maria, rafael, tglx, frederic, gautham.shenoy
  Cc: linux-kernel, peterz, daniel.lezcano, linux-pm, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid

In order to facilitate governors that track history in idle-state
buckets (TEO) making a useful decision about NOHZ, make sure we have a
bucket that counts tick-and-longer.

In order to be inclusive of the tick itself -- after all, if we do not
disable NOHZ we'll sleep for a full tick, the actual boundary should
be just short of a full tick.

IOW, when registering the idle-states, add one that is always
disabled, just to have a bucket.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 drivers/cpuidle/cpuidle.h |    2 +
 drivers/cpuidle/driver.c  |   48 +++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/cpuidle.h   |    2 -
 3 files changed, 50 insertions(+), 2 deletions(-)

--- a/drivers/cpuidle/cpuidle.h
+++ b/drivers/cpuidle/cpuidle.h
@@ -72,4 +72,6 @@ static inline void cpuidle_coupled_unreg
 }
 #endif
 
+#define SHORT_TICK_NSEC (TICK_NSEC - TICK_NSEC/32)
+
 #endif /* __DRIVER_CPUIDLE_H */
--- a/drivers/cpuidle/driver.c
+++ b/drivers/cpuidle/driver.c
@@ -147,13 +147,37 @@ static void cpuidle_setup_broadcast_time
 		tick_broadcast_disable();
 }
 
+static int tick_enter(struct cpuidle_device *dev,
+		      struct cpuidle_driver *drv,
+		      int index)
+{
+	return -ENODEV;
+}
+
+static void __cpuidle_state_init_tick(struct cpuidle_state *s)
+{
+	strcpy(s->name, "TICK");
+	strcpy(s->desc, "(no-op)");
+
+	s->target_residency_ns = SHORT_TICK_NSEC;
+	s->target_residency = div_u64(SHORT_TICK_NSEC, NSEC_PER_USEC);
+
+	s->exit_latency_ns = 0;
+	s->exit_latency = 0;
+
+	s->flags |= CPUIDLE_FLAG_UNUSABLE;
+
+	s->enter = tick_enter;
+	s->enter_s2idle = tick_enter;
+}
+
 /**
  * __cpuidle_driver_init - initialize the driver's internal data
  * @drv: a valid pointer to a struct cpuidle_driver
  */
 static void __cpuidle_driver_init(struct cpuidle_driver *drv)
 {
-	int i;
+	int tick = 0, i;
 
 	/*
 	 * Use all possible CPUs as the default, because if the kernel boots
@@ -163,6 +187,9 @@ static void __cpuidle_driver_init(struct
 	if (!drv->cpumask)
 		drv->cpumask = (struct cpumask *)cpu_possible_mask;
 
+	if (WARN_ON_ONCE(drv->state_count >= CPUIDLE_STATE_MAX-2))
+		tick = 1;
+
 	for (i = 0; i < drv->state_count; i++) {
 		struct cpuidle_state *s = &drv->states[i];
 
@@ -192,6 +219,25 @@ static void __cpuidle_driver_init(struct
 			s->exit_latency_ns =  0;
 		else
 			s->exit_latency = div_u64(s->exit_latency_ns, NSEC_PER_USEC);
+
+		if (!tick && s->target_residency_ns >= SHORT_TICK_NSEC) {
+			tick = 1;
+
+			if (s->target_residency_ns == SHORT_TICK_NSEC)
+				continue;
+
+			memmove(&drv->states[i+1], &drv->states[i],
+				sizeof(struct cpuidle_state) * (CPUIDLE_STATE_MAX - i - 1));
+			__cpuidle_state_init_tick(s);
+			drv->state_count++;
+			i++;
+		}
+	}
+
+	if (!tick) {
+		struct cpuidle_state *s = &drv->states[i];
+		__cpuidle_state_init_tick(s);
+		drv->state_count++;
 	}
 }
 
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -16,7 +16,7 @@
 #include <linux/hrtimer.h>
 #include <linux/context_tracking.h>
 
-#define CPUIDLE_STATE_MAX	10
+#define CPUIDLE_STATE_MAX	16
 #define CPUIDLE_NAME_LEN	16
 #define CPUIDLE_DESC_LEN	32
 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC][PATCH 2/3] cpuidle,teo: Improve NOHZ management
  2023-07-28 14:55 [RFC][PATCH 0/3] cpuidle,teo: Improve TEO tick decisions Peter Zijlstra
  2023-07-28 14:55 ` [RFC][PATCH 1/3] cpuidle: Inject tick boundary state Peter Zijlstra
@ 2023-07-28 14:55 ` Peter Zijlstra
  2023-07-28 16:56   ` Rafael J. Wysocki
  2023-07-28 14:55 ` [RFC][PATCH 3/3] cpuidle,teo: Improve state selection Peter Zijlstra
  2023-07-31 11:46 ` [RFC][PATCH 0/3] cpuidle,teo: Improve TEO tick decisions Anna-Maria Behnsen
  3 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2023-07-28 14:55 UTC (permalink / raw)
  To: anna-maria, rafael, tglx, frederic, gautham.shenoy
  Cc: linux-kernel, peterz, daniel.lezcano, linux-pm, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid

With cpuidle having added a TICK bucket, TEO will account all TICK and
longer idles there. This means we can now make an informed decision
about stopping the tick. If the sum of 'hit+intercepts' of all states
below the TICK bucket is more than 50%, it is most likely we'll not
reach the tick this time around either, so stopping the tick doesn't
make sense.

If we don't stop the tick, don't bother calling
tick_nohz_get_sleep_length() and assume duration is no longer than a
tick (could be improved to still look at the current pending time and
timers).

Since we have this extra state, remove the state_count based early
decisions.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 drivers/cpuidle/governors/teo.c |   97 ++++++++++++++--------------------------
 1 file changed, 34 insertions(+), 63 deletions(-)

--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -139,6 +139,7 @@
 #include <linux/sched/clock.h>
 #include <linux/sched/topology.h>
 #include <linux/tick.h>
+#include "../cpuidle.h"
 
 /*
  * The number of bits to shift the CPU's capacity by in order to determine
@@ -197,7 +198,6 @@ struct teo_cpu {
 	int next_recent_idx;
 	int recent_idx[NR_RECENT];
 	unsigned long util_threshold;
-	bool utilized;
 };
 
 static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
@@ -276,11 +276,11 @@ static void teo_update(struct cpuidle_dr
 
 		cpu_data->total += bin->hits + bin->intercepts;
 
-		if (target_residency_ns <= cpu_data->sleep_length_ns) {
+		if (target_residency_ns <= cpu_data->sleep_length_ns)
 			idx_timer = i;
-			if (target_residency_ns <= measured_ns)
-				idx_duration = i;
-		}
+
+		if (target_residency_ns <= measured_ns)
+			idx_duration = i;
 	}
 
 	i = cpu_data->next_recent_idx++;
@@ -362,11 +362,12 @@ static int teo_select(struct cpuidle_dri
 	unsigned int recent_sum = 0;
 	unsigned int idx_hit_sum = 0;
 	unsigned int hit_sum = 0;
+	unsigned int tick_sum = 0;
 	int constraint_idx = 0;
 	int idx0 = 0, idx = -1;
 	bool alt_intercepts, alt_recent;
 	ktime_t delta_tick;
-	s64 duration_ns;
+	s64 duration_ns = TICK_NSEC;
 	int i;
 
 	if (dev->last_state_idx >= 0) {
@@ -376,36 +377,26 @@ static int teo_select(struct cpuidle_dri
 
 	cpu_data->time_span_ns = local_clock();
 
-	duration_ns = tick_nohz_get_sleep_length(&delta_tick);
-	cpu_data->sleep_length_ns = duration_ns;
+	/* Should we stop the tick? */
+	for (i = 1; i < drv->state_count; i++) {
+		struct teo_bin *prev_bin = &cpu_data->state_bins[i-1];
+		struct cpuidle_state *s = &drv->states[i];
 
-	/* Check if there is any choice in the first place. */
-	if (drv->state_count < 2) {
-		idx = 0;
-		goto end;
-	}
-	if (!dev->states_usage[0].disable) {
-		idx = 0;
-		if (drv->states[1].target_residency_ns > duration_ns)
-			goto end;
-	}
+		tick_sum += prev_bin->intercepts;
+		tick_sum += prev_bin->hits;
 
-	cpu_data->utilized = teo_cpu_is_utilized(dev->cpu, cpu_data);
-	/*
-	 * If the CPU is being utilized over the threshold and there are only 2
-	 * states to choose from, the metrics need not be considered, so choose
-	 * the shallowest non-polling state and exit.
-	 */
-	if (drv->state_count < 3 && cpu_data->utilized) {
-		for (i = 0; i < drv->state_count; ++i) {
-			if (!dev->states_usage[i].disable &&
-			    !(drv->states[i].flags & CPUIDLE_FLAG_POLLING)) {
-				idx = i;
-				goto end;
-			}
-		}
+		if (s->target_residency_ns >= SHORT_TICK_NSEC)
+			break;
 	}
 
+	if (2*tick_sum > cpu_data->total)
+		*stop_tick = false;
+
+	/* If we do stop the tick, ask for the next timer. */
+	if (*stop_tick)
+		duration_ns = tick_nohz_get_sleep_length(&delta_tick);
+	cpu_data->sleep_length_ns = duration_ns;
+
 	/*
 	 * Find the deepest idle state whose target residency does not exceed
 	 * the current sleep length and the deepest idle state not deeper than
@@ -446,13 +437,13 @@ static int teo_select(struct cpuidle_dri
 		idx_recent_sum = recent_sum;
 	}
 
-	/* Avoid unnecessary overhead. */
-	if (idx < 0) {
-		idx = 0; /* No states enabled, must use 0. */
-		goto end;
-	} else if (idx == idx0) {
-		goto end;
-	}
+	/* No states enabled, must use 0 */
+	if (idx < 0)
+		return 0;
+
+	/* No point looking for something shallower than the first enabled state */
+	if (idx == idx0)
+		return idx;
 
 	/*
 	 * If the sum of the intercepts metric for all of the idle states
@@ -541,29 +532,9 @@ static int teo_select(struct cpuidle_dri
 	 * If the CPU is being utilized over the threshold, choose a shallower
 	 * non-polling state to improve latency
 	 */
-	if (cpu_data->utilized)
+	if (teo_cpu_is_utilized(dev->cpu, cpu_data))
 		idx = teo_find_shallower_state(drv, dev, idx, duration_ns, true);
 
-end:
-	/*
-	 * Don't stop the tick if the selected state is a polling one or if the
-	 * expected idle duration is shorter than the tick period length.
-	 */
-	if (((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) ||
-	    duration_ns < TICK_NSEC) && !tick_nohz_tick_stopped()) {
-		*stop_tick = false;
-
-		/*
-		 * The tick is not going to be stopped, so if the target
-		 * residency of the state to be returned is not within the time
-		 * till the closest timer including the tick, try to correct
-		 * that.
-		 */
-		if (idx > idx0 &&
-		    drv->states[idx].target_residency_ns > delta_tick)
-			idx = teo_find_shallower_state(drv, dev, idx, delta_tick, false);
-	}
-
 	return idx;
 }
 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC][PATCH 3/3] cpuidle,teo: Improve state selection
  2023-07-28 14:55 [RFC][PATCH 0/3] cpuidle,teo: Improve TEO tick decisions Peter Zijlstra
  2023-07-28 14:55 ` [RFC][PATCH 1/3] cpuidle: Inject tick boundary state Peter Zijlstra
  2023-07-28 14:55 ` [RFC][PATCH 2/3] cpuidle,teo: Improve NOHZ management Peter Zijlstra
@ 2023-07-28 14:55 ` Peter Zijlstra
  2023-07-28 17:07   ` Rafael J. Wysocki
  2023-07-31 11:46 ` [RFC][PATCH 0/3] cpuidle,teo: Improve TEO tick decisions Anna-Maria Behnsen
  3 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2023-07-28 14:55 UTC (permalink / raw)
  To: anna-maria, rafael, tglx, frederic, gautham.shenoy
  Cc: linux-kernel, peterz, daniel.lezcano, linux-pm, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid

When selecting a state, stop when history tells us 66% of recent idles
were at or below our current state.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 drivers/cpuidle/governors/teo.c |    6 ++++++
 1 file changed, 6 insertions(+)

--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -363,6 +363,7 @@ static int teo_select(struct cpuidle_dri
 	unsigned int idx_hit_sum = 0;
 	unsigned int hit_sum = 0;
 	unsigned int tick_sum = 0;
+	unsigned int thresh_sum = 0;
 	int constraint_idx = 0;
 	int idx0 = 0, idx = -1;
 	bool alt_intercepts, alt_recent;
@@ -397,6 +398,8 @@ static int teo_select(struct cpuidle_dri
 		duration_ns = tick_nohz_get_sleep_length(&delta_tick);
 	cpu_data->sleep_length_ns = duration_ns;
 
+	thresh_sum = 2 * cpu_data->total / 3; /* 66% */
+
 	/*
 	 * Find the deepest idle state whose target residency does not exceed
 	 * the current sleep length and the deepest idle state not deeper than
@@ -427,6 +430,9 @@ static int teo_select(struct cpuidle_dri
 		if (s->target_residency_ns > duration_ns)
 			break;
 
+		if (intercept_sum + hit_sum > thresh_sum)
+			break;
+
 		idx = i;
 
 		if (s->exit_latency_ns <= latency_req)



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 1/3] cpuidle: Inject tick boundary state
  2023-07-28 14:55 ` [RFC][PATCH 1/3] cpuidle: Inject tick boundary state Peter Zijlstra
@ 2023-07-28 15:36   ` Rafael J. Wysocki
  2023-07-29  8:44     ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Rafael J. Wysocki @ 2023-07-28 15:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: anna-maria, rafael, tglx, frederic, gautham.shenoy, linux-kernel,
	daniel.lezcano, linux-pm, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid

On Fri, Jul 28, 2023 at 5:01 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> In order to facilitate governors that track history in idle-state
> buckets (TEO) making a useful decision about NOHZ, make sure we have a
> bucket that counts tick-and-longer.
>
> In order to be inclusive of the tick itself -- after all, if we do not
> disable NOHZ we'll sleep for a full tick, the actual boundary should
> be just short of a full tick.
>
> IOW, when registering the idle-states, add one that is always
> disabled, just to have a bucket.

This extra bucket can be created in the governor itself, can't it?

> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  drivers/cpuidle/cpuidle.h |    2 +
>  drivers/cpuidle/driver.c  |   48 +++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/cpuidle.h   |    2 -
>  3 files changed, 50 insertions(+), 2 deletions(-)
>
> --- a/drivers/cpuidle/cpuidle.h
> +++ b/drivers/cpuidle/cpuidle.h
> @@ -72,4 +72,6 @@ static inline void cpuidle_coupled_unreg
>  }
>  #endif
>
> +#define SHORT_TICK_NSEC (TICK_NSEC - TICK_NSEC/32)
> +
>  #endif /* __DRIVER_CPUIDLE_H */
> --- a/drivers/cpuidle/driver.c
> +++ b/drivers/cpuidle/driver.c
> @@ -147,13 +147,37 @@ static void cpuidle_setup_broadcast_time
>                 tick_broadcast_disable();
>  }
>
> +static int tick_enter(struct cpuidle_device *dev,
> +                     struct cpuidle_driver *drv,
> +                     int index)
> +{
> +       return -ENODEV;
> +}
> +
> +static void __cpuidle_state_init_tick(struct cpuidle_state *s)
> +{
> +       strcpy(s->name, "TICK");
> +       strcpy(s->desc, "(no-op)");
> +
> +       s->target_residency_ns = SHORT_TICK_NSEC;
> +       s->target_residency = div_u64(SHORT_TICK_NSEC, NSEC_PER_USEC);
> +
> +       s->exit_latency_ns = 0;
> +       s->exit_latency = 0;
> +
> +       s->flags |= CPUIDLE_FLAG_UNUSABLE;
> +
> +       s->enter = tick_enter;
> +       s->enter_s2idle = tick_enter;
> +}
> +
>  /**
>   * __cpuidle_driver_init - initialize the driver's internal data
>   * @drv: a valid pointer to a struct cpuidle_driver
>   */
>  static void __cpuidle_driver_init(struct cpuidle_driver *drv)
>  {
> -       int i;
> +       int tick = 0, i;
>
>         /*
>          * Use all possible CPUs as the default, because if the kernel boots
> @@ -163,6 +187,9 @@ static void __cpuidle_driver_init(struct
>         if (!drv->cpumask)
>                 drv->cpumask = (struct cpumask *)cpu_possible_mask;
>
> +       if (WARN_ON_ONCE(drv->state_count >= CPUIDLE_STATE_MAX-2))
> +               tick = 1;
> +
>         for (i = 0; i < drv->state_count; i++) {
>                 struct cpuidle_state *s = &drv->states[i];
>
> @@ -192,6 +219,25 @@ static void __cpuidle_driver_init(struct
>                         s->exit_latency_ns =  0;
>                 else
>                         s->exit_latency = div_u64(s->exit_latency_ns, NSEC_PER_USEC);
> +
> +               if (!tick && s->target_residency_ns >= SHORT_TICK_NSEC) {
> +                       tick = 1;
> +
> +                       if (s->target_residency_ns == SHORT_TICK_NSEC)
> +                               continue;
> +
> +                       memmove(&drv->states[i+1], &drv->states[i],
> +                               sizeof(struct cpuidle_state) * (CPUIDLE_STATE_MAX - i - 1));
> +                       __cpuidle_state_init_tick(s);
> +                       drv->state_count++;
> +                       i++;
> +               }
> +       }
> +
> +       if (!tick) {
> +               struct cpuidle_state *s = &drv->states[i];
> +               __cpuidle_state_init_tick(s);
> +               drv->state_count++;
>         }
>  }
>
> --- a/include/linux/cpuidle.h
> +++ b/include/linux/cpuidle.h
> @@ -16,7 +16,7 @@
>  #include <linux/hrtimer.h>
>  #include <linux/context_tracking.h>
>
> -#define CPUIDLE_STATE_MAX      10
> +#define CPUIDLE_STATE_MAX      16
>  #define CPUIDLE_NAME_LEN       16
>  #define CPUIDLE_DESC_LEN       32
>
>
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 2/3] cpuidle,teo: Improve NOHZ management
  2023-07-28 14:55 ` [RFC][PATCH 2/3] cpuidle,teo: Improve NOHZ management Peter Zijlstra
@ 2023-07-28 16:56   ` Rafael J. Wysocki
  2023-07-28 22:01     ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Rafael J. Wysocki @ 2023-07-28 16:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: anna-maria, rafael, tglx, frederic, gautham.shenoy, linux-kernel,
	daniel.lezcano, linux-pm, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid

On Fri, Jul 28, 2023 at 5:01 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> With cpuidle having added a TICK bucket, TEO will account all TICK and
> longer idles there. This means we can now make an informed decision
> about stopping the tick. If the sum of 'hit+intercepts' of all states
> below the TICK bucket is more than 50%, it is most likely we'll not
> reach the tick this time around either, so stopping the tick doesn't
> make sense.
>
> If we don't stop the tick, don't bother calling
> tick_nohz_get_sleep_length() and assume duration is no longer than a
> tick (could be improved to still look at the current pending time and
> timers).
>
> Since we have this extra state, remove the state_count based early
> decisions.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  drivers/cpuidle/governors/teo.c |   97 ++++++++++++++--------------------------
>  1 file changed, 34 insertions(+), 63 deletions(-)
>
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -139,6 +139,7 @@
>  #include <linux/sched/clock.h>
>  #include <linux/sched/topology.h>
>  #include <linux/tick.h>
> +#include "../cpuidle.h"
>
>  /*
>   * The number of bits to shift the CPU's capacity by in order to determine
> @@ -197,7 +198,6 @@ struct teo_cpu {
>         int next_recent_idx;
>         int recent_idx[NR_RECENT];
>         unsigned long util_threshold;
> -       bool utilized;
>  };
>
>  static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
> @@ -276,11 +276,11 @@ static void teo_update(struct cpuidle_dr
>
>                 cpu_data->total += bin->hits + bin->intercepts;
>
> -               if (target_residency_ns <= cpu_data->sleep_length_ns) {
> +               if (target_residency_ns <= cpu_data->sleep_length_ns)
>                         idx_timer = i;
> -                       if (target_residency_ns <= measured_ns)
> -                               idx_duration = i;
> -               }
> +
> +               if (target_residency_ns <= measured_ns)
> +                       idx_duration = i;

I'm not quite sure what happens here.

>         }
>
>         i = cpu_data->next_recent_idx++;
> @@ -362,11 +362,12 @@ static int teo_select(struct cpuidle_dri
>         unsigned int recent_sum = 0;
>         unsigned int idx_hit_sum = 0;
>         unsigned int hit_sum = 0;
> +       unsigned int tick_sum = 0;
>         int constraint_idx = 0;
>         int idx0 = 0, idx = -1;
>         bool alt_intercepts, alt_recent;
>         ktime_t delta_tick;
> -       s64 duration_ns;
> +       s64 duration_ns = TICK_NSEC;
>         int i;
>
>         if (dev->last_state_idx >= 0) {
> @@ -376,36 +377,26 @@ static int teo_select(struct cpuidle_dri
>
>         cpu_data->time_span_ns = local_clock();
>
> -       duration_ns = tick_nohz_get_sleep_length(&delta_tick);
> -       cpu_data->sleep_length_ns = duration_ns;
> +       /* Should we stop the tick? */

Who's we?  I'd prefer something like "Should the tick be stopped?"
here (analogously below).

> +       for (i = 1; i < drv->state_count; i++) {
> +               struct teo_bin *prev_bin = &cpu_data->state_bins[i-1];
> +               struct cpuidle_state *s = &drv->states[i];
>
> -       /* Check if there is any choice in the first place. */
> -       if (drv->state_count < 2) {
> -               idx = 0;
> -               goto end;
> -       }
> -       if (!dev->states_usage[0].disable) {
> -               idx = 0;
> -               if (drv->states[1].target_residency_ns > duration_ns)
> -                       goto end;
> -       }
> +               tick_sum += prev_bin->intercepts;
> +               tick_sum += prev_bin->hits;
>
> -       cpu_data->utilized = teo_cpu_is_utilized(dev->cpu, cpu_data);
> -       /*
> -        * If the CPU is being utilized over the threshold and there are only 2
> -        * states to choose from, the metrics need not be considered, so choose
> -        * the shallowest non-polling state and exit.
> -        */
> -       if (drv->state_count < 3 && cpu_data->utilized) {
> -               for (i = 0; i < drv->state_count; ++i) {
> -                       if (!dev->states_usage[i].disable &&
> -                           !(drv->states[i].flags & CPUIDLE_FLAG_POLLING)) {
> -                               idx = i;
> -                               goto end;
> -                       }
> -               }
> +               if (s->target_residency_ns >= SHORT_TICK_NSEC)
> +                       break;
>         }
>
> +       if (2*tick_sum > cpu_data->total)
> +               *stop_tick = false;

This means "if over 50% of all the events fall into the buckets below
the tick period length, don't stop the tick".  Fair enough, but this
covers long-term only and what about the most recent events?  I think
that they need to be taken into account here too.

> +
> +       /* If we do stop the tick, ask for the next timer. */
> +       if (*stop_tick)
> +               duration_ns = tick_nohz_get_sleep_length(&delta_tick);
> +       cpu_data->sleep_length_ns = duration_ns;

If the decision is made to retain the tick and the time to the closest
tick event is very small, it would be better to refine the state
selection so as to avoid returning a state with the target residency
above that time (which essentially is wasting energy).  That's what
delta_tick above is for, but now tick_nohz_get_sleep_length() is never
called when the tick is not going to be stopped.

Besides, if I'm not mistaken, setting sleep_length_ns to TICK_NSEC
every time the tick is not stopped will not really work on systems
where there are real idle states with target residencies beyond
TICK_NSEC.

> +
>         /*
>          * Find the deepest idle state whose target residency does not exceed
>          * the current sleep length and the deepest idle state not deeper than
> @@ -446,13 +437,13 @@ static int teo_select(struct cpuidle_dri
>                 idx_recent_sum = recent_sum;
>         }
>
> -       /* Avoid unnecessary overhead. */
> -       if (idx < 0) {
> -               idx = 0; /* No states enabled, must use 0. */
> -               goto end;
> -       } else if (idx == idx0) {
> -               goto end;
> -       }
> +       /* No states enabled, must use 0 */
> +       if (idx < 0)
> +               return 0;
> +
> +       /* No point looking for something shallower than the first enabled state */
> +       if (idx == idx0)
> +               return idx;
>
>         /*
>          * If the sum of the intercepts metric for all of the idle states
> @@ -541,29 +532,9 @@ static int teo_select(struct cpuidle_dri
>          * If the CPU is being utilized over the threshold, choose a shallower
>          * non-polling state to improve latency
>          */
> -       if (cpu_data->utilized)
> +       if (teo_cpu_is_utilized(dev->cpu, cpu_data))
>                 idx = teo_find_shallower_state(drv, dev, idx, duration_ns, true);
>
> -end:
> -       /*
> -        * Don't stop the tick if the selected state is a polling one or if the
> -        * expected idle duration is shorter than the tick period length.
> -        */
> -       if (((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) ||
> -           duration_ns < TICK_NSEC) && !tick_nohz_tick_stopped()) {
> -               *stop_tick = false;
> -
> -               /*
> -                * The tick is not going to be stopped, so if the target
> -                * residency of the state to be returned is not within the time
> -                * till the closest timer including the tick, try to correct
> -                * that.
> -                */
> -               if (idx > idx0 &&
> -                   drv->states[idx].target_residency_ns > delta_tick)
> -                       idx = teo_find_shallower_state(drv, dev, idx, delta_tick, false);
> -       }
> -
>         return idx;
>  }

Overall, I think that the problem with calling
tick_nohz_get_sleep_length() is limited to the cases when the CPU is
almost fully loaded, so the overall amount of idle time on it is tiny.
I would rather use a special pah for those cases and I would register
all of the wakeups as "intercepts" in those cases.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 3/3] cpuidle,teo: Improve state selection
  2023-07-28 14:55 ` [RFC][PATCH 3/3] cpuidle,teo: Improve state selection Peter Zijlstra
@ 2023-07-28 17:07   ` Rafael J. Wysocki
  2023-07-29  8:46     ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Rafael J. Wysocki @ 2023-07-28 17:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: anna-maria, rafael, tglx, frederic, gautham.shenoy, linux-kernel,
	daniel.lezcano, linux-pm, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid

On Fri, Jul 28, 2023 at 5:01 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> When selecting a state, stop when history tells us 66% of recent idles
> were at or below our current state.

This is not really about "recent idles" AFAICS.  It stops the
selection when 66% of the total sum of the "hits" and "intercepts"
signals comes from the current state and the states below it.  This
covers the entire history, while the "recent" ones are counted
separately.

> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  drivers/cpuidle/governors/teo.c |    6 ++++++
>  1 file changed, 6 insertions(+)
>
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -363,6 +363,7 @@ static int teo_select(struct cpuidle_dri
>         unsigned int idx_hit_sum = 0;
>         unsigned int hit_sum = 0;
>         unsigned int tick_sum = 0;
> +       unsigned int thresh_sum = 0;
>         int constraint_idx = 0;
>         int idx0 = 0, idx = -1;
>         bool alt_intercepts, alt_recent;
> @@ -397,6 +398,8 @@ static int teo_select(struct cpuidle_dri
>                 duration_ns = tick_nohz_get_sleep_length(&delta_tick);
>         cpu_data->sleep_length_ns = duration_ns;
>
> +       thresh_sum = 2 * cpu_data->total / 3; /* 66% */
> +
>         /*
>          * Find the deepest idle state whose target residency does not exceed
>          * the current sleep length and the deepest idle state not deeper than
> @@ -427,6 +430,9 @@ static int teo_select(struct cpuidle_dri
>                 if (s->target_residency_ns > duration_ns)
>                         break;
>
> +               if (intercept_sum + hit_sum > thresh_sum)
> +                       break;
> +
>                 idx = i;
>
>                 if (s->exit_latency_ns <= latency_req)
>
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 2/3] cpuidle,teo: Improve NOHZ management
  2023-07-28 16:56   ` Rafael J. Wysocki
@ 2023-07-28 22:01     ` Peter Zijlstra
  2023-07-31 10:17       ` Rafael J. Wysocki
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2023-07-28 22:01 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: anna-maria, tglx, frederic, gautham.shenoy, linux-kernel,
	daniel.lezcano, linux-pm, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid

On Fri, Jul 28, 2023 at 06:56:24PM +0200, Rafael J. Wysocki wrote:
> On Fri, Jul 28, 2023 at 5:01 PM Peter Zijlstra <peterz@infradead.org> wrote:

> > @@ -276,11 +276,11 @@ static void teo_update(struct cpuidle_dr
> >
> >                 cpu_data->total += bin->hits + bin->intercepts;
> >
> > -               if (target_residency_ns <= cpu_data->sleep_length_ns) {
> > +               if (target_residency_ns <= cpu_data->sleep_length_ns)
> >                         idx_timer = i;
> > -                       if (target_residency_ns <= measured_ns)
> > -                               idx_duration = i;
> > -               }
> > +
> > +               if (target_residency_ns <= measured_ns)
> > +                       idx_duration = i;
> 
> I'm not quite sure what happens here.

Oh, I couldn't convince myself that measured_ns <= sleep_length_ns. If
measured was longer we still want the higher index.

But yeah, I forgots I had that hunk in.

> >         }
> >
> >         i = cpu_data->next_recent_idx++;
> > @@ -362,11 +362,12 @@ static int teo_select(struct cpuidle_dri
> >         unsigned int recent_sum = 0;
> >         unsigned int idx_hit_sum = 0;
> >         unsigned int hit_sum = 0;
> > +       unsigned int tick_sum = 0;
> >         int constraint_idx = 0;
> >         int idx0 = 0, idx = -1;
> >         bool alt_intercepts, alt_recent;
> >         ktime_t delta_tick;
> > -       s64 duration_ns;
> > +       s64 duration_ns = TICK_NSEC;
> >         int i;
> >
> >         if (dev->last_state_idx >= 0) {
> > @@ -376,36 +377,26 @@ static int teo_select(struct cpuidle_dri
> >
> >         cpu_data->time_span_ns = local_clock();
> >
> > -       duration_ns = tick_nohz_get_sleep_length(&delta_tick);
> > -       cpu_data->sleep_length_ns = duration_ns;
> > +       /* Should we stop the tick? */
> 
> Who's we?  I'd prefer something like "Should the tick be stopped?"
> here (analogously below).

Sure.

> > +       for (i = 1; i < drv->state_count; i++) {
> > +               struct teo_bin *prev_bin = &cpu_data->state_bins[i-1];
> > +               struct cpuidle_state *s = &drv->states[i];
> >
> > -       /* Check if there is any choice in the first place. */
> > -       if (drv->state_count < 2) {
> > -               idx = 0;
> > -               goto end;
> > -       }
> > -       if (!dev->states_usage[0].disable) {
> > -               idx = 0;
> > -               if (drv->states[1].target_residency_ns > duration_ns)
> > -                       goto end;
> > -       }
> > +               tick_sum += prev_bin->intercepts;
> > +               tick_sum += prev_bin->hits;
> >
> > -       cpu_data->utilized = teo_cpu_is_utilized(dev->cpu, cpu_data);
> > -       /*
> > -        * If the CPU is being utilized over the threshold and there are only 2
> > -        * states to choose from, the metrics need not be considered, so choose
> > -        * the shallowest non-polling state and exit.
> > -        */
> > -       if (drv->state_count < 3 && cpu_data->utilized) {
> > -               for (i = 0; i < drv->state_count; ++i) {
> > -                       if (!dev->states_usage[i].disable &&
> > -                           !(drv->states[i].flags & CPUIDLE_FLAG_POLLING)) {
> > -                               idx = i;
> > -                               goto end;
> > -                       }
> > -               }
> > +               if (s->target_residency_ns >= SHORT_TICK_NSEC)
> > +                       break;
> >         }
> >
> > +       if (2*tick_sum > cpu_data->total)
> > +               *stop_tick = false;
> 
> This means "if over 50% of all the events fall into the buckets below
> the tick period length, don't stop the tick".  Fair enough, but this
> covers long-term only and what about the most recent events?  I think
> that they need to be taken into account here too.

From looking at a few traces this 'long' term is around 8-10 samples.
Which I figured was quick enough.

Note that DECAY_SHIFT is 3, while the pulse is 10 bits, so 3-4 cycles
will drain most of the history when there's a distinct phase shift.

That said; I did look at the recent thing and those seem geared towards
the intercepts, while I think hits+intercepts makes more sense here.
Given it adjusted quickly enough I didn't put more time in it.

> > +
> > +       /* If we do stop the tick, ask for the next timer. */
> > +       if (*stop_tick)
> > +               duration_ns = tick_nohz_get_sleep_length(&delta_tick);
> > +       cpu_data->sleep_length_ns = duration_ns;
> 
> If the decision is made to retain the tick and the time to the closest
> tick event is very small, it would be better to refine the state
> selection so as to avoid returning a state with the target residency
> above that time (which essentially is wasting energy).  That's what
> delta_tick above is for, but now tick_nohz_get_sleep_length() is never
> called when the tick is not going to be stopped.

Right, so I did ponder using something like
ktime_sub(tick_nohz_get_next_hrtimer(), now) instead of TICK_NSEC to get
a more accurate measure, but I didn't do so yet.

> Besides, if I'm not mistaken, setting sleep_length_ns to TICK_NSEC
> every time the tick is not stopped will not really work on systems
> where there are real idle states with target residencies beyond
> TICK_NSEC.

It does work; you really don't want to select such a state if the tick
is still active -- you'll never get your residency. Such a state should
really only be used when the tick is off.

> > +
> >         /*
> >          * Find the deepest idle state whose target residency does not exceed
> >          * the current sleep length and the deepest idle state not deeper than
> > @@ -446,13 +437,13 @@ static int teo_select(struct cpuidle_dri
> >                 idx_recent_sum = recent_sum;
> >         }
> >
> > -       /* Avoid unnecessary overhead. */
> > -       if (idx < 0) {
> > -               idx = 0; /* No states enabled, must use 0. */
> > -               goto end;
> > -       } else if (idx == idx0) {
> > -               goto end;
> > -       }
> > +       /* No states enabled, must use 0 */
> > +       if (idx < 0)
> > +               return 0;
> > +
> > +       /* No point looking for something shallower than the first enabled state */
> > +       if (idx == idx0)
> > +               return idx;
> >
> >         /*
> >          * If the sum of the intercepts metric for all of the idle states
> > @@ -541,29 +532,9 @@ static int teo_select(struct cpuidle_dri
> >          * If the CPU is being utilized over the threshold, choose a shallower
> >          * non-polling state to improve latency
> >          */
> > -       if (cpu_data->utilized)
> > +       if (teo_cpu_is_utilized(dev->cpu, cpu_data))
> >                 idx = teo_find_shallower_state(drv, dev, idx, duration_ns, true);
> >
> > -end:
> > -       /*
> > -        * Don't stop the tick if the selected state is a polling one or if the
> > -        * expected idle duration is shorter than the tick period length.
> > -        */
> > -       if (((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) ||
> > -           duration_ns < TICK_NSEC) && !tick_nohz_tick_stopped()) {
> > -               *stop_tick = false;
> > -
> > -               /*
> > -                * The tick is not going to be stopped, so if the target
> > -                * residency of the state to be returned is not within the time
> > -                * till the closest timer including the tick, try to correct
> > -                * that.
> > -                */
> > -               if (idx > idx0 &&
> > -                   drv->states[idx].target_residency_ns > delta_tick)
> > -                       idx = teo_find_shallower_state(drv, dev, idx, delta_tick, false);
> > -       }
> > -
> >         return idx;
> >  }
> 
> Overall, I think that the problem with calling
> tick_nohz_get_sleep_length() is limited to the cases when the CPU is
> almost fully loaded, so the overall amount of idle time on it is tiny.
> I would rather use a special pah for those cases and I would register
> all of the wakeups as "intercepts" in those cases.

I'm not sure what you're proposing. If we track the tick+ bucket -- as
we must in order to say anything useful about it, then we can decide the
tick state before (as I do here) calling sleep_length().

The timer-pull rework from Anna-Maria unfortunately makes the
tick_nohz_get_sleep_length() thing excessively expensive and it really
doesn't make sense to call it when we retain the tick.

It's all a bit of a chicken-egg situation, cpuidle wants to know when
the next timer is, but telling when that is, wants to know if the tick
stays. We need to break that somehow -- I propose by not calling it when
we know we'll keep the tick.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 1/3] cpuidle: Inject tick boundary state
  2023-07-28 15:36   ` Rafael J. Wysocki
@ 2023-07-29  8:44     ` Peter Zijlstra
  2023-07-31  8:01       ` Rafael J. Wysocki
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2023-07-29  8:44 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: anna-maria, tglx, frederic, gautham.shenoy, linux-kernel,
	daniel.lezcano, linux-pm, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid

On Fri, Jul 28, 2023 at 05:36:55PM +0200, Rafael J. Wysocki wrote:
> On Fri, Jul 28, 2023 at 5:01 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > In order to facilitate governors that track history in idle-state
> > buckets (TEO) making a useful decision about NOHZ, make sure we have a
> > bucket that counts tick-and-longer.
> >
> > In order to be inclusive of the tick itself -- after all, if we do not
> > disable NOHZ we'll sleep for a full tick, the actual boundary should
> > be just short of a full tick.
> >
> > IOW, when registering the idle-states, add one that is always
> > disabled, just to have a bucket.
> 
> This extra bucket can be created in the governor itself, can't it?

I couldn't find a nice spot for the governor to add idle-states.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 3/3] cpuidle,teo: Improve state selection
  2023-07-28 17:07   ` Rafael J. Wysocki
@ 2023-07-29  8:46     ` Peter Zijlstra
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Zijlstra @ 2023-07-29  8:46 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: anna-maria, tglx, frederic, gautham.shenoy, linux-kernel,
	daniel.lezcano, linux-pm, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid

On Fri, Jul 28, 2023 at 07:07:02PM +0200, Rafael J. Wysocki wrote:
> On Fri, Jul 28, 2023 at 5:01 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > When selecting a state, stop when history tells us 66% of recent idles
> > were at or below our current state.
> 
> This is not really about "recent idles" AFAICS.  It stops the
> selection when 66% of the total sum of the "hits" and "intercepts"
> signals comes from the current state and the states below it.  This
> covers the entire history, while the "recent" ones are counted
> separately.

Ah, bad wording perhaps. I found that 'recent' list teo has very hard to
use, also the regular state isn't that long lived as per the reply on
the previous patch.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 1/3] cpuidle: Inject tick boundary state
  2023-07-29  8:44     ` Peter Zijlstra
@ 2023-07-31  8:01       ` Rafael J. Wysocki
  2023-07-31  9:09         ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Rafael J. Wysocki @ 2023-07-31  8:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, anna-maria, tglx, frederic, gautham.shenoy,
	linux-kernel, daniel.lezcano, linux-pm, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid

On Sat, Jul 29, 2023 at 10:44 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jul 28, 2023 at 05:36:55PM +0200, Rafael J. Wysocki wrote:
> > On Fri, Jul 28, 2023 at 5:01 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > In order to facilitate governors that track history in idle-state
> > > buckets (TEO) making a useful decision about NOHZ, make sure we have a
> > > bucket that counts tick-and-longer.
> > >
> > > In order to be inclusive of the tick itself -- after all, if we do not
> > > disable NOHZ we'll sleep for a full tick, the actual boundary should
> > > be just short of a full tick.
> > >
> > > IOW, when registering the idle-states, add one that is always
> > > disabled, just to have a bucket.
> >
> > This extra bucket can be created in the governor itself, can't it?
>
> I couldn't find a nice spot for the governor to add idle-states.

Well, I've thought this through and recalled a couple of things and my
conclusion is that the decision whether or not to stop the tick really
depends on the idle state choice.

There are three cases:

1. The selected idle state is shallow (that is, its target residency
is below the tick period length), but it is not the deepest one.
2. The selected idle state is shallow, but it is the deepest one (or
at least the deepest enabled one).
3. The selected idle state is deep (that is, its target residency is
above the tick length period).

In case 1, the tick should not be stopped so as to prevent the CPU
from spending too much time in a suboptimal idle state.

In case 3, the tick needs to be stopped, because otherwise the target
residency of the selected state would not be met.

Case 2 is somewhat a mixed bag, but generally speaking stopping the
tick doesn't hurt if the selected idle state is the deepest one,
because in that case the governor kind of expects a significant exit
latency anyway.  If it is not the deepest one (which is disabled),
it's better to let the tick tick.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 1/3] cpuidle: Inject tick boundary state
  2023-07-31  8:01       ` Rafael J. Wysocki
@ 2023-07-31  9:09         ` Peter Zijlstra
  2023-07-31 10:35           ` Rafael J. Wysocki
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2023-07-31  9:09 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: anna-maria, tglx, frederic, gautham.shenoy, linux-kernel,
	daniel.lezcano, linux-pm, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid

On Mon, Jul 31, 2023 at 10:01:53AM +0200, Rafael J. Wysocki wrote:
> On Sat, Jul 29, 2023 at 10:44 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Fri, Jul 28, 2023 at 05:36:55PM +0200, Rafael J. Wysocki wrote:
> > > On Fri, Jul 28, 2023 at 5:01 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > > >
> > > > In order to facilitate governors that track history in idle-state
> > > > buckets (TEO) making a useful decision about NOHZ, make sure we have a
> > > > bucket that counts tick-and-longer.
> > > >
> > > > In order to be inclusive of the tick itself -- after all, if we do not
> > > > disable NOHZ we'll sleep for a full tick, the actual boundary should
> > > > be just short of a full tick.
> > > >
> > > > IOW, when registering the idle-states, add one that is always
> > > > disabled, just to have a bucket.
> > >
> > > This extra bucket can be created in the governor itself, can't it?
> >
> > I couldn't find a nice spot for the governor to add idle-states.
> 
> Well, I've thought this through and recalled a couple of things and my
> conclusion is that the decision whether or not to stop the tick really
> depends on the idle state choice.
> 
> There are three cases:
> 
> 1. The selected idle state is shallow (that is, its target residency
> is below the tick period length), but it is not the deepest one.
> 2. The selected idle state is shallow, but it is the deepest one (or
> at least the deepest enabled one).
> 3. The selected idle state is deep (that is, its target residency is
> above the tick length period).
> 
> In case 1, the tick should not be stopped so as to prevent the CPU
> from spending too much time in a suboptimal idle state.
> 
> In case 3, the tick needs to be stopped, because otherwise the target
> residency of the selected state would not be met.
> 
> Case 2 is somewhat a mixed bag, but generally speaking stopping the
> tick doesn't hurt if the selected idle state is the deepest one,
> because in that case the governor kind of expects a significant exit
> latency anyway.  If it is not the deepest one (which is disabled),
> it's better to let the tick tick.

So I agree with 1.

I do not agree with 2. Disabling the tick is costly, doubly so with the
timer-pull thing, but even today. Simply disabling it because we picked
the deepest idle state, irrespective of the expected duration is wrong
as it will incur this significant cost.

With 3 there is the question of how we get the expected sleep duration;
this is especially important with timer-pull, where we have this
chicken-and-egg thing.

Notably: tick_nohz_get_sleep_length() wants to know if the tick gets
disabled and cpuilde wants to use tick_nohz_get_sleep_length() to
determine if to disable the tick. This cycle needs to be broken for
timer-pull.

Hence my proposal to introduce the extra tick state, that allows fixing
both 2 and 3.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 2/3] cpuidle,teo: Improve NOHZ management
  2023-07-28 22:01     ` Peter Zijlstra
@ 2023-07-31 10:17       ` Rafael J. Wysocki
  2023-07-31 12:02         ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Rafael J. Wysocki @ 2023-07-31 10:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, anna-maria, tglx, frederic, gautham.shenoy,
	linux-kernel, daniel.lezcano, linux-pm, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid

On Sat, Jul 29, 2023 at 12:01 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jul 28, 2023 at 06:56:24PM +0200, Rafael J. Wysocki wrote:
> > On Fri, Jul 28, 2023 at 5:01 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> > > @@ -276,11 +276,11 @@ static void teo_update(struct cpuidle_dr
> > >
> > >                 cpu_data->total += bin->hits + bin->intercepts;
> > >
> > > -               if (target_residency_ns <= cpu_data->sleep_length_ns) {
> > > +               if (target_residency_ns <= cpu_data->sleep_length_ns)
> > >                         idx_timer = i;
> > > -                       if (target_residency_ns <= measured_ns)
> > > -                               idx_duration = i;
> > > -               }
> > > +
> > > +               if (target_residency_ns <= measured_ns)
> > > +                       idx_duration = i;
> >
> > I'm not quite sure what happens here.
>
> Oh, I couldn't convince myself that measured_ns <= sleep_length_ns. If
> measured was longer we still want the higher index.
>
> But yeah, I forgots I had that hunk in.
>
> > >         }
> > >
> > >         i = cpu_data->next_recent_idx++;
> > > @@ -362,11 +362,12 @@ static int teo_select(struct cpuidle_dri
> > >         unsigned int recent_sum = 0;
> > >         unsigned int idx_hit_sum = 0;
> > >         unsigned int hit_sum = 0;
> > > +       unsigned int tick_sum = 0;
> > >         int constraint_idx = 0;
> > >         int idx0 = 0, idx = -1;
> > >         bool alt_intercepts, alt_recent;
> > >         ktime_t delta_tick;
> > > -       s64 duration_ns;
> > > +       s64 duration_ns = TICK_NSEC;
> > >         int i;
> > >
> > >         if (dev->last_state_idx >= 0) {
> > > @@ -376,36 +377,26 @@ static int teo_select(struct cpuidle_dri
> > >
> > >         cpu_data->time_span_ns = local_clock();
> > >
> > > -       duration_ns = tick_nohz_get_sleep_length(&delta_tick);
> > > -       cpu_data->sleep_length_ns = duration_ns;
> > > +       /* Should we stop the tick? */
> >
> > Who's we?  I'd prefer something like "Should the tick be stopped?"
> > here (analogously below).
>
> Sure.
>
> > > +       for (i = 1; i < drv->state_count; i++) {
> > > +               struct teo_bin *prev_bin = &cpu_data->state_bins[i-1];
> > > +               struct cpuidle_state *s = &drv->states[i];
> > >
> > > -       /* Check if there is any choice in the first place. */
> > > -       if (drv->state_count < 2) {
> > > -               idx = 0;
> > > -               goto end;
> > > -       }
> > > -       if (!dev->states_usage[0].disable) {
> > > -               idx = 0;
> > > -               if (drv->states[1].target_residency_ns > duration_ns)
> > > -                       goto end;
> > > -       }
> > > +               tick_sum += prev_bin->intercepts;
> > > +               tick_sum += prev_bin->hits;
> > >
> > > -       cpu_data->utilized = teo_cpu_is_utilized(dev->cpu, cpu_data);
> > > -       /*
> > > -        * If the CPU is being utilized over the threshold and there are only 2
> > > -        * states to choose from, the metrics need not be considered, so choose
> > > -        * the shallowest non-polling state and exit.
> > > -        */
> > > -       if (drv->state_count < 3 && cpu_data->utilized) {
> > > -               for (i = 0; i < drv->state_count; ++i) {
> > > -                       if (!dev->states_usage[i].disable &&
> > > -                           !(drv->states[i].flags & CPUIDLE_FLAG_POLLING)) {
> > > -                               idx = i;
> > > -                               goto end;
> > > -                       }
> > > -               }
> > > +               if (s->target_residency_ns >= SHORT_TICK_NSEC)
> > > +                       break;
> > >         }
> > >
> > > +       if (2*tick_sum > cpu_data->total)
> > > +               *stop_tick = false;
> >
> > This means "if over 50% of all the events fall into the buckets below
> > the tick period length, don't stop the tick".  Fair enough, but this
> > covers long-term only and what about the most recent events?  I think
> > that they need to be taken into account here too.
>
> From looking at a few traces this 'long' term is around 8-10 samples.
> Which I figured was quick enough.
>
> Note that DECAY_SHIFT is 3, while the pulse is 10 bits, so 3-4 cycles
> will drain most of the history when there's a distinct phase shift.
>
> That said; I did look at the recent thing and those seem geared towards
> the intercepts, while I think hits+intercepts makes more sense here.
> Given it adjusted quickly enough I didn't put more time in it.
>
> > > +
> > > +       /* If we do stop the tick, ask for the next timer. */
> > > +       if (*stop_tick)
> > > +               duration_ns = tick_nohz_get_sleep_length(&delta_tick);
> > > +       cpu_data->sleep_length_ns = duration_ns;
> >
> > If the decision is made to retain the tick and the time to the closest
> > tick event is very small, it would be better to refine the state
> > selection so as to avoid returning a state with the target residency
> > above that time (which essentially is wasting energy).  That's what
> > delta_tick above is for, but now tick_nohz_get_sleep_length() is never
> > called when the tick is not going to be stopped.
>
> Right, so I did ponder using something like
> ktime_sub(tick_nohz_get_next_hrtimer(), now) instead of TICK_NSEC to get
> a more accurate measure, but I didn't do so yet.
>
> > Besides, if I'm not mistaken, setting sleep_length_ns to TICK_NSEC
> > every time the tick is not stopped will not really work on systems
> > where there are real idle states with target residencies beyond
> > TICK_NSEC.
>
> It does work; you really don't want to select such a state if the tick
> is still active -- you'll never get your residency. Such a state should
> really only be used when the tick is off.
>
> > > +
> > >         /*
> > >          * Find the deepest idle state whose target residency does not exceed
> > >          * the current sleep length and the deepest idle state not deeper than
> > > @@ -446,13 +437,13 @@ static int teo_select(struct cpuidle_dri
> > >                 idx_recent_sum = recent_sum;
> > >         }
> > >
> > > -       /* Avoid unnecessary overhead. */
> > > -       if (idx < 0) {
> > > -               idx = 0; /* No states enabled, must use 0. */
> > > -               goto end;
> > > -       } else if (idx == idx0) {
> > > -               goto end;
> > > -       }
> > > +       /* No states enabled, must use 0 */
> > > +       if (idx < 0)
> > > +               return 0;
> > > +
> > > +       /* No point looking for something shallower than the first enabled state */
> > > +       if (idx == idx0)
> > > +               return idx;
> > >
> > >         /*
> > >          * If the sum of the intercepts metric for all of the idle states
> > > @@ -541,29 +532,9 @@ static int teo_select(struct cpuidle_dri
> > >          * If the CPU is being utilized over the threshold, choose a shallower
> > >          * non-polling state to improve latency
> > >          */
> > > -       if (cpu_data->utilized)
> > > +       if (teo_cpu_is_utilized(dev->cpu, cpu_data))
> > >                 idx = teo_find_shallower_state(drv, dev, idx, duration_ns, true);
> > >
> > > -end:
> > > -       /*
> > > -        * Don't stop the tick if the selected state is a polling one or if the
> > > -        * expected idle duration is shorter than the tick period length.
> > > -        */
> > > -       if (((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) ||
> > > -           duration_ns < TICK_NSEC) && !tick_nohz_tick_stopped()) {
> > > -               *stop_tick = false;
> > > -
> > > -               /*
> > > -                * The tick is not going to be stopped, so if the target
> > > -                * residency of the state to be returned is not within the time
> > > -                * till the closest timer including the tick, try to correct
> > > -                * that.
> > > -                */
> > > -               if (idx > idx0 &&
> > > -                   drv->states[idx].target_residency_ns > delta_tick)
> > > -                       idx = teo_find_shallower_state(drv, dev, idx, delta_tick, false);
> > > -       }
> > > -
> > >         return idx;
> > >  }
> >
> > Overall, I think that the problem with calling
> > tick_nohz_get_sleep_length() is limited to the cases when the CPU is
> > almost fully loaded, so the overall amount of idle time on it is tiny.
> > I would rather use a special pah for those cases and I would register
> > all of the wakeups as "intercepts" in those cases.
>
> I'm not sure what you're proposing.

Something really simple like:

1. Check sched_cpu_util() (which is done by teo anyway).
2. If that is around 90% of the maximum CPU capacity, select the first
non-polling idle state and be done (don't stop the tick as my other
replay earlier today).

In 2, tick_nohz_get_sleep_length() need not be checked, because the
idle state selection doesn't depend on it, and sleep_length_ns can be
set to KTIME_MAX (or indeed anything greater than TICK_NSEC), because
if the CPU is woken up by the tick, teo_reflect() will take care of
this and otherwise the wakeup should be recorded as an "intercept".

> If we track the tick+ bucket -- as
> we must in order to say anything useful about it, then we can decide the
> tick state before (as I do here) calling sleep_length().
>
> The timer-pull rework from Anna-Maria unfortunately makes the
> tick_nohz_get_sleep_length() thing excessively expensive and it really
> doesn't make sense to call it when we retain the tick.
>
> It's all a bit of a chicken-egg situation, cpuidle wants to know when
> the next timer is, but telling when that is, wants to know if the tick
> stays. We need to break that somehow -- I propose by not calling it when
> we know we'll keep the tick.

By selecting a state whose target residency will not be met, we lose
on both energy and performance, so doing this really should be
avoided, unless the state is really shallow in which case there may be
no time for making this consideration.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 1/3] cpuidle: Inject tick boundary state
  2023-07-31  9:09         ` Peter Zijlstra
@ 2023-07-31 10:35           ` Rafael J. Wysocki
  2023-07-31 11:00             ` Rafael J. Wysocki
  2023-07-31 11:38             ` Peter Zijlstra
  0 siblings, 2 replies; 27+ messages in thread
From: Rafael J. Wysocki @ 2023-07-31 10:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, anna-maria, tglx, frederic, gautham.shenoy,
	linux-kernel, daniel.lezcano, linux-pm, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid

On Mon, Jul 31, 2023 at 11:11 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Jul 31, 2023 at 10:01:53AM +0200, Rafael J. Wysocki wrote:
> > On Sat, Jul 29, 2023 at 10:44 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Fri, Jul 28, 2023 at 05:36:55PM +0200, Rafael J. Wysocki wrote:
> > > > On Fri, Jul 28, 2023 at 5:01 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > > > >
> > > > > In order to facilitate governors that track history in idle-state
> > > > > buckets (TEO) making a useful decision about NOHZ, make sure we have a
> > > > > bucket that counts tick-and-longer.
> > > > >
> > > > > In order to be inclusive of the tick itself -- after all, if we do not
> > > > > disable NOHZ we'll sleep for a full tick, the actual boundary should
> > > > > be just short of a full tick.
> > > > >
> > > > > IOW, when registering the idle-states, add one that is always
> > > > > disabled, just to have a bucket.
> > > >
> > > > This extra bucket can be created in the governor itself, can't it?
> > >
> > > I couldn't find a nice spot for the governor to add idle-states.
> >
> > Well, I've thought this through and recalled a couple of things and my
> > conclusion is that the decision whether or not to stop the tick really
> > depends on the idle state choice.
> >
> > There are three cases:
> >
> > 1. The selected idle state is shallow (that is, its target residency
> > is below the tick period length), but it is not the deepest one.
> > 2. The selected idle state is shallow, but it is the deepest one (or
> > at least the deepest enabled one).
> > 3. The selected idle state is deep (that is, its target residency is
> > above the tick length period).
> >
> > In case 1, the tick should not be stopped so as to prevent the CPU
> > from spending too much time in a suboptimal idle state.
> >
> > In case 3, the tick needs to be stopped, because otherwise the target
> > residency of the selected state would not be met.
> >
> > Case 2 is somewhat a mixed bag, but generally speaking stopping the
> > tick doesn't hurt if the selected idle state is the deepest one,
> > because in that case the governor kind of expects a significant exit
> > latency anyway.  If it is not the deepest one (which is disabled),
> > it's better to let the tick tick.
>
> So I agree with 1.
>
> I do not agree with 2. Disabling the tick is costly, doubly so with the
> timer-pull thing, but even today. Simply disabling it because we picked
> the deepest idle state, irrespective of the expected duration is wrong
> as it will incur this significant cost.
>
> With 3 there is the question of how we get the expected sleep duration;
> this is especially important with timer-pull, where we have this
> chicken-and-egg thing.
>
> Notably: tick_nohz_get_sleep_length() wants to know if the tick gets
> disabled

Well, it shouldn't.  Or at least it didn't before.

It is expected to produce two values, one with the tick stopped (this
is the return value of the function) and the other with the tick
ticking (this is the one written under the address passed as the arg).
This cannot depend on whether or not the tick will be stopped.  Both
are good to know.

Now, I understand that getting these two values may be costly, so
there is an incentive to avoid calling it, but then the governor needs
to figure this out from its crystal ball and so care needs to be taken
to limit the possible damage in case the crystal ball is not right.

> and cpuilde wants to use tick_nohz_get_sleep_length() to
> determine if to disable the tick. This cycle needs to be broken for
> timer-pull.
>
> Hence my proposal to introduce the extra tick state, that allows fixing
> both 2 and 3.

I'm not sure about 3 TBH.

Say there are 2 idle states, one shallow (say its target residency is
10 us) and one deep (say its target residency is T = 2 * TICK_NSEC).

Currently, there are 3 bins in this case, 0 (0 - 10 us), 1 (10 us - T)
and 2 (T - infinity) and the governor will need to check bins 1 and 2
unless it somehow knows that it will use state 0.

Note that all of the events falling into bin 1 will cause the governor
to select state 0 more often, so I don't see how adding one more bin
between 1 and 2 changes this.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 1/3] cpuidle: Inject tick boundary state
  2023-07-31 10:35           ` Rafael J. Wysocki
@ 2023-07-31 11:00             ` Rafael J. Wysocki
  2023-07-31 11:38             ` Peter Zijlstra
  1 sibling, 0 replies; 27+ messages in thread
From: Rafael J. Wysocki @ 2023-07-31 11:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: anna-maria, tglx, frederic, gautham.shenoy, linux-kernel,
	daniel.lezcano, linux-pm, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid

On Mon, Jul 31, 2023 at 12:35 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Mon, Jul 31, 2023 at 11:11 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Jul 31, 2023 at 10:01:53AM +0200, Rafael J. Wysocki wrote:
> > > On Sat, Jul 29, 2023 at 10:44 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > >
> > > > On Fri, Jul 28, 2023 at 05:36:55PM +0200, Rafael J. Wysocki wrote:
> > > > > On Fri, Jul 28, 2023 at 5:01 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > > > > >
> > > > > > In order to facilitate governors that track history in idle-state
> > > > > > buckets (TEO) making a useful decision about NOHZ, make sure we have a
> > > > > > bucket that counts tick-and-longer.
> > > > > >
> > > > > > In order to be inclusive of the tick itself -- after all, if we do not
> > > > > > disable NOHZ we'll sleep for a full tick, the actual boundary should
> > > > > > be just short of a full tick.
> > > > > >
> > > > > > IOW, when registering the idle-states, add one that is always
> > > > > > disabled, just to have a bucket.
> > > > >
> > > > > This extra bucket can be created in the governor itself, can't it?
> > > >
> > > > I couldn't find a nice spot for the governor to add idle-states.
> > >
> > > Well, I've thought this through and recalled a couple of things and my
> > > conclusion is that the decision whether or not to stop the tick really
> > > depends on the idle state choice.
> > >
> > > There are three cases:
> > >
> > > 1. The selected idle state is shallow (that is, its target residency
> > > is below the tick period length), but it is not the deepest one.
> > > 2. The selected idle state is shallow, but it is the deepest one (or
> > > at least the deepest enabled one).
> > > 3. The selected idle state is deep (that is, its target residency is
> > > above the tick length period).
> > >
> > > In case 1, the tick should not be stopped so as to prevent the CPU
> > > from spending too much time in a suboptimal idle state.
> > >
> > > In case 3, the tick needs to be stopped, because otherwise the target
> > > residency of the selected state would not be met.
> > >
> > > Case 2 is somewhat a mixed bag, but generally speaking stopping the
> > > tick doesn't hurt if the selected idle state is the deepest one,
> > > because in that case the governor kind of expects a significant exit
> > > latency anyway.  If it is not the deepest one (which is disabled),
> > > it's better to let the tick tick.
> >
> > So I agree with 1.
> >
> > I do not agree with 2. Disabling the tick is costly, doubly so with the
> > timer-pull thing, but even today. Simply disabling it because we picked
> > the deepest idle state, irrespective of the expected duration is wrong
> > as it will incur this significant cost.
> >
> > With 3 there is the question of how we get the expected sleep duration;
> > this is especially important with timer-pull, where we have this
> > chicken-and-egg thing.
> >
> > Notably: tick_nohz_get_sleep_length() wants to know if the tick gets
> > disabled
>
> Well, it shouldn't.  Or at least it didn't before.
>
> It is expected to produce two values, one with the tick stopped (this
> is the return value of the function) and the other with the tick
> ticking (this is the one written under the address passed as the arg).
> This cannot depend on whether or not the tick will be stopped.  Both
> are good to know.
>
> Now, I understand that getting these two values may be costly, so
> there is an incentive to avoid calling it, but then the governor needs
> to figure this out from its crystal ball and so care needs to be taken
> to limit the possible damage in case the crystal ball is not right.
>
> > and cpuilde wants to use tick_nohz_get_sleep_length() to
> > determine if to disable the tick. This cycle needs to be broken for
> > timer-pull.
> >
> > Hence my proposal to introduce the extra tick state, that allows fixing
> > both 2 and 3.
>
> I'm not sure about 3 TBH.
>
> Say there are 2 idle states, one shallow (say its target residency is
> 10 us) and one deep (say its target residency is T = 2 * TICK_NSEC).
>
> Currently, there are 3 bins in this case, 0 (0 - 10 us), 1 (10 us - T)
> and 2 (T - infinity) and the governor will need to check bins 1 and 2
> unless it somehow knows that it will use state 0.
>
> Note that all of the events falling into bin 1 will cause the governor
> to select state 0 more often, so I don't see how adding one more bin
> between 1 and 2 changes this.

BTW, in teo, using the return value of tick_nohz_get_sleep_length()
for the initial state selection is an optimization.  In principle, it
could consider all of the bins every time and use the
tick_nohz_get_sleep_length() return value to refine the state
selection in case it turns out to be small enough.

So if this is changed, it would avoid calling
tick_nohz_get_sleep_length() when it knows that, say, state 0 will be
selected anyway (or that the first non-polling state will be selected
anyway or some such).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 1/3] cpuidle: Inject tick boundary state
  2023-07-31 10:35           ` Rafael J. Wysocki
  2023-07-31 11:00             ` Rafael J. Wysocki
@ 2023-07-31 11:38             ` Peter Zijlstra
  2023-07-31 16:55               ` Rafael J. Wysocki
  1 sibling, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2023-07-31 11:38 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: anna-maria, tglx, frederic, gautham.shenoy, linux-kernel,
	daniel.lezcano, linux-pm, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid

On Mon, Jul 31, 2023 at 12:35:20PM +0200, Rafael J. Wysocki wrote:

> > So I agree with 1.
> >
> > I do not agree with 2. Disabling the tick is costly, doubly so with the
> > timer-pull thing, but even today. Simply disabling it because we picked
> > the deepest idle state, irrespective of the expected duration is wrong
> > as it will incur this significant cost.
> >
> > With 3 there is the question of how we get the expected sleep duration;
> > this is especially important with timer-pull, where we have this
> > chicken-and-egg thing.
> >
> > Notably: tick_nohz_get_sleep_length() wants to know if the tick gets
> > disabled
> 
> Well, it shouldn't.  Or at least it didn't before.

Correct, this is new in the timer-pull thing.

> It is expected to produce two values, one with the tick stopped (this
> is the return value of the function) and the other with the tick
> ticking (this is the one written under the address passed as the arg).
> This cannot depend on whether or not the tick will be stopped.  Both
> are good to know.
> 
> Now, I understand that getting these two values may be costly, so
> there is an incentive to avoid calling it, but then the governor needs
> to figure this out from its crystal ball and so care needs to be taken
> to limit the possible damage in case the crystal ball is not right.

If we can get the governor to decide the tick state up-front we can
avoid a lot of the expensive parts.

> > and cpuilde wants to use tick_nohz_get_sleep_length() to
> > determine if to disable the tick. This cycle needs to be broken for
> > timer-pull.
> >
> > Hence my proposal to introduce the extra tick state, that allows fixing
> > both 2 and 3.
> 
> I'm not sure about 3 TBH.
> 
> Say there are 2 idle states, one shallow (say its target residency is
> 10 us) and one deep (say its target residency is T = 2 * TICK_NSEC).

This is the easy case and that actually 'works' today. The
interesting case is where your deepest state has a target residency that
is below the tick (because for HZ=100, we have a 10ms tick and pretty
much all idle states are below that).

In that case you cannot tell the difference between I'm good to use this
state and I'm good to disable the tick and still use this state.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 0/3] cpuidle,teo: Improve TEO tick decisions
  2023-07-28 14:55 [RFC][PATCH 0/3] cpuidle,teo: Improve TEO tick decisions Peter Zijlstra
                   ` (2 preceding siblings ...)
  2023-07-28 14:55 ` [RFC][PATCH 3/3] cpuidle,teo: Improve state selection Peter Zijlstra
@ 2023-07-31 11:46 ` Anna-Maria Behnsen
  2023-07-31 17:51   ` Rafael J. Wysocki
  3 siblings, 1 reply; 27+ messages in thread
From: Anna-Maria Behnsen @ 2023-07-31 11:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rafael, tglx, frederic, gautham.shenoy, linux-kernel,
	daniel.lezcano, linux-pm, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid

Hi,

On Fri, 28 Jul 2023, Peter Zijlstra wrote:

> Hi,
> 
> Wanted to send this yesterday, but my home server died and took everything
> down :/
> 
> These patches are lightly tested but appear to behave as expected.
> 
> 

As I was asked to see if the patches of Raphael improve the behavior, I
rerun the tests with Raphaels v2 as well as with Peters RFC patchset. Here
are the results (compared to upstream):

			upstream		raphael v2		peter RFC

Idle Total		2533	100.00%		1183	100.00%		5563	100.00%
x >= 4ms		1458	57.56%		1151	97.30%		3385	60.85%
4ms> x >= 2ms		91	3.59%		12	1.01%		133	2.39%
2ms > x >= 1ms		56	2.21%		3	0.25%		80	1.44%
1ms > x >= 500us	64	2.53%		1	0.08%		98	1.76%
500us > x >= 250us	73	2.88%		0	0.00%		113	2.03%
250us > x >=100us	76	3.00%		2	0.17%		106	1.91%
100us > x >= 50us	33	1.30%		4	0.34%		75	1.35%
50us > x >= 25us	39	1.54%		4	0.34%		152	2.73%
25us > x >= 10us	199	7.86%		4	0.34%		404	7.26%
10us > x > 5us		156	6.16%		0	0.00%		477	8.57%
5us > x			288	11.37%		2	0.17%		540	9.71%

tick_nohz_next_event()
count			8839790			6142079			36623


Raphaels Approach still does the tick_nohz_get_sleep_length() execution
unconditional. It executes ~5000 times more tick_nohz_next_event() as the
tick is stopped. This relation improved massively in Peters approach
(factor is ~7).

Thanks,

	Anna-Maria


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 2/3] cpuidle,teo: Improve NOHZ management
  2023-07-31 10:17       ` Rafael J. Wysocki
@ 2023-07-31 12:02         ` Peter Zijlstra
  2023-07-31 17:20           ` Rafael J. Wysocki
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2023-07-31 12:02 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: anna-maria, tglx, frederic, gautham.shenoy, linux-kernel,
	daniel.lezcano, linux-pm, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid

On Mon, Jul 31, 2023 at 12:17:27PM +0200, Rafael J. Wysocki wrote:

> Something really simple like:
> 
> 1. Check sched_cpu_util() (which is done by teo anyway).
> 2. If that is around 90% of the maximum CPU capacity, select the first
> non-polling idle state and be done (don't stop the tick as my other
> replay earlier today).

So I really don't like using cpu_util() here, yes, 90% is a high number,
but it doesn't say *anything* about the idle duration. Remember, this is
a 32ms window, so 90% of that is 28.8ms.

(not entirely accurate, since it's an exponential average, but that
doesn't change the overal argument, only some of the particulars)

That is, 90% util, at best, says there is no idle longer than 3.2 ms.
But that is still vastly longer than pretty much all residencies. Heck,
that is still 3 ticks worth of HZ=1000 ticks. So 90% util should not
preclude disabling the tick (at HZ=1000).

Now, typically this won't be the case, and at 90% you'll have lots of
small idles adding up to 3.2ms total idle. But the point is, you can't
tell the difference. And as such util is a horrible measure to use for
cpuidle.

> > If we track the tick+ bucket -- as
> > we must in order to say anything useful about it, then we can decide the
> > tick state before (as I do here) calling sleep_length().
> >
> > The timer-pull rework from Anna-Maria unfortunately makes the
> > tick_nohz_get_sleep_length() thing excessively expensive and it really
> > doesn't make sense to call it when we retain the tick.
> >
> > It's all a bit of a chicken-egg situation, cpuidle wants to know when
> > the next timer is, but telling when that is, wants to know if the tick
> > stays. We need to break that somehow -- I propose by not calling it when
> > we know we'll keep the tick.
> 
> By selecting a state whose target residency will not be met, we lose
> on both energy and performance, so doing this really should be
> avoided, unless the state is really shallow in which case there may be
> no time for making this consideration.

I'm not sure how that relates to what I propose above. By adding the
tick+ bucket we have more historical information as related to the tick
boundary, how does that make us select states we won't match residency
for?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 1/3] cpuidle: Inject tick boundary state
  2023-07-31 11:38             ` Peter Zijlstra
@ 2023-07-31 16:55               ` Rafael J. Wysocki
  2023-07-31 17:27                 ` Rafael J. Wysocki
  2023-08-02 10:34                 ` Peter Zijlstra
  0 siblings, 2 replies; 27+ messages in thread
From: Rafael J. Wysocki @ 2023-07-31 16:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, anna-maria, tglx, frederic, gautham.shenoy,
	linux-kernel, daniel.lezcano, linux-pm, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid

On Mon, Jul 31, 2023 at 1:39 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Jul 31, 2023 at 12:35:20PM +0200, Rafael J. Wysocki wrote:
>
> > > So I agree with 1.
> > >
> > > I do not agree with 2. Disabling the tick is costly, doubly so with the
> > > timer-pull thing, but even today. Simply disabling it because we picked
> > > the deepest idle state, irrespective of the expected duration is wrong
> > > as it will incur this significant cost.
> > >
> > > With 3 there is the question of how we get the expected sleep duration;
> > > this is especially important with timer-pull, where we have this
> > > chicken-and-egg thing.
> > >
> > > Notably: tick_nohz_get_sleep_length() wants to know if the tick gets
> > > disabled
> >
> > Well, it shouldn't.  Or at least it didn't before.
>
> Correct, this is new in the timer-pull thing.
>
> > It is expected to produce two values, one with the tick stopped (this
> > is the return value of the function) and the other with the tick
> > ticking (this is the one written under the address passed as the arg).
> > This cannot depend on whether or not the tick will be stopped.  Both
> > are good to know.
> >
> > Now, I understand that getting these two values may be costly, so
> > there is an incentive to avoid calling it, but then the governor needs
> > to figure this out from its crystal ball and so care needs to be taken
> > to limit the possible damage in case the crystal ball is not right.
>
> If we can get the governor to decide the tick state up-front we can
> avoid a lot of the expensive parts.

I claim that in the vast majority of cases this is the same as
deciding the state.

The case when it is not is when the target residency of the deepest
idle state is below the tick period length and the governor is about
to select that state.  According to the data I've seen so far this is
a tiny fraction of all the cases.

> > > and cpuilde wants to use tick_nohz_get_sleep_length() to
> > > determine if to disable the tick. This cycle needs to be broken for
> > > timer-pull.
> > >
> > > Hence my proposal to introduce the extra tick state, that allows fixing
> > > both 2 and 3.
> >
> > I'm not sure about 3 TBH.
> >
> > Say there are 2 idle states, one shallow (say its target residency is
> > 10 us) and one deep (say its target residency is T = 2 * TICK_NSEC).
>
> This is the easy case and that actually 'works' today.

But this is my case 3 which you said you didn't agree with.  I don't
see why it needs to be fixed.

> The interesting case is where your deepest state has a target residency that
> is below the tick (because for HZ=100, we have a 10ms tick and pretty
> much all idle states are below that).

What about HZ=1000, though?

> In that case you cannot tell the difference between I'm good to use this
> state and I'm good to disable the tick and still use this state.

No, you don't, but is it really worth the fuss?

The state is high-latency anyway and tick_nohz_get_sleep_length()
needs to be called anyway in that case in order to determine if a
shallower state wouldn't be better.  At this point the statistics have
already told the governor otherwise and a misprediction would be a
double loss.

So yes, you can gain performance by avoiding to call
tick_nohz_get_sleep_length(), but then you can also lose it by
selecting a state that is too deep (and that can be determined exactly
with respect to timers).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 2/3] cpuidle,teo: Improve NOHZ management
  2023-07-31 12:02         ` Peter Zijlstra
@ 2023-07-31 17:20           ` Rafael J. Wysocki
  0 siblings, 0 replies; 27+ messages in thread
From: Rafael J. Wysocki @ 2023-07-31 17:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, anna-maria, tglx, frederic, gautham.shenoy,
	linux-kernel, daniel.lezcano, linux-pm, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid

On Mon, Jul 31, 2023 at 2:02 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Jul 31, 2023 at 12:17:27PM +0200, Rafael J. Wysocki wrote:
>
> > Something really simple like:
> >
> > 1. Check sched_cpu_util() (which is done by teo anyway).
> > 2. If that is around 90% of the maximum CPU capacity, select the first
> > non-polling idle state and be done (don't stop the tick as my other
> > replay earlier today).
>
> So I really don't like using cpu_util() here, yes, 90% is a high number,
> but it doesn't say *anything* about the idle duration. Remember, this is
> a 32ms window, so 90% of that is 28.8ms.

It doesn't have to say anything about the idle duration as long as it
says something about the governor's "accuracy".

If it is observed experimentally that the governor is generally likely
to mispredict a deeper state if CPu utilization is about a certain
threshold, then it makes sense to use this information to counter
that.  That's how it is used today.

And yes, you are right about the most immediate idle duration, but
overall the rule that if the CPU utilization is high, then selecting
deep idle states is not a good idea in general does seem to hold.

> (not entirely accurate, since it's an exponential average, but that
> doesn't change the overal argument, only some of the particulars)
>
> That is, 90% util, at best, says there is no idle longer than 3.2 ms.
> But that is still vastly longer than pretty much all residencies. Heck,
> that is still 3 ticks worth of HZ=1000 ticks. So 90% util should not
> preclude disabling the tick (at HZ=1000).
>
> Now, typically this won't be the case, and at 90% you'll have lots of
> small idles adding up to 3.2ms total idle. But the point is, you can't
> tell the difference. And as such util is a horrible measure to use for
> cpuidle.

No it is not IMO, because idle is all about the combined outcome of
multiple cycles.

> > > If we track the tick+ bucket -- as
> > > we must in order to say anything useful about it, then we can decide the
> > > tick state before (as I do here) calling sleep_length().
> > >
> > > The timer-pull rework from Anna-Maria unfortunately makes the
> > > tick_nohz_get_sleep_length() thing excessively expensive and it really
> > > doesn't make sense to call it when we retain the tick.
> > >
> > > It's all a bit of a chicken-egg situation, cpuidle wants to know when
> > > the next timer is, but telling when that is, wants to know if the tick
> > > stays. We need to break that somehow -- I propose by not calling it when
> > > we know we'll keep the tick.
> >
> > By selecting a state whose target residency will not be met, we lose
> > on both energy and performance, so doing this really should be
> > avoided, unless the state is really shallow in which case there may be
> > no time for making this consideration.
>
> I'm not sure how that relates to what I propose above. By adding the
> tick+ bucket we have more historical information as related to the tick
> boundary, how does that make us select states we won't match residency
> for?

As stated in my last reply, the only case in which it makes a
difference is when the deepest idle state's target residency is below
the tick and I'm not really sure if that difference is demonstrably
significant.

So I would do the following to start with:

1. Rearrange the teo code so that it considers all of the bins every
time without calling tick_nohz_get_sleep_length().

2. The sched_cpu_util() check will still be applied to the resulting
candidate state as it is now.

3. If it finds that the candidate state is shallow enough (for
instance, it is a polling state or the first non-polling one), it will
return this state without calling tick_nohz_get_sleep_length() and
stopping the tick.

4. Otherwise it will call tick_nohz_get_sleep_length() to see what
about timers and refine the selection (towards the shallower states)
if need be.

5. If the candidate state is not the deepest one and its target
residency is below the tick, it will be returned and the tick will not
be stopped.

6. Otherwise, the candidate state will be returned and the tick will be stopped.

If this still doesn't get us where we want to be, the extra bin can be
added (as long as it makes a measurable difference).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 1/3] cpuidle: Inject tick boundary state
  2023-07-31 16:55               ` Rafael J. Wysocki
@ 2023-07-31 17:27                 ` Rafael J. Wysocki
  2023-07-31 18:06                   ` Peter Zijlstra
  2023-08-02 10:34                 ` Peter Zijlstra
  1 sibling, 1 reply; 27+ messages in thread
From: Rafael J. Wysocki @ 2023-07-31 17:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: anna-maria, tglx, frederic, gautham.shenoy, linux-kernel,
	daniel.lezcano, linux-pm, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid

On Mon, Jul 31, 2023 at 6:55 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Mon, Jul 31, 2023 at 1:39 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Jul 31, 2023 at 12:35:20PM +0200, Rafael J. Wysocki wrote:
> >
> > > > So I agree with 1.
> > > >
> > > > I do not agree with 2. Disabling the tick is costly, doubly so with the
> > > > timer-pull thing, but even today. Simply disabling it because we picked
> > > > the deepest idle state, irrespective of the expected duration is wrong
> > > > as it will incur this significant cost.
> > > >
> > > > With 3 there is the question of how we get the expected sleep duration;
> > > > this is especially important with timer-pull, where we have this
> > > > chicken-and-egg thing.
> > > >
> > > > Notably: tick_nohz_get_sleep_length() wants to know if the tick gets
> > > > disabled
> > >
> > > Well, it shouldn't.  Or at least it didn't before.
> >
> > Correct, this is new in the timer-pull thing.
> >
> > > It is expected to produce two values, one with the tick stopped (this
> > > is the return value of the function) and the other with the tick
> > > ticking (this is the one written under the address passed as the arg).
> > > This cannot depend on whether or not the tick will be stopped.  Both
> > > are good to know.
> > >
> > > Now, I understand that getting these two values may be costly, so
> > > there is an incentive to avoid calling it, but then the governor needs
> > > to figure this out from its crystal ball and so care needs to be taken
> > > to limit the possible damage in case the crystal ball is not right.
> >
> > If we can get the governor to decide the tick state up-front we can
> > avoid a lot of the expensive parts.
>
> I claim that in the vast majority of cases this is the same as
> deciding the state.
>
> The case when it is not is when the target residency of the deepest
> idle state is below the tick period length and the governor is about
> to select that state.  According to the data I've seen so far this is
> a tiny fraction of all the cases.
>
> > > > and cpuilde wants to use tick_nohz_get_sleep_length() to
> > > > determine if to disable the tick. This cycle needs to be broken for
> > > > timer-pull.
> > > >
> > > > Hence my proposal to introduce the extra tick state, that allows fixing
> > > > both 2 and 3.
> > >
> > > I'm not sure about 3 TBH.
> > >
> > > Say there are 2 idle states, one shallow (say its target residency is
> > > 10 us) and one deep (say its target residency is T = 2 * TICK_NSEC).
> >
> > This is the easy case and that actually 'works' today.
>
> But this is my case 3 which you said you didn't agree with.  I don't
> see why it needs to be fixed.
>
> > The interesting case is where your deepest state has a target residency that
> > is below the tick (because for HZ=100, we have a 10ms tick and pretty
> > much all idle states are below that).
>
> What about HZ=1000, though?
>
> > In that case you cannot tell the difference between I'm good to use this
> > state and I'm good to disable the tick and still use this state.
>
> No, you don't, but is it really worth the fuss?
>
> The state is high-latency anyway and tick_nohz_get_sleep_length()
> needs to be called anyway in that case in order to determine if a
> shallower state wouldn't be better.  At this point the statistics have
> already told the governor otherwise and a misprediction would be a
> double loss.
>
> So yes, you can gain performance by avoiding to call
> tick_nohz_get_sleep_length(), but then you can also lose it by
> selecting a state that is too deep (and that can be determined exactly
> with respect to timers).

BTW, note that teo records timers as "hits", because it has an idea
about when the next timer should occur and that's because it calls
tick_nohz_get_sleep_length().

If it doesn't call that function, it will not be able to tell the
difference between a non-tick timer and any other wakeup, so the
non-tick timer wakeups will then be recorded as "intercepts" which
will skew it towards shallow states over time.  That's one of the
reasons why I would prefer to only avoid calling
tick_nohz_get_sleep_length() when the candidate idle state is really
shallow.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 0/3] cpuidle,teo: Improve TEO tick decisions
  2023-07-31 11:46 ` [RFC][PATCH 0/3] cpuidle,teo: Improve TEO tick decisions Anna-Maria Behnsen
@ 2023-07-31 17:51   ` Rafael J. Wysocki
  2023-08-02  8:12     ` Anna-Maria Behnsen
  0 siblings, 1 reply; 27+ messages in thread
From: Rafael J. Wysocki @ 2023-07-31 17:51 UTC (permalink / raw)
  To: Anna-Maria Behnsen
  Cc: Peter Zijlstra, rafael, tglx, frederic, gautham.shenoy,
	linux-kernel, daniel.lezcano, linux-pm, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid

On Mon, Jul 31, 2023 at 1:47 PM Anna-Maria Behnsen
<anna-maria@linutronix.de> wrote:
>
> Hi,
>
> On Fri, 28 Jul 2023, Peter Zijlstra wrote:
>
> > Hi,
> >
> > Wanted to send this yesterday, but my home server died and took everything
> > down :/
> >
> > These patches are lightly tested but appear to behave as expected.
> >
> >
>
> As I was asked to see if the patches of Raphael improve the behavior, I
> rerun the tests with Raphaels v2 as well as with Peters RFC patchset. Here
> are the results (compared to upstream):

Thanks for the numbers!

>                         upstream                raphael v2              peter RFC
>
> Idle Total              2533    100.00%         1183    100.00%         5563    100.00%
> x >= 4ms                1458    57.56%          1151    97.30%          3385    60.85%
> 4ms> x >= 2ms           91      3.59%           12      1.01%           133     2.39%
> 2ms > x >= 1ms          56      2.21%           3       0.25%           80      1.44%
> 1ms > x >= 500us        64      2.53%           1       0.08%           98      1.76%
> 500us > x >= 250us      73      2.88%           0       0.00%           113     2.03%
> 250us > x >=100us       76      3.00%           2       0.17%           106     1.91%
> 100us > x >= 50us       33      1.30%           4       0.34%           75      1.35%
> 50us > x >= 25us        39      1.54%           4       0.34%           152     2.73%
> 25us > x >= 10us        199     7.86%           4       0.34%           404     7.26%
> 10us > x > 5us          156     6.16%           0       0.00%           477     8.57%
> 5us > x                 288     11.37%          2       0.17%           540     9.71%
>
> tick_nohz_next_event()
> count                   8839790                 6142079                 36623
>
> Raphaels Approach still does the tick_nohz_get_sleep_length() execution
> unconditional. It executes ~5000 times more tick_nohz_next_event() as the
> tick is stopped. This relation improved massively in Peters approach
> (factor is ~7).

So I'm drawing slightly different conclusions from the above.

First, because there are different numbers of idle cycles in each
case, I think it only makes sense to look at the percentages.

So in both the "upstream" and "Peter RFC" cases there is a significant
percentage of times when the tick was stopped and the measure idle
duration was below 25 us (25.39% and 25.54%, respectively), whereas in
the "Rafael v2" case that is only 0.51% of times (not even 1%).  This
means a huge improvement to me, because all of these cases mean that
the governor incorrectly told the idle loop to stop the tick (it must
have selected a shallow state and so it should have not stopped the
tick in those cases).  To me, this clearly means fixing a real bug in
the governor.

Now, I said in the changelog of my v1 that the goal was not to reduce
the number of tick_nohz_get_sleep_length() invocations, so I'm not
sure why it is regarded as a comparison criteria.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 1/3] cpuidle: Inject tick boundary state
  2023-07-31 17:27                 ` Rafael J. Wysocki
@ 2023-07-31 18:06                   ` Peter Zijlstra
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Zijlstra @ 2023-07-31 18:06 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: anna-maria, tglx, frederic, gautham.shenoy, linux-kernel,
	daniel.lezcano, linux-pm, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid

On Mon, Jul 31, 2023 at 07:27:27PM +0200, Rafael J. Wysocki wrote:

> BTW, note that teo records timers as "hits", because it has an idea
> about when the next timer should occur and that's because it calls
> tick_nohz_get_sleep_length().
> 
> If it doesn't call that function, it will not be able to tell the
> difference between a non-tick timer and any other wakeup, so the
> non-tick timer wakeups will then be recorded as "intercepts" which
> will skew it towards shallow states over time.  That's one of the
> reasons why I would prefer to only avoid calling
> tick_nohz_get_sleep_length() when the candidate idle state is really
> shallow.

Can be fixed using tick_nohz_get_next_hrtimer() I think. Let me try that
tomorrow.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 0/3] cpuidle,teo: Improve TEO tick decisions
  2023-07-31 17:51   ` Rafael J. Wysocki
@ 2023-08-02  8:12     ` Anna-Maria Behnsen
  0 siblings, 0 replies; 27+ messages in thread
From: Anna-Maria Behnsen @ 2023-08-02  8:12 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra, tglx, frederic, gautham.shenoy, linux-kernel,
	daniel.lezcano, linux-pm, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid

[-- Attachment #1: Type: text/plain, Size: 3612 bytes --]

On Mon, 31 Jul 2023, Rafael J. Wysocki wrote:

> On Mon, Jul 31, 2023 at 1:47 PM Anna-Maria Behnsen
> <anna-maria@linutronix.de> wrote:
> >
> > Hi,
> >
> > On Fri, 28 Jul 2023, Peter Zijlstra wrote:
> >
> > > Hi,
> > >
> > > Wanted to send this yesterday, but my home server died and took everything
> > > down :/
> > >
> > > These patches are lightly tested but appear to behave as expected.
> > >
> > >
> >
> > As I was asked to see if the patches of Raphael improve the behavior, I
> > rerun the tests with Raphaels v2 as well as with Peters RFC patchset. Here
> > are the results (compared to upstream):
> 
> Thanks for the numbers!
> 
> >                         upstream                raphael v2              peter RFC
> >
> > Idle Total              2533    100.00%         1183    100.00%         5563    100.00%
> > x >= 4ms                1458    57.56%          1151    97.30%          3385    60.85%
> > 4ms> x >= 2ms           91      3.59%           12      1.01%           133     2.39%
> > 2ms > x >= 1ms          56      2.21%           3       0.25%           80      1.44%
> > 1ms > x >= 500us        64      2.53%           1       0.08%           98      1.76%
> > 500us > x >= 250us      73      2.88%           0       0.00%           113     2.03%
> > 250us > x >=100us       76      3.00%           2       0.17%           106     1.91%
> > 100us > x >= 50us       33      1.30%           4       0.34%           75      1.35%
> > 50us > x >= 25us        39      1.54%           4       0.34%           152     2.73%
> > 25us > x >= 10us        199     7.86%           4       0.34%           404     7.26%
> > 10us > x > 5us          156     6.16%           0       0.00%           477     8.57%
> > 5us > x                 288     11.37%          2       0.17%           540     9.71%
> >
> > tick_nohz_next_event()
> > count                   8839790                 6142079                 36623
> >
> > Raphaels Approach still does the tick_nohz_get_sleep_length() execution
> > unconditional. It executes ~5000 times more tick_nohz_next_event() as the
> > tick is stopped. This relation improved massively in Peters approach
> > (factor is ~7).
> 
> So I'm drawing slightly different conclusions from the above.
> 
> First, because there are different numbers of idle cycles in each
> case, I think it only makes sense to look at the percentages.
> 
> So in both the "upstream" and "Peter RFC" cases there is a significant
> percentage of times when the tick was stopped and the measure idle
> duration was below 25 us (25.39% and 25.54%, respectively), whereas in
> the "Rafael v2" case that is only 0.51% of times (not even 1%).  This
> means a huge improvement to me, because all of these cases mean that
> the governor incorrectly told the idle loop to stop the tick (it must
> have selected a shallow state and so it should have not stopped the
> tick in those cases).  To me, this clearly means fixing a real bug in
> the governor.
> 
> Now, I said in the changelog of my v1 that the goal was not to reduce
> the number of tick_nohz_get_sleep_length() invocations, so I'm not
> sure why it is regarded as a comparison criteria.
> 

I just mentioned this relation as the percentage values are obvious. I
know, your approach does not adress the tick_nohz_get_sleep_length
invocations, but it might be worth a try - I saw your new patchset
adressing it and will have a deeper look at it today/tomorrow. Thanks!

I also try to play around with the timer logics (moving marking timer base
idle into tick_nohz_stop_tick), but the results does not look pretty good
at the moment.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 1/3] cpuidle: Inject tick boundary state
  2023-07-31 16:55               ` Rafael J. Wysocki
  2023-07-31 17:27                 ` Rafael J. Wysocki
@ 2023-08-02 10:34                 ` Peter Zijlstra
  2023-08-02 12:44                   ` Rafael J. Wysocki
  1 sibling, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2023-08-02 10:34 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: anna-maria, tglx, frederic, gautham.shenoy, linux-kernel,
	daniel.lezcano, linux-pm, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid

On Mon, Jul 31, 2023 at 06:55:35PM +0200, Rafael J. Wysocki wrote:

> > In that case you cannot tell the difference between I'm good to use this
> > state and I'm good to disable the tick and still use this state.
> 
> No, you don't, but is it really worth the fuss?

My somewhat aged IVB-EP sits around 25 us for restarting the tick.

Depending on the C state, that is a significant chunk of exit latency,
and depending on how often you do the whole NOHZ dance, this can add up
to significant lost runtime too.

And these are all machines that have a usable TSC, these numbers all go
up significantly when you somehow end up on the HPET or similar wreckage.

Stopping the tick is slightly more expensive, but in the same order, I
get around 30 us on the IVB, vs 25 for restarting it. Reprogramming the
timer (LAPIC/TSC-DEADLINE) is the main chunk of it I suspect.

So over-all that's 55 us extra latency for the full idle path, which can
definitely hurt.

So yeah, I would say this is all worth it.

My ADL is somewhat better, but also much higher clocked, and gets around
10 us for a big core and 16 us for a little core for restarting the
tick.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 1/3] cpuidle: Inject tick boundary state
  2023-08-02 10:34                 ` Peter Zijlstra
@ 2023-08-02 12:44                   ` Rafael J. Wysocki
  2023-08-02 13:23                     ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Rafael J. Wysocki @ 2023-08-02 12:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, anna-maria, tglx, frederic, gautham.shenoy,
	linux-kernel, daniel.lezcano, linux-pm, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid

On Wed, Aug 2, 2023 at 12:34 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Jul 31, 2023 at 06:55:35PM +0200, Rafael J. Wysocki wrote:
>
> > > In that case you cannot tell the difference between I'm good to use this
> > > state and I'm good to disable the tick and still use this state.
> >
> > No, you don't, but is it really worth the fuss?
>
> My somewhat aged IVB-EP sits around 25 us for restarting the tick.
>
> Depending on the C state, that is a significant chunk of exit latency,
> and depending on how often you do the whole NOHZ dance, this can add up
> to significant lost runtime too.
>
> And these are all machines that have a usable TSC, these numbers all go
> up significantly when you somehow end up on the HPET or similar wreckage.
>
> Stopping the tick is slightly more expensive, but in the same order, I
> get around 30 us on the IVB, vs 25 for restarting it. Reprogramming the
> timer (LAPIC/TSC-DEADLINE) is the main chunk of it I suspect.
>
> So over-all that's 55 us extra latency for the full idle path, which can
> definitely hurt.
>
> So yeah, I would say this is all worth it.

I agree that, in general, it is good to avoid stopping the tick when
it is not necessary to stop it.

> My ADL is somewhat better, but also much higher clocked, and gets around
> 10 us for a big core and 16 us for a little core for restarting the
> tick.

But my overall point is different.

An additional bin would possibly help if the deepest state has been
selected and its target residency is below the tick, and the closest
timer (other than the tick) is beyond the tick.  So how much of a
difference would be made by making this particular case more accurate?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 1/3] cpuidle: Inject tick boundary state
  2023-08-02 12:44                   ` Rafael J. Wysocki
@ 2023-08-02 13:23                     ` Peter Zijlstra
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Zijlstra @ 2023-08-02 13:23 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: anna-maria, tglx, frederic, gautham.shenoy, linux-kernel,
	daniel.lezcano, linux-pm, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid

On Wed, Aug 02, 2023 at 02:44:33PM +0200, Rafael J. Wysocki wrote:
> On Wed, Aug 2, 2023 at 12:34 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Jul 31, 2023 at 06:55:35PM +0200, Rafael J. Wysocki wrote:
> >
> > > > In that case you cannot tell the difference between I'm good to use this
> > > > state and I'm good to disable the tick and still use this state.
> > >
> > > No, you don't, but is it really worth the fuss?
> >
> > My somewhat aged IVB-EP sits around 25 us for restarting the tick.
> >
> > Depending on the C state, that is a significant chunk of exit latency,
> > and depending on how often you do the whole NOHZ dance, this can add up
> > to significant lost runtime too.
> >
> > And these are all machines that have a usable TSC, these numbers all go
> > up significantly when you somehow end up on the HPET or similar wreckage.
> >
> > Stopping the tick is slightly more expensive, but in the same order, I
> > get around 30 us on the IVB, vs 25 for restarting it. Reprogramming the
> > timer (LAPIC/TSC-DEADLINE) is the main chunk of it I suspect.
> >
> > So over-all that's 55 us extra latency for the full idle path, which can
> > definitely hurt.
> >
> > So yeah, I would say this is all worth it.
> 
> I agree that, in general, it is good to avoid stopping the tick when
> it is not necessary to stop it.
> 
> > My ADL is somewhat better, but also much higher clocked, and gets around
> > 10 us for a big core and 16 us for a little core for restarting the
> > tick.
> 
> But my overall point is different.
> 
> An additional bin would possibly help if the deepest state has been
> selected and its target residency is below the tick, and the closest
> timer (other than the tick) is beyond the tick.  So how much of a
> difference would be made by making this particular case more accurate?

Many of the server parts have a deepest idle state around 600us, distros
have HZ=250. So every idle 600us < x < 4000us would unnecessarily
disable the tick.

How often this happens is of course workload dependent, but if unlucky
it could be a lot. It also adds the above mentioned latency to the idle
state, which for those parts is a significant chunk of the exit latency
extra.

The fix is 'trivial', why not do it?

Anyway, let me post my latest hackery :-)

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2023-08-02 13:23 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-28 14:55 [RFC][PATCH 0/3] cpuidle,teo: Improve TEO tick decisions Peter Zijlstra
2023-07-28 14:55 ` [RFC][PATCH 1/3] cpuidle: Inject tick boundary state Peter Zijlstra
2023-07-28 15:36   ` Rafael J. Wysocki
2023-07-29  8:44     ` Peter Zijlstra
2023-07-31  8:01       ` Rafael J. Wysocki
2023-07-31  9:09         ` Peter Zijlstra
2023-07-31 10:35           ` Rafael J. Wysocki
2023-07-31 11:00             ` Rafael J. Wysocki
2023-07-31 11:38             ` Peter Zijlstra
2023-07-31 16:55               ` Rafael J. Wysocki
2023-07-31 17:27                 ` Rafael J. Wysocki
2023-07-31 18:06                   ` Peter Zijlstra
2023-08-02 10:34                 ` Peter Zijlstra
2023-08-02 12:44                   ` Rafael J. Wysocki
2023-08-02 13:23                     ` Peter Zijlstra
2023-07-28 14:55 ` [RFC][PATCH 2/3] cpuidle,teo: Improve NOHZ management Peter Zijlstra
2023-07-28 16:56   ` Rafael J. Wysocki
2023-07-28 22:01     ` Peter Zijlstra
2023-07-31 10:17       ` Rafael J. Wysocki
2023-07-31 12:02         ` Peter Zijlstra
2023-07-31 17:20           ` Rafael J. Wysocki
2023-07-28 14:55 ` [RFC][PATCH 3/3] cpuidle,teo: Improve state selection Peter Zijlstra
2023-07-28 17:07   ` Rafael J. Wysocki
2023-07-29  8:46     ` Peter Zijlstra
2023-07-31 11:46 ` [RFC][PATCH 0/3] cpuidle,teo: Improve TEO tick decisions Anna-Maria Behnsen
2023-07-31 17:51   ` Rafael J. Wysocki
2023-08-02  8:12     ` Anna-Maria Behnsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).