public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/3] tick-sched cleanups
@ 2024-11-08 17:48 Joel Fernandes (Google)
  2024-11-08 17:48 ` [RFC 1/3] tick-sched: Remove last_tick and calculate next tick from now Joel Fernandes (Google)
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Joel Fernandes (Google) @ 2024-11-08 17:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: anna-maria, frederic, mingo, tglx, Joel Fernandes (Google)

Hello,
Here are some things I noticed which could be improved in the tick-sched code.
I still have to work on the third patch a bit more. I have done some basic
testing on the patches and I am looking for any feedback!

Thanks.

Joel Fernandes (Google) (3):
  tick-sched: Remove last_tick and calculate next tick from now
  tick-sched: Keep tick on if hrtimer is due imminently
  tick-sched: Replace jiffie readout with idle_entrytime

 kernel/time/tick-sched.c | 39 +++++++++++++--------------------------
 kernel/time/tick-sched.h |  1 -
 kernel/time/timer_list.c |  1 -
 3 files changed, 13 insertions(+), 28 deletions(-)

-- 
2.47.0.277.g8800431eea-goog


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC 1/3] tick-sched: Remove last_tick and calculate next tick from now
  2024-11-08 17:48 [RFC 0/3] tick-sched cleanups Joel Fernandes (Google)
@ 2024-11-08 17:48 ` Joel Fernandes (Google)
  2024-11-11 23:43   ` Frederic Weisbecker
  2024-11-08 17:48 ` [RFC 2/3] tick-sched: Keep tick on if hrtimer is due imminently Joel Fernandes (Google)
  2024-11-08 17:48 ` [RFC 3/3] tick-sched: Replace jiffie readout with idle_entrytime Joel Fernandes (Google)
  2 siblings, 1 reply; 22+ messages in thread
From: Joel Fernandes (Google) @ 2024-11-08 17:48 UTC (permalink / raw)
  To: linux-kernel, Anna-Maria Behnsen, Frederic Weisbecker,
	Ingo Molnar, Thomas Gleixner
  Cc: Joel Fernandes (Google)

During tick restart, we use last_tick and forward it past now.

Since we are forwarding past now, we can simply use now as a reference
instead of last_tick. This patch removes last_tick and does so.

This patch potentially does more mul/imul than the existing code,
as sometimes forwarding past now need not be done if last_tick > now.
However, the patch is a cleanup which reduces LOC and reduces the size
of struct tick_sched.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/time/tick-sched.c | 7 ++-----
 kernel/time/tick-sched.h | 1 -
 kernel/time/timer_list.c | 1 -
 3 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 71a792cd8936..52a4eda664cf 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -837,11 +837,9 @@ EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
 
 static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
 {
+	/* Set the time to expire on the next tick and not some far away future. */
 	hrtimer_cancel(&ts->sched_timer);
-	hrtimer_set_expires(&ts->sched_timer, ts->last_tick);
-
-	/* Forward the time to expire in the future */
-	hrtimer_forward(&ts->sched_timer, now, TICK_NSEC);
+	hrtimer_set_expires(&ts->sched_timer, DIV_ROUND_UP_ULL(now, TICK_NSEC) * TICK_NSEC);
 
 	if (tick_sched_flag_test(ts, TS_FLAG_HIGHRES)) {
 		hrtimer_start_expires(&ts->sched_timer,
@@ -1043,7 +1041,6 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
 		calc_load_nohz_start();
 		quiet_vmstat();
 
-		ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
 		tick_sched_flag_set(ts, TS_FLAG_STOPPED);
 		trace_tick_stop(1, TICK_DEP_MASK_NONE);
 	}
diff --git a/kernel/time/tick-sched.h b/kernel/time/tick-sched.h
index b4a7822f495d..7210cc473855 100644
--- a/kernel/time/tick-sched.h
+++ b/kernel/time/tick-sched.h
@@ -71,7 +71,6 @@ struct tick_sched {
 
 	/* Tick handling */
 	struct hrtimer			sched_timer;
-	ktime_t				last_tick;
 	ktime_t				next_tick;
 	unsigned long			idle_jiffies;
 	ktime_t				idle_waketime;
diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c
index 1c311c46da50..26688a3b8ea8 100644
--- a/kernel/time/timer_list.c
+++ b/kernel/time/timer_list.c
@@ -154,7 +154,6 @@ static void print_cpu(struct seq_file *m, int cpu, u64 now)
 		struct tick_sched *ts = tick_get_tick_sched(cpu);
 		P_flag(nohz, TS_FLAG_NOHZ);
 		P_flag(highres, TS_FLAG_HIGHRES);
-		P_ns(last_tick);
 		P_flag(tick_stopped, TS_FLAG_STOPPED);
 		P(idle_jiffies);
 		P(idle_calls);
-- 
2.47.0.277.g8800431eea-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC 2/3] tick-sched: Keep tick on if hrtimer is due imminently
  2024-11-08 17:48 [RFC 0/3] tick-sched cleanups Joel Fernandes (Google)
  2024-11-08 17:48 ` [RFC 1/3] tick-sched: Remove last_tick and calculate next tick from now Joel Fernandes (Google)
@ 2024-11-08 17:48 ` Joel Fernandes (Google)
  2024-11-11 12:37   ` Christian Loehle
  2024-11-08 17:48 ` [RFC 3/3] tick-sched: Replace jiffie readout with idle_entrytime Joel Fernandes (Google)
  2 siblings, 1 reply; 22+ messages in thread
From: Joel Fernandes (Google) @ 2024-11-08 17:48 UTC (permalink / raw)
  To: linux-kernel, Anna-Maria Behnsen, Frederic Weisbecker,
	Ingo Molnar, Thomas Gleixner
  Cc: Joel Fernandes (Google)

In highres mode, the kernel only considers timer wheel events when
considering whether to keep the tick on (via get_next_interrupt()).

This seems odd because it consider several other reasons to keep the
tick on. Further, turning off the tick does not help because once idle
exit happens due to that imminent hrtimer interrupt, the tick hrtimer
interrupt is requeued. That means more hrtimer rbtree operations for not
much benefit.

Ideally we should not have to do anything because the cpuidle governor
should not try to the stop the tick because it knows about this
situation, but apparently it still does try to stop the tick.

To be more efficient, check for any immminent non-sched hrtimer event
and keep the tick on if we know such events exist.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/time/tick-sched.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 52a4eda664cf..4aa64266f2b0 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -890,7 +890,7 @@ u64 get_jiffies_update(unsigned long *basej)
  */
 static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
 {
-	u64 basemono, next_tick, delta, expires;
+	u64 basemono, next_tick, delta, expires, delta_hr, next_hr_wo;
 	unsigned long basejiff;
 	int tick_cpu;
 
@@ -932,7 +932,10 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
 	 * force prod the timer.
 	 */
 	delta = next_tick - basemono;
-	if (delta <= (u64)TICK_NSEC) {
+	next_hr_wo = hrtimer_next_event_without(&ts->sched_timer);
+	delta_hr = next_hr_wo - basemono;
+
+	if (delta <= (u64)TICK_NSEC || delta_hr <= (u64)TICK_NSEC) {
 		/*
 		 * We've not stopped the tick yet, and there's a timer in the
 		 * next period, so no point in stopping it either, bail.
-- 
2.47.0.277.g8800431eea-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC 3/3] tick-sched: Replace jiffie readout with idle_entrytime
  2024-11-08 17:48 [RFC 0/3] tick-sched cleanups Joel Fernandes (Google)
  2024-11-08 17:48 ` [RFC 1/3] tick-sched: Remove last_tick and calculate next tick from now Joel Fernandes (Google)
  2024-11-08 17:48 ` [RFC 2/3] tick-sched: Keep tick on if hrtimer is due imminently Joel Fernandes (Google)
@ 2024-11-08 17:48 ` Joel Fernandes (Google)
  2024-11-10 22:55   ` Joel Fernandes
                     ` (2 more replies)
  2 siblings, 3 replies; 22+ messages in thread
From: Joel Fernandes (Google) @ 2024-11-08 17:48 UTC (permalink / raw)
  To: linux-kernel, Anna-Maria Behnsen, Frederic Weisbecker,
	Ingo Molnar, Thomas Gleixner
  Cc: Joel Fernandes (Google)

This solves the issue where jiffies can be stale and inaccurate.

Putting some prints, I see that basemono can be quite stale:
tick_nohz_next_event: basemono=18692000000 basemono_from_idle_entrytime=18695000000

Since we have 'now' in ts->idle_entrytime, we can just use that. It is
more accurate, cleaner, reduces lines of code and reduces any lock
contention with the seq locks.

I was also concerned about issue where jiffies is not updated for a long
time, and then we receive a non-tick interrupt in the future. Relying on
stale jiffies value and using that as base can be inaccurate to determine
whether next event occurs within next tick. Fix that.

XXX: Need to fix issue in idle accounting which does 'jiffies -
idle_entrytime'. If idle_entrytime is more current than jiffies, it
could cause negative values. I could replace jiffies with idle_exittime
in this computation potentially to fix that.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/time/tick-sched.c | 27 +++++++--------------------
 1 file changed, 7 insertions(+), 20 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 4aa64266f2b0..22a4f96d9585 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -860,24 +860,6 @@ static inline bool local_timer_softirq_pending(void)
 	return local_softirq_pending() & BIT(TIMER_SOFTIRQ);
 }
 
-/*
- * Read jiffies and the time when jiffies were updated last
- */
-u64 get_jiffies_update(unsigned long *basej)
-{
-	unsigned long basejiff;
-	unsigned int seq;
-	u64 basemono;
-
-	do {
-		seq = read_seqcount_begin(&jiffies_seq);
-		basemono = last_jiffies_update;
-		basejiff = jiffies;
-	} while (read_seqcount_retry(&jiffies_seq, seq));
-	*basej = basejiff;
-	return basemono;
-}
-
 /**
  * tick_nohz_next_event() - return the clock monotonic based next event
  * @ts:		pointer to tick_sched struct
@@ -887,14 +869,19 @@ u64 get_jiffies_update(unsigned long *basej)
  * *%0		- When the next event is a maximum of TICK_NSEC in the future
  *		  and the tick is not stopped yet
  * *%next_event	- Next event based on clock monotonic
+ *
+ * Note: ts->idle_entrytime is updated with 'now' via tick_nohz_idle_enter().
  */
 static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
 {
-	u64 basemono, next_tick, delta, expires, delta_hr, next_hr_wo;
+	u64 basemono, next_tick, delta, expires, delta_hr, next_hr_wo, boot_ticks;
 	unsigned long basejiff;
 	int tick_cpu;
 
-	basemono = get_jiffies_update(&basejiff);
+	boot_ticks = DIV_ROUND_DOWN_ULL(ts->idle_entrytime, TICK_NSEC);
+	basejiff = boot_ticks + INITIAL_JIFFIES;
+	basemono = boot_ticks * TICK_NSEC;
+
 	ts->last_jiffies = basejiff;
 	ts->timer_expires_base = basemono;
 
-- 
2.47.0.277.g8800431eea-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC 3/3] tick-sched: Replace jiffie readout with idle_entrytime
  2024-11-08 17:48 ` [RFC 3/3] tick-sched: Replace jiffie readout with idle_entrytime Joel Fernandes (Google)
@ 2024-11-10 22:55   ` Joel Fernandes
  2024-11-12 14:48     ` Thomas Gleixner
  2024-11-11 22:25   ` Frederic Weisbecker
  2024-11-12 14:30   ` Thomas Gleixner
  2 siblings, 1 reply; 22+ messages in thread
From: Joel Fernandes @ 2024-11-10 22:55 UTC (permalink / raw)
  To: linux-kernel, Anna-Maria Behnsen, Frederic Weisbecker,
	Ingo Molnar, Thomas Gleixner

On Fri, Nov 08, 2024 at 05:48:36PM +0000, Joel Fernandes (Google) wrote:
> This solves the issue where jiffies can be stale and inaccurate.
> 
> Putting some prints, I see that basemono can be quite stale:
> tick_nohz_next_event: basemono=18692000000 basemono_from_idle_entrytime=18695000000
> 
> Since we have 'now' in ts->idle_entrytime, we can just use that. It is
> more accurate, cleaner, reduces lines of code and reduces any lock
> contention with the seq locks.
> 
> I was also concerned about issue where jiffies is not updated for a long
> time, and then we receive a non-tick interrupt in the future. Relying on
> stale jiffies value and using that as base can be inaccurate to determine
> whether next event occurs within next tick. Fix that.
> 
> XXX: Need to fix issue in idle accounting which does 'jiffies -
> idle_entrytime'. If idle_entrytime is more current than jiffies, it
> could cause negative values. I could replace jiffies with idle_exittime
> in this computation potentially to fix that.
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/time/tick-sched.c | 27 +++++++--------------------
>  1 file changed, 7 insertions(+), 20 deletions(-)
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 4aa64266f2b0..22a4f96d9585 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -860,24 +860,6 @@ static inline bool local_timer_softirq_pending(void)
>  	return local_softirq_pending() & BIT(TIMER_SOFTIRQ);
>  }
>  
> -/*
> - * Read jiffies and the time when jiffies were updated last
> - */
> -u64 get_jiffies_update(unsigned long *basej)
> -{
> -	unsigned long basejiff;
> -	unsigned int seq;
> -	u64 basemono;
> -
> -	do {
> -		seq = read_seqcount_begin(&jiffies_seq);
> -		basemono = last_jiffies_update;
> -		basejiff = jiffies;
> -	} while (read_seqcount_retry(&jiffies_seq, seq));
> -	*basej = basejiff;
> -	return basemono;
> -}
> -
>  /**
>   * tick_nohz_next_event() - return the clock monotonic based next event
>   * @ts:		pointer to tick_sched struct
> @@ -887,14 +869,19 @@ u64 get_jiffies_update(unsigned long *basej)
>   * *%0		- When the next event is a maximum of TICK_NSEC in the future
>   *		  and the tick is not stopped yet
>   * *%next_event	- Next event based on clock monotonic
> + *
> + * Note: ts->idle_entrytime is updated with 'now' via tick_nohz_idle_enter().
>   */
>  static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
>  {
> -	u64 basemono, next_tick, delta, expires, delta_hr, next_hr_wo;
> +	u64 basemono, next_tick, delta, expires, delta_hr, next_hr_wo, boot_ticks;
>  	unsigned long basejiff;
>  	int tick_cpu;
>  
> -	basemono = get_jiffies_update(&basejiff);
> +	boot_ticks = DIV_ROUND_DOWN_ULL(ts->idle_entrytime, TICK_NSEC);
> +	basejiff = boot_ticks + INITIAL_JIFFIES;
> +	basemono = boot_ticks * TICK_NSEC;
> +

There is a bug here, I end up overcounting basejiff. I did something like
this and it now makes basejiff equivalent to the previous code so should be
good. I'll work more on it this week...

                                                                                       ─╯
diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index d88b13076b79..5387c67eea7a 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -34,6 +34,8 @@ DEFINE_PER_CPU(struct tick_device, tick_cpu_device);
  */
 ktime_t tick_next_period;

+ktime_t tick_first_period;
+
 /*
  * tick_do_timer_cpu is a timer core internal variable which holds the CPU NR
  * which is responsible for calling do_timer(), i.e. the timekeeping stuff. This
@@ -219,6 +221,7 @@ static void tick_setup_device(struct tick_device *td,
                if (READ_ONCE(tick_do_timer_cpu) == TICK_DO_TIMER_BOOT) {
                        WRITE_ONCE(tick_do_timer_cpu, cpu);
                        tick_next_period = ktime_get();
+                       tick_first_period = tick_next_period;
 #ifdef CONFIG_NO_HZ_FULL
                        /*
                         * The boot CPU may be nohz_full, in which case set
diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index 5f2105e637bd..a15721516a85 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -20,6 +20,7 @@ struct timer_events {

 DECLARE_PER_CPU(struct tick_device, tick_cpu_device);
 extern ktime_t tick_next_period;
+extern ktime_t tick_first_period;
 extern int tick_do_timer_cpu __read_mostly;

 extern void tick_setup_periodic(struct clock_event_device *dev, int broadcast);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 8a245f8ceb56..8fdfda4b8af3 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -895,11 +896,23 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
        u64 basemono, next_tick, delta, expires, delta_hr, next_hr_wo, boot_ticks;
        unsigned long basejiff;
        int tick_cpu;

        boot_ticks = DIV_ROUND_DOWN_ULL(ts->idle_entrytime, TICK_NSEC);
        basejiff = boot_ticks + INITIAL_JIFFIES;
        basemono = boot_ticks * TICK_NSEC;

+       /*
+        * There is some time that passes between when clocksource starts and the
+        * first time tick device is setup. Offset basejiff by that.
+       */
+       basejiff -= DIV_ROUND_DOWN_ULL(tick_first_period, TICK_NSEC);
+
        ts->last_jiffies = basejiff;
        ts->timer_expires_base = basemono;


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC 2/3] tick-sched: Keep tick on if hrtimer is due imminently
  2024-11-08 17:48 ` [RFC 2/3] tick-sched: Keep tick on if hrtimer is due imminently Joel Fernandes (Google)
@ 2024-11-11 12:37   ` Christian Loehle
  2024-11-11 15:56     ` Joel Fernandes
  0 siblings, 1 reply; 22+ messages in thread
From: Christian Loehle @ 2024-11-11 12:37 UTC (permalink / raw)
  To: Joel Fernandes (Google), linux-kernel, Anna-Maria Behnsen,
	Frederic Weisbecker, Ingo Molnar, Thomas Gleixner

On 11/8/24 17:48, Joel Fernandes (Google) wrote:
> In highres mode, the kernel only considers timer wheel events when
> considering whether to keep the tick on (via get_next_interrupt()).
> 
> This seems odd because it consider several other reasons to keep the
> tick on. Further, turning off the tick does not help because once idle
> exit happens due to that imminent hrtimer interrupt, the tick hrtimer
> interrupt is requeued. That means more hrtimer rbtree operations for not
> much benefit.
> 
> Ideally we should not have to do anything because the cpuidle governor
> should not try to the stop the tick because it knows about this
> situation, but apparently it still does try to stop the tick.

Any details on this? Which governor?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 2/3] tick-sched: Keep tick on if hrtimer is due imminently
  2024-11-11 12:37   ` Christian Loehle
@ 2024-11-11 15:56     ` Joel Fernandes
  2024-11-11 16:55       ` Christian Loehle
  0 siblings, 1 reply; 22+ messages in thread
From: Joel Fernandes @ 2024-11-11 15:56 UTC (permalink / raw)
  To: Christian Loehle
  Cc: linux-kernel, Anna-Maria Behnsen, Frederic Weisbecker,
	Ingo Molnar, Thomas Gleixner

On Mon, Nov 11, 2024 at 7:38 AM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 11/8/24 17:48, Joel Fernandes (Google) wrote:
> > In highres mode, the kernel only considers timer wheel events when
> > considering whether to keep the tick on (via get_next_interrupt()).
> >
> > This seems odd because it consider several other reasons to keep the
> > tick on. Further, turning off the tick does not help because once idle
> > exit happens due to that imminent hrtimer interrupt, the tick hrtimer
> > interrupt is requeued. That means more hrtimer rbtree operations for not
> > much benefit.
> >
> > Ideally we should not have to do anything because the cpuidle governor
> > should not try to the stop the tick because it knows about this
> > situation, but apparently it still does try to stop the tick.
>
> Any details on this? Which governor?

I noticed this in Qemu (virtualized hardware). Actually I need to
update the commit message. I think it is not because of the governor
but because of lack of guest cpuidle support.

static void cpuidle_idle_call(void)
{
....
  if (cpuidle_not_available(drv, dev)) {
    tick_nohz_idle_stop_tick();
    default_idle_call();
    goto exit_idle;
  }
...
Over here dev and drv are NULL for me. I will also test on real hardware.

Also maybe the " if (cpuidle_not_available(drv, dev))" condition
should do some more work to determine if tick_nohz_idle_stop_tick()
should be called instead of unconditionally calling it?

Pasting relevant parts of my .config:

# grep IDLE .config
CONFIG_NO_HZ_IDLE=y
CONFIG_ARCH_CPUIDLE_HALTPOLL=y
CONFIG_ACPI_PROCESSOR_IDLE=y
# CPU Idle
CONFIG_CPU_IDLE=y
# CONFIG_CPU_IDLE_GOV_LADDER is not set
CONFIG_CPU_IDLE_GOV_MENU=y
# CONFIG_CPU_IDLE_GOV_TEO is not set
CONFIG_CPU_IDLE_GOV_HALTPOLL=y
CONFIG_HALTPOLL_CPUIDLE=y
# end of CPU Idle
CONFIG_INTEL_IDLE=y

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 2/3] tick-sched: Keep tick on if hrtimer is due imminently
  2024-11-11 15:56     ` Joel Fernandes
@ 2024-11-11 16:55       ` Christian Loehle
  2024-11-11 17:17         ` Joel Fernandes
  0 siblings, 1 reply; 22+ messages in thread
From: Christian Loehle @ 2024-11-11 16:55 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Anna-Maria Behnsen, Frederic Weisbecker,
	Ingo Molnar, Thomas Gleixner

On 11/11/24 15:56, Joel Fernandes wrote:
> On Mon, Nov 11, 2024 at 7:38 AM Christian Loehle
> <christian.loehle@arm.com> wrote:
>>
>> On 11/8/24 17:48, Joel Fernandes (Google) wrote:
>>> In highres mode, the kernel only considers timer wheel events when
>>> considering whether to keep the tick on (via get_next_interrupt()).
>>>
>>> This seems odd because it consider several other reasons to keep the
>>> tick on. Further, turning off the tick does not help because once idle
>>> exit happens due to that imminent hrtimer interrupt, the tick hrtimer
>>> interrupt is requeued. That means more hrtimer rbtree operations for not
>>> much benefit.
>>>
>>> Ideally we should not have to do anything because the cpuidle governor
>>> should not try to the stop the tick because it knows about this
>>> situation, but apparently it still does try to stop the tick.
>>
>> Any details on this? Which governor?
> 
> I noticed this in Qemu (virtualized hardware). Actually I need to
> update the commit message. I think it is not because of the governor
> but because of lack of guest cpuidle support.

Ah indeed, then it makes sense.
FYI Anna-Maria proposed something like below a year ago:
https://lore.kernel.org/lkml/20231215130501.24542-1-anna-maria@linutronix.de/
I have no strong opinion on it either way.

Regards,
Christian

> 
> static void cpuidle_idle_call(void)
> {
> ....
>   if (cpuidle_not_available(drv, dev)) {
>     tick_nohz_idle_stop_tick();
>     default_idle_call();
>     goto exit_idle;
>   }
> ...
> Over here dev and drv are NULL for me. I will also test on real hardware.
> 
> Also maybe the " if (cpuidle_not_available(drv, dev))" condition
> should do some more work to determine if tick_nohz_idle_stop_tick()
> should be called instead of unconditionally calling it?
> 
> Pasting relevant parts of my .config:
> 
> # grep IDLE .config
> CONFIG_NO_HZ_IDLE=y
> CONFIG_ARCH_CPUIDLE_HALTPOLL=y
> CONFIG_ACPI_PROCESSOR_IDLE=y
> # CPU Idle
> CONFIG_CPU_IDLE=y
> # CONFIG_CPU_IDLE_GOV_LADDER is not set
> CONFIG_CPU_IDLE_GOV_MENU=y
> # CONFIG_CPU_IDLE_GOV_TEO is not set
> CONFIG_CPU_IDLE_GOV_HALTPOLL=y
> CONFIG_HALTPOLL_CPUIDLE=y
> # end of CPU Idle
> CONFIG_INTEL_IDLE=y
> 
> thanks,
> 
>  - Joel


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 2/3] tick-sched: Keep tick on if hrtimer is due imminently
  2024-11-11 16:55       ` Christian Loehle
@ 2024-11-11 17:17         ` Joel Fernandes
  0 siblings, 0 replies; 22+ messages in thread
From: Joel Fernandes @ 2024-11-11 17:17 UTC (permalink / raw)
  To: Christian Loehle
  Cc: linux-kernel, Anna-Maria Behnsen, Frederic Weisbecker,
	Ingo Molnar, Thomas Gleixner

On Mon, Nov 11, 2024 at 11:55 AM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 11/11/24 15:56, Joel Fernandes wrote:
> > On Mon, Nov 11, 2024 at 7:38 AM Christian Loehle
> > <christian.loehle@arm.com> wrote:
> >>
> >> On 11/8/24 17:48, Joel Fernandes (Google) wrote:
> >>> In highres mode, the kernel only considers timer wheel events when
> >>> considering whether to keep the tick on (via get_next_interrupt()).
> >>>
> >>> This seems odd because it consider several other reasons to keep the
> >>> tick on. Further, turning off the tick does not help because once idle
> >>> exit happens due to that imminent hrtimer interrupt, the tick hrtimer
> >>> interrupt is requeued. That means more hrtimer rbtree operations for not
> >>> much benefit.
> >>>
> >>> Ideally we should not have to do anything because the cpuidle governor
> >>> should not try to the stop the tick because it knows about this
> >>> situation, but apparently it still does try to stop the tick.
> >>
> >> Any details on this? Which governor?
> >
> > I noticed this in Qemu (virtualized hardware). Actually I need to
> > update the commit message. I think it is not because of the governor
> > but because of lack of guest cpuidle support.
>
> Ah indeed, then it makes sense.
> FYI Anna-Maria proposed something like below a year ago:
> https://lore.kernel.org/lkml/20231215130501.24542-1-anna-maria@linutronix.de/
> I have no strong opinion on it either way.

Thanks for the pointer, it is great to know this was discussed. One
thing I am not fully sure about that patch is, if we can just drop
tick-stoppage like that. I share Pierre's concern [1]. Maybe a better
approach is to do some rudimentary checks for imminent events when
there is no driver/device loaded and only then stop the tick. I will
try to explore such an approach and see if I can come up with
something.

thanks,

 - Joel

[1]
https://lore.kernel.org/lkml/06a2561f-557b-4eaa-8f11-75883bbbaef9@arm.com/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 3/3] tick-sched: Replace jiffie readout with idle_entrytime
  2024-11-08 17:48 ` [RFC 3/3] tick-sched: Replace jiffie readout with idle_entrytime Joel Fernandes (Google)
  2024-11-10 22:55   ` Joel Fernandes
@ 2024-11-11 22:25   ` Frederic Weisbecker
  2024-11-13 22:39     ` Joel Fernandes
  2024-11-12 14:30   ` Thomas Gleixner
  2 siblings, 1 reply; 22+ messages in thread
From: Frederic Weisbecker @ 2024-11-11 22:25 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner

Le Fri, Nov 08, 2024 at 05:48:36PM +0000, Joel Fernandes (Google) a écrit :
> This solves the issue where jiffies can be stale and inaccurate.
> 
> Putting some prints, I see that basemono can be quite stale:
> tick_nohz_next_event: basemono=18692000000 basemono_from_idle_entrytime=18695000000

That's 3 ms. If HZ < 1000 that's to be expected. But even with HZ = 1000 it can
happen.

> 
> Since we have 'now' in ts->idle_entrytime, we can just use that. It is
> more accurate, cleaner, reduces lines of code and reduces any lock
> contention with the seq locks.

Do we need such accuracy? The timers rely on jiffies anyway.
Also it's a seqcount read. Basically just a pair of smp_rmb().
Not sure a division would be cheaper.

> 
> I was also concerned about issue where jiffies is not updated for a long
> time, and then we receive a non-tick interrupt in the future. Relying on
> stale jiffies value and using that as base can be inaccurate to determine
> whether next event occurs within next tick. Fix that.
> 
> XXX: Need to fix issue in idle accounting which does 'jiffies -
> idle_entrytime'. If idle_entrytime is more current than jiffies, it
> could cause negative values. I could replace jiffies with idle_exittime
> in this computation potentially to fix that.
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/time/tick-sched.c | 27 +++++++--------------------
>  1 file changed, 7 insertions(+), 20 deletions(-)
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 4aa64266f2b0..22a4f96d9585 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -860,24 +860,6 @@ static inline bool local_timer_softirq_pending(void)
>  	return local_softirq_pending() & BIT(TIMER_SOFTIRQ);
>  }
>  
> -/*
> - * Read jiffies and the time when jiffies were updated last
> - */
> -u64 get_jiffies_update(unsigned long *basej)
> -{
> -	unsigned long basejiff;
> -	unsigned int seq;
> -	u64 basemono;
> -
> -	do {
> -		seq = read_seqcount_begin(&jiffies_seq);
> -		basemono = last_jiffies_update;
> -		basejiff = jiffies;
> -	} while (read_seqcount_retry(&jiffies_seq, seq));
> -	*basej = basejiff;
> -	return basemono;
> -}

This is used in tmigr as well.

Thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 1/3] tick-sched: Remove last_tick and calculate next tick from now
  2024-11-08 17:48 ` [RFC 1/3] tick-sched: Remove last_tick and calculate next tick from now Joel Fernandes (Google)
@ 2024-11-11 23:43   ` Frederic Weisbecker
  2024-11-12 13:46     ` Thomas Gleixner
  2024-11-12 18:33     ` Joel Fernandes
  0 siblings, 2 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2024-11-11 23:43 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner

Le Fri, Nov 08, 2024 at 05:48:34PM +0000, Joel Fernandes (Google) a écrit :
> During tick restart, we use last_tick and forward it past now.
> 
> Since we are forwarding past now, we can simply use now as a reference
> instead of last_tick. This patch removes last_tick and does so.
> 
> This patch potentially does more mul/imul than the existing code,
> as sometimes forwarding past now need not be done if last_tick > now.

Which is not uncommon if idle exited because of a non-timer interrupt
(remote wake up IPI or hardware interrupt).

It's also cheaper with hrtimer_forward() if now - last_tick < TICK_NSEC
which is not uncommon either if idle exited because of a wake-up from the tick
(schedule_timeout for example).

> However, the patch is a cleanup which reduces LOC and reduces the size
> of struct tick_sched.

Reducing the overhead of idle exit and consolidating its code within existing
forward API is more important than a per-cpu field.

> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/time/tick-sched.c | 7 ++-----
>  kernel/time/tick-sched.h | 1 -
>  kernel/time/timer_list.c | 1 -
>  3 files changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 71a792cd8936..52a4eda664cf 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -837,11 +837,9 @@ EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
>  
>  static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
>  {
> +	/* Set the time to expire on the next tick and not some far away future. */
>  	hrtimer_cancel(&ts->sched_timer);
> -	hrtimer_set_expires(&ts->sched_timer, ts->last_tick);
> -
> -	/* Forward the time to expire in the future */
> -	hrtimer_forward(&ts->sched_timer, now, TICK_NSEC);
> +	hrtimer_set_expires(&ts->sched_timer, DIV_ROUND_UP_ULL(now, TICK_NSEC) * TICK_NSEC);

We don't want to rewrite hrtimer_forward() but, after all, the current expiry is
enough a relevant information.

How about just this? It's worth it as it now forwards after the real last programmed
tick, which should be close enough from @now with a delta below TICK_NSEC, or even
better @now is below the expiry. Therefore it should resume as just a no-op
or at worst an addition within hrtimer_forward():

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 753a184c7090..ffd0c026a248 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -838,7 +838,6 @@ EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
 static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
 {
 	hrtimer_cancel(&ts->sched_timer);
-	hrtimer_set_expires(&ts->sched_timer, ts->last_tick);
 
 	/* Forward the time to expire in the future */
 	hrtimer_forward(&ts->sched_timer, now, TICK_NSEC);


As for removing last_tick, I think it's a precious debugging information. But
it's lagging behind the record of the first time only the tick got stopped within
the last trip to idle. So it could become this instead:

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 753a184c7090..af013f7733b2 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1042,12 +1041,11 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
 	if (!tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
 		calc_load_nohz_start();
 		quiet_vmstat();
-
-		ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
 		tick_sched_flag_set(ts, TS_FLAG_STOPPED);
 		trace_tick_stop(1, TICK_DEP_MASK_NONE);
 	}
 
+	ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
 	ts->next_tick = expires;
 
 	/*

Thanks!

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC 1/3] tick-sched: Remove last_tick and calculate next tick from now
  2024-11-11 23:43   ` Frederic Weisbecker
@ 2024-11-12 13:46     ` Thomas Gleixner
  2024-11-12 13:56       ` Frederic Weisbecker
  2024-11-12 18:20       ` Joel Fernandes
  2024-11-12 18:33     ` Joel Fernandes
  1 sibling, 2 replies; 22+ messages in thread
From: Thomas Gleixner @ 2024-11-12 13:46 UTC (permalink / raw)
  To: Frederic Weisbecker, Joel Fernandes (Google)
  Cc: linux-kernel, Anna-Maria Behnsen, Ingo Molnar

On Tue, Nov 12 2024 at 00:43, Frederic Weisbecker wrote:
> Le Fri, Nov 08, 2024 at 05:48:34PM +0000, Joel Fernandes (Google) a écrit :

>> During tick restart, we use last_tick and forward it past now.
>>
>> Since we are forwarding past now, we can simply use now as a reference
>> instead of last_tick. This patch removes last_tick and does so.
>>
>> This patch potentially does more mul/imul than the existing code,
>> as sometimes forwarding past now need not be done if last_tick > now.
>> However, the patch is a cleanup which reduces LOC and reduces the size
>> of struct tick_sched.

May I politely ask you to read and follow the Documentation
vs. changelogs?

  https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#changelog

Also

git grep 'This patch' Documentation/process

might give you a hint.

>> -	/* Forward the time to expire in the future */
>> -	hrtimer_forward(&ts->sched_timer, now, TICK_NSEC);
>> +	hrtimer_set_expires(&ts->sched_timer, DIV_ROUND_UP_ULL(now, TICK_NSEC) * TICK_NSEC);

How is a division and multiplication in this hotpath helpful? That's
awfully slow on 32-bit machines and pointless on 64-bit too.

Using now is also wrong as it breaks the sched_skew_tick distribution by
aligning the tick on all CPUs again.

IOW, this "cleanup" is making things worse.

> We don't want to rewrite hrtimer_forward() but, after all, the current expiry is
> enough a relevant information.
>
> How about just this? It's worth it as it now forwards after the real last programmed
> tick, which should be close enough from @now with a delta below TICK_NSEC, or even
> better @now is below the expiry. Therefore it should resume as just a no-op
> or at worst an addition within hrtimer_forward():
>
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 753a184c7090..ffd0c026a248 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -838,7 +838,6 @@ EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
>  static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
>  {
>  	hrtimer_cancel(&ts->sched_timer);
> -	hrtimer_set_expires(&ts->sched_timer, ts->last_tick);
>  
>  	/* Forward the time to expire in the future */
>  	hrtimer_forward(&ts->sched_timer, now, TICK_NSEC);

That's just wrong. ts->sched_timer.expires contains a tick in the
future. If tick_nohz_stop_tick() set it to 10 ticks in the future and
the CPU goes out of idle due to a device interrupt before the timer
expires, then hrtimer_forward() will do nothing because expires is ahead
of now.

Which means the CPU is not idle and has no tick until the delayed tick
which was set by tick_nohz_stop_tick() expires. Not really correct.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 1/3] tick-sched: Remove last_tick and calculate next tick from now
  2024-11-12 13:46     ` Thomas Gleixner
@ 2024-11-12 13:56       ` Frederic Weisbecker
  2024-11-12 18:20       ` Joel Fernandes
  1 sibling, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2024-11-12 13:56 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Joel Fernandes (Google), linux-kernel, Anna-Maria Behnsen,
	Ingo Molnar

Le Tue, Nov 12, 2024 at 02:46:23PM +0100, Thomas Gleixner a écrit :
> On Tue, Nov 12 2024 at 00:43, Frederic Weisbecker wrote:
> > Le Fri, Nov 08, 2024 at 05:48:34PM +0000, Joel Fernandes (Google) a écrit :
> 
> >> During tick restart, we use last_tick and forward it past now.
> >>
> >> Since we are forwarding past now, we can simply use now as a reference
> >> instead of last_tick. This patch removes last_tick and does so.
> >>
> >> This patch potentially does more mul/imul than the existing code,
> >> as sometimes forwarding past now need not be done if last_tick > now.
> >> However, the patch is a cleanup which reduces LOC and reduces the size
> >> of struct tick_sched.
> 
> May I politely ask you to read and follow the Documentation
> vs. changelogs?
> 
>   https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#changelog
> 
> Also
> 
> git grep 'This patch' Documentation/process
> 
> might give you a hint.
> 
> >> -	/* Forward the time to expire in the future */
> >> -	hrtimer_forward(&ts->sched_timer, now, TICK_NSEC);
> >> +	hrtimer_set_expires(&ts->sched_timer, DIV_ROUND_UP_ULL(now, TICK_NSEC) * TICK_NSEC);
> 
> How is a division and multiplication in this hotpath helpful? That's
> awfully slow on 32-bit machines and pointless on 64-bit too.
> 
> Using now is also wrong as it breaks the sched_skew_tick distribution by
> aligning the tick on all CPUs again.
> 
> IOW, this "cleanup" is making things worse.
> 
> > We don't want to rewrite hrtimer_forward() but, after all, the current expiry is
> > enough a relevant information.
> >
> > How about just this? It's worth it as it now forwards after the real last programmed
> > tick, which should be close enough from @now with a delta below TICK_NSEC, or even
> > better @now is below the expiry. Therefore it should resume as just a no-op
> > or at worst an addition within hrtimer_forward():
> >
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index 753a184c7090..ffd0c026a248 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -838,7 +838,6 @@ EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
> >  static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
> >  {
> >  	hrtimer_cancel(&ts->sched_timer);
> > -	hrtimer_set_expires(&ts->sched_timer, ts->last_tick);
> >  
> >  	/* Forward the time to expire in the future */
> >  	hrtimer_forward(&ts->sched_timer, now, TICK_NSEC);
> 
> That's just wrong. ts->sched_timer.expires contains a tick in the
> future. If tick_nohz_stop_tick() set it to 10 ticks in the future and
> the CPU goes out of idle due to a device interrupt before the timer
> expires, then hrtimer_forward() will do nothing because expires is ahead
> of now.
> 
> Which means the CPU is not idle and has no tick until the delayed tick
> which was set by tick_nohz_stop_tick() expires. Not really correct.

Bah! Yes of course...

> 
> Thanks,
> 
>         tglx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 3/3] tick-sched: Replace jiffie readout with idle_entrytime
  2024-11-08 17:48 ` [RFC 3/3] tick-sched: Replace jiffie readout with idle_entrytime Joel Fernandes (Google)
  2024-11-10 22:55   ` Joel Fernandes
  2024-11-11 22:25   ` Frederic Weisbecker
@ 2024-11-12 14:30   ` Thomas Gleixner
  2024-11-13 22:18     ` Joel Fernandes
  2 siblings, 1 reply; 22+ messages in thread
From: Thomas Gleixner @ 2024-11-12 14:30 UTC (permalink / raw)
  To: Joel Fernandes (Google), linux-kernel, Anna-Maria Behnsen,
	Frederic Weisbecker, Ingo Molnar
  Cc: Joel Fernandes (Google)

On Fri, Nov 08 2024 at 17:48, Joel Fernandes wrote:
> This solves the issue where jiffies can be stale and inaccurate.

Which issue?

> Putting some prints, I see that basemono can be quite stale:
> tick_nohz_next_event: basemono=18692000000 basemono_from_idle_entrytime=18695000000

What is your definition of stale? 3ms on a system with HZ < 1000 is
completely correct and within the margin of the next tick, no?

> Since we have 'now' in ts->idle_entrytime, we can just use that. It is
> more accurate, cleaner, reduces lines of code and reduces any lock
> contention with the seq locks.

What's more accurate and what is the actual problem you are trying to
solve. This handwaving about cleaner, less lines of code and contention
on a non existing lock is just not helpful.

> I was also concerned about issue where jiffies is not updated for a long
> time, and then we receive a non-tick interrupt in the future. Relying on
> stale jiffies value and using that as base can be inaccurate to determine
> whether next event occurs within next tick. Fix that.

I'm failing to decode this word salad.

> XXX: Need to fix issue in idle accounting which does 'jiffies -
> idle_entrytime'. If idle_entrytime is more current than jiffies, it
> could cause negative values. I could replace jiffies with idle_exittime
> in this computation potentially to fix that.

So you "fix" some yet to be correctly described issue by breaking stuff?

>  static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
>  {
> -	u64 basemono, next_tick, delta, expires, delta_hr, next_hr_wo;
> +	u64 basemono, next_tick, delta, expires, delta_hr, next_hr_wo, boot_ticks;
>  	unsigned long basejiff;
>  	int tick_cpu;
>  
> -	basemono = get_jiffies_update(&basejiff);
> +	boot_ticks = DIV_ROUND_DOWN_ULL(ts->idle_entrytime, TICK_NSEC);

Again this div/mult is more expensive than the sequence count on 32bit.

> -/*
> - * Read jiffies and the time when jiffies were updated last
> - */
> -u64 get_jiffies_update(unsigned long *basej)

How does this even compile? This function is global for a reason.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 3/3] tick-sched: Replace jiffie readout with idle_entrytime
  2024-11-10 22:55   ` Joel Fernandes
@ 2024-11-12 14:48     ` Thomas Gleixner
  2024-11-13 21:46       ` Joel Fernandes
  0 siblings, 1 reply; 22+ messages in thread
From: Thomas Gleixner @ 2024-11-12 14:48 UTC (permalink / raw)
  To: Joel Fernandes, linux-kernel, Anna-Maria Behnsen,
	Frederic Weisbecker, Ingo Molnar

On Sun, Nov 10 2024 at 22:55, Joel Fernandes wrote:
>
> +       /*
> +        * There is some time that passes between when clocksource starts and the
> +        * first time tick device is setup. Offset basejiff by that.
> +       */
> +       basejiff -= DIV_ROUND_DOWN_ULL(tick_first_period, TICK_NSEC);

We clearly need yet another division here. Especially as that division
results in the exactly same value every time.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 1/3] tick-sched: Remove last_tick and calculate next tick from now
  2024-11-12 13:46     ` Thomas Gleixner
  2024-11-12 13:56       ` Frederic Weisbecker
@ 2024-11-12 18:20       ` Joel Fernandes
  1 sibling, 0 replies; 22+ messages in thread
From: Joel Fernandes @ 2024-11-12 18:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Frederic Weisbecker, linux-kernel, Anna-Maria Behnsen,
	Ingo Molnar

On Tue, Nov 12, 2024 at 02:46:23PM +0100, Thomas Gleixner wrote:
> On Tue, Nov 12 2024 at 00:43, Frederic Weisbecker wrote:
> > Le Fri, Nov 08, 2024 at 05:48:34PM +0000, Joel Fernandes (Google) a écrit :
> 
> >> During tick restart, we use last_tick and forward it past now.
> >>
> >> Since we are forwarding past now, we can simply use now as a reference
> >> instead of last_tick. This patch removes last_tick and does so.
> >>
> >> This patch potentially does more mul/imul than the existing code,
> >> as sometimes forwarding past now need not be done if last_tick > now.
> >> However, the patch is a cleanup which reduces LOC and reduces the size
> >> of struct tick_sched.
> 
> May I politely ask you to read and follow the Documentation
> vs. changelogs?
> 
>   https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#changelog
> 
> Also
> 
> git grep 'This patch' Documentation/process
> 
> might give you a hint.

Oops, sorry. I will go read that again. My bad.

> >> -	/* Forward the time to expire in the future */
> >> -	hrtimer_forward(&ts->sched_timer, now, TICK_NSEC);
> >> +	hrtimer_set_expires(&ts->sched_timer, DIV_ROUND_UP_ULL(now, TICK_NSEC) * TICK_NSEC);
> 
> How is a division and multiplication in this hotpath helpful? That's
> awfully slow on 32-bit machines and pointless on 64-bit too.

Yes, I was afraid of that but also hrtimer_forward() already does div and
mult:

        if (unlikely(delta >= interval)) {
                s64 incr = ktime_to_ns(interval);

                orun = ktime_divns(delta, incr);
                hrtimer_add_expires_ns(timer, incr * orun);

I am not fully sure if I am doing division and multiplication more often than
existing code (I'll go count that), because tick should not be stopped at a
distance of just 1 tick I think (otherwise why stop it in the first place..).

> Using now is also wrong as it breaks the sched_skew_tick distribution by
> aligning the tick on all CPUs again.

I am not very familiar with that so I'll do some research on it, thanks!

> IOW, this "cleanup" is making things worse.

Sorry and thanks for filling me in on the drawbacks of this. One of the goal
of this particular change I posted is to learn "why not" and this really
helped, thanks!

> > We don't want to rewrite hrtimer_forward() but, after all, the current expiry is
> > enough a relevant information.
> >
> > How about just this? It's worth it as it now forwards after the real last programmed
> > tick, which should be close enough from @now with a delta below TICK_NSEC, or even
> > better @now is below the expiry. Therefore it should resume as just a no-op
> > or at worst an addition within hrtimer_forward():
> >
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index 753a184c7090..ffd0c026a248 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -838,7 +838,6 @@ EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
> >  static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
> >  {
> >  	hrtimer_cancel(&ts->sched_timer);
> > -	hrtimer_set_expires(&ts->sched_timer, ts->last_tick);
> >  
> >  	/* Forward the time to expire in the future */
> >  	hrtimer_forward(&ts->sched_timer, now, TICK_NSEC);
> 
> That's just wrong. ts->sched_timer.expires contains a tick in the
> future. If tick_nohz_stop_tick() set it to 10 ticks in the future and
> the CPU goes out of idle due to a device interrupt before the timer
> expires, then hrtimer_forward() will do nothing because expires is ahead
> of now.
> 
> Which means the CPU is not idle and has no tick until the delayed tick
> which was set by tick_nohz_stop_tick() expires. Not really correct.

I agree, Frederic's suggestion will break as we have to reset the hrtimer back
to reality.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 1/3] tick-sched: Remove last_tick and calculate next tick from now
  2024-11-11 23:43   ` Frederic Weisbecker
  2024-11-12 13:46     ` Thomas Gleixner
@ 2024-11-12 18:33     ` Joel Fernandes
  2024-11-13 12:40       ` Frederic Weisbecker
  1 sibling, 1 reply; 22+ messages in thread
From: Joel Fernandes @ 2024-11-12 18:33 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner

On Tue, Nov 12, 2024 at 12:43:58AM +0100, Frederic Weisbecker wrote:
> Le Fri, Nov 08, 2024 at 05:48:34PM +0000, Joel Fernandes (Google) a écrit :
> > During tick restart, we use last_tick and forward it past now.
> > 
> > Since we are forwarding past now, we can simply use now as a reference
> > instead of last_tick. This patch removes last_tick and does so.
> > 
> > This patch potentially does more mul/imul than the existing code,
> > as sometimes forwarding past now need not be done if last_tick > now.
> 
> Which is not uncommon if idle exited because of a non-timer interrupt
> (remote wake up IPI or hardware interrupt).
> 
> It's also cheaper with hrtimer_forward() if now - last_tick < TICK_NSEC
> which is not uncommon either if idle exited because of a wake-up from the tick
> (schedule_timeout for example).
> 
> > However, the patch is a cleanup which reduces LOC and reduces the size
> > of struct tick_sched.
> 
> Reducing the overhead of idle exit and consolidating its code within existing
> forward API is more important than a per-cpu field.
> 
> > 
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> >  kernel/time/tick-sched.c | 7 ++-----
> >  kernel/time/tick-sched.h | 1 -
> >  kernel/time/timer_list.c | 1 -
> >  3 files changed, 2 insertions(+), 7 deletions(-)
> > 
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index 71a792cd8936..52a4eda664cf 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -837,11 +837,9 @@ EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
> >  
> >  static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
> >  {
> > +	/* Set the time to expire on the next tick and not some far away future. */
> >  	hrtimer_cancel(&ts->sched_timer);
> > -	hrtimer_set_expires(&ts->sched_timer, ts->last_tick);
> > -
> > -	/* Forward the time to expire in the future */
> > -	hrtimer_forward(&ts->sched_timer, now, TICK_NSEC);
> > +	hrtimer_set_expires(&ts->sched_timer, DIV_ROUND_UP_ULL(now, TICK_NSEC) * TICK_NSEC);
> 
> We don't want to rewrite hrtimer_forward() but, after all, the current expiry is
> enough a relevant information.

Thanks, do you envision any way we can get past the sched_skew_tick issue
Thomas mentioned, if we still want to do something like this patch?

> How about just this? It's worth it as it now forwards after the real last programmed
> tick, which should be close enough from @now with a delta below TICK_NSEC, or even
> better @now is below the expiry. Therefore it should resume as just a no-op
> or at worst an addition within hrtimer_forward():
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 753a184c7090..ffd0c026a248 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -838,7 +838,6 @@ EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
>  static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
>  {
>  	hrtimer_cancel(&ts->sched_timer);
> -	hrtimer_set_expires(&ts->sched_timer, ts->last_tick);
>  
>  	/* Forward the time to expire in the future */
>  	hrtimer_forward(&ts->sched_timer, now, TICK_NSEC);

For completeness, as we discussed on other thread and Thomas mentioned, we
break code if doing this.

> As for removing last_tick, I think it's a precious debugging information. But
> it's lagging behind the record of the first time only the tick got stopped within
> the last trip to idle. So it could become this instead:
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 753a184c7090..af013f7733b2 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -1042,12 +1041,11 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
>  	if (!tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
>  		calc_load_nohz_start();
>  		quiet_vmstat();
> -
> -		ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
>  		tick_sched_flag_set(ts, TS_FLAG_STOPPED);
>  		trace_tick_stop(1, TICK_DEP_MASK_NONE);
>  	}
>  
> +	ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
>  	ts->next_tick = expires;

Are you suggesting we roll this part of your diff into a new patch (to
improve debug)? I could do that with attribution to you. But I guess I don't
understand this particular part of your diff.

If the tick was already stopped, how does
hrtimer_get_expires(&ts->sched_timer) change since the last time the tick was
stopped? ->last_tick should be set only when the tick was last running and a
stop was attempted? Otherwise your diff might set ->last_tick well into the
future after the tick was already stopped, AFAICS.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 1/3] tick-sched: Remove last_tick and calculate next tick from now
  2024-11-12 18:33     ` Joel Fernandes
@ 2024-11-13 12:40       ` Frederic Weisbecker
  2024-11-13 21:40         ` Joel Fernandes
  0 siblings, 1 reply; 22+ messages in thread
From: Frederic Weisbecker @ 2024-11-13 12:40 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner

Le Tue, Nov 12, 2024 at 06:33:30PM +0000, Joel Fernandes a écrit :
> On Tue, Nov 12, 2024 at 12:43:58AM +0100, Frederic Weisbecker wrote:
> > > @@ -837,11 +837,9 @@ EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
> > >  
> > >  static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
> > >  {
> > > +	/* Set the time to expire on the next tick and not some far away future. */
> > >  	hrtimer_cancel(&ts->sched_timer);
> > > -	hrtimer_set_expires(&ts->sched_timer, ts->last_tick);
> > > -
> > > -	/* Forward the time to expire in the future */
> > > -	hrtimer_forward(&ts->sched_timer, now, TICK_NSEC);
> > > +	hrtimer_set_expires(&ts->sched_timer, DIV_ROUND_UP_ULL(now, TICK_NSEC) * TICK_NSEC);
> > 
> > We don't want to rewrite hrtimer_forward() but, after all, the current expiry is
> > enough a relevant information.
> 
> Thanks, do you envision any way we can get past the sched_skew_tick issue
> Thomas mentioned, if we still want to do something like this patch?

First, do we still want to do something like this patch? :-)

> 
> > How about just this? It's worth it as it now forwards after the real last programmed
> > tick, which should be close enough from @now with a delta below TICK_NSEC, or even
> > better @now is below the expiry. Therefore it should resume as just a no-op
> > or at worst an addition within hrtimer_forward():
> > 
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index 753a184c7090..ffd0c026a248 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -838,7 +838,6 @@ EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
> >  static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
> >  {
> >  	hrtimer_cancel(&ts->sched_timer);
> > -	hrtimer_set_expires(&ts->sched_timer, ts->last_tick);
> >  
> >  	/* Forward the time to expire in the future */
> >  	hrtimer_forward(&ts->sched_timer, now, TICK_NSEC);
> 
> For completeness, as we discussed on other thread and Thomas mentioned, we
> break code if doing this.

Right!

> 
> > As for removing last_tick, I think it's a precious debugging information. But
> > it's lagging behind the record of the first time only the tick got stopped within
> > the last trip to idle. So it could become this instead:
> > 
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index 753a184c7090..af013f7733b2 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -1042,12 +1041,11 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
> >  	if (!tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
> >  		calc_load_nohz_start();
> >  		quiet_vmstat();
> > -
> > -		ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
> >  		tick_sched_flag_set(ts, TS_FLAG_STOPPED);
> >  		trace_tick_stop(1, TICK_DEP_MASK_NONE);
> >  	}
> >  
> > +	ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
> >  	ts->next_tick = expires;
> 
> Are you suggesting we roll this part of your diff into a new patch (to
> improve debug)? I could do that with attribution to you. But I guess I don't
> understand this particular part of your diff.

No there is no point in doing this after all. I was trying to find a point for
this ->last_tick existence but there was one I overlooked like Thomas explained.

> If the tick was already stopped, how does
> hrtimer_get_expires(&ts->sched_timer) change since the last time the tick was
> stopped? ->last_tick should be set only when the tick was last running and a
> stop was attempted? Otherwise your diff might set ->last_tick well into the
> future after the tick was already stopped, AFAICS.

Right.

Thanks.

> 
> thanks,
> 
>  - Joel
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 1/3] tick-sched: Remove last_tick and calculate next tick from now
  2024-11-13 12:40       ` Frederic Weisbecker
@ 2024-11-13 21:40         ` Joel Fernandes
  0 siblings, 0 replies; 22+ messages in thread
From: Joel Fernandes @ 2024-11-13 21:40 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner

On Wed, Nov 13, 2024 at 01:40:17PM +0100, Frederic Weisbecker wrote:
> Le Tue, Nov 12, 2024 at 06:33:30PM +0000, Joel Fernandes a écrit :
> > On Tue, Nov 12, 2024 at 12:43:58AM +0100, Frederic Weisbecker wrote:
> > > > @@ -837,11 +837,9 @@ EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
> > > >  
> > > >  static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
> > > >  {
> > > > +	/* Set the time to expire on the next tick and not some far away future. */
> > > >  	hrtimer_cancel(&ts->sched_timer);
> > > > -	hrtimer_set_expires(&ts->sched_timer, ts->last_tick);
> > > > -
> > > > -	/* Forward the time to expire in the future */
> > > > -	hrtimer_forward(&ts->sched_timer, now, TICK_NSEC);
> > > > +	hrtimer_set_expires(&ts->sched_timer, DIV_ROUND_UP_ULL(now, TICK_NSEC) * TICK_NSEC);
> > > 
> > > We don't want to rewrite hrtimer_forward() but, after all, the current expiry is
> > > enough a relevant information.
> > 
> > Thanks, do you envision any way we can get past the sched_skew_tick issue
> > Thomas mentioned, if we still want to do something like this patch?
> 
> First, do we still want to do something like this patch? :-)

I am leaning to dropping it due to the skew issues mentioned which is a
gaping hole. And also the debug usecase you mentioned. At least I appreciate
why this mechanism exists now :-) Thank you both :-)

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 3/3] tick-sched: Replace jiffie readout with idle_entrytime
  2024-11-12 14:48     ` Thomas Gleixner
@ 2024-11-13 21:46       ` Joel Fernandes
  0 siblings, 0 replies; 22+ messages in thread
From: Joel Fernandes @ 2024-11-13 21:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Anna-Maria Behnsen, Frederic Weisbecker,
	Ingo Molnar

On Tue, Nov 12, 2024 at 03:48:43PM +0100, Thomas Gleixner wrote:
> On Sun, Nov 10 2024 at 22:55, Joel Fernandes wrote:
> >
> > +       /*
> > +        * There is some time that passes between when clocksource starts and the
> > +        * first time tick device is setup. Offset basejiff by that.
> > +       */
> > +       basejiff -= DIV_ROUND_DOWN_ULL(tick_first_period, TICK_NSEC);
> 
> We clearly need yet another division here. Especially as that division
> results in the exactly same value every time.

Yeah I fixed it in my v2 but did not post it yet:

https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=timers/tick-sched&id=60707e27418e3bca026be0bd9b1eacfe2a6ce72a

I will hold back on posting till we finish discussing on this RFC.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 3/3] tick-sched: Replace jiffie readout with idle_entrytime
  2024-11-12 14:30   ` Thomas Gleixner
@ 2024-11-13 22:18     ` Joel Fernandes
  0 siblings, 0 replies; 22+ messages in thread
From: Joel Fernandes @ 2024-11-13 22:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Anna-Maria Behnsen, Frederic Weisbecker,
	Ingo Molnar

On Tue, Nov 12, 2024 at 03:30:24PM +0100, Thomas Gleixner wrote:
> On Fri, Nov 08 2024 at 17:48, Joel Fernandes wrote:
> > This solves the issue where jiffies can be stale and inaccurate.
> 
> Which issue?

I will describe below.

> > Putting some prints, I see that basemono can be quite stale:
> > tick_nohz_next_event: basemono=18692000000 basemono_from_idle_entrytime=18695000000
> 
> What is your definition of stale? 3ms on a system with HZ < 1000 is
> completely correct and within the margin of the next tick, no?

I am on HZ=1000. So 3ms is not within margin of the next tick for me.

> > Since we have 'now' in ts->idle_entrytime, we can just use that. It is
> > more accurate, cleaner, reduces lines of code and reduces any lock
> > contention with the seq locks.
> 
> What's more accurate and what is the actual problem you are trying to
> solve. This handwaving about cleaner, less lines of code and contention
> on a non existing lock is just not helpful.

Oh sure. the concern I have is that jiffies might be quite out of date for
whatever reason as I shared above with comparing the jiffies with
idle_entrytrime. I am not sure exactly why this happens, maybe because
interrupts were disabled on the do-timer CPU? But in any case I was concerned
about this code in tick_nohz_next_event():


	if (rcu_needs_cpu() || arch_needs_cpu() ||
	    irq_work_needs_cpu() || local_timer_softirq_pending()) {
		next_tick = basemono + TICK_NSEC;
	}

If we are using a stale basemono, that just seems wrong to me. If basemono
is stale, then next_tick could even be in the past?

> > I was also concerned about issue where jiffies is not updated for a long
> > time, and then we receive a non-tick interrupt in the future. Relying on
> > stale jiffies value and using that as base can be inaccurate to determine
> > whether next event occurs within next tick. Fix that.
> 
> I'm failing to decode this word salad.

I think I came up with an incorrect explanation of why basemono is lagging -
I have to look deeper into when basemono can be out-of-date - I certainly saw
it being lagging sometimes as I showed in the traces.

Anyway my broken word salad was something like: Say jiffies update did not
happen for a long time (can the tick be turned off on a do-timer CPU?). Then
say if a non-tick interrupt, like a device interrupt bring the CPU out of
idle, and then on the way back to idle, it tries to stop the tick. 'basemono'
would be quite out of date.

I am just speculating, I don't know exactly why I see basemono lagging. I
thought maybe we can fix that and avoid issues in the future that might show
up because of such inaccuracy.

Other theories?

I wonder if you're going to mention that this code works even if basemono is
out-of-date. But I am concerned that in 'low resolution mode', if we have an
hrtimer that is about to expire within the next tick, and if basemono is
lagging by several ticks, then could this code accidentally not keep the tick
alive thinking that the hrtimer is several ticks away?

> > XXX: Need to fix issue in idle accounting which does 'jiffies -
> > idle_entrytime'. If idle_entrytime is more current than jiffies, it
> > could cause negative values. I could replace jiffies with idle_exittime
> > in this computation potentially to fix that.
> 
> So you "fix" some yet to be correctly described issue by breaking stuff?

Its a side-effect of this patch, more adjustments are needed. This is just an
RFC, Thomas :-)

> >  static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
> >  {
> > -	u64 basemono, next_tick, delta, expires, delta_hr, next_hr_wo;
> > +	u64 basemono, next_tick, delta, expires, delta_hr, next_hr_wo, boot_ticks;
> >  	unsigned long basejiff;
> >  	int tick_cpu;
> >  
> > -	basemono = get_jiffies_update(&basejiff);
> > +	boot_ticks = DIV_ROUND_DOWN_ULL(ts->idle_entrytime, TICK_NSEC);
> 
> Again this div/mult is more expensive than the sequence count on 32bit.

Got it, I was hoping that there was a better way to calculate the number of
ticks without expensive div/mult, let me know if you had any ideas on that?
But it seems to me there isn't an alternative. So we may have to table this
idea. Also I think it suffers from the same tick-skew issue you mentioned I
think.

> > -/*
> > - * Read jiffies and the time when jiffies were updated last
> > - */
> > -u64 get_jiffies_update(unsigned long *basej)
> 
> How does this even compile? This function is global for a reason.

Yeah, sorry I had fixed the issue in my next revision when I adjusted the
timer migration code:

https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=timers/tick-sched&id=1cf33bddf50341cf9802ed19374dc42d8466868b

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 3/3] tick-sched: Replace jiffie readout with idle_entrytime
  2024-11-11 22:25   ` Frederic Weisbecker
@ 2024-11-13 22:39     ` Joel Fernandes
  0 siblings, 0 replies; 22+ messages in thread
From: Joel Fernandes @ 2024-11-13 22:39 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner

On Mon, Nov 11, 2024 at 11:25:15PM +0100, Frederic Weisbecker wrote:
> Le Fri, Nov 08, 2024 at 05:48:36PM +0000, Joel Fernandes (Google) a écrit :
> > This solves the issue where jiffies can be stale and inaccurate.
> > 
> > Putting some prints, I see that basemono can be quite stale:
> > tick_nohz_next_event: basemono=18692000000 basemono_from_idle_entrytime=18695000000
> 
> That's 3 ms. If HZ < 1000 that's to be expected. But even with HZ = 1000 it can
> happen.

For me HZ=1000 though.

> > Since we have 'now' in ts->idle_entrytime, we can just use that. It is
> > more accurate, cleaner, reduces lines of code and reduces any lock
> > contention with the seq locks.
> 
> Do we need such accuracy? The timers rely on jiffies anyway.
> Also it's a seqcount read. Basically just a pair of smp_rmb().
> Not sure a division would be cheaper.

In low resolution mode, hrtimers are also expired by the tick. It seems to
me, because of the error in jiffies, the clockevent programming for the tick
will also have errors which reflect in hrtimer expiry (granted if one does
not want errors in hrtimer, they shouldn't be using low-res mode, but
still a bounded error is better than an unpredictable unbounded one..).

I agree on the division issue though, I couldn't find an alternative.

Expanding on the previous paragraph, it seems to me that we will end up
programming the clockevent with incorrect values that are subject to the
error. Like if basemono is lagging by several ticks, then 'expires' value in
clock event may also be lagging.  Then after the clock event is programmed
with the erroneous value, the clockevent will wake up the CPU too soon I
think causing power wastage. If the jifies was accurate, the clockevent would
wake up the system more into the future..

Or do you see this scenario not playing out?

> > I was also concerned about issue where jiffies is not updated for a long
> > time, and then we receive a non-tick interrupt in the future. Relying on
> > stale jiffies value and using that as base can be inaccurate to determine
> > whether next event occurs within next tick. Fix that.
> > 
> > XXX: Need to fix issue in idle accounting which does 'jiffies -
> > idle_entrytime'. If idle_entrytime is more current than jiffies, it
> > could cause negative values. I could replace jiffies with idle_exittime
> > in this computation potentially to fix that.
> > 
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> >  kernel/time/tick-sched.c | 27 +++++++--------------------
> >  1 file changed, 7 insertions(+), 20 deletions(-)
> > 
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index 4aa64266f2b0..22a4f96d9585 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -860,24 +860,6 @@ static inline bool local_timer_softirq_pending(void)
> >  	return local_softirq_pending() & BIT(TIMER_SOFTIRQ);
> >  }
> >  
> > -/*
> > - * Read jiffies and the time when jiffies were updated last
> > - */
> > -u64 get_jiffies_update(unsigned long *basej)
> > -{
> > -	unsigned long basejiff;
> > -	unsigned int seq;
> > -	u64 basemono;
> > -
> > -	do {
> > -		seq = read_seqcount_begin(&jiffies_seq);
> > -		basemono = last_jiffies_update;
> > -		basejiff = jiffies;
> > -	} while (read_seqcount_retry(&jiffies_seq, seq));
> > -	*basej = basejiff;
> > -	return basemono;
> > -}
> 
> This is used in tmigr as well.

Yeah, sorry I had fixed the issue in my next revision when I adjusted the
timer migration code:

https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=timers/tick-sched&id=1cf33bddf50341cf9802ed19374dc42d8466868b

I am on work travel this week, apologies for slow replies and thanks so much
for your replies and your discussions!

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2024-11-13 22:39 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-08 17:48 [RFC 0/3] tick-sched cleanups Joel Fernandes (Google)
2024-11-08 17:48 ` [RFC 1/3] tick-sched: Remove last_tick and calculate next tick from now Joel Fernandes (Google)
2024-11-11 23:43   ` Frederic Weisbecker
2024-11-12 13:46     ` Thomas Gleixner
2024-11-12 13:56       ` Frederic Weisbecker
2024-11-12 18:20       ` Joel Fernandes
2024-11-12 18:33     ` Joel Fernandes
2024-11-13 12:40       ` Frederic Weisbecker
2024-11-13 21:40         ` Joel Fernandes
2024-11-08 17:48 ` [RFC 2/3] tick-sched: Keep tick on if hrtimer is due imminently Joel Fernandes (Google)
2024-11-11 12:37   ` Christian Loehle
2024-11-11 15:56     ` Joel Fernandes
2024-11-11 16:55       ` Christian Loehle
2024-11-11 17:17         ` Joel Fernandes
2024-11-08 17:48 ` [RFC 3/3] tick-sched: Replace jiffie readout with idle_entrytime Joel Fernandes (Google)
2024-11-10 22:55   ` Joel Fernandes
2024-11-12 14:48     ` Thomas Gleixner
2024-11-13 21:46       ` Joel Fernandes
2024-11-11 22:25   ` Frederic Weisbecker
2024-11-13 22:39     ` Joel Fernandes
2024-11-12 14:30   ` Thomas Gleixner
2024-11-13 22:18     ` Joel Fernandes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox