* [RFC][PATCH 0/3] DynamicHZ: Configuring the timer tick rate at boot time
@ 2025-01-28 6:32 John Stultz
2025-01-28 6:32 ` [RFC][PATCH 1/3] time/tick: Pipe tick count down through cputime accounting John Stultz
` (3 more replies)
0 siblings, 4 replies; 16+ messages in thread
From: John Stultz @ 2025-01-28 6:32 UTC (permalink / raw)
To: LKML
Cc: John Stultz, Anna-Maria Behnsen, Frederic Weisbecker, Ingo Molnar,
Thomas Gleixner, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Stephen Boyd, Yury Norov, Bitao Hu,
Andrew Morton, kernel-team
The HZ value has long been a compile time constant. This is
really useful as there is a lot of hotpath code that uses HZ
when setting timers (needing to convert nanoseconds to ticks),
thus the division is much faster with a compile time constant
divisor.
However, having to select the system HZ value at build time is
somewhat limiting. Distros have to make choices for their users
as to what the best HZ value would be balancing latency and
power usage.
With Android, this is a major issue, as we have one GKI binary
that runs across a wide array of devices from top of the line
flagship phones to watches. Balancing the choice for HZ is
difficult, we currently have HZ=250, but some devices would love
to have HZ=1000, while other devices aren’t willing to pay the
power cost of 4x the timer slots, resulting in shorter idle
times.
(As an aside, some suggested RCU_LAZY would avoid the cost
of bumping to HZ=1000, and it does indeed help a good bit, but
we still see higher power usage compared with HZ=250)
So I’ve been thinking and talking about an idea for awhile to
try to address this: DynamicHZ.
The idea is we just set HZ=1000, with 1ms granular timers.
However, using a boot time argument, we can optionally program
the clockevent that generates the timer tick to fire at a lower
frequency. Effectively this is just like the lost ticks handling
done for SMIs. The jiffies accounting is handled by the time the
clocksource sees pass, so time progresses properly, and timers
for all the ticks we skip will fire when the next timer
interrupt occurs. So with dyn_hz=250, the interrupt fires every
fourth tick and with dyn_hz=100, every 10th.
So far, this approach has seemed to work ok.
One area that needed adjustments was the cputime accounting, as
it assumes we only account one tick per interrupt, so I’ve
reworked some of that logic to pipe through the actual tick
count.
Once that was addressed, in testing with HZ=1000, dyn_hz=250,
all of the time/timer tests I've tried seem to be working
properly. I don’t see any performance regressions on tests like
Geekbench(compared to HZ=250), nor have I observed any power
regressions.
Though as the tick and scheduler code are intertwined and
neither subsystem is particularly simple, I expect I may still
be missing or forgetting things. So I wanted to send this out
for review and feedback.
I’d love to hear your thoughts or concerns!
Also, I've not yet gotten this to work for the fixed
periodic-tick paths (you need a oneshot capable clockevent).
Mostly because in that case we always just increment by a single
tick. While for dyn_hz=250 or dyn_hz=1000 calculating the
periodic tick count is pretty simple (4 ticks, 10 ticks). But
for dyn_hz=300, or other possible values, it doesn’t evenly
divide, so we would have to do a 3,3,4,3,3,4 style interval to
stay on time and I’ve not yet thought through how to do
remainder handling efficiently yet.
thanks
-john
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Yury Norov <yury.norov@gmail.com>
Cc: Bitao Hu <yaoma@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: kernel-team@android.com
John Stultz (3):
time/tick: Pipe tick count down through cputime accounting
time/tick: Introduce a dyn_hz boot option
Kconfig: Add CONFIG_DYN_HZ_DEFAULT to specify the default dynhz= boot
option value
include/linux/kernel_stat.h | 4 ++--
include/linux/tick.h | 11 +++++++++--
kernel/Kconfig.hz | 19 +++++++++++++++++++
kernel/sched/cputime.c | 6 +++---
kernel/time/tick-common.c | 32 +++++++++++++++++++++++++++++++-
kernel/time/tick-legacy.c | 2 +-
kernel/time/tick-sched.c | 33 ++++++++++++++++++---------------
kernel/time/timekeeping.h | 2 +-
kernel/time/timer.c | 4 ++--
9 files changed, 86 insertions(+), 27 deletions(-)
--
2.48.1.262.g85cc9f2d1e-goog
^ permalink raw reply [flat|nested] 16+ messages in thread
* [RFC][PATCH 1/3] time/tick: Pipe tick count down through cputime accounting
2025-01-28 6:32 [RFC][PATCH 0/3] DynamicHZ: Configuring the timer tick rate at boot time John Stultz
@ 2025-01-28 6:32 ` John Stultz
2025-01-28 14:44 ` Thomas Gleixner
2025-01-28 6:32 ` [RFC][PATCH 2/3] time/tick: Introduce a dyn_hz boot option John Stultz
` (2 subsequent siblings)
3 siblings, 1 reply; 16+ messages in thread
From: John Stultz @ 2025-01-28 6:32 UTC (permalink / raw)
To: LKML
Cc: John Stultz, Anna-Maria Behnsen, Frederic Weisbecker, Ingo Molnar,
Thomas Gleixner, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Stephen Boyd, Yury Norov, Bitao Hu,
Andrew Morton, kernel-team
In working up the dynHZ patch, I found that the skipping of
ticks would result in large latencies for itimers.
As I dug into it, I realized there is still some logic that
assumes we don't miss ticks, resulting in late expiration of
cputime timers.
So this patch pipes the actual number of ticks passed down
through cputime accounting.
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Yury Norov <yury.norov@gmail.com>
Cc: Bitao Hu <yaoma@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: kernel-team@android.com
Signed-off-by: John Stultz <jstultz@google.com>
---
include/linux/kernel_stat.h | 4 ++--
kernel/sched/cputime.c | 6 +++---
kernel/time/tick-common.c | 2 +-
kernel/time/tick-legacy.c | 2 +-
kernel/time/tick-sched.c | 29 ++++++++++++++++-------------
kernel/time/timekeeping.h | 2 +-
kernel/time/timer.c | 4 ++--
7 files changed, 26 insertions(+), 23 deletions(-)
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index b97ce2df376f9..4b5169cd8db04 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -127,12 +127,12 @@ extern void account_idle_time(u64);
extern u64 get_idle_time(struct kernel_cpustat *kcs, int cpu);
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
-static inline void account_process_tick(struct task_struct *tsk, int user)
+static inline void account_process_tick(struct task_struct *tsk, unsigned long ticks, int user)
{
vtime_flush(tsk);
}
#else
-extern void account_process_tick(struct task_struct *, int user);
+extern void account_process_tick(struct task_struct *, unsigned long ticks, int user);
#endif
extern void account_idle_ticks(unsigned long ticks);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 0bed0fa1acd98..9948da0d80842 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -471,7 +471,7 @@ void thread_group_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)
* @p: the process that the CPU time gets accounted to
* @user_tick: indicates if the tick is a user or a system tick
*/
-void account_process_tick(struct task_struct *p, int user_tick)
+void account_process_tick(struct task_struct *p, unsigned long ticks, int user_tick)
{
u64 cputime, steal;
@@ -479,11 +479,11 @@ void account_process_tick(struct task_struct *p, int user_tick)
return;
if (sched_clock_irqtime) {
- irqtime_account_process_tick(p, user_tick, 1);
+ irqtime_account_process_tick(p, user_tick, ticks);
return;
}
- cputime = TICK_NSEC;
+ cputime = ticks * TICK_NSEC;
steal = steal_account_process_time(ULONG_MAX);
if (steal >= cputime)
diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index a47bcf71defcf..ae5c5befdc58b 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -98,7 +98,7 @@ static void tick_periodic(int cpu)
update_wall_time();
}
- update_process_times(user_mode(get_irq_regs()));
+ update_process_times(1, user_mode(get_irq_regs()));
profile_tick(CPU_PROFILING);
}
diff --git a/kernel/time/tick-legacy.c b/kernel/time/tick-legacy.c
index af225b32f5b37..dbc156e69802b 100644
--- a/kernel/time/tick-legacy.c
+++ b/kernel/time/tick-legacy.c
@@ -32,6 +32,6 @@ void legacy_timer_tick(unsigned long ticks)
raw_spin_unlock(&jiffies_lock);
update_wall_time();
}
- update_process_times(user_mode(get_irq_regs()));
+ update_process_times(ticks, user_mode(get_irq_regs()));
profile_tick(CPU_PROFILING);
}
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index fa058510af9c1..983790923aee9 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -54,7 +54,7 @@ static ktime_t last_jiffies_update;
/*
* Must be called with interrupts disabled !
*/
-static void tick_do_update_jiffies64(ktime_t now)
+static unsigned long tick_do_update_jiffies64(ktime_t now)
{
unsigned long ticks = 1;
ktime_t delta, nextp;
@@ -70,7 +70,7 @@ static void tick_do_update_jiffies64(ktime_t now)
*/
if (IS_ENABLED(CONFIG_64BIT)) {
if (ktime_before(now, smp_load_acquire(&tick_next_period)))
- return;
+ return 0;
} else {
unsigned int seq;
@@ -84,7 +84,7 @@ static void tick_do_update_jiffies64(ktime_t now)
} while (read_seqcount_retry(&jiffies_seq, seq));
if (ktime_before(now, nextp))
- return;
+ return 0;
}
/* Quick check failed, i.e. update is required. */
@@ -95,7 +95,7 @@ static void tick_do_update_jiffies64(ktime_t now)
*/
if (ktime_before(now, tick_next_period)) {
raw_spin_unlock(&jiffies_lock);
- return;
+ return 0;
}
write_seqcount_begin(&jiffies_seq);
@@ -147,6 +147,7 @@ static void tick_do_update_jiffies64(ktime_t now)
raw_spin_unlock(&jiffies_lock);
update_wall_time();
+ return ticks;
}
/*
@@ -203,10 +204,10 @@ static inline void tick_sched_flag_clear(struct tick_sched *ts,
#define MAX_STALLED_JIFFIES 5
-static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
+static unsigned long tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
{
int tick_cpu, cpu = smp_processor_id();
-
+ unsigned long ticks = 0;
/*
* Check if the do_timer duty was dropped. We don't care about
* concurrency: This happens only when the CPU in charge went
@@ -229,7 +230,7 @@ static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
/* Check if jiffies need an update */
if (tick_cpu == cpu)
- tick_do_update_jiffies64(now);
+ ticks = tick_do_update_jiffies64(now);
/*
* If the jiffies update stalled for too long (timekeeper in stop_machine()
@@ -240,7 +241,7 @@ static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
ts->last_tick_jiffies = READ_ONCE(jiffies);
} else {
if (++ts->stalled_jiffies == MAX_STALLED_JIFFIES) {
- tick_do_update_jiffies64(now);
+ ticks += tick_do_update_jiffies64(now);
ts->stalled_jiffies = 0;
ts->last_tick_jiffies = READ_ONCE(jiffies);
}
@@ -248,9 +249,10 @@ static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
if (tick_sched_flag_test(ts, TS_FLAG_INIDLE))
ts->got_idle_tick = 1;
+ return ticks;
}
-static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs)
+static void tick_sched_handle(struct tick_sched *ts, unsigned long ticks, struct pt_regs *regs)
{
/*
* When we are idle and the tick is stopped, we have to touch
@@ -264,7 +266,7 @@ static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs)
tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
touch_softlockup_watchdog_sched();
if (is_idle_task(current))
- ts->idle_jiffies++;
+ ts->idle_jiffies += ticks;
/*
* In case the current tick fired too early past its expected
* expiration, make sure we don't bypass the next clock reprogramming
@@ -273,7 +275,7 @@ static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs)
ts->next_tick = 0;
}
- update_process_times(user_mode(regs));
+ update_process_times(ticks, user_mode(regs));
profile_tick(CPU_PROFILING);
}
@@ -286,15 +288,16 @@ static enum hrtimer_restart tick_nohz_handler(struct hrtimer *timer)
struct tick_sched *ts = container_of(timer, struct tick_sched, sched_timer);
struct pt_regs *regs = get_irq_regs();
ktime_t now = ktime_get();
+ unsigned long ticks;
- tick_sched_do_timer(ts, now);
+ ticks = tick_sched_do_timer(ts, now);
/*
* Do not call when we are not in IRQ context and have
* no valid 'regs' pointer
*/
if (regs)
- tick_sched_handle(ts, regs);
+ tick_sched_handle(ts, ticks, regs);
else
ts->next_tick = 0;
diff --git a/kernel/time/timekeeping.h b/kernel/time/timekeeping.h
index 543beba096c75..3f93af06ce5c3 100644
--- a/kernel/time/timekeeping.h
+++ b/kernel/time/timekeeping.h
@@ -22,7 +22,7 @@ static inline int sched_clock_suspend(void) { return 0; }
static inline void sched_clock_resume(void) { }
#endif
-extern void update_process_times(int user);
+extern void update_process_times(unsigned long ticks, int user);
extern void do_timer(unsigned long ticks);
extern void update_wall_time(void);
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index a5860bf6d16f9..335ed23d1cb2d 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -2509,12 +2509,12 @@ static void run_local_timers(void)
* Called from the timer interrupt handler to charge one tick to the current
* process. user_tick is 1 if the tick is user time, 0 for system.
*/
-void update_process_times(int user_tick)
+void update_process_times(unsigned long ticks, int user_tick)
{
struct task_struct *p = current;
/* Note: this timer irq context must be accounted for as well. */
- account_process_tick(p, user_tick);
+ account_process_tick(p, ticks, user_tick);
run_local_timers();
rcu_sched_clock_irq(user_tick);
#ifdef CONFIG_IRQ_WORK
--
2.48.1.262.g85cc9f2d1e-goog
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [RFC][PATCH 2/3] time/tick: Introduce a dyn_hz boot option
2025-01-28 6:32 [RFC][PATCH 0/3] DynamicHZ: Configuring the timer tick rate at boot time John Stultz
2025-01-28 6:32 ` [RFC][PATCH 1/3] time/tick: Pipe tick count down through cputime accounting John Stultz
@ 2025-01-28 6:32 ` John Stultz
2025-01-28 9:07 ` Peter Zijlstra
2025-01-28 6:32 ` [RFC][PATCH 3/3] Kconfig: Add CONFIG_DYN_HZ_DEFAULT to specify the default dynhz= boot option value John Stultz
2025-01-28 16:46 ` [RFC][PATCH 0/3] DynamicHZ: Configuring the timer tick rate at boot time Thomas Gleixner
3 siblings, 1 reply; 16+ messages in thread
From: John Stultz @ 2025-01-28 6:32 UTC (permalink / raw)
To: LKML
Cc: John Stultz, Anna-Maria Behnsen, Frederic Weisbecker, Ingo Molnar,
Thomas Gleixner, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Stephen Boyd, Yury Norov, Bitao Hu,
Andrew Morton, kernel-team
Introduce dyn_hz= option, which allows the actual timer tick to
be scaled down at boot time.
This allows kerenls to be built with HZ=1000 but systems to
effectively run as if HZ=100 if specified.
The system will still run with the configured HZ value, but
ticks will just arrive "late".
The valid values are between 100 and the build time CONFIG_HZ
value.
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Yury Norov <yury.norov@gmail.com>
Cc: Bitao Hu <yaoma@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: kernel-team@android.com
Signed-off-by: John Stultz <jstultz@google.com>
---
include/linux/tick.h | 11 +++++++++--
kernel/Kconfig.hz | 9 +++++++++
kernel/time/tick-common.c | 30 ++++++++++++++++++++++++++++++
kernel/time/tick-sched.c | 4 ++--
4 files changed, 50 insertions(+), 4 deletions(-)
diff --git a/include/linux/tick.h b/include/linux/tick.h
index b8ddc8e631a3c..734ee1e08c9ef 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -14,6 +14,13 @@
#include <linux/rcupdate.h>
#include <linux/static_key.h>
+#ifdef CONFIG_DYN_HZ
+extern long long dyn_tick_nsec;
+#define DYN_TICK_NSEC (dyn_tick_nsec)
+#else
+#define DYN_TICK_NSEC TICK_NSEC
+#endif
+
#ifdef CONFIG_GENERIC_CLOCKEVENTS
extern void __init tick_init(void);
/* Should be core only, but ARM BL switcher requires it */
@@ -153,11 +160,11 @@ static inline bool tick_nohz_idle_got_tick(void) { return false; }
static inline ktime_t tick_nohz_get_next_hrtimer(void)
{
/* Next wake up is the tick period, assume it starts now */
- return ktime_add(ktime_get(), TICK_NSEC);
+ return ktime_add(ktime_get(), DYN_TICK_NSEC);
}
static inline ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)
{
- *delta_next = TICK_NSEC;
+ *delta_next = DYN_TICK_NSEC;
return *delta_next;
}
static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
diff --git a/kernel/Kconfig.hz b/kernel/Kconfig.hz
index 38ef6d06888ef..76714317674c5 100644
--- a/kernel/Kconfig.hz
+++ b/kernel/Kconfig.hz
@@ -55,5 +55,14 @@ config HZ
default 300 if HZ_300
default 1000 if HZ_1000
+config DYN_HZ
+ bool "Support dynamic HZ via boot argument"
+ default n
+ help
+ Allow for the tick rate to be set at boot time via the dynhz=
+ boot argument. This allows the tick rate to be lower then the
+ build time configured HZ value.
+ If you are unsure, say no.
+
config SCHED_HRTICK
def_bool HIGH_RES_TIMERS
diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index ae5c5befdc58b..75fd9dadb8273 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -80,6 +80,30 @@ int tick_is_oneshot_available(void)
return tick_broadcast_oneshot_available();
}
+#ifdef CONFIG_DYN_HZ
+long long dyn_tick_nsec = TICK_NSEC;
+
+static int __init set_dyn_hz(char *str)
+{
+ int ret, dyn_hz;
+
+ ret = kstrtoint(str, 0, &dyn_hz);
+ if (ret)
+ return ret;
+ if (dyn_hz > HZ || dyn_hz < 100)
+ dyn_hz = HZ;
+ dyn_tick_nsec = TICK_NSEC * HZ / dyn_hz;
+ return 1;
+}
+#else /* !CONFIG_DYN_HZ */
+static int __init set_dyn_hz(char *str)
+{
+ pr_warn("CONFIG_DYN_HZ not enabled, ignoring dyn_hz boot argument\n");
+ return -1;
+}
+#endif /* CONFIG_DYN_HZ */
+__setup("dyn_hz=", set_dyn_hz);
+
/*
* Periodic tick
*/
@@ -575,4 +599,10 @@ void __init tick_init(void)
{
tick_broadcast_init();
tick_nohz_init();
+ if (DYN_TICK_NSEC != TICK_NSEC) {
+ long long dynhz = TICK_NSEC * HZ;
+
+ do_div(dynhz, DYN_TICK_NSEC);
+ pr_info("dynHZ in use! HZ=%ld dynHZ=%lld\n", (long)HZ, dynhz);
+ }
}
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 983790923aee9..040ad70c31d38 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -309,7 +309,7 @@ static enum hrtimer_restart tick_nohz_handler(struct hrtimer *timer)
if (unlikely(tick_sched_flag_test(ts, TS_FLAG_STOPPED)))
return HRTIMER_NORESTART;
- hrtimer_forward(timer, now, TICK_NSEC);
+ hrtimer_forward(timer, now, DYN_TICK_NSEC);
return HRTIMER_RESTART;
}
@@ -842,7 +842,7 @@ static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
hrtimer_set_expires(&ts->sched_timer, ts->last_tick);
/* Forward the time to expire in the future */
- hrtimer_forward(&ts->sched_timer, now, TICK_NSEC);
+ hrtimer_forward(&ts->sched_timer, now, DYN_TICK_NSEC);
if (tick_sched_flag_test(ts, TS_FLAG_HIGHRES)) {
hrtimer_start_expires(&ts->sched_timer,
--
2.48.1.262.g85cc9f2d1e-goog
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [RFC][PATCH 3/3] Kconfig: Add CONFIG_DYN_HZ_DEFAULT to specify the default dynhz= boot option value
2025-01-28 6:32 [RFC][PATCH 0/3] DynamicHZ: Configuring the timer tick rate at boot time John Stultz
2025-01-28 6:32 ` [RFC][PATCH 1/3] time/tick: Pipe tick count down through cputime accounting John Stultz
2025-01-28 6:32 ` [RFC][PATCH 2/3] time/tick: Introduce a dyn_hz boot option John Stultz
@ 2025-01-28 6:32 ` John Stultz
2025-01-28 16:46 ` [RFC][PATCH 0/3] DynamicHZ: Configuring the timer tick rate at boot time Thomas Gleixner
3 siblings, 0 replies; 16+ messages in thread
From: John Stultz @ 2025-01-28 6:32 UTC (permalink / raw)
To: LKML
Cc: John Stultz, Anna-Maria Behnsen, Frederic Weisbecker, Ingo Molnar,
Thomas Gleixner, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Stephen Boyd, Yury Norov, Bitao Hu,
Andrew Morton, kernel-team
Allow a default dynhz= boot option value to be specified via config.
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Yury Norov <yury.norov@gmail.com>
Cc: Bitao Hu <yaoma@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: kernel-team@android.com
Signed-off-by: John Stultz <jstultz@google.com>
---
kernel/Kconfig.hz | 10 ++++++++++
kernel/time/tick-common.c | 2 +-
2 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/kernel/Kconfig.hz b/kernel/Kconfig.hz
index 76714317674c5..27ad1400ad64e 100644
--- a/kernel/Kconfig.hz
+++ b/kernel/Kconfig.hz
@@ -64,5 +64,15 @@ config DYN_HZ
build time configured HZ value.
If you are unsure, say no.
+config DYN_HZ_DEFAULT
+ int "Default DynHZ value (valid range: 100-CONFIG_HZ)"
+ depends on DYN_HZ
+ default HZ
+ range 100 HZ
+ help
+ Default value for dynhz. This allows ticks to be configured
+ to arrive at slower than HZ rates. This is useful when you
+ want to allow boot-time configurable tick rates.
+
config SCHED_HRTICK
def_bool HIGH_RES_TIMERS
diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index 75fd9dadb8273..46463bfb75d3d 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -81,7 +81,7 @@ int tick_is_oneshot_available(void)
}
#ifdef CONFIG_DYN_HZ
-long long dyn_tick_nsec = TICK_NSEC;
+long long dyn_tick_nsec = TICK_NSEC * HZ / CONFIG_DYN_HZ_DEFAULT;
static int __init set_dyn_hz(char *str)
{
--
2.48.1.262.g85cc9f2d1e-goog
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [RFC][PATCH 2/3] time/tick: Introduce a dyn_hz boot option
2025-01-28 6:32 ` [RFC][PATCH 2/3] time/tick: Introduce a dyn_hz boot option John Stultz
@ 2025-01-28 9:07 ` Peter Zijlstra
2025-01-28 17:29 ` John Stultz
0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2025-01-28 9:07 UTC (permalink / raw)
To: John Stultz
Cc: LKML, Anna-Maria Behnsen, Frederic Weisbecker, Ingo Molnar,
Thomas Gleixner, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Stephen Boyd, Yury Norov, Bitao Hu, Andrew Morton, kernel-team
On Mon, Jan 27, 2025 at 10:32:54PM -0800, John Stultz wrote:
> Introduce dyn_hz= option, which allows the actual timer tick to
> be scaled down at boot time.
>
> This allows kerenls to be built with HZ=1000 but systems to
> effectively run as if HZ=100 if specified.
>
> The system will still run with the configured HZ value, but
> ticks will just arrive "late".
>
> The valid values are between 100 and the build time CONFIG_HZ
> value.
>
> Signed-off-by: John Stultz <jstultz@google.com>
> ---
> include/linux/tick.h | 11 +++++++++--
> kernel/Kconfig.hz | 9 +++++++++
> kernel/time/tick-common.c | 30 ++++++++++++++++++++++++++++++
> kernel/time/tick-sched.c | 4 ++--
> 4 files changed, 50 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index b8ddc8e631a3c..734ee1e08c9ef 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -14,6 +14,13 @@
> #include <linux/rcupdate.h>
> #include <linux/static_key.h>
>
> +#ifdef CONFIG_DYN_HZ
> +extern long long dyn_tick_nsec;
> +#define DYN_TICK_NSEC (dyn_tick_nsec)
> +#else
> +#define DYN_TICK_NSEC TICK_NSEC
> +#endif
My git-grep TICK_NSEC spies a whole bunch of TICK_NSEC users that seem
sad now.
That is, why don't they all need updating?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC][PATCH 1/3] time/tick: Pipe tick count down through cputime accounting
2025-01-28 6:32 ` [RFC][PATCH 1/3] time/tick: Pipe tick count down through cputime accounting John Stultz
@ 2025-01-28 14:44 ` Thomas Gleixner
2025-01-29 4:10 ` John Stultz
0 siblings, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2025-01-28 14:44 UTC (permalink / raw)
To: John Stultz, LKML
Cc: John Stultz, Anna-Maria Behnsen, Frederic Weisbecker, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Stephen Boyd, Yury Norov, Bitao Hu, Andrew Morton, kernel-team
On Mon, Jan 27 2025 at 22:32, John Stultz wrote:
> In working up the dynHZ patch, I found that the skipping of
> ticks would result in large latencies for itimers.
>
> As I dug into it, I realized there is still some logic that
> assumes we don't miss ticks, resulting in late expiration of
> cputime timers.
>
> So this patch pipes the actual number of ticks passed down
> through cputime accounting.
This word salad does not qualify as a proper technical changelog. See
Documentation/
> /*
> * Must be called with interrupts disabled !
> */
> -static void tick_do_update_jiffies64(ktime_t now)
> +static unsigned long tick_do_update_jiffies64(ktime_t now)
> {
> unsigned long ticks = 1;
> ktime_t delta, nextp;
> @@ -70,7 +70,7 @@ static void tick_do_update_jiffies64(ktime_t now)
> */
> if (IS_ENABLED(CONFIG_64BIT)) {
> if (ktime_before(now, smp_load_acquire(&tick_next_period)))
> - return;
> + return 0;
So if the CPU's tick handler does not update jiffies, then this returns
zero ticks....
> -static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
> +static unsigned long tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
> {
> int tick_cpu, cpu = smp_processor_id();
> -
> + unsigned long ticks = 0;
And you have also zero ticks, when the CPU is not the tick_cpu:
> /* Check if jiffies need an update */
> if (tick_cpu == cpu)
> - tick_do_update_jiffies64(now);
> + ticks = tick_do_update_jiffies64(now);
...
> + return ticks;
> }
>
> -static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs)
> +static void tick_sched_handle(struct tick_sched *ts, unsigned long ticks, struct pt_regs *regs)
> {
> /*
> * When we are idle and the tick is stopped, we have to touch
> @@ -264,7 +266,7 @@ static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs)
> tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
> touch_softlockup_watchdog_sched();
> if (is_idle_task(current))
> - ts->idle_jiffies++;
> + ts->idle_jiffies += ticks;
> /*
> * In case the current tick fired too early past its expected
> * expiration, make sure we don't bypass the next clock reprogramming
> @@ -273,7 +275,7 @@ static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs)
> ts->next_tick = 0;
> }
>
> - update_process_times(user_mode(regs));
> + update_process_times(ticks, user_mode(regs));
Which is then fed to update_process_times() and subsequently to
account_process_ticks().
IOW, any CPUs tick handler which does not actually advance jiffies is
going to account ZERO ticks...
Seriously?
Thanks,
tglx
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC][PATCH 0/3] DynamicHZ: Configuring the timer tick rate at boot time
2025-01-28 6:32 [RFC][PATCH 0/3] DynamicHZ: Configuring the timer tick rate at boot time John Stultz
` (2 preceding siblings ...)
2025-01-28 6:32 ` [RFC][PATCH 3/3] Kconfig: Add CONFIG_DYN_HZ_DEFAULT to specify the default dynhz= boot option value John Stultz
@ 2025-01-28 16:46 ` Thomas Gleixner
2025-01-29 6:10 ` John Stultz
` (2 more replies)
3 siblings, 3 replies; 16+ messages in thread
From: Thomas Gleixner @ 2025-01-28 16:46 UTC (permalink / raw)
To: John Stultz, LKML
Cc: John Stultz, Anna-Maria Behnsen, Frederic Weisbecker, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Stephen Boyd, Yury Norov, Bitao Hu, Andrew Morton, kernel-team
John!
On Mon, Jan 27 2025 at 22:32, John Stultz wrote:
> The HZ value has long been a compile time constant. This is
> really useful as there is a lot of hotpath code that uses HZ
> when setting timers (needing to convert nanoseconds to ticks),
> thus the division is much faster with a compile time constant
> divisor.
To some extent, yes. Though the meaning of the 'HZ tick' has become
pretty blury over time.
If you actually look at the timer wheel timer usage, then the vast
majority is converting SI units based timeouts/delays to ticks and they
do not care about the actual tick frequency at all. Just grep for
'_to_jiffies()' aside of the tons of 'HZ * $N' places which are
sprinkled across the code base.
Code which relies on accurate wakeups is mostly using hrtimers anyway.
The only two places which are truly tick bound are the scheduler and the
timer wheel itself, where the latter is not really about it.
> One area that needed adjustments was the cputime accounting, as
> it assumes we only account one tick per interrupt, so I’ve
> reworked some of that logic to pipe through the actual tick
> count.
And you got that patently wrong...
> However, having to select the system HZ value at build time is
> somewhat limiting. Distros have to make choices for their users
> as to what the best HZ value would be balancing latency and
> power usage.
>
> With Android, this is a major issue, as we have one GKI binary
> that runs across a wide array of devices from top of the line
> flagship phones to watches. Balancing the choice for HZ is
> difficult, we currently have HZ=250, but some devices would love
> to have HZ=1000, while other devices aren’t willing to pay the
> power cost of 4x the timer slots, resulting in shorter idle
> times.
The shorter idle times are because timer wheel timers wake up more
accurately with HZ=1000 and not because the scheduler is more agressive?
> Also, I've not yet gotten this to work for the fixed
> periodic-tick paths (you need a oneshot capable clockevent).
Which is not a given on the museum pieces we insist to support just
because we can. But with periodic timers it should be easy enough to
make clockevents::set_state_periodic() take a tick frequency argument
and convert the ~70 callbacks to handle it.
> Mostly because in that case we always just increment by a single
> tick. While for dyn_hz=250 or dyn_hz=1000 calculating the
> periodic tick count is pretty simple (4 ticks, 10 ticks). But
> for dyn_hz=300, or other possible values, it doesn’t evenly
> divide, so we would have to do a 3,3,4,3,3,4 style interval to
> stay on time and I’ve not yet thought through how to do
> remainder handling efficiently yet.
I doubt you need that. Programming it to the next closest value is good
enough and there is no reason to overengineer it for a marginal benefit
of "accuracy". But that's obviously not really working with your chosen
minimalistic approach.
Aside of that, using random HZ values is a pretty academic exercise and
HZ=300 had been introduced for multimedia to cater for 30FPS. But that
was long ago when high resolution timers, NOHZ and modern graphic
devices did not exist.
I seriously doubt that HZ=300 has any actual advantage on modern
systems. Sure, I know that SteamOS uses HZ=300, but AFAICT from public
discussions this just caters to the HZ=300 myth and is not backed by any
factual evidence that HZ=300 is so superior. Quite the contrary there
are enough people who actually want HZ=1000 for better responsiveness.
But let me come back to your proposed hack, which is admittedly cute.
Though I'm not really convinced that it is more than a bandaid, which
papers over the most obvious places to make it "work".
Let's take a step back and look at the usage of 'HZ':
1) Jiffies and related timer wheel interfaces
jiffies should just go away completely and be replaced by a simple
millisecond counter, which is accessible in the same way as
jiffies today.
That removes the bulk of HZ usage all over the place and makes the
usage sites simpler as the interfaces just use SI units and the
gazillions (~4500 to jiffies and ~1000 from jiffies) back and
forth conversions just go away.
We obviously need to keep the time_before/after/*() interfaces for
32bit, unless we decide to limit the uptime for 32-bit machines to
~8 years and force reboot them before the counter can overflow :)
On the timer wheel side that means that the base granularity is
always 1ms, which only affects the maximum timeout. The timer
expiry is just batched on the actual tick frequency and should not
have any other side effects except for slightly moving the
granularity boundaries depending on the tick frequency. But that's
not any different from the hard coded HZ values.
The other minor change is to make the next timer interrupt
retrieval for NOHZ round up the next event to the tick boundary,
but that's trivial enough.
2) Clock events
Periodic mode is trivial to fix with a tick frequency argument to
the set_state_periodic() callback.
Oneshot mode just works as it programs the hardware to the next
closest event. Not much different from the current situation with a
hard-coded HZ value.
3) Accounting
The accounting has to be seperated from the jiffies advancement and
it has to feed the delta to the last tick in nanoseconds into the
accounting path, which internally operates in nanoseconds already
today.
4) Scheduler
I leave that part to Peter as he definitely has a better overview
of what needs to be done than me.
Thanks,
tglx
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC][PATCH 2/3] time/tick: Introduce a dyn_hz boot option
2025-01-28 9:07 ` Peter Zijlstra
@ 2025-01-28 17:29 ` John Stultz
2025-01-28 19:30 ` Peter Zijlstra
0 siblings, 1 reply; 16+ messages in thread
From: John Stultz @ 2025-01-28 17:29 UTC (permalink / raw)
To: Peter Zijlstra
Cc: LKML, Anna-Maria Behnsen, Frederic Weisbecker, Ingo Molnar,
Thomas Gleixner, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Stephen Boyd, Yury Norov, Bitao Hu, Andrew Morton, kernel-team
On Tue, Jan 28, 2025 at 1:07 AM Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Jan 27, 2025 at 10:32:54PM -0800, John Stultz wrote:
> > +#ifdef CONFIG_DYN_HZ
> > +extern long long dyn_tick_nsec;
> > +#define DYN_TICK_NSEC (dyn_tick_nsec)
> > +#else
> > +#define DYN_TICK_NSEC TICK_NSEC
> > +#endif
>
> My git-grep TICK_NSEC spies a whole bunch of TICK_NSEC users that seem
> sad now.
>
> That is, why don't they all need updating?
Thanks for taking a look, Peter!
So TICK_NSEC is still the length of the HZ tick. The idea with this
series is we're leaving HZ as a high frequency, but then just
programming the interrupt to come in at a lower frequency.
So it's only the logic involved in setting the next interrupt, where
we need to use DYN_TICK_NSEC.
One of the issues I hit and addressed before sending (which I
mentioned in the cover letter), was that previously I was also using
DYN_TICK_NSEC in the periodic tick path, as this approach of skipping
ticks should be usable there. But the problem was since we're using
tick counting there instead of time counting, I didn't have proper
handling for cases where there's a remainder from the HZ/dynHZ value.
So for now I've dropped that and only use this when we're in oneshot
mode.
Let me know if you have other thoughts or concerns! I'll try to
reword the commit message further to try to better explain this.
thanks
-john
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC][PATCH 2/3] time/tick: Introduce a dyn_hz boot option
2025-01-28 17:29 ` John Stultz
@ 2025-01-28 19:30 ` Peter Zijlstra
0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2025-01-28 19:30 UTC (permalink / raw)
To: John Stultz
Cc: LKML, Anna-Maria Behnsen, Frederic Weisbecker, Ingo Molnar,
Thomas Gleixner, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Stephen Boyd, Yury Norov, Bitao Hu, Andrew Morton, kernel-team
On Tue, Jan 28, 2025 at 09:29:30AM -0800, John Stultz wrote:
> On Tue, Jan 28, 2025 at 1:07 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > On Mon, Jan 27, 2025 at 10:32:54PM -0800, John Stultz wrote:
> > > +#ifdef CONFIG_DYN_HZ
> > > +extern long long dyn_tick_nsec;
> > > +#define DYN_TICK_NSEC (dyn_tick_nsec)
> > > +#else
> > > +#define DYN_TICK_NSEC TICK_NSEC
> > > +#endif
> >
> > My git-grep TICK_NSEC spies a whole bunch of TICK_NSEC users that seem
> > sad now.
> >
> > That is, why don't they all need updating?
>
> Thanks for taking a look, Peter!
>
> So TICK_NSEC is still the length of the HZ tick. The idea with this
> series is we're leaving HZ as a high frequency, but then just
> programming the interrupt to come in at a lower frequency.
That's word salad. Either the timer is slower, as alluded to below, or
it is not. It can't be both.
> So it's only the logic involved in setting the next interrupt, where
> we need to use DYN_TICK_NSEC.
In other words, the actual interrupts will happen dyn_tick_nsec apart,
not TICK_NSEC.
This means that pretty much every TICK_NSEC user in kernel/events/ is
now wrong. They're assuming the time between interrupts is TICK_NSEC --
we even inhibit NOHZ for perf to ensure this.
Really, you need to audit each and every TICK_NSEC users at the very
least.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC][PATCH 1/3] time/tick: Pipe tick count down through cputime accounting
2025-01-28 14:44 ` Thomas Gleixner
@ 2025-01-29 4:10 ` John Stultz
0 siblings, 0 replies; 16+ messages in thread
From: John Stultz @ 2025-01-29 4:10 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Anna-Maria Behnsen, Frederic Weisbecker, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Stephen Boyd, Yury Norov, Bitao Hu, Andrew Morton, kernel-team
On Tue, Jan 28, 2025 at 6:44 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Mon, Jan 27 2025 at 22:32, John Stultz wrote:
> > In working up the dynHZ patch, I found that the skipping of
> > ticks would result in large latencies for itimers.
> >
> > As I dug into it, I realized there is still some logic that
> > assumes we don't miss ticks, resulting in late expiration of
> > cputime timers.
> >
> > So this patch pipes the actual number of ticks passed down
> > through cputime accounting.
>
> This word salad does not qualify as a proper technical changelog. See
> Documentation/
>
> > /*
> > * Must be called with interrupts disabled !
> > */
> > -static void tick_do_update_jiffies64(ktime_t now)
> > +static unsigned long tick_do_update_jiffies64(ktime_t now)
> > {
> > unsigned long ticks = 1;
> > ktime_t delta, nextp;
> > @@ -70,7 +70,7 @@ static void tick_do_update_jiffies64(ktime_t now)
> > */
> > if (IS_ENABLED(CONFIG_64BIT)) {
> > if (ktime_before(now, smp_load_acquire(&tick_next_period)))
> > - return;
> > + return 0;
>
> So if the CPU's tick handler does not update jiffies, then this returns
> zero ticks....
Oh, bah! You're right, of course.
I was thinking I could grab the tick delta that way, but blanked that
it wouldn't work anywhere but on the one tick_cpu.
I guess I was getting lucky with my tests running mainly on the tick_cpu.
I'll go back to the drawing board on this one. Thanks for pointing this out.
-john
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC][PATCH 0/3] DynamicHZ: Configuring the timer tick rate at boot time
2025-01-28 16:46 ` [RFC][PATCH 0/3] DynamicHZ: Configuring the timer tick rate at boot time Thomas Gleixner
@ 2025-01-29 6:10 ` John Stultz
2025-01-29 8:09 ` Thomas Gleixner
2025-02-03 11:14 ` Peter Zijlstra
2025-02-10 1:09 ` Qais Yousef
2 siblings, 1 reply; 16+ messages in thread
From: John Stultz @ 2025-01-29 6:10 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Anna-Maria Behnsen, Frederic Weisbecker, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Stephen Boyd, Yury Norov, Bitao Hu, Andrew Morton, kernel-team,
Saravana Kannan, Samuel Wu, Qais Yousef
On Tue, Jan 28, 2025 at 8:46 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> On Mon, Jan 27 2025 at 22:32, John Stultz wrote:
> > With Android, this is a major issue, as we have one GKI binary
> > that runs across a wide array of devices from top of the line
> > flagship phones to watches. Balancing the choice for HZ is
> > difficult, we currently have HZ=250, but some devices would love
> > to have HZ=1000, while other devices aren’t willing to pay the
> > power cost of 4x the timer slots, resulting in shorter idle
> > times.
>
> The shorter idle times are because timer wheel timers wake up more
> accurately with HZ=1000 and not because the scheduler is more agressive?
Saravana and Samuel gave a talk on this at plumbers that has more details
https://lpc.events/event/18/contributions/1842/
Specifically, starting on slide 9 here:
https://lpc.events/event/18/contributions/1842/attachments/1476/3126/LPC%202024%20-%20Wattson.pdf
Where the load is a fairly light one. I'm regenerating some data on
that now to look a little more closely at to make sure it wasn't
faster up-migration to costlier cpus (but I'd not expect that given
how light the load is).
But even if it is the faster up-migration, the example showed HZ=1000
to have added power costs for workloads that were performing
acceptably at HZ=250. Which makes it a hard sell for folks who are
focused on avoiding power regressions. Potentially folks could tune
their way out of that problem, but it's a larger ask. Thus the desire
to thread the needle somehow to allow HZ=1000 as an option without
disrupting folks happy with current HZ=250 behavior.
> > Also, I've not yet gotten this to work for the fixed
> > periodic-tick paths (you need a oneshot capable clockevent).
>
> Which is not a given on the museum pieces we insist to support just
> because we can. But with periodic timers it should be easy enough to
> make clockevents::set_state_periodic() take a tick frequency argument
> and convert the ~70 callbacks to handle it.
I'll take a look at this. Mostly I was just highlighting a deficiency
in the patch where DynHZ wouldn't impact things, as
periodic-clockevents didn't seem the most critical for the benefit of
the feature.
> > Mostly because in that case we always just increment by a single
> > tick. While for dyn_hz=250 or dyn_hz=1000 calculating the
> > periodic tick count is pretty simple (4 ticks, 10 ticks). But
> > for dyn_hz=300, or other possible values, it doesn’t evenly
> > divide, so we would have to do a 3,3,4,3,3,4 style interval to
> > stay on time and I’ve not yet thought through how to do
> > remainder handling efficiently yet.
>
> I doubt you need that. Programming it to the next closest value is good
> enough and there is no reason to overengineer it for a marginal benefit
> of "accuracy". But that's obviously not really working with your chosen
> minimalistic approach.
>
> Aside of that, using random HZ values is a pretty academic exercise and
> HZ=300 had been introduced for multimedia to cater for 30FPS. But that
> was long ago when high resolution timers, NOHZ and modern graphic
> devices did not exist.
>
> I seriously doubt that HZ=300 has any actual advantage on modern
> systems. Sure, I know that SteamOS uses HZ=300, but AFAICT from public
> discussions this just caters to the HZ=300 myth and is not backed by any
> factual evidence that HZ=300 is so superior. Quite the contrary there
> are enough people who actually want HZ=1000 for better responsiveness.
Yeah. Initially I was shooting for dyn_hz to be possibly even more
flexible then HZ (500! 10! why not?) since it initially seemed it
would be simple, but I obviously underestimated the accounting
complexity, so I'm fine if it's relegated to 250 or 100. :)
> But let me come back to your proposed hack, which is admittedly cute.
> Though I'm not really convinced that it is more than a bandaid, which
> papers over the most obvious places to make it "work".
>
> Let's take a step back and look at the usage of 'HZ':
>
> 1) Jiffies and related timer wheel interfaces
>
> jiffies should just go away completely and be replaced by a simple
> millisecond counter, which is accessible in the same way as
> jiffies today.
>
> That removes the bulk of HZ usage all over the place and makes the
> usage sites simpler as the interfaces just use SI units and the
> gazillions (~4500 to jiffies and ~1000 from jiffies) back and
> forth conversions just go away.
Yeah, this was basically where I was hoping this would allow us to go.
I was imagining once dyn_hz was possible, we could basically fix HZ to
1000 internally, leaving jiffies as that 1ms counter, and let the
actual interrupt rate be set via the dynhz default config value. Then
iterating through all the code switching HZ usage to MSEC_PER_SEC, etc
would be possible.
> We obviously need to keep the time_before/after/*() interfaces for
> 32bit, unless we decide to limit the uptime for 32-bit machines to
> ~8 years and force reboot them before the counter can overflow :)
>
> On the timer wheel side that means that the base granularity is
> always 1ms, which only affects the maximum timeout. The timer
> expiry is just batched on the actual tick frequency and should not
> have any other side effects except for slightly moving the
> granularity boundaries depending on the tick frequency. But that's
> not any different from the hard coded HZ values.
>
> The other minor change is to make the next timer interrupt
> retrieval for NOHZ round up the next event to the tick boundary,
> but that's trivial enough.
>
> 2) Clock events
>
> Periodic mode is trivial to fix with a tick frequency argument to
> the set_state_periodic() callback.
>
> Oneshot mode just works as it programs the hardware to the next
> closest event. Not much different from the current situation with a
> hard-coded HZ value.
Ack.
> 3) Accounting
>
> The accounting has to be seperated from the jiffies advancement and
> it has to feed the delta to the last tick in nanoseconds into the
> accounting path, which internally operates in nanoseconds already
> today.
I'll take a look at this next, as the accounting has clearly been my
biggest fumbling point. Keeping the per-cpu last tick times is
probably the key change (my main concern is the extra cost of reading
the ns time, but deriving it from 1ms jiffies counter could be an
option), but yeah, I think this sounds like a nice direction. Thank
you for the suggestion!
> 4) Scheduler
>
> I leave that part to Peter as he definitely has a better overview
> of what needs to be done than me.
Yeah, I'll look more into his concerns around the sched/events code.
Thanks again for all the feedback! Much appreciated!
-john
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC][PATCH 0/3] DynamicHZ: Configuring the timer tick rate at boot time
2025-01-29 6:10 ` John Stultz
@ 2025-01-29 8:09 ` Thomas Gleixner
2025-02-10 16:54 ` David Laight
0 siblings, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2025-01-29 8:09 UTC (permalink / raw)
To: John Stultz
Cc: LKML, Anna-Maria Behnsen, Frederic Weisbecker, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Stephen Boyd, Yury Norov, Bitao Hu, Andrew Morton, kernel-team,
Saravana Kannan, Samuel Wu, Qais Yousef
On Tue, Jan 28 2025 at 22:10, John Stultz wrote:
> On Tue, Jan 28, 2025 at 8:46 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> 1) Jiffies and related timer wheel interfaces
>>
>> jiffies should just go away completely and be replaced by a simple
>> millisecond counter, which is accessible in the same way as
>> jiffies today.
>>
>> That removes the bulk of HZ usage all over the place and makes the
>> usage sites simpler as the interfaces just use SI units and the
>> gazillions (~4500 to jiffies and ~1000 from jiffies) back and
>> forth conversions just go away.
>
> Yeah, this was basically where I was hoping this would allow us to go.
> I was imagining once dyn_hz was possible, we could basically fix HZ to
> 1000 internally, leaving jiffies as that 1ms counter, and let the
> actual interrupt rate be set via the dynhz default config value. Then
> iterating through all the code switching HZ usage to MSEC_PER_SEC, etc
> would be possible.
I strongly suggest to start with exactly this because it significantly
reduces the problem space and has a valuable benefit in general.
>> 3) Accounting
>>
>> The accounting has to be seperated from the jiffies advancement and
>> it has to feed the delta to the last tick in nanoseconds into the
>> accounting path, which internally operates in nanoseconds already
>> today.
>
> I'll take a look at this next, as the accounting has clearly been my
> biggest fumbling point. Keeping the per-cpu last tick times is
> probably the key change (my main concern is the extra cost of reading
> the ns time, but deriving it from 1ms jiffies counter could be an
> option), but yeah, I think this sounds like a nice direction. Thank
> you for the suggestion!
You don't need to read the time for that. The delta to the previous tick
is known already. It's the amount of nanoseconds which corresponds to
the chosen dynamic HZ value. So you can just do:
- cputime = TICK_NSEC;
+ cputime = tick_nsec;
i.e. read the variable which stores the tick length in nanoseconds, no?
Thanks,
tglx
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC][PATCH 0/3] DynamicHZ: Configuring the timer tick rate at boot time
2025-01-28 16:46 ` [RFC][PATCH 0/3] DynamicHZ: Configuring the timer tick rate at boot time Thomas Gleixner
2025-01-29 6:10 ` John Stultz
@ 2025-02-03 11:14 ` Peter Zijlstra
2025-02-10 1:14 ` Qais Yousef
2025-02-10 1:09 ` Qais Yousef
2 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2025-02-03 11:14 UTC (permalink / raw)
To: Thomas Gleixner
Cc: John Stultz, LKML, Anna-Maria Behnsen, Frederic Weisbecker,
Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Stephen Boyd, Yury Norov, Bitao Hu, Andrew Morton, kernel-team
On Tue, Jan 28, 2025 at 05:46:10PM +0100, Thomas Gleixner wrote:
> 4) Scheduler
>
> I leave that part to Peter as he definitely has a better overview
> of what needs to be done than me.
Ponies, scheduler wants ponies :-)
So scheduler tick does waaay too much:
- time keeping / accounting:
. internally
. psi
. cgroup.cpuacct
. posix timers
. a million other things
- periodic update/aging of things like:
. global load avg
. hw pressure
. freq scale
- tied into perf
(which I've briefly touched upon earlier)
- drives load balance
- drives mm scanning for NUMA crud
- drives tick based preemption
The whole load-balance and global-load-avg are basically interal tick
based timers. Not sure replacing them with timer wheel timers makes
sense due to the buckets, but it might also not be the worst.
The whole preemption thing could probably be replaced with HRTICK (which
might be suffering from bitrot), but the problem has always been with
hrtimers being too expensive (on x86). But ideally we'd move away from
tick based preemption.
That said, driving preemption with dynamic HZ should work just fine.
Most of the time accounting is TSC (or sched_clock()) based, and derives
the measure of time from that. But things like perf use TICK_NSEC to
tell us how much time is between ticks -- so if you go and make that
dynamic you really do have to fix that.
Anyway, I would really like to understand what exactly is driving the
cost in your case. It should be possible to move things out of the tick,
or run them at a lower rate without running all of it lower.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC][PATCH 0/3] DynamicHZ: Configuring the timer tick rate at boot time
2025-01-28 16:46 ` [RFC][PATCH 0/3] DynamicHZ: Configuring the timer tick rate at boot time Thomas Gleixner
2025-01-29 6:10 ` John Stultz
2025-02-03 11:14 ` Peter Zijlstra
@ 2025-02-10 1:09 ` Qais Yousef
2 siblings, 0 replies; 16+ messages in thread
From: Qais Yousef @ 2025-02-10 1:09 UTC (permalink / raw)
To: Thomas Gleixner
Cc: John Stultz, LKML, Anna-Maria Behnsen, Frederic Weisbecker,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Stephen Boyd, Yury Norov, Bitao Hu,
Andrew Morton, kernel-team
On 01/28/25 17:46, Thomas Gleixner wrote:
> > However, having to select the system HZ value at build time is
> > somewhat limiting. Distros have to make choices for their users
> > as to what the best HZ value would be balancing latency and
> > power usage.
> >
> > With Android, this is a major issue, as we have one GKI binary
> > that runs across a wide array of devices from top of the line
> > flagship phones to watches. Balancing the choice for HZ is
> > difficult, we currently have HZ=250, but some devices would love
> > to have HZ=1000, while other devices aren’t willing to pay the
> > power cost of 4x the timer slots, resulting in shorter idle
> > times.
>
> The shorter idle times are because timer wheel timers wake up more
> accurately with HZ=1000 and not because the scheduler is more agressive?
You'll all find another patch [1] in your inbox which changes the default to
1ms. And (hopefully) explains the details of why higher values are bad for
modern systems/workloads. And why power could be expectedly worse in a number
of use cases.
TLDR, beside the reason above the system is generally responsive. It was by
accident ending up stuck at lower frequencies and in case of HMP systems on
smaller cores. With faster tick we respond faster to demand shifting
frequencies higher and biasing task placement to bigger core fixing performance
issues along the way.
> Aside of that, using random HZ values is a pretty academic exercise and
> HZ=300 had been introduced for multimedia to cater for 30FPS. But that
> was long ago when high resolution timers, NOHZ and modern graphic
> devices did not exist.
>
> I seriously doubt that HZ=300 has any actual advantage on modern
> systems. Sure, I know that SteamOS uses HZ=300, but AFAICT from public
> discussions this just caters to the HZ=300 myth and is not backed by any
> factual evidence that HZ=300 is so superior. Quite the contrary there
> are enough people who actually want HZ=1000 for better responsiveness.
These lower HZ values don't make sense to me either and I think keeping them is
just giving people more means to shoot themselves in the foot. HZ=100 irks
particularly as to my humble understanding it is there to help throughput, but
I truly doubt this is a sensible configuration anymore. I believe they are
better with setting the base_slice to 10ms by default now while still keep
HZ=1000. But there could be sensible reasons why it is useful beyond my
knowledge.
My biggest worry this could be a common source of 'sched latency'. The ancient
trade-offs don't hold anymore IMHO.
FWIW I had a series that converted HZ into a variable and did a good chunk of
work to convert a large number of users and conversion code. Sadly I lost it
:''( Though it seems you had another idea on how it should be done. So maybe
I shouldn't cry too hard on losing it.
Not trying to side track the discussion. But just wanted to point out maybe we
should do some clean up on current supported HZ values and think harder whether
anything but HZ=1000 makes sense still. Having the dynamic option is great, but
my gut feeling is that we want to have a single HZ=1000 and fix the potential
remaining issues that could make TICK disturb idle (if any). I am not buying
the throughput argument, but have little experience with the monstrous
machines. If folks with big machine see throughput issues, it'd be great to
learn about those. As I tried to argue in my patch [1], I believe latency is
dominant on modern systems. It has important consequences on scheduler
decisions. Peter and Vincent can correct me if I went astray.
[1] https://lore.kernel.org/lkml/20250210001915.123424-1-qyousef@layalina.io/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC][PATCH 0/3] DynamicHZ: Configuring the timer tick rate at boot time
2025-02-03 11:14 ` Peter Zijlstra
@ 2025-02-10 1:14 ` Qais Yousef
0 siblings, 0 replies; 16+ messages in thread
From: Qais Yousef @ 2025-02-10 1:14 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Thomas Gleixner, John Stultz, LKML, Anna-Maria Behnsen,
Frederic Weisbecker, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Stephen Boyd, Yury Norov, Bitao Hu,
Andrew Morton, kernel-team
On 02/03/25 12:14, Peter Zijlstra wrote:
> The whole preemption thing could probably be replaced with HRTICK (which
> might be suffering from bitrot), but the problem has always been with
> hrtimers being too expensive (on x86). But ideally we'd move away from
> tick based preemption.
I do think HRTICK must be on by default. Precise time slices are important.
Especially if your plans to attach changing of base_slice to
sched_attr->runtime are still happening. People will find lots of surprises
when they ask for 500us runtime but end up running (much) longer, especially
with the default 4ms TICK value (which I just posted a patch to change it).
Can we re-evaluate the HRTICK impact?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC][PATCH 0/3] DynamicHZ: Configuring the timer tick rate at boot time
2025-01-29 8:09 ` Thomas Gleixner
@ 2025-02-10 16:54 ` David Laight
0 siblings, 0 replies; 16+ messages in thread
From: David Laight @ 2025-02-10 16:54 UTC (permalink / raw)
To: Thomas Gleixner
Cc: John Stultz, LKML, Anna-Maria Behnsen, Frederic Weisbecker,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Stephen Boyd, Yury Norov, Bitao Hu,
Andrew Morton, kernel-team, Saravana Kannan, Samuel Wu,
Qais Yousef
On Wed, 29 Jan 2025 09:09:02 +0100
Thomas Gleixner <tglx@linutronix.de> wrote:
> On Tue, Jan 28 2025 at 22:10, John Stultz wrote:
> > On Tue, Jan 28, 2025 at 8:46 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> 1) Jiffies and related timer wheel interfaces
> >>
> >> jiffies should just go away completely and be replaced by a simple
> >> millisecond counter, which is accessible in the same way as
> >> jiffies today.
> >>
> >> That removes the bulk of HZ usage all over the place and makes the
> >> usage sites simpler as the interfaces just use SI units and the
> >> gazillions (~4500 to jiffies and ~1000 from jiffies) back and
> >> forth conversions just go away.
> >
> > Yeah, this was basically where I was hoping this would allow us to go.
> > I was imagining once dyn_hz was possible, we could basically fix HZ to
> > 1000 internally, leaving jiffies as that 1ms counter, and let the
> > actual interrupt rate be set via the dynhz default config value. Then
> > iterating through all the code switching HZ usage to MSEC_PER_SEC, etc
> > would be possible.
>
> I strongly suggest to start with exactly this because it significantly
> reduces the problem space and has a valuable benefit in general.
I doubt anyone will notice if a 250Hz timer interrupt always adds 4 to 'jiffies'.
One problem with increasing the frequency of the interrupt is the sheer amount
of code that runs every timer tick.
I suspect most of it is in the scheduler!
Do I recall that the timer wheels no longer move long timers onto the high
resolution wheels?
So longer timers are a much increased granularity.
That is going to be made worse for anyone currently using a 250Hz 'tick'.
Making 'jiffies' milliseconds does stop you having a faster timer tick.
I know one architecture kernel (might not have been Linux) defaulted to 1024Hz.
But I'm not sure that is a problem.
32bit systems must be able to handle wrap - we all know (to our cost) that
a 1ms counter wraps in 48 days. So even a 4ms one wraps quite often.
Although having a 64bit (long) counter in 64bit mode really just hides bugs.
David
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2025-02-10 16:54 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-28 6:32 [RFC][PATCH 0/3] DynamicHZ: Configuring the timer tick rate at boot time John Stultz
2025-01-28 6:32 ` [RFC][PATCH 1/3] time/tick: Pipe tick count down through cputime accounting John Stultz
2025-01-28 14:44 ` Thomas Gleixner
2025-01-29 4:10 ` John Stultz
2025-01-28 6:32 ` [RFC][PATCH 2/3] time/tick: Introduce a dyn_hz boot option John Stultz
2025-01-28 9:07 ` Peter Zijlstra
2025-01-28 17:29 ` John Stultz
2025-01-28 19:30 ` Peter Zijlstra
2025-01-28 6:32 ` [RFC][PATCH 3/3] Kconfig: Add CONFIG_DYN_HZ_DEFAULT to specify the default dynhz= boot option value John Stultz
2025-01-28 16:46 ` [RFC][PATCH 0/3] DynamicHZ: Configuring the timer tick rate at boot time Thomas Gleixner
2025-01-29 6:10 ` John Stultz
2025-01-29 8:09 ` Thomas Gleixner
2025-02-10 16:54 ` David Laight
2025-02-03 11:14 ` Peter Zijlstra
2025-02-10 1:14 ` Qais Yousef
2025-02-10 1:09 ` Qais Yousef
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox