* [PATCH v2 1/6] sched/eevdf: Fix HRTICK duration
2026-01-21 16:20 [PATCH v2 0/6] hrtimer/sched: Improve hrtick Peter Zijlstra
@ 2026-01-21 16:20 ` Peter Zijlstra
2026-01-22 10:53 ` Juri Lelli
2026-02-05 8:38 ` Peter Zijlstra
2026-01-21 16:20 ` [PATCH v2 2/6] hrtimer: Optimize __hrtimer_start_range_ns() Peter Zijlstra
` (4 subsequent siblings)
5 siblings, 2 replies; 25+ messages in thread
From: Peter Zijlstra @ 2026-01-21 16:20 UTC (permalink / raw)
To: tglx
Cc: arnd, anna-maria, frederic, peterz, luto, mingo, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, oliver.sang
The nominal duration for an EEVDF task to run is until its deadline.
At which point the deadline is moved ahead and a new task selection is
done.
Try and predict the time 'lost' to higher scheduling classes. Since
this is an estimate, the timer can be both early or late. In case it
is early task_tick_fair() will take the !need_resched() path and
restarts the timer.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/sched/fair.c | 55 +++++++++++++++++++++++++++++-----------------------
1 file changed, 31 insertions(+), 24 deletions(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5511,7 +5511,7 @@ static void put_prev_entity(struct cfs_r
}
static void
-entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
+entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
/*
* Update run-time statistics of the 'current'.
@@ -5523,17 +5523,6 @@ entity_tick(struct cfs_rq *cfs_rq, struc
*/
update_load_avg(cfs_rq, curr, UPDATE_TG);
update_cfs_group(curr);
-
-#ifdef CONFIG_SCHED_HRTICK
- /*
- * queued ticks are scheduled to match the slice, so don't bother
- * validating it and just reschedule.
- */
- if (queued) {
- resched_curr_lazy(rq_of(cfs_rq));
- return;
- }
-#endif
}
@@ -6735,21 +6724,39 @@ static inline void sched_fair_update_sto
static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
{
struct sched_entity *se = &p->se;
+ unsigned long scale = 1024;
+ unsigned long util = 0;
+ u64 vdelta;
+ u64 delta;
WARN_ON_ONCE(task_rq(p) != rq);
- if (rq->cfs.h_nr_queued > 1) {
- u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
- u64 slice = se->slice;
- s64 delta = slice - ran;
-
- if (delta < 0) {
- if (task_current_donor(rq, p))
- resched_curr(rq);
- return;
- }
- hrtick_start(rq, delta);
+ if (rq->cfs.h_nr_queued <= 1)
+ return;
+
+ /*
+ * Compute time until virtual deadline
+ */
+ vdelta = se->deadline - se->vruntime;
+ if ((s64)vdelta < 0) {
+ if (task_current_donor(rq, p))
+ resched_curr(rq);
+ return;
+ }
+ delta = (se->load.weight * vdelta) / NICE_0_LOAD;
+
+ /*
+ * Correct for instantaneous load of other classes.
+ */
+ util += cpu_util_dl(rq);
+ util += cpu_util_rt(rq);
+ util += cpu_util_irq(rq);
+ if (util && util < 1024) {
+ scale *= 1024;
+ scale /= (1024 - util);
}
+
+ hrtick_start(rq, (scale * delta) / 1024);
}
/*
@@ -13373,7 +13380,7 @@ static void task_tick_fair(struct rq *rq
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
- entity_tick(cfs_rq, se, queued);
+ entity_tick(cfs_rq, se);
}
if (queued) {
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH v2 1/6] sched/eevdf: Fix HRTICK duration
2026-01-21 16:20 ` [PATCH v2 1/6] sched/eevdf: Fix HRTICK duration Peter Zijlstra
@ 2026-01-22 10:53 ` Juri Lelli
2026-02-05 8:38 ` Peter Zijlstra
1 sibling, 0 replies; 25+ messages in thread
From: Juri Lelli @ 2026-01-22 10:53 UTC (permalink / raw)
To: Peter Zijlstra
Cc: tglx, arnd, anna-maria, frederic, luto, mingo, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
linux-kernel, oliver.sang
Hello,
On 21/01/26 17:20, Peter Zijlstra wrote:
> The nominal duration for an EEVDF task to run is until its deadline.
> At which point the deadline is moved ahead and a new task selection is
> done.
>
> Try and predict the time 'lost' to higher scheduling classes. Since
> this is an estimate, the timer can be both early or late. In case it
> is early task_tick_fair() will take the !need_resched() path and
> restarts the timer.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
...
> @@ -6735,21 +6724,39 @@ static inline void sched_fair_update_sto
> static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
> {
> struct sched_entity *se = &p->se;
> + unsigned long scale = 1024;
> + unsigned long util = 0;
> + u64 vdelta;
> + u64 delta;
>
> WARN_ON_ONCE(task_rq(p) != rq);
>
> - if (rq->cfs.h_nr_queued > 1) {
> - u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> - u64 slice = se->slice;
> - s64 delta = slice - ran;
> -
> - if (delta < 0) {
> - if (task_current_donor(rq, p))
> - resched_curr(rq);
> - return;
> - }
> - hrtick_start(rq, delta);
> + if (rq->cfs.h_nr_queued <= 1)
> + return;
> +
> + /*
> + * Compute time until virtual deadline
> + */
> + vdelta = se->deadline - se->vruntime;
> + if ((s64)vdelta < 0) {
> + if (task_current_donor(rq, p))
> + resched_curr(rq);
> + return;
> + }
> + delta = (se->load.weight * vdelta) / NICE_0_LOAD;
Nit.. guess we don't fear overflow since vdelta should be bounded
anyway.
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Thanks,
Juri
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH v2 1/6] sched/eevdf: Fix HRTICK duration
2026-01-21 16:20 ` [PATCH v2 1/6] sched/eevdf: Fix HRTICK duration Peter Zijlstra
2026-01-22 10:53 ` Juri Lelli
@ 2026-02-05 8:38 ` Peter Zijlstra
1 sibling, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2026-02-05 8:38 UTC (permalink / raw)
To: tglx
Cc: arnd, anna-maria, frederic, luto, mingo, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, oliver.sang
On Wed, Jan 21, 2026 at 05:20:11PM +0100, Peter Zijlstra wrote:
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6735,21 +6724,39 @@ static inline void sched_fair_update_sto
> static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
> {
> struct sched_entity *se = &p->se;
> + unsigned long scale = 1024;
> + unsigned long util = 0;
> + u64 vdelta;
> + u64 delta;
>
> WARN_ON_ONCE(task_rq(p) != rq);
>
> + if (rq->cfs.h_nr_queued <= 1)
> + return;
> +
> + /*
> + * Compute time until virtual deadline
> + */
> + vdelta = se->deadline - se->vruntime;
> + if ((s64)vdelta < 0) {
> + if (task_current_donor(rq, p))
> + resched_curr(rq);
> + return;
> + }
> + delta = (se->load.weight * vdelta) / NICE_0_LOAD;
> +
> + /*
> + * Correct for instantaneous load of other classes.
> + */
> + util += cpu_util_dl(rq);
> + util += cpu_util_rt(rq);
Since this is all about current, other scheduling classes are
irrelevant, they cannot run without causing schedule() which will cause
the hrtick to be reprogrammed anyway.
So I'm thinking those two lines above ought to go.
> + util += cpu_util_irq(rq);
> + if (util && util < 1024) {
> + scale *= 1024;
> + scale /= (1024 - util);
> }
> +
> + hrtick_start(rq, (scale * delta) / 1024);
> }
>
> /*
> @@ -5511,7 +5511,7 @@ static void put_prev_entity(struct cfs_r
> }
>
> static void
> -entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
> +entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> {
> /*
> * Update run-time statistics of the 'current'.
> @@ -5523,17 +5523,6 @@ entity_tick(struct cfs_rq *cfs_rq, struc
> */
> update_load_avg(cfs_rq, curr, UPDATE_TG);
> update_cfs_group(curr);
> -
> -#ifdef CONFIG_SCHED_HRTICK
> - /*
> - * queued ticks are scheduled to match the slice, so don't bother
> - * validating it and just reschedule.
> - */
> - if (queued) {
> - resched_curr_lazy(rq_of(cfs_rq));
> - return;
> - }
> -#endif
> }
>
>
> @@ -13373,7 +13380,7 @@ static void task_tick_fair(struct rq *rq
>
> for_each_sched_entity(se) {
> cfs_rq = cfs_rq_of(se);
> - entity_tick(cfs_rq, se, queued);
> + entity_tick(cfs_rq, se);
> }
>
> if (queued) {
>
So Thomas did observe some really small hrtimer reprogramming because of
this. If we just miss the normal deadline, it will re-try with a stupid
sliver of time.
Perhaps it makes sense to leave these two hunks, and simply hard preempt
when the hrtick goes, irrespective of slightly missing the vruntime due
to the approximation on task-clock.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v2 2/6] hrtimer: Optimize __hrtimer_start_range_ns()
2026-01-21 16:20 [PATCH v2 0/6] hrtimer/sched: Improve hrtick Peter Zijlstra
2026-01-21 16:20 ` [PATCH v2 1/6] sched/eevdf: Fix HRTICK duration Peter Zijlstra
@ 2026-01-21 16:20 ` Peter Zijlstra
2026-01-22 11:00 ` Juri Lelli
2026-02-02 12:28 ` Thomas Gleixner
2026-01-21 16:20 ` [PATCH v2 3/6] hrtimer,sched: Add fuzzy hrtimer mode for HRTICK Peter Zijlstra
` (3 subsequent siblings)
5 siblings, 2 replies; 25+ messages in thread
From: Peter Zijlstra @ 2026-01-21 16:20 UTC (permalink / raw)
To: tglx
Cc: arnd, anna-maria, frederic, peterz, luto, mingo, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, oliver.sang
Much like hrtimer_reprogram(), skip programming if the cpu_base is
running the hrtimer interrupt.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/time/hrtimer.c | 8 ++++++++
1 file changed, 8 insertions(+)
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1261,6 +1261,14 @@ static int __hrtimer_start_range_ns(stru
}
first = enqueue_hrtimer(timer, new_base, mode);
+
+ /*
+ * If the hrtimer interrupt is running, then it will reevaluate the
+ * clock bases and reprogram the clock event device.
+ */
+ if (new_base->cpu_base->in_hrtirq)
+ return 0;
+
if (!force_local) {
/*
* If the current CPU base is online, then the timer is
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH v2 2/6] hrtimer: Optimize __hrtimer_start_range_ns()
2026-01-21 16:20 ` [PATCH v2 2/6] hrtimer: Optimize __hrtimer_start_range_ns() Peter Zijlstra
@ 2026-01-22 11:00 ` Juri Lelli
2026-02-02 12:28 ` Thomas Gleixner
1 sibling, 0 replies; 25+ messages in thread
From: Juri Lelli @ 2026-01-22 11:00 UTC (permalink / raw)
To: Peter Zijlstra
Cc: tglx, arnd, anna-maria, frederic, luto, mingo, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
linux-kernel, oliver.sang
Hello,
On 21/01/26 17:20, Peter Zijlstra wrote:
> Much like hrtimer_reprogram(), skip programming if the cpu_base is
> running the hrtimer interrupt.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Thanks,
Juri
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 2/6] hrtimer: Optimize __hrtimer_start_range_ns()
2026-01-21 16:20 ` [PATCH v2 2/6] hrtimer: Optimize __hrtimer_start_range_ns() Peter Zijlstra
2026-01-22 11:00 ` Juri Lelli
@ 2026-02-02 12:28 ` Thomas Gleixner
1 sibling, 0 replies; 25+ messages in thread
From: Thomas Gleixner @ 2026-02-02 12:28 UTC (permalink / raw)
To: Peter Zijlstra
Cc: arnd, anna-maria, frederic, peterz, luto, mingo, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, oliver.sang
On Wed, Jan 21 2026 at 17:20, Peter Zijlstra wrote:
> Much like hrtimer_reprogram(), skip programming if the cpu_base is
> running the hrtimer interrupt.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@kernel.org>
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v2 3/6] hrtimer,sched: Add fuzzy hrtimer mode for HRTICK
2026-01-21 16:20 [PATCH v2 0/6] hrtimer/sched: Improve hrtick Peter Zijlstra
2026-01-21 16:20 ` [PATCH v2 1/6] sched/eevdf: Fix HRTICK duration Peter Zijlstra
2026-01-21 16:20 ` [PATCH v2 2/6] hrtimer: Optimize __hrtimer_start_range_ns() Peter Zijlstra
@ 2026-01-21 16:20 ` Peter Zijlstra
2026-01-22 13:12 ` Juri Lelli
2026-02-02 14:02 ` Thomas Gleixner
2026-01-21 16:20 ` [PATCH v2 4/6] hrtimer: Re-arrange hrtimer_interrupt() Peter Zijlstra
` (2 subsequent siblings)
5 siblings, 2 replies; 25+ messages in thread
From: Peter Zijlstra @ 2026-01-21 16:20 UTC (permalink / raw)
To: tglx
Cc: arnd, anna-maria, frederic, peterz, luto, mingo, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, oliver.sang
Upon schedule() HRTICK will cancel the current timer, pick the next
task and reprogram the timer. When schedule() consistently triggers
due to blocking conditions instead of the timer, this leads to endless
reprogramming without ever firing.
Mitigate this with a new hrtimer mode: fuzzy (not really happy with
that name); this mode does two things:
- skip reprogramming the hardware on timer remove;
- skip reprogramming the hardware when the new timer
is after cpu_base->expires_next
Both things are already possible;
- removing a remote timer will leave the hardware programmed and
cause a spurious interrupt.
- this remote CPU adding a timer can skip the reprogramming
when the timer's expiration is after the (spurious) expiration.
This new timer mode simply causes more of this 'fuzzy' behaviour; it
causes a few spurious interrupts, but similarly avoids endlessly
reprogramming the timer.
This makes the HRTICK match the NO_HRTICK hackbench runs -- the case
where a task never runs until its slice is complete but always goes
sleep early.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
include/linux/hrtimer.h | 1 +
include/linux/hrtimer_types.h | 1 +
kernel/sched/core.c | 3 ++-
kernel/time/hrtimer.c | 16 +++++++++++++++-
4 files changed, 19 insertions(+), 2 deletions(-)
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -38,6 +38,7 @@ enum hrtimer_mode {
HRTIMER_MODE_PINNED = 0x02,
HRTIMER_MODE_SOFT = 0x04,
HRTIMER_MODE_HARD = 0x08,
+ HRTIMER_MODE_FUZZY = 0x10,
HRTIMER_MODE_ABS_PINNED = HRTIMER_MODE_ABS | HRTIMER_MODE_PINNED,
HRTIMER_MODE_REL_PINNED = HRTIMER_MODE_REL | HRTIMER_MODE_PINNED,
--- a/include/linux/hrtimer_types.h
+++ b/include/linux/hrtimer_types.h
@@ -45,6 +45,7 @@ struct hrtimer {
u8 is_rel;
u8 is_soft;
u8 is_hard;
+ u8 is_fuzzy;
};
#endif /* _LINUX_HRTIMER_TYPES_H */
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -928,7 +928,8 @@ void hrtick_start(struct rq *rq, u64 del
static void hrtick_rq_init(struct rq *rq)
{
INIT_CSD(&rq->hrtick_csd, __hrtick_start, rq);
- hrtimer_setup(&rq->hrtick_timer, hrtick, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
+ hrtimer_setup(&rq->hrtick_timer, hrtick, CLOCK_MONOTONIC,
+ HRTIMER_MODE_REL_HARD | HRTIMER_MODE_FUZZY);
}
#else /* !CONFIG_SCHED_HRTICK: */
static inline void hrtick_clear(struct rq *rq)
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1122,7 +1122,7 @@ static void __remove_hrtimer(struct hrti
* an superfluous call to hrtimer_force_reprogram() on the
* remote cpu later on if the same timer gets enqueued again.
*/
- if (reprogram && timer == cpu_base->next_timer)
+ if (!timer->is_fuzzy && reprogram && timer == cpu_base->next_timer)
hrtimer_force_reprogram(cpu_base, 1);
}
@@ -1269,6 +1269,19 @@ static int __hrtimer_start_range_ns(stru
if (new_base->cpu_base->in_hrtirq)
return 0;
+ if (timer->is_fuzzy) {
+ /*
+ * XXX fuzzy implies pinned! not sure how to deal with
+ * retrigger_next_event() for the !local case.
+ */
+ WARN_ON_ONCE(!(mode & HRTIMER_MODE_PINNED));
+ /*
+ * Notably, by going into hrtimer_reprogram(), it will
+ * not reprogram if cpu_base->expires_next is earlier.
+ */
+ return first;
+ }
+
if (!force_local) {
/*
* If the current CPU base is online, then the timer is
@@ -1645,6 +1658,7 @@ static void __hrtimer_setup(struct hrtim
base += hrtimer_clockid_to_base(clock_id);
timer->is_soft = softtimer;
timer->is_hard = !!(mode & HRTIMER_MODE_HARD);
+ timer->is_fuzzy = !!(mode & HRTIMER_MODE_FUZZY);
timer->base = &cpu_base->clock_base[base];
timerqueue_init(&timer->node);
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH v2 3/6] hrtimer,sched: Add fuzzy hrtimer mode for HRTICK
2026-01-21 16:20 ` [PATCH v2 3/6] hrtimer,sched: Add fuzzy hrtimer mode for HRTICK Peter Zijlstra
@ 2026-01-22 13:12 ` Juri Lelli
2026-01-23 20:04 ` Steven Rostedt
2026-02-02 14:02 ` Thomas Gleixner
1 sibling, 1 reply; 25+ messages in thread
From: Juri Lelli @ 2026-01-22 13:12 UTC (permalink / raw)
To: Peter Zijlstra
Cc: tglx, arnd, anna-maria, frederic, luto, mingo, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
linux-kernel, oliver.sang
Hello,
On 21/01/26 17:20, Peter Zijlstra wrote:
> Upon schedule() HRTICK will cancel the current timer, pick the next
> task and reprogram the timer. When schedule() consistently triggers
> due to blocking conditions instead of the timer, this leads to endless
> reprogramming without ever firing.
>
> Mitigate this with a new hrtimer mode: fuzzy (not really happy with
> that name); this mode does two things:
Does the more common (lazier :) 'lazy' work better?
>
> - skip reprogramming the hardware on timer remove;
> - skip reprogramming the hardware when the new timer
> is after cpu_base->expires_next
>
> Both things are already possible;
>
> - removing a remote timer will leave the hardware programmed and
> cause a spurious interrupt.
> - this remote CPU adding a timer can skip the reprogramming
> when the timer's expiration is after the (spurious) expiration.
>
> This new timer mode simply causes more of this 'fuzzy' behaviour; it
> causes a few spurious interrupts, but similarly avoids endlessly
> reprogramming the timer.
>
> This makes the HRTICK match the NO_HRTICK hackbench runs -- the case
> where a task never runs until its slice is complete but always goes
> sleep early.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
...
> @@ -1269,6 +1269,19 @@ static int __hrtimer_start_range_ns(stru
> if (new_base->cpu_base->in_hrtirq)
> return 0;
>
> + if (timer->is_fuzzy) {
> + /*
> + * XXX fuzzy implies pinned! not sure how to deal with
> + * retrigger_next_event() for the !local case.
> + */
> + WARN_ON_ONCE(!(mode & HRTIMER_MODE_PINNED));
Not sure either, but since it's improving things for local already,
maybe it's an acceptable first step?
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Thanks,
Juri
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH v2 3/6] hrtimer,sched: Add fuzzy hrtimer mode for HRTICK
2026-01-22 13:12 ` Juri Lelli
@ 2026-01-23 20:04 ` Steven Rostedt
0 siblings, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2026-01-23 20:04 UTC (permalink / raw)
To: Juri Lelli
Cc: Peter Zijlstra, tglx, arnd, anna-maria, frederic, luto, mingo,
vincent.guittot, dietmar.eggemann, bsegall, mgorman, vschneid,
linux-kernel, oliver.sang
On Thu, 22 Jan 2026 14:12:28 +0100
Juri Lelli <juri.lelli@redhat.com> wrote:
> Hello,
>
> On 21/01/26 17:20, Peter Zijlstra wrote:
> > Upon schedule() HRTICK will cancel the current timer, pick the next
> > task and reprogram the timer. When schedule() consistently triggers
> > due to blocking conditions instead of the timer, this leads to endless
> > reprogramming without ever firing.
> >
> > Mitigate this with a new hrtimer mode: fuzzy (not really happy with
> > that name); this mode does two things:
>
> Does the more common (lazier :) 'lazy' work better?
I don't like either fuzzy or lazy.
Fuzzy makes me think of just random entries (for fuzz testing and such).
Lazy is to postpone things to do things less often.
What about "speculative"? Like branch prediction and such. Where a timer
is expected to be used at a certain time but it may not be?
-- Steve
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 3/6] hrtimer,sched: Add fuzzy hrtimer mode for HRTICK
2026-01-21 16:20 ` [PATCH v2 3/6] hrtimer,sched: Add fuzzy hrtimer mode for HRTICK Peter Zijlstra
2026-01-22 13:12 ` Juri Lelli
@ 2026-02-02 14:02 ` Thomas Gleixner
1 sibling, 0 replies; 25+ messages in thread
From: Thomas Gleixner @ 2026-02-02 14:02 UTC (permalink / raw)
To: Peter Zijlstra
Cc: arnd, anna-maria, frederic, peterz, luto, mingo, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, oliver.sang
On Wed, Jan 21 2026 at 17:20, Peter Zijlstra wrote:
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -928,7 +928,8 @@ void hrtick_start(struct rq *rq, u64 del
> static void hrtick_rq_init(struct rq *rq)
> {
> INIT_CSD(&rq->hrtick_csd, __hrtick_start, rq);
> - hrtimer_setup(&rq->hrtick_timer, hrtick, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
> + hrtimer_setup(&rq->hrtick_timer, hrtick, CLOCK_MONOTONIC,
> + HRTIMER_MODE_REL_HARD | HRTIMER_MODE_FUZZY);
SHouldn't this be HRTIMER_MODE_REL_PINNED_HARD? I know it's set when
starting the timer, but I had to double check it.
> }
> #else /* !CONFIG_SCHED_HRTICK: */
> static inline void hrtick_clear(struct rq *rq)
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -1122,7 +1122,7 @@ static void __remove_hrtimer(struct hrti
> * an superfluous call to hrtimer_force_reprogram() on the
> * remote cpu later on if the same timer gets enqueued again.
> */
> - if (reprogram && timer == cpu_base->next_timer)
> + if (!timer->is_fuzzy && reprogram && timer == cpu_base->next_timer)
> hrtimer_force_reprogram(cpu_base, 1);
> }
>
> @@ -1269,6 +1269,19 @@ static int __hrtimer_start_range_ns(stru
> if (new_base->cpu_base->in_hrtirq)
> return 0;
>
> + if (timer->is_fuzzy) {
> + /*
> + * XXX fuzzy implies pinned! not sure how to deal with
> + * retrigger_next_event() for the !local case.
I'd rather say:
Fuzzy requires pinned as the lazy reprogramming only works
for CPU local timers.
> + */
> + WARN_ON_ONCE(!(mode & HRTIMER_MODE_PINNED));
Other than that:
Reviewed-by: Thomas Gleixner <tglx@kernel.org>
Thanks,
tglx
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v2 4/6] hrtimer: Re-arrange hrtimer_interrupt()
2026-01-21 16:20 [PATCH v2 0/6] hrtimer/sched: Improve hrtick Peter Zijlstra
` (2 preceding siblings ...)
2026-01-21 16:20 ` [PATCH v2 3/6] hrtimer,sched: Add fuzzy hrtimer mode for HRTICK Peter Zijlstra
@ 2026-01-21 16:20 ` Peter Zijlstra
2026-02-02 14:05 ` Thomas Gleixner
2026-01-21 16:20 ` [PATCH v2 5/6] entry,hrtimer: Push reprogramming timers into the interrupt return path Peter Zijlstra
2026-01-21 16:20 ` [PATCH v2 6/6] sched: Default enable HRTICK Peter Zijlstra
5 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2026-01-21 16:20 UTC (permalink / raw)
To: tglx
Cc: arnd, anna-maria, frederic, peterz, luto, mingo, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, oliver.sang
Rework hrtimer_interrupt() such that reprogramming is split out into
an independent function at the end of the interrupt.
This prepares for reprogramming getting delayed beyond the end of
hrtimer_interrupt().
Notably, this changes the hang handling to always wait 100ms instead
of trying to keep it proportional to the actual delay. This simplifies
the state, also this really shouldn't be happening.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/time/hrtimer.c | 87 ++++++++++++++++++++++----------------------------
1 file changed, 39 insertions(+), 48 deletions(-)
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1889,6 +1889,29 @@ static __latent_entropy void hrtimer_run
#ifdef CONFIG_HIGH_RES_TIMERS
/*
+ * Very similar to hrtimer_force_reprogram(), except it deals with
+ * in_hrirq and hang_detected.
+ */
+static void __hrtimer_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t now)
+{
+ ktime_t expires_next = hrtimer_update_next_event(cpu_base);
+
+ cpu_base->expires_next = expires_next;
+ cpu_base->in_hrtirq = 0;
+
+ if (unlikely(cpu_base->hang_detected)) {
+ /*
+ * Give the system a chance to do something else than looping
+ * on hrtimer interrupts.
+ */
+ expires_next = ktime_add_ns(now, 100 * NSEC_PER_MSEC);
+ cpu_base->hang_detected = 0;
+ }
+
+ tick_program_event(expires_next, 1);
+}
+
+/*
* High resolution timer interrupt
* Called with interrupts disabled
*/
@@ -1924,63 +1947,31 @@ void hrtimer_interrupt(struct clock_even
__hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_HARD);
- /* Reevaluate the clock bases for the [soft] next expiry */
- expires_next = hrtimer_update_next_event(cpu_base);
- /*
- * Store the new expiry value so the migration code can verify
- * against it.
- */
- cpu_base->expires_next = expires_next;
- cpu_base->in_hrtirq = 0;
- raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
-
- /* Reprogramming necessary ? */
- if (!tick_program_event(expires_next, 0)) {
- cpu_base->hang_detected = 0;
- return;
- }
-
/*
* The next timer was already expired due to:
* - tracing
* - long lasting callbacks
* - being scheduled away when running in a VM
*
- * We need to prevent that we loop forever in the hrtimer
- * interrupt routine. We give it 3 attempts to avoid
- * overreacting on some spurious event.
- *
- * Acquire base lock for updating the offsets and retrieving
- * the current time.
+ * We need to prevent that we loop forever in the hrtiner interrupt
+ * routine. We give it 3 attempts to avoid overreacting on some
+ * spurious event.
*/
- raw_spin_lock_irqsave(&cpu_base->lock, flags);
+ expires_next = hrtimer_update_next_event(cpu_base);
now = hrtimer_update_base(cpu_base);
- cpu_base->nr_retries++;
- if (++retries < 3)
- goto retry;
- /*
- * Give the system a chance to do something else than looping
- * here. We stored the entry time, so we know exactly how long
- * we spent here. We schedule the next event this amount of
- * time away.
- */
- cpu_base->nr_hangs++;
- cpu_base->hang_detected = 1;
- raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
+ if (expires_next < now) {
+ if (++retries < 3)
+ goto retry;
+
+ delta = ktime_sub(now, entry_time);
+ cpu_base->max_hang_time = max_t(unsigned int,
+ cpu_base->max_hang_time, delta);
+ cpu_base->nr_hangs++;
+ cpu_base->hang_detected = 1;
+ }
- delta = ktime_sub(now, entry_time);
- if ((unsigned int)delta > cpu_base->max_hang_time)
- cpu_base->max_hang_time = (unsigned int) delta;
- /*
- * Limit it to a sensible value as we enforce a longer
- * delay. Give the CPU at least 100ms to catch up.
- */
- if (delta > 100 * NSEC_PER_MSEC)
- expires_next = ktime_add_ns(now, 100 * NSEC_PER_MSEC);
- else
- expires_next = ktime_add(now, delta);
- tick_program_event(expires_next, 1);
- pr_warn_once("hrtimer: interrupt took %llu ns\n", ktime_to_ns(delta));
+ __hrtimer_rearm(cpu_base, now);
+ raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
}
#endif /* !CONFIG_HIGH_RES_TIMERS */
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH v2 4/6] hrtimer: Re-arrange hrtimer_interrupt()
2026-01-21 16:20 ` [PATCH v2 4/6] hrtimer: Re-arrange hrtimer_interrupt() Peter Zijlstra
@ 2026-02-02 14:05 ` Thomas Gleixner
0 siblings, 0 replies; 25+ messages in thread
From: Thomas Gleixner @ 2026-02-02 14:05 UTC (permalink / raw)
To: Peter Zijlstra
Cc: arnd, anna-maria, frederic, peterz, luto, mingo, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, oliver.sang
On Wed, Jan 21 2026 at 17:20, Peter Zijlstra wrote:
> Rework hrtimer_interrupt() such that reprogramming is split out into
> an independent function at the end of the interrupt.
>
> This prepares for reprogramming getting delayed beyond the end of
> hrtimer_interrupt().
>
> Notably, this changes the hang handling to always wait 100ms instead
> of trying to keep it proportional to the actual delay. This simplifies
> the state, also this really shouldn't be happening.
Indeed.
> /*
> + * Very similar to hrtimer_force_reprogram(), except it deals with
> + * in_hrirq and hang_detected.
in_hrtirq
Reviewed-by: Thomas Gleixner <tglx@kernel.org>
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v2 5/6] entry,hrtimer: Push reprogramming timers into the interrupt return path
2026-01-21 16:20 [PATCH v2 0/6] hrtimer/sched: Improve hrtick Peter Zijlstra
` (3 preceding siblings ...)
2026-01-21 16:20 ` [PATCH v2 4/6] hrtimer: Re-arrange hrtimer_interrupt() Peter Zijlstra
@ 2026-01-21 16:20 ` Peter Zijlstra
2026-01-23 20:08 ` Steven Rostedt
2026-02-02 14:37 ` Thomas Gleixner
2026-01-21 16:20 ` [PATCH v2 6/6] sched: Default enable HRTICK Peter Zijlstra
5 siblings, 2 replies; 25+ messages in thread
From: Peter Zijlstra @ 2026-01-21 16:20 UTC (permalink / raw)
To: tglx
Cc: arnd, anna-maria, frederic, peterz, luto, mingo, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, oliver.sang
Currently hrtimer_interrupt() runs expired timers, which can re-arm
themselves, after which it computes the next expiration time and
re-programs the hardware.
However, things like HRTICK, a highres timer driving preemption,
cannot re-arm itself at the point of running, since the next task has
not been determined yet. The schedule() in the interrupt return path
will switch to the next task, which then causes a new hrtimer to be
programmed.
This then results in reprogramming the hardware at least twice, once
after running the timers, and once upon selecting the new task.
Notably, *both* events happen in the interrupt.
By pushing the hrtimer reprogram all the way into the interrupt return
path, it runs after schedule() and this double reprogram can be
avoided.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
include/asm-generic/thread_info_tif.h | 5 ++++-
include/linux/hrtimer.h | 17 +++++++++++++++++
include/linux/irq-entry-common.h | 2 ++
kernel/entry/common.c | 13 +++++++++++++
kernel/sched/core.c | 10 ++++++++++
kernel/time/hrtimer.c | 28 ++++++++++++++++++++++++----
6 files changed, 70 insertions(+), 5 deletions(-)
--- a/include/asm-generic/thread_info_tif.h
+++ b/include/asm-generic/thread_info_tif.h
@@ -41,11 +41,14 @@
#define _TIF_PATCH_PENDING BIT(TIF_PATCH_PENDING)
#ifdef HAVE_TIF_RESTORE_SIGMASK
-# define TIF_RESTORE_SIGMASK 10 // Restore signal mask in do_signal() */
+# define TIF_RESTORE_SIGMASK 10 // Restore signal mask in do_signal()
# define _TIF_RESTORE_SIGMASK BIT(TIF_RESTORE_SIGMASK)
#endif
#define TIF_RSEQ 11 // Run RSEQ fast path
#define _TIF_RSEQ BIT(TIF_RSEQ)
+#define TIF_HRTIMER_REARM 12 // re-arm the timer
+#define _TIF_HRTIMER_REARM BIT(TIF_HRTIMER_REARM)
+
#endif /* _ASM_GENERIC_THREAD_INFO_TIF_H_ */
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -175,10 +175,27 @@ extern void hrtimer_interrupt(struct clo
extern unsigned int hrtimer_resolution;
+#ifdef TIF_HRTIMER_REARM
+extern void _hrtimer_rearm(void);
+/*
+ * This is to be called on all irqentry_exit() paths that will enable
+ * interrupts; as well as in the context switch path before switch_to().
+ */
+static inline void hrtimer_rearm(void)
+{
+ if (test_thread_flag(TIF_HRTIMER_REARM))
+ _hrtimer_rearm();
+}
+#else
+static inline void hrtimer_rearm(void) { }
+#endif /* TIF_HRTIMER_REARM */
+
#else
#define hrtimer_resolution (unsigned int)LOW_RES_NSEC
+static inline void hrtimer_rearm(void) { }
+
#endif
static inline ktime_t
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -224,6 +224,8 @@ static __always_inline void __exit_to_us
ti_work = read_thread_flags();
if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
ti_work = exit_to_user_mode_loop(regs, ti_work);
+ else
+ hrtimer_rearm();
arch_exit_to_user_mode_prepare(regs, ti_work);
}
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -7,6 +7,7 @@
#include <linux/kmsan.h>
#include <linux/livepatch.h>
#include <linux/tick.h>
+#include <linux/hrtimer.h>
/* Workaround to allow gradual conversion of architecture code */
void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
@@ -26,6 +27,16 @@ static __always_inline unsigned long __e
*/
while (ti_work & EXIT_TO_USER_MODE_WORK_LOOP) {
+ /*
+ * If hrtimer need re-arming, do so before enabling IRQs,
+ * except when a reschedule is needed, in that case schedule()
+ * will do this.
+ */
+ if ((ti_work & (_TIF_NEED_RESCHED |
+ _TIF_NEED_RESCHED_LAZY |
+ _TIF_HRTIMER_REARM)) == _TIF_HRTIMER_REARM)
+ hrtimer_rearm();
+
local_irq_enable_exit_to_user(ti_work);
if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
@@ -202,6 +213,7 @@ noinstr void irqentry_exit(struct pt_reg
*/
if (state.exit_rcu) {
instrumentation_begin();
+ hrtimer_rearm();
/* Tell the tracer that IRET will enable interrupts */
trace_hardirqs_on_prepare();
lockdep_hardirqs_on_prepare();
@@ -215,6 +227,7 @@ noinstr void irqentry_exit(struct pt_reg
if (IS_ENABLED(CONFIG_PREEMPTION))
irqentry_exit_cond_resched();
+ hrtimer_rearm();
/* Covers both tracing and lockdep */
trace_hardirqs_on();
instrumentation_end();
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6814,6 +6814,16 @@ static void __sched notrace __schedule(i
keep_resched:
rq->last_seen_need_resched_ns = 0;
+ /*
+ * Notably, this must be called after pick_next_task() but before
+ * switch_to(), since the new task need not be on the return from
+ * interrupt path. Additionally, exit_to_user_mode_loop() relies on
+ * any schedule() call to imply this call, so do it unconditionally.
+ *
+ * We've just cleared TIF_NEED_RESCHED, TIF word should be in cache.
+ */
+ hrtimer_rearm();
+
is_switch = prev != next;
if (likely(is_switch)) {
rq->nr_switches++;
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1892,10 +1892,9 @@ static __latent_entropy void hrtimer_run
* Very similar to hrtimer_force_reprogram(), except it deals with
* in_hrirq and hang_detected.
*/
-static void __hrtimer_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t now)
+static void __hrtimer_rearm(struct hrtimer_cpu_base *cpu_base,
+ ktime_t now, ktime_t expires_next)
{
- ktime_t expires_next = hrtimer_update_next_event(cpu_base);
-
cpu_base->expires_next = expires_next;
cpu_base->in_hrtirq = 0;
@@ -1970,9 +1969,30 @@ void hrtimer_interrupt(struct clock_even
cpu_base->hang_detected = 1;
}
- __hrtimer_rearm(cpu_base, now);
+#ifdef TIF_HRTIMER_REARM
+ set_thread_flag(TIF_HRTIMER_REARM);
+#else
+ __hrtimer_rearm(cpu_base, now, expires_next);
+#endif
raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
}
+
+#ifdef TIF_HRTIMER_REARM
+void _hrtimer_rearm(void)
+{
+ struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
+ ktime_t now, expires_next;
+
+ lockdep_assert_irqs_disabled();
+
+ scoped_guard (raw_spinlock, &cpu_base->lock) {
+ now = hrtimer_update_base(cpu_base);
+ expires_next = hrtimer_update_next_event(cpu_base);
+ __hrtimer_rearm(cpu_base, now, expires_next);
+ clear_thread_flag(TIF_HRTIMER_REARM);
+ }
+}
+#endif /* TIF_HRTIMER_REARM */
#endif /* !CONFIG_HIGH_RES_TIMERS */
/*
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH v2 5/6] entry,hrtimer: Push reprogramming timers into the interrupt return path
2026-01-21 16:20 ` [PATCH v2 5/6] entry,hrtimer: Push reprogramming timers into the interrupt return path Peter Zijlstra
@ 2026-01-23 20:08 ` Steven Rostedt
2026-01-23 21:04 ` Peter Zijlstra
2026-02-02 14:37 ` Thomas Gleixner
1 sibling, 1 reply; 25+ messages in thread
From: Steven Rostedt @ 2026-01-23 20:08 UTC (permalink / raw)
To: Peter Zijlstra
Cc: tglx, arnd, anna-maria, frederic, luto, mingo, juri.lelli,
vincent.guittot, dietmar.eggemann, bsegall, mgorman, vschneid,
linux-kernel, oliver.sang
On Wed, 21 Jan 2026 17:20:15 +0100
Peter Zijlstra <peterz@infradead.org> wrote:
> +#ifdef TIF_HRTIMER_REARM
> +void _hrtimer_rearm(void)
> +{
> + struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
> + ktime_t now, expires_next;
> +
> + lockdep_assert_irqs_disabled();
> +
> + scoped_guard (raw_spinlock, &cpu_base->lock) {
> + now = hrtimer_update_base(cpu_base);
> + expires_next = hrtimer_update_next_event(cpu_base);
> + __hrtimer_rearm(cpu_base, now, expires_next);
> + clear_thread_flag(TIF_HRTIMER_REARM);
> + }
> +}
I'm curious to why you decided to use scoped_guard() here and not just
guard() and not add the extra indentation? The function is small enough
where everything is expected to be protected by the spinlock.
-- Steve
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH v2 5/6] entry,hrtimer: Push reprogramming timers into the interrupt return path
2026-01-23 20:08 ` Steven Rostedt
@ 2026-01-23 21:04 ` Peter Zijlstra
0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2026-01-23 21:04 UTC (permalink / raw)
To: Steven Rostedt
Cc: tglx, arnd, anna-maria, frederic, luto, mingo, juri.lelli,
vincent.guittot, dietmar.eggemann, bsegall, mgorman, vschneid,
linux-kernel, oliver.sang
On Fri, Jan 23, 2026 at 03:08:43PM -0500, Steven Rostedt wrote:
> On Wed, 21 Jan 2026 17:20:15 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
>
> > +#ifdef TIF_HRTIMER_REARM
> > +void _hrtimer_rearm(void)
> > +{
> > + struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
> > + ktime_t now, expires_next;
> > +
> > + lockdep_assert_irqs_disabled();
> > +
> > + scoped_guard (raw_spinlock, &cpu_base->lock) {
> > + now = hrtimer_update_base(cpu_base);
> > + expires_next = hrtimer_update_next_event(cpu_base);
> > + __hrtimer_rearm(cpu_base, now, expires_next);
> > + clear_thread_flag(TIF_HRTIMER_REARM);
> > + }
> > +}
>
> I'm curious to why you decided to use scoped_guard() here and not just
> guard() and not add the extra indentation? The function is small enough
> where everything is expected to be protected by the spinlock.
Yeah, I'm not entirely sure... its been over 6 months since I wrote this
code :-/
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 5/6] entry,hrtimer: Push reprogramming timers into the interrupt return path
2026-01-21 16:20 ` [PATCH v2 5/6] entry,hrtimer: Push reprogramming timers into the interrupt return path Peter Zijlstra
2026-01-23 20:08 ` Steven Rostedt
@ 2026-02-02 14:37 ` Thomas Gleixner
2026-02-02 16:33 ` Peter Zijlstra
1 sibling, 1 reply; 25+ messages in thread
From: Thomas Gleixner @ 2026-02-02 14:37 UTC (permalink / raw)
To: Peter Zijlstra
Cc: arnd, anna-maria, frederic, peterz, luto, mingo, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, oliver.sang
On Wed, Jan 21 2026 at 17:20, Peter Zijlstra wrote:
> while (ti_work & EXIT_TO_USER_MODE_WORK_LOOP) {
>
> + /*
> + * If hrtimer need re-arming, do so before enabling IRQs,
> + * except when a reschedule is needed, in that case schedule()
> + * will do this.
> + */
> + if ((ti_work & (_TIF_NEED_RESCHED |
> + _TIF_NEED_RESCHED_LAZY |
> + _TIF_HRTIMER_REARM)) == _TIF_HRTIMER_REARM)
> + hrtimer_rearm();
Two things I'm not convinced that they are handled correctly:
1) Interrupts
After reenabling interrupts and before reaching schedule() an
interrupt comes in and runs soft interrupt processing for a while
on the way back, which delays the update until that processing
completes.
2) Time slice extension
When the time slice is granted this will not rearm the clockevent
device unless the slice hrtimer becomes the first expiring timer
on that CPU, but even then that misses the full reevaluation of
the next timer event.
> -static void __hrtimer_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t now)
> +static void __hrtimer_rearm(struct hrtimer_cpu_base *cpu_base,
> + ktime_t now, ktime_t expires_next)
> {
> - ktime_t expires_next = hrtimer_update_next_event(cpu_base);
> -
> cpu_base->expires_next = expires_next;
> cpu_base->in_hrtirq = 0;
>
> @@ -1970,9 +1969,30 @@ void hrtimer_interrupt(struct clock_even
> cpu_base->hang_detected = 1;
> }
>
> - __hrtimer_rearm(cpu_base, now);
> +#ifdef TIF_HRTIMER_REARM
> + set_thread_flag(TIF_HRTIMER_REARM);
> +#else
> + __hrtimer_rearm(cpu_base, now, expires_next);
> +#endif
in hrtimer.h where you already have the #ifdef TIF_HRTIMER_REARM
section:
static inline bool hrtimer_set_rearm_delayed()
{
set_thread_flag(TIF_HRTIMER_REARM);
return true;
}
and a empty stub returning false for the other case then this becomes:
if (!hrtimer_set_rearm_delayed())
hrtimer_rearm(...);
and the ugly ifdef in the code goes away.
> raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
> }
> +
> +#ifdef TIF_HRTIMER_REARM
> +void _hrtimer_rearm(void)
Grr. I had to read this five times to figure out that we now have
hrtimer_rearm()
_hrtimer_rearm()
__hrtimer_rearm()
You clearly ran out of characters to make that obvious:
hrtimer_rearm_delayed()
hrtimer_rearm()
hrtimer_do_rearm()
or something like that.
Thanks,
tglx
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH v2 5/6] entry,hrtimer: Push reprogramming timers into the interrupt return path
2026-02-02 14:37 ` Thomas Gleixner
@ 2026-02-02 16:33 ` Peter Zijlstra
2026-02-02 23:28 ` Thomas Gleixner
0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2026-02-02 16:33 UTC (permalink / raw)
To: Thomas Gleixner
Cc: arnd, anna-maria, frederic, luto, mingo, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, oliver.sang
On Mon, Feb 02, 2026 at 03:37:13PM +0100, Thomas Gleixner wrote:
> On Wed, Jan 21 2026 at 17:20, Peter Zijlstra wrote:
> > while (ti_work & EXIT_TO_USER_MODE_WORK_LOOP) {
> >
> > + /*
> > + * If hrtimer need re-arming, do so before enabling IRQs,
> > + * except when a reschedule is needed, in that case schedule()
> > + * will do this.
> > + */
> > + if ((ti_work & (_TIF_NEED_RESCHED |
> > + _TIF_NEED_RESCHED_LAZY |
> > + _TIF_HRTIMER_REARM)) == _TIF_HRTIMER_REARM)
> > + hrtimer_rearm();
>
> Two things I'm not convinced that they are handled correctly:
>
> 1) Interrupts
>
> After reenabling interrupts and before reaching schedule() an
> interrupt comes in and runs soft interrupt processing for a while
> on the way back, which delays the update until that processing
> completes.
So the basic thing looks like:
<USER-MODE>
irqentry_enter()
run_irq_on_irqstack_cond()
if (user_mode() || hardirq_stack_inuse)
irq_enter_rcu();
func_c();
irq_exit_rcu()
__irq_exit_rcu()
invoke_softirq()
irqentry_exit()
irqentry_exit_to_user_mode()
irqentry_exit_to_user_mode_prepare()
__exit_to_user_mode_prepare()
exit_to_user_mode_loop()
...here...
So a nested IRQ at this point will have !user_mode(), but I think it can
still end up in softirqs due to that hardirq_stack_inuse. Should we
perhaps make sure only user_mode() ends up in softirqs?
> 2) Time slice extension
>
> When the time slice is granted this will not rearm the clockevent
> device unless the slice hrtimer becomes the first expiring timer
> on that CPU, but even then that misses the full reevaluation of
> the next timer event.
Oh crud yes, that should be something like:
if (!rseq_grant_slice_extension(ti_work & TIF_SLICE_EXT_DENY))
schedule();
else
hrtimer_rearm();
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH v2 5/6] entry,hrtimer: Push reprogramming timers into the interrupt return path
2026-02-02 16:33 ` Peter Zijlstra
@ 2026-02-02 23:28 ` Thomas Gleixner
2026-02-03 8:14 ` Thomas Gleixner
2026-02-04 13:58 ` Peter Zijlstra
0 siblings, 2 replies; 25+ messages in thread
From: Thomas Gleixner @ 2026-02-02 23:28 UTC (permalink / raw)
To: Peter Zijlstra
Cc: arnd, anna-maria, frederic, luto, mingo, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, oliver.sang
On Mon, Feb 02 2026 at 17:33, Peter Zijlstra wrote:
> On Mon, Feb 02, 2026 at 03:37:13PM +0100, Thomas Gleixner wrote:
>> On Wed, Jan 21 2026 at 17:20, Peter Zijlstra wrote:
>> > while (ti_work & EXIT_TO_USER_MODE_WORK_LOOP) {
>> >
>> > + /*
>> > + * If hrtimer need re-arming, do so before enabling IRQs,
>> > + * except when a reschedule is needed, in that case schedule()
>> > + * will do this.
>> > + */
>> > + if ((ti_work & (_TIF_NEED_RESCHED |
>> > + _TIF_NEED_RESCHED_LAZY |
>> > + _TIF_HRTIMER_REARM)) == _TIF_HRTIMER_REARM)
>> > + hrtimer_rearm();
>>
>> Two things I'm not convinced that they are handled correctly:
>>
>> 1) Interrupts
>>
>> After reenabling interrupts and before reaching schedule() an
>> interrupt comes in and runs soft interrupt processing for a while
>> on the way back, which delays the update until that processing
>> completes.
>
> So the basic thing looks like:
>
> <USER-MODE>
> irqentry_enter()
> run_irq_on_irqstack_cond()
> if (user_mode() || hardirq_stack_inuse)
> irq_enter_rcu();
> func_c();
> irq_exit_rcu()
> __irq_exit_rcu()
> invoke_softirq()
> irqentry_exit()
> irqentry_exit_to_user_mode()
> irqentry_exit_to_user_mode_prepare()
> __exit_to_user_mode_prepare()
> exit_to_user_mode_loop()
> ...here...
>
> So a nested IRQ at this point will have !user_mode(), but I think it can
> still end up in softirqs due to that hardirq_stack_inuse. Should we
> perhaps make sure only user_mode() ends up in softirqs?
All interrupts independent of the mode they hit are ending up in
irq_exit_rcu() and therefore in __irq_exit_rcu()
run_irq_on_irqstack_cond()
if (user_mode() || hardirq_stack_inuse)
// Stay on user or hardirq stack
irq_enter_rcu();
func_c();
irq_exit_rcu()
else
// MAGIC ASM to switch to hardirq stack
call irq_enter_rcu
call func_c
call irq_exit_rcu
The only reason why invoke_softirq() won't be called is when the
interrupt hits into the softirq processing region of the previous
interrupt, which means it's already on the hardirq stack.
But looking at this there is already a problem without interrupt
nesting:
irq_enter_rcu();
timer_interrupt()
hrtimer_interrupt()
delay_rearm();
irq_exit_rcu()
__irq_exit_rcu()
invoke_softirq() <- Here
Soft interrupts can run for quite some time, which means this already
can cause timers being delayed for way too long. I think in
__irq_exit_rcu() you want to do:
if (!in_interrupt() && local_softirq_pending()) {
hrtimer_rearm();
invoke_softirq();
}
Thanks,
tglx
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH v2 5/6] entry,hrtimer: Push reprogramming timers into the interrupt return path
2026-02-02 23:28 ` Thomas Gleixner
@ 2026-02-03 8:14 ` Thomas Gleixner
2026-02-04 13:58 ` Peter Zijlstra
1 sibling, 0 replies; 25+ messages in thread
From: Thomas Gleixner @ 2026-02-03 8:14 UTC (permalink / raw)
To: Peter Zijlstra
Cc: arnd, anna-maria, frederic, luto, mingo, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, oliver.sang
On Tue, Feb 03 2026 at 00:28, Thomas Gleixner wrote:
> On Mon, Feb 02 2026 at 17:33, Peter Zijlstra wrote:
>> So a nested IRQ at this point will have !user_mode(), but I think it can
>> still end up in softirqs due to that hardirq_stack_inuse. Should we
>> perhaps make sure only user_mode() ends up in softirqs?
>
> All interrupts independent of the mode they hit are ending up in
> irq_exit_rcu() and therefore in __irq_exit_rcu()
>
> run_irq_on_irqstack_cond()
> if (user_mode() || hardirq_stack_inuse)
> // Stay on user or hardirq stack
> irq_enter_rcu();
> func_c();
> irq_exit_rcu()
> else
> // MAGIC ASM to switch to hardirq stack
> call irq_enter_rcu
> call func_c
> call irq_exit_rcu
>
> The only reason why invoke_softirq() won't be called is when the
> interrupt hits into the softirq processing region of the previous
> interrupt, which means it's already on the hardirq stack.
In the case I pointed out where the second interrupt hits right after
exit to user enabled interupts, there is no nesting and it will happily
take the second path which switches to the hardirq stack and then on
return processes soft interrupts.
> But looking at this there is already a problem without interrupt
> nesting:
>
> irq_enter_rcu();
> timer_interrupt()
> hrtimer_interrupt()
> delay_rearm();
> irq_exit_rcu()
> __irq_exit_rcu()
> invoke_softirq() <- Here
>
> Soft interrupts can run for quite some time, which means this already
> can cause timers being delayed for way too long. I think in
> __irq_exit_rcu() you want to do:
>
> if (!in_interrupt() && local_softirq_pending()) {
> hrtimer_rearm();
> invoke_softirq();
> }
Actually it's worse. Assume the CPU on which this happens has the
jiffies duty. As the timer does not fire, jiffies become stale. So
anything which relies on jiffies going forward will get stuck until some
other condition breaks the tie. That's going to be fun to debug :)
Thanks,
tglx
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH v2 5/6] entry,hrtimer: Push reprogramming timers into the interrupt return path
2026-02-02 23:28 ` Thomas Gleixner
2026-02-03 8:14 ` Thomas Gleixner
@ 2026-02-04 13:58 ` Peter Zijlstra
1 sibling, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2026-02-04 13:58 UTC (permalink / raw)
To: Thomas Gleixner
Cc: arnd, anna-maria, frederic, luto, mingo, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, oliver.sang
On Tue, Feb 03, 2026 at 12:28:13AM +0100, Thomas Gleixner wrote:
> But looking at this there is already a problem without interrupt
> nesting:
>
> irq_enter_rcu();
> timer_interrupt()
> hrtimer_interrupt()
> delay_rearm();
> irq_exit_rcu()
> __irq_exit_rcu()
> invoke_softirq() <- Here
>
> Soft interrupts can run for quite some time, which means this already
> can cause timers being delayed for way too long. I think in
> __irq_exit_rcu() you want to do:
>
> if (!in_interrupt() && local_softirq_pending()) {
> hrtimer_rearm();
> invoke_softirq();
> }
Right, and we can do the same on (nested) IRQ entry. Something like so:
---
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -63,6 +63,8 @@ static __always_inline unsigned long __e
if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
if (!rseq_grant_slice_extension(ti_work & TIF_SLICE_EXT_DENY))
schedule();
+ else
+ hrtimer_rearm();
}
if (ti_work & _TIF_UPROBE)
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -663,6 +663,13 @@ void irq_enter_rcu(void)
{
__irq_enter_raw();
+ /*
+ * If this is a nested IRQ that hits the exit_to_user_mode_loop
+ * where it has enabled IRQs but before it has hit schedule()
+ * we could have hrtimers in an undefined state. Fix it up here.
+ */
+ hrtimer_rearm();
+
if (tick_nohz_full_cpu(smp_processor_id()) ||
(is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET)))
tick_irq_enter();
@@ -719,8 +726,14 @@ static inline void __irq_exit_rcu(void)
#endif
account_hardirq_exit(current);
preempt_count_sub(HARDIRQ_OFFSET);
- if (!in_interrupt() && local_softirq_pending())
+ if (!in_interrupt() && local_softirq_pending()) {
+ /*
+ * If we left hrtimers unarmed, make sure to arm them now,
+ * before enabling interrupts to run SoftIRQ.
+ */
+ hrtimer_rearm();
invoke_softirq();
+ }
if (IS_ENABLED(CONFIG_IRQ_FORCED_THREADING) && force_irqthreads() &&
local_timers_pending_force_th() && !(in_nmi() | in_hardirq()))
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1279,8 +1279,8 @@ static int __hrtimer_start_range_ns(stru
if (timer->is_fuzzy) {
/*
- * XXX fuzzy implies pinned! not sure how to deal with
- * retrigger_next_event() for the !local case.
+ * Fuzzy requires pinned as the lazy programming only works
+ * for CPU local timers.
*/
WARN_ON_ONCE(!(mode & HRTIMER_MODE_PINNED));
/*
@@ -1898,7 +1898,7 @@ static __latent_entropy void hrtimer_run
/*
* Very similar to hrtimer_force_reprogram(), except it deals with
- * in_hrirq and hang_detected.
+ * in_hrtirq and hang_detected.
*/
static void __hrtimer_rearm(struct hrtimer_cpu_base *cpu_base,
ktime_t now, ktime_t expires_next)
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v2 6/6] sched: Default enable HRTICK
2026-01-21 16:20 [PATCH v2 0/6] hrtimer/sched: Improve hrtick Peter Zijlstra
` (4 preceding siblings ...)
2026-01-21 16:20 ` [PATCH v2 5/6] entry,hrtimer: Push reprogramming timers into the interrupt return path Peter Zijlstra
@ 2026-01-21 16:20 ` Peter Zijlstra
2026-01-21 22:24 ` Phil Auld
5 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2026-01-21 16:20 UTC (permalink / raw)
To: tglx
Cc: arnd, anna-maria, frederic, peterz, luto, mingo, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, oliver.sang
... for generic entry architectures. This decouples preemption from
CONFIG_HZ, leaving only the periodic load-balancer and various
accounting things relying on the tick.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/sched/features.h | 5 +++++
1 file changed, 5 insertions(+)
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -63,8 +63,13 @@ SCHED_FEAT(DELAY_ZERO, true)
*/
SCHED_FEAT(WAKEUP_PREEMPTION, true)
+#ifdef TIF_HRTIMER_REARM
+SCHED_FEAT(HRTICK, true)
+SCHED_FEAT(HRTICK_DL, true)
+#else
SCHED_FEAT(HRTICK, false)
SCHED_FEAT(HRTICK_DL, false)
+#endif
/*
* Decrement CPU capacity based on time not spent running tasks
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH v2 6/6] sched: Default enable HRTICK
2026-01-21 16:20 ` [PATCH v2 6/6] sched: Default enable HRTICK Peter Zijlstra
@ 2026-01-21 22:24 ` Phil Auld
2026-01-22 11:40 ` Peter Zijlstra
0 siblings, 1 reply; 25+ messages in thread
From: Phil Auld @ 2026-01-21 22:24 UTC (permalink / raw)
To: Peter Zijlstra
Cc: tglx, arnd, anna-maria, frederic, luto, mingo, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, oliver.sang
Hi Peter,
On Wed, Jan 21, 2026 at 05:20:16PM +0100 Peter Zijlstra wrote:
> ... for generic entry architectures. This decouples preemption from
> CONFIG_HZ, leaving only the periodic load-balancer and various
> accounting things relying on the tick.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
> kernel/sched/features.h | 5 +++++
> 1 file changed, 5 insertions(+)
>
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -63,8 +63,13 @@ SCHED_FEAT(DELAY_ZERO, true)
> */
> SCHED_FEAT(WAKEUP_PREEMPTION, true)
>
> +#ifdef TIF_HRTIMER_REARM
> +SCHED_FEAT(HRTICK, true)
> +SCHED_FEAT(HRTICK_DL, true)
> +#else
> SCHED_FEAT(HRTICK, false)
> SCHED_FEAT(HRTICK_DL, false)
> +#endif
I maybe be missing something. But the title of this patch
and the above code do not seem to match.
Cheers,
Phil
>
> /*
> * Decrement CPU capacity based on time not spent running tasks
>
>
>
--
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 6/6] sched: Default enable HRTICK
2026-01-21 22:24 ` Phil Auld
@ 2026-01-22 11:40 ` Peter Zijlstra
2026-01-22 12:31 ` Phil Auld
0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2026-01-22 11:40 UTC (permalink / raw)
To: Phil Auld
Cc: tglx, arnd, anna-maria, frederic, luto, mingo, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, oliver.sang
On Wed, Jan 21, 2026 at 05:24:44PM -0500, Phil Auld wrote:
> Hi Peter,
>
> On Wed, Jan 21, 2026 at 05:20:16PM +0100 Peter Zijlstra wrote:
> > ... for generic entry architectures. This decouples preemption from
> > CONFIG_HZ, leaving only the periodic load-balancer and various
> > accounting things relying on the tick.
> >
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> > kernel/sched/features.h | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > --- a/kernel/sched/features.h
> > +++ b/kernel/sched/features.h
> > @@ -63,8 +63,13 @@ SCHED_FEAT(DELAY_ZERO, true)
> > */
> > SCHED_FEAT(WAKEUP_PREEMPTION, true)
> >
> > +#ifdef TIF_HRTIMER_REARM
Arguably this should be CONFIG_GENERIC_ENTRY I suppose
> > +SCHED_FEAT(HRTICK, true)
> > +SCHED_FEAT(HRTICK_DL, true)
> > +#else
> > SCHED_FEAT(HRTICK, false)
> > SCHED_FEAT(HRTICK_DL, false)
> > +#endif
>
> I maybe be missing something. But the title of this patch
> and the above code do not seem to match.
You mean it only default enables it for a subset of architectures?
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 6/6] sched: Default enable HRTICK
2026-01-22 11:40 ` Peter Zijlstra
@ 2026-01-22 12:31 ` Phil Auld
0 siblings, 0 replies; 25+ messages in thread
From: Phil Auld @ 2026-01-22 12:31 UTC (permalink / raw)
To: Peter Zijlstra
Cc: tglx, arnd, anna-maria, frederic, luto, mingo, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, oliver.sang
On Thu, Jan 22, 2026 at 12:40:54PM +0100 Peter Zijlstra wrote:
> On Wed, Jan 21, 2026 at 05:24:44PM -0500, Phil Auld wrote:
> > Hi Peter,
> >
> > On Wed, Jan 21, 2026 at 05:20:16PM +0100 Peter Zijlstra wrote:
> > > ... for generic entry architectures. This decouples preemption from
> > > CONFIG_HZ, leaving only the periodic load-balancer and various
> > > accounting things relying on the tick.
> > >
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > ---
> > > kernel/sched/features.h | 5 +++++
> > > 1 file changed, 5 insertions(+)
> > >
> > > --- a/kernel/sched/features.h
> > > +++ b/kernel/sched/features.h
> > > @@ -63,8 +63,13 @@ SCHED_FEAT(DELAY_ZERO, true)
> > > */
> > > SCHED_FEAT(WAKEUP_PREEMPTION, true)
> > >
> > > +#ifdef TIF_HRTIMER_REARM
>
> Arguably this should be CONFIG_GENERIC_ENTRY I suppose
>
> > > +SCHED_FEAT(HRTICK, true)
> > > +SCHED_FEAT(HRTICK_DL, true)
> > > +#else
> > > SCHED_FEAT(HRTICK, false)
> > > SCHED_FEAT(HRTICK_DL, false)
> > > +#endif
> >
> > I maybe be missing something. But the title of this patch
> > and the above code do not seem to match.
>
> You mean it only default enables it for a subset of architectures?
>
Nope, I mean I can't read... nevermind.
Cheers,
Phil
--
^ permalink raw reply [flat|nested] 25+ messages in thread