[patch 00/48] hrtimer,sched: General optimizations and hrtick enablement

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement
@ 2026-02-24 16:35 Thomas Gleixner
  2026-02-24 16:35 ` [patch 01/48] sched/eevdf: Fix HRTICK duration Thomas Gleixner
                   ` (49 more replies)
  0 siblings, 50 replies; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:35 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

Peter recently posted a series tweaking the hrtimer subsystem to reduce the
overhead of the scheduler hrtick timer so it can be enabled by default:

   https://lore.kernel.org/20260121162010.647043073@infradead.org

That turned out to be incomplete and led to a deeper investigation of the
related bits and pieces.

The problem is that the hrtick deadline changes on every context switch and
is also modified by wakeups and balancing. On a hackbench run this results
in about 2500 clockevent reprogramming cycles per second, which is
especially hurtful in a VM as accessing the clockevent device implies a
VM-Exit.

The following series addresses various aspects of the overall related
problem space:

    1) Scheduler

       Aside of some trivial fixes the handling of the hrtick timer in
       the scheduler is suboptimal:

        - schedule() modifies the hrtick when picking the next task

	- schedule() can modify the hrtick when the balance callback runs
          before releasing rq:lock

	- the expiry time is unfiltered and can result in really tiny
          changes of the expiry time, which are functionally completely
          irrelevant

       Solve this by deferring the hrtick update to the end of schedule()
       and filtering out tiny changes.


    2) Clocksource, clockevents, timekeeping

        - Reading the current clocksource involves an indirect call, which
          is expensive especially for clocksources where the actual read is
          a single instruction like the TSC read on x86.

	  This could be solved with a static call, but the architecture
	  coverage for static calls is meager and that still has the
	  overhead of a function call and in the worst case a return
	  speculation mitigation.

	  As x86 and other architectures like S390 have one preferred
	  clocksource which is normally used on all contemporary systems,
	  this begs for a fully inlined solution.

	  This is achieved by a config option which tells the core code to
	  use the architecture provided inline guarded by a static branch.

	  If the branch is disabled, the indirect function call is used as
	  before. If enabled the inlined read is utilized.

	  The branch is disabled by default and only enabled after a
	  clocksource is installed which has the INLINE feature flag
	  set. When the clocksource is replaced the branch is disabled
	  before the clocksource change happens.


        - Programming clock events is based on calculating a relative
          expiry time, converting it to the clock cycles corresponding to
          the clockevent device frequency and invoking the set_next_event()
          callback of the clockevent device.

	  That works perfectly fine as most hardware timers are count down
	  implementations which require a relative time for programming.

	  But clockevent devices which are coupled to the clocksource and
	  provide a less than equal comparator suffer from this scheme. The
	  core calculates the relative expiry time based on a clock read
	  and the set_next_event() callback has to read the same clock
	  again to convert it back to a absolute time which can be
	  programmed into the comparator.

	  The other issue is that the conversion factor of the clockevent
	  device is calculated at boot time and does not take the NTP/PTP
	  adjustments of the clocksource frequency into account. Depending
	  on the direction of the adjustment this can cause timers to fire
	  early or late. Early is the more problematic case as the timer
	  interrupt has to reprogram the device with a very short delta as
	  it can't expire timers early.

	  This can be optimized by introducing a 'coupled' mode for the
	  clocksource and the clockevent device.

	    A) If the clocksource indicates support for 'coupled' mode, the
	       timekeeping core calculates a (NTP adjusted) reverse
	       conversion factor from the clocksource to nanoseconds
	       conversion. This takes NTP adjustments into account and
	       keeps the conversion in sync.

	    B) The timekeeping core provides a function to convert an
	       absolute CLOCK_MONOTONIC expiry time into a absolute time in
	       clocksource cycles which can be programmed directly into the
	       comparator without reading the clocksource at all.

	       This is possible because timekeeping keeps a time pair of
	       the base cycle count and the corresponding CLOCK_MONOTONIC base
	       time at the last update of the timekeeper.

	       So the absolute cycle time can be calculated by calculating
	       the relative time to the CLOCK_MONOTONIC base time,
	       converting the delta into cycles with the help of #A and
	       adding the base cycle count. Pure math, no hardware access.

	    C) The clockevent reprogramming code invokes this conversion
	       function when the clockevent device indicates 'coupled'
	       mode.  The function returns false when the corresponding
	       clocksource is not the current system clocksource (based on
	       a clocksource ID check) and true if the clocksource matches
	       and the conversion is successful.

	       If false, the regular relative set_next_event() mechanism is
	       used, otherwise a new set_next_coupled() callback which
	       takes the calculated absolute expiry time as argument.

	       Similar to the clocksource, this new callback can optionally
	       be inlined.


    3) hrtimers

       It turned out that the hrtimer code needed a long overdue spring
       cleaning independent of the problem at hand. That was conducted
       before tackling the actual performance issues:

       - Timer locality

       	 The handling of timer locality is suboptimal and results often in
	 pointless invocations of switch_hrtimer_base() which end up
	 keeping the CPU base unchanged.

	 Aside of the pointless overhead, this prevents further
	 optimizations for the common local case.

	 Address this by improving the decision logic for keeping the clock
	 base local and splitting out the (re)arm handling into a unified
	 operation.


       - Evalutation of the clock base expiries

       	 The clock bases (MONOTONIC, REALTIME, BOOT, TAI) cache the first
       	 expiring timer, but not the corresponding expiry time, which means
       	 a re-evaluation of the clock bases for the next expiring timer on
       	 the CPU requires to touch up to for extra cache lines.

	 Trivial to solve by caching the earliest expiry time in the clock
	 base itself.


       - Reprogramming of the clock event device

       	 The hrtimer interrupt already deferres reprogramming until the
       	 interrupt handler completes, but in case of the hrtick timer
       	 that's not sufficient because the hrtick timer callback only sets
       	 the NEED_RESCHED flag but has no information about the next hrtick
       	 timer expiry time, which can only be determined in the scheduler.

	 Expand the deferred reprogramming so it can ideally be handled in
	 the subsequent schedule() after the new hrtick value has been
	 established. If there is no schedule, soft interrupts have to be
	 processed on return from interrupt or a nested interrupt hits
	 before reaching schedule, the deferred programming is handled in
	 those contexts.


       - Modification of queued timers

       	 If a timer is already queued modifying the expiry time requires
       	 dequeueing from the RB tree and requeuing after the new expiry
       	 value has been updated. It turned out that the hrtick timer
       	 modification end up very often at the same spot in the RB tree as
       	 they have been before, which means the dequeue/enqueue cycle along
       	 with the related rebalancing could have been avoided. The timer
       	 wheel timers have a similar mechanism by checking upfront whether
       	 the resulting expiry time keeps them in the same hash bucket.

	 It was tried to check this by using rb_prev() and rb_next() to
	 evaluate whether the modification keeps the timer in the same
	 spot, but that turned out to be really inefficent.

	 Solve this by providing a RB tree variant which extends the node
	 with links to the previous and next nodes, which is established
	 when the node is linked into the tree or adjusted when it is
	 removed. These links allow a quick peek into the previous and next
	 expiry time and if the new expiry stays in the boundary the whole
	 RB tree operation can be avoided.

	 This also simplifies the caching and update of the leftmost node
	 as on remove the rb_next() walk can be completely avoided. It
	 would obviously provide a cached rightmost pointer too, but there
	 is not use case for that (yet).

	 On a hackbench run this results in about 35% of the updates being
	 handled that way, which cuts the execution time of
	 hrtimer_start_range_ns() down to 50ns on a 2GHz machine.


       - Cancellation of queued timers

       	 Cancelling a timer or moving its expiry time past the programmed
       	 time can result in reprogramming the clock event device.
       	 Especially with frequent modifications of a queued timer this
       	 results in substantial overhead especially in VMs.

	 Provide an option for hrtimers to tell the core to handle
	 reprogramming lazy in those cases, which means it trades frequent
	 reprogramming against an occasional pointless hrtimer interrupt.

	 But it turned out for the hrtick timer this is a reasonable
	 tradeoff. It's especially valuable when transitioning to idle,
	 where the timer has to be cancelled but then the NOHZ idle code
	 will reprogram it in case of a long idle sleep anyway. But also in
	 high frequency scheduling scenarios this turned out to be
	 beneficial.


With all the above modifications in place enabling hrtick does not longer
result in regressions compared to the hrtick disabled mode.

The reprogramming frequency of the clockevent device got down from
~2500/sec to ~100/sec for a hackbench run with a spurious hrtimer interrupt
ratio of about 25%.

What's interesting is the astonishing improvement of a hackbench run with
the following command line parameters: '-l$LOOPS -p -s8'. That uses pipes
with a message size of 8 bytes. On a 112 CPU SKL machine this results in:

       	   NO HRTICK[_DL]		HRTICK[_DL]
runtime:   0.840s			0.481s		~-42%

With other message sizes up to 256, HRTICK still results in improvements,
but not in that magnitude. Haven't investigated the cause of that yet.

While quite some parts of the series are independent enhancements, I've
decided to keep them together in one big pile for now as all of the
components are required to actually achieve the overall goal.

The patches have been already structured in a way that they can be
distributed to different subsystem branches without causing major cross
subsystem contamination or merge conflict headaches.

The series applies on v7.0-rc1 and is also available from git:

   git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git sched/hrtick

Thanks,

	tglx
---
 arch/x86/Kconfig                      |    2 
 arch/x86/include/asm/clock_inlined.h  |   22 
 arch/x86/kernel/apic/apic.c           |   41 -
 arch/x86/kernel/tsc.c                 |    4 
 include/asm-generic/thread_info_tif.h |    5 
 include/linux/clockchips.h            |    8 
 include/linux/clocksource.h           |    3 
 include/linux/hrtimer.h               |   59 -
 include/linux/hrtimer_defs.h          |   79 +-
 include/linux/hrtimer_rearm.h         |   83 ++
 include/linux/hrtimer_types.h         |   19 
 include/linux/irq-entry-common.h      |   25 
 include/linux/rbtree.h                |   81 ++
 include/linux/rbtree_types.h          |   16 
 include/linux/rseq_entry.h            |   14 
 include/linux/timekeeper_internal.h   |    8 
 include/linux/timerqueue.h            |   56 +
 include/linux/timerqueue_types.h      |   15 
 include/trace/events/timer.h          |   35 -
 kernel/entry/common.c                 |    4 
 kernel/sched/core.c                   |   89 ++
 kernel/sched/deadline.c               |    2 
 kernel/sched/fair.c                   |   55 -
 kernel/sched/features.h               |    5 
 kernel/sched/sched.h                  |   41 -
 kernel/softirq.c                      |   15 
 kernel/time/Kconfig                   |   16 
 kernel/time/clockevents.c             |   48 +
 kernel/time/hrtimer.c                 | 1116 +++++++++++++++++++---------------
 kernel/time/tick-broadcast-hrtimer.c  |    1 
 kernel/time/tick-sched.c              |   27 
 kernel/time/timekeeping.c             |  184 +++++
 kernel/time/timekeeping.h             |    2 
 kernel/time/timer_list.c              |   12 
 lib/rbtree.c                          |   17 
 lib/timerqueue.c                      |   14 
 36 files changed, 1497 insertions(+), 728 deletions(-)



^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 01/48] sched/eevdf: Fix HRTICK duration
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
@ 2026-02-24 16:35 ` Thomas Gleixner
  2026-02-28 15:37   ` [tip: sched/hrtick] " tip-bot2 for Peter Zijlstra
  2026-02-24 16:35 ` [patch 02/48] sched/fair: Simplify hrtick_update() Thomas Gleixner
                   ` (48 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:35 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

From: Peter Zijlstra <peterz@infradead.org>

The nominal duration for an EEVDF task to run is until its deadline. At
which point the deadline is moved ahead and a new task selection is done.

Try and predict the time 'lost' to higher scheduling classes. Since this is
an estimate, the timer can be both early or late. In case it is early
task_tick_fair() will take the !need_resched() path and restarts the timer.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>

---
 kernel/sched/fair.c |   43 ++++++++++++++++++++++++++++---------------
 1 file changed, 28 insertions(+), 15 deletions(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6735,21 +6735,37 @@ static inline void sched_fair_update_sto
 static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
+	unsigned long scale = 1024;
+	unsigned long util = 0;
+	u64 vdelta;
+	u64 delta;
 
 	WARN_ON_ONCE(task_rq(p) != rq);
 
-	if (rq->cfs.h_nr_queued > 1) {
-		u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
-		u64 slice = se->slice;
-		s64 delta = slice - ran;
-
-		if (delta < 0) {
-			if (task_current_donor(rq, p))
-				resched_curr(rq);
-			return;
-		}
-		hrtick_start(rq, delta);
+	if (rq->cfs.h_nr_queued <= 1)
+		return;
+
+	/*
+	 * Compute time until virtual deadline
+	 */
+	vdelta = se->deadline - se->vruntime;
+	if ((s64)vdelta < 0) {
+		if (task_current_donor(rq, p))
+			resched_curr(rq);
+		return;
 	}
+	delta = (se->load.weight * vdelta) / NICE_0_LOAD;
+
+	/*
+	 * Correct for instantaneous load of other classes.
+	 */
+	util += cpu_util_irq(rq);
+	if (util && util < 1024) {
+		scale *= 1024;
+		scale /= (1024 - util);
+	}
+
+	hrtick_start(rq, (scale * delta) / 1024);
 }
 
 /*
@@ -13365,11 +13381,8 @@ static void task_tick_fair(struct rq *rq
 		entity_tick(cfs_rq, se, queued);
 	}
 
-	if (queued) {
-		if (!need_resched())
-			hrtick_start_fair(rq, curr);
+	if (queued)
 		return;
-	}
 
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 02/48] sched/fair: Simplify hrtick_update()
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
  2026-02-24 16:35 ` [patch 01/48] sched/eevdf: Fix HRTICK duration Thomas Gleixner
@ 2026-02-24 16:35 ` Thomas Gleixner
  2026-02-28 15:37   ` [tip: sched/hrtick] " tip-bot2 for Peter Zijlstra (Intel)
  2026-02-24 16:35 ` [patch 03/48] sched/fair: Make hrtick resched hard Thomas Gleixner
                   ` (47 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:35 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

From: Peter Zijlstra (Intel) <peterz@infradead.org>

hrtick_update() was needed when the slice depended on nr_running, all that
code is gone. All that remains is starting the hrtick when nr_running
becomes more than 1.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---

---
 kernel/sched/fair.c  |   12 ++++--------
 kernel/sched/sched.h |    4 ++++
 2 files changed, 8 insertions(+), 8 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6769,9 +6769,7 @@ static void hrtick_start_fair(struct rq
 }
 
 /*
- * called from enqueue/dequeue and updates the hrtick when the
- * current task is from our class and nr_running is low enough
- * to matter.
+ * Called on enqueue to start the hrtick when h_nr_queued becomes more than 1.
  */
 static void hrtick_update(struct rq *rq)
 {
@@ -6780,6 +6778,9 @@ static void hrtick_update(struct rq *rq)
 	if (!hrtick_enabled_fair(rq) || donor->sched_class != &fair_sched_class)
 		return;
 
+	if (hrtick_active(rq))
+		return;
+
 	hrtick_start_fair(rq, donor);
 }
 #else /* !CONFIG_SCHED_HRTICK: */
@@ -7102,9 +7103,6 @@ static int dequeue_entities(struct rq *r
 		WARN_ON_ONCE(!task_sleep);
 		WARN_ON_ONCE(p->on_rq != 1);
 
-		/* Fix-up what dequeue_task_fair() skipped */
-		hrtick_update(rq);
-
 		/*
 		 * Fix-up what block_task() skipped.
 		 *
@@ -7138,8 +7136,6 @@ static bool dequeue_task_fair(struct rq
 	/*
 	 * Must not reference @p after dequeue_entities(DEQUEUE_DELAYED).
 	 */
-
-	hrtick_update(rq);
 	return true;
 }
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3041,6 +3041,10 @@ static inline int hrtick_enabled_dl(stru
 }
 
 extern void hrtick_start(struct rq *rq, u64 delay);
+static inline bool hrtick_active(struct rq *rq)
+{
+	return hrtimer_active(&rq->hrtick_timer);
+}
 
 #else /* !CONFIG_SCHED_HRTICK: */
 


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 03/48] sched/fair: Make hrtick resched hard
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
  2026-02-24 16:35 ` [patch 01/48] sched/eevdf: Fix HRTICK duration Thomas Gleixner
  2026-02-24 16:35 ` [patch 02/48] sched/fair: Simplify hrtick_update() Thomas Gleixner
@ 2026-02-24 16:35 ` Thomas Gleixner
  2026-02-28 15:37   ` [tip: sched/hrtick] " tip-bot2 for Peter Zijlstra (Intel)
  2026-02-24 16:35 ` [patch 04/48] sched: Avoid ktime_get() indirection Thomas Gleixner
                   ` (46 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:35 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

From: Peter Zijlstra (Intel) <peterz@infradead.org>

Since the tick causes hard preemption, the hrtick should too.

Letting the hrtick do lazy preemption completely defeats the purpose, since
it will then still be delayed until a old tick and be dependent on
CONFIG_HZ.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
---
 kernel/sched/fair.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5530,7 +5530,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
 	 * validating it and just reschedule.
 	 */
 	if (queued) {
-		resched_curr_lazy(rq_of(cfs_rq));
+		resched_curr(rq_of(cfs_rq));
 		return;
 	}
 #endif


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 04/48] sched: Avoid ktime_get() indirection
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (2 preceding siblings ...)
  2026-02-24 16:35 ` [patch 03/48] sched/fair: Make hrtick resched hard Thomas Gleixner
@ 2026-02-24 16:35 ` Thomas Gleixner
  2026-02-28 15:37   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:35 ` [patch 05/48] hrtimer: Avoid pointless reprogramming in __hrtimer_start_range_ns() Thomas Gleixner
                   ` (45 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:35 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

The clock of the hrtick and deadline timers is known to be CLOCK_MONOTONIC.
No point in looking it up via hrtimer_cb_get_time().

Just use ktime_get() directly.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 kernel/sched/core.c     |    3 +--
 kernel/sched/deadline.c |    2 +-
 2 files changed, 2 insertions(+), 3 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -925,7 +925,6 @@ static void __hrtick_start(void *arg)
  */
 void hrtick_start(struct rq *rq, u64 delay)
 {
-	struct hrtimer *timer = &rq->hrtick_timer;
 	s64 delta;
 
 	/*
@@ -933,7 +932,7 @@ void hrtick_start(struct rq *rq, u64 del
 	 * doesn't make sense and can cause timer DoS.
 	 */
 	delta = max_t(s64, delay, 10000LL);
-	rq->hrtick_time = ktime_add_ns(hrtimer_cb_get_time(timer), delta);
+	rq->hrtick_time = ktime_add_ns(ktime_get(), delta);
 
 	if (rq == this_rq())
 		__hrtick_restart(rq);
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1097,7 +1097,7 @@ static int start_dl_timer(struct sched_d
 		act = ns_to_ktime(dl_next_period(dl_se));
 	}
 
-	now = hrtimer_cb_get_time(timer);
+	now = ktime_get();
 	delta = ktime_to_ns(now) - rq_clock(rq);
 	act = ktime_add_ns(act, delta);
 


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 05/48] hrtimer: Avoid pointless reprogramming in __hrtimer_start_range_ns()
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (3 preceding siblings ...)
  2026-02-24 16:35 ` [patch 04/48] sched: Avoid ktime_get() indirection Thomas Gleixner
@ 2026-02-24 16:35 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Peter Zijlstra
  2026-02-24 16:35 ` [patch 06/48] hrtimer: Provide a static branch based hrtimer_hres_enabled() Thomas Gleixner
                   ` (44 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:35 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

From: Peter Zijlstra <peterz@infradead.org>

Much like hrtimer_reprogram(), skip programming if the cpu_base is running
the hrtimer interrupt.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@kernel.org>
---
 kernel/time/hrtimer.c |    8 ++++++++
 1 file changed, 8 insertions(+)
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1269,6 +1269,14 @@ static int __hrtimer_start_range_ns(stru
 	}
 
 	first = enqueue_hrtimer(timer, new_base, mode);
+
+	/*
+	 * If the hrtimer interrupt is running, then it will reevaluate the
+	 * clock bases and reprogram the clock event device.
+	 */
+	if (new_base->cpu_base->in_hrtirq)
+		return false;
+
 	if (!force_local) {
 		/*
 		 * If the current CPU base is online, then the timer is


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 06/48] hrtimer: Provide a static branch based hrtimer_hres_enabled()
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (4 preceding siblings ...)
  2026-02-24 16:35 ` [patch 05/48] hrtimer: Avoid pointless reprogramming in __hrtimer_start_range_ns() Thomas Gleixner
@ 2026-02-24 16:35 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:35 ` [patch 07/48] sched: Use hrtimer_highres_enabled() Thomas Gleixner
                   ` (43 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:35 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

The scheduler evaluates this via hrtimer_is_hres_active() every time it has
to update HRTICK. This needs to follow three pointers, which is expensive.

Provide a static branch based mechanism to avoid that.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 include/linux/hrtimer.h |   13 +++++++++----
 kernel/time/hrtimer.c   |   28 +++++++++++++++++++++++++---
 2 files changed, 34 insertions(+), 7 deletions(-)

--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -153,17 +153,22 @@ static inline int hrtimer_is_hres_active
 }
 
 #ifdef CONFIG_HIGH_RES_TIMERS
+extern unsigned int hrtimer_resolution;
 struct clock_event_device;
 
 extern void hrtimer_interrupt(struct clock_event_device *dev);
 
-extern unsigned int hrtimer_resolution;
+extern struct static_key_false hrtimer_highres_enabled_key;
 
-#else
+static inline bool hrtimer_highres_enabled(void)
+{
+	return static_branch_likely(&hrtimer_highres_enabled_key);
+}
 
+#else  /* CONFIG_HIGH_RES_TIMERS */
 #define hrtimer_resolution	(unsigned int)LOW_RES_NSEC
-
-#endif
+static inline bool hrtimer_highres_enabled(void) { return false; }
+#endif  /* !CONFIG_HIGH_RES_TIMERS */
 
 static inline ktime_t
 __hrtimer_expires_remaining_adjusted(const struct hrtimer *timer, ktime_t now)
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -126,6 +126,25 @@ static inline bool hrtimer_base_is_onlin
 		return likely(base->online);
 }
 
+#ifdef CONFIG_HIGH_RES_TIMERS
+DEFINE_STATIC_KEY_FALSE(hrtimer_highres_enabled_key);
+
+static void hrtimer_hres_workfn(struct work_struct *work)
+{
+	static_branch_enable(&hrtimer_highres_enabled_key);
+}
+
+static DECLARE_WORK(hrtimer_hres_work, hrtimer_hres_workfn);
+
+static inline void hrtimer_schedule_hres_work(void)
+{
+	if (!hrtimer_highres_enabled())
+		schedule_work(&hrtimer_hres_work);
+}
+#else
+static inline void hrtimer_schedule_hres_work(void) { }
+#endif
+
 /*
  * Functions and macros which are different for UP/SMP systems are kept in a
  * single place
@@ -649,7 +668,9 @@ static inline ktime_t hrtimer_update_bas
 }
 
 /*
- * Is the high resolution mode active ?
+ * Is the high resolution mode active in the CPU base. This cannot use the
+ * static key as the CPUs are switched to high resolution mode
+ * asynchronously.
  */
 static inline int hrtimer_hres_active(struct hrtimer_cpu_base *cpu_base)
 {
@@ -750,6 +771,7 @@ static void hrtimer_switch_to_hres(void)
 	tick_setup_sched_timer(true);
 	/* "Retrigger" the interrupt to get things going */
 	retrigger_next_event(NULL);
+	hrtimer_schedule_hres_work();
 }
 
 #else
@@ -947,11 +969,10 @@ static bool update_needs_ipi(struct hrti
  */
 void clock_was_set(unsigned int bases)
 {
-	struct hrtimer_cpu_base *cpu_base = raw_cpu_ptr(&hrtimer_bases);
 	cpumask_var_t mask;
 	int cpu;
 
-	if (!hrtimer_hres_active(cpu_base) && !tick_nohz_is_active())
+	if (!hrtimer_highres_enabled() && !tick_nohz_is_active())
 		goto out_timerfd;
 
 	if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) {
@@ -962,6 +983,7 @@ void clock_was_set(unsigned int bases)
 	/* Avoid interrupting CPUs if possible */
 	cpus_read_lock();
 	for_each_online_cpu(cpu) {
+		struct hrtimer_cpu_base *cpu_base;
 		unsigned long flags;
 
 		cpu_base = &per_cpu(hrtimer_bases, cpu);


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 07/48] sched: Use hrtimer_highres_enabled()
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (5 preceding siblings ...)
  2026-02-24 16:35 ` [patch 06/48] hrtimer: Provide a static branch based hrtimer_hres_enabled() Thomas Gleixner
@ 2026-02-24 16:35 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:35 ` [patch 08/48] sched: Optimize hrtimer handling Thomas Gleixner
                   ` (42 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:35 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

Use the static branch based variant and thereby avoid following three
pointers.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 include/linux/hrtimer.h |    6 ------
 kernel/sched/sched.h    |   37 +++++++++----------------------------
 2 files changed, 9 insertions(+), 34 deletions(-)

--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -146,12 +146,6 @@ static inline ktime_t hrtimer_expires_re
 	return ktime_sub(timer->node.expires, hrtimer_cb_get_time(timer));
 }
 
-static inline int hrtimer_is_hres_active(struct hrtimer *timer)
-{
-	return IS_ENABLED(CONFIG_HIGH_RES_TIMERS) ?
-		timer->base->cpu_base->hres_active : 0;
-}
-
 #ifdef CONFIG_HIGH_RES_TIMERS
 extern unsigned int hrtimer_resolution;
 struct clock_event_device;
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3019,25 +3019,19 @@ extern unsigned int sysctl_numa_balancin
  *  - enabled by features
  *  - hrtimer is actually high res
  */
-static inline int hrtick_enabled(struct rq *rq)
+static inline bool hrtick_enabled(struct rq *rq)
 {
-	if (!cpu_active(cpu_of(rq)))
-		return 0;
-	return hrtimer_is_hres_active(&rq->hrtick_timer);
+	return cpu_active(cpu_of(rq)) && hrtimer_highres_enabled();
 }
 
-static inline int hrtick_enabled_fair(struct rq *rq)
+static inline bool hrtick_enabled_fair(struct rq *rq)
 {
-	if (!sched_feat(HRTICK))
-		return 0;
-	return hrtick_enabled(rq);
+	return sched_feat(HRTICK) && hrtick_enabled(rq);
 }
 
-static inline int hrtick_enabled_dl(struct rq *rq)
+static inline bool hrtick_enabled_dl(struct rq *rq)
 {
-	if (!sched_feat(HRTICK_DL))
-		return 0;
-	return hrtick_enabled(rq);
+	return sched_feat(HRTICK_DL) && hrtick_enabled(rq);
 }
 
 extern void hrtick_start(struct rq *rq, u64 delay);
@@ -3047,22 +3041,9 @@ static inline bool hrtick_active(struct
 }
 
 #else /* !CONFIG_SCHED_HRTICK: */
-
-static inline int hrtick_enabled_fair(struct rq *rq)
-{
-	return 0;
-}
-
-static inline int hrtick_enabled_dl(struct rq *rq)
-{
-	return 0;
-}
-
-static inline int hrtick_enabled(struct rq *rq)
-{
-	return 0;
-}
-
+static inline bool hrtick_enabled_fair(struct rq *rq) { return false; }
+static inline bool hrtick_enabled_dl(struct rq *rq) { return false; }
+static inline bool hrtick_enabled(struct rq *rq) { return false; }
 #endif /* !CONFIG_SCHED_HRTICK */
 
 #ifndef arch_scale_freq_tick


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 08/48] sched: Optimize hrtimer handling
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (6 preceding siblings ...)
  2026-02-24 16:35 ` [patch 07/48] sched: Use hrtimer_highres_enabled() Thomas Gleixner
@ 2026-02-24 16:35 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:35 ` [patch 09/48] sched/hrtick: Avoid tiny hrtick rearms Thomas Gleixner
                   ` (41 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:35 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

schedule() provides several mechanisms to update the hrtick timer:

  1) When the next task is picked

  2) When the balance callbacks are invoked before rq::lock is released

Each of them can result in a first expiring timer and cause a reprogram of
the clock event device.

Solve this by deferring the rearm to the end of schedule() right before
releasing rq::lock by setting a flag on entry which tells hrtick_start() to
cache the runtime constraint in rq::hrtick_delay without touching the timer
itself.

Right before releasing rq::lock evaluate the flags and either rearm or
cancel the hrtick timer.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 kernel/sched/core.c  |   57 ++++++++++++++++++++++++++++++++++++++++++---------
 kernel/sched/sched.h |    2 +
 2 files changed, 50 insertions(+), 9 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -872,6 +872,12 @@ void update_rq_clock(struct rq *rq)
  * Use HR-timers to deliver accurate preemption points.
  */
 
+enum {
+	HRTICK_SCHED_NONE		= 0,
+	HRTICK_SCHED_DEFER		= BIT(1),
+	HRTICK_SCHED_START		= BIT(2),
+};
+
 static void hrtick_clear(struct rq *rq)
 {
 	if (hrtimer_active(&rq->hrtick_timer))
@@ -932,6 +938,17 @@ void hrtick_start(struct rq *rq, u64 del
 	 * doesn't make sense and can cause timer DoS.
 	 */
 	delta = max_t(s64, delay, 10000LL);
+
+	/*
+	 * If this is in the middle of schedule() only note the delay
+	 * and let hrtick_schedule_exit() deal with it.
+	 */
+	if (rq->hrtick_sched) {
+		rq->hrtick_sched |= HRTICK_SCHED_START;
+		rq->hrtick_delay = delta;
+		return;
+	}
+
 	rq->hrtick_time = ktime_add_ns(ktime_get(), delta);
 
 	if (rq == this_rq())
@@ -940,19 +957,40 @@ void hrtick_start(struct rq *rq, u64 del
 		smp_call_function_single_async(cpu_of(rq), &rq->hrtick_csd);
 }
 
-static void hrtick_rq_init(struct rq *rq)
+static inline void hrtick_schedule_enter(struct rq *rq)
 {
-	INIT_CSD(&rq->hrtick_csd, __hrtick_start, rq);
-	hrtimer_setup(&rq->hrtick_timer, hrtick, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
+	rq->hrtick_sched = HRTICK_SCHED_DEFER;
 }
-#else /* !CONFIG_SCHED_HRTICK: */
-static inline void hrtick_clear(struct rq *rq)
+
+static inline void hrtick_schedule_exit(struct rq *rq)
 {
+	if (rq->hrtick_sched & HRTICK_SCHED_START) {
+		rq->hrtick_time = ktime_add_ns(ktime_get(), rq->hrtick_delay);
+		__hrtick_restart(rq);
+	} else if (idle_rq(rq)) {
+		/*
+		 * No need for using hrtimer_is_active(). The timer is CPU local
+		 * and interrupts are disabled, so the callback cannot be
+		 * running and the queued state is valid.
+		 */
+		if (hrtimer_is_queued(&rq->hrtick_timer))
+			hrtimer_cancel(&rq->hrtick_timer);
+	}
+
+	rq->hrtick_sched = HRTICK_SCHED_NONE;
 }
 
-static inline void hrtick_rq_init(struct rq *rq)
+static void hrtick_rq_init(struct rq *rq)
 {
+	INIT_CSD(&rq->hrtick_csd, __hrtick_start, rq);
+	rq->hrtick_sched = HRTICK_SCHED_NONE;
+	hrtimer_setup(&rq->hrtick_timer, hrtick, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
 }
+#else /* !CONFIG_SCHED_HRTICK: */
+static inline void hrtick_clear(struct rq *rq) { }
+static inline void hrtick_rq_init(struct rq *rq) { }
+static inline void hrtick_schedule_enter(struct rq *rq) { }
+static inline void hrtick_schedule_exit(struct rq *rq) { }
 #endif /* !CONFIG_SCHED_HRTICK */
 
 /*
@@ -5028,6 +5066,7 @@ static inline void finish_lock_switch(st
 	 */
 	spin_acquire(&__rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
 	__balance_callbacks(rq, NULL);
+	hrtick_schedule_exit(rq);
 	raw_spin_rq_unlock_irq(rq);
 }
 
@@ -6781,9 +6820,6 @@ static void __sched notrace __schedule(i
 
 	schedule_debug(prev, preempt);
 
-	if (sched_feat(HRTICK) || sched_feat(HRTICK_DL))
-		hrtick_clear(rq);
-
 	klp_sched_try_switch(prev);
 
 	local_irq_disable();
@@ -6810,6 +6846,8 @@ static void __sched notrace __schedule(i
 	rq_lock(rq, &rf);
 	smp_mb__after_spinlock();
 
+	hrtick_schedule_enter(rq);
+
 	/* Promote REQ to ACT */
 	rq->clock_update_flags <<= 1;
 	update_rq_clock(rq);
@@ -6911,6 +6949,7 @@ static void __sched notrace __schedule(i
 
 		rq_unpin_lock(rq, &rf);
 		__balance_callbacks(rq, NULL);
+		hrtick_schedule_exit(rq);
 		raw_spin_rq_unlock_irq(rq);
 	}
 	trace_sched_exit_tp(is_switch);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1285,6 +1285,8 @@ struct rq {
 	call_single_data_t	hrtick_csd;
 	struct hrtimer		hrtick_timer;
 	ktime_t			hrtick_time;
+	ktime_t			hrtick_delay;
+	unsigned int		hrtick_sched;
 #endif
 
 #ifdef CONFIG_SCHEDSTATS


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 09/48] sched/hrtick: Avoid tiny hrtick rearms
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (7 preceding siblings ...)
  2026-02-24 16:35 ` [patch 08/48] sched: Optimize hrtimer handling Thomas Gleixner
@ 2026-02-24 16:35 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:36 ` [patch 10/48] hrtimer: Provide LAZY_REARM mode Thomas Gleixner
                   ` (40 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:35 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

Tiny adjustments to the hrtick expiry time below 5 microseconds are just
causing extra work for no real value. Filter them out when restarting the
hrtick.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 kernel/sched/core.c |   24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -903,12 +903,24 @@ static enum hrtimer_restart hrtick(struc
 	return HRTIMER_NORESTART;
 }
 
-static void __hrtick_restart(struct rq *rq)
+static inline bool hrtick_needs_rearm(struct hrtimer *timer, ktime_t expires)
+{
+	/*
+	 * Queued is false when the timer is not started or currently
+	 * running the callback. In both cases, restart. If queued check
+	 * whether the expiry time actually changes substantially.
+	 */
+	return !hrtimer_is_queued(timer) ||
+		abs(expires - hrtimer_get_expires(timer)) > 5000;
+}
+
+static void hrtick_cond_restart(struct rq *rq)
 {
 	struct hrtimer *timer = &rq->hrtick_timer;
 	ktime_t time = rq->hrtick_time;
 
-	hrtimer_start(timer, time, HRTIMER_MODE_ABS_PINNED_HARD);
+	if (hrtick_needs_rearm(timer, time))
+		hrtimer_start(timer, time, HRTIMER_MODE_ABS_PINNED_HARD);
 }
 
 /*
@@ -920,7 +932,7 @@ static void __hrtick_start(void *arg)
 	struct rq_flags rf;
 
 	rq_lock(rq, &rf);
-	__hrtick_restart(rq);
+	hrtick_cond_restart(rq);
 	rq_unlock(rq, &rf);
 }
 
@@ -950,9 +962,11 @@ void hrtick_start(struct rq *rq, u64 del
 	}
 
 	rq->hrtick_time = ktime_add_ns(ktime_get(), delta);
+	if (!hrtick_needs_rearm(&rq->hrtick_timer, rq->hrtick_time))
+		return;
 
 	if (rq == this_rq())
-		__hrtick_restart(rq);
+		hrtimer_start(&rq->hrtick_timer, rq->hrtick_time, HRTIMER_MODE_ABS_PINNED_HARD);
 	else
 		smp_call_function_single_async(cpu_of(rq), &rq->hrtick_csd);
 }
@@ -966,7 +980,7 @@ static inline void hrtick_schedule_exit(
 {
 	if (rq->hrtick_sched & HRTICK_SCHED_START) {
 		rq->hrtick_time = ktime_add_ns(ktime_get(), rq->hrtick_delay);
-		__hrtick_restart(rq);
+		hrtick_cond_restart(rq);
 	} else if (idle_rq(rq)) {
 		/*
 		 * No need for using hrtimer_is_active(). The timer is CPU local


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 10/48] hrtimer: Provide LAZY_REARM mode
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (8 preceding siblings ...)
  2026-02-24 16:35 ` [patch 09/48] sched/hrtick: Avoid tiny hrtick rearms Thomas Gleixner
@ 2026-02-24 16:36 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Peter Zijlstra
  2026-02-24 16:36 ` [patch 11/48] sched/hrtick: Mark hrtick timer LAZY_REARM Thomas Gleixner
                   ` (39 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:36 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

From: Peter Zijlstra <peterz@infradead.org>

The hrtick timer is frequently rearmed before expiry and most of the time
the new expiry is past the armed one. As this happens on every context
switch it becomes expensive with scheduling heavy work loads especially in
virtual machines as the "hardware" reprogamming implies a VM exit.

Add a lazy rearm mode flag which skips the reprogamming if:

    1) The timer was the first expiring timer before the rearm

    2) The new expiry time is farther out than the armed time

This avoids a massive amount of reprogramming operations of the hrtick
timer for the price of eventually taking the alredy armed interrupt for
nothing.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 include/linux/hrtimer.h       |    8 ++++++++
 include/linux/hrtimer_types.h |    3 +++
 kernel/time/hrtimer.c         |   17 ++++++++++++++++-
 3 files changed, 27 insertions(+), 1 deletion(-)

--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -31,6 +31,13 @@
  *				  soft irq context
  * HRTIMER_MODE_HARD		- Timer callback function will be executed in
  *				  hard irq context even on PREEMPT_RT.
+ * HRTIMER_MODE_LAZY_REARM	- Avoid reprogramming if the timer was the
+ *				  first expiring timer and is moved into the
+ *				  future. Special mode for the HRTICK timer to
+ *				  avoid extensive reprogramming of the hardware,
+ *				  which is expensive in virtual machines. Risks
+ *				  a pointless expiry, but that's better than
+ *				  reprogramming on every context switch,
  */
 enum hrtimer_mode {
 	HRTIMER_MODE_ABS	= 0x00,
@@ -38,6 +45,7 @@ enum hrtimer_mode {
 	HRTIMER_MODE_PINNED	= 0x02,
 	HRTIMER_MODE_SOFT	= 0x04,
 	HRTIMER_MODE_HARD	= 0x08,
+	HRTIMER_MODE_LAZY_REARM	= 0x10,
 
 	HRTIMER_MODE_ABS_PINNED = HRTIMER_MODE_ABS | HRTIMER_MODE_PINNED,
 	HRTIMER_MODE_REL_PINNED = HRTIMER_MODE_REL | HRTIMER_MODE_PINNED,
--- a/include/linux/hrtimer_types.h
+++ b/include/linux/hrtimer_types.h
@@ -33,6 +33,8 @@ enum hrtimer_restart {
  * @is_soft:	Set if hrtimer will be expired in soft interrupt context.
  * @is_hard:	Set if hrtimer will be expired in hard interrupt context
  *		even on RT.
+ * @is_lazy:	Set if the timer is frequently rearmed to avoid updates
+ *		of the clock event device
  *
  * The hrtimer structure must be initialized by hrtimer_setup()
  */
@@ -45,6 +47,7 @@ struct hrtimer {
 	u8				is_rel;
 	u8				is_soft;
 	u8				is_hard;
+	u8				is_lazy;
 };
 
 #endif /* _LINUX_HRTIMER_TYPES_H */
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1152,7 +1152,7 @@ static void __remove_hrtimer(struct hrti
 	 * an superfluous call to hrtimer_force_reprogram() on the
 	 * remote cpu later on if the same timer gets enqueued again.
 	 */
-	if (reprogram && timer == cpu_base->next_timer)
+	if (reprogram && timer == cpu_base->next_timer && !timer->is_lazy)
 		hrtimer_force_reprogram(cpu_base, 1);
 }
 
@@ -1322,6 +1322,20 @@ static int __hrtimer_start_range_ns(stru
 	}
 
 	/*
+	 * Special case for the HRTICK timer. It is frequently rearmed and most
+	 * of the time moves the expiry into the future. That's expensive in
+	 * virtual machines and it's better to take the pointless already armed
+	 * interrupt than reprogramming the hardware on every context switch.
+	 *
+	 * If the new expiry is before the armed time, then reprogramming is
+	 * required.
+	 */
+	if (timer->is_lazy) {
+		if (new_base->cpu_base->expires_next <= hrtimer_get_expires(timer))
+			return 0;
+	}
+
+	/*
 	 * Timer was forced to stay on the current CPU to avoid
 	 * reprogramming on removal and enqueue. Force reprogram the
 	 * hardware by evaluating the new first expiring timer.
@@ -1675,6 +1689,7 @@ static void __hrtimer_setup(struct hrtim
 	base += hrtimer_clockid_to_base(clock_id);
 	timer->is_soft = softtimer;
 	timer->is_hard = !!(mode & HRTIMER_MODE_HARD);
+	timer->is_lazy = !!(mode & HRTIMER_MODE_LAZY_REARM);
 	timer->base = &cpu_base->clock_base[base];
 	timerqueue_init(&timer->node);
 


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 11/48] sched/hrtick: Mark hrtick timer LAZY_REARM
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (9 preceding siblings ...)
  2026-02-24 16:36 ` [patch 10/48] hrtimer: Provide LAZY_REARM mode Thomas Gleixner
@ 2026-02-24 16:36 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Peter Zijlstra
  2026-02-24 16:36 ` [patch 12/48] tick/sched: Avoid hrtimer_cancel/start() sequence Thomas Gleixner
                   ` (38 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:36 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

From: Peter Zijlstra <peterz@infradead.org>

The hrtick timer is frequently rearmed before expiry and most of the time
the new expiry is past the armed one. As this happens on every context
switch it becomes expensive with scheduling heavy work loads especially in
virtual machines as the "hardware" reprogamming implies a VM exit.

hrtimer now provide a lazy rearm mode flag which skips the reprogamming if:

    1) The timer was the first expiring timer before the rearm

    2) The new expiry time is farther out than the armed time

This avoids a massive amount of reprogramming operations of the hrtick
timer for the price of eventually taking the alredy armed interrupt for
nothing.

Mark the hrtick timer accordingly.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 kernel/sched/core.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -998,7 +998,8 @@ static void hrtick_rq_init(struct rq *rq
 {
 	INIT_CSD(&rq->hrtick_csd, __hrtick_start, rq);
 	rq->hrtick_sched = HRTICK_SCHED_NONE;
-	hrtimer_setup(&rq->hrtick_timer, hrtick, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
+	hrtimer_setup(&rq->hrtick_timer, hrtick, CLOCK_MONOTONIC,
+		      HRTIMER_MODE_REL_HARD | HRTIMER_MODE_LAZY_REARM);
 }
 #else /* !CONFIG_SCHED_HRTICK: */
 static inline void hrtick_clear(struct rq *rq) { }


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 12/48] tick/sched: Avoid hrtimer_cancel/start() sequence
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (10 preceding siblings ...)
  2026-02-24 16:36 ` [patch 11/48] sched/hrtick: Mark hrtick timer LAZY_REARM Thomas Gleixner
@ 2026-02-24 16:36 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:36 ` [patch 13/48] clockevents: Remove redundant CLOCK_EVT_FEAT_KTIME Thomas Gleixner
                   ` (37 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:36 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

The sequence of cancel and start is inefficient. It has to do the timer
lock/unlock twice and in the worst case has to reprogram the underlying
clock event device twice.

The reason why it is done this way is the usage of hrtimer_forward_now(),
which requires the timer to be inactive.

But that can be completely avoided as the forward can be done on a variable
and does not need any of the overrun accounting provided by
hrtimer_forward_now().

Implement a trivial forwarding mechanism and replace the cancel/reprogram
sequence with hrtimer_start(..., new_expiry).

For the non high resolution case the timer is not actually armed, but used
for storage so that code checking for expiry times can unconditially look
it up in the timer. So it is safe for that case to set the new expiry time
directly.

Signed-off-by: Thomas Gleixner <tglx@kernel.org
Cc: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/time/tick-sched.c |   27 ++++++++++++++++++++-------
 1 file changed, 20 insertions(+), 7 deletions(-)

--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -864,19 +864,32 @@ u64 get_cpu_iowait_time_us(int cpu, u64
 }
 EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
 
+/* Simplified variant of hrtimer_forward_now() */
+static ktime_t tick_forward_now(ktime_t expires, ktime_t now)
+{
+	ktime_t delta = now - expires;
+
+	if (likely(delta < TICK_NSEC))
+		return expires + TICK_NSEC;
+
+	expires += TICK_NSEC * ktime_divns(delta, TICK_NSEC);
+	if (expires > now)
+		return expires;
+	return expires + TICK_NSEC;
+}
+
 static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
 {
-	hrtimer_cancel(&ts->sched_timer);
-	hrtimer_set_expires(&ts->sched_timer, ts->last_tick);
+	ktime_t expires = ts->last_tick;
 
-	/* Forward the time to expire in the future */
-	hrtimer_forward(&ts->sched_timer, now, TICK_NSEC);
+	if (now >= expires)
+		expires = tick_forward_now(expires, now);
 
 	if (tick_sched_flag_test(ts, TS_FLAG_HIGHRES)) {
-		hrtimer_start_expires(&ts->sched_timer,
-				      HRTIMER_MODE_ABS_PINNED_HARD);
+		hrtimer_start(&ts->sched_timer,	expires, HRTIMER_MODE_ABS_PINNED_HARD);
 	} else {
-		tick_program_event(hrtimer_get_expires(&ts->sched_timer), 1);
+		hrtimer_set_expires(&ts->sched_timer, expires);
+		tick_program_event(expires, 1);
 	}
 
 	/*


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 13/48] clockevents: Remove redundant CLOCK_EVT_FEAT_KTIME
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (11 preceding siblings ...)
  2026-02-24 16:36 ` [patch 12/48] tick/sched: Avoid hrtimer_cancel/start() sequence Thomas Gleixner
@ 2026-02-24 16:36 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:36 ` [patch 14/48] timekeeping: Allow inlining clocksource::read() Thomas Gleixner
                   ` (36 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:36 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

The only real usecase for this is the hrtimer based broadcast device.
No point in using two different feature flags for this.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 include/linux/clockchips.h           |    1 -
 kernel/time/clockevents.c            |    4 ++--
 kernel/time/tick-broadcast-hrtimer.c |    1 -
 3 files changed, 2 insertions(+), 4 deletions(-)

--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -45,7 +45,6 @@ enum clock_event_state {
  */
 # define CLOCK_EVT_FEAT_PERIODIC	0x000001
 # define CLOCK_EVT_FEAT_ONESHOT		0x000002
-# define CLOCK_EVT_FEAT_KTIME		0x000004
 
 /*
  * x86(64) specific (mis)features:
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -319,8 +319,8 @@ int clockevents_program_event(struct clo
 	WARN_ONCE(!clockevent_state_oneshot(dev), "Current state: %d\n",
 		  clockevent_get_state(dev));
 
-	/* Shortcut for clockevent devices that can deal with ktime. */
-	if (dev->features & CLOCK_EVT_FEAT_KTIME)
+	/* ktime_t based reprogramming for the broadcast hrtimer device */
+	if (unlikely(dev->features & CLOCK_EVT_FEAT_HRTIMER))
 		return dev->set_next_ktime(expires, dev);
 
 	delta = ktime_to_ns(ktime_sub(expires, ktime_get()));
--- a/kernel/time/tick-broadcast-hrtimer.c
+++ b/kernel/time/tick-broadcast-hrtimer.c
@@ -78,7 +78,6 @@ static struct clock_event_device ce_broa
 	.set_state_shutdown	= bc_shutdown,
 	.set_next_ktime		= bc_set_next,
 	.features		= CLOCK_EVT_FEAT_ONESHOT |
-				  CLOCK_EVT_FEAT_KTIME |
 				  CLOCK_EVT_FEAT_HRTIMER,
 	.rating			= 0,
 	.bound_on		= -1,


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 14/48] timekeeping: Allow inlining clocksource::read()
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (12 preceding siblings ...)
  2026-02-24 16:36 ` [patch 13/48] clockevents: Remove redundant CLOCK_EVT_FEAT_KTIME Thomas Gleixner
@ 2026-02-24 16:36 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:36 ` [patch 15/48] x86: Inline TSC reads in timekeeping Thomas Gleixner
                   ` (35 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:36 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

On some architectures clocksource::read() boils down to a single
instruction, so the indirect function call is just a massive overhead
especially with speculative execution mitigations in effect.

Allow architectures to enable conditional inlining of that read to avoid
that by:

   - providing a static branch to switch to the inlined variant

   - disabling the branch before clocksource changes

   - enabling the branch after a clocksource change, when the clocksource
     indicates in a feature flag that it is the one which provides the
     inlined variant

This is intentionally not a static call as that would only remove the
indirect call, but not the rest of the overhead.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 include/linux/clocksource.h |    2 +
 kernel/time/Kconfig         |    3 +
 kernel/time/timekeeping.c   |   74 ++++++++++++++++++++++++++++++++------------
 3 files changed, 60 insertions(+), 19 deletions(-)

--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -149,6 +149,8 @@ struct clocksource {
 #define CLOCK_SOURCE_SUSPEND_NONSTOP		0x80
 #define CLOCK_SOURCE_RESELECT			0x100
 #define CLOCK_SOURCE_VERIFY_PERCPU		0x200
+#define CLOCK_SOURCE_CAN_INLINE_READ		0x400
+
 /* simplify initialization of mask field */
 #define CLOCKSOURCE_MASK(bits) GENMASK_ULL((bits) - 1, 0)
 
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -17,6 +17,9 @@ config ARCH_CLOCKSOURCE_DATA
 config ARCH_CLOCKSOURCE_INIT
 	bool
 
+config ARCH_WANTS_CLOCKSOURCE_READ_INLINE
+	bool
+
 # Timekeeping vsyscall support
 config GENERIC_TIME_VSYSCALL
 	bool
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -3,34 +3,30 @@
  *  Kernel timekeeping code and accessor functions. Based on code from
  *  timer.c, moved in commit 8524070b7982.
  */
-#include <linux/timekeeper_internal.h>
-#include <linux/module.h>
-#include <linux/interrupt.h>
+#include <linux/audit.h>
+#include <linux/clocksource.h>
+#include <linux/compiler.h>
+#include <linux/jiffies.h>
 #include <linux/kobject.h>
-#include <linux/percpu.h>
-#include <linux/init.h>
-#include <linux/mm.h>
+#include <linux/module.h>
 #include <linux/nmi.h>
-#include <linux/sched.h>
-#include <linux/sched/loadavg.h>
+#include <linux/pvclock_gtod.h>
+#include <linux/random.h>
 #include <linux/sched/clock.h>
+#include <linux/sched/loadavg.h>
+#include <linux/static_key.h>
+#include <linux/stop_machine.h>
 #include <linux/syscore_ops.h>
-#include <linux/clocksource.h>
-#include <linux/jiffies.h>
+#include <linux/tick.h>
 #include <linux/time.h>
 #include <linux/timex.h>
-#include <linux/tick.h>
-#include <linux/stop_machine.h>
-#include <linux/pvclock_gtod.h>
-#include <linux/compiler.h>
-#include <linux/audit.h>
-#include <linux/random.h>
+#include <linux/timekeeper_internal.h>
 
 #include <vdso/auxclock.h>
 
 #include "tick-internal.h"
-#include "ntp_internal.h"
 #include "timekeeping_internal.h"
+#include "ntp_internal.h"
 
 #define TK_CLEAR_NTP		(1 << 0)
 #define TK_CLOCK_WAS_SET	(1 << 1)
@@ -275,6 +271,11 @@ static inline void tk_update_sleep_time(
 	tk->monotonic_to_boot = ktime_to_timespec64(tk->offs_boot);
 }
 
+#ifdef CONFIG_ARCH_WANTS_CLOCKSOURCE_READ_INLINE
+#include <asm/clock_inlined.h>
+
+static DEFINE_STATIC_KEY_FALSE(clocksource_read_inlined);
+
 /*
  * tk_clock_read - atomic clocksource read() helper
  *
@@ -288,13 +289,36 @@ static inline void tk_update_sleep_time(
  * a read of the fast-timekeeper tkrs (which is protected by its own locking
  * and update logic).
  */
-static inline u64 tk_clock_read(const struct tk_read_base *tkr)
+static __always_inline u64 tk_clock_read(const struct tk_read_base *tkr)
 {
 	struct clocksource *clock = READ_ONCE(tkr->clock);
 
+	if (static_branch_likely(&clocksource_read_inlined))
+		return arch_inlined_clocksource_read(clock);
+
 	return clock->read(clock);
 }
 
+static inline void clocksource_disable_inline_read(void)
+{
+	static_branch_disable(&clocksource_read_inlined);
+}
+
+static inline void clocksource_enable_inline_read(void)
+{
+	static_branch_enable(&clocksource_read_inlined);
+}
+#else
+static __always_inline u64 tk_clock_read(const struct tk_read_base *tkr)
+{
+	struct clocksource *clock = READ_ONCE(tkr->clock);
+
+	return clock->read(clock);
+}
+static inline void clocksource_disable_inline_read(void) { }
+static inline void clocksource_enable_inline_read(void) { }
+#endif
+
 /**
  * tk_setup_internals - Set up internals to use clocksource clock.
  *
@@ -375,7 +399,7 @@ static noinline u64 delta_to_ns_safe(con
 	return mul_u64_u32_add_u64_shr(delta, tkr->mult, tkr->xtime_nsec, tkr->shift);
 }
 
-static inline u64 timekeeping_cycles_to_ns(const struct tk_read_base *tkr, u64 cycles)
+static __always_inline u64 timekeeping_cycles_to_ns(const struct tk_read_base *tkr, u64 cycles)
 {
 	/* Calculate the delta since the last update_wall_time() */
 	u64 mask = tkr->mask, delta = (cycles - tkr->cycle_last) & mask;
@@ -1631,7 +1655,19 @@ int timekeeping_notify(struct clocksourc
 
 	if (tk->tkr_mono.clock == clock)
 		return 0;
+
+	/* Disable inlined reads accross the clocksource switch */
+	clocksource_disable_inline_read();
+
 	stop_machine(change_clocksource, clock, NULL);
+
+	/*
+	 * If the clocksource has been selected and supports inlined reads
+	 * enable the branch.
+	 */
+	if (tk->tkr_mono.clock == clock && clock->flags & CLOCK_SOURCE_CAN_INLINE_READ)
+		clocksource_enable_inline_read();
+
 	tick_clock_notify();
 	return tk->tkr_mono.clock == clock ? 0 : -1;
 }


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 15/48] x86: Inline TSC reads in timekeeping
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (13 preceding siblings ...)
  2026-02-24 16:36 ` [patch 14/48] timekeeping: Allow inlining clocksource::read() Thomas Gleixner
@ 2026-02-24 16:36 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:36 ` [patch 16/48] x86/apic: Remove pointless fence in lapic_next_deadline() Thomas Gleixner
                   ` (34 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:36 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

Avoid the overhead of the indirect call for a single instruction to read
the TSC.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 arch/x86/Kconfig                     |    1 +
 arch/x86/include/asm/clock_inlined.h |   14 ++++++++++++++
 arch/x86/kernel/tsc.c                |    1 +
 3 files changed, 16 insertions(+)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -141,6 +141,7 @@ config X86
 	select ARCH_USE_SYM_ANNOTATIONS
 	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	select ARCH_WANT_DEFAULT_BPF_JIT	if X86_64
+	select ARCH_WANTS_CLOCKSOURCE_READ_INLINE	if X86_64
 	select ARCH_WANTS_DYNAMIC_TASK_STRUCT
 	select ARCH_WANTS_NO_INSTR
 	select ARCH_WANT_GENERAL_HUGETLB
--- /dev/null
+++ b/arch/x86/include/asm/clock_inlined.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_CLOCK_INLINED_H
+#define _ASM_X86_CLOCK_INLINED_H
+
+#include <asm/tsc.h>
+
+struct clocksource;
+
+static __always_inline u64 arch_inlined_clocksource_read(struct clocksource *cs)
+{
+	return (u64)rdtsc_ordered();
+}
+
+#endif
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1201,6 +1201,7 @@ static struct clocksource clocksource_ts
 	.mask			= CLOCKSOURCE_MASK(64),
 	.flags			= CLOCK_SOURCE_IS_CONTINUOUS |
 				  CLOCK_SOURCE_VALID_FOR_HRES |
+				  CLOCK_SOURCE_CAN_INLINE_READ |
 				  CLOCK_SOURCE_MUST_VERIFY |
 				  CLOCK_SOURCE_VERIFY_PERCPU,
 	.id			= CSID_X86_TSC,


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 16/48] x86/apic: Remove pointless fence in lapic_next_deadline()
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (14 preceding siblings ...)
  2026-02-24 16:36 ` [patch 15/48] x86: Inline TSC reads in timekeeping Thomas Gleixner
@ 2026-02-24 16:36 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:36 ` [patch 17/48] x86/apic: Avoid the PVOPS indirection for the TSC deadline timer Thomas Gleixner
                   ` (33 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:36 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

lapic_next_deadline() contains a fence before the TSC read and the write to
the TSC_DEADLINE MSR with a content free and therefore useless comment:

    /* This MSR is special and need a special fence: */

The MSR is not really special. It is just not a serializing MSR, but that
does not matter at all in this context as all of these operations are
strictly CPU local.

The only thing the fence prevents is that the RDTSC is speculated ahead,
but that's not really relevant as the delta is calculated way before based
on a previous TSC read and therefore inaccurate by definition.

So removing the fence is just making it slightly more inaccurate in the
worst case, but that is irrelevant as it's way below the actual system
immanent latencies and variations.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 arch/x86/kernel/apic/apic.c |   16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -412,22 +412,20 @@ EXPORT_SYMBOL_GPL(setup_APIC_eilvt);
 /*
  * Program the next event, relative to now
  */
-static int lapic_next_event(unsigned long delta,
-			    struct clock_event_device *evt)
+static int lapic_next_event(unsigned long delta, struct clock_event_device *evt)
 {
 	apic_write(APIC_TMICT, delta);
 	return 0;
 }
 
-static int lapic_next_deadline(unsigned long delta,
-			       struct clock_event_device *evt)
+static int lapic_next_deadline(unsigned long delta, struct clock_event_device *evt)
 {
-	u64 tsc;
+	/*
+	 * There is no weak_wrmsr_fence() required here as all of this is purely
+	 * CPU local. Avoid the [ml]fence overhead.
+	 */
+	u64 tsc = rdtsc();
 
-	/* This MSR is special and need a special fence: */
-	weak_wrmsr_fence();
-
-	tsc = rdtsc();
 	wrmsrq(MSR_IA32_TSC_DEADLINE, tsc + (((u64) delta) * TSC_DIVISOR));
 	return 0;
 }


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 17/48] x86/apic: Avoid the PVOPS indirection for the TSC deadline timer
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (15 preceding siblings ...)
  2026-02-24 16:36 ` [patch 16/48] x86/apic: Remove pointless fence in lapic_next_deadline() Thomas Gleixner
@ 2026-02-24 16:36 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:36 ` [patch 18/48] timekeeping: Provide infrastructure for coupled clockevents Thomas Gleixner
                   ` (32 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:36 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

XEN PV does not emulate the TSC deadline timer, so the PVOPS indirection
for writing the deadline MSR can be avoided completely.

Use native_wrmsrq() instead.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 arch/x86/kernel/apic/apic.c |   13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -426,7 +426,7 @@ static int lapic_next_deadline(unsigned
 	 */
 	u64 tsc = rdtsc();
 
-	wrmsrq(MSR_IA32_TSC_DEADLINE, tsc + (((u64) delta) * TSC_DIVISOR));
+	native_wrmsrq(MSR_IA32_TSC_DEADLINE, tsc + (((u64) delta) * TSC_DIVISOR));
 	return 0;
 }
 
@@ -450,7 +450,7 @@ static int lapic_timer_shutdown(struct c
 	 * the timer _and_ zero the counter registers:
 	 */
 	if (v & APIC_LVT_TIMER_TSCDEADLINE)
-		wrmsrq(MSR_IA32_TSC_DEADLINE, 0);
+		native_wrmsrq(MSR_IA32_TSC_DEADLINE, 0);
 	else
 		apic_write(APIC_TMICT, 0);
 
@@ -547,6 +547,11 @@ static __init bool apic_validate_deadlin
 
 	if (!boot_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER))
 		return false;
+
+	/* XEN_PV does not support it, but be paranoia about it */
+	if (boot_cpu_has(X86_FEATURE_XENPV))
+		goto clear;
+
 	if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
 		return true;
 
@@ -559,9 +564,11 @@ static __init bool apic_validate_deadlin
 	if (boot_cpu_data.microcode >= rev)
 		return true;
 
-	setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
 	pr_err(FW_BUG "TSC_DEADLINE disabled due to Errata; "
 	       "please update microcode to version: 0x%x (or later)\n", rev);
+
+clear:
+	setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
 	return false;
 }
 


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 18/48] timekeeping: Provide infrastructure for coupled clockevents
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (16 preceding siblings ...)
  2026-02-24 16:36 ` [patch 17/48] x86/apic: Avoid the PVOPS indirection for the TSC deadline timer Thomas Gleixner
@ 2026-02-24 16:36 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:36 ` [patch 19/48] clockevents: Provide support for clocksource coupled comparators Thomas Gleixner
                   ` (31 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:36 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

Some architectures have clockevent devices which are coupled to the system
clocksource by implementing a less than or equal comparator which compares
the programmed absolute expiry time against the underlying time
counter. Well known examples are TSC/TSC deadline timer and the S390 TOD
clocksource/comparator.

While the concept is nice it has some downsides:

  1) The clockevents core code is strictly based on relative expiry times
     as that's the most common case for clockevent device hardware. That
     requires to convert the absolute expiry time provided by the caller
     (hrtimers, NOHZ code) to a relative expiry time by reading and
     substracting the current time.

     The clockevent::set_next_event() callback must then read the counter
     again to convert the relative expiry back into a absolute one.

  2) The conversion factors from nanoseconds to counter clock cycles are
     set up when the clockevent is registered. When NTP applies corrections
     then the clockevent conversion factors can deviate from the
     clocksource conversion substantially which either results in timers
     firing late or in the worst case early. The early expiry then needs to
     do a reprogam with a short delta.

     In most cases this is papered over by the fact that the read in the
     set_next_event() callback happens after the read which is used to
     calculate the delta. So the tendency is that timers expire mostly
     late.

All of this can be avoided by providing support for these devices in the
core code:

  1) The timekeeping core keeps track of the last update to the clocksource
     by storing the base nanoseconds and the corresponding clocksource
     counter value. That's used to keep the conversion math for reading the
     time within 64-bit in the common case.

     This information can be used to avoid both reads of the underlying
     clocksource in the clockevents reprogramming path:

     delta = expiry - base_ns;
     cycles = base_cycles + ((delta * clockevent::mult) >> clockevent::shift);

     The resulting cycles value can be directly used to program the
     comparator.

  2) As #1 does not longer provide the "compensation" through the second
     read the deviation of the clocksource and clockevent conversions
     caused by NTP become more prominent.

     This can be cured by letting the timekeeping core compute and store
     the reverse conversion factors when the clocksource cycles to
     nanoseconds factors are modified by NTP:

         CS::MULT      (1 << NS_TO_CYC_SHIFT)
     --------------- = ----------------------
     (1 << CS:SHIFT)       NS_TO_CYC_MULT
	
     Ergo: NS_TO_CYC_MULT = (1 << (CS::SHIFT + NS_TO_CYC_SHIFT)) / CS::MULT
     
     The NS_TO_CYC_SHIFT value is calculated when the clocksource is
     installed so that it aims for a one hour maximum sleep time.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 include/linux/clocksource.h         |    1 
 include/linux/timekeeper_internal.h |    8 ++
 kernel/time/Kconfig                 |    3 
 kernel/time/timekeeping.c           |  110 ++++++++++++++++++++++++++++++++++++
 kernel/time/timekeeping.h           |    2 
 5 files changed, 124 insertions(+)
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -150,6 +150,7 @@ struct clocksource {
 #define CLOCK_SOURCE_RESELECT			0x100
 #define CLOCK_SOURCE_VERIFY_PERCPU		0x200
 #define CLOCK_SOURCE_CAN_INLINE_READ		0x400
+#define CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT	0x800
 
 /* simplify initialization of mask field */
 #define CLOCKSOURCE_MASK(bits) GENMASK_ULL((bits) - 1, 0)
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -72,6 +72,10 @@ struct tk_read_base {
  * @id:				The timekeeper ID
  * @tkr_raw:			The readout base structure for CLOCK_MONOTONIC_RAW
  * @raw_sec:			CLOCK_MONOTONIC_RAW  time in seconds
+ * @cs_id:			The ID of the current clocksource
+ * @cs_ns_to_cyc_mult:		Multiplicator for nanoseconds to cycles conversion
+ * @cs_ns_to_cyc_shift:		Shift value for nanoseconds to cycles conversion
+ * @cs_ns_to_cyc_maxns:		Maximum nanoseconds to cyles conversion range
  * @clock_was_set_seq:		The sequence number of clock was set events
  * @cs_was_changed_seq:		The sequence number of clocksource change events
  * @clock_valid:		Indicator for valid clock
@@ -159,6 +163,10 @@ struct timekeeper {
 	u64			raw_sec;
 
 	/* Cachline 3 and 4 (timekeeping internal variables): */
+	enum clocksource_ids	cs_id;
+	u32			cs_ns_to_cyc_mult;
+	u32			cs_ns_to_cyc_shift;
+	u64			cs_ns_to_cyc_maxns;
 	unsigned int		clock_was_set_seq;
 	u8			cs_was_changed_seq;
 	u8			clock_valid;
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -47,6 +47,9 @@ config GENERIC_CLOCKEVENTS_BROADCAST_IDL
 config GENERIC_CLOCKEVENTS_MIN_ADJUST
 	bool
 
+config GENERIC_CLOCKEVENTS_COUPLED
+	bool
+
 # Generic update of CMOS clock
 config GENERIC_CMOS_UPDATE
 	bool
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -391,6 +391,20 @@ static void tk_setup_internals(struct ti
 	tk->tkr_raw.mult = clock->mult;
 	tk->ntp_err_mult = 0;
 	tk->skip_second_overflow = 0;
+
+	tk->cs_id = clock->id;
+
+	/* Coupled clockevent data */
+	if (IS_ENABLED(CONFIG_GENERIC_CLOCKEVENTS_COUPLED) &&
+	    clock->flags & CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT) {
+		/*
+		 * Aim for an one hour maximum delta and use KHz to handle
+		 * clocksources with a frequency above 4GHz correctly as
+		 * the frequency argument of clocks_calc_mult_shift() is u32.
+		 */
+		clocks_calc_mult_shift(&tk->cs_ns_to_cyc_mult, &tk->cs_ns_to_cyc_shift,
+				       NSEC_PER_MSEC, clock->freq_khz, 3600 * 1000);
+	}
 }
 
 /* Timekeeper helper functions. */
@@ -720,6 +734,36 @@ static inline void tk_update_ktime_data(
 	tk->tkr_raw.base = ns_to_ktime(tk->raw_sec * NSEC_PER_SEC);
 }
 
+static inline void tk_update_ns_to_cyc(struct timekeeper *tks, struct timekeeper *tkc)
+{
+	struct tk_read_base *tkrs = &tks->tkr_mono;
+	struct tk_read_base *tkrc = &tkc->tkr_mono;
+	unsigned int shift;
+
+	if (!IS_ENABLED(CONFIG_GENERIC_CLOCKEVENTS_COUPLED) ||
+	    !(tkrs->clock->flags & CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT))
+		return;
+
+	if (tkrs->mult == tkrc->mult && tkrs->shift == tkrc->shift)
+		return;
+	/*
+	 * The conversion math is simple:
+	 *
+	 *      CS::MULT       (1 << NS_TO_CYC_SHIFT)
+	 *   --------------- = ----------------------
+	 *   (1 << CS:SHIFT)       NS_TO_CYC_MULT
+	 *
+	 * Ergo:
+	 *
+	 *   NS_TO_CYC_MULT = (1 << (CS::SHIFT + NS_TO_CYC_SHIFT)) / CS::MULT
+	 *
+	 * NS_TO_CYC_SHIFT has been set up in tk_setup_internals()
+	 */
+	shift = tkrs->shift + tks->cs_ns_to_cyc_shift;
+	tks->cs_ns_to_cyc_mult = (u32)div_u64(1ULL << shift, tkrs->mult);
+	tks->cs_ns_to_cyc_maxns = div_u64(tkrs->clock->mask, tks->cs_ns_to_cyc_mult);
+}
+
 /*
  * Restore the shadow timekeeper from the real timekeeper.
  */
@@ -754,6 +798,7 @@ static void timekeeping_update_from_shad
 	tk->tkr_mono.base_real = tk->tkr_mono.base + tk->offs_real;
 
 	if (tk->id == TIMEKEEPER_CORE) {
+		tk_update_ns_to_cyc(tk, &tkd->timekeeper);
 		update_vsyscall(tk);
 		update_pvclock_gtod(tk, action & TK_CLOCK_WAS_SET);
 
@@ -808,6 +853,71 @@ static void timekeeping_forward_now(stru
 	tk_update_coarse_nsecs(tk);
 }
 
+/*
+ * ktime_expiry_to_cycles - Convert a expiry time to clocksource cycles
+ * @id:		Clocksource ID which is required for validity
+ * @expires_ns:	Absolute CLOCK_MONOTONIC expiry time (nsecs) to be converted
+ * @cycles:	Pointer to storage for corresponding absolute cycles value
+ *
+ * Convert a CLOCK_MONOTONIC based absolute expiry time to a cycles value
+ * based on the correlated clocksource of the clockevent device by using
+ * the base nanoseconds and cycles values of the last timekeeper update and
+ * converting the delta between @expires_ns and base nanoseconds to cycles.
+ *
+ * This only works for clockevent devices which are using a less than or
+ * equal comparator against the clocksource.
+ *
+ * Utilizing this avoids two clocksource reads for such devices, the
+ * ktime_get() in clockevents_program_event() to calculate the delta expiry
+ * value and the readout in the device::set_next_event() callback to
+ * convert the delta back to a absolute comparator value.
+ *
+ * Returns: True if @id matches the current clocksource ID, false otherwise
+ */
+bool ktime_expiry_to_cycles(enum clocksource_ids id, ktime_t expires_ns, u64 *cycles)
+{
+	struct timekeeper *tk = &tk_core.timekeeper;
+	struct tk_read_base *tkrm = &tk->tkr_mono;
+	ktime_t base_ns, delta_ns, max_ns;
+	u64 base_cycles, delta_cycles;
+	unsigned int seq;
+	u32 mult, shift;
+
+	/*
+	 * Racy check to avoid the seqcount overhead when ID does not match. If
+	 * the relevant clocksource is installed concurrently, then this will
+	 * just delay the switch over to this mechanism until the next event is
+	 * programmed. If the ID is not matching the clock events code will use
+	 * the regular relative set_next_event() callback as before.
+	 */
+	if (data_race(tk->cs_id) != id)
+		return false;
+
+	do {
+		seq = read_seqcount_begin(&tk_core.seq);
+
+		if (tk->cs_id != id)
+			return false;
+
+		base_cycles = tkrm->cycle_last;
+		base_ns = tkrm->base + (tkrm->xtime_nsec >> tkrm->shift);
+
+		mult = tk->cs_ns_to_cyc_mult;
+		shift = tk->cs_ns_to_cyc_shift;
+		max_ns = tk->cs_ns_to_cyc_maxns;
+
+	} while (read_seqcount_retry(&tk_core.seq, seq));
+
+	/* Prevent negative deltas and multiplication overflows */
+	delta_ns = min(expires_ns - base_ns, max_ns);
+	delta_ns = max(delta_ns, 0);
+
+	/* Convert to cycles */
+	delta_cycles = ((u64)delta_ns * mult) >> shift;
+	*cycles = base_cycles + delta_cycles;
+	return true;
+}
+
 /**
  * ktime_get_real_ts64 - Returns the time of day in a timespec64.
  * @ts:		pointer to the timespec to be set
--- a/kernel/time/timekeeping.h
+++ b/kernel/time/timekeeping.h
@@ -9,6 +9,8 @@ extern ktime_t ktime_get_update_offsets_
 					    ktime_t *offs_boot,
 					    ktime_t *offs_tai);
 
+bool ktime_expiry_to_cycles(enum clocksource_ids id, ktime_t expires_ns, u64 *cycles);
+
 extern int timekeeping_valid_for_hres(void);
 extern u64 timekeeping_max_deferment(void);
 extern void timekeeping_warp_clock(void);


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 19/48] clockevents: Provide support for clocksource coupled comparators
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (17 preceding siblings ...)
  2026-02-24 16:36 ` [patch 18/48] timekeeping: Provide infrastructure for coupled clockevents Thomas Gleixner
@ 2026-02-24 16:36 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-03-03 18:44   ` [patch 19/48] " Michael Kelley
  2026-02-24 16:36 ` [patch 20/48] x86/apic: Enable TSC coupled programming mode Thomas Gleixner
                   ` (30 subsequent siblings)
  49 siblings, 2 replies; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:36 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

Some clockevent devices are coupled to the system clocksource by
implementing a less than or equal comparator which compares the programmed
absolute expiry time against the underlying time counter.

The timekeeping core provides a function to convert and absolute
CLOCK_MONOTONIC based expiry time to a absolute clock cycles time which can
be directly fed into the comparator. That spares two time reads in the next
event progamming path, one to convert the absolute nanoseconds time to a
delta value and the other to convert the delta value back to a absolute
time value suitable for the comparator.

Provide a new clocksource callback which takes the absolute cycle value and
wire it up in clockevents_program_event(). Similar to clocksources allow
architectures to inline the rearm operation.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 include/linux/clockchips.h |    7 +++++--
 kernel/time/Kconfig        |    4 ++++
 kernel/time/clockevents.c  |   44 +++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 48 insertions(+), 7 deletions(-)

--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -43,8 +43,9 @@ enum clock_event_state {
 /*
  * Clock event features
  */
-# define CLOCK_EVT_FEAT_PERIODIC	0x000001
-# define CLOCK_EVT_FEAT_ONESHOT		0x000002
+# define CLOCK_EVT_FEAT_PERIODIC		0x000001
+# define CLOCK_EVT_FEAT_ONESHOT			0x000002
+# define CLOCK_EVT_FEAT_CLOCKSOURCE_COUPLED	0x000004
 
 /*
  * x86(64) specific (mis)features:
@@ -100,6 +101,7 @@ struct clock_event_device {
 	void			(*event_handler)(struct clock_event_device *);
 	int			(*set_next_event)(unsigned long evt, struct clock_event_device *);
 	int			(*set_next_ktime)(ktime_t expires, struct clock_event_device *);
+	void			(*set_next_coupled)(u64 cycles, struct clock_event_device *);
 	ktime_t			next_event;
 	u64			max_delta_ns;
 	u64			min_delta_ns;
@@ -107,6 +109,7 @@ struct clock_event_device {
 	u32			shift;
 	enum clock_event_state	state_use_accessors;
 	unsigned int		features;
+	enum clocksource_ids	cs_id;
 	unsigned long		retries;
 
 	int			(*set_state_periodic)(struct clock_event_device *);
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -50,6 +50,10 @@ config GENERIC_CLOCKEVENTS_MIN_ADJUST
 config GENERIC_CLOCKEVENTS_COUPLED
 	bool
 
+config GENERIC_CLOCKEVENTS_COUPLED_INLINE
+	select GENERIC_CLOCKEVENTS_COUPLED
+	bool
+
 # Generic update of CMOS clock
 config GENERIC_CMOS_UPDATE
 	bool
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -292,6 +292,38 @@ static int clockevents_program_min_delta
 
 #endif /* CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST */
 
+#ifdef CONFIG_GENERIC_CLOCKEVENTS_COUPLED
+#ifdef CONFIG_GENERIC_CLOCKEVENTS_COUPLED_INLINE
+#include <asm/clock_inlined.h>
+#else
+static __always_inline void
+arch_inlined_clockevent_set_next_coupled(u64 u64 cycles, struct clock_event_device *dev) { }
+#endif
+
+static inline bool clockevent_set_next_coupled(struct clock_event_device *dev, ktime_t expires)
+{
+	u64 cycles;
+
+	if (unlikely(!(dev->features & CLOCK_EVT_FEAT_CLOCKSOURCE_COUPLED)))
+		return false;
+
+	if (unlikely(!ktime_expiry_to_cycles(dev->cs_id, expires, &cycles)))
+		return false;
+
+	if (IS_ENABLED(CONFIG_GENERIC_CLOCKEVENTS_COUPLED_INLINE))
+		arch_inlined_clockevent_set_next_coupled(cycles, dev);
+	else
+		dev->set_next_coupled(cycles, dev);
+	return true;
+}
+
+#else
+static inline bool clockevent_set_next_coupled(struct clock_event_device *dev, ktime_t expires)
+{
+	return false;
+}
+#endif
+
 /**
  * clockevents_program_event - Reprogram the clock event device.
  * @dev:	device to program
@@ -300,11 +332,10 @@ static int clockevents_program_min_delta
  *
  * Returns 0 on success, -ETIME when the event is in the past.
  */
-int clockevents_program_event(struct clock_event_device *dev, ktime_t expires,
-			      bool force)
+int clockevents_program_event(struct clock_event_device *dev, ktime_t expires, bool force)
 {
-	unsigned long long clc;
 	int64_t delta;
+	u64 cycles;
 	int rc;
 
 	if (WARN_ON_ONCE(expires < 0))
@@ -323,6 +354,9 @@ int clockevents_program_event(struct clo
 	if (unlikely(dev->features & CLOCK_EVT_FEAT_HRTIMER))
 		return dev->set_next_ktime(expires, dev);
 
+	if (likely(clockevent_set_next_coupled(dev, expires)))
+		return 0;
+
 	delta = ktime_to_ns(ktime_sub(expires, ktime_get()));
 	if (delta <= 0)
 		return force ? clockevents_program_min_delta(dev) : -ETIME;
@@ -330,8 +364,8 @@ int clockevents_program_event(struct clo
 	delta = min(delta, (int64_t) dev->max_delta_ns);
 	delta = max(delta, (int64_t) dev->min_delta_ns);
 
-	clc = ((unsigned long long) delta * dev->mult) >> dev->shift;
-	rc = dev->set_next_event((unsigned long) clc, dev);
+	cycles = ((u64)delta * dev->mult) >> dev->shift;
+	rc = dev->set_next_event((unsigned long) cycles, dev);
 
 	return (rc && force) ? clockevents_program_min_delta(dev) : rc;
 }


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 20/48] x86/apic: Enable TSC coupled programming mode
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (18 preceding siblings ...)
  2026-02-24 16:36 ` [patch 19/48] clockevents: Provide support for clocksource coupled comparators Thomas Gleixner
@ 2026-02-24 16:36 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-03-03  1:29   ` [patch 20/48] " Nathan Chancellor
  2026-02-24 16:36 ` [patch 21/48] hrtimer: Add debug object init assertion Thomas Gleixner
                   ` (29 subsequent siblings)
  49 siblings, 2 replies; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:36 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

The TSC deadline timer is directly coupled to the TSC and setting the next
deadline is tedious as the clockevents core code converts the
CLOCK_MONOTONIC based absolute expiry time to a relative expiry by reading
the current time from the TSC. It converts that delta to cycles and hands
the result to lapic_next_deadline(), which then has read to the TSC and add
the delta to program the timer.

The core code now supports coupled clock event devices and can provide the
expiry time in TSC cycles directly without reading the TSC at all.

This obviouly works only when the TSC is the current clocksource, but
that's the default for all modern CPUs which implement the TSC deadline
timer. If the TSC is not the current clocksource (e.g. early boot) then the
core code falls back to the relative set_next_event() callback as before.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: x86@kernel.org
---
 arch/x86/Kconfig                     |    1 +
 arch/x86/include/asm/clock_inlined.h |    8 ++++++++
 arch/x86/kernel/apic/apic.c          |   12 ++++++------
 arch/x86/kernel/tsc.c                |    3 ++-
 4 files changed, 17 insertions(+), 7 deletions(-)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -164,6 +164,7 @@ config X86
 	select EDAC_SUPPORT
 	select GENERIC_CLOCKEVENTS_BROADCAST	if X86_64 || (X86_32 && X86_LOCAL_APIC)
 	select GENERIC_CLOCKEVENTS_BROADCAST_IDLE	if GENERIC_CLOCKEVENTS_BROADCAST
+	select GENERIC_CLOCKEVENTS_COUPLED_INLINE	if X86_64
 	select GENERIC_CLOCKEVENTS_MIN_ADJUST
 	select GENERIC_CMOS_UPDATE
 	select GENERIC_CPU_AUTOPROBE
--- a/arch/x86/include/asm/clock_inlined.h
+++ b/arch/x86/include/asm/clock_inlined.h
@@ -11,4 +11,12 @@ static __always_inline u64 arch_inlined_
 	return (u64)rdtsc_ordered();
 }
 
+struct clock_event_device;
+
+static __always_inline void
+arch_inlined_clockevent_set_next_coupled(u64 cycles, struct clock_event_device *evt)
+{
+	native_wrmsrq(MSR_IA32_TSC_DEADLINE, cycles);
+}
+
 #endif
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -591,14 +591,14 @@ static void setup_APIC_timer(void)
 
 	if (this_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER)) {
 		levt->name = "lapic-deadline";
-		levt->features &= ~(CLOCK_EVT_FEAT_PERIODIC |
-				    CLOCK_EVT_FEAT_DUMMY);
+		levt->features &= ~(CLOCK_EVT_FEAT_PERIODIC | CLOCK_EVT_FEAT_DUMMY);
+		levt->features |= CLOCK_EVT_FEAT_CLOCKSOURCE_COUPLED;
+		levt->cs_id = CSID_X86_TSC;
 		levt->set_next_event = lapic_next_deadline;
-		clockevents_config_and_register(levt,
-						tsc_khz * (1000 / TSC_DIVISOR),
-						0xF, ~0UL);
-	} else
+		clockevents_config_and_register(levt, tsc_khz * (1000 / TSC_DIVISOR), 0xF, ~0UL);
+	} else {
 		clockevents_register_device(levt);
+	}
 
 	apic_update_vector(smp_processor_id(), LOCAL_TIMER_VECTOR, true);
 }
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1203,7 +1203,8 @@ static struct clocksource clocksource_ts
 				  CLOCK_SOURCE_VALID_FOR_HRES |
 				  CLOCK_SOURCE_CAN_INLINE_READ |
 				  CLOCK_SOURCE_MUST_VERIFY |
-				  CLOCK_SOURCE_VERIFY_PERCPU,
+				  CLOCK_SOURCE_VERIFY_PERCPU |
+				  CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT,
 	.id			= CSID_X86_TSC,
 	.vdso_clock_mode	= VDSO_CLOCKMODE_TSC,
 	.enable			= tsc_cs_enable,


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 21/48] hrtimer: Add debug object init assertion
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (19 preceding siblings ...)
  2026-02-24 16:36 ` [patch 20/48] x86/apic: Enable TSC coupled programming mode Thomas Gleixner
@ 2026-02-24 16:36 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:36 ` [patch 22/48] hrtimer: Reduce trace noise in hrtimer_start() Thomas Gleixner
                   ` (28 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:36 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

The debug object coverage in hrtimer_start_range_ns() happens too late to
do anything useful. Implement the init assert assertion part and invoke
that early in hrtimer_start_range_ns().

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 kernel/time/hrtimer.c |   43 ++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 38 insertions(+), 5 deletions(-)

--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -441,12 +441,37 @@ static bool hrtimer_fixup_free(void *add
 	}
 }
 
+/* Stub timer callback for improperly used timers. */
+static enum hrtimer_restart stub_timer(struct hrtimer *unused)
+{
+	WARN_ON_ONCE(1);
+	return HRTIMER_NORESTART;
+}
+
+/*
+ * hrtimer_fixup_assert_init is called when:
+ * - an untracked/uninit-ed object is found
+ */
+static bool hrtimer_fixup_assert_init(void *addr, enum debug_obj_state state)
+{
+	struct hrtimer *timer = addr;
+
+	switch (state) {
+	case ODEBUG_STATE_NOTAVAILABLE:
+		hrtimer_setup(timer, stub_timer, CLOCK_MONOTONIC, 0);
+		return true;
+	default:
+		return false;
+	}
+}
+
 static const struct debug_obj_descr hrtimer_debug_descr = {
-	.name		= "hrtimer",
-	.debug_hint	= hrtimer_debug_hint,
-	.fixup_init	= hrtimer_fixup_init,
-	.fixup_activate	= hrtimer_fixup_activate,
-	.fixup_free	= hrtimer_fixup_free,
+	.name			= "hrtimer",
+	.debug_hint		= hrtimer_debug_hint,
+	.fixup_init		= hrtimer_fixup_init,
+	.fixup_activate		= hrtimer_fixup_activate,
+	.fixup_free		= hrtimer_fixup_free,
+	.fixup_assert_init	= hrtimer_fixup_assert_init,
 };
 
 static inline void debug_hrtimer_init(struct hrtimer *timer)
@@ -470,6 +495,11 @@ static inline void debug_hrtimer_deactiv
 	debug_object_deactivate(timer, &hrtimer_debug_descr);
 }
 
+static inline void debug_hrtimer_assert_init(struct hrtimer *timer)
+{
+	debug_object_assert_init(timer, &hrtimer_debug_descr);
+}
+
 void destroy_hrtimer_on_stack(struct hrtimer *timer)
 {
 	debug_object_free(timer, &hrtimer_debug_descr);
@@ -483,6 +513,7 @@ static inline void debug_hrtimer_init_on
 static inline void debug_hrtimer_activate(struct hrtimer *timer,
 					  enum hrtimer_mode mode) { }
 static inline void debug_hrtimer_deactivate(struct hrtimer *timer) { }
+static inline void debug_hrtimer_assert_init(struct hrtimer *timer) { }
 #endif
 
 static inline void debug_setup(struct hrtimer *timer, clockid_t clockid, enum hrtimer_mode mode)
@@ -1359,6 +1390,8 @@ void hrtimer_start_range_ns(struct hrtim
 	struct hrtimer_clock_base *base;
 	unsigned long flags;
 
+	debug_hrtimer_assert_init(timer);
+
 	/*
 	 * Check whether the HRTIMER_MODE_SOFT bit and hrtimer.is_soft
 	 * match on CONFIG_PREEMPT_RT = n. With PREEMPT_RT check the hard


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 22/48] hrtimer: Reduce trace noise in hrtimer_start()
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (20 preceding siblings ...)
  2026-02-24 16:36 ` [patch 21/48] hrtimer: Add debug object init assertion Thomas Gleixner
@ 2026-02-24 16:36 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:37 ` [patch 23/48] hrtimer: Use guards where appropriate Thomas Gleixner
                   ` (27 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:36 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

hrtimer_start() when invoked with an already armed timer traces like:
    
 <comm>-..   [032] d.h2. 5.002263: hrtimer_cancel: hrtimer= ....
 <comm>-..   [032] d.h1. 5.002263: hrtimer_start: hrtimer= ....
    
Which is incorrect as the timer doesn't get canceled. Just the expiry time
changes. The internal dequeue operation which is required for that is not
really interesting for trace analysis. But it makes it tedious to keep real
cancellations and the above case apart.

Remove the cancel tracing in hrtimer_start() and add a 'was_armed'
indicator to the hrtimer start tracepoint, which clearly indicates what the
state of the hrtimer is when hrtimer_start() is invoked:

 <comm>-..   [032] d.h1. 6.200103: hrtimer_start: hrtimer= .... was_armed=0
 <comm>-..   [032] d.h1. 6.200558: hrtimer_start: hrtimer= .... was_armed=1
    
Fixes: c6a2a1770245 ("hrtimer: Add tracepoint for hrtimers")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 include/trace/events/timer.h |   11 +++++++----
 kernel/time/hrtimer.c        |   43 ++++++++++++++++++++-----------------------
 2 files changed, 27 insertions(+), 27 deletions(-)

--- a/include/trace/events/timer.h
+++ b/include/trace/events/timer.h
@@ -218,12 +218,13 @@ TRACE_EVENT(hrtimer_setup,
  * hrtimer_start - called when the hrtimer is started
  * @hrtimer:	pointer to struct hrtimer
  * @mode:	the hrtimers mode
+ * @was_armed:	Was armed when hrtimer_start*() was invoked
  */
 TRACE_EVENT(hrtimer_start,
 
-	TP_PROTO(struct hrtimer *hrtimer, enum hrtimer_mode mode),
+	TP_PROTO(struct hrtimer *hrtimer, enum hrtimer_mode mode, bool was_armed),
 
-	TP_ARGS(hrtimer, mode),
+	TP_ARGS(hrtimer, mode, was_armed),
 
 	TP_STRUCT__entry(
 		__field( void *,	hrtimer		)
@@ -231,6 +232,7 @@ TRACE_EVENT(hrtimer_start,
 		__field( s64,		expires		)
 		__field( s64,		softexpires	)
 		__field( enum hrtimer_mode,	mode	)
+		__field( bool,		was_armed	)
 	),
 
 	TP_fast_assign(
@@ -239,13 +241,14 @@ TRACE_EVENT(hrtimer_start,
 		__entry->expires	= hrtimer_get_expires(hrtimer);
 		__entry->softexpires	= hrtimer_get_softexpires(hrtimer);
 		__entry->mode		= mode;
+		__entry->was_armed	= was_armed;
 	),
 
 	TP_printk("hrtimer=%p function=%ps expires=%llu softexpires=%llu "
-		  "mode=%s", __entry->hrtimer, __entry->function,
+		  "mode=%s was_armed=%d", __entry->hrtimer, __entry->function,
 		  (unsigned long long) __entry->expires,
 		  (unsigned long long) __entry->softexpires,
-		  decode_hrtimer_mode(__entry->mode))
+		  decode_hrtimer_mode(__entry->mode), __entry->was_armed)
 );
 
 /**
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -529,17 +529,10 @@ static inline void debug_setup_on_stack(
 	trace_hrtimer_setup(timer, clockid, mode);
 }
 
-static inline void debug_activate(struct hrtimer *timer,
-				  enum hrtimer_mode mode)
+static inline void debug_activate(struct hrtimer *timer, enum hrtimer_mode mode, bool was_armed)
 {
 	debug_hrtimer_activate(timer, mode);
-	trace_hrtimer_start(timer, mode);
-}
-
-static inline void debug_deactivate(struct hrtimer *timer)
-{
-	debug_hrtimer_deactivate(timer);
-	trace_hrtimer_cancel(timer);
+	trace_hrtimer_start(timer, mode, was_armed);
 }
 
 static struct hrtimer_clock_base *
@@ -1137,9 +1130,9 @@ EXPORT_SYMBOL_GPL(hrtimer_forward);
  * Returns true when the new timer is the leftmost timer in the tree.
  */
 static bool enqueue_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
-			    enum hrtimer_mode mode)
+			    enum hrtimer_mode mode, bool was_armed)
 {
-	debug_activate(timer, mode);
+	debug_activate(timer, mode, was_armed);
 	WARN_ON_ONCE(!base->cpu_base->online);
 
 	base->cpu_base->active_bases |= 1 << base->index;
@@ -1199,6 +1192,8 @@ remove_hrtimer(struct hrtimer *timer, st
 	if (state & HRTIMER_STATE_ENQUEUED) {
 		bool reprogram;
 
+		debug_hrtimer_deactivate(timer);
+
 		/*
 		 * Remove the timer and force reprogramming when high
 		 * resolution mode is active and the timer is on the current
@@ -1207,7 +1202,6 @@ remove_hrtimer(struct hrtimer *timer, st
 		 * reprogramming happens in the interrupt handler. This is a
 		 * rare case and less expensive than a smp call.
 		 */
-		debug_deactivate(timer);
 		reprogram = base->cpu_base == this_cpu_ptr(&hrtimer_bases);
 
 		/*
@@ -1274,15 +1268,15 @@ static int __hrtimer_start_range_ns(stru
 {
 	struct hrtimer_cpu_base *this_cpu_base = this_cpu_ptr(&hrtimer_bases);
 	struct hrtimer_clock_base *new_base;
-	bool force_local, first;
+	bool force_local, first, was_armed;
 
 	/*
 	 * If the timer is on the local cpu base and is the first expiring
 	 * timer then this might end up reprogramming the hardware twice
-	 * (on removal and on enqueue). To avoid that by prevent the
-	 * reprogram on removal, keep the timer local to the current CPU
-	 * and enforce reprogramming after it is queued no matter whether
-	 * it is the new first expiring timer again or not.
+	 * (on removal and on enqueue). To avoid that prevent the reprogram
+	 * on removal, keep the timer local to the current CPU and enforce
+	 * reprogramming after it is queued no matter whether it is the new
+	 * first expiring timer again or not.
 	 */
 	force_local = base->cpu_base == this_cpu_base;
 	force_local &= base->cpu_base->next_timer == timer;
@@ -1304,7 +1298,7 @@ static int __hrtimer_start_range_ns(stru
 	 * avoids programming the underlying clock event twice (once at
 	 * removal and once after enqueue).
 	 */
-	remove_hrtimer(timer, base, true, force_local);
+	was_armed = remove_hrtimer(timer, base, true, force_local);
 
 	if (mode & HRTIMER_MODE_REL)
 		tim = ktime_add_safe(tim, __hrtimer_cb_get_time(base->clockid));
@@ -1321,7 +1315,7 @@ static int __hrtimer_start_range_ns(stru
 		new_base = base;
 	}
 
-	first = enqueue_hrtimer(timer, new_base, mode);
+	first = enqueue_hrtimer(timer, new_base, mode, was_armed);
 
 	/*
 	 * If the hrtimer interrupt is running, then it will reevaluate the
@@ -1439,8 +1433,11 @@ int hrtimer_try_to_cancel(struct hrtimer
 
 	base = lock_hrtimer_base(timer, &flags);
 
-	if (!hrtimer_callback_running(timer))
+	if (!hrtimer_callback_running(timer)) {
 		ret = remove_hrtimer(timer, base, false, false);
+		if (ret)
+			trace_hrtimer_cancel(timer);
+	}
 
 	unlock_hrtimer_base(timer, &flags);
 
@@ -1877,7 +1874,7 @@ static void __run_hrtimer(struct hrtimer
 	 */
 	if (restart != HRTIMER_NORESTART &&
 	    !(timer->state & HRTIMER_STATE_ENQUEUED))
-		enqueue_hrtimer(timer, base, HRTIMER_MODE_ABS);
+		enqueue_hrtimer(timer, base, HRTIMER_MODE_ABS, false);
 
 	/*
 	 * Separate the ->running assignment from the ->state assignment.
@@ -2356,7 +2353,7 @@ static void migrate_hrtimer_list(struct
 	while ((node = timerqueue_getnext(&old_base->active))) {
 		timer = container_of(node, struct hrtimer, node);
 		BUG_ON(hrtimer_callback_running(timer));
-		debug_deactivate(timer);
+		debug_hrtimer_deactivate(timer);
 
 		/*
 		 * Mark it as ENQUEUED not INACTIVE otherwise the
@@ -2373,7 +2370,7 @@ static void migrate_hrtimer_list(struct
 		 * sort out already expired timers and reprogram the
 		 * event device.
 		 */
-		enqueue_hrtimer(timer, new_base, HRTIMER_MODE_ABS);
+		enqueue_hrtimer(timer, new_base, HRTIMER_MODE_ABS, true);
 	}
 }
 


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 23/48] hrtimer: Use guards where appropriate
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (21 preceding siblings ...)
  2026-02-24 16:36 ` [patch 22/48] hrtimer: Reduce trace noise in hrtimer_start() Thomas Gleixner
@ 2026-02-24 16:37 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:37 ` [patch 24/48] hrtimer: Cleanup coding style and comments Thomas Gleixner
                   ` (26 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:37 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

Simplify and tidy up the code where possible.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 kernel/time/hrtimer.c |   48 +++++++++++++++---------------------------------
 1 file changed, 15 insertions(+), 33 deletions(-)

--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -838,13 +838,12 @@ static void retrigger_next_event(void *a
 	 * In periodic low resolution mode, the next softirq expiration
 	 * must also be updated.
 	 */
-	raw_spin_lock(&base->lock);
+	guard(raw_spinlock)(&base->lock);
 	hrtimer_update_base(base);
 	if (hrtimer_hres_active(base))
 		hrtimer_force_reprogram(base, 0);
 	else
 		hrtimer_update_next_event(base);
-	raw_spin_unlock(&base->lock);
 }
 
 /*
@@ -994,7 +993,6 @@ static bool update_needs_ipi(struct hrti
 void clock_was_set(unsigned int bases)
 {
 	cpumask_var_t mask;
-	int cpu;
 
 	if (!hrtimer_highres_enabled() && !tick_nohz_is_active())
 		goto out_timerfd;
@@ -1005,24 +1003,19 @@ void clock_was_set(unsigned int bases)
 	}
 
 	/* Avoid interrupting CPUs if possible */
-	cpus_read_lock();
-	for_each_online_cpu(cpu) {
-		struct hrtimer_cpu_base *cpu_base;
-		unsigned long flags;
+	scoped_guard(cpus_read_lock) {
+		int cpu;
 
-		cpu_base = &per_cpu(hrtimer_bases, cpu);
-		raw_spin_lock_irqsave(&cpu_base->lock, flags);
+		for_each_online_cpu(cpu) {
+			struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu);
 
-		if (update_needs_ipi(cpu_base, bases))
-			cpumask_set_cpu(cpu, mask);
-
-		raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
+			guard(raw_spinlock_irqsave)(&cpu_base->lock);
+			if (update_needs_ipi(cpu_base, bases))
+				cpumask_set_cpu(cpu, mask);
+		}
+		scoped_guard(preempt)
+			smp_call_function_many(mask, retrigger_next_event, NULL, 1);
 	}
-
-	preempt_disable();
-	smp_call_function_many(mask, retrigger_next_event, NULL, 1);
-	preempt_enable();
-	cpus_read_unlock();
 	free_cpumask_var(mask);
 
 out_timerfd:
@@ -1600,15 +1593,11 @@ u64 hrtimer_get_next_event(void)
 {
 	struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
 	u64 expires = KTIME_MAX;
-	unsigned long flags;
-
-	raw_spin_lock_irqsave(&cpu_base->lock, flags);
 
+	guard(raw_spinlock_irqsave)(&cpu_base->lock);
 	if (!hrtimer_hres_active(cpu_base))
 		expires = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_ALL);
 
-	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
-
 	return expires;
 }
 
@@ -1623,25 +1612,18 @@ u64 hrtimer_next_event_without(const str
 {
 	struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
 	u64 expires = KTIME_MAX;
-	unsigned long flags;
-
-	raw_spin_lock_irqsave(&cpu_base->lock, flags);
 
+	guard(raw_spinlock_irqsave)(&cpu_base->lock);
 	if (hrtimer_hres_active(cpu_base)) {
 		unsigned int active;
 
 		if (!cpu_base->softirq_activated) {
 			active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
-			expires = __hrtimer_next_event_base(cpu_base, exclude,
-							    active, KTIME_MAX);
+			expires = __hrtimer_next_event_base(cpu_base, exclude, active, KTIME_MAX);
 		}
 		active = cpu_base->active_bases & HRTIMER_ACTIVE_HARD;
-		expires = __hrtimer_next_event_base(cpu_base, exclude, active,
-						    expires);
+		expires = __hrtimer_next_event_base(cpu_base, exclude, active, expires);
 	}
-
-	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
-
 	return expires;
 }
 #endif


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 24/48] hrtimer: Cleanup coding style and comments
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (22 preceding siblings ...)
  2026-02-24 16:37 ` [patch 23/48] hrtimer: Use guards where appropriate Thomas Gleixner
@ 2026-02-24 16:37 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:37 ` [patch 25/48] hrtimer: Evaluate timer expiry only once Thomas Gleixner
                   ` (25 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:37 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

As this code has some major surgery ahead, clean up coding style and bring
comments up to date.

No functional change intended.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 kernel/time/hrtimer.c |  364 +++++++++++++++++++-------------------------------
 1 file changed, 143 insertions(+), 221 deletions(-)

--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -77,43 +77,22 @@ static ktime_t __hrtimer_cb_get_time(clo
  * to reach a base using a clockid, hrtimer_clockid_to_base()
  * is used to convert from clockid to the proper hrtimer_base_type.
  */
+
+#define BASE_INIT(idx, cid)			\
+	[idx] = { .index = idx, .clockid = cid }
+
 DEFINE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases) =
 {
 	.lock = __RAW_SPIN_LOCK_UNLOCKED(hrtimer_bases.lock),
-	.clock_base =
-	{
-		{
-			.index = HRTIMER_BASE_MONOTONIC,
-			.clockid = CLOCK_MONOTONIC,
-		},
-		{
-			.index = HRTIMER_BASE_REALTIME,
-			.clockid = CLOCK_REALTIME,
-		},
-		{
-			.index = HRTIMER_BASE_BOOTTIME,
-			.clockid = CLOCK_BOOTTIME,
-		},
-		{
-			.index = HRTIMER_BASE_TAI,
-			.clockid = CLOCK_TAI,
-		},
-		{
-			.index = HRTIMER_BASE_MONOTONIC_SOFT,
-			.clockid = CLOCK_MONOTONIC,
-		},
-		{
-			.index = HRTIMER_BASE_REALTIME_SOFT,
-			.clockid = CLOCK_REALTIME,
-		},
-		{
-			.index = HRTIMER_BASE_BOOTTIME_SOFT,
-			.clockid = CLOCK_BOOTTIME,
-		},
-		{
-			.index = HRTIMER_BASE_TAI_SOFT,
-			.clockid = CLOCK_TAI,
-		},
+	.clock_base = {
+		BASE_INIT(HRTIMER_BASE_MONOTONIC,	CLOCK_MONOTONIC),
+		BASE_INIT(HRTIMER_BASE_REALTIME,	CLOCK_REALTIME),
+		BASE_INIT(HRTIMER_BASE_BOOTTIME,	CLOCK_BOOTTIME),
+		BASE_INIT(HRTIMER_BASE_TAI,		CLOCK_TAI),
+		BASE_INIT(HRTIMER_BASE_MONOTONIC_SOFT,	CLOCK_MONOTONIC),
+		BASE_INIT(HRTIMER_BASE_REALTIME_SOFT,	CLOCK_REALTIME),
+		BASE_INIT(HRTIMER_BASE_BOOTTIME_SOFT,	CLOCK_BOOTTIME),
+		BASE_INIT(HRTIMER_BASE_TAI_SOFT,	CLOCK_TAI),
 	},
 	.csd = CSD_INIT(retrigger_next_event, NULL)
 };
@@ -150,18 +129,19 @@ static inline void hrtimer_schedule_hres
  * single place
  */
 #ifdef CONFIG_SMP
-
 /*
  * We require the migration_base for lock_hrtimer_base()/switch_hrtimer_base()
  * such that hrtimer_callback_running() can unconditionally dereference
  * timer->base->cpu_base
  */
 static struct hrtimer_cpu_base migration_cpu_base = {
-	.clock_base = { {
-		.cpu_base = &migration_cpu_base,
-		.seq      = SEQCNT_RAW_SPINLOCK_ZERO(migration_cpu_base.seq,
-						     &migration_cpu_base.lock),
-	}, },
+	.clock_base = {
+		[0] = {
+			.cpu_base = &migration_cpu_base,
+			.seq      = SEQCNT_RAW_SPINLOCK_ZERO(migration_cpu_base.seq,
+							     &migration_cpu_base.lock),
+		},
+	},
 };
 
 #define migration_base	migration_cpu_base.clock_base[0]
@@ -178,15 +158,13 @@ static struct hrtimer_cpu_base migration
  * possible to set timer->base = &migration_base and drop the lock: the timer
  * remains locked.
  */
-static
-struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer,
-					     unsigned long *flags)
+static struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer,
+						    unsigned long *flags)
 	__acquires(&timer->base->lock)
 {
-	struct hrtimer_clock_base *base;
-
 	for (;;) {
-		base = READ_ONCE(timer->base);
+		struct hrtimer_clock_base *base = READ_ONCE(timer->base);
+
 		if (likely(base != &migration_base)) {
 			raw_spin_lock_irqsave(&base->cpu_base->lock, *flags);
 			if (likely(base == timer->base))
@@ -239,7 +217,7 @@ static bool hrtimer_suitable_target(stru
 	return expires >= new_base->cpu_base->expires_next;
 }
 
-static inline struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base, int pinned)
+static inline struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base, bool pinned)
 {
 	if (!hrtimer_base_is_online(base)) {
 		int cpu = cpumask_any_and(cpu_online_mask, housekeeping_cpumask(HK_TYPE_TIMER));
@@ -267,8 +245,7 @@ static inline struct hrtimer_cpu_base *g
  * the timer callback is currently running.
  */
 static inline struct hrtimer_clock_base *
-switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
-		    int pinned)
+switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base, bool pinned)
 {
 	struct hrtimer_cpu_base *new_cpu_base, *this_cpu_base;
 	struct hrtimer_clock_base *new_base;
@@ -281,13 +258,12 @@ switch_hrtimer_base(struct hrtimer *time
 
 	if (base != new_base) {
 		/*
-		 * We are trying to move timer to new_base.
-		 * However we can't change timer's base while it is running,
-		 * so we keep it on the same CPU. No hassle vs. reprogramming
-		 * the event source in the high resolution case. The softirq
-		 * code will take care of this when the timer function has
-		 * completed. There is no conflict as we hold the lock until
-		 * the timer is enqueued.
+		 * We are trying to move timer to new_base. However we can't
+		 * change timer's base while it is running, so we keep it on
+		 * the same CPU. No hassle vs. reprogramming the event source
+		 * in the high resolution case. The remote CPU will take care
+		 * of this when the timer function has completed. There is no
+		 * conflict as we hold the lock until the timer is enqueued.
 		 */
 		if (unlikely(hrtimer_callback_running(timer)))
 			return base;
@@ -297,8 +273,7 @@ switch_hrtimer_base(struct hrtimer *time
 		raw_spin_unlock(&base->cpu_base->lock);
 		raw_spin_lock(&new_base->cpu_base->lock);
 
-		if (!hrtimer_suitable_target(timer, new_base, new_cpu_base,
-					     this_cpu_base)) {
+		if (!hrtimer_suitable_target(timer, new_base, new_cpu_base, this_cpu_base)) {
 			raw_spin_unlock(&new_base->cpu_base->lock);
 			raw_spin_lock(&base->cpu_base->lock);
 			new_cpu_base = this_cpu_base;
@@ -317,14 +292,13 @@ switch_hrtimer_base(struct hrtimer *time
 
 #else /* CONFIG_SMP */
 
-static inline struct hrtimer_clock_base *
-lock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
+static inline struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer,
+							   unsigned long *flags)
 	__acquires(&timer->base->cpu_base->lock)
 {
 	struct hrtimer_clock_base *base = timer->base;
 
 	raw_spin_lock_irqsave(&base->cpu_base->lock, *flags);
-
 	return base;
 }
 
@@ -484,8 +458,7 @@ static inline void debug_hrtimer_init_on
 	debug_object_init_on_stack(timer, &hrtimer_debug_descr);
 }
 
-static inline void debug_hrtimer_activate(struct hrtimer *timer,
-					  enum hrtimer_mode mode)
+static inline void debug_hrtimer_activate(struct hrtimer *timer, enum hrtimer_mode mode)
 {
 	debug_object_activate(timer, &hrtimer_debug_descr);
 }
@@ -510,8 +483,7 @@ EXPORT_SYMBOL_GPL(destroy_hrtimer_on_sta
 
 static inline void debug_hrtimer_init(struct hrtimer *timer) { }
 static inline void debug_hrtimer_init_on_stack(struct hrtimer *timer) { }
-static inline void debug_hrtimer_activate(struct hrtimer *timer,
-					  enum hrtimer_mode mode) { }
+static inline void debug_hrtimer_activate(struct hrtimer *timer, enum hrtimer_mode mode) { }
 static inline void debug_hrtimer_deactivate(struct hrtimer *timer) { }
 static inline void debug_hrtimer_assert_init(struct hrtimer *timer) { }
 #endif
@@ -549,13 +521,12 @@ static struct hrtimer_clock_base *
 	return &cpu_base->clock_base[idx];
 }
 
-#define for_each_active_base(base, cpu_base, active)	\
+#define for_each_active_base(base, cpu_base, active)		\
 	while ((base = __next_base((cpu_base), &(active))))
 
 static ktime_t __hrtimer_next_event_base(struct hrtimer_cpu_base *cpu_base,
 					 const struct hrtimer *exclude,
-					 unsigned int active,
-					 ktime_t expires_next)
+					 unsigned int active, ktime_t expires_next)
 {
 	struct hrtimer_clock_base *base;
 	ktime_t expires;
@@ -618,29 +589,24 @@ static ktime_t __hrtimer_next_event_base
  *  - HRTIMER_ACTIVE_SOFT, or
  *  - HRTIMER_ACTIVE_HARD.
  */
-static ktime_t
-__hrtimer_get_next_event(struct hrtimer_cpu_base *cpu_base, unsigned int active_mask)
+static ktime_t __hrtimer_get_next_event(struct hrtimer_cpu_base *cpu_base, unsigned int active_mask)
 {
-	unsigned int active;
 	struct hrtimer *next_timer = NULL;
 	ktime_t expires_next = KTIME_MAX;
+	unsigned int active;
 
 	if (!cpu_base->softirq_activated && (active_mask & HRTIMER_ACTIVE_SOFT)) {
 		active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
 		cpu_base->softirq_next_timer = NULL;
-		expires_next = __hrtimer_next_event_base(cpu_base, NULL,
-							 active, KTIME_MAX);
-
+		expires_next = __hrtimer_next_event_base(cpu_base, NULL, active, KTIME_MAX);
 		next_timer = cpu_base->softirq_next_timer;
 	}
 
 	if (active_mask & HRTIMER_ACTIVE_HARD) {
 		active = cpu_base->active_bases & HRTIMER_ACTIVE_HARD;
 		cpu_base->next_timer = next_timer;
-		expires_next = __hrtimer_next_event_base(cpu_base, NULL, active,
-							 expires_next);
+		expires_next = __hrtimer_next_event_base(cpu_base, NULL, active, expires_next);
 	}
-
 	return expires_next;
 }
 
@@ -681,8 +647,8 @@ static inline ktime_t hrtimer_update_bas
 	ktime_t *offs_boot = &base->clock_base[HRTIMER_BASE_BOOTTIME].offset;
 	ktime_t *offs_tai = &base->clock_base[HRTIMER_BASE_TAI].offset;
 
-	ktime_t now = ktime_get_update_offsets_now(&base->clock_was_set_seq,
-					    offs_real, offs_boot, offs_tai);
+	ktime_t now = ktime_get_update_offsets_now(&base->clock_was_set_seq, offs_real,
+						   offs_boot, offs_tai);
 
 	base->clock_base[HRTIMER_BASE_REALTIME_SOFT].offset = *offs_real;
 	base->clock_base[HRTIMER_BASE_BOOTTIME_SOFT].offset = *offs_boot;
@@ -702,8 +668,7 @@ static inline int hrtimer_hres_active(st
 		cpu_base->hres_active : 0;
 }
 
-static void __hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base,
-				struct hrtimer *next_timer,
+static void __hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base, struct hrtimer *next_timer,
 				ktime_t expires_next)
 {
 	cpu_base->expires_next = expires_next;
@@ -736,12 +701,9 @@ static void __hrtimer_reprogram(struct h
  * next event
  * Called with interrupts disabled and base->lock held
  */
-static void
-hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal)
+static void hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, bool skip_equal)
 {
-	ktime_t expires_next;
-
-	expires_next = hrtimer_update_next_event(cpu_base);
+	ktime_t expires_next = hrtimer_update_next_event(cpu_base);
 
 	if (skip_equal && expires_next == cpu_base->expires_next)
 		return;
@@ -752,41 +714,31 @@ hrtimer_force_reprogram(struct hrtimer_c
 /* High resolution timer related functions */
 #ifdef CONFIG_HIGH_RES_TIMERS
 
-/*
- * High resolution timer enabled ?
- */
+/* High resolution timer enabled ? */
 static bool hrtimer_hres_enabled __read_mostly  = true;
 unsigned int hrtimer_resolution __read_mostly = LOW_RES_NSEC;
 EXPORT_SYMBOL_GPL(hrtimer_resolution);
 
-/*
- * Enable / Disable high resolution mode
- */
+/* Enable / Disable high resolution mode */
 static int __init setup_hrtimer_hres(char *str)
 {
 	return (kstrtobool(str, &hrtimer_hres_enabled) == 0);
 }
-
 __setup("highres=", setup_hrtimer_hres);
 
-/*
- * hrtimer_high_res_enabled - query, if the highres mode is enabled
- */
-static inline int hrtimer_is_hres_enabled(void)
+/* hrtimer_high_res_enabled - query, if the highres mode is enabled */
+static inline bool hrtimer_is_hres_enabled(void)
 {
 	return hrtimer_hres_enabled;
 }
 
-/*
- * Switch to high resolution mode
- */
+/* Switch to high resolution mode */
 static void hrtimer_switch_to_hres(void)
 {
 	struct hrtimer_cpu_base *base = this_cpu_ptr(&hrtimer_bases);
 
 	if (tick_init_highres()) {
-		pr_warn("Could not switch to high resolution mode on CPU %u\n",
-			base->cpu);
+		pr_warn("Could not switch to high resolution mode on CPU %u\n",	base->cpu);
 		return;
 	}
 	base->hres_active = 1;
@@ -800,10 +752,11 @@ static void hrtimer_switch_to_hres(void)
 
 #else
 
-static inline int hrtimer_is_hres_enabled(void) { return 0; }
+static inline bool hrtimer_is_hres_enabled(void) { return 0; }
 static inline void hrtimer_switch_to_hres(void) { }
 
 #endif /* CONFIG_HIGH_RES_TIMERS */
+
 /*
  * Retrigger next event is called after clock was set with interrupts
  * disabled through an SMP function call or directly from low level
@@ -841,7 +794,7 @@ static void retrigger_next_event(void *a
 	guard(raw_spinlock)(&base->lock);
 	hrtimer_update_base(base);
 	if (hrtimer_hres_active(base))
-		hrtimer_force_reprogram(base, 0);
+		hrtimer_force_reprogram(base, /* skip_equal */ false);
 	else
 		hrtimer_update_next_event(base);
 }
@@ -887,8 +840,7 @@ static void hrtimer_reprogram(struct hrt
 		timer_cpu_base->softirq_next_timer = timer;
 		timer_cpu_base->softirq_expires_next = expires;
 
-		if (!ktime_before(expires, timer_cpu_base->expires_next) ||
-		    !reprogram)
+		if (!ktime_before(expires, timer_cpu_base->expires_next) || !reprogram)
 			return;
 	}
 
@@ -914,8 +866,7 @@ static void hrtimer_reprogram(struct hrt
 	__hrtimer_reprogram(cpu_base, timer, expires);
 }
 
-static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base,
-			     unsigned int active)
+static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base, unsigned int active)
 {
 	struct hrtimer_clock_base *base;
 	unsigned int seq;
@@ -1050,11 +1001,8 @@ void hrtimers_resume_local(void)
 	retrigger_next_event(NULL);
 }
 
-/*
- * Counterpart to lock_hrtimer_base above:
- */
-static inline
-void unlock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
+/* Counterpart to lock_hrtimer_base above */
+static inline void unlock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
 	__releases(&timer->base->cpu_base->lock)
 {
 	raw_spin_unlock_irqrestore(&timer->base->cpu_base->lock, *flags);
@@ -1071,7 +1019,7 @@ void unlock_hrtimer_base(const struct hr
  * .. note::
  *  This only updates the timer expiry value and does not requeue the timer.
  *
- * There is also a variant of the function hrtimer_forward_now().
+ * There is also a variant of this function: hrtimer_forward_now().
  *
  * Context: Can be safely called from the callback function of @timer. If called
  *          from other contexts @timer must neither be enqueued nor running the
@@ -1081,8 +1029,8 @@ void unlock_hrtimer_base(const struct hr
  */
 u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
 {
-	u64 orun = 1;
 	ktime_t delta;
+	u64 orun = 1;
 
 	delta = ktime_sub(now, hrtimer_get_expires(timer));
 
@@ -1118,13 +1066,15 @@ EXPORT_SYMBOL_GPL(hrtimer_forward);
  * enqueue_hrtimer - internal function to (re)start a timer
  *
  * The timer is inserted in expiry order. Insertion into the
- * red black tree is O(log(n)). Must hold the base lock.
+ * red black tree is O(log(n)).
  *
  * Returns true when the new timer is the leftmost timer in the tree.
  */
 static bool enqueue_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
 			    enum hrtimer_mode mode, bool was_armed)
 {
+	lockdep_assert_held(&base->cpu_base->lock);
+
 	debug_activate(timer, mode, was_armed);
 	WARN_ON_ONCE(!base->cpu_base->online);
 
@@ -1139,20 +1089,19 @@ static bool enqueue_hrtimer(struct hrtim
 /*
  * __remove_hrtimer - internal function to remove a timer
  *
- * Caller must hold the base lock.
- *
  * High resolution timer mode reprograms the clock event device when the
  * timer is the one which expires next. The caller can disable this by setting
  * reprogram to zero. This is useful, when the context does a reprogramming
  * anyway (e.g. timer interrupt)
  */
-static void __remove_hrtimer(struct hrtimer *timer,
-			     struct hrtimer_clock_base *base,
-			     u8 newstate, int reprogram)
+static void __remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
+			     u8 newstate, bool reprogram)
 {
 	struct hrtimer_cpu_base *cpu_base = base->cpu_base;
 	u8 state = timer->state;
 
+	lockdep_assert_held(&cpu_base->lock);
+
 	/* Pairs with the lockless read in hrtimer_is_queued() */
 	WRITE_ONCE(timer->state, newstate);
 	if (!(state & HRTIMER_STATE_ENQUEUED))
@@ -1162,26 +1111,25 @@ static void __remove_hrtimer(struct hrti
 		cpu_base->active_bases &= ~(1 << base->index);
 
 	/*
-	 * Note: If reprogram is false we do not update
-	 * cpu_base->next_timer. This happens when we remove the first
-	 * timer on a remote cpu. No harm as we never dereference
-	 * cpu_base->next_timer. So the worst thing what can happen is
-	 * an superfluous call to hrtimer_force_reprogram() on the
-	 * remote cpu later on if the same timer gets enqueued again.
+	 * If reprogram is false don't update cpu_base->next_timer and do not
+	 * touch the clock event device.
+	 *
+	 * This happens when removing the first timer on a remote CPU, which
+	 * will be handled by the remote CPU's interrupt. It also happens when
+	 * a local timer is removed to be immediately restarted. That's handled
+	 * at the call site.
 	 */
 	if (reprogram && timer == cpu_base->next_timer && !timer->is_lazy)
-		hrtimer_force_reprogram(cpu_base, 1);
+		hrtimer_force_reprogram(cpu_base, /* skip_equal */ true);
 }
 
-/*
- * remove hrtimer, called with base lock held
- */
-static inline int
-remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
-	       bool restart, bool keep_local)
+static inline bool remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
+				 bool restart, bool keep_local)
 {
 	u8 state = timer->state;
 
+	lockdep_assert_held(&base->cpu_base->lock);
+
 	if (state & HRTIMER_STATE_ENQUEUED) {
 		bool reprogram;
 
@@ -1209,9 +1157,9 @@ remove_hrtimer(struct hrtimer *timer, st
 			reprogram &= !keep_local;
 
 		__remove_hrtimer(timer, base, state, reprogram);
-		return 1;
+		return true;
 	}
-	return 0;
+	return false;
 }
 
 static inline ktime_t hrtimer_update_lowres(struct hrtimer *timer, ktime_t tim,
@@ -1230,34 +1178,27 @@ static inline ktime_t hrtimer_update_low
 	return tim;
 }
 
-static void
-hrtimer_update_softirq_timer(struct hrtimer_cpu_base *cpu_base, bool reprogram)
+static void hrtimer_update_softirq_timer(struct hrtimer_cpu_base *cpu_base, bool reprogram)
 {
-	ktime_t expires;
+	ktime_t expires = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_SOFT);
 
 	/*
-	 * Find the next SOFT expiration.
-	 */
-	expires = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_SOFT);
-
-	/*
-	 * reprogramming needs to be triggered, even if the next soft
-	 * hrtimer expires at the same time than the next hard
+	 * Reprogramming needs to be triggered, even if the next soft
+	 * hrtimer expires at the same time as the next hard
 	 * hrtimer. cpu_base->softirq_expires_next needs to be updated!
 	 */
 	if (expires == KTIME_MAX)
 		return;
 
 	/*
-	 * cpu_base->*next_timer is recomputed by __hrtimer_get_next_event()
-	 * cpu_base->*expires_next is only set by hrtimer_reprogram()
+	 * cpu_base->next_timer is recomputed by __hrtimer_get_next_event()
+	 * cpu_base->expires_next is only set by hrtimer_reprogram()
 	 */
 	hrtimer_reprogram(cpu_base->softirq_next_timer, reprogram);
 }
 
-static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
-				    u64 delta_ns, const enum hrtimer_mode mode,
-				    struct hrtimer_clock_base *base)
+static bool __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, u64 delta_ns,
+				     const enum hrtimer_mode mode, struct hrtimer_clock_base *base)
 {
 	struct hrtimer_cpu_base *this_cpu_base = this_cpu_ptr(&hrtimer_bases);
 	struct hrtimer_clock_base *new_base;
@@ -1301,12 +1242,10 @@ static int __hrtimer_start_range_ns(stru
 	hrtimer_set_expires_range_ns(timer, tim, delta_ns);
 
 	/* Switch the timer base, if necessary: */
-	if (!force_local) {
-		new_base = switch_hrtimer_base(timer, base,
-					       mode & HRTIMER_MODE_PINNED);
-	} else {
+	if (!force_local)
+		new_base = switch_hrtimer_base(timer, base, mode & HRTIMER_MODE_PINNED);
+	else
 		new_base = base;
-	}
 
 	first = enqueue_hrtimer(timer, new_base, mode, was_armed);
 
@@ -1319,9 +1258,12 @@ static int __hrtimer_start_range_ns(stru
 
 	if (!force_local) {
 		/*
-		 * If the current CPU base is online, then the timer is
-		 * never queued on a remote CPU if it would be the first
-		 * expiring timer there.
+		 * If the current CPU base is online, then the timer is never
+		 * queued on a remote CPU if it would be the first expiring
+		 * timer there unless the timer callback is currently executed
+		 * on the remote CPU. In the latter case the remote CPU will
+		 * re-evaluate the first expiring timer after completing the
+		 * callbacks.
 		 */
 		if (hrtimer_base_is_online(this_cpu_base))
 			return first;
@@ -1336,7 +1278,7 @@ static int __hrtimer_start_range_ns(stru
 
 			smp_call_function_single_async(new_cpu_base->cpu, &new_cpu_base->csd);
 		}
-		return 0;
+		return false;
 	}
 
 	/*
@@ -1350,7 +1292,7 @@ static int __hrtimer_start_range_ns(stru
 	 */
 	if (timer->is_lazy) {
 		if (new_base->cpu_base->expires_next <= hrtimer_get_expires(timer))
-			return 0;
+			return false;
 	}
 
 	/*
@@ -1358,8 +1300,8 @@ static int __hrtimer_start_range_ns(stru
 	 * reprogramming on removal and enqueue. Force reprogram the
 	 * hardware by evaluating the new first expiring timer.
 	 */
-	hrtimer_force_reprogram(new_base->cpu_base, 1);
-	return 0;
+	hrtimer_force_reprogram(new_base->cpu_base, /* skip_equal */ true);
+	return false;
 }
 
 /**
@@ -1371,8 +1313,8 @@ static int __hrtimer_start_range_ns(stru
  *		relative (HRTIMER_MODE_REL), and pinned (HRTIMER_MODE_PINNED);
  *		softirq based mode is considered for debug purpose only!
  */
-void hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
-			    u64 delta_ns, const enum hrtimer_mode mode)
+void hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, u64 delta_ns,
+			    const enum hrtimer_mode mode)
 {
 	struct hrtimer_clock_base *base;
 	unsigned long flags;
@@ -1464,8 +1406,7 @@ static void hrtimer_cpu_base_unlock_expi
  * the timer callback to finish. Drop expiry_lock and reacquire it. That
  * allows the waiter to acquire the lock and make progress.
  */
-static void hrtimer_sync_wait_running(struct hrtimer_cpu_base *cpu_base,
-				      unsigned long flags)
+static void hrtimer_sync_wait_running(struct hrtimer_cpu_base *cpu_base, unsigned long flags)
 {
 	if (atomic_read(&cpu_base->timer_waiters)) {
 		raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
@@ -1530,14 +1471,10 @@ void hrtimer_cancel_wait_running(const s
 	spin_unlock_bh(&base->cpu_base->softirq_expiry_lock);
 }
 #else
-static inline void
-hrtimer_cpu_base_init_expiry_lock(struct hrtimer_cpu_base *base) { }
-static inline void
-hrtimer_cpu_base_lock_expiry(struct hrtimer_cpu_base *base) { }
-static inline void
-hrtimer_cpu_base_unlock_expiry(struct hrtimer_cpu_base *base) { }
-static inline void hrtimer_sync_wait_running(struct hrtimer_cpu_base *base,
-					     unsigned long flags) { }
+static inline void hrtimer_cpu_base_init_expiry_lock(struct hrtimer_cpu_base *base) { }
+static inline void hrtimer_cpu_base_lock_expiry(struct hrtimer_cpu_base *base) { }
+static inline void hrtimer_cpu_base_unlock_expiry(struct hrtimer_cpu_base *base) { }
+static inline void hrtimer_sync_wait_running(struct hrtimer_cpu_base *base, unsigned long fl) { }
 #endif
 
 /**
@@ -1668,8 +1605,7 @@ ktime_t hrtimer_cb_get_time(const struct
 }
 EXPORT_SYMBOL_GPL(hrtimer_cb_get_time);
 
-static void __hrtimer_setup(struct hrtimer *timer,
-			    enum hrtimer_restart (*function)(struct hrtimer *),
+static void __hrtimer_setup(struct hrtimer *timer, enum hrtimer_restart (*fn)(struct hrtimer *),
 			    clockid_t clock_id, enum hrtimer_mode mode)
 {
 	bool softtimer = !!(mode & HRTIMER_MODE_SOFT);
@@ -1705,10 +1641,10 @@ static void __hrtimer_setup(struct hrtim
 	timer->base = &cpu_base->clock_base[base];
 	timerqueue_init(&timer->node);
 
-	if (WARN_ON_ONCE(!function))
+	if (WARN_ON_ONCE(!fn))
 		ACCESS_PRIVATE(timer, function) = hrtimer_dummy_timeout;
 	else
-		ACCESS_PRIVATE(timer, function) = function;
+		ACCESS_PRIVATE(timer, function) = fn;
 }
 
 /**
@@ -1767,12 +1703,10 @@ bool hrtimer_active(const struct hrtimer
 		base = READ_ONCE(timer->base);
 		seq = raw_read_seqcount_begin(&base->seq);
 
-		if (timer->state != HRTIMER_STATE_INACTIVE ||
-		    base->running == timer)
+		if (timer->state != HRTIMER_STATE_INACTIVE || base->running == timer)
 			return true;
 
-	} while (read_seqcount_retry(&base->seq, seq) ||
-		 base != READ_ONCE(timer->base));
+	} while (read_seqcount_retry(&base->seq, seq) || base != READ_ONCE(timer->base));
 
 	return false;
 }
@@ -1795,11 +1729,9 @@ EXPORT_SYMBOL_GPL(hrtimer_active);
  * a false negative if the read side got smeared over multiple consecutive
  * __run_hrtimer() invocations.
  */
-
-static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base,
-			  struct hrtimer_clock_base *base,
-			  struct hrtimer *timer, ktime_t *now,
-			  unsigned long flags) __must_hold(&cpu_base->lock)
+static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base, struct hrtimer_clock_base *base,
+			  struct hrtimer *timer, ktime_t *now, unsigned long flags)
+	__must_hold(&cpu_base->lock)
 {
 	enum hrtimer_restart (*fn)(struct hrtimer *);
 	bool expires_in_hardirq;
@@ -1819,7 +1751,7 @@ static void __run_hrtimer(struct hrtimer
 	 */
 	raw_write_seqcount_barrier(&base->seq);
 
-	__remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE, 0);
+	__remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE, false);
 	fn = ACCESS_PRIVATE(timer, function);
 
 	/*
@@ -1854,8 +1786,7 @@ static void __run_hrtimer(struct hrtimer
 	 * hrtimer_start_range_ns() can have popped in and enqueued the timer
 	 * for us already.
 	 */
-	if (restart != HRTIMER_NORESTART &&
-	    !(timer->state & HRTIMER_STATE_ENQUEUED))
+	if (restart != HRTIMER_NORESTART && !(timer->state & HRTIMER_STATE_ENQUEUED))
 		enqueue_hrtimer(timer, base, HRTIMER_MODE_ABS, false);
 
 	/*
@@ -1874,8 +1805,8 @@ static void __run_hrtimer(struct hrtimer
 static void __hrtimer_run_queues(struct hrtimer_cpu_base *cpu_base, ktime_t now,
 				 unsigned long flags, unsigned int active_mask)
 {
-	struct hrtimer_clock_base *base;
 	unsigned int active = cpu_base->active_bases & active_mask;
+	struct hrtimer_clock_base *base;
 
 	for_each_active_base(base, cpu_base, active) {
 		struct timerqueue_node *node;
@@ -1951,11 +1882,10 @@ void hrtimer_interrupt(struct clock_even
 retry:
 	cpu_base->in_hrtirq = 1;
 	/*
-	 * We set expires_next to KTIME_MAX here with cpu_base->lock
-	 * held to prevent that a timer is enqueued in our queue via
-	 * the migration code. This does not affect enqueueing of
-	 * timers which run their callback and need to be requeued on
-	 * this CPU.
+	 * Set expires_next to KTIME_MAX, which prevents that remote CPUs queue
+	 * timers while __hrtimer_run_queues() is expiring the clock bases.
+	 * Timers which are re/enqueued on the local CPU are not affected by
+	 * this.
 	 */
 	cpu_base->expires_next = KTIME_MAX;
 
@@ -2069,8 +1999,7 @@ void hrtimer_run_queues(void)
  */
 static enum hrtimer_restart hrtimer_wakeup(struct hrtimer *timer)
 {
-	struct hrtimer_sleeper *t =
-		container_of(timer, struct hrtimer_sleeper, timer);
+	struct hrtimer_sleeper *t = container_of(timer, struct hrtimer_sleeper, timer);
 	struct task_struct *task = t->task;
 
 	t->task = NULL;
@@ -2088,8 +2017,7 @@ static enum hrtimer_restart hrtimer_wake
  * Wrapper around hrtimer_start_expires() for hrtimer_sleeper based timers
  * to allow PREEMPT_RT to tweak the delivery mode (soft/hardirq context)
  */
-void hrtimer_sleeper_start_expires(struct hrtimer_sleeper *sl,
-				   enum hrtimer_mode mode)
+void hrtimer_sleeper_start_expires(struct hrtimer_sleeper *sl, enum hrtimer_mode mode)
 {
 	/*
 	 * Make the enqueue delivery mode check work on RT. If the sleeper
@@ -2105,8 +2033,8 @@ void hrtimer_sleeper_start_expires(struc
 }
 EXPORT_SYMBOL_GPL(hrtimer_sleeper_start_expires);
 
-static void __hrtimer_setup_sleeper(struct hrtimer_sleeper *sl,
-				    clockid_t clock_id, enum hrtimer_mode mode)
+static void __hrtimer_setup_sleeper(struct hrtimer_sleeper *sl, clockid_t clock_id,
+				    enum hrtimer_mode mode)
 {
 	/*
 	 * On PREEMPT_RT enabled kernels hrtimers which are not explicitly
@@ -2142,8 +2070,8 @@ static void __hrtimer_setup_sleeper(stru
  * @clock_id:	the clock to be used
  * @mode:	timer mode abs/rel
  */
-void hrtimer_setup_sleeper_on_stack(struct hrtimer_sleeper *sl,
-				    clockid_t clock_id, enum hrtimer_mode mode)
+void hrtimer_setup_sleeper_on_stack(struct hrtimer_sleeper *sl, clockid_t clock_id,
+				    enum hrtimer_mode mode)
 {
 	debug_setup_on_stack(&sl->timer, clock_id, mode);
 	__hrtimer_setup_sleeper(sl, clock_id, mode);
@@ -2216,8 +2144,7 @@ static long __sched hrtimer_nanosleep_re
 	return ret;
 }
 
-long hrtimer_nanosleep(ktime_t rqtp, const enum hrtimer_mode mode,
-		       const clockid_t clockid)
+long hrtimer_nanosleep(ktime_t rqtp, const enum hrtimer_mode mode, const clockid_t clockid)
 {
 	struct restart_block *restart;
 	struct hrtimer_sleeper t;
@@ -2260,8 +2187,7 @@ SYSCALL_DEFINE2(nanosleep, struct __kern
 	current->restart_block.fn = do_no_restart_syscall;
 	current->restart_block.nanosleep.type = rmtp ? TT_NATIVE : TT_NONE;
 	current->restart_block.nanosleep.rmtp = rmtp;
-	return hrtimer_nanosleep(timespec64_to_ktime(tu), HRTIMER_MODE_REL,
-				 CLOCK_MONOTONIC);
+	return hrtimer_nanosleep(timespec64_to_ktime(tu), HRTIMER_MODE_REL, CLOCK_MONOTONIC);
 }
 
 #endif
@@ -2269,7 +2195,7 @@ SYSCALL_DEFINE2(nanosleep, struct __kern
 #ifdef CONFIG_COMPAT_32BIT_TIME
 
 SYSCALL_DEFINE2(nanosleep_time32, struct old_timespec32 __user *, rqtp,
-		       struct old_timespec32 __user *, rmtp)
+		struct old_timespec32 __user *, rmtp)
 {
 	struct timespec64 tu;
 
@@ -2282,8 +2208,7 @@ SYSCALL_DEFINE2(nanosleep_time32, struct
 	current->restart_block.fn = do_no_restart_syscall;
 	current->restart_block.nanosleep.type = rmtp ? TT_COMPAT : TT_NONE;
 	current->restart_block.nanosleep.compat_rmtp = rmtp;
-	return hrtimer_nanosleep(timespec64_to_ktime(tu), HRTIMER_MODE_REL,
-				 CLOCK_MONOTONIC);
+	return hrtimer_nanosleep(timespec64_to_ktime(tu), HRTIMER_MODE_REL, CLOCK_MONOTONIC);
 }
 #endif
 
@@ -2293,9 +2218,8 @@ SYSCALL_DEFINE2(nanosleep_time32, struct
 int hrtimers_prepare_cpu(unsigned int cpu)
 {
 	struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu);
-	int i;
 
-	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
+	for (int i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
 		struct hrtimer_clock_base *clock_b = &cpu_base->clock_base[i];
 
 		clock_b->cpu_base = cpu_base;
@@ -2329,8 +2253,8 @@ int hrtimers_cpu_starting(unsigned int c
 static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base,
 				struct hrtimer_clock_base *new_base)
 {
-	struct hrtimer *timer;
 	struct timerqueue_node *node;
+	struct hrtimer *timer;
 
 	while ((node = timerqueue_getnext(&old_base->active))) {
 		timer = container_of(node, struct hrtimer, node);
@@ -2342,7 +2266,7 @@ static void migrate_hrtimer_list(struct
 		 * timer could be seen as !active and just vanish away
 		 * under us on another CPU
 		 */
-		__remove_hrtimer(timer, old_base, HRTIMER_STATE_ENQUEUED, 0);
+		__remove_hrtimer(timer, old_base, HRTIMER_STATE_ENQUEUED, false);
 		timer->base = new_base;
 		/*
 		 * Enqueue the timers on the new cpu. This does not
@@ -2358,7 +2282,7 @@ static void migrate_hrtimer_list(struct
 
 int hrtimers_cpu_dying(unsigned int dying_cpu)
 {
-	int i, ncpu = cpumask_any_and(cpu_active_mask, housekeeping_cpumask(HK_TYPE_TIMER));
+	int ncpu = cpumask_any_and(cpu_active_mask, housekeeping_cpumask(HK_TYPE_TIMER));
 	struct hrtimer_cpu_base *old_base, *new_base;
 
 	old_base = this_cpu_ptr(&hrtimer_bases);
@@ -2371,10 +2295,8 @@ int hrtimers_cpu_dying(unsigned int dyin
 	raw_spin_lock(&old_base->lock);
 	raw_spin_lock_nested(&new_base->lock, SINGLE_DEPTH_NESTING);
 
-	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
-		migrate_hrtimer_list(&old_base->clock_base[i],
-				     &new_base->clock_base[i]);
-	}
+	for (int i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++)
+		migrate_hrtimer_list(&old_base->clock_base[i], &new_base->clock_base[i]);
 
 	/* Tell the other CPU to retrigger the next event */
 	smp_call_function_single(ncpu, retrigger_next_event, NULL, 0);


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 25/48] hrtimer: Evaluate timer expiry only once
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (23 preceding siblings ...)
  2026-02-24 16:37 ` [patch 24/48] hrtimer: Cleanup coding style and comments Thomas Gleixner
@ 2026-02-24 16:37 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:37 ` [patch 26/48] hrtimer: Replace the bitfield in hrtimer_cpu_base Thomas Gleixner
                   ` (24 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:37 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

No point in accessing the timer twice.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 kernel/time/hrtimer.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -810,10 +810,11 @@ static void hrtimer_reprogram(struct hrt
 {
 	struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
 	struct hrtimer_clock_base *base = timer->base;
-	ktime_t expires = ktime_sub(hrtimer_get_expires(timer), base->offset);
+	ktime_t expires = hrtimer_get_expires(timer);
 
-	WARN_ON_ONCE(hrtimer_get_expires(timer) < 0);
+	WARN_ON_ONCE(expires < 0);
 
+	expires = ktime_sub(expires, base->offset);
 	/*
 	 * CLOCK_REALTIME timer might be requested with an absolute
 	 * expiry time which is less than base->offset. Set it to 0.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 26/48] hrtimer: Replace the bitfield in hrtimer_cpu_base
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (24 preceding siblings ...)
  2026-02-24 16:37 ` [patch 25/48] hrtimer: Evaluate timer expiry only once Thomas Gleixner
@ 2026-02-24 16:37 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:37 ` [patch 27/48] hrtimer: Convert state and properties to boolean Thomas Gleixner
                   ` (23 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:37 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

Use bool for the various flags as that creates better code in the hot path.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 include/linux/hrtimer_defs.h |   10 +++++-----
 kernel/time/hrtimer.c        |   25 +++++++++++++------------
 2 files changed, 18 insertions(+), 17 deletions(-)

--- a/include/linux/hrtimer_defs.h
+++ b/include/linux/hrtimer_defs.h
@@ -83,11 +83,11 @@ struct hrtimer_cpu_base {
 	unsigned int			cpu;
 	unsigned int			active_bases;
 	unsigned int			clock_was_set_seq;
-	unsigned int			hres_active		: 1,
-					in_hrtirq		: 1,
-					hang_detected		: 1,
-					softirq_activated       : 1,
-					online			: 1;
+	bool				hres_active;
+	bool				in_hrtirq;
+	bool				hang_detected;
+	bool				softirq_activated;
+	bool				online;
 #ifdef CONFIG_HIGH_RES_TIMERS
 	unsigned int			nr_events;
 	unsigned short			nr_retries;
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -741,7 +741,7 @@ static void hrtimer_switch_to_hres(void)
 		pr_warn("Could not switch to high resolution mode on CPU %u\n",	base->cpu);
 		return;
 	}
-	base->hres_active = 1;
+	base->hres_active = true;
 	hrtimer_resolution = HIGH_RES_NSEC;
 
 	tick_setup_sched_timer(true);
@@ -1854,7 +1854,7 @@ static __latent_entropy void hrtimer_run
 	now = hrtimer_update_base(cpu_base);
 	__hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_SOFT);
 
-	cpu_base->softirq_activated = 0;
+	cpu_base->softirq_activated = false;
 	hrtimer_update_softirq_timer(cpu_base, true);
 
 	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
@@ -1881,7 +1881,7 @@ void hrtimer_interrupt(struct clock_even
 	raw_spin_lock_irqsave(&cpu_base->lock, flags);
 	entry_time = now = hrtimer_update_base(cpu_base);
 retry:
-	cpu_base->in_hrtirq = 1;
+	cpu_base->in_hrtirq = true;
 	/*
 	 * Set expires_next to KTIME_MAX, which prevents that remote CPUs queue
 	 * timers while __hrtimer_run_queues() is expiring the clock bases.
@@ -1892,7 +1892,7 @@ void hrtimer_interrupt(struct clock_even
 
 	if (!ktime_before(now, cpu_base->softirq_expires_next)) {
 		cpu_base->softirq_expires_next = KTIME_MAX;
-		cpu_base->softirq_activated = 1;
+		cpu_base->softirq_activated = true;
 		raise_timer_softirq(HRTIMER_SOFTIRQ);
 	}
 
@@ -1905,12 +1905,12 @@ void hrtimer_interrupt(struct clock_even
 	 * against it.
 	 */
 	cpu_base->expires_next = expires_next;
-	cpu_base->in_hrtirq = 0;
+	cpu_base->in_hrtirq = false;
 	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
 
 	/* Reprogramming necessary ? */
 	if (!tick_program_event(expires_next, 0)) {
-		cpu_base->hang_detected = 0;
+		cpu_base->hang_detected = false;
 		return;
 	}
 
@@ -1939,7 +1939,7 @@ void hrtimer_interrupt(struct clock_even
 	 * time away.
 	 */
 	cpu_base->nr_hangs++;
-	cpu_base->hang_detected = 1;
+	cpu_base->hang_detected = true;
 	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
 
 	delta = ktime_sub(now, entry_time);
@@ -1987,7 +1987,7 @@ void hrtimer_run_queues(void)
 
 	if (!ktime_before(now, cpu_base->softirq_expires_next)) {
 		cpu_base->softirq_expires_next = KTIME_MAX;
-		cpu_base->softirq_activated = 1;
+		cpu_base->softirq_activated = true;
 		raise_timer_softirq(HRTIMER_SOFTIRQ);
 	}
 
@@ -2239,13 +2239,14 @@ int hrtimers_cpu_starting(unsigned int c
 
 	/* Clear out any left over state from a CPU down operation */
 	cpu_base->active_bases = 0;
-	cpu_base->hres_active = 0;
-	cpu_base->hang_detected = 0;
+	cpu_base->hres_active = false;
+	cpu_base->hang_detected = false;
 	cpu_base->next_timer = NULL;
 	cpu_base->softirq_next_timer = NULL;
 	cpu_base->expires_next = KTIME_MAX;
 	cpu_base->softirq_expires_next = KTIME_MAX;
-	cpu_base->online = 1;
+	cpu_base->softirq_activated = false;
+	cpu_base->online = true;
 	return 0;
 }
 
@@ -2303,7 +2304,7 @@ int hrtimers_cpu_dying(unsigned int dyin
 	smp_call_function_single(ncpu, retrigger_next_event, NULL, 0);
 
 	raw_spin_unlock(&new_base->lock);
-	old_base->online = 0;
+	old_base->online = false;
 	raw_spin_unlock(&old_base->lock);
 
 	return 0;


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 27/48] hrtimer: Convert state and properties to boolean
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (25 preceding siblings ...)
  2026-02-24 16:37 ` [patch 26/48] hrtimer: Replace the bitfield in hrtimer_cpu_base Thomas Gleixner
@ 2026-02-24 16:37 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:37 ` [patch 28/48] hrtimer: Optimize for local timers Thomas Gleixner
                   ` (22 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:37 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

All 'u8' flags are true booleans, so make it entirely clear that these can
only contain true or false.

This is especially true for hrtimer::state, which has a historical leftover
of using the state with bitwise operations. That was used in the early
hrtimer implementation with several bits, but then converted to a boolean
state. But that conversion missed to replace the bit OR and bit check
operations all over the place, which creates suboptimal code. As of today
'state' is a misnomer because it's only purpose is to reflect whether the
timer is enqueued into the RB-tree or not. Rename it to 'is_queued' and
make all operations on it boolean.

This reduces text size from 8926 to 8732 bytes.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 include/linux/hrtimer.h       |   31 +---------------------
 include/linux/hrtimer_types.h |   12 ++++----
 kernel/time/hrtimer.c         |   58 ++++++++++++++++++++++++++++--------------
 kernel/time/timer_list.c      |    2 -
 4 files changed, 49 insertions(+), 54 deletions(-)

--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -63,33 +63,6 @@ enum hrtimer_mode {
 	HRTIMER_MODE_REL_PINNED_HARD = HRTIMER_MODE_REL_PINNED | HRTIMER_MODE_HARD,
 };
 
-/*
- * Values to track state of the timer
- *
- * Possible states:
- *
- * 0x00		inactive
- * 0x01		enqueued into rbtree
- *
- * The callback state is not part of the timer->state because clearing it would
- * mean touching the timer after the callback, this makes it impossible to free
- * the timer from the callback function.
- *
- * Therefore we track the callback state in:
- *
- *	timer->base->cpu_base->running == timer
- *
- * On SMP it is possible to have a "callback function running and enqueued"
- * status. It happens for example when a posix timer expired and the callback
- * queued a signal. Between dropping the lock which protects the posix timer
- * and reacquiring the base lock of the hrtimer, another CPU can deliver the
- * signal and rearm the timer.
- *
- * All state transitions are protected by cpu_base->lock.
- */
-#define HRTIMER_STATE_INACTIVE	0x00
-#define HRTIMER_STATE_ENQUEUED	0x01
-
 /**
  * struct hrtimer_sleeper - simple sleeper structure
  * @timer:	embedded timer structure
@@ -300,8 +273,8 @@ extern bool hrtimer_active(const struct
  */
 static inline bool hrtimer_is_queued(struct hrtimer *timer)
 {
-	/* The READ_ONCE pairs with the update functions of timer->state */
-	return !!(READ_ONCE(timer->state) & HRTIMER_STATE_ENQUEUED);
+	/* The READ_ONCE pairs with the update functions of timer->is_queued */
+	return READ_ONCE(timer->is_queued);
 }
 
 /*
--- a/include/linux/hrtimer_types.h
+++ b/include/linux/hrtimer_types.h
@@ -28,7 +28,7 @@ enum hrtimer_restart {
  *		was armed.
  * @function:	timer expiry callback function
  * @base:	pointer to the timer base (per cpu and per clock)
- * @state:	state information (See bit values above)
+ * @is_queued:	Indicates whether a timer is enqueued or not
  * @is_rel:	Set if the timer was armed relative
  * @is_soft:	Set if hrtimer will be expired in soft interrupt context.
  * @is_hard:	Set if hrtimer will be expired in hard interrupt context
@@ -43,11 +43,11 @@ struct hrtimer {
 	ktime_t				_softexpires;
 	enum hrtimer_restart		(*__private function)(struct hrtimer *);
 	struct hrtimer_clock_base	*base;
-	u8				state;
-	u8				is_rel;
-	u8				is_soft;
-	u8				is_hard;
-	u8				is_lazy;
+	bool				is_queued;
+	bool				is_rel;
+	bool				is_soft;
+	bool				is_hard;
+	bool				is_lazy;
 };
 
 #endif /* _LINUX_HRTIMER_TYPES_H */
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -50,6 +50,28 @@
 #include "tick-internal.h"
 
 /*
+ * Constants to set the queued state of the timer (INACTIVE, ENQUEUED)
+ *
+ * The callback state is kept separate in the CPU base because having it in
+ * the timer would required touching the timer after the callback, which
+ * makes it impossible to free the timer from the callback function.
+ *
+ * Therefore we track the callback state in:
+ *
+ *	timer->base->cpu_base->running == timer
+ *
+ * On SMP it is possible to have a "callback function running and enqueued"
+ * status. It happens for example when a posix timer expired and the callback
+ * queued a signal. Between dropping the lock which protects the posix timer
+ * and reacquiring the base lock of the hrtimer, another CPU can deliver the
+ * signal and rearm the timer.
+ *
+ * All state transitions are protected by cpu_base->lock.
+ */
+#define HRTIMER_STATE_INACTIVE	false
+#define HRTIMER_STATE_ENQUEUED	true
+
+/*
  * The resolution of the clocks. The resolution value is returned in
  * the clock_getres() system call to give application programmers an
  * idea of the (in)accuracy of timers. Timer values are rounded up to
@@ -1038,7 +1060,7 @@ u64 hrtimer_forward(struct hrtimer *time
 	if (delta < 0)
 		return 0;
 
-	if (WARN_ON(timer->state & HRTIMER_STATE_ENQUEUED))
+	if (WARN_ON(timer->is_queued))
 		return 0;
 
 	if (interval < hrtimer_resolution)
@@ -1082,7 +1104,7 @@ static bool enqueue_hrtimer(struct hrtim
 	base->cpu_base->active_bases |= 1 << base->index;
 
 	/* Pairs with the lockless read in hrtimer_is_queued() */
-	WRITE_ONCE(timer->state, HRTIMER_STATE_ENQUEUED);
+	WRITE_ONCE(timer->is_queued, HRTIMER_STATE_ENQUEUED);
 
 	return timerqueue_add(&base->active, &timer->node);
 }
@@ -1096,18 +1118,18 @@ static bool enqueue_hrtimer(struct hrtim
  * anyway (e.g. timer interrupt)
  */
 static void __remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
-			     u8 newstate, bool reprogram)
+			     bool newstate, bool reprogram)
 {
 	struct hrtimer_cpu_base *cpu_base = base->cpu_base;
-	u8 state = timer->state;
 
 	lockdep_assert_held(&cpu_base->lock);
 
-	/* Pairs with the lockless read in hrtimer_is_queued() */
-	WRITE_ONCE(timer->state, newstate);
-	if (!(state & HRTIMER_STATE_ENQUEUED))
+	if (!timer->is_queued)
 		return;
 
+	/* Pairs with the lockless read in hrtimer_is_queued() */
+	WRITE_ONCE(timer->is_queued, newstate);
+
 	if (!timerqueue_del(&base->active, &timer->node))
 		cpu_base->active_bases &= ~(1 << base->index);
 
@@ -1127,11 +1149,11 @@ static void __remove_hrtimer(struct hrti
 static inline bool remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
 				 bool restart, bool keep_local)
 {
-	u8 state = timer->state;
+	bool queued_state = timer->is_queued;
 
 	lockdep_assert_held(&base->cpu_base->lock);
 
-	if (state & HRTIMER_STATE_ENQUEUED) {
+	if (queued_state) {
 		bool reprogram;
 
 		debug_hrtimer_deactivate(timer);
@@ -1153,11 +1175,11 @@ static inline bool remove_hrtimer(struct
 		 * and a moment later when it's requeued).
 		 */
 		if (!restart)
-			state = HRTIMER_STATE_INACTIVE;
+			queued_state = HRTIMER_STATE_INACTIVE;
 		else
 			reprogram &= !keep_local;
 
-		__remove_hrtimer(timer, base, state, reprogram);
+		__remove_hrtimer(timer, base, queued_state, reprogram);
 		return true;
 	}
 	return false;
@@ -1704,7 +1726,7 @@ bool hrtimer_active(const struct hrtimer
 		base = READ_ONCE(timer->base);
 		seq = raw_read_seqcount_begin(&base->seq);
 
-		if (timer->state != HRTIMER_STATE_INACTIVE || base->running == timer)
+		if (timer->is_queued || base->running == timer)
 			return true;
 
 	} while (read_seqcount_retry(&base->seq, seq) || base != READ_ONCE(timer->base));
@@ -1721,7 +1743,7 @@ EXPORT_SYMBOL_GPL(hrtimer_active);
  *  - callback:	the timer is being ran
  *  - post:	the timer is inactive or (re)queued
  *
- * On the read side we ensure we observe timer->state and cpu_base->running
+ * On the read side we ensure we observe timer->is_queued and cpu_base->running
  * from the same section, if anything changed while we looked at it, we retry.
  * This includes timer->base changing because sequence numbers alone are
  * insufficient for that.
@@ -1744,11 +1766,11 @@ static void __run_hrtimer(struct hrtimer
 	base->running = timer;
 
 	/*
-	 * Separate the ->running assignment from the ->state assignment.
+	 * Separate the ->running assignment from the ->is_queued assignment.
 	 *
 	 * As with a regular write barrier, this ensures the read side in
 	 * hrtimer_active() cannot observe base->running == NULL &&
-	 * timer->state == INACTIVE.
+	 * timer->is_queued == INACTIVE.
 	 */
 	raw_write_seqcount_barrier(&base->seq);
 
@@ -1787,15 +1809,15 @@ static void __run_hrtimer(struct hrtimer
 	 * hrtimer_start_range_ns() can have popped in and enqueued the timer
 	 * for us already.
 	 */
-	if (restart != HRTIMER_NORESTART && !(timer->state & HRTIMER_STATE_ENQUEUED))
+	if (restart == HRTIMER_RESTART && !timer->is_queued)
 		enqueue_hrtimer(timer, base, HRTIMER_MODE_ABS, false);
 
 	/*
-	 * Separate the ->running assignment from the ->state assignment.
+	 * Separate the ->running assignment from the ->is_queued assignment.
 	 *
 	 * As with a regular write barrier, this ensures the read side in
 	 * hrtimer_active() cannot observe base->running.timer == NULL &&
-	 * timer->state == INACTIVE.
+	 * timer->is_queued == INACTIVE.
 	 */
 	raw_write_seqcount_barrier(&base->seq);
 
--- a/kernel/time/timer_list.c
+++ b/kernel/time/timer_list.c
@@ -47,7 +47,7 @@ print_timer(struct seq_file *m, struct h
 	    int idx, u64 now)
 {
 	SEQ_printf(m, " #%d: <%p>, %ps", idx, taddr, ACCESS_PRIVATE(timer, function));
-	SEQ_printf(m, ", S:%02x", timer->state);
+	SEQ_printf(m, ", S:%02x", timer->is_queued);
 	SEQ_printf(m, "\n");
 	SEQ_printf(m, " # expires at %Lu-%Lu nsecs [in %Ld to %Ld nsecs]\n",
 		(unsigned long long)ktime_to_ns(hrtimer_get_softexpires(timer)),


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 28/48] hrtimer: Optimize for local timers
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (26 preceding siblings ...)
  2026-02-24 16:37 ` [patch 27/48] hrtimer: Convert state and properties to boolean Thomas Gleixner
@ 2026-02-24 16:37 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:37 ` [patch 29/48] hrtimer: Use NOHZ information for locality Thomas Gleixner
                   ` (21 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:37 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

The decision whether to keep timers on the local CPU or on the CPU they are
associated to is suboptimal and causes the expensive switch_hrtimer_base()
mechanism to be invoked more than necessary. This is especially true for
pinned timers.

Rewrite the decision logic so that the current base is kept if:

   1) The callback is running on the base

   2) The timer is associated to the local CPU and the first expiring timer as
      that allows to optimize for reprogramming avoidance

   3) The timer is associated to the local CPU and pinned

   4) The timer is associated to the local CPU and timer migration is
      disabled.

Only #2 was covered by the original code, but especially #3 makes a
difference for high frequency rearming timers like the scheduler hrtick
timer. If timer migration is disabled, then #4 avoids most of the base
switches.

Signed-off-by: Thomas Gleixner <tglx@kernel.org
---
 kernel/time/hrtimer.c |  101 ++++++++++++++++++++++++++++++++------------------
 1 file changed, 65 insertions(+), 36 deletions(-)

--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1147,7 +1147,7 @@ static void __remove_hrtimer(struct hrti
 }
 
 static inline bool remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
-				 bool restart, bool keep_local)
+				 bool restart, bool keep_base)
 {
 	bool queued_state = timer->is_queued;
 
@@ -1177,7 +1177,7 @@ static inline bool remove_hrtimer(struct
 		if (!restart)
 			queued_state = HRTIMER_STATE_INACTIVE;
 		else
-			reprogram &= !keep_local;
+			reprogram &= !keep_base;
 
 		__remove_hrtimer(timer, base, queued_state, reprogram);
 		return true;
@@ -1220,29 +1220,57 @@ static void hrtimer_update_softirq_timer
 	hrtimer_reprogram(cpu_base->softirq_next_timer, reprogram);
 }
 
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
+static __always_inline bool hrtimer_prefer_local(bool is_local, bool is_first, bool is_pinned)
+{
+	if (static_branch_likely(&timers_migration_enabled)) {
+		/*
+		 * If it is local and the first expiring timer keep it on the local
+		 * CPU to optimize reprogramming of the clockevent device. Also
+		 * avoid switch_hrtimer_base() overhead when local and pinned.
+		 */
+		if (!is_local)
+			return false;
+		return is_first || is_pinned;
+	}
+	return is_local;
+}
+#else
+static __always_inline bool hrtimer_prefer_local(bool is_local, bool is_first, bool is_pinned)
+{
+	return is_local;
+}
+#endif
+
+static inline bool hrtimer_keep_base(struct hrtimer *timer, bool is_local, bool is_first,
+				     bool is_pinned)
+{
+	/* If the timer is running the callback it has to stay on its CPU base. */
+	if (unlikely(timer->base->running == timer))
+		return true;
+
+	return hrtimer_prefer_local(is_local, is_first, is_pinned);
+}
+
 static bool __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, u64 delta_ns,
 				     const enum hrtimer_mode mode, struct hrtimer_clock_base *base)
 {
 	struct hrtimer_cpu_base *this_cpu_base = this_cpu_ptr(&hrtimer_bases);
-	struct hrtimer_clock_base *new_base;
-	bool force_local, first, was_armed;
+	bool is_pinned, first, was_first, was_armed, keep_base = false;
+	struct hrtimer_cpu_base *cpu_base = base->cpu_base;
 
-	/*
-	 * If the timer is on the local cpu base and is the first expiring
-	 * timer then this might end up reprogramming the hardware twice
-	 * (on removal and on enqueue). To avoid that prevent the reprogram
-	 * on removal, keep the timer local to the current CPU and enforce
-	 * reprogramming after it is queued no matter whether it is the new
-	 * first expiring timer again or not.
-	 */
-	force_local = base->cpu_base == this_cpu_base;
-	force_local &= base->cpu_base->next_timer == timer;
+	was_first = cpu_base->next_timer == timer;
+	is_pinned = !!(mode & HRTIMER_MODE_PINNED);
 
 	/*
-	 * Don't force local queuing if this enqueue happens on a unplugged
-	 * CPU after hrtimer_cpu_dying() has been invoked.
+	 * Don't keep it local if this enqueue happens on a unplugged CPU
+	 * after hrtimer_cpu_dying() has been invoked.
 	 */
-	force_local &= this_cpu_base->online;
+	if (likely(this_cpu_base->online)) {
+		bool is_local = cpu_base == this_cpu_base;
+
+		keep_base = hrtimer_keep_base(timer, is_local, was_first, is_pinned);
+	}
 
 	/*
 	 * Remove an active timer from the queue. In case it is not queued
@@ -1254,8 +1282,11 @@ static bool __hrtimer_start_range_ns(str
 	 * reprogramming later if it was the first expiring timer.  This
 	 * avoids programming the underlying clock event twice (once at
 	 * removal and once after enqueue).
+	 *
+	 * @keep_base is also true if the timer callback is running on a
+	 * remote CPU and for local pinned timers.
 	 */
-	was_armed = remove_hrtimer(timer, base, true, force_local);
+	was_armed = remove_hrtimer(timer, base, true, keep_base);
 
 	if (mode & HRTIMER_MODE_REL)
 		tim = ktime_add_safe(tim, __hrtimer_cb_get_time(base->clockid));
@@ -1265,21 +1296,21 @@ static bool __hrtimer_start_range_ns(str
 	hrtimer_set_expires_range_ns(timer, tim, delta_ns);
 
 	/* Switch the timer base, if necessary: */
-	if (!force_local)
-		new_base = switch_hrtimer_base(timer, base, mode & HRTIMER_MODE_PINNED);
-	else
-		new_base = base;
+	if (!keep_base) {
+		base = switch_hrtimer_base(timer, base, is_pinned);
+		cpu_base = base->cpu_base;
+	}
 
-	first = enqueue_hrtimer(timer, new_base, mode, was_armed);
+	first = enqueue_hrtimer(timer, base, mode, was_armed);
 
 	/*
 	 * If the hrtimer interrupt is running, then it will reevaluate the
 	 * clock bases and reprogram the clock event device.
 	 */
-	if (new_base->cpu_base->in_hrtirq)
+	if (cpu_base->in_hrtirq)
 		return false;
 
-	if (!force_local) {
+	if (!was_first || cpu_base != this_cpu_base) {
 		/*
 		 * If the current CPU base is online, then the timer is never
 		 * queued on a remote CPU if it would be the first expiring
@@ -1288,7 +1319,7 @@ static bool __hrtimer_start_range_ns(str
 		 * re-evaluate the first expiring timer after completing the
 		 * callbacks.
 		 */
-		if (hrtimer_base_is_online(this_cpu_base))
+		if (likely(hrtimer_base_is_online(this_cpu_base)))
 			return first;
 
 		/*
@@ -1296,11 +1327,8 @@ static bool __hrtimer_start_range_ns(str
 		 * already offline. If the timer is the first to expire,
 		 * kick the remote CPU to reprogram the clock event.
 		 */
-		if (first) {
-			struct hrtimer_cpu_base *new_cpu_base = new_base->cpu_base;
-
-			smp_call_function_single_async(new_cpu_base->cpu, &new_cpu_base->csd);
-		}
+		if (first)
+			smp_call_function_single_async(cpu_base->cpu, &cpu_base->csd);
 		return false;
 	}
 
@@ -1314,16 +1342,17 @@ static bool __hrtimer_start_range_ns(str
 	 * required.
 	 */
 	if (timer->is_lazy) {
-		if (new_base->cpu_base->expires_next <= hrtimer_get_expires(timer))
+		if (cpu_base->expires_next <= hrtimer_get_expires(timer))
 			return false;
 	}
 
 	/*
-	 * Timer was forced to stay on the current CPU to avoid
-	 * reprogramming on removal and enqueue. Force reprogram the
-	 * hardware by evaluating the new first expiring timer.
+	 * Timer was the first expiring timer and forced to stay on the
+	 * current CPU to avoid reprogramming on removal and enqueue. Force
+	 * reprogram the hardware by evaluating the new first expiring
+	 * timer.
 	 */
-	hrtimer_force_reprogram(new_base->cpu_base, /* skip_equal */ true);
+	hrtimer_force_reprogram(cpu_base, /* skip_equal */ true);
 	return false;
 }
 


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 29/48] hrtimer: Use NOHZ information for locality
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (27 preceding siblings ...)
  2026-02-24 16:37 ` [patch 28/48] hrtimer: Optimize for local timers Thomas Gleixner
@ 2026-02-24 16:37 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:37 ` [patch 30/48] hrtimer: Separate remove/enqueue handling for local timers Thomas Gleixner
                   ` (20 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:37 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

The decision to keep a timer which is associated to the local CPU on that
CPU does not take NOHZ information into account. As a result there are a
lot of hrtimer base switch invocations which end up not switching the base
and stay on the local CPU. That's just work for nothing and can be further
improved.

If the local CPU is part of the NOISE housekeeping mask, then check:

  1) Whether the local CPU has the tick running, which means it is
     either not idle or already expecting a timer soon.

  2) Whether the tick is stopped and need_resched() is set, which
     means the CPU is about to exit idle.

This reduces the amount of hrtimer base switch attempts, which end up on
the local CPU anyway, significantly and prepares for further optimizations.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 kernel/time/hrtimer.c |   13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1231,7 +1231,18 @@ static __always_inline bool hrtimer_pref
 		 */
 		if (!is_local)
 			return false;
-		return is_first || is_pinned;
+		if (is_first || is_pinned)
+			return true;
+
+		/* Honour the NOHZ full restrictions */
+		if (!housekeeping_cpu(smp_processor_id(), HK_TYPE_KERNEL_NOISE))
+			return false;
+
+		/*
+		 * If the tick is not stopped or need_resched() is set, then
+		 * there is no point in moving the timer somewhere else.
+		 */
+		return !tick_nohz_tick_stopped() || need_resched();
 	}
 	return is_local;
 }


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 30/48] hrtimer: Separate remove/enqueue handling for local timers
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (28 preceding siblings ...)
  2026-02-24 16:37 ` [patch 29/48] hrtimer: Use NOHZ information for locality Thomas Gleixner
@ 2026-02-24 16:37 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:37 ` [patch 31/48] hrtimer: Add hrtimer_rearm tracepoint Thomas Gleixner
                   ` (19 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:37 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

As the base switch can be avoided completely when the base stays the same
the remove/enqueue handling can be more streamlined.

Split it out into a separate function which handles both in one go which is
way more efficient and makes the code simpler to follow.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 kernel/time/hrtimer.c |   72 +++++++++++++++++++++++++++++---------------------
 1 file changed, 43 insertions(+), 29 deletions(-)

--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1147,13 +1147,11 @@ static void __remove_hrtimer(struct hrti
 }
 
 static inline bool remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
-				 bool restart, bool keep_base)
+				  bool newstate)
 {
-	bool queued_state = timer->is_queued;
-
 	lockdep_assert_held(&base->cpu_base->lock);
 
-	if (queued_state) {
+	if (timer->is_queued) {
 		bool reprogram;
 
 		debug_hrtimer_deactivate(timer);
@@ -1168,23 +1166,35 @@ static inline bool remove_hrtimer(struct
 		 */
 		reprogram = base->cpu_base == this_cpu_ptr(&hrtimer_bases);
 
-		/*
-		 * If the timer is not restarted then reprogramming is
-		 * required if the timer is local. If it is local and about
-		 * to be restarted, avoid programming it twice (on removal
-		 * and a moment later when it's requeued).
-		 */
-		if (!restart)
-			queued_state = HRTIMER_STATE_INACTIVE;
-		else
-			reprogram &= !keep_base;
-
-		__remove_hrtimer(timer, base, queued_state, reprogram);
+		__remove_hrtimer(timer, base, newstate, reprogram);
 		return true;
 	}
 	return false;
 }
 
+static inline bool
+remove_and_enqueue_same_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
+			     const enum hrtimer_mode mode, ktime_t expires, u64 delta_ns)
+{
+	/* Remove it from the timer queue if active */
+	if (timer->is_queued) {
+		debug_hrtimer_deactivate(timer);
+		timerqueue_del(&base->active, &timer->node);
+	}
+
+	/* Set the new expiry time */
+	hrtimer_set_expires_range_ns(timer, expires, delta_ns);
+
+	debug_activate(timer, mode, timer->is_queued);
+	base->cpu_base->active_bases |= 1 << base->index;
+
+	/* Pairs with the lockless read in hrtimer_is_queued() */
+	WRITE_ONCE(timer->is_queued, HRTIMER_STATE_ENQUEUED);
+
+	/* Returns true if this is the first expiring timer */
+	return timerqueue_add(&base->active, &timer->node);
+}
+
 static inline ktime_t hrtimer_update_lowres(struct hrtimer *timer, ktime_t tim,
 					    const enum hrtimer_mode mode)
 {
@@ -1267,7 +1277,7 @@ static bool __hrtimer_start_range_ns(str
 				     const enum hrtimer_mode mode, struct hrtimer_clock_base *base)
 {
 	struct hrtimer_cpu_base *this_cpu_base = this_cpu_ptr(&hrtimer_bases);
-	bool is_pinned, first, was_first, was_armed, keep_base = false;
+	bool is_pinned, first, was_first, keep_base = false;
 	struct hrtimer_cpu_base *cpu_base = base->cpu_base;
 
 	was_first = cpu_base->next_timer == timer;
@@ -1283,6 +1293,12 @@ static bool __hrtimer_start_range_ns(str
 		keep_base = hrtimer_keep_base(timer, is_local, was_first, is_pinned);
 	}
 
+	/* Calculate absolute expiry time for relative timers */
+	if (mode & HRTIMER_MODE_REL)
+		tim = ktime_add_safe(tim, __hrtimer_cb_get_time(base->clockid));
+	/* Compensate for low resolution granularity */
+	tim = hrtimer_update_lowres(timer, tim, mode);
+
 	/*
 	 * Remove an active timer from the queue. In case it is not queued
 	 * on the current CPU, make sure that remove_hrtimer() updates the
@@ -1297,22 +1313,20 @@ static bool __hrtimer_start_range_ns(str
 	 * @keep_base is also true if the timer callback is running on a
 	 * remote CPU and for local pinned timers.
 	 */
-	was_armed = remove_hrtimer(timer, base, true, keep_base);
-
-	if (mode & HRTIMER_MODE_REL)
-		tim = ktime_add_safe(tim, __hrtimer_cb_get_time(base->clockid));
-
-	tim = hrtimer_update_lowres(timer, tim, mode);
+	if (likely(keep_base)) {
+		first = remove_and_enqueue_same_base(timer, base, mode, tim, delta_ns);
+	} else {
+		/* Keep the ENQUEUED state in case it is queued */
+		bool was_armed = remove_hrtimer(timer, base, HRTIMER_STATE_ENQUEUED);
 
-	hrtimer_set_expires_range_ns(timer, tim, delta_ns);
+		hrtimer_set_expires_range_ns(timer, tim, delta_ns);
 
-	/* Switch the timer base, if necessary: */
-	if (!keep_base) {
+		/* Switch the timer base, if necessary: */
 		base = switch_hrtimer_base(timer, base, is_pinned);
 		cpu_base = base->cpu_base;
-	}
 
-	first = enqueue_hrtimer(timer, base, mode, was_armed);
+		first = enqueue_hrtimer(timer, base, mode, was_armed);
+	}
 
 	/*
 	 * If the hrtimer interrupt is running, then it will reevaluate the
@@ -1432,7 +1446,7 @@ int hrtimer_try_to_cancel(struct hrtimer
 	base = lock_hrtimer_base(timer, &flags);
 
 	if (!hrtimer_callback_running(timer)) {
-		ret = remove_hrtimer(timer, base, false, false);
+		ret = remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE);
 		if (ret)
 			trace_hrtimer_cancel(timer);
 	}


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 31/48] hrtimer: Add hrtimer_rearm tracepoint
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (29 preceding siblings ...)
  2026-02-24 16:37 ` [patch 30/48] hrtimer: Separate remove/enqueue handling for local timers Thomas Gleixner
@ 2026-02-24 16:37 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:37 ` [patch 32/48] hrtimer: Re-arrange hrtimer_interrupt() Thomas Gleixner
                   ` (18 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:37 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

Analyzing the reprogramming of the clock event device is essential to debug
the behaviour of the hrtimer subsystem especially with the upcoming
deferred rearming scheme.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 include/trace/events/timer.h |   24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

--- a/include/trace/events/timer.h
+++ b/include/trace/events/timer.h
@@ -325,6 +325,30 @@ DEFINE_EVENT(hrtimer_class, hrtimer_canc
 );
 
 /**
+ * hrtimer_rearm - Invoked when the clockevent device is rearmed
+ * @next_event:	The next expiry time (CLOCK_MONOTONIC)
+ */
+TRACE_EVENT(hrtimer_rearm,
+
+	TP_PROTO(ktime_t next_event, bool deferred),
+
+	TP_ARGS(next_event, deferred),
+
+	TP_STRUCT__entry(
+		__field( s64,		next_event	)
+		__field( bool,		deferred	)
+	),
+
+	TP_fast_assign(
+		__entry->next_event	= next_event;
+		__entry->deferred	= deferred;
+	),
+
+	TP_printk("next_event=%llu deferred=%d",
+		  (unsigned long long) __entry->next_event, __entry->deferred)
+);
+
+/**
  * itimer_state - called when itimer is started or canceled
  * @which:	name of the interval timer
  * @value:	the itimers value, itimer is canceled if value->it_value is


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 32/48] hrtimer: Re-arrange hrtimer_interrupt()
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (30 preceding siblings ...)
  2026-02-24 16:37 ` [patch 31/48] hrtimer: Add hrtimer_rearm tracepoint Thomas Gleixner
@ 2026-02-24 16:37 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Peter Zijlstra
  2026-02-24 16:37 ` [patch 33/48] hrtimer: Rename hrtimer_cpu_base::in_hrtirq to deferred_rearm Thomas Gleixner
                   ` (17 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:37 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

From: Peter Zijlstra <peterz@infradead.org>

Rework hrtimer_interrupt() such that reprogramming is split out into an
independent function at the end of the interrupt.

This prepares for reprogramming getting delayed beyond the end of
hrtimer_interrupt().

Notably, this changes the hang handling to always wait 100ms instead of
trying to keep it proportional to the actual delay. This simplifies the
state, also this really shouldn't be happening.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
tglx: Added the tracepoint and used a proper naming convention
---
 kernel/time/hrtimer.c |   93 +++++++++++++++++++++++---------------------------
 1 file changed, 44 insertions(+), 49 deletions(-)
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -690,6 +690,12 @@ static inline int hrtimer_hres_active(st
 		cpu_base->hres_active : 0;
 }
 
+static inline void hrtimer_rearm_event(ktime_t expires_next, bool deferred)
+{
+	trace_hrtimer_rearm(expires_next, deferred);
+	tick_program_event(expires_next, 1);
+}
+
 static void __hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base, struct hrtimer *next_timer,
 				ktime_t expires_next)
 {
@@ -715,7 +721,7 @@ static void __hrtimer_reprogram(struct h
 	if (!hrtimer_hres_active(cpu_base) || cpu_base->hang_detected)
 		return;
 
-	tick_program_event(expires_next, 1);
+	hrtimer_rearm_event(expires_next, false);
 }
 
 /*
@@ -1939,6 +1945,28 @@ static __latent_entropy void hrtimer_run
 #ifdef CONFIG_HIGH_RES_TIMERS
 
 /*
+ * Very similar to hrtimer_force_reprogram(), except it deals with
+ * in_hrtirq and hang_detected.
+ */
+static void hrtimer_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t now)
+{
+	ktime_t expires_next = hrtimer_update_next_event(cpu_base);
+
+	cpu_base->expires_next = expires_next;
+	cpu_base->in_hrtirq = false;
+
+	if (unlikely(cpu_base->hang_detected)) {
+		/*
+		 * Give the system a chance to do something else than looping
+		 * on hrtimer interrupts.
+		 */
+		expires_next = ktime_add_ns(now, 100 * NSEC_PER_MSEC);
+		cpu_base->hang_detected = false;
+	}
+	hrtimer_rearm_event(expires_next, false);
+}
+
+/*
  * High resolution timer interrupt
  * Called with interrupts disabled
  */
@@ -1973,63 +2001,30 @@ void hrtimer_interrupt(struct clock_even
 
 	__hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_HARD);
 
-	/* Reevaluate the clock bases for the [soft] next expiry */
-	expires_next = hrtimer_update_next_event(cpu_base);
-	/*
-	 * Store the new expiry value so the migration code can verify
-	 * against it.
-	 */
-	cpu_base->expires_next = expires_next;
-	cpu_base->in_hrtirq = false;
-	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
-
-	/* Reprogramming necessary ? */
-	if (!tick_program_event(expires_next, 0)) {
-		cpu_base->hang_detected = false;
-		return;
-	}
-
 	/*
 	 * The next timer was already expired due to:
 	 * - tracing
 	 * - long lasting callbacks
 	 * - being scheduled away when running in a VM
 	 *
-	 * We need to prevent that we loop forever in the hrtimer
-	 * interrupt routine. We give it 3 attempts to avoid
-	 * overreacting on some spurious event.
-	 *
-	 * Acquire base lock for updating the offsets and retrieving
-	 * the current time.
+	 * We need to prevent that we loop forever in the hrtiner interrupt
+	 * routine. We give it 3 attempts to avoid overreacting on some
+	 * spurious event.
 	 */
-	raw_spin_lock_irqsave(&cpu_base->lock, flags);
 	now = hrtimer_update_base(cpu_base);
-	cpu_base->nr_retries++;
-	if (++retries < 3)
-		goto retry;
-	/*
-	 * Give the system a chance to do something else than looping
-	 * here. We stored the entry time, so we know exactly how long
-	 * we spent here. We schedule the next event this amount of
-	 * time away.
-	 */
-	cpu_base->nr_hangs++;
-	cpu_base->hang_detected = true;
-	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
+	expires_next = hrtimer_update_next_event(cpu_base);
+	if (expires_next < now) {
+		if (++retries < 3)
+			goto retry;
 
-	delta = ktime_sub(now, entry_time);
-	if ((unsigned int)delta > cpu_base->max_hang_time)
-		cpu_base->max_hang_time = (unsigned int) delta;
-	/*
-	 * Limit it to a sensible value as we enforce a longer
-	 * delay. Give the CPU at least 100ms to catch up.
-	 */
-	if (delta > 100 * NSEC_PER_MSEC)
-		expires_next = ktime_add_ns(now, 100 * NSEC_PER_MSEC);
-	else
-		expires_next = ktime_add(now, delta);
-	tick_program_event(expires_next, 1);
-	pr_warn_once("hrtimer: interrupt took %llu ns\n", ktime_to_ns(delta));
+		delta = ktime_sub(now, entry_time);
+		cpu_base->max_hang_time = max_t(unsigned int, cpu_base->max_hang_time, delta);
+		cpu_base->nr_hangs++;
+		cpu_base->hang_detected = true;
+	}
+
+	hrtimer_rearm(cpu_base, now);
+	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
 }
 #endif /* !CONFIG_HIGH_RES_TIMERS */
 


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 33/48] hrtimer: Rename hrtimer_cpu_base::in_hrtirq to deferred_rearm
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (31 preceding siblings ...)
  2026-02-24 16:37 ` [patch 32/48] hrtimer: Re-arrange hrtimer_interrupt() Thomas Gleixner
@ 2026-02-24 16:37 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:37 ` [patch 34/48] hrtimer: Prepare stubs for deferred rearming Thomas Gleixner
                   ` (16 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:37 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

The upcoming deferred rearming scheme has the same effect as the deferred
rearming when the hrtimer interrupt is executing. So it can reuse the
in_hrtirq flag, but when it gets deferred beyond the hrtimer interrupt
path, then the name does not make sense anymore.

Rename it to deferred_rearm upfront to keep the actual functional change
separate from the mechanical rename churn.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 include/linux/hrtimer_defs.h |    4 ++--
 kernel/time/hrtimer.c        |   28 +++++++++-------------------
 2 files changed, 11 insertions(+), 21 deletions(-)

--- a/include/linux/hrtimer_defs.h
+++ b/include/linux/hrtimer_defs.h
@@ -53,7 +53,7 @@ enum  hrtimer_base_type {
  * @active_bases:	Bitfield to mark bases with active timers
  * @clock_was_set_seq:	Sequence counter of clock was set events
  * @hres_active:	State of high resolution mode
- * @in_hrtirq:		hrtimer_interrupt() is currently executing
+ * @deferred_rearm:	A deferred rearm is pending
  * @hang_detected:	The last hrtimer interrupt detected a hang
  * @softirq_activated:	displays, if the softirq is raised - update of softirq
  *			related settings is not required then.
@@ -84,7 +84,7 @@ struct hrtimer_cpu_base {
 	unsigned int			active_bases;
 	unsigned int			clock_was_set_seq;
 	bool				hres_active;
-	bool				in_hrtirq;
+	bool				deferred_rearm;
 	bool				hang_detected;
 	bool				softirq_activated;
 	bool				online;
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -883,11 +883,8 @@ static void hrtimer_reprogram(struct hrt
 	if (expires >= cpu_base->expires_next)
 		return;
 
-	/*
-	 * If the hrtimer interrupt is running, then it will reevaluate the
-	 * clock bases and reprogram the clock event device.
-	 */
-	if (cpu_base->in_hrtirq)
+	/* If a deferred rearm is pending skip reprogramming the device */
+	if (cpu_base->deferred_rearm)
 		return;
 
 	cpu_base->next_timer = timer;
@@ -921,12 +918,8 @@ static bool update_needs_ipi(struct hrti
 	if (seq == cpu_base->clock_was_set_seq)
 		return false;
 
-	/*
-	 * If the remote CPU is currently handling an hrtimer interrupt, it
-	 * will reevaluate the first expiring timer of all clock bases
-	 * before reprogramming. Nothing to do here.
-	 */
-	if (cpu_base->in_hrtirq)
+	/* If a deferred rearm is pending the remote CPU will take care of it */
+	if (cpu_base->deferred_rearm)
 		return false;
 
 	/*
@@ -1334,11 +1327,8 @@ static bool __hrtimer_start_range_ns(str
 		first = enqueue_hrtimer(timer, base, mode, was_armed);
 	}
 
-	/*
-	 * If the hrtimer interrupt is running, then it will reevaluate the
-	 * clock bases and reprogram the clock event device.
-	 */
-	if (cpu_base->in_hrtirq)
+	/* If a deferred rearm is pending skip reprogramming the device */
+	if (cpu_base->deferred_rearm)
 		return false;
 
 	if (!was_first || cpu_base != this_cpu_base) {
@@ -1946,14 +1936,14 @@ static __latent_entropy void hrtimer_run
 
 /*
  * Very similar to hrtimer_force_reprogram(), except it deals with
- * in_hrtirq and hang_detected.
+ * deferred_rearm and hang_detected.
  */
 static void hrtimer_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t now)
 {
 	ktime_t expires_next = hrtimer_update_next_event(cpu_base);
 
 	cpu_base->expires_next = expires_next;
-	cpu_base->in_hrtirq = false;
+	cpu_base->deferred_rearm = false;
 
 	if (unlikely(cpu_base->hang_detected)) {
 		/*
@@ -1984,7 +1974,7 @@ void hrtimer_interrupt(struct clock_even
 	raw_spin_lock_irqsave(&cpu_base->lock, flags);
 	entry_time = now = hrtimer_update_base(cpu_base);
 retry:
-	cpu_base->in_hrtirq = true;
+	cpu_base->deferred_rearm = true;
 	/*
 	 * Set expires_next to KTIME_MAX, which prevents that remote CPUs queue
 	 * timers while __hrtimer_run_queues() is expiring the clock bases.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 34/48] hrtimer: Prepare stubs for deferred rearming
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (32 preceding siblings ...)
  2026-02-24 16:37 ` [patch 33/48] hrtimer: Rename hrtimer_cpu_base::in_hrtirq to deferred_rearm Thomas Gleixner
@ 2026-02-24 16:37 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Peter Zijlstra
  2026-02-24 16:38 ` [patch 35/48] entry: Prepare for deferred hrtimer rearming Thomas Gleixner
                   ` (15 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:37 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

From: Peter Zijlstra <peterz@infradead.org>

The hrtimer interrupt expires timers and at the end of the interrupt it
rearms the clockevent device for the next expiring timer.

That's obviously correct, but in the case that a expired timer set
NEED_RESCHED the return from interrupt ends up in schedule(). If HRTICK is
enabled then schedule() will modify the hrtick timer, which causes another
reprogramming of the hardware.

That can be avoided by deferring the rearming to the return from interrupt
path and if the return results in a immediate schedule() invocation then it
can be deferred until the end of schedule().

To make this correct the affected code parts need to be made aware of this.

Provide empty stubs for the deferred rearming mechanism, so that the
relevant code changes for entry, softirq and scheduler can be split up into
separate changes independent of the actual enablement in the hrtimer code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
tglx: Split out to make it simpler to review and to make cross subsystem
      merge logistics trivial.
---
 include/linux/hrtimer.h       |    1 +
 include/linux/hrtimer_rearm.h |   21 +++++++++++++++++++++
 kernel/time/Kconfig           |    4 ++++
 3 files changed, 26 insertions(+)

--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -13,6 +13,7 @@
 #define _LINUX_HRTIMER_H
 
 #include <linux/hrtimer_defs.h>
+#include <linux/hrtimer_rearm.h>
 #include <linux/hrtimer_types.h>
 #include <linux/init.h>
 #include <linux/list.h>
--- /dev/null
+++ b/include/linux/hrtimer_rearm.h
@@ -0,0 +1,21 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef _LINUX_HRTIMER_REARM_H
+#define _LINUX_HRTIMER_REARM_H
+
+#ifdef CONFIG_HRTIMER_REARM_DEFERRED
+static __always_inline void __hrtimer_rearm_deferred(void) { }
+static __always_inline void hrtimer_rearm_deferred(void) { }
+static __always_inline void hrtimer_rearm_deferred_tif(unsigned long tif_work) { }
+static __always_inline bool
+hrtimer_rearm_deferred_user_irq(unsigned long *tif_work, const unsigned long tif_mask) { return false; }
+static __always_inline bool hrtimer_test_and_clear_rearm_deferred(void) { return false; }
+#else  /* CONFIG_HRTIMER_REARM_DEFERRED */
+static __always_inline void __hrtimer_rearm_deferred(void) { }
+static __always_inline void hrtimer_rearm_deferred(void) { }
+static __always_inline void hrtimer_rearm_deferred_tif(unsigned long tif_work) { }
+static __always_inline bool
+hrtimer_rearm_deferred_user_irq(unsigned long *tif_work, const unsigned long tif_mask) { return false; }
+static __always_inline bool hrtimer_test_and_clear_rearm_deferred(void) { return false; }
+#endif  /* !CONFIG_HRTIMER_REARM_DEFERRED */
+
+#endif
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -58,6 +58,10 @@ config GENERIC_CLOCKEVENTS_COUPLED_INLIN
 config GENERIC_CMOS_UPDATE
 	bool
 
+# Deferred rearming of the hrtimer interrupt
+config HRTIMER_REARM_DEFERRED
+       def_bool n
+
 # Select to handle posix CPU timers from task_work
 # and not from the timer interrupt context
 config HAVE_POSIX_CPU_TIMERS_TASK_WORK


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 35/48] entry: Prepare for deferred hrtimer rearming
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (33 preceding siblings ...)
  2026-02-24 16:37 ` [patch 34/48] hrtimer: Prepare stubs for deferred rearming Thomas Gleixner
@ 2026-02-24 16:38 ` Thomas Gleixner
  2026-02-27 15:57   ` Christian Loehle
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Peter Zijlstra
  2026-02-24 16:38 ` [patch 36/48] softirq: " Thomas Gleixner
                   ` (14 subsequent siblings)
  49 siblings, 2 replies; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:38 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

From: Peter Zijlstra <peterz@infradead.org>

The hrtimer interrupt expires timers and at the end of the interrupt it
rearms the clockevent device for the next expiring timer.

That's obviously correct, but in the case that a expired timer sets
NEED_RESCHED the return from interrupt ends up in schedule(). If HRTICK is
enabled then schedule() will modify the hrtick timer, which causes another
reprogramming of the hardware.

That can be avoided by deferring the rearming to the return from interrupt
path and if the return results in a immediate schedule() invocation then it
can be deferred until the end of schedule(), which avoids multiple rearms
and re-evaluation of the timer wheel.

As this is only relevant for interrupt to user return split the work masks
up and hand them in as arguments from the relevant exit to user functions,
which allows the compiler to optimize the deferred handling out for the
syscall exit to user case.

Add the rearm checks to the approritate places in the exit to user loop and
the interrupt return to kernel path, so that the rearming is always
guaranteed.

In the return to user space path this is handled in the same way as
TIF_RSEQ to avoid extra instructions in the fast path, which are truly
hurtful for device interrupt heavy work loads as the extra instructions and
conditionals while benign at first sight accumulate quickly into measurable
regressions. The return from syscall path is completely unaffected due to
the above mentioned split so syscall heavy workloads wont have any extra
burden.

For now this is just placing empty stubs at the right places which are all
optimized out by the compiler until the actual functionality is in place.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
tglx: Split out to make it simpler to review and to make cross subsystem
      merge logistics trivial.
---
 include/linux/irq-entry-common.h |   25 +++++++++++++++++++------
 include/linux/rseq_entry.h       |   16 +++++++++++++---
 kernel/entry/common.c            |    4 +++-
 3 files changed, 35 insertions(+), 10 deletions(-)

--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -3,6 +3,7 @@
 #define __LINUX_IRQENTRYCOMMON_H
 
 #include <linux/context_tracking.h>
+#include <linux/hrtimer_rearm.h>
 #include <linux/kmsan.h>
 #include <linux/rseq_entry.h>
 #include <linux/static_call_types.h>
@@ -33,6 +34,14 @@
 	 _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | _TIF_RSEQ |		\
 	 ARCH_EXIT_TO_USER_MODE_WORK)
 
+#ifdef CONFIG_HRTIMER_REARM_DEFERRED
+# define EXIT_TO_USER_MODE_WORK_SYSCALL	(EXIT_TO_USER_MODE_WORK)
+# define EXIT_TO_USER_MODE_WORK_IRQ	(EXIT_TO_USER_MODE_WORK | _TIF_HRTIMER_REARM)
+#else
+# define EXIT_TO_USER_MODE_WORK_SYSCALL	(EXIT_TO_USER_MODE_WORK)
+# define EXIT_TO_USER_MODE_WORK_IRQ	(EXIT_TO_USER_MODE_WORK)
+#endif
+
 /**
  * arch_enter_from_user_mode - Architecture specific sanity check for user mode regs
  * @regs:	Pointer to currents pt_regs
@@ -203,6 +212,7 @@ unsigned long exit_to_user_mode_loop(str
 /**
  * __exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
  * @regs:	Pointer to pt_regs on entry stack
+ * @work_mask:	Which TIF bits need to be evaluated
  *
  * 1) check that interrupts are disabled
  * 2) call tick_nohz_user_enter_prepare()
@@ -212,7 +222,8 @@ unsigned long exit_to_user_mode_loop(str
  *
  * Don't invoke directly, use the syscall/irqentry_ prefixed variants below
  */
-static __always_inline void __exit_to_user_mode_prepare(struct pt_regs *regs)
+static __always_inline void __exit_to_user_mode_prepare(struct pt_regs *regs,
+							const unsigned long work_mask)
 {
 	unsigned long ti_work;
 
@@ -222,8 +233,10 @@ static __always_inline void __exit_to_us
 	tick_nohz_user_enter_prepare();
 
 	ti_work = read_thread_flags();
-	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
-		ti_work = exit_to_user_mode_loop(regs, ti_work);
+	if (unlikely(ti_work & work_mask)) {
+		if (!hrtimer_rearm_deferred_user_irq(&ti_work, work_mask))
+			ti_work = exit_to_user_mode_loop(regs, ti_work);
+	}
 
 	arch_exit_to_user_mode_prepare(regs, ti_work);
 }
@@ -239,7 +252,7 @@ static __always_inline void __exit_to_us
 /* Temporary workaround to keep ARM64 alive */
 static __always_inline void exit_to_user_mode_prepare_legacy(struct pt_regs *regs)
 {
-	__exit_to_user_mode_prepare(regs);
+	__exit_to_user_mode_prepare(regs, EXIT_TO_USER_MODE_WORK);
 	rseq_exit_to_user_mode_legacy();
 	__exit_to_user_mode_validate();
 }
@@ -253,7 +266,7 @@ static __always_inline void exit_to_user
  */
 static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
 {
-	__exit_to_user_mode_prepare(regs);
+	__exit_to_user_mode_prepare(regs, EXIT_TO_USER_MODE_WORK_SYSCALL);
 	rseq_syscall_exit_to_user_mode();
 	__exit_to_user_mode_validate();
 }
@@ -267,7 +280,7 @@ static __always_inline void syscall_exit
  */
 static __always_inline void irqentry_exit_to_user_mode_prepare(struct pt_regs *regs)
 {
-	__exit_to_user_mode_prepare(regs);
+	__exit_to_user_mode_prepare(regs, EXIT_TO_USER_MODE_WORK_IRQ);
 	rseq_irqentry_exit_to_user_mode();
 	__exit_to_user_mode_validate();
 }
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -40,6 +40,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_
 #endif /* !CONFIG_RSEQ_STATS */
 
 #ifdef CONFIG_RSEQ
+#include <linux/hrtimer_rearm.h>
 #include <linux/jump_label.h>
 #include <linux/rseq.h>
 #include <linux/sched/signal.h>
@@ -110,7 +111,7 @@ static __always_inline void rseq_slice_c
 	t->rseq.slice.state.granted = false;
 }
 
-static __always_inline bool rseq_grant_slice_extension(bool work_pending)
+static __always_inline bool __rseq_grant_slice_extension(bool work_pending)
 {
 	struct task_struct *curr = current;
 	struct rseq_slice_ctrl usr_ctrl;
@@ -215,11 +216,20 @@ static __always_inline bool rseq_grant_s
 	return false;
 }
 
+static __always_inline bool rseq_grant_slice_extension(unsigned long ti_work, unsigned long mask)
+{
+	if (unlikely(__rseq_grant_slice_extension(ti_work & mask))) {
+		hrtimer_rearm_deferred_tif(ti_work);
+		return true;
+	}
+	return false;
+}
+
 #else /* CONFIG_RSEQ_SLICE_EXTENSION */
 static inline bool rseq_slice_extension_enabled(void) { return false; }
 static inline bool rseq_arm_slice_extension_timer(void) { return false; }
 static inline void rseq_slice_clear_grant(struct task_struct *t) { }
-static inline bool rseq_grant_slice_extension(bool work_pending) { return false; }
+static inline bool rseq_grant_slice_extension(unsigned long ti_work, unsigned long mask) { return false; }
 #endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
 
 bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
@@ -778,7 +788,7 @@ static inline void rseq_syscall_exit_to_
 static inline void rseq_irqentry_exit_to_user_mode(void) { }
 static inline void rseq_exit_to_user_mode_legacy(void) { }
 static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
-static inline bool rseq_grant_slice_extension(bool work_pending) { return false; }
+static inline bool rseq_grant_slice_extension(unsigned long ti_work, unsigned long mask) { return false; }
 #endif /* !CONFIG_RSEQ */
 
 #endif /* _LINUX_RSEQ_ENTRY_H */
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -50,7 +50,7 @@ static __always_inline unsigned long __e
 		local_irq_enable_exit_to_user(ti_work);
 
 		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
-			if (!rseq_grant_slice_extension(ti_work & TIF_SLICE_EXT_DENY))
+			if (!rseq_grant_slice_extension(ti_work, TIF_SLICE_EXT_DENY))
 				schedule();
 		}
 
@@ -225,6 +225,7 @@ noinstr void irqentry_exit(struct pt_reg
 		 */
 		if (state.exit_rcu) {
 			instrumentation_begin();
+			hrtimer_rearm_deferred();
 			/* Tell the tracer that IRET will enable interrupts */
 			trace_hardirqs_on_prepare();
 			lockdep_hardirqs_on_prepare();
@@ -238,6 +239,7 @@ noinstr void irqentry_exit(struct pt_reg
 		if (IS_ENABLED(CONFIG_PREEMPTION))
 			irqentry_exit_cond_resched();
 
+		hrtimer_rearm_deferred();
 		/* Covers both tracing and lockdep */
 		trace_hardirqs_on();
 		instrumentation_end();


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 36/48] softirq: Prepare for deferred hrtimer rearming
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (34 preceding siblings ...)
  2026-02-24 16:38 ` [patch 35/48] entry: Prepare for deferred hrtimer rearming Thomas Gleixner
@ 2026-02-24 16:38 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Peter Zijlstra
  2026-02-24 16:38 ` [patch 37/48] sched/core: " Thomas Gleixner
                   ` (13 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:38 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

From: Peter Zijlstra <peterz@infradead.org>

The hrtimer interrupt expires timers and at the end of the interrupt it
rearms the clockevent device for the next expiring timer.

That's obviously correct, but in the case that a expired timer sets
NEED_RESCHED the return from interrupt ends up in schedule(). If HRTICK is
enabled then schedule() will modify the hrtick timer, which causes another
reprogramming of the hardware.

That can be avoided by deferring the rearming to the return from interrupt
path and if the return results in a immediate schedule() invocation then it
can be deferred until the end of schedule(), which avoids multiple rearms
and re-evaluation of the timer wheel.

In case that the return from interrupt ends up handling softirqs before
reaching the rearm conditions in the return to user entry code functions, a
deferred rearm has to be handled before softirq handling enables interrupts
as soft interrupt handling can be long and would therefore introduce hard
to diagnose latencies to the timer interrupt.

Place the for now empty stub call right before invoking the softirq
handling routine.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
tglx: Split out to make it simpler to review and to make cross subsystem
      merge logistics trivial.
---
kernel/softirq.c |   15 ++++++++++++++-
 kernel/softirq.c |   15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -663,6 +663,13 @@ void irq_enter_rcu(void)
 {
 	__irq_enter_raw();
 
+	/*
+	 * If this is a nested interrupt that hits the exit_to_user_mode_loop
+	 * where it has enabled interrupts but before it has hit schedule() we
+	 * could have hrtimers in an undefined state. Fix it up here.
+	 */
+	hrtimer_rearm_deferred();
+
 	if (tick_nohz_full_cpu(smp_processor_id()) ||
 	    (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET)))
 		tick_irq_enter();
@@ -719,8 +726,14 @@ static inline void __irq_exit_rcu(void)
 #endif
 	account_hardirq_exit(current);
 	preempt_count_sub(HARDIRQ_OFFSET);
-	if (!in_interrupt() && local_softirq_pending())
+	if (!in_interrupt() && local_softirq_pending()) {
+		/*
+		 * If we left hrtimers unarmed, make sure to arm them now,
+		 * before enabling interrupts to run SoftIRQ.
+		 */
+		hrtimer_rearm_deferred();
 		invoke_softirq();
+	}
 
 	if (IS_ENABLED(CONFIG_IRQ_FORCED_THREADING) && force_irqthreads() &&
 	    local_timers_pending_force_th() && !(in_nmi() | in_hardirq()))


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 37/48] sched/core: Prepare for deferred hrtimer rearming
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (35 preceding siblings ...)
  2026-02-24 16:38 ` [patch 36/48] softirq: " Thomas Gleixner
@ 2026-02-24 16:38 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Peter Zijlstra
  2026-02-24 16:38 ` [patch 38/48] hrtimer: Push reprogramming timers into the interrupt return path Thomas Gleixner
                   ` (12 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:38 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

From: Peter Zijlstra <peterz@infradead.org>

The hrtimer interrupt expires timers and at the end of the interrupt it
rearms the clockevent device for the next expiring timer.

That's obviously correct, but in the case that a expired timer sets
NEED_RESCHED the return from interrupt ends up in schedule(). If HRTICK is
enabled then schedule() will modify the hrtick timer, which causes another
reprogramming of the hardware.

That can be avoided by deferring the rearming to the return from interrupt
path and if the return results in a immediate schedule() invocation then it
can be deferred until the end of schedule(), which avoids multiple rearms
and re-evaluation of the timer wheel.

Add the rearm checks to the existing sched_hrtick_enter/exit() functions,
which already handle the batched rearm of the hrtick timer.

For now this is just placing empty stubs at the right places which are all
optimized out by the compiler until the guard condition becomes true.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
tglx: Split out to make it simpler to review and to make cross subsystem
      merge logistics trivial.
---
 kernel/sched/core.c |    6 ++++++
 1 file changed, 6 insertions(+)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -876,6 +876,7 @@ enum {
 	HRTICK_SCHED_NONE		= 0,
 	HRTICK_SCHED_DEFER		= BIT(1),
 	HRTICK_SCHED_START		= BIT(2),
+	HRTICK_SCHED_REARM_HRTIMER	= BIT(3)
 };
 
 static void hrtick_clear(struct rq *rq)
@@ -974,6 +975,8 @@ void hrtick_start(struct rq *rq, u64 del
 static inline void hrtick_schedule_enter(struct rq *rq)
 {
 	rq->hrtick_sched = HRTICK_SCHED_DEFER;
+	if (hrtimer_test_and_clear_rearm_deferred())
+		rq->hrtick_sched |= HRTICK_SCHED_REARM_HRTIMER;
 }
 
 static inline void hrtick_schedule_exit(struct rq *rq)
@@ -991,6 +994,9 @@ static inline void hrtick_schedule_exit(
 			hrtimer_cancel(&rq->hrtick_timer);
 	}
 
+	if (rq->hrtick_sched & HRTICK_SCHED_REARM_HRTIMER)
+		__hrtimer_rearm_deferred();
+
 	rq->hrtick_sched = HRTICK_SCHED_NONE;
 }
 


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 38/48] hrtimer: Push reprogramming timers into the interrupt return path
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (36 preceding siblings ...)
  2026-02-24 16:38 ` [patch 37/48] sched/core: " Thomas Gleixner
@ 2026-02-24 16:38 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Peter Zijlstra
  2026-02-24 16:38 ` [patch 39/48] hrtimer: Avoid re-evaluation when nothing changed Thomas Gleixner
                   ` (11 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:38 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

From: Peter Zijlstra <peterz@infradead.org>

Currently hrtimer_interrupt() runs expired timers, which can re-arm
themselves, after which it computes the next expiration time and
re-programs the hardware.

However, things like HRTICK, a highres timer driving preemption, cannot
re-arm itself at the point of running, since the next task has not been
determined yet. The schedule() in the interrupt return path will switch to
the next task, which then causes a new hrtimer to be programmed.

This then results in reprogramming the hardware at least twice, once after
running the timers, and once upon selecting the new task.

Notably, *both* events happen in the interrupt.

By pushing the hrtimer reprogram all the way into the interrupt return
path, it runs after schedule() picks the new task and the double reprogram
can be avoided.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 include/asm-generic/thread_info_tif.h |    5 +-
 include/linux/hrtimer_rearm.h         |   72 +++++++++++++++++++++++++++++++---
 kernel/time/Kconfig                   |    4 +
 kernel/time/hrtimer.c                 |   38 +++++++++++++++--
 4 files changed, 107 insertions(+), 12 deletions(-)
--- a/include/asm-generic/thread_info_tif.h
+++ b/include/asm-generic/thread_info_tif.h
@@ -41,11 +41,14 @@
 #define _TIF_PATCH_PENDING	BIT(TIF_PATCH_PENDING)
 
 #ifdef HAVE_TIF_RESTORE_SIGMASK
-# define TIF_RESTORE_SIGMASK	10	// Restore signal mask in do_signal() */
+# define TIF_RESTORE_SIGMASK	10	// Restore signal mask in do_signal()
 # define _TIF_RESTORE_SIGMASK	BIT(TIF_RESTORE_SIGMASK)
 #endif
 
 #define TIF_RSEQ		11	// Run RSEQ fast path
 #define _TIF_RSEQ		BIT(TIF_RSEQ)
 
+#define TIF_HRTIMER_REARM	12       // re-arm the timer
+#define _TIF_HRTIMER_REARM	BIT(TIF_HRTIMER_REARM)
+
 #endif /* _ASM_GENERIC_THREAD_INFO_TIF_H_ */
--- a/include/linux/hrtimer_rearm.h
+++ b/include/linux/hrtimer_rearm.h
@@ -3,12 +3,74 @@
 #define _LINUX_HRTIMER_REARM_H
 
 #ifdef CONFIG_HRTIMER_REARM_DEFERRED
-static __always_inline void __hrtimer_rearm_deferred(void) { }
-static __always_inline void hrtimer_rearm_deferred(void) { }
-static __always_inline void hrtimer_rearm_deferred_tif(unsigned long tif_work) { }
+#include <linux/thread_info.h>
+
+void __hrtimer_rearm_deferred(void);
+
+/*
+ * This is purely CPU local, so check the TIF bit first to avoid the overhead of
+ * the atomic test_and_clear_bit() operation for the common case where the bit
+ * is not set.
+ */
+static __always_inline bool hrtimer_test_and_clear_rearm_deferred_tif(unsigned long tif_work)
+{
+	lockdep_assert_irqs_disabled();
+
+	if (unlikely(tif_work & _TIF_HRTIMER_REARM)) {
+		clear_thread_flag(TIF_HRTIMER_REARM);
+		return true;
+	}
+	return false;
+}
+
+#define TIF_REARM_MASK	(_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY | _TIF_HRTIMER_REARM)
+
+/* Invoked from the exit to user before invoking exit_to_user_mode_loop() */
 static __always_inline bool
-hrtimer_rearm_deferred_user_irq(unsigned long *tif_work, const unsigned long tif_mask) { return false; }
-static __always_inline bool hrtimer_test_and_clear_rearm_deferred(void) { return false; }
+hrtimer_rearm_deferred_user_irq(unsigned long *tif_work, const unsigned long tif_mask)
+{
+	/* Help the compiler to optimize the function out for syscall returns */
+	if (!(tif_mask & _TIF_HRTIMER_REARM))
+		return false;
+	/*
+	 * Rearm the timer if none of the resched flags is set before going into
+	 * the loop which re-enables interrupts.
+	 */
+	if (unlikely((*tif_work & TIF_REARM_MASK) == _TIF_HRTIMER_REARM)) {
+		clear_thread_flag(TIF_HRTIMER_REARM);
+		__hrtimer_rearm_deferred();
+		/* Don't go into the loop if HRTIMER_REARM was the only flag */
+		*tif_work &= ~TIF_HRTIMER_REARM;
+		return !*tif_work;
+	}
+	return false;
+}
+
+/* Invoked from the time slice extension decision function */
+static __always_inline void hrtimer_rearm_deferred_tif(unsigned long tif_work)
+{
+	if (hrtimer_test_and_clear_rearm_deferred_tif(tif_work))
+		__hrtimer_rearm_deferred();
+}
+
+/*
+ * This is to be called on all irqentry_exit() paths that will enable
+ * interrupts.
+ */
+static __always_inline void hrtimer_rearm_deferred(void)
+{
+	hrtimer_rearm_deferred_tif(read_thread_flags());
+}
+
+/*
+ * Invoked from the scheduler on entry to __schedule() so it can defer
+ * rearming after the load balancing callbacks which might change hrtick.
+ */
+static __always_inline bool hrtimer_test_and_clear_rearm_deferred(void)
+{
+	return hrtimer_test_and_clear_rearm_deferred_tif(read_thread_flags());
+}
+
 #else  /* CONFIG_HRTIMER_REARM_DEFERRED */
 static __always_inline void __hrtimer_rearm_deferred(void) { }
 static __always_inline void hrtimer_rearm_deferred(void) { }
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -60,7 +60,9 @@ config GENERIC_CMOS_UPDATE
 
 # Deferred rearming of the hrtimer interrupt
 config HRTIMER_REARM_DEFERRED
-       def_bool n
+       def_bool y
+       depends on GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS
+       depends on HIGH_RES_TIMERS && SCHED_HRTICK
 
 # Select to handle posix CPU timers from task_work
 # and not from the timer interrupt context
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1939,10 +1939,9 @@ static __latent_entropy void hrtimer_run
  * Very similar to hrtimer_force_reprogram(), except it deals with
  * deferred_rearm and hang_detected.
  */
-static void hrtimer_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t now)
+static void hrtimer_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t now,
+			  ktime_t expires_next, bool deferred)
 {
-	ktime_t expires_next = hrtimer_update_next_event(cpu_base);
-
 	cpu_base->expires_next = expires_next;
 	cpu_base->deferred_rearm = false;
 
@@ -1954,9 +1953,37 @@ static void hrtimer_rearm(struct hrtimer
 		expires_next = ktime_add_ns(now, 100 * NSEC_PER_MSEC);
 		cpu_base->hang_detected = false;
 	}
-	hrtimer_rearm_event(expires_next, false);
+	hrtimer_rearm_event(expires_next, deferred);
 }
 
+#ifdef CONFIG_HRTIMER_REARM_DEFERRED
+void __hrtimer_rearm_deferred(void)
+{
+	struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
+	ktime_t now, expires_next;
+
+	if (!cpu_base->deferred_rearm)
+		return;
+
+	guard(raw_spinlock)(&cpu_base->lock);
+	now = hrtimer_update_base(cpu_base);
+	expires_next = hrtimer_update_next_event(cpu_base);
+	hrtimer_rearm(cpu_base, now, expires_next, true);
+}
+
+static __always_inline void
+hrtimer_interrupt_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t now, ktime_t expires_next)
+{
+	set_thread_flag(TIF_HRTIMER_REARM);
+}
+#else  /* CONFIG_HRTIMER_REARM_DEFERRED */
+static __always_inline void
+hrtimer_interrupt_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t now, ktime_t expires_next)
+{
+	hrtimer_rearm(cpu_base, now, expires_next, false);
+}
+#endif  /* !CONFIG_HRTIMER_REARM_DEFERRED */
+
 /*
  * High resolution timer interrupt
  * Called with interrupts disabled
@@ -2014,9 +2041,10 @@ void hrtimer_interrupt(struct clock_even
 		cpu_base->hang_detected = true;
 	}
 
-	hrtimer_rearm(cpu_base, now);
+	hrtimer_interrupt_rearm(cpu_base, now, expires_next);
 	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
 }
+
 #endif /* !CONFIG_HIGH_RES_TIMERS */
 
 /*


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 39/48] hrtimer: Avoid re-evaluation when nothing changed
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (37 preceding siblings ...)
  2026-02-24 16:38 ` [patch 38/48] hrtimer: Push reprogramming timers into the interrupt return path Thomas Gleixner
@ 2026-02-24 16:38 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:38 ` [patch 40/48] hrtimer: Keep track of first expiring timer per clock base Thomas Gleixner
                   ` (10 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:38 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

Most times there is no change between hrtimer_interrupt() deferring the rearm
and the invocation of hrtimer_rearm_deferred(). In those cases it's a pointless
exercise to re-evaluate the next expiring timer.

Cache the required data and use it if nothing changed.

Signed-off-by: Thomas Gleixner <tglx@kernel.org
---
 include/linux/hrtimer_defs.h |   53 +++++++++++++++++++++----------------------
 kernel/time/hrtimer.c        |   45 +++++++++++++++++++++++++-----------
 2 files changed, 58 insertions(+), 40 deletions(-)

--- a/include/linux/hrtimer_defs.h
+++ b/include/linux/hrtimer_defs.h
@@ -47,32 +47,31 @@ enum  hrtimer_base_type {
 
 /**
  * struct hrtimer_cpu_base - the per cpu clock bases
- * @lock:		lock protecting the base and associated clock bases
- *			and timers
- * @cpu:		cpu number
- * @active_bases:	Bitfield to mark bases with active timers
- * @clock_was_set_seq:	Sequence counter of clock was set events
- * @hres_active:	State of high resolution mode
- * @deferred_rearm:	A deferred rearm is pending
- * @hang_detected:	The last hrtimer interrupt detected a hang
- * @softirq_activated:	displays, if the softirq is raised - update of softirq
- *			related settings is not required then.
- * @nr_events:		Total number of hrtimer interrupt events
- * @nr_retries:		Total number of hrtimer interrupt retries
- * @nr_hangs:		Total number of hrtimer interrupt hangs
- * @max_hang_time:	Maximum time spent in hrtimer_interrupt
- * @softirq_expiry_lock: Lock which is taken while softirq based hrtimer are
- *			 expired
- * @online:		CPU is online from an hrtimers point of view
- * @timer_waiters:	A hrtimer_cancel() invocation waits for the timer
- *			callback to finish.
- * @expires_next:	absolute time of the next event, is required for remote
- *			hrtimer enqueue; it is the total first expiry time (hard
- *			and soft hrtimer are taken into account)
- * @next_timer:		Pointer to the first expiring timer
- * @softirq_expires_next: Time to check, if soft queues needs also to be expired
- * @softirq_next_timer: Pointer to the first expiring softirq based timer
- * @clock_base:		array of clock bases for this cpu
+ * @lock:			lock protecting the base and associated clock bases and timers
+ * @cpu:			cpu number
+ * @active_bases:		Bitfield to mark bases with active timers
+ * @clock_was_set_seq:		Sequence counter of clock was set events
+ * @hres_active:		State of high resolution mode
+ * @deferred_rearm:		A deferred rearm is pending
+ * @deferred_needs_update:	The deferred rearm must re-evaluate the first timer
+ * @hang_detected:		The last hrtimer interrupt detected a hang
+ * @softirq_activated:		displays, if the softirq is raised - update of softirq
+ *				related settings is not required then.
+ * @nr_events:			Total number of hrtimer interrupt events
+ * @nr_retries:			Total number of hrtimer interrupt retries
+ * @nr_hangs:			Total number of hrtimer interrupt hangs
+ * @max_hang_time:		Maximum time spent in hrtimer_interrupt
+ * @softirq_expiry_lock:	Lock which is taken while softirq based hrtimer are expired
+ * @online:			CPU is online from an hrtimers point of view
+ * @timer_waiters:		A hrtimer_cancel() waiters for the timer callback to finish.
+ * @expires_next:		Absolute time of the next event, is required for remote
+ *				hrtimer enqueue; it is the total first expiry time (hard
+ *				and soft hrtimer are taken into account)
+ * @next_timer:			Pointer to the first expiring timer
+ * @softirq_expires_next:	Time to check, if soft queues needs also to be expired
+ * @softirq_next_timer:		Pointer to the first expiring softirq based timer
+ * @deferred_expires_next:	Cached expires next value for deferred rearm
+ * @clock_base:			Array of clock bases for this cpu
  *
  * Note: next_timer is just an optimization for __remove_hrtimer().
  *	 Do not dereference the pointer because it is not reliable on
@@ -85,6 +84,7 @@ struct hrtimer_cpu_base {
 	unsigned int			clock_was_set_seq;
 	bool				hres_active;
 	bool				deferred_rearm;
+	bool				deferred_needs_update;
 	bool				hang_detected;
 	bool				softirq_activated;
 	bool				online;
@@ -102,6 +102,7 @@ struct hrtimer_cpu_base {
 	struct hrtimer			*next_timer;
 	ktime_t				softirq_expires_next;
 	struct hrtimer			*softirq_next_timer;
+	ktime_t				deferred_expires_next;
 	struct hrtimer_clock_base	clock_base[HRTIMER_MAX_CLOCK_BASES];
 	call_single_data_t		csd;
 } ____cacheline_aligned;
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -919,8 +919,10 @@ static bool update_needs_ipi(struct hrti
 		return false;
 
 	/* If a deferred rearm is pending the remote CPU will take care of it */
-	if (cpu_base->deferred_rearm)
+	if (cpu_base->deferred_rearm) {
+		cpu_base->deferred_needs_update = true;
 		return false;
+	}
 
 	/*
 	 * Walk the affected clock bases and check whether the first expiring
@@ -1141,7 +1143,12 @@ static void __remove_hrtimer(struct hrti
 	 * a local timer is removed to be immediately restarted. That's handled
 	 * at the call site.
 	 */
-	if (reprogram && timer == cpu_base->next_timer && !timer->is_lazy)
+	if (!reprogram || timer != cpu_base->next_timer || timer->is_lazy)
+		return;
+
+	if (cpu_base->deferred_rearm)
+		cpu_base->deferred_needs_update = true;
+	else
 		hrtimer_force_reprogram(cpu_base, /* skip_equal */ true);
 }
 
@@ -1328,8 +1335,10 @@ static bool __hrtimer_start_range_ns(str
 	}
 
 	/* If a deferred rearm is pending skip reprogramming the device */
-	if (cpu_base->deferred_rearm)
+	if (cpu_base->deferred_rearm) {
+		cpu_base->deferred_needs_update = true;
 		return false;
+	}
 
 	if (!was_first || cpu_base != this_cpu_base) {
 		/*
@@ -1939,8 +1948,7 @@ static __latent_entropy void hrtimer_run
  * Very similar to hrtimer_force_reprogram(), except it deals with
  * deferred_rearm and hang_detected.
  */
-static void hrtimer_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t now,
-			  ktime_t expires_next, bool deferred)
+static void hrtimer_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t expires_next, bool deferred)
 {
 	cpu_base->expires_next = expires_next;
 	cpu_base->deferred_rearm = false;
@@ -1950,7 +1958,7 @@ static void hrtimer_rearm(struct hrtimer
 		 * Give the system a chance to do something else than looping
 		 * on hrtimer interrupts.
 		 */
-		expires_next = ktime_add_ns(now, 100 * NSEC_PER_MSEC);
+		expires_next = ktime_add_ns(ktime_get(), 100 * NSEC_PER_MSEC);
 		cpu_base->hang_detected = false;
 	}
 	hrtimer_rearm_event(expires_next, deferred);
@@ -1960,27 +1968,36 @@ static void hrtimer_rearm(struct hrtimer
 void __hrtimer_rearm_deferred(void)
 {
 	struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
-	ktime_t now, expires_next;
+	ktime_t expires_next;
 
 	if (!cpu_base->deferred_rearm)
 		return;
 
 	guard(raw_spinlock)(&cpu_base->lock);
-	now = hrtimer_update_base(cpu_base);
-	expires_next = hrtimer_update_next_event(cpu_base);
-	hrtimer_rearm(cpu_base, now, expires_next, true);
+	if (cpu_base->deferred_needs_update) {
+		hrtimer_update_base(cpu_base);
+		expires_next = hrtimer_update_next_event(cpu_base);
+	} else {
+		/* No timer added/removed. Use the cached value */
+		expires_next = cpu_base->deferred_expires_next;
+	}
+	hrtimer_rearm(cpu_base, expires_next, true);
 }
 
 static __always_inline void
-hrtimer_interrupt_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t now, ktime_t expires_next)
+hrtimer_interrupt_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t expires_next)
 {
+	/* hrtimer_interrupt() just re-evaluated the first expiring timer */
+	cpu_base->deferred_needs_update = false;
+	/* Cache the expiry time */
+	cpu_base->deferred_expires_next = expires_next;
 	set_thread_flag(TIF_HRTIMER_REARM);
 }
 #else  /* CONFIG_HRTIMER_REARM_DEFERRED */
 static __always_inline void
-hrtimer_interrupt_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t now, ktime_t expires_next)
+hrtimer_interrupt_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t expires_next)
 {
-	hrtimer_rearm(cpu_base, now, expires_next, false);
+	hrtimer_rearm(cpu_base, expires_next, false);
 }
 #endif  /* !CONFIG_HRTIMER_REARM_DEFERRED */
 
@@ -2041,7 +2058,7 @@ void hrtimer_interrupt(struct clock_even
 		cpu_base->hang_detected = true;
 	}
 
-	hrtimer_interrupt_rearm(cpu_base, now, expires_next);
+	hrtimer_interrupt_rearm(cpu_base, expires_next);
 	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
 }
 


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 40/48] hrtimer: Keep track of first expiring timer per clock base
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (38 preceding siblings ...)
  2026-02-24 16:38 ` [patch 39/48] hrtimer: Avoid re-evaluation when nothing changed Thomas Gleixner
@ 2026-02-24 16:38 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:38 ` [patch 41/48] hrtimer: Rework next event evaluation Thomas Gleixner
                   ` (9 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:38 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

Evaluating the next expiry time of all clock bases is cache line expensive
as the expiry time of the first expiring timer is not cached in the base
and requires to access the timer itself, which is definitely in a different
cache line.

It's way more efficient to keep track of the expiry time on enqueue and
dequeue operations as the relevant data is already in the cache at that
point.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 include/linux/hrtimer_defs.h |    2 ++
 kernel/time/hrtimer.c        |   37 ++++++++++++++++++++++++++++++++++---
 2 files changed, 36 insertions(+), 3 deletions(-)

--- a/include/linux/hrtimer_defs.h
+++ b/include/linux/hrtimer_defs.h
@@ -19,6 +19,7 @@
  *			timer to a base on another cpu.
  * @clockid:		clock id for per_cpu support
  * @seq:		seqcount around __run_hrtimer
+ * @expires_next:	Absolute time of the next event in this clock base
  * @running:		pointer to the currently running hrtimer
  * @active:		red black tree root node for the active timers
  * @offset:		offset of this clock to the monotonic base
@@ -28,6 +29,7 @@ struct hrtimer_clock_base {
 	unsigned int		index;
 	clockid_t		clockid;
 	seqcount_raw_spinlock_t	seq;
+	ktime_t			expires_next;
 	struct hrtimer		*running;
 	struct timerqueue_head	active;
 	ktime_t			offset;
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1107,7 +1107,18 @@ static bool enqueue_hrtimer(struct hrtim
 	/* Pairs with the lockless read in hrtimer_is_queued() */
 	WRITE_ONCE(timer->is_queued, HRTIMER_STATE_ENQUEUED);
 
-	return timerqueue_add(&base->active, &timer->node);
+	if (!timerqueue_add(&base->active, &timer->node))
+		return false;
+
+	base->expires_next = hrtimer_get_expires(timer);
+	return true;
+}
+
+static inline void base_update_next_timer(struct hrtimer_clock_base *base)
+{
+	struct timerqueue_node *next = timerqueue_getnext(&base->active);
+
+	base->expires_next = next ? next->expires : KTIME_MAX;
 }
 
 /*
@@ -1122,6 +1133,7 @@ static void __remove_hrtimer(struct hrti
 			     bool newstate, bool reprogram)
 {
 	struct hrtimer_cpu_base *cpu_base = base->cpu_base;
+	bool was_first;
 
 	lockdep_assert_held(&cpu_base->lock);
 
@@ -1131,9 +1143,17 @@ static void __remove_hrtimer(struct hrti
 	/* Pairs with the lockless read in hrtimer_is_queued() */
 	WRITE_ONCE(timer->is_queued, newstate);
 
+	was_first = &timer->node == timerqueue_getnext(&base->active);
+
 	if (!timerqueue_del(&base->active, &timer->node))
 		cpu_base->active_bases &= ~(1 << base->index);
 
+	/* Nothing to update if this was not the first timer in the base */
+	if (!was_first)
+		return;
+
+	base_update_next_timer(base);
+
 	/*
 	 * If reprogram is false don't update cpu_base->next_timer and do not
 	 * touch the clock event device.
@@ -1182,9 +1202,12 @@ static inline bool
 remove_and_enqueue_same_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
 			     const enum hrtimer_mode mode, ktime_t expires, u64 delta_ns)
 {
+	bool was_first = false;
+
 	/* Remove it from the timer queue if active */
 	if (timer->is_queued) {
 		debug_hrtimer_deactivate(timer);
+		was_first = &timer->node == timerqueue_getnext(&base->active);
 		timerqueue_del(&base->active, &timer->node);
 	}
 
@@ -1197,8 +1220,16 @@ remove_and_enqueue_same_base(struct hrti
 	/* Pairs with the lockless read in hrtimer_is_queued() */
 	WRITE_ONCE(timer->is_queued, HRTIMER_STATE_ENQUEUED);
 
-	/* Returns true if this is the first expiring timer */
-	return timerqueue_add(&base->active, &timer->node);
+	/* If it's the first expiring timer now or again, update base */
+	if (timerqueue_add(&base->active, &timer->node)) {
+		base->expires_next = expires;
+		return true;
+	}
+
+	if (was_first)
+		base_update_next_timer(base);
+
+	return false;
 }
 
 static inline ktime_t hrtimer_update_lowres(struct hrtimer *timer, ktime_t tim,


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 41/48] hrtimer: Rework next event evaluation
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (39 preceding siblings ...)
  2026-02-24 16:38 ` [patch 40/48] hrtimer: Keep track of first expiring timer per clock base Thomas Gleixner
@ 2026-02-24 16:38 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:38 ` [patch 42/48] hrtimer: Simplify run_hrtimer_queues() Thomas Gleixner
                   ` (8 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:38 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

The per clock base cached expiry time allows to do a more efficient
evaluation of the next expiry on a CPU.

Separate the reprogramming evaluation from the NOHZ idle evaluation which
needs to exclude the NOHZ timer to keep the reprogramming path lean and
clean.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 kernel/time/hrtimer.c |  120 ++++++++++++++++++++++++++++----------------------
 1 file changed, 69 insertions(+), 51 deletions(-)

--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -546,49 +546,67 @@ static struct hrtimer_clock_base *
 #define for_each_active_base(base, cpu_base, active)		\
 	while ((base = __next_base((cpu_base), &(active))))
 
-static ktime_t __hrtimer_next_event_base(struct hrtimer_cpu_base *cpu_base,
-					 const struct hrtimer *exclude,
-					 unsigned int active, ktime_t expires_next)
+#if defined(CONFIG_NO_HZ_COMMON)
+/*
+ * Same as hrtimer_bases_next_event() below, but skips the excluded timer and
+ * does not update cpu_base->next_timer/expires.
+ */
+static ktime_t hrtimer_bases_next_event_without(struct hrtimer_cpu_base *cpu_base,
+						const struct hrtimer *exclude,
+						unsigned int active, ktime_t expires_next)
 {
 	struct hrtimer_clock_base *base;
 	ktime_t expires;
 
+	lockdep_assert_held(&cpu_base->lock);
+
 	for_each_active_base(base, cpu_base, active) {
-		struct timerqueue_node *next;
-		struct hrtimer *timer;
+		expires = ktime_sub(base->expires_next, base->offset);
+		if (expires >= expires_next)
+			continue;
 
-		next = timerqueue_getnext(&base->active);
-		timer = container_of(next, struct hrtimer, node);
-		if (timer == exclude) {
-			/* Get to the next timer in the queue. */
-			next = timerqueue_iterate_next(next);
-			if (!next)
-				continue;
+		/*
+		 * If the excluded timer is the first on this base evaluate the
+		 * next timer.
+		 */
+		struct timerqueue_node *node = timerqueue_getnext(&base->active);
 
-			timer = container_of(next, struct hrtimer, node);
+		if (unlikely(&exclude->node == node)) {
+			node = timerqueue_iterate_next(node);
+			if (!node)
+				continue;
+			expires = ktime_sub(node->expires, base->offset);
+			if (expires >= expires_next)
+				continue;
 		}
-		expires = ktime_sub(hrtimer_get_expires(timer), base->offset);
-		if (expires < expires_next) {
-			expires_next = expires;
+		expires_next = expires;
+	}
+	/* If base->offset changed, the result might be negative */
+	return max(expires_next, 0);
+}
+#endif
 
-			/* Skip cpu_base update if a timer is being excluded. */
-			if (exclude)
-				continue;
+static __always_inline struct hrtimer *clock_base_next_timer(struct hrtimer_clock_base *base)
+{
+	struct timerqueue_node *next = timerqueue_getnext(&base->active);
+
+	return container_of(next, struct hrtimer, node);
+}
 
-			if (timer->is_soft)
-				cpu_base->softirq_next_timer = timer;
-			else
-				cpu_base->next_timer = timer;
+/* Find the base with the earliest expiry */
+static void hrtimer_bases_first(struct hrtimer_cpu_base *cpu_base,unsigned int active,
+				ktime_t *expires_next, struct hrtimer **next_timer)
+{
+	struct hrtimer_clock_base *base;
+	ktime_t expires;
+
+	for_each_active_base(base, cpu_base, active) {
+		expires = ktime_sub(base->expires_next, base->offset);
+		if (expires < *expires_next) {
+			*expires_next = expires;
+			*next_timer = clock_base_next_timer(base);
 		}
 	}
-	/*
-	 * clock_was_set() might have changed base->offset of any of
-	 * the clock bases so the result might be negative. Fix it up
-	 * to prevent a false positive in clockevents_program_event().
-	 */
-	if (expires_next < 0)
-		expires_next = 0;
-	return expires_next;
 }
 
 /*
@@ -617,19 +635,22 @@ static ktime_t __hrtimer_get_next_event(
 	ktime_t expires_next = KTIME_MAX;
 	unsigned int active;
 
+	lockdep_assert_held(&cpu_base->lock);
+
 	if (!cpu_base->softirq_activated && (active_mask & HRTIMER_ACTIVE_SOFT)) {
 		active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
-		cpu_base->softirq_next_timer = NULL;
-		expires_next = __hrtimer_next_event_base(cpu_base, NULL, active, KTIME_MAX);
-		next_timer = cpu_base->softirq_next_timer;
+		if (active)
+			hrtimer_bases_first(cpu_base, active, &expires_next, &next_timer);
+		cpu_base->softirq_next_timer = next_timer;
 	}
 
 	if (active_mask & HRTIMER_ACTIVE_HARD) {
 		active = cpu_base->active_bases & HRTIMER_ACTIVE_HARD;
+		if (active)
+			hrtimer_bases_first(cpu_base, active, &expires_next, &next_timer);
 		cpu_base->next_timer = next_timer;
-		expires_next = __hrtimer_next_event_base(cpu_base, NULL, active, expires_next);
 	}
-	return expires_next;
+	return max(expires_next, 0);
 }
 
 static ktime_t hrtimer_update_next_event(struct hrtimer_cpu_base *cpu_base)
@@ -724,11 +745,7 @@ static void __hrtimer_reprogram(struct h
 	hrtimer_rearm_event(expires_next, false);
 }
 
-/*
- * Reprogram the event source with checking both queues for the
- * next event
- * Called with interrupts disabled and base->lock held
- */
+/* Reprogram the event source with a evaluation of all clock bases */
 static void hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, bool skip_equal)
 {
 	ktime_t expires_next = hrtimer_update_next_event(cpu_base);
@@ -1662,19 +1679,20 @@ u64 hrtimer_next_event_without(const str
 {
 	struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
 	u64 expires = KTIME_MAX;
+	unsigned int active;
 
 	guard(raw_spinlock_irqsave)(&cpu_base->lock);
-	if (hrtimer_hres_active(cpu_base)) {
-		unsigned int active;
+	if (!hrtimer_hres_active(cpu_base))
+		return expires;
 
-		if (!cpu_base->softirq_activated) {
-			active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
-			expires = __hrtimer_next_event_base(cpu_base, exclude, active, KTIME_MAX);
-		}
-		active = cpu_base->active_bases & HRTIMER_ACTIVE_HARD;
-		expires = __hrtimer_next_event_base(cpu_base, exclude, active, expires);
-	}
-	return expires;
+	active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
+	if (active && !cpu_base->softirq_activated)
+		expires = hrtimer_bases_next_event_without(cpu_base, exclude, active, KTIME_MAX);
+
+	active = cpu_base->active_bases & HRTIMER_ACTIVE_HARD;
+	if (!active)
+		return expires;
+	return hrtimer_bases_next_event_without(cpu_base, exclude, active, expires);
 }
 #endif
 


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 42/48] hrtimer: Simplify run_hrtimer_queues()
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (40 preceding siblings ...)
  2026-02-24 16:38 ` [patch 41/48] hrtimer: Rework next event evaluation Thomas Gleixner
@ 2026-02-24 16:38 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:38 ` [patch 43/48] hrtimer: Optimize for_each_active_base() Thomas Gleixner
                   ` (7 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:38 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

Replace the open coded container_of() orgy with a trivial
clock_base_next_timer() helper.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 kernel/time/hrtimer.c |   19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1933,6 +1933,13 @@ static void __run_hrtimer(struct hrtimer
 	base->running = NULL;
 }
 
+static __always_inline struct hrtimer *clock_base_next_timer_safe(struct hrtimer_clock_base *base)
+{
+	struct timerqueue_node *next = timerqueue_getnext(&base->active);
+
+	return next ? container_of(next, struct hrtimer, node) : NULL;
+}
+
 static void __hrtimer_run_queues(struct hrtimer_cpu_base *cpu_base, ktime_t now,
 				 unsigned long flags, unsigned int active_mask)
 {
@@ -1940,16 +1947,10 @@ static void __hrtimer_run_queues(struct
 	struct hrtimer_clock_base *base;
 
 	for_each_active_base(base, cpu_base, active) {
-		struct timerqueue_node *node;
-		ktime_t basenow;
-
-		basenow = ktime_add(now, base->offset);
-
-		while ((node = timerqueue_getnext(&base->active))) {
-			struct hrtimer *timer;
-
-			timer = container_of(node, struct hrtimer, node);
+		ktime_t basenow = ktime_add(now, base->offset);
+		struct hrtimer *timer;
 
+		while ((timer = clock_base_next_timer(base))) {
 			/*
 			 * The immediate goal for using the softexpires is
 			 * minimizing wakeups, not running timers at the


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 43/48] hrtimer: Optimize for_each_active_base()
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (41 preceding siblings ...)
  2026-02-24 16:38 ` [patch 42/48] hrtimer: Simplify run_hrtimer_queues() Thomas Gleixner
@ 2026-02-24 16:38 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:38 ` [patch 44/48] rbtree: Provide rbtree with links Thomas Gleixner
                   ` (6 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:38 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

Give the compiler some help to emit way better code.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 kernel/time/hrtimer.c |   20 ++++----------------
 1 file changed, 4 insertions(+), 16 deletions(-)

--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -529,22 +529,10 @@ static inline void debug_activate(struct
 	trace_hrtimer_start(timer, mode, was_armed);
 }
 
-static struct hrtimer_clock_base *
-__next_base(struct hrtimer_cpu_base *cpu_base, unsigned int *active)
-{
-	unsigned int idx;
-
-	if (!*active)
-		return NULL;
-
-	idx = __ffs(*active);
-	*active &= ~(1U << idx);
-
-	return &cpu_base->clock_base[idx];
-}
-
-#define for_each_active_base(base, cpu_base, active)		\
-	while ((base = __next_base((cpu_base), &(active))))
+#define for_each_active_base(base, cpu_base, active)					\
+	for (unsigned int idx = ffs(active); idx--; idx = ffs((active)))		\
+		for (bool done = false; !done; active &= ~(1U << idx))			\
+			for (base = &cpu_base->clock_base[idx]; !done; done = true)
 
 #if defined(CONFIG_NO_HZ_COMMON)
 /*


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 44/48] rbtree: Provide rbtree with links
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (42 preceding siblings ...)
  2026-02-24 16:38 ` [patch 43/48] hrtimer: Optimize for_each_active_base() Thomas Gleixner
@ 2026-02-24 16:38 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:38 ` [patch 45/48] timerqueue: Provide linked timerqueue Thomas Gleixner
                   ` (5 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:38 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

Some RB tree users require quick access to the next and the previous node,
e.g. to check whether a modification of the node results in a change of the
nodes position in the tree. If the node position does not change, then the
modification can happen in place without going through a full enqueue
requeue cycle. A upcoming use case for this are the timer queues of the
hrtimer subsystem as they can optimize for timers which are frequently
rearmed while enqueued.

This can be obviously achieved with rb_next() and rb_prev(), but those
turned out to be quite expensive for hotpath operations depending on the
tree depth.

Add a linked RB tree variant where add() and erase() maintain the links
between the nodes. Like the cached variant it provides a pointer to the
left most node in the root.

It intentionally does not use a [h]list head as there is no real need for
true list operations as the list is strictly coupled to the tree and
and cannot be manipulated independently.

It sets the nodes previous pointer to NULL for the left most node and the
next pointer to NULL for the right most node. This allows a quick check
especially for the left most node without consulting the list head address,
which creates better code.

Aside of the rb_leftmost cached pointer this could trivially provide a
rb_rightmost pointer as well, but there is no usage for that (yet).

Signed-off-by: Thomas Gleixner <tglx@kernel.org
Cc: Eric Dumazet <edumazet@google.com>
---
 include/linux/rbtree.h       |   81 ++++++++++++++++++++++++++++++++++++++-----
 include/linux/rbtree_types.h |   16 ++++++++
 lib/rbtree.c                 |   17 +++++++++
 3 files changed, 105 insertions(+), 9 deletions(-)

--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -35,10 +35,15 @@
 #define RB_CLEAR_NODE(node)  \
 	((node)->__rb_parent_color = (unsigned long)(node))
 
+#define RB_EMPTY_LINKED_NODE(lnode)  RB_EMPTY_NODE(&(lnode)->node)
+#define RB_CLEAR_LINKED_NODE(lnode)  ({					\
+	RB_CLEAR_NODE(&(lnode)->node);					\
+	(lnode)->prev = (lnode)->next = NULL;				\
+})
 
 extern void rb_insert_color(struct rb_node *, struct rb_root *);
 extern void rb_erase(struct rb_node *, struct rb_root *);
-
+extern bool rb_erase_linked(struct rb_node_linked *, struct rb_root_linked *);
 
 /* Find logical next and previous nodes in a tree */
 extern struct rb_node *rb_next(const struct rb_node *);
@@ -213,15 +218,10 @@ rb_add_cached(struct rb_node *node, stru
 	return leftmost ? node : NULL;
 }
 
-/**
- * rb_add() - insert @node into @tree
- * @node: node to insert
- * @tree: tree to insert @node into
- * @less: operator defining the (partial) node order
- */
 static __always_inline void
-rb_add(struct rb_node *node, struct rb_root *tree,
-       bool (*less)(struct rb_node *, const struct rb_node *))
+__rb_add(struct rb_node *node, struct rb_root *tree,
+	 bool (*less)(struct rb_node *, const struct rb_node *),
+	 void (*linkop)(struct rb_node *, struct rb_node *, struct rb_node **))
 {
 	struct rb_node **link = &tree->rb_node;
 	struct rb_node *parent = NULL;
@@ -234,10 +234,73 @@ rb_add(struct rb_node *node, struct rb_r
 			link = &parent->rb_right;
 	}
 
+	linkop(node, parent, link);
 	rb_link_node(node, parent, link);
 	rb_insert_color(node, tree);
 }
 
+#define __node_2_linked_node(_n) \
+	rb_entry((_n), struct rb_node_linked, node)
+
+static inline void
+rb_link_linked_node(struct rb_node *node, struct rb_node *parent, struct rb_node **link)
+{
+	if (!parent)
+		return;
+
+	struct rb_node_linked *nnew = __node_2_linked_node(node);
+	struct rb_node_linked *npar = __node_2_linked_node(parent);
+
+	if (link == &parent->rb_left) {
+		nnew->prev = npar->prev;
+		nnew->next = npar;
+		npar->prev = nnew;
+		if (nnew->prev)
+			nnew->prev->next = nnew;
+	} else {
+		nnew->next = npar->next;
+		nnew->prev = npar;
+		npar->next = nnew;
+		if (nnew->next)
+			nnew->next->prev = nnew;
+	}
+}
+
+/**
+ * rb_add_linked() - insert @node into the leftmost linked tree @tree
+ * @node: node to insert
+ * @tree: linked tree to insert @node into
+ * @less: operator defining the (partial) node order
+ *
+ * Returns @true when @node is the new leftmost, @false otherwise.
+ */
+static __always_inline bool
+rb_add_linked(struct rb_node_linked *node, struct rb_root_linked *tree,
+	      bool (*less)(struct rb_node *, const struct rb_node *))
+{
+	__rb_add(&node->node, &tree->rb_root, less, rb_link_linked_node);
+	if (!node->prev)
+		tree->rb_leftmost = node;
+	return !node->prev;
+}
+
+/* Empty linkop function which is optimized away by the compiler */
+static __always_inline void
+rb_link_noop(struct rb_node *n, struct rb_node *p, struct rb_node **l) { }
+
+/**
+ * rb_add() - insert @node into @tree
+ * @node: node to insert
+ * @tree: tree to insert @node into
+ * @less: operator defining the (partial) node order
+ */
+static __always_inline void
+rb_add(struct rb_node *node, struct rb_root *tree,
+       bool (*less)(struct rb_node *, const struct rb_node *))
+{
+	__rb_add(node, tree, less, rb_link_noop);
+}
+
 /**
  * rb_find_add_cached() - find equivalent @node in @tree, or add @node
  * @node: node to look-for / insert
--- a/include/linux/rbtree_types.h
+++ b/include/linux/rbtree_types.h
@@ -9,6 +9,12 @@ struct rb_node {
 } __attribute__((aligned(sizeof(long))));
 /* The alignment might seem pointless, but allegedly CRIS needs it */
 
+struct rb_node_linked {
+	struct rb_node		node;
+	struct rb_node_linked	*prev;
+	struct rb_node_linked	*next;
+};
+
 struct rb_root {
 	struct rb_node *rb_node;
 };
@@ -28,7 +34,17 @@ struct rb_root_cached {
 	struct rb_node *rb_leftmost;
 };
 
+/*
+ * Leftmost tree with links. This would allow a trivial rb_rightmost update,
+ * but that has been omitted due to the lack of users.
+ */
+struct rb_root_linked {
+	struct rb_root		rb_root;
+	struct rb_node_linked	*rb_leftmost;
+};
+
 #define RB_ROOT (struct rb_root) { NULL, }
 #define RB_ROOT_CACHED (struct rb_root_cached) { {NULL, }, NULL }
+#define RB_ROOT_LINKED (struct rb_root_linked) { {NULL, }, NULL }
 
 #endif
--- a/lib/rbtree.c
+++ b/lib/rbtree.c
@@ -446,6 +446,23 @@ void rb_erase(struct rb_node *node, stru
 }
 EXPORT_SYMBOL(rb_erase);
 
+bool rb_erase_linked(struct rb_node_linked *node, struct rb_root_linked *root)
+{
+	if (node->prev)
+		node->prev->next = node->next;
+	else
+		root->rb_leftmost = node->next;
+
+	if (node->next)
+		node->next->prev = node->prev;
+
+	rb_erase(&node->node, &root->rb_root);
+	RB_CLEAR_LINKED_NODE(node);
+
+	return !!root->rb_leftmost;
+}
+EXPORT_SYMBOL_GPL(rb_erase_linked);
+
 /*
  * Augmented rbtree manipulation functions.
  *


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 45/48] timerqueue: Provide linked timerqueue
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (43 preceding siblings ...)
  2026-02-24 16:38 ` [patch 44/48] rbtree: Provide rbtree with links Thomas Gleixner
@ 2026-02-24 16:38 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:38 ` [patch 46/48] hrtimer: Use " Thomas Gleixner
                   ` (4 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:38 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

The hrtimer subsystem wants to peak ahead to the next and previous timer to
evaluated whether a to be rearmed timer can stay at the same position in
the RB tree with the new expiry time.

The linked RB tree provides the infrastructure for this as it maintains
links to the previous and next nodes for each entry in the tree.

Provide timerqueue wrappers around that.

Signed-off-by: Thomas Gleixner <tglx@kernel.org
---
 include/linux/timerqueue.h       |   56 +++++++++++++++++++++++++++++++++------
 include/linux/timerqueue_types.h |   15 ++++++++--
 lib/timerqueue.c                 |   14 +++++++++
 3 files changed, 74 insertions(+), 11 deletions(-)

--- a/include/linux/timerqueue.h
+++ b/include/linux/timerqueue.h
@@ -5,12 +5,11 @@
 #include <linux/rbtree.h>
 #include <linux/timerqueue_types.h>
 
-extern bool timerqueue_add(struct timerqueue_head *head,
-			   struct timerqueue_node *node);
-extern bool timerqueue_del(struct timerqueue_head *head,
-			   struct timerqueue_node *node);
-extern struct timerqueue_node *timerqueue_iterate_next(
-						struct timerqueue_node *node);
+bool timerqueue_add(struct timerqueue_head *head, struct timerqueue_node *node);
+bool timerqueue_del(struct timerqueue_head *head, struct timerqueue_node *node);
+struct timerqueue_node *timerqueue_iterate_next(struct timerqueue_node *node);
+
+bool timerqueue_linked_add(struct timerqueue_linked_head *head, struct timerqueue_linked_node *node);
 
 /**
  * timerqueue_getnext - Returns the timer with the earliest expiration time
@@ -19,8 +18,7 @@ extern struct timerqueue_node *timerqueu
  *
  * Returns a pointer to the timer node that has the earliest expiration time.
  */
-static inline
-struct timerqueue_node *timerqueue_getnext(struct timerqueue_head *head)
+static inline struct timerqueue_node *timerqueue_getnext(struct timerqueue_head *head)
 {
 	struct rb_node *leftmost = rb_first_cached(&head->rb_root);
 
@@ -41,4 +39,46 @@ static inline void timerqueue_init_head(
 {
 	head->rb_root = RB_ROOT_CACHED;
 }
+
+/* Timer queues with linked nodes */
+
+static __always_inline
+struct timerqueue_linked_node *timerqueue_linked_first(struct timerqueue_linked_head *head)
+{
+	return rb_entry_safe(head->rb_root.rb_leftmost, struct timerqueue_linked_node, node);
+}
+
+static __always_inline
+struct timerqueue_linked_node *timerqueue_linked_next(struct timerqueue_linked_node *node)
+{
+	return rb_entry_safe(node->node.next, struct timerqueue_linked_node, node);
+}
+
+static __always_inline
+struct timerqueue_linked_node *timerqueue_linked_prev(struct timerqueue_linked_node *node)
+{
+	return rb_entry_safe(node->node.prev, struct timerqueue_linked_node, node);
+}
+
+static __always_inline
+bool timerqueue_linked_del(struct timerqueue_linked_head *head, struct timerqueue_linked_node *node)
+{
+	return rb_erase_linked(&node->node, &head->rb_root);
+}
+
+static __always_inline void timerqueue_linked_init(struct timerqueue_linked_node *node)
+{
+	RB_CLEAR_LINKED_NODE(&node->node);
+}
+
+static __always_inline bool timerqueue_linked_node_queued(struct timerqueue_linked_node *node)
+{
+	return !RB_EMPTY_LINKED_NODE(&node->node);
+}
+
+static __always_inline void timerqueue_linked_init_head(struct timerqueue_linked_head *head)
+{
+	head->rb_root = RB_ROOT_LINKED;
+}
+
 #endif /* _LINUX_TIMERQUEUE_H */
--- a/include/linux/timerqueue_types.h
+++ b/include/linux/timerqueue_types.h
@@ -6,12 +6,21 @@
 #include <linux/types.h>
 
 struct timerqueue_node {
-	struct rb_node node;
-	ktime_t expires;
+	struct rb_node		node;
+	ktime_t			expires;
 };
 
 struct timerqueue_head {
-	struct rb_root_cached rb_root;
+	struct rb_root_cached	rb_root;
+};
+
+struct timerqueue_linked_node {
+	struct rb_node_linked		node;
+	ktime_t				expires;
+};
+
+struct timerqueue_linked_head {
+	struct rb_root_linked		rb_root;
 };
 
 #endif /* _LINUX_TIMERQUEUE_TYPES_H */
--- a/lib/timerqueue.c
+++ b/lib/timerqueue.c
@@ -82,3 +82,17 @@ struct timerqueue_node *timerqueue_itera
 	return container_of(next, struct timerqueue_node, node);
 }
 EXPORT_SYMBOL_GPL(timerqueue_iterate_next);
+
+#define __node_2_tq_linked(_n) \
+	container_of(rb_entry((_n), struct rb_node_linked, node), struct timerqueue_linked_node, node)
+
+static __always_inline bool __tq_linked_less(struct rb_node *a, const struct rb_node *b)
+{
+	return __node_2_tq_linked(a)->expires < __node_2_tq_linked(b)->expires;
+}
+
+bool timerqueue_linked_add(struct timerqueue_linked_head *head, struct timerqueue_linked_node *node)
+{
+	return rb_add_linked(&node->node, &head->rb_root, __tq_linked_less);
+}
+EXPORT_SYMBOL_GPL(timerqueue_linked_add);


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 46/48] hrtimer: Use linked timerqueue
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (44 preceding siblings ...)
  2026-02-24 16:38 ` [patch 45/48] timerqueue: Provide linked timerqueue Thomas Gleixner
@ 2026-02-24 16:38 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:39 ` [patch 47/48] hrtimer: Try to modify timers in place Thomas Gleixner
                   ` (3 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:38 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

To prepare for optimizing the rearming of enqueued timers, switch to the
linked timerqueue. That allows to check whether the new expiry time changes
the position of the timer in the RB tree or not, by checking the new expiry
time against the previous and the next timers expiry.

Signed-off-by: Thomas Gleixner <tglx@kernel.org
---
 include/linux/hrtimer_defs.h  |   16 ++++++++--------
 include/linux/hrtimer_types.h |    8 ++++----
 kernel/time/hrtimer.c         |   34 +++++++++++++++++-----------------
 kernel/time/timer_list.c      |   10 ++++------
 4 files changed, 33 insertions(+), 35 deletions(-)

--- a/include/linux/hrtimer_defs.h
+++ b/include/linux/hrtimer_defs.h
@@ -25,14 +25,14 @@
  * @offset:		offset of this clock to the monotonic base
  */
 struct hrtimer_clock_base {
-	struct hrtimer_cpu_base	*cpu_base;
-	unsigned int		index;
-	clockid_t		clockid;
-	seqcount_raw_spinlock_t	seq;
-	ktime_t			expires_next;
-	struct hrtimer		*running;
-	struct timerqueue_head	active;
-	ktime_t			offset;
+	struct hrtimer_cpu_base		*cpu_base;
+	unsigned int			index;
+	clockid_t			clockid;
+	seqcount_raw_spinlock_t		seq;
+	ktime_t				expires_next;
+	struct hrtimer			*running;
+	struct timerqueue_linked_head	active;
+	ktime_t				offset;
 } __hrtimer_clock_base_align;
 
 enum  hrtimer_base_type {
--- a/include/linux/hrtimer_types.h
+++ b/include/linux/hrtimer_types.h
@@ -17,7 +17,7 @@ enum hrtimer_restart {
 
 /**
  * struct hrtimer - the basic hrtimer structure
- * @node:	timerqueue node, which also manages node.expires,
+ * @node:	Linked timerqueue node, which also manages node.expires,
  *		the absolute expiry time in the hrtimers internal
  *		representation. The time is related to the clock on
  *		which the timer is based. Is setup by adding
@@ -39,15 +39,15 @@ enum hrtimer_restart {
  * The hrtimer structure must be initialized by hrtimer_setup()
  */
 struct hrtimer {
-	struct timerqueue_node		node;
-	ktime_t				_softexpires;
-	enum hrtimer_restart		(*__private function)(struct hrtimer *);
+	struct timerqueue_linked_node	node;
 	struct hrtimer_clock_base	*base;
 	bool				is_queued;
 	bool				is_rel;
 	bool				is_soft;
 	bool				is_hard;
 	bool				is_lazy;
+	ktime_t				_softexpires;
+	enum hrtimer_restart		(*__private function)(struct hrtimer *);
 };
 
 #endif /* _LINUX_HRTIMER_TYPES_H */
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -557,10 +557,10 @@ static ktime_t hrtimer_bases_next_event_
 		 * If the excluded timer is the first on this base evaluate the
 		 * next timer.
 		 */
-		struct timerqueue_node *node = timerqueue_getnext(&base->active);
+		struct timerqueue_linked_node *node = timerqueue_linked_first(&base->active);
 
 		if (unlikely(&exclude->node == node)) {
-			node = timerqueue_iterate_next(node);
+			node = timerqueue_linked_next(node);
 			if (!node)
 				continue;
 			expires = ktime_sub(node->expires, base->offset);
@@ -576,7 +576,7 @@ static ktime_t hrtimer_bases_next_event_
 
 static __always_inline struct hrtimer *clock_base_next_timer(struct hrtimer_clock_base *base)
 {
-	struct timerqueue_node *next = timerqueue_getnext(&base->active);
+	struct timerqueue_linked_node *next = timerqueue_linked_first(&base->active);
 
 	return container_of(next, struct hrtimer, node);
 }
@@ -938,9 +938,9 @@ static bool update_needs_ipi(struct hrti
 	active &= cpu_base->active_bases;
 
 	for_each_active_base(base, cpu_base, active) {
-		struct timerqueue_node *next;
+		struct timerqueue_linked_node *next;
 
-		next = timerqueue_getnext(&base->active);
+		next = timerqueue_linked_first(&base->active);
 		expires = ktime_sub(next->expires, base->offset);
 		if (expires < cpu_base->expires_next)
 			return true;
@@ -1112,7 +1112,7 @@ static bool enqueue_hrtimer(struct hrtim
 	/* Pairs with the lockless read in hrtimer_is_queued() */
 	WRITE_ONCE(timer->is_queued, HRTIMER_STATE_ENQUEUED);
 
-	if (!timerqueue_add(&base->active, &timer->node))
+	if (!timerqueue_linked_add(&base->active, &timer->node))
 		return false;
 
 	base->expires_next = hrtimer_get_expires(timer);
@@ -1121,7 +1121,7 @@ static bool enqueue_hrtimer(struct hrtim
 
 static inline void base_update_next_timer(struct hrtimer_clock_base *base)
 {
-	struct timerqueue_node *next = timerqueue_getnext(&base->active);
+	struct timerqueue_linked_node *next = timerqueue_linked_first(&base->active);
 
 	base->expires_next = next ? next->expires : KTIME_MAX;
 }
@@ -1148,9 +1148,9 @@ static void __remove_hrtimer(struct hrti
 	/* Pairs with the lockless read in hrtimer_is_queued() */
 	WRITE_ONCE(timer->is_queued, newstate);
 
-	was_first = &timer->node == timerqueue_getnext(&base->active);
+	was_first = !timerqueue_linked_prev(&timer->node);
 
-	if (!timerqueue_del(&base->active, &timer->node))
+	if (!timerqueue_linked_del(&base->active, &timer->node))
 		cpu_base->active_bases &= ~(1 << base->index);
 
 	/* Nothing to update if this was not the first timer in the base */
@@ -1212,8 +1212,8 @@ remove_and_enqueue_same_base(struct hrti
 	/* Remove it from the timer queue if active */
 	if (timer->is_queued) {
 		debug_hrtimer_deactivate(timer);
-		was_first = &timer->node == timerqueue_getnext(&base->active);
-		timerqueue_del(&base->active, &timer->node);
+		was_first = !timerqueue_linked_prev(&timer->node);
+		timerqueue_linked_del(&base->active, &timer->node);
 	}
 
 	/* Set the new expiry time */
@@ -1226,7 +1226,7 @@ remove_and_enqueue_same_base(struct hrti
 	WRITE_ONCE(timer->is_queued, HRTIMER_STATE_ENQUEUED);
 
 	/* If it's the first expiring timer now or again, update base */
-	if (timerqueue_add(&base->active, &timer->node)) {
+	if (timerqueue_linked_add(&base->active, &timer->node)) {
 		base->expires_next = expires;
 		return true;
 	}
@@ -1758,7 +1758,7 @@ static void __hrtimer_setup(struct hrtim
 	timer->is_hard = !!(mode & HRTIMER_MODE_HARD);
 	timer->is_lazy = !!(mode & HRTIMER_MODE_LAZY_REARM);
 	timer->base = &cpu_base->clock_base[base];
-	timerqueue_init(&timer->node);
+	timerqueue_linked_init(&timer->node);
 
 	if (WARN_ON_ONCE(!fn))
 		ACCESS_PRIVATE(timer, function) = hrtimer_dummy_timeout;
@@ -1923,7 +1923,7 @@ static void __run_hrtimer(struct hrtimer
 
 static __always_inline struct hrtimer *clock_base_next_timer_safe(struct hrtimer_clock_base *base)
 {
-	struct timerqueue_node *next = timerqueue_getnext(&base->active);
+	struct timerqueue_linked_node *next = timerqueue_linked_first(&base->active);
 
 	return next ? container_of(next, struct hrtimer, node) : NULL;
 }
@@ -2369,7 +2369,7 @@ int hrtimers_prepare_cpu(unsigned int cp
 
 		clock_b->cpu_base = cpu_base;
 		seqcount_raw_spinlock_init(&clock_b->seq, &cpu_base->lock);
-		timerqueue_init_head(&clock_b->active);
+		timerqueue_linked_init_head(&clock_b->active);
 	}
 
 	cpu_base->cpu = cpu;
@@ -2399,10 +2399,10 @@ int hrtimers_cpu_starting(unsigned int c
 static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base,
 				struct hrtimer_clock_base *new_base)
 {
-	struct timerqueue_node *node;
+	struct timerqueue_linked_node *node;
 	struct hrtimer *timer;
 
-	while ((node = timerqueue_getnext(&old_base->active))) {
+	while ((node = timerqueue_linked_first(&old_base->active))) {
 		timer = container_of(node, struct hrtimer, node);
 		BUG_ON(hrtimer_callback_running(timer));
 		debug_hrtimer_deactivate(timer);
--- a/kernel/time/timer_list.c
+++ b/kernel/time/timer_list.c
@@ -56,13 +56,11 @@ print_timer(struct seq_file *m, struct h
 		(long long)(ktime_to_ns(hrtimer_get_expires(timer)) - now));
 }
 
-static void
-print_active_timers(struct seq_file *m, struct hrtimer_clock_base *base,
-		    u64 now)
+static void print_active_timers(struct seq_file *m, struct hrtimer_clock_base *base, u64 now)
 {
+	struct timerqueue_linked_node *curr;
 	struct hrtimer *timer, tmp;
 	unsigned long next = 0, i;
-	struct timerqueue_node *curr;
 	unsigned long flags;
 
 next_one:
@@ -72,13 +70,13 @@ print_active_timers(struct seq_file *m,
 
 	raw_spin_lock_irqsave(&base->cpu_base->lock, flags);
 
-	curr = timerqueue_getnext(&base->active);
+	curr = timerqueue_linked_first(&base->active);
 	/*
 	 * Crude but we have to do this O(N*N) thing, because
 	 * we have to unlock the base when printing:
 	 */
 	while (curr && i < next) {
-		curr = timerqueue_iterate_next(curr);
+		curr = timerqueue_linked_next(curr);
 		i++;
 	}
 


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 47/48] hrtimer: Try to modify timers in place
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (45 preceding siblings ...)
  2026-02-24 16:38 ` [patch 46/48] hrtimer: Use " Thomas Gleixner
@ 2026-02-24 16:39 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2026-02-24 16:39 ` [patch 48/48] sched: Default enable HRTICK when deferred rearming is enabled Thomas Gleixner
                   ` (2 subsequent siblings)
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:39 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

When modifying the expiry of a armed timer it is first dequeued, then the
expiry value is updated and then it is queued again.

This can be avoided when the new expiry value is within the range of the
previous and the next timer as that does not change the position in the RB
tree.

The linked timerqueue allows to peak ahead to the neighbours and check
whether the new expiry time is within the range of the previous and next
timer. If so just modify the timer in place and spare the enqueue and
requeue effort, which might end up rotating the RB tree twice for nothing.

This speeds up the handling of frequently rearmed hrtimers, like the hrtick
scheduler timer significantly.

Signed-off-by: Thomas Gleixner <tglx@kernel.org
---
 kernel/time/hrtimer.c |   37 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 36 insertions(+), 1 deletion(-)

--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1203,6 +1203,31 @@ static inline bool remove_hrtimer(struct
 	return false;
 }
 
+/*
+ * Update in place has to retrieve the expiry times of the neighbour nodes
+ * if they exist. That is cache line neutral because the dequeue/enqueue
+ * operation is going to need the same cache lines. But there is a big win
+ * when the dequeue/enqueue can be avoided because the RB tree does not
+ * have to be rebalanced twice.
+ */
+static inline bool
+hrtimer_can_update_in_place(struct hrtimer *timer, struct hrtimer_clock_base *base, ktime_t expires)
+{
+	struct timerqueue_linked_node *next = timerqueue_linked_next(&timer->node);
+	struct timerqueue_linked_node *prev = timerqueue_linked_prev(&timer->node);
+
+	/* If the new expiry goes behind the next timer, requeue is required */
+	if (next && expires > next->expires)
+		return false;
+
+	/* If this is the first timer, update in place */
+	if (!prev)
+		return true;
+
+	/* Update in place when it does not go ahead of the previous one */
+	return expires >= prev->expires;
+}
+
 static inline bool
 remove_and_enqueue_same_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
 			     const enum hrtimer_mode mode, ktime_t expires, u64 delta_ns)
@@ -1211,8 +1236,18 @@ remove_and_enqueue_same_base(struct hrti
 
 	/* Remove it from the timer queue if active */
 	if (timer->is_queued) {
-		debug_hrtimer_deactivate(timer);
 		was_first = !timerqueue_linked_prev(&timer->node);
+
+		/* Try to update in place to avoid the de/enqueue dance */
+		if (hrtimer_can_update_in_place(timer, base, expires)) {
+			hrtimer_set_expires_range_ns(timer, expires, delta_ns);
+			trace_hrtimer_start(timer, mode, true);
+			if (was_first)
+				base->expires_next = expires;
+			return was_first;
+		}
+
+		debug_hrtimer_deactivate(timer);
 		timerqueue_linked_del(&base->active, &timer->node);
 	}
 


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 48/48] sched: Default enable HRTICK when deferred rearming is enabled
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (46 preceding siblings ...)
  2026-02-24 16:39 ` [patch 47/48] hrtimer: Try to modify timers in place Thomas Gleixner
@ 2026-02-24 16:39 ` Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Peter Zijlstra
  2026-02-25 15:25 ` [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Peter Zijlstra
  2026-03-04 15:59 ` Christian Loehle
  49 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-24 16:39 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

From: Peter Zijlstra <peterz@infradead.org>

The deferred rearm of the clock event device after an interrupt and and
other hrtimer optimizations allow now to enable HRTICK for generic entry
architectures.

This decouples preemption from CONFIG_HZ, leaving only the periodic
load-balancer and various accounting things relying on the tick.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 kernel/sched/features.h |    5 +++++
 1 file changed, 5 insertions(+)
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -63,8 +63,13 @@ SCHED_FEAT(DELAY_ZERO, true)
  */
 SCHED_FEAT(WAKEUP_PREEMPTION, true)
 
+#ifdef CONFIG_HRTIMER_REARM_DEFERRED
+SCHED_FEAT(HRTICK, true)
+SCHED_FEAT(HRTICK_DL, true)
+#else
 SCHED_FEAT(HRTICK, false)
 SCHED_FEAT(HRTICK_DL, false)
+#endif
 
 /*
  * Decrement CPU capacity based on time not spent running tasks


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (47 preceding siblings ...)
  2026-02-24 16:39 ` [patch 48/48] sched: Default enable HRTICK when deferred rearming is enabled Thomas Gleixner
@ 2026-02-25 15:25 ` Peter Zijlstra
  2026-02-25 16:02   ` Thomas Gleixner
  2026-03-04 15:59 ` Christian Loehle
  49 siblings, 1 reply; 128+ messages in thread
From: Peter Zijlstra @ 2026-02-25 15:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Anna-Maria Behnsen, John Stultz, Stephen Boyd,
	Daniel Lezcano, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, x86,
	Frederic Weisbecker, Eric Dumazet

On Tue, Feb 24, 2026 at 05:35:12PM +0100, Thomas Gleixner wrote:

> The series applies on v7.0-rc1 and is also available from git:
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git sched/hrtick

If you'd have added the shortlog, you'd have made nearly 350 lines :-)

Peter Zijlstra (11):
      sched/eevdf: Fix HRTICK duration
      hrtimer: Avoid pointless reprogramming in __hrtimer_start_range_ns()
      hrtimer: Provide LAZY_REARM mode
      sched/hrtick: Mark hrtick timer LAZY_REARM
      hrtimer: Re-arrange hrtimer_interrupt()
      hrtimer: Prepare stubs for deferred rearming
      entry: Prepare for deferred hrtimer rearming
      softirq: Prepare for deferred hrtimer rearming
      sched/core: Prepare for deferred hrtimer rearming
      hrtimer: Push reprogramming timers into the interrupt return path
      sched: Default enable HRTICK when deferred rearming is enabled

Peter Zijlstra (Intel) (2):
      sched/fair: Simplify hrtick_update()
      sched/fair: Make hrtick resched hard

Thomas Gleixner (35):
      sched: Avoid ktime_get() indirection
      hrtimer: Provide a static branch based hrtimer_hres_enabled()
      sched: Use hrtimer_highres_enabled()
      sched: Optimize hrtimer handling
      sched/hrtick: Avoid tiny hrtick rearms
      tick/sched: Avoid hrtimer_cancel/start() sequence
      clockevents: Remove redundant CLOCK_EVT_FEAT_KTIME
      timekeeping: Allow inlining clocksource::read()
      x86: Inline TSC reads in timekeeping
      x86/apic: Remove pointless fence in lapic_next_deadline()
      x86/apic: Avoid the PVOPS indirection for the TSC deadline timer
      timekeeping: Provide infrastructure for coupled clockevents
      clockevents: Provide support for clocksource coupled comparators
      x86/apic: Enable TSC coupled programming mode
      hrtimer: Add debug object init assertion
      hrtimer: Reduce trace noise in hrtimer_start()
      hrtimer: Use guards where appropriate
      hrtimer: Cleanup coding style and comments
      hrtimer: Evaluate timer expiry only once
      hrtimer: Replace the bitfield in hrtimer_cpu_base
      hrtimer: Convert state and properties to boolean
      hrtimer: Optimize for local timers
      hrtimer: Use NOHZ information for locality
      hrtimer: Separate remove/enqueue handling for local timers
      hrtimer: Add hrtimer_rearm tracepoint
      hrtimer: Rename hrtimer_cpu_base::in_hrtirq to deferred_rearm
      hrtimer: Avoid re-evaluation when nothing changed
      hrtimer: Keep track of first expiring timer per clock base
      hrtimer: Rework next event evaluation
      hrtimer: Simplify run_hrtimer_queues()
      hrtimer: Optimize for_each_active_base()
      rbtree: Provide rbtree with links
      timerqueue: Provide linked timerqueue
      hrtimer: Use linked timerqueue
      hrtimer: Try to modify timers in place


Anyway, since I've been staring at these patches for over a week now:

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

You want me to go queue them in tip/sched/hrtick, tip/timer/hrick and
then merge both into tip/sched/core and have tip/timer/core only include
tip/timer/hrtick or something?

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement
  2026-02-25 15:25 ` [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Peter Zijlstra
@ 2026-02-25 16:02   ` Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: Thomas Gleixner @ 2026-02-25 16:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Anna-Maria Behnsen, John Stultz, Stephen Boyd,
	Daniel Lezcano, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, x86,
	Frederic Weisbecker, Eric Dumazet

On Wed, Feb 25 2026 at 16:25, Peter Zijlstra wrote:
> You want me to go queue them in tip/sched/hrtick, tip/timer/hrick and
> then merge both into tip/sched/core and have tip/timer/core only include
> tip/timer/hrtick or something?

I"d like to split them up and only pull the minimal stuff into the
subsystem branches. I made a plan already, but I can't find the notes
right now. I'll dig them out later.

Thanks,

        tglx




^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [patch 35/48] entry: Prepare for deferred hrtimer rearming
  2026-02-24 16:38 ` [patch 35/48] entry: Prepare for deferred hrtimer rearming Thomas Gleixner
@ 2026-02-27 15:57   ` Christian Loehle
  2026-02-27 16:25     ` Peter Zijlstra
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Peter Zijlstra
  1 sibling, 1 reply; 128+ messages in thread
From: Christian Loehle @ 2026-02-27 15:57 UTC (permalink / raw)
  To: Thomas Gleixner, LKML, Peter Zijlstra
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

On 2/24/26 16:38, Thomas Gleixner wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> The hrtimer interrupt expires timers and at the end of the interrupt it
> rearms the clockevent device for the next expiring timer.
> 
> That's obviously correct, but in the case that a expired timer sets
> NEED_RESCHED the return from interrupt ends up in schedule(). If HRTICK is
> enabled then schedule() will modify the hrtick timer, which causes another
> reprogramming of the hardware.
> 
> That can be avoided by deferring the rearming to the return from interrupt
> path and if the return results in a immediate schedule() invocation then it
> can be deferred until the end of schedule(), which avoids multiple rearms
> and re-evaluation of the timer wheel.
> 
> As this is only relevant for interrupt to user return split the work masks
> up and hand them in as arguments from the relevant exit to user functions,
> which allows the compiler to optimize the deferred handling out for the
> syscall exit to user case.
> 
> Add the rearm checks to the approritate places in the exit to user loop and
> the interrupt return to kernel path, so that the rearming is always
> guaranteed.
> 
> In the return to user space path this is handled in the same way as
> TIF_RSEQ to avoid extra instructions in the fast path, which are truly
> hurtful for device interrupt heavy work loads as the extra instructions and
> conditionals while benign at first sight accumulate quickly into measurable
> regressions. The return from syscall path is completely unaffected due to
> the above mentioned split so syscall heavy workloads wont have any extra
> burden.
> 
> For now this is just placing empty stubs at the right places which are all
> optimized out by the compiler until the actual functionality is in place.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> ---
> tglx: Split out to make it simpler to review and to make cross subsystem
>       merge logistics trivial.
> ---
>  include/linux/irq-entry-common.h |   25 +++++++++++++++++++------
>  include/linux/rseq_entry.h       |   16 +++++++++++++---
>  kernel/entry/common.c            |    4 +++-
>  3 files changed, 35 insertions(+), 10 deletions(-)
> 
> --- a/include/linux/irq-entry-common.h
> +++ b/include/linux/irq-entry-common.h
> @@ -3,6 +3,7 @@
>  #define __LINUX_IRQENTRYCOMMON_H
>  
>  #include <linux/context_tracking.h>
> +#include <linux/hrtimer_rearm.h>
>  #include <linux/kmsan.h>
>  #include <linux/rseq_entry.h>
>  #include <linux/static_call_types.h>
> @@ -33,6 +34,14 @@
>  	 _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | _TIF_RSEQ |		\
>  	 ARCH_EXIT_TO_USER_MODE_WORK)
>  
> +#ifdef CONFIG_HRTIMER_REARM_DEFERRED
> +# define EXIT_TO_USER_MODE_WORK_SYSCALL	(EXIT_TO_USER_MODE_WORK)
> +# define EXIT_TO_USER_MODE_WORK_IRQ	(EXIT_TO_USER_MODE_WORK | _TIF_HRTIMER_REARM)
> +#else
> +# define EXIT_TO_USER_MODE_WORK_SYSCALL	(EXIT_TO_USER_MODE_WORK)
> +# define EXIT_TO_USER_MODE_WORK_IRQ	(EXIT_TO_USER_MODE_WORK)
> +#endif
> +
>  /**
>   * arch_enter_from_user_mode - Architecture specific sanity check for user mode regs
>   * @regs:	Pointer to currents pt_regs
> @@ -203,6 +212,7 @@ unsigned long exit_to_user_mode_loop(str
>  /**
>   * __exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
>   * @regs:	Pointer to pt_regs on entry stack
> + * @work_mask:	Which TIF bits need to be evaluated
>   *
>   * 1) check that interrupts are disabled
>   * 2) call tick_nohz_user_enter_prepare()
> @@ -212,7 +222,8 @@ unsigned long exit_to_user_mode_loop(str
>   *
>   * Don't invoke directly, use the syscall/irqentry_ prefixed variants below
>   */
> -static __always_inline void __exit_to_user_mode_prepare(struct pt_regs *regs)
> +static __always_inline void __exit_to_user_mode_prepare(struct pt_regs *regs,
> +							const unsigned long work_mask)
>  {
>  	unsigned long ti_work;
>  
> @@ -222,8 +233,10 @@ static __always_inline void __exit_to_us
>  	tick_nohz_user_enter_prepare();
>  
>  	ti_work = read_thread_flags();
> -	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
> -		ti_work = exit_to_user_mode_loop(regs, ti_work);
> +	if (unlikely(ti_work & work_mask)) {
> +		if (!hrtimer_rearm_deferred_user_irq(&ti_work, work_mask))
> +			ti_work = exit_to_user_mode_loop(regs, ti_work);
> +	}
>  
>  	arch_exit_to_user_mode_prepare(regs, ti_work);
>  }
> @@ -239,7 +252,7 @@ static __always_inline void __exit_to_us
>  /* Temporary workaround to keep ARM64 alive */
>  static __always_inline void exit_to_user_mode_prepare_legacy(struct pt_regs *regs)
>  {
> -	__exit_to_user_mode_prepare(regs);
> +	__exit_to_user_mode_prepare(regs, EXIT_TO_USER_MODE_WORK);

Should this also be EXIT_TO_USER_MODE_WORK_IRQ?
I guess it doesn't really matter for now (since arm64 doesn't have the generic entry
path and generic TIF bits yet and therefore HRTIMER_REARM_DEFERRED=n), but I've been
playing around with the this series, the generic entry series
https://lore.kernel.org/lkml/20260203133728.848283-1-ruanjinjie@huawei.com
(and using generic TIF bits) and noticed this.


>  	rseq_exit_to_user_mode_legacy();
>  	__exit_to_user_mode_validate();
>  }
> @@ -253,7 +266,7 @@ static __always_inline void exit_to_user
>   */
>  static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
>  {
> -	__exit_to_user_mode_prepare(regs);
> +	__exit_to_user_mode_prepare(regs, EXIT_TO_USER_MODE_WORK_SYSCALL);
>  	rseq_syscall_exit_to_user_mode();
>  	__exit_to_user_mode_validate();
>  }
> @@ -267,7 +280,7 @@ static __always_inline void syscall_exit
>   */
>  static __always_inline void irqentry_exit_to_user_mode_prepare(struct pt_regs *regs)
>  {
> -	__exit_to_user_mode_prepare(regs);
> +	__exit_to_user_mode_prepare(regs, EXIT_TO_USER_MODE_WORK_IRQ);
>  	rseq_irqentry_exit_to_user_mode();
>  	__exit_to_user_mode_validate();
>  [snip]

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [patch 35/48] entry: Prepare for deferred hrtimer rearming
  2026-02-27 15:57   ` Christian Loehle
@ 2026-02-27 16:25     ` Peter Zijlstra
  2026-02-27 16:32       ` Christian Loehle
  0 siblings, 1 reply; 128+ messages in thread
From: Peter Zijlstra @ 2026-02-27 16:25 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Thomas Gleixner, LKML, Anna-Maria Behnsen, John Stultz,
	Stephen Boyd, Daniel Lezcano, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, x86, Frederic Weisbecker, Eric Dumazet

On Fri, Feb 27, 2026 at 03:57:55PM +0000, Christian Loehle wrote:

> > @@ -239,7 +252,7 @@ static __always_inline void __exit_to_us
> >  /* Temporary workaround to keep ARM64 alive */
> >  static __always_inline void exit_to_user_mode_prepare_legacy(struct pt_regs *regs)
> >  {
> > -	__exit_to_user_mode_prepare(regs);
> > +	__exit_to_user_mode_prepare(regs, EXIT_TO_USER_MODE_WORK);
> 
> Should this also be EXIT_TO_USER_MODE_WORK_IRQ?
> I guess it doesn't really matter for now (since arm64 doesn't have the generic entry
> path and generic TIF bits yet and therefore HRTIMER_REARM_DEFERRED=n), but I've been
> playing around with the this series, the generic entry series
> https://lore.kernel.org/lkml/20260203133728.848283-1-ruanjinjie@huawei.com
> (and using generic TIF bits) and noticed this.

I'm confused; if ARM64 goes GENERIC_ENTRY, its use of legacy should go
away and we can delete that whole thing, no?

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [patch 35/48] entry: Prepare for deferred hrtimer rearming
  2026-02-27 16:25     ` Peter Zijlstra
@ 2026-02-27 16:32       ` Christian Loehle
  0 siblings, 0 replies; 128+ messages in thread
From: Christian Loehle @ 2026-02-27 16:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, Anna-Maria Behnsen, John Stultz,
	Stephen Boyd, Daniel Lezcano, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, x86, Frederic Weisbecker, Eric Dumazet

On 2/27/26 16:25, Peter Zijlstra wrote:
> On Fri, Feb 27, 2026 at 03:57:55PM +0000, Christian Loehle wrote:
> 
>>> @@ -239,7 +252,7 @@ static __always_inline void __exit_to_us
>>>  /* Temporary workaround to keep ARM64 alive */
>>>  static __always_inline void exit_to_user_mode_prepare_legacy(struct pt_regs *regs)
>>>  {
>>> -	__exit_to_user_mode_prepare(regs);
>>> +	__exit_to_user_mode_prepare(regs, EXIT_TO_USER_MODE_WORK);
>>
>> Should this also be EXIT_TO_USER_MODE_WORK_IRQ?
>> I guess it doesn't really matter for now (since arm64 doesn't have the generic entry
>> path and generic TIF bits yet and therefore HRTIMER_REARM_DEFERRED=n), but I've been
>> playing around with the this series, the generic entry series
>> https://lore.kernel.org/lkml/20260203133728.848283-1-ruanjinjie@huawei.com
>> (and using generic TIF bits) and noticed this.
> 
> I'm confused; if ARM64 goes GENERIC_ENTRY, its use of legacy should go
> away and we can delete that whole thing, no?

Duh, the confusion was on my side
Let me check why the conversion again and see why it wouldn't, if there's a reason...

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] sched: Default enable HRTICK when deferred rearming is enabled
  2026-02-24 16:39 ` [patch 48/48] sched: Default enable HRTICK when deferred rearming is enabled Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     9213aa4784cf4e63e6d8d30ba71fd61c3d110247
Gitweb:        https://git.kernel.org/tip/9213aa4784cf4e63e6d8d30ba71fd61c3d110247
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 24 Feb 2026 17:39:08 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:17 +01:00

sched: Default enable HRTICK when deferred rearming is enabled

The deferred rearm of the clock event device after an interrupt and and
other hrtimer optimizations allow now to enable HRTICK for generic entry
architectures.

This decouples preemption from CONFIG_HZ, leaving only the periodic
load-balancer and various accounting things relying on the tick.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.937531564@kernel.org
---
 kernel/sched/features.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 136a658..d062284 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -63,8 +63,13 @@ SCHED_FEAT(DELAY_ZERO, true)
  */
 SCHED_FEAT(WAKEUP_PREEMPTION, true)
 
+#ifdef CONFIG_HRTIMER_REARM_DEFERRED
+SCHED_FEAT(HRTICK, true)
+SCHED_FEAT(HRTICK_DL, true)
+#else
 SCHED_FEAT(HRTICK, false)
 SCHED_FEAT(HRTICK_DL, false)
+#endif
 
 /*
  * Decrement CPU capacity based on time not spent running tasks

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] hrtimer: Try to modify timers in place
  2026-02-24 16:39 ` [patch 47/48] hrtimer: Try to modify timers in place Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     343f2f4dc5425107d509d29e26ef59c2053aeaa4
Gitweb:        https://git.kernel.org/tip/343f2f4dc5425107d509d29e26ef59c2053aeaa4
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:39:02 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:17 +01:00

hrtimer: Try to modify timers in place

When modifying the expiry of a armed timer it is first dequeued, then the
expiry value is updated and then it is queued again.

This can be avoided when the new expiry value is within the range of the
previous and the next timer as that does not change the position in the RB
tree.

The linked timerqueue allows to peak ahead to the neighbours and check
whether the new expiry time is within the range of the previous and next
timer. If so just modify the timer in place and spare the enqueue and
requeue effort, which might end up rotating the RB tree twice for nothing.

This speeds up the handling of frequently rearmed hrtimers, like the hrtick
scheduler timer significantly.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.873359816@kernel.org
---
 kernel/time/hrtimer.c | 37 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 5e45982..b94bd56 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1203,6 +1203,31 @@ static inline bool remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_ba
 	return false;
 }
 
+/*
+ * Update in place has to retrieve the expiry times of the neighbour nodes
+ * if they exist. That is cache line neutral because the dequeue/enqueue
+ * operation is going to need the same cache lines. But there is a big win
+ * when the dequeue/enqueue can be avoided because the RB tree does not
+ * have to be rebalanced twice.
+ */
+static inline bool
+hrtimer_can_update_in_place(struct hrtimer *timer, struct hrtimer_clock_base *base, ktime_t expires)
+{
+	struct timerqueue_linked_node *next = timerqueue_linked_next(&timer->node);
+	struct timerqueue_linked_node *prev = timerqueue_linked_prev(&timer->node);
+
+	/* If the new expiry goes behind the next timer, requeue is required */
+	if (next && expires > next->expires)
+		return false;
+
+	/* If this is the first timer, update in place */
+	if (!prev)
+		return true;
+
+	/* Update in place when it does not go ahead of the previous one */
+	return expires >= prev->expires;
+}
+
 static inline bool
 remove_and_enqueue_same_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
 			     const enum hrtimer_mode mode, ktime_t expires, u64 delta_ns)
@@ -1211,8 +1236,18 @@ remove_and_enqueue_same_base(struct hrtimer *timer, struct hrtimer_clock_base *b
 
 	/* Remove it from the timer queue if active */
 	if (timer->is_queued) {
-		debug_hrtimer_deactivate(timer);
 		was_first = !timerqueue_linked_prev(&timer->node);
+
+		/* Try to update in place to avoid the de/enqueue dance */
+		if (hrtimer_can_update_in_place(timer, base, expires)) {
+			hrtimer_set_expires_range_ns(timer, expires, delta_ns);
+			trace_hrtimer_start(timer, mode, true);
+			if (was_first)
+				base->expires_next = expires;
+			return was_first;
+		}
+
+		debug_hrtimer_deactivate(timer);
 		timerqueue_linked_del(&base->active, &timer->node);
 	}
 

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] hrtimer: Use linked timerqueue
  2026-02-24 16:38 ` [patch 46/48] hrtimer: Use " Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     b7418e6e9b87b849af4df93d527ff83498d1e4c3
Gitweb:        https://git.kernel.org/tip/b7418e6e9b87b849af4df93d527ff83498d1e4c3
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:38:57 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:16 +01:00

hrtimer: Use linked timerqueue

To prepare for optimizing the rearming of enqueued timers, switch to the
linked timerqueue. That allows to check whether the new expiry time changes
the position of the timer in the RB tree or not, by checking the new expiry
time against the previous and the next timers expiry.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.806643179@kernel.org
---
 include/linux/hrtimer_defs.h  | 16 ++++++++--------
 include/linux/hrtimer_types.h |  8 ++++----
 kernel/time/hrtimer.c         | 34 +++++++++++++++++-----------------
 kernel/time/timer_list.c      | 10 ++++------
 4 files changed, 33 insertions(+), 35 deletions(-)

diff --git a/include/linux/hrtimer_defs.h b/include/linux/hrtimer_defs.h
index fb38df4..0f851b2 100644
--- a/include/linux/hrtimer_defs.h
+++ b/include/linux/hrtimer_defs.h
@@ -25,14 +25,14 @@
  * @offset:		offset of this clock to the monotonic base
  */
 struct hrtimer_clock_base {
-	struct hrtimer_cpu_base	*cpu_base;
-	unsigned int		index;
-	clockid_t		clockid;
-	seqcount_raw_spinlock_t	seq;
-	ktime_t			expires_next;
-	struct hrtimer		*running;
-	struct timerqueue_head	active;
-	ktime_t			offset;
+	struct hrtimer_cpu_base		*cpu_base;
+	unsigned int			index;
+	clockid_t			clockid;
+	seqcount_raw_spinlock_t		seq;
+	ktime_t				expires_next;
+	struct hrtimer			*running;
+	struct timerqueue_linked_head	active;
+	ktime_t				offset;
 } __hrtimer_clock_base_align;
 
 enum  hrtimer_base_type {
diff --git a/include/linux/hrtimer_types.h b/include/linux/hrtimer_types.h
index 0e22bc9..b5dacc8 100644
--- a/include/linux/hrtimer_types.h
+++ b/include/linux/hrtimer_types.h
@@ -17,7 +17,7 @@ enum hrtimer_restart {
 
 /**
  * struct hrtimer - the basic hrtimer structure
- * @node:	timerqueue node, which also manages node.expires,
+ * @node:	Linked timerqueue node, which also manages node.expires,
  *		the absolute expiry time in the hrtimers internal
  *		representation. The time is related to the clock on
  *		which the timer is based. Is setup by adding
@@ -39,15 +39,15 @@ enum hrtimer_restart {
  * The hrtimer structure must be initialized by hrtimer_setup()
  */
 struct hrtimer {
-	struct timerqueue_node		node;
-	ktime_t				_softexpires;
-	enum hrtimer_restart		(*__private function)(struct hrtimer *);
+	struct timerqueue_linked_node	node;
 	struct hrtimer_clock_base	*base;
 	bool				is_queued;
 	bool				is_rel;
 	bool				is_soft;
 	bool				is_hard;
 	bool				is_lazy;
+	ktime_t				_softexpires;
+	enum hrtimer_restart		(*__private function)(struct hrtimer *);
 };
 
 #endif /* _LINUX_HRTIMER_TYPES_H */
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index d1e5848..5e45982 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -557,10 +557,10 @@ static ktime_t hrtimer_bases_next_event_without(struct hrtimer_cpu_base *cpu_bas
 		 * If the excluded timer is the first on this base evaluate the
 		 * next timer.
 		 */
-		struct timerqueue_node *node = timerqueue_getnext(&base->active);
+		struct timerqueue_linked_node *node = timerqueue_linked_first(&base->active);
 
 		if (unlikely(&exclude->node == node)) {
-			node = timerqueue_iterate_next(node);
+			node = timerqueue_linked_next(node);
 			if (!node)
 				continue;
 			expires = ktime_sub(node->expires, base->offset);
@@ -576,7 +576,7 @@ static ktime_t hrtimer_bases_next_event_without(struct hrtimer_cpu_base *cpu_bas
 
 static __always_inline struct hrtimer *clock_base_next_timer(struct hrtimer_clock_base *base)
 {
-	struct timerqueue_node *next = timerqueue_getnext(&base->active);
+	struct timerqueue_linked_node *next = timerqueue_linked_first(&base->active);
 
 	return container_of(next, struct hrtimer, node);
 }
@@ -938,9 +938,9 @@ static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base, unsigned int act
 	active &= cpu_base->active_bases;
 
 	for_each_active_base(base, cpu_base, active) {
-		struct timerqueue_node *next;
+		struct timerqueue_linked_node *next;
 
-		next = timerqueue_getnext(&base->active);
+		next = timerqueue_linked_first(&base->active);
 		expires = ktime_sub(next->expires, base->offset);
 		if (expires < cpu_base->expires_next)
 			return true;
@@ -1112,7 +1112,7 @@ static bool enqueue_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *ba
 	/* Pairs with the lockless read in hrtimer_is_queued() */
 	WRITE_ONCE(timer->is_queued, HRTIMER_STATE_ENQUEUED);
 
-	if (!timerqueue_add(&base->active, &timer->node))
+	if (!timerqueue_linked_add(&base->active, &timer->node))
 		return false;
 
 	base->expires_next = hrtimer_get_expires(timer);
@@ -1121,7 +1121,7 @@ static bool enqueue_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *ba
 
 static inline void base_update_next_timer(struct hrtimer_clock_base *base)
 {
-	struct timerqueue_node *next = timerqueue_getnext(&base->active);
+	struct timerqueue_linked_node *next = timerqueue_linked_first(&base->active);
 
 	base->expires_next = next ? next->expires : KTIME_MAX;
 }
@@ -1148,9 +1148,9 @@ static void __remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *b
 	/* Pairs with the lockless read in hrtimer_is_queued() */
 	WRITE_ONCE(timer->is_queued, newstate);
 
-	was_first = &timer->node == timerqueue_getnext(&base->active);
+	was_first = !timerqueue_linked_prev(&timer->node);
 
-	if (!timerqueue_del(&base->active, &timer->node))
+	if (!timerqueue_linked_del(&base->active, &timer->node))
 		cpu_base->active_bases &= ~(1 << base->index);
 
 	/* Nothing to update if this was not the first timer in the base */
@@ -1212,8 +1212,8 @@ remove_and_enqueue_same_base(struct hrtimer *timer, struct hrtimer_clock_base *b
 	/* Remove it from the timer queue if active */
 	if (timer->is_queued) {
 		debug_hrtimer_deactivate(timer);
-		was_first = &timer->node == timerqueue_getnext(&base->active);
-		timerqueue_del(&base->active, &timer->node);
+		was_first = !timerqueue_linked_prev(&timer->node);
+		timerqueue_linked_del(&base->active, &timer->node);
 	}
 
 	/* Set the new expiry time */
@@ -1226,7 +1226,7 @@ remove_and_enqueue_same_base(struct hrtimer *timer, struct hrtimer_clock_base *b
 	WRITE_ONCE(timer->is_queued, HRTIMER_STATE_ENQUEUED);
 
 	/* If it's the first expiring timer now or again, update base */
-	if (timerqueue_add(&base->active, &timer->node)) {
+	if (timerqueue_linked_add(&base->active, &timer->node)) {
 		base->expires_next = expires;
 		return true;
 	}
@@ -1758,7 +1758,7 @@ static void __hrtimer_setup(struct hrtimer *timer, enum hrtimer_restart (*fn)(st
 	timer->is_hard = !!(mode & HRTIMER_MODE_HARD);
 	timer->is_lazy = !!(mode & HRTIMER_MODE_LAZY_REARM);
 	timer->base = &cpu_base->clock_base[base];
-	timerqueue_init(&timer->node);
+	timerqueue_linked_init(&timer->node);
 
 	if (WARN_ON_ONCE(!fn))
 		ACCESS_PRIVATE(timer, function) = hrtimer_dummy_timeout;
@@ -1923,7 +1923,7 @@ static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base, struct hrtimer_cloc
 
 static __always_inline struct hrtimer *clock_base_next_timer_safe(struct hrtimer_clock_base *base)
 {
-	struct timerqueue_node *next = timerqueue_getnext(&base->active);
+	struct timerqueue_linked_node *next = timerqueue_linked_first(&base->active);
 
 	return next ? container_of(next, struct hrtimer, node) : NULL;
 }
@@ -2369,7 +2369,7 @@ int hrtimers_prepare_cpu(unsigned int cpu)
 
 		clock_b->cpu_base = cpu_base;
 		seqcount_raw_spinlock_init(&clock_b->seq, &cpu_base->lock);
-		timerqueue_init_head(&clock_b->active);
+		timerqueue_linked_init_head(&clock_b->active);
 	}
 
 	cpu_base->cpu = cpu;
@@ -2399,10 +2399,10 @@ int hrtimers_cpu_starting(unsigned int cpu)
 static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base,
 				struct hrtimer_clock_base *new_base)
 {
-	struct timerqueue_node *node;
+	struct timerqueue_linked_node *node;
 	struct hrtimer *timer;
 
-	while ((node = timerqueue_getnext(&old_base->active))) {
+	while ((node = timerqueue_linked_first(&old_base->active))) {
 		timer = container_of(node, struct hrtimer, node);
 		BUG_ON(hrtimer_callback_running(timer));
 		debug_hrtimer_deactivate(timer);
diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c
index 19e6182..e2e14fd 100644
--- a/kernel/time/timer_list.c
+++ b/kernel/time/timer_list.c
@@ -56,13 +56,11 @@ print_timer(struct seq_file *m, struct hrtimer *taddr, struct hrtimer *timer,
 		(long long)(ktime_to_ns(hrtimer_get_expires(timer)) - now));
 }
 
-static void
-print_active_timers(struct seq_file *m, struct hrtimer_clock_base *base,
-		    u64 now)
+static void print_active_timers(struct seq_file *m, struct hrtimer_clock_base *base, u64 now)
 {
+	struct timerqueue_linked_node *curr;
 	struct hrtimer *timer, tmp;
 	unsigned long next = 0, i;
-	struct timerqueue_node *curr;
 	unsigned long flags;
 
 next_one:
@@ -72,13 +70,13 @@ next_one:
 
 	raw_spin_lock_irqsave(&base->cpu_base->lock, flags);
 
-	curr = timerqueue_getnext(&base->active);
+	curr = timerqueue_linked_first(&base->active);
 	/*
 	 * Crude but we have to do this O(N*N) thing, because
 	 * we have to unlock the base when printing:
 	 */
 	while (curr && i < next) {
-		curr = timerqueue_iterate_next(curr);
+		curr = timerqueue_linked_next(curr);
 		i++;
 	}
 

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] timerqueue: Provide linked timerqueue
  2026-02-24 16:38 ` [patch 45/48] timerqueue: Provide linked timerqueue Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     1339eeb73d6b99cf3aa9981f3f91d6ac4a49c72e
Gitweb:        https://git.kernel.org/tip/1339eeb73d6b99cf3aa9981f3f91d6ac4a49c72e
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:38:52 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:16 +01:00

timerqueue: Provide linked timerqueue

The hrtimer subsystem wants to peak ahead to the next and previous timer to
evaluated whether a to be rearmed timer can stay at the same position in
the RB tree with the new expiry time.

The linked RB tree provides the infrastructure for this as it maintains
links to the previous and next nodes for each entry in the tree.

Provide timerqueue wrappers around that.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.734827095@kernel.org
---
 include/linux/timerqueue.h       | 56 ++++++++++++++++++++++++++-----
 include/linux/timerqueue_types.h | 15 ++++++--
 lib/timerqueue.c                 | 14 ++++++++-
 3 files changed, 74 insertions(+), 11 deletions(-)

diff --git a/include/linux/timerqueue.h b/include/linux/timerqueue.h
index d306d9d..7d0aaa7 100644
--- a/include/linux/timerqueue.h
+++ b/include/linux/timerqueue.h
@@ -5,12 +5,11 @@
 #include <linux/rbtree.h>
 #include <linux/timerqueue_types.h>
 
-extern bool timerqueue_add(struct timerqueue_head *head,
-			   struct timerqueue_node *node);
-extern bool timerqueue_del(struct timerqueue_head *head,
-			   struct timerqueue_node *node);
-extern struct timerqueue_node *timerqueue_iterate_next(
-						struct timerqueue_node *node);
+bool timerqueue_add(struct timerqueue_head *head, struct timerqueue_node *node);
+bool timerqueue_del(struct timerqueue_head *head, struct timerqueue_node *node);
+struct timerqueue_node *timerqueue_iterate_next(struct timerqueue_node *node);
+
+bool timerqueue_linked_add(struct timerqueue_linked_head *head, struct timerqueue_linked_node *node);
 
 /**
  * timerqueue_getnext - Returns the timer with the earliest expiration time
@@ -19,8 +18,7 @@ extern struct timerqueue_node *timerqueue_iterate_next(
  *
  * Returns a pointer to the timer node that has the earliest expiration time.
  */
-static inline
-struct timerqueue_node *timerqueue_getnext(struct timerqueue_head *head)
+static inline struct timerqueue_node *timerqueue_getnext(struct timerqueue_head *head)
 {
 	struct rb_node *leftmost = rb_first_cached(&head->rb_root);
 
@@ -41,4 +39,46 @@ static inline void timerqueue_init_head(struct timerqueue_head *head)
 {
 	head->rb_root = RB_ROOT_CACHED;
 }
+
+/* Timer queues with linked nodes */
+
+static __always_inline
+struct timerqueue_linked_node *timerqueue_linked_first(struct timerqueue_linked_head *head)
+{
+	return rb_entry_safe(head->rb_root.rb_leftmost, struct timerqueue_linked_node, node);
+}
+
+static __always_inline
+struct timerqueue_linked_node *timerqueue_linked_next(struct timerqueue_linked_node *node)
+{
+	return rb_entry_safe(node->node.next, struct timerqueue_linked_node, node);
+}
+
+static __always_inline
+struct timerqueue_linked_node *timerqueue_linked_prev(struct timerqueue_linked_node *node)
+{
+	return rb_entry_safe(node->node.prev, struct timerqueue_linked_node, node);
+}
+
+static __always_inline
+bool timerqueue_linked_del(struct timerqueue_linked_head *head, struct timerqueue_linked_node *node)
+{
+	return rb_erase_linked(&node->node, &head->rb_root);
+}
+
+static __always_inline void timerqueue_linked_init(struct timerqueue_linked_node *node)
+{
+	RB_CLEAR_LINKED_NODE(&node->node);
+}
+
+static __always_inline bool timerqueue_linked_node_queued(struct timerqueue_linked_node *node)
+{
+	return !RB_EMPTY_LINKED_NODE(&node->node);
+}
+
+static __always_inline void timerqueue_linked_init_head(struct timerqueue_linked_head *head)
+{
+	head->rb_root = RB_ROOT_LINKED;
+}
+
 #endif /* _LINUX_TIMERQUEUE_H */
diff --git a/include/linux/timerqueue_types.h b/include/linux/timerqueue_types.h
index dc298d0..be2218b 100644
--- a/include/linux/timerqueue_types.h
+++ b/include/linux/timerqueue_types.h
@@ -6,12 +6,21 @@
 #include <linux/types.h>
 
 struct timerqueue_node {
-	struct rb_node node;
-	ktime_t expires;
+	struct rb_node		node;
+	ktime_t			expires;
 };
 
 struct timerqueue_head {
-	struct rb_root_cached rb_root;
+	struct rb_root_cached	rb_root;
+};
+
+struct timerqueue_linked_node {
+	struct rb_node_linked		node;
+	ktime_t				expires;
+};
+
+struct timerqueue_linked_head {
+	struct rb_root_linked		rb_root;
 };
 
 #endif /* _LINUX_TIMERQUEUE_TYPES_H */
diff --git a/lib/timerqueue.c b/lib/timerqueue.c
index cdb9c76..e2a1e08 100644
--- a/lib/timerqueue.c
+++ b/lib/timerqueue.c
@@ -82,3 +82,17 @@ struct timerqueue_node *timerqueue_iterate_next(struct timerqueue_node *node)
 	return container_of(next, struct timerqueue_node, node);
 }
 EXPORT_SYMBOL_GPL(timerqueue_iterate_next);
+
+#define __node_2_tq_linked(_n) \
+	container_of(rb_entry((_n), struct rb_node_linked, node), struct timerqueue_linked_node, node)
+
+static __always_inline bool __tq_linked_less(struct rb_node *a, const struct rb_node *b)
+{
+	return __node_2_tq_linked(a)->expires < __node_2_tq_linked(b)->expires;
+}
+
+bool timerqueue_linked_add(struct timerqueue_linked_head *head, struct timerqueue_linked_node *node)
+{
+	return rb_add_linked(&node->node, &head->rb_root, __tq_linked_less);
+}
+EXPORT_SYMBOL_GPL(timerqueue_linked_add);

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] rbtree: Provide rbtree with links
  2026-02-24 16:38 ` [patch 44/48] rbtree: Provide rbtree with links Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     671047943dce5af24e023bca3c5cc244d7565f5a
Gitweb:        https://git.kernel.org/tip/671047943dce5af24e023bca3c5cc244d7565f5a
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:38:47 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:16 +01:00

rbtree: Provide rbtree with links

Some RB tree users require quick access to the next and the previous node,
e.g. to check whether a modification of the node results in a change of the
nodes position in the tree. If the node position does not change, then the
modification can happen in place without going through a full enqueue
requeue cycle. A upcoming use case for this are the timer queues of the
hrtimer subsystem as they can optimize for timers which are frequently
rearmed while enqueued.

This can be obviously achieved with rb_next() and rb_prev(), but those
turned out to be quite expensive for hotpath operations depending on the
tree depth.

Add a linked RB tree variant where add() and erase() maintain the links
between the nodes. Like the cached variant it provides a pointer to the
left most node in the root.

It intentionally does not use a [h]list head as there is no real need for
true list operations as the list is strictly coupled to the tree and
and cannot be manipulated independently.

It sets the nodes previous pointer to NULL for the left most node and the
next pointer to NULL for the right most node. This allows a quick check
especially for the left most node without consulting the list head address,
which creates better code.

Aside of the rb_leftmost cached pointer this could trivially provide a
rb_rightmost pointer as well, but there is no usage for that (yet).

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.668401024@kernel.org
---
 include/linux/rbtree.h       | 81 +++++++++++++++++++++++++++++++----
 include/linux/rbtree_types.h | 16 +++++++-
 lib/rbtree.c                 | 17 +++++++-
 3 files changed, 105 insertions(+), 9 deletions(-)

diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
index 4091e97..48acdc3 100644
--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -35,10 +35,15 @@
 #define RB_CLEAR_NODE(node)  \
 	((node)->__rb_parent_color = (unsigned long)(node))
 
+#define RB_EMPTY_LINKED_NODE(lnode)  RB_EMPTY_NODE(&(lnode)->node)
+#define RB_CLEAR_LINKED_NODE(lnode)  ({					\
+	RB_CLEAR_NODE(&(lnode)->node);					\
+	(lnode)->prev = (lnode)->next = NULL;				\
+})
 
 extern void rb_insert_color(struct rb_node *, struct rb_root *);
 extern void rb_erase(struct rb_node *, struct rb_root *);
-
+extern bool rb_erase_linked(struct rb_node_linked *, struct rb_root_linked *);
 
 /* Find logical next and previous nodes in a tree */
 extern struct rb_node *rb_next(const struct rb_node *);
@@ -213,15 +218,10 @@ rb_add_cached(struct rb_node *node, struct rb_root_cached *tree,
 	return leftmost ? node : NULL;
 }
 
-/**
- * rb_add() - insert @node into @tree
- * @node: node to insert
- * @tree: tree to insert @node into
- * @less: operator defining the (partial) node order
- */
 static __always_inline void
-rb_add(struct rb_node *node, struct rb_root *tree,
-       bool (*less)(struct rb_node *, const struct rb_node *))
+__rb_add(struct rb_node *node, struct rb_root *tree,
+	 bool (*less)(struct rb_node *, const struct rb_node *),
+	 void (*linkop)(struct rb_node *, struct rb_node *, struct rb_node **))
 {
 	struct rb_node **link = &tree->rb_node;
 	struct rb_node *parent = NULL;
@@ -234,10 +234,73 @@ rb_add(struct rb_node *node, struct rb_root *tree,
 			link = &parent->rb_right;
 	}
 
+	linkop(node, parent, link);
 	rb_link_node(node, parent, link);
 	rb_insert_color(node, tree);
 }
 
+#define __node_2_linked_node(_n) \
+	rb_entry((_n), struct rb_node_linked, node)
+
+static inline void
+rb_link_linked_node(struct rb_node *node, struct rb_node *parent, struct rb_node **link)
+{
+	if (!parent)
+		return;
+
+	struct rb_node_linked *nnew = __node_2_linked_node(node);
+	struct rb_node_linked *npar = __node_2_linked_node(parent);
+
+	if (link == &parent->rb_left) {
+		nnew->prev = npar->prev;
+		nnew->next = npar;
+		npar->prev = nnew;
+		if (nnew->prev)
+			nnew->prev->next = nnew;
+	} else {
+		nnew->next = npar->next;
+		nnew->prev = npar;
+		npar->next = nnew;
+		if (nnew->next)
+			nnew->next->prev = nnew;
+	}
+}
+
+/**
+ * rb_add_linked() - insert @node into the leftmost linked tree @tree
+ * @node: node to insert
+ * @tree: linked tree to insert @node into
+ * @less: operator defining the (partial) node order
+ *
+ * Returns @true when @node is the new leftmost, @false otherwise.
+ */
+static __always_inline bool
+rb_add_linked(struct rb_node_linked *node, struct rb_root_linked *tree,
+	      bool (*less)(struct rb_node *, const struct rb_node *))
+{
+	__rb_add(&node->node, &tree->rb_root, less, rb_link_linked_node);
+	if (!node->prev)
+		tree->rb_leftmost = node;
+	return !node->prev;
+}
+
+/* Empty linkop function which is optimized away by the compiler */
+static __always_inline void
+rb_link_noop(struct rb_node *n, struct rb_node *p, struct rb_node **l) { }
+
+/**
+ * rb_add() - insert @node into @tree
+ * @node: node to insert
+ * @tree: tree to insert @node into
+ * @less: operator defining the (partial) node order
+ */
+static __always_inline void
+rb_add(struct rb_node *node, struct rb_root *tree,
+       bool (*less)(struct rb_node *, const struct rb_node *))
+{
+	__rb_add(node, tree, less, rb_link_noop);
+}
+
 /**
  * rb_find_add_cached() - find equivalent @node in @tree, or add @node
  * @node: node to look-for / insert
diff --git a/include/linux/rbtree_types.h b/include/linux/rbtree_types.h
index 45b6ecd..3c7ae53 100644
--- a/include/linux/rbtree_types.h
+++ b/include/linux/rbtree_types.h
@@ -9,6 +9,12 @@ struct rb_node {
 } __attribute__((aligned(sizeof(long))));
 /* The alignment might seem pointless, but allegedly CRIS needs it */
 
+struct rb_node_linked {
+	struct rb_node		node;
+	struct rb_node_linked	*prev;
+	struct rb_node_linked	*next;
+};
+
 struct rb_root {
 	struct rb_node *rb_node;
 };
@@ -28,7 +34,17 @@ struct rb_root_cached {
 	struct rb_node *rb_leftmost;
 };
 
+/*
+ * Leftmost tree with links. This would allow a trivial rb_rightmost update,
+ * but that has been omitted due to the lack of users.
+ */
+struct rb_root_linked {
+	struct rb_root		rb_root;
+	struct rb_node_linked	*rb_leftmost;
+};
+
 #define RB_ROOT (struct rb_root) { NULL, }
 #define RB_ROOT_CACHED (struct rb_root_cached) { {NULL, }, NULL }
+#define RB_ROOT_LINKED (struct rb_root_linked) { {NULL, }, NULL }
 
 #endif
diff --git a/lib/rbtree.c b/lib/rbtree.c
index 18d42bc..5790d6e 100644
--- a/lib/rbtree.c
+++ b/lib/rbtree.c
@@ -446,6 +446,23 @@ void rb_erase(struct rb_node *node, struct rb_root *root)
 }
 EXPORT_SYMBOL(rb_erase);
 
+bool rb_erase_linked(struct rb_node_linked *node, struct rb_root_linked *root)
+{
+	if (node->prev)
+		node->prev->next = node->next;
+	else
+		root->rb_leftmost = node->next;
+
+	if (node->next)
+		node->next->prev = node->prev;
+
+	rb_erase(&node->node, &root->rb_root);
+	RB_CLEAR_LINKED_NODE(node);
+
+	return !!root->rb_leftmost;
+}
+EXPORT_SYMBOL_GPL(rb_erase_linked);
+
 /*
  * Augmented rbtree manipulation functions.
  *

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] hrtimer: Optimize for_each_active_base()
  2026-02-24 16:38 ` [patch 43/48] hrtimer: Optimize for_each_active_base() Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     3601a1d85028d7d479e1571419174fc3334f58f5
Gitweb:        https://git.kernel.org/tip/3601a1d85028d7d479e1571419174fc3334f58f5
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:38:42 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:15 +01:00

hrtimer: Optimize for_each_active_base()

Give the compiler some help to emit way better code.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.599804894@kernel.org
---
 kernel/time/hrtimer.c | 20 ++++----------------
 1 file changed, 4 insertions(+), 16 deletions(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index b0e7e29..d1e5848 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -529,22 +529,10 @@ static inline void debug_activate(struct hrtimer *timer, enum hrtimer_mode mode,
 	trace_hrtimer_start(timer, mode, was_armed);
 }
 
-static struct hrtimer_clock_base *
-__next_base(struct hrtimer_cpu_base *cpu_base, unsigned int *active)
-{
-	unsigned int idx;
-
-	if (!*active)
-		return NULL;
-
-	idx = __ffs(*active);
-	*active &= ~(1U << idx);
-
-	return &cpu_base->clock_base[idx];
-}
-
-#define for_each_active_base(base, cpu_base, active)		\
-	while ((base = __next_base((cpu_base), &(active))))
+#define for_each_active_base(base, cpu_base, active)					\
+	for (unsigned int idx = ffs(active); idx--; idx = ffs((active)))		\
+		for (bool done = false; !done; active &= ~(1U << idx))			\
+			for (base = &cpu_base->clock_base[idx]; !done; done = true)
 
 #if defined(CONFIG_NO_HZ_COMMON)
 /*

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] hrtimer: Simplify run_hrtimer_queues()
  2026-02-24 16:38 ` [patch 42/48] hrtimer: Simplify run_hrtimer_queues() Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     a64ad57e41c7e3daadbc2c1bc252d9a90c87222f
Gitweb:        https://git.kernel.org/tip/a64ad57e41c7e3daadbc2c1bc252d9a90c87222f
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:38:37 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:15 +01:00

hrtimer: Simplify run_hrtimer_queues()

Replace the open coded container_of() orgy with a trivial
clock_base_next_timer() helper.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.532927977@kernel.org
---
 kernel/time/hrtimer.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index aa1cb4f..b0e7e29 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1933,6 +1933,13 @@ static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base, struct hrtimer_cloc
 	base->running = NULL;
 }
 
+static __always_inline struct hrtimer *clock_base_next_timer_safe(struct hrtimer_clock_base *base)
+{
+	struct timerqueue_node *next = timerqueue_getnext(&base->active);
+
+	return next ? container_of(next, struct hrtimer, node) : NULL;
+}
+
 static void __hrtimer_run_queues(struct hrtimer_cpu_base *cpu_base, ktime_t now,
 				 unsigned long flags, unsigned int active_mask)
 {
@@ -1940,16 +1947,10 @@ static void __hrtimer_run_queues(struct hrtimer_cpu_base *cpu_base, ktime_t now,
 	struct hrtimer_clock_base *base;
 
 	for_each_active_base(base, cpu_base, active) {
-		struct timerqueue_node *node;
-		ktime_t basenow;
-
-		basenow = ktime_add(now, base->offset);
-
-		while ((node = timerqueue_getnext(&base->active))) {
-			struct hrtimer *timer;
-
-			timer = container_of(node, struct hrtimer, node);
+		ktime_t basenow = ktime_add(now, base->offset);
+		struct hrtimer *timer;
 
+		while ((timer = clock_base_next_timer(base))) {
 			/*
 			 * The immediate goal for using the softexpires is
 			 * minimizing wakeups, not running timers at the

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] hrtimer: Rework next event evaluation
  2026-02-24 16:38 ` [patch 41/48] hrtimer: Rework next event evaluation Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     2bd1cc24fafc84be844c9ef66aa819d7dec285bf
Gitweb:        https://git.kernel.org/tip/2bd1cc24fafc84be844c9ef66aa819d7dec285bf
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:38:33 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:15 +01:00

hrtimer: Rework next event evaluation

The per clock base cached expiry time allows to do a more efficient
evaluation of the next expiry on a CPU.

Separate the reprogramming evaluation from the NOHZ idle evaluation which
needs to exclude the NOHZ timer to keep the reprogramming path lean and
clean.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.468186893@kernel.org
---
 kernel/time/hrtimer.c | 120 +++++++++++++++++++++++------------------
 1 file changed, 69 insertions(+), 51 deletions(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index d70899a..aa1cb4f 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -546,49 +546,67 @@ __next_base(struct hrtimer_cpu_base *cpu_base, unsigned int *active)
 #define for_each_active_base(base, cpu_base, active)		\
 	while ((base = __next_base((cpu_base), &(active))))
 
-static ktime_t __hrtimer_next_event_base(struct hrtimer_cpu_base *cpu_base,
-					 const struct hrtimer *exclude,
-					 unsigned int active, ktime_t expires_next)
+#if defined(CONFIG_NO_HZ_COMMON)
+/*
+ * Same as hrtimer_bases_next_event() below, but skips the excluded timer and
+ * does not update cpu_base->next_timer/expires.
+ */
+static ktime_t hrtimer_bases_next_event_without(struct hrtimer_cpu_base *cpu_base,
+						const struct hrtimer *exclude,
+						unsigned int active, ktime_t expires_next)
 {
 	struct hrtimer_clock_base *base;
 	ktime_t expires;
 
+	lockdep_assert_held(&cpu_base->lock);
+
 	for_each_active_base(base, cpu_base, active) {
-		struct timerqueue_node *next;
-		struct hrtimer *timer;
+		expires = ktime_sub(base->expires_next, base->offset);
+		if (expires >= expires_next)
+			continue;
 
-		next = timerqueue_getnext(&base->active);
-		timer = container_of(next, struct hrtimer, node);
-		if (timer == exclude) {
-			/* Get to the next timer in the queue. */
-			next = timerqueue_iterate_next(next);
-			if (!next)
-				continue;
+		/*
+		 * If the excluded timer is the first on this base evaluate the
+		 * next timer.
+		 */
+		struct timerqueue_node *node = timerqueue_getnext(&base->active);
 
-			timer = container_of(next, struct hrtimer, node);
+		if (unlikely(&exclude->node == node)) {
+			node = timerqueue_iterate_next(node);
+			if (!node)
+				continue;
+			expires = ktime_sub(node->expires, base->offset);
+			if (expires >= expires_next)
+				continue;
 		}
-		expires = ktime_sub(hrtimer_get_expires(timer), base->offset);
-		if (expires < expires_next) {
-			expires_next = expires;
+		expires_next = expires;
+	}
+	/* If base->offset changed, the result might be negative */
+	return max(expires_next, 0);
+}
+#endif
 
-			/* Skip cpu_base update if a timer is being excluded. */
-			if (exclude)
-				continue;
+static __always_inline struct hrtimer *clock_base_next_timer(struct hrtimer_clock_base *base)
+{
+	struct timerqueue_node *next = timerqueue_getnext(&base->active);
+
+	return container_of(next, struct hrtimer, node);
+}
 
-			if (timer->is_soft)
-				cpu_base->softirq_next_timer = timer;
-			else
-				cpu_base->next_timer = timer;
+/* Find the base with the earliest expiry */
+static void hrtimer_bases_first(struct hrtimer_cpu_base *cpu_base,unsigned int active,
+				ktime_t *expires_next, struct hrtimer **next_timer)
+{
+	struct hrtimer_clock_base *base;
+	ktime_t expires;
+
+	for_each_active_base(base, cpu_base, active) {
+		expires = ktime_sub(base->expires_next, base->offset);
+		if (expires < *expires_next) {
+			*expires_next = expires;
+			*next_timer = clock_base_next_timer(base);
 		}
 	}
-	/*
-	 * clock_was_set() might have changed base->offset of any of
-	 * the clock bases so the result might be negative. Fix it up
-	 * to prevent a false positive in clockevents_program_event().
-	 */
-	if (expires_next < 0)
-		expires_next = 0;
-	return expires_next;
 }
 
 /*
@@ -617,19 +635,22 @@ static ktime_t __hrtimer_get_next_event(struct hrtimer_cpu_base *cpu_base, unsig
 	ktime_t expires_next = KTIME_MAX;
 	unsigned int active;
 
+	lockdep_assert_held(&cpu_base->lock);
+
 	if (!cpu_base->softirq_activated && (active_mask & HRTIMER_ACTIVE_SOFT)) {
 		active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
-		cpu_base->softirq_next_timer = NULL;
-		expires_next = __hrtimer_next_event_base(cpu_base, NULL, active, KTIME_MAX);
-		next_timer = cpu_base->softirq_next_timer;
+		if (active)
+			hrtimer_bases_first(cpu_base, active, &expires_next, &next_timer);
+		cpu_base->softirq_next_timer = next_timer;
 	}
 
 	if (active_mask & HRTIMER_ACTIVE_HARD) {
 		active = cpu_base->active_bases & HRTIMER_ACTIVE_HARD;
+		if (active)
+			hrtimer_bases_first(cpu_base, active, &expires_next, &next_timer);
 		cpu_base->next_timer = next_timer;
-		expires_next = __hrtimer_next_event_base(cpu_base, NULL, active, expires_next);
 	}
-	return expires_next;
+	return max(expires_next, 0);
 }
 
 static ktime_t hrtimer_update_next_event(struct hrtimer_cpu_base *cpu_base)
@@ -724,11 +745,7 @@ static void __hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base, struct hrtime
 	hrtimer_rearm_event(expires_next, false);
 }
 
-/*
- * Reprogram the event source with checking both queues for the
- * next event
- * Called with interrupts disabled and base->lock held
- */
+/* Reprogram the event source with a evaluation of all clock bases */
 static void hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, bool skip_equal)
 {
 	ktime_t expires_next = hrtimer_update_next_event(cpu_base);
@@ -1662,19 +1679,20 @@ u64 hrtimer_next_event_without(const struct hrtimer *exclude)
 {
 	struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
 	u64 expires = KTIME_MAX;
+	unsigned int active;
 
 	guard(raw_spinlock_irqsave)(&cpu_base->lock);
-	if (hrtimer_hres_active(cpu_base)) {
-		unsigned int active;
+	if (!hrtimer_hres_active(cpu_base))
+		return expires;
 
-		if (!cpu_base->softirq_activated) {
-			active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
-			expires = __hrtimer_next_event_base(cpu_base, exclude, active, KTIME_MAX);
-		}
-		active = cpu_base->active_bases & HRTIMER_ACTIVE_HARD;
-		expires = __hrtimer_next_event_base(cpu_base, exclude, active, expires);
-	}
-	return expires;
+	active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
+	if (active && !cpu_base->softirq_activated)
+		expires = hrtimer_bases_next_event_without(cpu_base, exclude, active, KTIME_MAX);
+
+	active = cpu_base->active_bases & HRTIMER_ACTIVE_HARD;
+	if (!active)
+		return expires;
+	return hrtimer_bases_next_event_without(cpu_base, exclude, active, expires);
 }
 #endif
 

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] hrtimer: Keep track of first expiring timer per clock base
  2026-02-24 16:38 ` [patch 40/48] hrtimer: Keep track of first expiring timer per clock base Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     eddffab8282e388dddf032f3295fcec87eb08095
Gitweb:        https://git.kernel.org/tip/eddffab8282e388dddf032f3295fcec87eb08095
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:38:28 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:14 +01:00

hrtimer: Keep track of first expiring timer per clock base

Evaluating the next expiry time of all clock bases is cache line expensive
as the expiry time of the first expiring timer is not cached in the base
and requires to access the timer itself, which is definitely in a different
cache line.

It's way more efficient to keep track of the expiry time on enqueue and
dequeue operations as the relevant data is already in the cache at that
point.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.404839710@kernel.org
---
 include/linux/hrtimer_defs.h |  2 ++-
 kernel/time/hrtimer.c        | 37 ++++++++++++++++++++++++++++++++---
 2 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/include/linux/hrtimer_defs.h b/include/linux/hrtimer_defs.h
index b6846ef..fb38df4 100644
--- a/include/linux/hrtimer_defs.h
+++ b/include/linux/hrtimer_defs.h
@@ -19,6 +19,7 @@
  *			timer to a base on another cpu.
  * @clockid:		clock id for per_cpu support
  * @seq:		seqcount around __run_hrtimer
+ * @expires_next:	Absolute time of the next event in this clock base
  * @running:		pointer to the currently running hrtimer
  * @active:		red black tree root node for the active timers
  * @offset:		offset of this clock to the monotonic base
@@ -28,6 +29,7 @@ struct hrtimer_clock_base {
 	unsigned int		index;
 	clockid_t		clockid;
 	seqcount_raw_spinlock_t	seq;
+	ktime_t			expires_next;
 	struct hrtimer		*running;
 	struct timerqueue_head	active;
 	ktime_t			offset;
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index e9592cb..d70899a 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1107,7 +1107,18 @@ static bool enqueue_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *ba
 	/* Pairs with the lockless read in hrtimer_is_queued() */
 	WRITE_ONCE(timer->is_queued, HRTIMER_STATE_ENQUEUED);
 
-	return timerqueue_add(&base->active, &timer->node);
+	if (!timerqueue_add(&base->active, &timer->node))
+		return false;
+
+	base->expires_next = hrtimer_get_expires(timer);
+	return true;
+}
+
+static inline void base_update_next_timer(struct hrtimer_clock_base *base)
+{
+	struct timerqueue_node *next = timerqueue_getnext(&base->active);
+
+	base->expires_next = next ? next->expires : KTIME_MAX;
 }
 
 /*
@@ -1122,6 +1133,7 @@ static void __remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *b
 			     bool newstate, bool reprogram)
 {
 	struct hrtimer_cpu_base *cpu_base = base->cpu_base;
+	bool was_first;
 
 	lockdep_assert_held(&cpu_base->lock);
 
@@ -1131,9 +1143,17 @@ static void __remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *b
 	/* Pairs with the lockless read in hrtimer_is_queued() */
 	WRITE_ONCE(timer->is_queued, newstate);
 
+	was_first = &timer->node == timerqueue_getnext(&base->active);
+
 	if (!timerqueue_del(&base->active, &timer->node))
 		cpu_base->active_bases &= ~(1 << base->index);
 
+	/* Nothing to update if this was not the first timer in the base */
+	if (!was_first)
+		return;
+
+	base_update_next_timer(base);
+
 	/*
 	 * If reprogram is false don't update cpu_base->next_timer and do not
 	 * touch the clock event device.
@@ -1182,9 +1202,12 @@ static inline bool
 remove_and_enqueue_same_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
 			     const enum hrtimer_mode mode, ktime_t expires, u64 delta_ns)
 {
+	bool was_first = false;
+
 	/* Remove it from the timer queue if active */
 	if (timer->is_queued) {
 		debug_hrtimer_deactivate(timer);
+		was_first = &timer->node == timerqueue_getnext(&base->active);
 		timerqueue_del(&base->active, &timer->node);
 	}
 
@@ -1197,8 +1220,16 @@ remove_and_enqueue_same_base(struct hrtimer *timer, struct hrtimer_clock_base *b
 	/* Pairs with the lockless read in hrtimer_is_queued() */
 	WRITE_ONCE(timer->is_queued, HRTIMER_STATE_ENQUEUED);
 
-	/* Returns true if this is the first expiring timer */
-	return timerqueue_add(&base->active, &timer->node);
+	/* If it's the first expiring timer now or again, update base */
+	if (timerqueue_add(&base->active, &timer->node)) {
+		base->expires_next = expires;
+		return true;
+	}
+
+	if (was_first)
+		base_update_next_timer(base);
+
+	return false;
 }
 
 static inline ktime_t hrtimer_update_lowres(struct hrtimer *timer, ktime_t tim,

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] hrtimer: Avoid re-evaluation when nothing changed
  2026-02-24 16:38 ` [patch 39/48] hrtimer: Avoid re-evaluation when nothing changed Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     b95c4442b02162904e9012e670b602ebeb3c6c1b
Gitweb:        https://git.kernel.org/tip/b95c4442b02162904e9012e670b602ebeb3c6c1b
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:38:23 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:14 +01:00

hrtimer: Avoid re-evaluation when nothing changed

Most times there is no change between hrtimer_interrupt() deferring the rearm
and the invocation of hrtimer_rearm_deferred(). In those cases it's a pointless
exercise to re-evaluate the next expiring timer.

Cache the required data and use it if nothing changed.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.338569372@kernel.org
---
 include/linux/hrtimer_defs.h | 53 +++++++++++++++++------------------
 kernel/time/hrtimer.c        | 45 ++++++++++++++++++++----------
 2 files changed, 58 insertions(+), 40 deletions(-)

diff --git a/include/linux/hrtimer_defs.h b/include/linux/hrtimer_defs.h
index 2c3bdbd..b6846ef 100644
--- a/include/linux/hrtimer_defs.h
+++ b/include/linux/hrtimer_defs.h
@@ -47,32 +47,31 @@ enum  hrtimer_base_type {
 
 /**
  * struct hrtimer_cpu_base - the per cpu clock bases
- * @lock:		lock protecting the base and associated clock bases
- *			and timers
- * @cpu:		cpu number
- * @active_bases:	Bitfield to mark bases with active timers
- * @clock_was_set_seq:	Sequence counter of clock was set events
- * @hres_active:	State of high resolution mode
- * @deferred_rearm:	A deferred rearm is pending
- * @hang_detected:	The last hrtimer interrupt detected a hang
- * @softirq_activated:	displays, if the softirq is raised - update of softirq
- *			related settings is not required then.
- * @nr_events:		Total number of hrtimer interrupt events
- * @nr_retries:		Total number of hrtimer interrupt retries
- * @nr_hangs:		Total number of hrtimer interrupt hangs
- * @max_hang_time:	Maximum time spent in hrtimer_interrupt
- * @softirq_expiry_lock: Lock which is taken while softirq based hrtimer are
- *			 expired
- * @online:		CPU is online from an hrtimers point of view
- * @timer_waiters:	A hrtimer_cancel() invocation waits for the timer
- *			callback to finish.
- * @expires_next:	absolute time of the next event, is required for remote
- *			hrtimer enqueue; it is the total first expiry time (hard
- *			and soft hrtimer are taken into account)
- * @next_timer:		Pointer to the first expiring timer
- * @softirq_expires_next: Time to check, if soft queues needs also to be expired
- * @softirq_next_timer: Pointer to the first expiring softirq based timer
- * @clock_base:		array of clock bases for this cpu
+ * @lock:			lock protecting the base and associated clock bases and timers
+ * @cpu:			cpu number
+ * @active_bases:		Bitfield to mark bases with active timers
+ * @clock_was_set_seq:		Sequence counter of clock was set events
+ * @hres_active:		State of high resolution mode
+ * @deferred_rearm:		A deferred rearm is pending
+ * @deferred_needs_update:	The deferred rearm must re-evaluate the first timer
+ * @hang_detected:		The last hrtimer interrupt detected a hang
+ * @softirq_activated:		displays, if the softirq is raised - update of softirq
+ *				related settings is not required then.
+ * @nr_events:			Total number of hrtimer interrupt events
+ * @nr_retries:			Total number of hrtimer interrupt retries
+ * @nr_hangs:			Total number of hrtimer interrupt hangs
+ * @max_hang_time:		Maximum time spent in hrtimer_interrupt
+ * @softirq_expiry_lock:	Lock which is taken while softirq based hrtimer are expired
+ * @online:			CPU is online from an hrtimers point of view
+ * @timer_waiters:		A hrtimer_cancel() waiters for the timer callback to finish.
+ * @expires_next:		Absolute time of the next event, is required for remote
+ *				hrtimer enqueue; it is the total first expiry time (hard
+ *				and soft hrtimer are taken into account)
+ * @next_timer:			Pointer to the first expiring timer
+ * @softirq_expires_next:	Time to check, if soft queues needs also to be expired
+ * @softirq_next_timer:		Pointer to the first expiring softirq based timer
+ * @deferred_expires_next:	Cached expires next value for deferred rearm
+ * @clock_base:			Array of clock bases for this cpu
  *
  * Note: next_timer is just an optimization for __remove_hrtimer().
  *	 Do not dereference the pointer because it is not reliable on
@@ -85,6 +84,7 @@ struct hrtimer_cpu_base {
 	unsigned int			clock_was_set_seq;
 	bool				hres_active;
 	bool				deferred_rearm;
+	bool				deferred_needs_update;
 	bool				hang_detected;
 	bool				softirq_activated;
 	bool				online;
@@ -102,6 +102,7 @@ struct hrtimer_cpu_base {
 	struct hrtimer			*next_timer;
 	ktime_t				softirq_expires_next;
 	struct hrtimer			*softirq_next_timer;
+	ktime_t				deferred_expires_next;
 	struct hrtimer_clock_base	clock_base[HRTIMER_MAX_CLOCK_BASES];
 	call_single_data_t		csd;
 } ____cacheline_aligned;
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 2e5f0e2..e9592cb 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -919,8 +919,10 @@ static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base, unsigned int act
 		return false;
 
 	/* If a deferred rearm is pending the remote CPU will take care of it */
-	if (cpu_base->deferred_rearm)
+	if (cpu_base->deferred_rearm) {
+		cpu_base->deferred_needs_update = true;
 		return false;
+	}
 
 	/*
 	 * Walk the affected clock bases and check whether the first expiring
@@ -1141,7 +1143,12 @@ static void __remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *b
 	 * a local timer is removed to be immediately restarted. That's handled
 	 * at the call site.
 	 */
-	if (reprogram && timer == cpu_base->next_timer && !timer->is_lazy)
+	if (!reprogram || timer != cpu_base->next_timer || timer->is_lazy)
+		return;
+
+	if (cpu_base->deferred_rearm)
+		cpu_base->deferred_needs_update = true;
+	else
 		hrtimer_force_reprogram(cpu_base, /* skip_equal */ true);
 }
 
@@ -1328,8 +1335,10 @@ static bool __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, u64 del
 	}
 
 	/* If a deferred rearm is pending skip reprogramming the device */
-	if (cpu_base->deferred_rearm)
+	if (cpu_base->deferred_rearm) {
+		cpu_base->deferred_needs_update = true;
 		return false;
+	}
 
 	if (!was_first || cpu_base != this_cpu_base) {
 		/*
@@ -1939,8 +1948,7 @@ static __latent_entropy void hrtimer_run_softirq(void)
  * Very similar to hrtimer_force_reprogram(), except it deals with
  * deferred_rearm and hang_detected.
  */
-static void hrtimer_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t now,
-			  ktime_t expires_next, bool deferred)
+static void hrtimer_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t expires_next, bool deferred)
 {
 	cpu_base->expires_next = expires_next;
 	cpu_base->deferred_rearm = false;
@@ -1950,7 +1958,7 @@ static void hrtimer_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t now,
 		 * Give the system a chance to do something else than looping
 		 * on hrtimer interrupts.
 		 */
-		expires_next = ktime_add_ns(now, 100 * NSEC_PER_MSEC);
+		expires_next = ktime_add_ns(ktime_get(), 100 * NSEC_PER_MSEC);
 		cpu_base->hang_detected = false;
 	}
 	hrtimer_rearm_event(expires_next, deferred);
@@ -1960,27 +1968,36 @@ static void hrtimer_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t now,
 void __hrtimer_rearm_deferred(void)
 {
 	struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
-	ktime_t now, expires_next;
+	ktime_t expires_next;
 
 	if (!cpu_base->deferred_rearm)
 		return;
 
 	guard(raw_spinlock)(&cpu_base->lock);
-	now = hrtimer_update_base(cpu_base);
-	expires_next = hrtimer_update_next_event(cpu_base);
-	hrtimer_rearm(cpu_base, now, expires_next, true);
+	if (cpu_base->deferred_needs_update) {
+		hrtimer_update_base(cpu_base);
+		expires_next = hrtimer_update_next_event(cpu_base);
+	} else {
+		/* No timer added/removed. Use the cached value */
+		expires_next = cpu_base->deferred_expires_next;
+	}
+	hrtimer_rearm(cpu_base, expires_next, true);
 }
 
 static __always_inline void
-hrtimer_interrupt_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t now, ktime_t expires_next)
+hrtimer_interrupt_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t expires_next)
 {
+	/* hrtimer_interrupt() just re-evaluated the first expiring timer */
+	cpu_base->deferred_needs_update = false;
+	/* Cache the expiry time */
+	cpu_base->deferred_expires_next = expires_next;
 	set_thread_flag(TIF_HRTIMER_REARM);
 }
 #else  /* CONFIG_HRTIMER_REARM_DEFERRED */
 static __always_inline void
-hrtimer_interrupt_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t now, ktime_t expires_next)
+hrtimer_interrupt_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t expires_next)
 {
-	hrtimer_rearm(cpu_base, now, expires_next, false);
+	hrtimer_rearm(cpu_base, expires_next, false);
 }
 #endif  /* !CONFIG_HRTIMER_REARM_DEFERRED */
 
@@ -2041,7 +2058,7 @@ retry:
 		cpu_base->hang_detected = true;
 	}
 
-	hrtimer_interrupt_rearm(cpu_base, now, expires_next);
+	hrtimer_interrupt_rearm(cpu_base, expires_next);
 	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
 }
 

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] hrtimer: Push reprogramming timers into the interrupt return path
  2026-02-24 16:38 ` [patch 38/48] hrtimer: Push reprogramming timers into the interrupt return path Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     15dd3a9488557d3e6ebcecacab79f4e56b69ab54
Gitweb:        https://git.kernel.org/tip/15dd3a9488557d3e6ebcecacab79f4e56b69ab54
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 24 Feb 2026 17:38:18 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:14 +01:00

hrtimer: Push reprogramming timers into the interrupt return path

Currently hrtimer_interrupt() runs expired timers, which can re-arm
themselves, after which it computes the next expiration time and
re-programs the hardware.

However, things like HRTICK, a highres timer driving preemption, cannot
re-arm itself at the point of running, since the next task has not been
determined yet. The schedule() in the interrupt return path will switch to
the next task, which then causes a new hrtimer to be programmed.

This then results in reprogramming the hardware at least twice, once after
running the timers, and once upon selecting the new task.

Notably, *both* events happen in the interrupt.

By pushing the hrtimer reprogram all the way into the interrupt return
path, it runs after schedule() picks the new task and the double reprogram
can be avoided.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.273488269@kernel.org
---
 include/asm-generic/thread_info_tif.h |  5 +-
 include/linux/hrtimer_rearm.h         | 72 ++++++++++++++++++++++++--
 kernel/time/Kconfig                   |  4 +-
 kernel/time/hrtimer.c                 | 38 ++++++++++++--
 4 files changed, 107 insertions(+), 12 deletions(-)

diff --git a/include/asm-generic/thread_info_tif.h b/include/asm-generic/thread_info_tif.h
index da1610a..528e6fc 100644
--- a/include/asm-generic/thread_info_tif.h
+++ b/include/asm-generic/thread_info_tif.h
@@ -41,11 +41,14 @@
 #define _TIF_PATCH_PENDING	BIT(TIF_PATCH_PENDING)
 
 #ifdef HAVE_TIF_RESTORE_SIGMASK
-# define TIF_RESTORE_SIGMASK	10	// Restore signal mask in do_signal() */
+# define TIF_RESTORE_SIGMASK	10	// Restore signal mask in do_signal()
 # define _TIF_RESTORE_SIGMASK	BIT(TIF_RESTORE_SIGMASK)
 #endif
 
 #define TIF_RSEQ		11	// Run RSEQ fast path
 #define _TIF_RSEQ		BIT(TIF_RSEQ)
 
+#define TIF_HRTIMER_REARM	12       // re-arm the timer
+#define _TIF_HRTIMER_REARM	BIT(TIF_HRTIMER_REARM)
+
 #endif /* _ASM_GENERIC_THREAD_INFO_TIF_H_ */
diff --git a/include/linux/hrtimer_rearm.h b/include/linux/hrtimer_rearm.h
index 6293076..a6f2e5d 100644
--- a/include/linux/hrtimer_rearm.h
+++ b/include/linux/hrtimer_rearm.h
@@ -3,12 +3,74 @@
 #define _LINUX_HRTIMER_REARM_H
 
 #ifdef CONFIG_HRTIMER_REARM_DEFERRED
-static __always_inline void __hrtimer_rearm_deferred(void) { }
-static __always_inline void hrtimer_rearm_deferred(void) { }
-static __always_inline void hrtimer_rearm_deferred_tif(unsigned long tif_work) { }
+#include <linux/thread_info.h>
+
+void __hrtimer_rearm_deferred(void);
+
+/*
+ * This is purely CPU local, so check the TIF bit first to avoid the overhead of
+ * the atomic test_and_clear_bit() operation for the common case where the bit
+ * is not set.
+ */
+static __always_inline bool hrtimer_test_and_clear_rearm_deferred_tif(unsigned long tif_work)
+{
+	lockdep_assert_irqs_disabled();
+
+	if (unlikely(tif_work & _TIF_HRTIMER_REARM)) {
+		clear_thread_flag(TIF_HRTIMER_REARM);
+		return true;
+	}
+	return false;
+}
+
+#define TIF_REARM_MASK	(_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY | _TIF_HRTIMER_REARM)
+
+/* Invoked from the exit to user before invoking exit_to_user_mode_loop() */
 static __always_inline bool
-hrtimer_rearm_deferred_user_irq(unsigned long *tif_work, const unsigned long tif_mask) { return false; }
-static __always_inline bool hrtimer_test_and_clear_rearm_deferred(void) { return false; }
+hrtimer_rearm_deferred_user_irq(unsigned long *tif_work, const unsigned long tif_mask)
+{
+	/* Help the compiler to optimize the function out for syscall returns */
+	if (!(tif_mask & _TIF_HRTIMER_REARM))
+		return false;
+	/*
+	 * Rearm the timer if none of the resched flags is set before going into
+	 * the loop which re-enables interrupts.
+	 */
+	if (unlikely((*tif_work & TIF_REARM_MASK) == _TIF_HRTIMER_REARM)) {
+		clear_thread_flag(TIF_HRTIMER_REARM);
+		__hrtimer_rearm_deferred();
+		/* Don't go into the loop if HRTIMER_REARM was the only flag */
+		*tif_work &= ~TIF_HRTIMER_REARM;
+		return !*tif_work;
+	}
+	return false;
+}
+
+/* Invoked from the time slice extension decision function */
+static __always_inline void hrtimer_rearm_deferred_tif(unsigned long tif_work)
+{
+	if (hrtimer_test_and_clear_rearm_deferred_tif(tif_work))
+		__hrtimer_rearm_deferred();
+}
+
+/*
+ * This is to be called on all irqentry_exit() paths that will enable
+ * interrupts.
+ */
+static __always_inline void hrtimer_rearm_deferred(void)
+{
+	hrtimer_rearm_deferred_tif(read_thread_flags());
+}
+
+/*
+ * Invoked from the scheduler on entry to __schedule() so it can defer
+ * rearming after the load balancing callbacks which might change hrtick.
+ */
+static __always_inline bool hrtimer_test_and_clear_rearm_deferred(void)
+{
+	return hrtimer_test_and_clear_rearm_deferred_tif(read_thread_flags());
+}
+
 #else  /* CONFIG_HRTIMER_REARM_DEFERRED */
 static __always_inline void __hrtimer_rearm_deferred(void) { }
 static __always_inline void hrtimer_rearm_deferred(void) { }
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index b95bfee..6d6aace 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -60,7 +60,9 @@ config GENERIC_CMOS_UPDATE
 
 # Deferred rearming of the hrtimer interrupt
 config HRTIMER_REARM_DEFERRED
-       def_bool n
+       def_bool y
+       depends on GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS
+       depends on HIGH_RES_TIMERS && SCHED_HRTICK
 
 # Select to handle posix CPU timers from task_work
 # and not from the timer interrupt context
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 6f05d25..2e5f0e2 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1939,10 +1939,9 @@ static __latent_entropy void hrtimer_run_softirq(void)
  * Very similar to hrtimer_force_reprogram(), except it deals with
  * deferred_rearm and hang_detected.
  */
-static void hrtimer_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t now)
+static void hrtimer_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t now,
+			  ktime_t expires_next, bool deferred)
 {
-	ktime_t expires_next = hrtimer_update_next_event(cpu_base);
-
 	cpu_base->expires_next = expires_next;
 	cpu_base->deferred_rearm = false;
 
@@ -1954,9 +1953,37 @@ static void hrtimer_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t now)
 		expires_next = ktime_add_ns(now, 100 * NSEC_PER_MSEC);
 		cpu_base->hang_detected = false;
 	}
-	hrtimer_rearm_event(expires_next, false);
+	hrtimer_rearm_event(expires_next, deferred);
+}
+
+#ifdef CONFIG_HRTIMER_REARM_DEFERRED
+void __hrtimer_rearm_deferred(void)
+{
+	struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
+	ktime_t now, expires_next;
+
+	if (!cpu_base->deferred_rearm)
+		return;
+
+	guard(raw_spinlock)(&cpu_base->lock);
+	now = hrtimer_update_base(cpu_base);
+	expires_next = hrtimer_update_next_event(cpu_base);
+	hrtimer_rearm(cpu_base, now, expires_next, true);
 }
 
+static __always_inline void
+hrtimer_interrupt_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t now, ktime_t expires_next)
+{
+	set_thread_flag(TIF_HRTIMER_REARM);
+}
+#else  /* CONFIG_HRTIMER_REARM_DEFERRED */
+static __always_inline void
+hrtimer_interrupt_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t now, ktime_t expires_next)
+{
+	hrtimer_rearm(cpu_base, now, expires_next, false);
+}
+#endif  /* !CONFIG_HRTIMER_REARM_DEFERRED */
+
 /*
  * High resolution timer interrupt
  * Called with interrupts disabled
@@ -2014,9 +2041,10 @@ retry:
 		cpu_base->hang_detected = true;
 	}
 
-	hrtimer_rearm(cpu_base, now);
+	hrtimer_interrupt_rearm(cpu_base, now, expires_next);
 	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
 }
+
 #endif /* !CONFIG_HIGH_RES_TIMERS */
 
 /*

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] sched/core: Prepare for deferred hrtimer rearming
  2026-02-24 16:38 ` [patch 37/48] sched/core: " Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     b0a44fa5e2a22ff67752bbc08c651a2efac3e5fe
Gitweb:        https://git.kernel.org/tip/b0a44fa5e2a22ff67752bbc08c651a2efac3e5fe
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 24 Feb 2026 17:38:12 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:13 +01:00

sched/core: Prepare for deferred hrtimer rearming

The hrtimer interrupt expires timers and at the end of the interrupt it
rearms the clockevent device for the next expiring timer.

That's obviously correct, but in the case that a expired timer sets
NEED_RESCHED the return from interrupt ends up in schedule(). If HRTICK is
enabled then schedule() will modify the hrtick timer, which causes another
reprogramming of the hardware.

That can be avoided by deferring the rearming to the return from interrupt
path and if the return results in a immediate schedule() invocation then it
can be deferred until the end of schedule(), which avoids multiple rearms
and re-evaluation of the timer wheel.

Add the rearm checks to the existing sched_hrtick_enter/exit() functions,
which already handle the batched rearm of the hrtick timer.

For now this is just placing empty stubs at the right places which are all
optimized out by the compiler until the guard condition becomes true.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.208580085@kernel.org
---
 kernel/sched/core.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d1239a..49a64b4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -876,6 +876,7 @@ enum {
 	HRTICK_SCHED_NONE		= 0,
 	HRTICK_SCHED_DEFER		= BIT(1),
 	HRTICK_SCHED_START		= BIT(2),
+	HRTICK_SCHED_REARM_HRTIMER	= BIT(3)
 };

 static void hrtick_clear(struct rq *rq)
@@ -974,6 +975,8 @@ void hrtick_start(struct rq *rq, u64 delay)
 static inline void hrtick_schedule_enter(struct rq *rq)
 {
 	rq->hrtick_sched = HRTICK_SCHED_DEFER;
+	if (hrtimer_test_and_clear_rearm_deferred())
+		rq->hrtick_sched |= HRTICK_SCHED_REARM_HRTIMER;
 }

 static inline void hrtick_schedule_exit(struct rq *rq)
@@ -991,6 +994,9 @@ static inline void hrtick_schedule_exit(struct rq *rq)
 			hrtimer_cancel(&rq->hrtick_timer);
 	}

+	if (rq->hrtick_sched & HRTICK_SCHED_REARM_HRTIMER)
+		__hrtimer_rearm_deferred();
+
 	rq->hrtick_sched = HRTICK_SCHED_NONE;
 }

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] softirq: Prepare for deferred hrtimer rearming
  2026-02-24 16:38 ` [patch 36/48] softirq: " Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     7e641e52cf5f284706514f789df8c497aea984e1
Gitweb:        https://git.kernel.org/tip/7e641e52cf5f284706514f789df8c497aea984e1
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 24 Feb 2026 17:38:07 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:13 +01:00

softirq: Prepare for deferred hrtimer rearming

The hrtimer interrupt expires timers and at the end of the interrupt it
rearms the clockevent device for the next expiring timer.

That's obviously correct, but in the case that a expired timer sets
NEED_RESCHED the return from interrupt ends up in schedule(). If HRTICK is
enabled then schedule() will modify the hrtick timer, which causes another
reprogramming of the hardware.

That can be avoided by deferring the rearming to the return from interrupt
path and if the return results in a immediate schedule() invocation then it
can be deferred until the end of schedule(), which avoids multiple rearms
and re-evaluation of the timer wheel.

In case that the return from interrupt ends up handling softirqs before
reaching the rearm conditions in the return to user entry code functions, a
deferred rearm has to be handled before softirq handling enables interrupts
as soft interrupt handling can be long and would therefore introduce hard
to diagnose latencies to the timer interrupt.

Place the for now empty stub call right before invoking the softirq
handling routine.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.142854488@kernel.org
---
 kernel/softirq.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 7719891..4425d8d 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -663,6 +663,13 @@ void irq_enter_rcu(void)
 {
 	__irq_enter_raw();

+	/*
+	 * If this is a nested interrupt that hits the exit_to_user_mode_loop
+	 * where it has enabled interrupts but before it has hit schedule() we
+	 * could have hrtimers in an undefined state. Fix it up here.
+	 */
+	hrtimer_rearm_deferred();
+
 	if (tick_nohz_full_cpu(smp_processor_id()) ||
 	    (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET)))
 		tick_irq_enter();
@@ -719,8 +726,14 @@ static inline void __irq_exit_rcu(void)
 #endif
 	account_hardirq_exit(current);
 	preempt_count_sub(HARDIRQ_OFFSET);
-	if (!in_interrupt() && local_softirq_pending())
+	if (!in_interrupt() && local_softirq_pending()) {
+		/*
+		 * If we left hrtimers unarmed, make sure to arm them now,
+		 * before enabling interrupts to run SoftIRQ.
+		 */
+		hrtimer_rearm_deferred();
 		invoke_softirq();
+	}

 	if (IS_ENABLED(CONFIG_IRQ_FORCED_THREADING) && force_irqthreads() &&
 	    local_timers_pending_force_th() && !(in_nmi() | in_hardirq()))

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] entry: Prepare for deferred hrtimer rearming
  2026-02-24 16:38 ` [patch 35/48] entry: Prepare for deferred hrtimer rearming Thomas Gleixner
  2026-02-27 15:57   ` Christian Loehle
@ 2026-02-28 15:36   ` tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 128+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     0e98eb14814ef669e07ca6effaa03df2e57ef956
Gitweb:        https://git.kernel.org/tip/0e98eb14814ef669e07ca6effaa03df2e57ef956
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 24 Feb 2026 17:38:03 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:13 +01:00

entry: Prepare for deferred hrtimer rearming

The hrtimer interrupt expires timers and at the end of the interrupt it
rearms the clockevent device for the next expiring timer.

That's obviously correct, but in the case that a expired timer sets
NEED_RESCHED the return from interrupt ends up in schedule(). If HRTICK is
enabled then schedule() will modify the hrtick timer, which causes another
reprogramming of the hardware.

That can be avoided by deferring the rearming to the return from interrupt
path and if the return results in a immediate schedule() invocation then it
can be deferred until the end of schedule(), which avoids multiple rearms
and re-evaluation of the timer wheel.

As this is only relevant for interrupt to user return split the work masks
up and hand them in as arguments from the relevant exit to user functions,
which allows the compiler to optimize the deferred handling out for the
syscall exit to user case.

Add the rearm checks to the approritate places in the exit to user loop and
the interrupt return to kernel path, so that the rearming is always
guaranteed.

In the return to user space path this is handled in the same way as
TIF_RSEQ to avoid extra instructions in the fast path, which are truly
hurtful for device interrupt heavy work loads as the extra instructions and
conditionals while benign at first sight accumulate quickly into measurable
regressions. The return from syscall path is completely unaffected due to
the above mentioned split so syscall heavy workloads wont have any extra
burden.

For now this is just placing empty stubs at the right places which are all
optimized out by the compiler until the actual functionality is in place.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.066469985@kernel.org
---
 include/linux/irq-entry-common.h | 25 +++++++++++++++++++------
 include/linux/rseq_entry.h       | 16 +++++++++++++---
 kernel/entry/common.c            |  4 +++-
 3 files changed, 35 insertions(+), 10 deletions(-)

diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index d26d1b1..b976946 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -3,6 +3,7 @@
 #define __LINUX_IRQENTRYCOMMON_H
 
 #include <linux/context_tracking.h>
+#include <linux/hrtimer_rearm.h>
 #include <linux/kmsan.h>
 #include <linux/rseq_entry.h>
 #include <linux/static_call_types.h>
@@ -33,6 +34,14 @@
 	 _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | _TIF_RSEQ |		\
 	 ARCH_EXIT_TO_USER_MODE_WORK)
 
+#ifdef CONFIG_HRTIMER_REARM_DEFERRED
+# define EXIT_TO_USER_MODE_WORK_SYSCALL	(EXIT_TO_USER_MODE_WORK)
+# define EXIT_TO_USER_MODE_WORK_IRQ	(EXIT_TO_USER_MODE_WORK | _TIF_HRTIMER_REARM)
+#else
+# define EXIT_TO_USER_MODE_WORK_SYSCALL	(EXIT_TO_USER_MODE_WORK)
+# define EXIT_TO_USER_MODE_WORK_IRQ	(EXIT_TO_USER_MODE_WORK)
+#endif
+
 /**
  * arch_enter_from_user_mode - Architecture specific sanity check for user mode regs
  * @regs:	Pointer to currents pt_regs
@@ -203,6 +212,7 @@ unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work
 /**
  * __exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
  * @regs:	Pointer to pt_regs on entry stack
+ * @work_mask:	Which TIF bits need to be evaluated
  *
  * 1) check that interrupts are disabled
  * 2) call tick_nohz_user_enter_prepare()
@@ -212,7 +222,8 @@ unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work
  *
  * Don't invoke directly, use the syscall/irqentry_ prefixed variants below
  */
-static __always_inline void __exit_to_user_mode_prepare(struct pt_regs *regs)
+static __always_inline void __exit_to_user_mode_prepare(struct pt_regs *regs,
+							const unsigned long work_mask)
 {
 	unsigned long ti_work;
 
@@ -222,8 +233,10 @@ static __always_inline void __exit_to_user_mode_prepare(struct pt_regs *regs)
 	tick_nohz_user_enter_prepare();
 
 	ti_work = read_thread_flags();
-	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
-		ti_work = exit_to_user_mode_loop(regs, ti_work);
+	if (unlikely(ti_work & work_mask)) {
+		if (!hrtimer_rearm_deferred_user_irq(&ti_work, work_mask))
+			ti_work = exit_to_user_mode_loop(regs, ti_work);
+	}
 
 	arch_exit_to_user_mode_prepare(regs, ti_work);
 }
@@ -239,7 +252,7 @@ static __always_inline void __exit_to_user_mode_validate(void)
 /* Temporary workaround to keep ARM64 alive */
 static __always_inline void exit_to_user_mode_prepare_legacy(struct pt_regs *regs)
 {
-	__exit_to_user_mode_prepare(regs);
+	__exit_to_user_mode_prepare(regs, EXIT_TO_USER_MODE_WORK);
 	rseq_exit_to_user_mode_legacy();
 	__exit_to_user_mode_validate();
 }
@@ -253,7 +266,7 @@ static __always_inline void exit_to_user_mode_prepare_legacy(struct pt_regs *reg
  */
 static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
 {
-	__exit_to_user_mode_prepare(regs);
+	__exit_to_user_mode_prepare(regs, EXIT_TO_USER_MODE_WORK_SYSCALL);
 	rseq_syscall_exit_to_user_mode();
 	__exit_to_user_mode_validate();
 }
@@ -267,7 +280,7 @@ static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *re
  */
 static __always_inline void irqentry_exit_to_user_mode_prepare(struct pt_regs *regs)
 {
-	__exit_to_user_mode_prepare(regs);
+	__exit_to_user_mode_prepare(regs, EXIT_TO_USER_MODE_WORK_IRQ);
 	rseq_irqentry_exit_to_user_mode();
 	__exit_to_user_mode_validate();
 }
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index cbc4a79..17956e1 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -40,6 +40,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_stats);
 #endif /* !CONFIG_RSEQ_STATS */
 
 #ifdef CONFIG_RSEQ
+#include <linux/hrtimer_rearm.h>
 #include <linux/jump_label.h>
 #include <linux/rseq.h>
 #include <linux/sched/signal.h>
@@ -110,7 +111,7 @@ static __always_inline void rseq_slice_clear_grant(struct task_struct *t)
 	t->rseq.slice.state.granted = false;
 }
 
-static __always_inline bool rseq_grant_slice_extension(bool work_pending)
+static __always_inline bool __rseq_grant_slice_extension(bool work_pending)
 {
 	struct task_struct *curr = current;
 	struct rseq_slice_ctrl usr_ctrl;
@@ -215,11 +216,20 @@ efault:
 	return false;
 }
 
+static __always_inline bool rseq_grant_slice_extension(unsigned long ti_work, unsigned long mask)
+{
+	if (unlikely(__rseq_grant_slice_extension(ti_work & mask))) {
+		hrtimer_rearm_deferred_tif(ti_work);
+		return true;
+	}
+	return false;
+}
+
 #else /* CONFIG_RSEQ_SLICE_EXTENSION */
 static inline bool rseq_slice_extension_enabled(void) { return false; }
 static inline bool rseq_arm_slice_extension_timer(void) { return false; }
 static inline void rseq_slice_clear_grant(struct task_struct *t) { }
-static inline bool rseq_grant_slice_extension(bool work_pending) { return false; }
+static inline bool rseq_grant_slice_extension(unsigned long ti_work, unsigned long mask) { return false; }
 #endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
 
 bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
@@ -778,7 +788,7 @@ static inline void rseq_syscall_exit_to_user_mode(void) { }
 static inline void rseq_irqentry_exit_to_user_mode(void) { }
 static inline void rseq_exit_to_user_mode_legacy(void) { }
 static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
-static inline bool rseq_grant_slice_extension(bool work_pending) { return false; }
+static inline bool rseq_grant_slice_extension(unsigned long ti_work, unsigned long mask) { return false; }
 #endif /* !CONFIG_RSEQ */
 
 #endif /* _LINUX_RSEQ_ENTRY_H */
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 9ef63e4..9e1a6af 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -50,7 +50,7 @@ static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *re
 		local_irq_enable_exit_to_user(ti_work);
 
 		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
-			if (!rseq_grant_slice_extension(ti_work & TIF_SLICE_EXT_DENY))
+			if (!rseq_grant_slice_extension(ti_work, TIF_SLICE_EXT_DENY))
 				schedule();
 		}
 
@@ -225,6 +225,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 		 */
 		if (state.exit_rcu) {
 			instrumentation_begin();
+			hrtimer_rearm_deferred();
 			/* Tell the tracer that IRET will enable interrupts */
 			trace_hardirqs_on_prepare();
 			lockdep_hardirqs_on_prepare();
@@ -238,6 +239,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 		if (IS_ENABLED(CONFIG_PREEMPTION))
 			irqentry_exit_cond_resched();
 
+		hrtimer_rearm_deferred();
 		/* Covers both tracing and lockdep */
 		trace_hardirqs_on();
 		instrumentation_end();

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] hrtimer: Prepare stubs for deferred rearming
  2026-02-24 16:37 ` [patch 34/48] hrtimer: Prepare stubs for deferred rearming Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     a43b4856bc039675165a50d9ef5f41b28520f0f4
Gitweb:        https://git.kernel.org/tip/a43b4856bc039675165a50d9ef5f41b28520f0f4
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 24 Feb 2026 17:37:58 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:13 +01:00

hrtimer: Prepare stubs for deferred rearming

The hrtimer interrupt expires timers and at the end of the interrupt it
rearms the clockevent device for the next expiring timer.

That's obviously correct, but in the case that a expired timer set
NEED_RESCHED the return from interrupt ends up in schedule(). If HRTICK is
enabled then schedule() will modify the hrtick timer, which causes another
reprogramming of the hardware.

That can be avoided by deferring the rearming to the return from interrupt
path and if the return results in a immediate schedule() invocation then it
can be deferred until the end of schedule().

To make this correct the affected code parts need to be made aware of this.

Provide empty stubs for the deferred rearming mechanism, so that the
relevant code changes for entry, softirq and scheduler can be split up into
separate changes independent of the actual enablement in the hrtimer code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.000891171@kernel.org
---
 include/linux/hrtimer.h       |  1 +
 include/linux/hrtimer_rearm.h | 21 +++++++++++++++++++++
 kernel/time/Kconfig           |  4 ++++
 3 files changed, 26 insertions(+)
 create mode 100644 include/linux/hrtimer_rearm.h

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 4ad4a45..c087b71 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -13,6 +13,7 @@
 #define _LINUX_HRTIMER_H
 
 #include <linux/hrtimer_defs.h>
+#include <linux/hrtimer_rearm.h>
 #include <linux/hrtimer_types.h>
 #include <linux/init.h>
 #include <linux/list.h>
diff --git a/include/linux/hrtimer_rearm.h b/include/linux/hrtimer_rearm.h
new file mode 100644
index 0000000..6293076
--- /dev/null
+++ b/include/linux/hrtimer_rearm.h
@@ -0,0 +1,21 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef _LINUX_HRTIMER_REARM_H
+#define _LINUX_HRTIMER_REARM_H
+
+#ifdef CONFIG_HRTIMER_REARM_DEFERRED
+static __always_inline void __hrtimer_rearm_deferred(void) { }
+static __always_inline void hrtimer_rearm_deferred(void) { }
+static __always_inline void hrtimer_rearm_deferred_tif(unsigned long tif_work) { }
+static __always_inline bool
+hrtimer_rearm_deferred_user_irq(unsigned long *tif_work, const unsigned long tif_mask) { return false; }
+static __always_inline bool hrtimer_test_and_clear_rearm_deferred(void) { return false; }
+#else  /* CONFIG_HRTIMER_REARM_DEFERRED */
+static __always_inline void __hrtimer_rearm_deferred(void) { }
+static __always_inline void hrtimer_rearm_deferred(void) { }
+static __always_inline void hrtimer_rearm_deferred_tif(unsigned long tif_work) { }
+static __always_inline bool
+hrtimer_rearm_deferred_user_irq(unsigned long *tif_work, const unsigned long tif_mask) { return false; }
+static __always_inline bool hrtimer_test_and_clear_rearm_deferred(void) { return false; }
+#endif  /* !CONFIG_HRTIMER_REARM_DEFERRED */
+
+#endif
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index e1968ab..b95bfee 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -58,6 +58,10 @@ config GENERIC_CLOCKEVENTS_COUPLED_INLINE
 config GENERIC_CMOS_UPDATE
 	bool
 
+# Deferred rearming of the hrtimer interrupt
+config HRTIMER_REARM_DEFERRED
+       def_bool n
+
 # Select to handle posix CPU timers from task_work
 # and not from the timer interrupt context
 config HAVE_POSIX_CPU_TIMERS_TASK_WORK

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] hrtimer: Rename hrtimer_cpu_base::in_hrtirq to deferred_rearm
  2026-02-24 16:37 ` [patch 33/48] hrtimer: Rename hrtimer_cpu_base::in_hrtirq to deferred_rearm Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     9e07a9c980eaa93fd1bba722d31eeb4bf0cbbfb4
Gitweb:        https://git.kernel.org/tip/9e07a9c980eaa93fd1bba722d31eeb4bf0cbbfb4
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:37:53 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:12 +01:00

hrtimer: Rename hrtimer_cpu_base::in_hrtirq to deferred_rearm

The upcoming deferred rearming scheme has the same effect as the deferred
rearming when the hrtimer interrupt is executing. So it can reuse the
in_hrtirq flag, but when it gets deferred beyond the hrtimer interrupt
path, then the name does not make sense anymore.

Rename it to deferred_rearm upfront to keep the actual functional change
separate from the mechanical rename churn.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.935623347@kernel.org
---
 include/linux/hrtimer_defs.h |  4 ++--
 kernel/time/hrtimer.c        | 28 +++++++++-------------------
 2 files changed, 11 insertions(+), 21 deletions(-)

diff --git a/include/linux/hrtimer_defs.h b/include/linux/hrtimer_defs.h
index f9fbf9a..2c3bdbd 100644
--- a/include/linux/hrtimer_defs.h
+++ b/include/linux/hrtimer_defs.h
@@ -53,7 +53,7 @@ enum  hrtimer_base_type {
  * @active_bases:	Bitfield to mark bases with active timers
  * @clock_was_set_seq:	Sequence counter of clock was set events
  * @hres_active:	State of high resolution mode
- * @in_hrtirq:		hrtimer_interrupt() is currently executing
+ * @deferred_rearm:	A deferred rearm is pending
  * @hang_detected:	The last hrtimer interrupt detected a hang
  * @softirq_activated:	displays, if the softirq is raised - update of softirq
  *			related settings is not required then.
@@ -84,7 +84,7 @@ struct hrtimer_cpu_base {
 	unsigned int			active_bases;
 	unsigned int			clock_was_set_seq;
 	bool				hres_active;
-	bool				in_hrtirq;
+	bool				deferred_rearm;
 	bool				hang_detected;
 	bool				softirq_activated;
 	bool				online;
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 2e05a18..6f05d25 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -883,11 +883,8 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram)
 	if (expires >= cpu_base->expires_next)
 		return;
 
-	/*
-	 * If the hrtimer interrupt is running, then it will reevaluate the
-	 * clock bases and reprogram the clock event device.
-	 */
-	if (cpu_base->in_hrtirq)
+	/* If a deferred rearm is pending skip reprogramming the device */
+	if (cpu_base->deferred_rearm)
 		return;
 
 	cpu_base->next_timer = timer;
@@ -921,12 +918,8 @@ static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base, unsigned int act
 	if (seq == cpu_base->clock_was_set_seq)
 		return false;
 
-	/*
-	 * If the remote CPU is currently handling an hrtimer interrupt, it
-	 * will reevaluate the first expiring timer of all clock bases
-	 * before reprogramming. Nothing to do here.
-	 */
-	if (cpu_base->in_hrtirq)
+	/* If a deferred rearm is pending the remote CPU will take care of it */
+	if (cpu_base->deferred_rearm)
 		return false;
 
 	/*
@@ -1334,11 +1327,8 @@ static bool __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, u64 del
 		first = enqueue_hrtimer(timer, base, mode, was_armed);
 	}
 
-	/*
-	 * If the hrtimer interrupt is running, then it will reevaluate the
-	 * clock bases and reprogram the clock event device.
-	 */
-	if (cpu_base->in_hrtirq)
+	/* If a deferred rearm is pending skip reprogramming the device */
+	if (cpu_base->deferred_rearm)
 		return false;
 
 	if (!was_first || cpu_base != this_cpu_base) {
@@ -1947,14 +1937,14 @@ static __latent_entropy void hrtimer_run_softirq(void)
 
 /*
  * Very similar to hrtimer_force_reprogram(), except it deals with
- * in_hrtirq and hang_detected.
+ * deferred_rearm and hang_detected.
  */
 static void hrtimer_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t now)
 {
 	ktime_t expires_next = hrtimer_update_next_event(cpu_base);
 
 	cpu_base->expires_next = expires_next;
-	cpu_base->in_hrtirq = false;
+	cpu_base->deferred_rearm = false;
 
 	if (unlikely(cpu_base->hang_detected)) {
 		/*
@@ -1985,7 +1975,7 @@ void hrtimer_interrupt(struct clock_event_device *dev)
 	raw_spin_lock_irqsave(&cpu_base->lock, flags);
 	entry_time = now = hrtimer_update_base(cpu_base);
 retry:
-	cpu_base->in_hrtirq = true;
+	cpu_base->deferred_rearm = true;
 	/*
 	 * Set expires_next to KTIME_MAX, which prevents that remote CPUs queue
 	 * timers while __hrtimer_run_queues() is expiring the clock bases.

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] hrtimer: Re-arrange hrtimer_interrupt()
  2026-02-24 16:37 ` [patch 32/48] hrtimer: Re-arrange hrtimer_interrupt() Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     2889243848560b6b0211aba401d2fc122070ba2f
Gitweb:        https://git.kernel.org/tip/2889243848560b6b0211aba401d2fc122070ba2f
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 24 Feb 2026 17:37:48 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:12 +01:00

hrtimer: Re-arrange hrtimer_interrupt()

Rework hrtimer_interrupt() such that reprogramming is split out into an
independent function at the end of the interrupt.

This prepares for reprogramming getting delayed beyond the end of
hrtimer_interrupt().

Notably, this changes the hang handling to always wait 100ms instead of
trying to keep it proportional to the actual delay. This simplifies the
state, also this really shouldn't be happening.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.870639266@kernel.org
---
 kernel/time/hrtimer.c | 93 +++++++++++++++++++-----------------------
 1 file changed, 44 insertions(+), 49 deletions(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index c6fc164..2e05a18 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -690,6 +690,12 @@ static inline int hrtimer_hres_active(struct hrtimer_cpu_base *cpu_base)
 		cpu_base->hres_active : 0;
 }
 
+static inline void hrtimer_rearm_event(ktime_t expires_next, bool deferred)
+{
+	trace_hrtimer_rearm(expires_next, deferred);
+	tick_program_event(expires_next, 1);
+}
+
 static void __hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base, struct hrtimer *next_timer,
 				ktime_t expires_next)
 {
@@ -715,7 +721,7 @@ static void __hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base, struct hrtime
 	if (!hrtimer_hres_active(cpu_base) || cpu_base->hang_detected)
 		return;
 
-	tick_program_event(expires_next, 1);
+	hrtimer_rearm_event(expires_next, false);
 }
 
 /*
@@ -1940,6 +1946,28 @@ static __latent_entropy void hrtimer_run_softirq(void)
 #ifdef CONFIG_HIGH_RES_TIMERS
 
 /*
+ * Very similar to hrtimer_force_reprogram(), except it deals with
+ * in_hrtirq and hang_detected.
+ */
+static void hrtimer_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t now)
+{
+	ktime_t expires_next = hrtimer_update_next_event(cpu_base);
+
+	cpu_base->expires_next = expires_next;
+	cpu_base->in_hrtirq = false;
+
+	if (unlikely(cpu_base->hang_detected)) {
+		/*
+		 * Give the system a chance to do something else than looping
+		 * on hrtimer interrupts.
+		 */
+		expires_next = ktime_add_ns(now, 100 * NSEC_PER_MSEC);
+		cpu_base->hang_detected = false;
+	}
+	hrtimer_rearm_event(expires_next, false);
+}
+
+/*
  * High resolution timer interrupt
  * Called with interrupts disabled
  */
@@ -1974,63 +2002,30 @@ retry:
 
 	__hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_HARD);
 
-	/* Reevaluate the clock bases for the [soft] next expiry */
-	expires_next = hrtimer_update_next_event(cpu_base);
-	/*
-	 * Store the new expiry value so the migration code can verify
-	 * against it.
-	 */
-	cpu_base->expires_next = expires_next;
-	cpu_base->in_hrtirq = false;
-	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
-
-	/* Reprogramming necessary ? */
-	if (!tick_program_event(expires_next, 0)) {
-		cpu_base->hang_detected = false;
-		return;
-	}
-
 	/*
 	 * The next timer was already expired due to:
 	 * - tracing
 	 * - long lasting callbacks
 	 * - being scheduled away when running in a VM
 	 *
-	 * We need to prevent that we loop forever in the hrtimer
-	 * interrupt routine. We give it 3 attempts to avoid
-	 * overreacting on some spurious event.
-	 *
-	 * Acquire base lock for updating the offsets and retrieving
-	 * the current time.
+	 * We need to prevent that we loop forever in the hrtiner interrupt
+	 * routine. We give it 3 attempts to avoid overreacting on some
+	 * spurious event.
 	 */
-	raw_spin_lock_irqsave(&cpu_base->lock, flags);
 	now = hrtimer_update_base(cpu_base);
-	cpu_base->nr_retries++;
-	if (++retries < 3)
-		goto retry;
-	/*
-	 * Give the system a chance to do something else than looping
-	 * here. We stored the entry time, so we know exactly how long
-	 * we spent here. We schedule the next event this amount of
-	 * time away.
-	 */
-	cpu_base->nr_hangs++;
-	cpu_base->hang_detected = true;
-	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
+	expires_next = hrtimer_update_next_event(cpu_base);
+	if (expires_next < now) {
+		if (++retries < 3)
+			goto retry;
+
+		delta = ktime_sub(now, entry_time);
+		cpu_base->max_hang_time = max_t(unsigned int, cpu_base->max_hang_time, delta);
+		cpu_base->nr_hangs++;
+		cpu_base->hang_detected = true;
+	}
 
-	delta = ktime_sub(now, entry_time);
-	if ((unsigned int)delta > cpu_base->max_hang_time)
-		cpu_base->max_hang_time = (unsigned int) delta;
-	/*
-	 * Limit it to a sensible value as we enforce a longer
-	 * delay. Give the CPU at least 100ms to catch up.
-	 */
-	if (delta > 100 * NSEC_PER_MSEC)
-		expires_next = ktime_add_ns(now, 100 * NSEC_PER_MSEC);
-	else
-		expires_next = ktime_add(now, delta);
-	tick_program_event(expires_next, 1);
-	pr_warn_once("hrtimer: interrupt took %llu ns\n", ktime_to_ns(delta));
+	hrtimer_rearm(cpu_base, now);
+	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
 }
 #endif /* !CONFIG_HIGH_RES_TIMERS */
 

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] hrtimer: Add hrtimer_rearm tracepoint
  2026-02-24 16:37 ` [patch 31/48] hrtimer: Add hrtimer_rearm tracepoint Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     8e10f6b81afbf60e48bb4a71676ede4c7e374e79
Gitweb:        https://git.kernel.org/tip/8e10f6b81afbf60e48bb4a71676ede4c7e374e79
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:37:43 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:12 +01:00

hrtimer: Add hrtimer_rearm tracepoint

Analyzing the reprogramming of the clock event device is essential to debug
the behaviour of the hrtimer subsystem especially with the upcoming
deferred rearming scheme.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.803669745@kernel.org
---
 include/trace/events/timer.h | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/include/trace/events/timer.h b/include/trace/events/timer.h
index ab9a938..a54613f 100644
--- a/include/trace/events/timer.h
+++ b/include/trace/events/timer.h
@@ -325,6 +325,30 @@ DEFINE_EVENT(hrtimer_class, hrtimer_cancel,
 );
 
 /**
+ * hrtimer_rearm - Invoked when the clockevent device is rearmed
+ * @next_event:	The next expiry time (CLOCK_MONOTONIC)
+ */
+TRACE_EVENT(hrtimer_rearm,
+
+	TP_PROTO(ktime_t next_event, bool deferred),
+
+	TP_ARGS(next_event, deferred),
+
+	TP_STRUCT__entry(
+		__field( s64,		next_event	)
+		__field( bool,		deferred	)
+	),
+
+	TP_fast_assign(
+		__entry->next_event	= next_event;
+		__entry->deferred	= deferred;
+	),
+
+	TP_printk("next_event=%llu deferred=%d",
+		  (unsigned long long) __entry->next_event, __entry->deferred)
+);
+
+/**
  * itimer_state - called when itimer is started or canceled
  * @which:	name of the interval timer
  * @value:	the itimers value, itimer is canceled if value->it_value is

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] hrtimer: Separate remove/enqueue handling for local timers
  2026-02-24 16:37 ` [patch 30/48] hrtimer: Separate remove/enqueue handling for local timers Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     85a690d1c19cc266eed74ec3fcdaacadc03ed1b2
Gitweb:        https://git.kernel.org/tip/85a690d1c19cc266eed74ec3fcdaacadc03ed1b2
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:37:38 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:11 +01:00

hrtimer: Separate remove/enqueue handling for local timers

As the base switch can be avoided completely when the base stays the same
the remove/enqueue handling can be more streamlined.

Split it out into a separate function which handles both in one go which is
way more efficient and makes the code simpler to follow.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.737600486@kernel.org
---
 kernel/time/hrtimer.c | 72 +++++++++++++++++++++++++-----------------
 1 file changed, 43 insertions(+), 29 deletions(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 4caf2df..c6fc164 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1147,13 +1147,11 @@ static void __remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *b
 }
 
 static inline bool remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
-				 bool restart, bool keep_base)
+				  bool newstate)
 {
-	bool queued_state = timer->is_queued;
-
 	lockdep_assert_held(&base->cpu_base->lock);
 
-	if (queued_state) {
+	if (timer->is_queued) {
 		bool reprogram;
 
 		debug_hrtimer_deactivate(timer);
@@ -1168,23 +1166,35 @@ static inline bool remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_ba
 		 */
 		reprogram = base->cpu_base == this_cpu_ptr(&hrtimer_bases);
 
-		/*
-		 * If the timer is not restarted then reprogramming is
-		 * required if the timer is local. If it is local and about
-		 * to be restarted, avoid programming it twice (on removal
-		 * and a moment later when it's requeued).
-		 */
-		if (!restart)
-			queued_state = HRTIMER_STATE_INACTIVE;
-		else
-			reprogram &= !keep_base;
-
-		__remove_hrtimer(timer, base, queued_state, reprogram);
+		__remove_hrtimer(timer, base, newstate, reprogram);
 		return true;
 	}
 	return false;
 }
 
+static inline bool
+remove_and_enqueue_same_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
+			     const enum hrtimer_mode mode, ktime_t expires, u64 delta_ns)
+{
+	/* Remove it from the timer queue if active */
+	if (timer->is_queued) {
+		debug_hrtimer_deactivate(timer);
+		timerqueue_del(&base->active, &timer->node);
+	}
+
+	/* Set the new expiry time */
+	hrtimer_set_expires_range_ns(timer, expires, delta_ns);
+
+	debug_activate(timer, mode, timer->is_queued);
+	base->cpu_base->active_bases |= 1 << base->index;
+
+	/* Pairs with the lockless read in hrtimer_is_queued() */
+	WRITE_ONCE(timer->is_queued, HRTIMER_STATE_ENQUEUED);
+
+	/* Returns true if this is the first expiring timer */
+	return timerqueue_add(&base->active, &timer->node);
+}
+
 static inline ktime_t hrtimer_update_lowres(struct hrtimer *timer, ktime_t tim,
 					    const enum hrtimer_mode mode)
 {
@@ -1267,7 +1277,7 @@ static bool __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, u64 del
 				     const enum hrtimer_mode mode, struct hrtimer_clock_base *base)
 {
 	struct hrtimer_cpu_base *this_cpu_base = this_cpu_ptr(&hrtimer_bases);
-	bool is_pinned, first, was_first, was_armed, keep_base = false;
+	bool is_pinned, first, was_first, keep_base = false;
 	struct hrtimer_cpu_base *cpu_base = base->cpu_base;
 
 	was_first = cpu_base->next_timer == timer;
@@ -1283,6 +1293,12 @@ static bool __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, u64 del
 		keep_base = hrtimer_keep_base(timer, is_local, was_first, is_pinned);
 	}
 
+	/* Calculate absolute expiry time for relative timers */
+	if (mode & HRTIMER_MODE_REL)
+		tim = ktime_add_safe(tim, __hrtimer_cb_get_time(base->clockid));
+	/* Compensate for low resolution granularity */
+	tim = hrtimer_update_lowres(timer, tim, mode);
+
 	/*
 	 * Remove an active timer from the queue. In case it is not queued
 	 * on the current CPU, make sure that remove_hrtimer() updates the
@@ -1297,22 +1313,20 @@ static bool __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, u64 del
 	 * @keep_base is also true if the timer callback is running on a
 	 * remote CPU and for local pinned timers.
 	 */
-	was_armed = remove_hrtimer(timer, base, true, keep_base);
-
-	if (mode & HRTIMER_MODE_REL)
-		tim = ktime_add_safe(tim, __hrtimer_cb_get_time(base->clockid));
-
-	tim = hrtimer_update_lowres(timer, tim, mode);
+	if (likely(keep_base)) {
+		first = remove_and_enqueue_same_base(timer, base, mode, tim, delta_ns);
+	} else {
+		/* Keep the ENQUEUED state in case it is queued */
+		bool was_armed = remove_hrtimer(timer, base, HRTIMER_STATE_ENQUEUED);
 
-	hrtimer_set_expires_range_ns(timer, tim, delta_ns);
+		hrtimer_set_expires_range_ns(timer, tim, delta_ns);
 
-	/* Switch the timer base, if necessary: */
-	if (!keep_base) {
+		/* Switch the timer base, if necessary: */
 		base = switch_hrtimer_base(timer, base, is_pinned);
 		cpu_base = base->cpu_base;
-	}
 
-	first = enqueue_hrtimer(timer, base, mode, was_armed);
+		first = enqueue_hrtimer(timer, base, mode, was_armed);
+	}
 
 	/*
 	 * If the hrtimer interrupt is running, then it will reevaluate the
@@ -1432,7 +1446,7 @@ int hrtimer_try_to_cancel(struct hrtimer *timer)
 	base = lock_hrtimer_base(timer, &flags);
 
 	if (!hrtimer_callback_running(timer)) {
-		ret = remove_hrtimer(timer, base, false, false);
+		ret = remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE);
 		if (ret)
 			trace_hrtimer_cancel(timer);
 	}

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] hrtimer: Use NOHZ information for locality
  2026-02-24 16:37 ` [patch 29/48] hrtimer: Use NOHZ information for locality Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     c939191457fead7bce2f991fe5bf39d4d5dde90f
Gitweb:        https://git.kernel.org/tip/c939191457fead7bce2f991fe5bf39d4d5dde90f
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:37:33 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:11 +01:00

hrtimer: Use NOHZ information for locality

The decision to keep a timer which is associated to the local CPU on that
CPU does not take NOHZ information into account. As a result there are a
lot of hrtimer base switch invocations which end up not switching the base
and stay on the local CPU. That's just work for nothing and can be further
improved.

If the local CPU is part of the NOISE housekeeping mask, then check:

  1) Whether the local CPU has the tick running, which means it is
     either not idle or already expecting a timer soon.

  2) Whether the tick is stopped and need_resched() is set, which
     means the CPU is about to exit idle.

This reduces the amount of hrtimer base switch attempts, which end up on
the local CPU anyway, significantly and prepares for further optimizations.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.673473029@kernel.org
---
 kernel/time/hrtimer.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index b87995f..4caf2df 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1231,7 +1231,18 @@ static __always_inline bool hrtimer_prefer_local(bool is_local, bool is_first, b
 		 */
 		if (!is_local)
 			return false;
-		return is_first || is_pinned;
+		if (is_first || is_pinned)
+			return true;
+
+		/* Honour the NOHZ full restrictions */
+		if (!housekeeping_cpu(smp_processor_id(), HK_TYPE_KERNEL_NOISE))
+			return false;
+
+		/*
+		 * If the tick is not stopped or need_resched() is set, then
+		 * there is no point in moving the timer somewhere else.
+		 */
+		return !tick_nohz_tick_stopped() || need_resched();
 	}
 	return is_local;
 }

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] hrtimer: Optimize for local timers
  2026-02-24 16:37 ` [patch 28/48] hrtimer: Optimize for local timers Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     3288cd486376b322868c9fb41f10e35916e7e989
Gitweb:        https://git.kernel.org/tip/3288cd486376b322868c9fb41f10e35916e7e989
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:37:28 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:11 +01:00

hrtimer: Optimize for local timers

The decision whether to keep timers on the local CPU or on the CPU they are
associated to is suboptimal and causes the expensive switch_hrtimer_base()
mechanism to be invoked more than necessary. This is especially true for
pinned timers.

Rewrite the decision logic so that the current base is kept if:

   1) The callback is running on the base

   2) The timer is associated to the local CPU and the first expiring timer as
      that allows to optimize for reprogramming avoidance

   3) The timer is associated to the local CPU and pinned

   4) The timer is associated to the local CPU and timer migration is
      disabled.

Only #2 was covered by the original code, but especially #3 makes a
difference for high frequency rearming timers like the scheduler hrtick
timer. If timer migration is disabled, then #4 avoids most of the base
switches.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.607935269@kernel.org
---
 kernel/time/hrtimer.c | 101 ++++++++++++++++++++++++++---------------
 1 file changed, 65 insertions(+), 36 deletions(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 6bab3b7..b87995f 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1147,7 +1147,7 @@ static void __remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *b
 }
 
 static inline bool remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
-				 bool restart, bool keep_local)
+				 bool restart, bool keep_base)
 {
 	bool queued_state = timer->is_queued;
 
@@ -1177,7 +1177,7 @@ static inline bool remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_ba
 		if (!restart)
 			queued_state = HRTIMER_STATE_INACTIVE;
 		else
-			reprogram &= !keep_local;
+			reprogram &= !keep_base;
 
 		__remove_hrtimer(timer, base, queued_state, reprogram);
 		return true;
@@ -1220,29 +1220,57 @@ static void hrtimer_update_softirq_timer(struct hrtimer_cpu_base *cpu_base, bool
 	hrtimer_reprogram(cpu_base->softirq_next_timer, reprogram);
 }
 
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
+static __always_inline bool hrtimer_prefer_local(bool is_local, bool is_first, bool is_pinned)
+{
+	if (static_branch_likely(&timers_migration_enabled)) {
+		/*
+		 * If it is local and the first expiring timer keep it on the local
+		 * CPU to optimize reprogramming of the clockevent device. Also
+		 * avoid switch_hrtimer_base() overhead when local and pinned.
+		 */
+		if (!is_local)
+			return false;
+		return is_first || is_pinned;
+	}
+	return is_local;
+}
+#else
+static __always_inline bool hrtimer_prefer_local(bool is_local, bool is_first, bool is_pinned)
+{
+	return is_local;
+}
+#endif
+
+static inline bool hrtimer_keep_base(struct hrtimer *timer, bool is_local, bool is_first,
+				     bool is_pinned)
+{
+	/* If the timer is running the callback it has to stay on its CPU base. */
+	if (unlikely(timer->base->running == timer))
+		return true;
+
+	return hrtimer_prefer_local(is_local, is_first, is_pinned);
+}
+
 static bool __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, u64 delta_ns,
 				     const enum hrtimer_mode mode, struct hrtimer_clock_base *base)
 {
 	struct hrtimer_cpu_base *this_cpu_base = this_cpu_ptr(&hrtimer_bases);
-	struct hrtimer_clock_base *new_base;
-	bool force_local, first, was_armed;
+	bool is_pinned, first, was_first, was_armed, keep_base = false;
+	struct hrtimer_cpu_base *cpu_base = base->cpu_base;
 
-	/*
-	 * If the timer is on the local cpu base and is the first expiring
-	 * timer then this might end up reprogramming the hardware twice
-	 * (on removal and on enqueue). To avoid that prevent the reprogram
-	 * on removal, keep the timer local to the current CPU and enforce
-	 * reprogramming after it is queued no matter whether it is the new
-	 * first expiring timer again or not.
-	 */
-	force_local = base->cpu_base == this_cpu_base;
-	force_local &= base->cpu_base->next_timer == timer;
+	was_first = cpu_base->next_timer == timer;
+	is_pinned = !!(mode & HRTIMER_MODE_PINNED);
 
 	/*
-	 * Don't force local queuing if this enqueue happens on a unplugged
-	 * CPU after hrtimer_cpu_dying() has been invoked.
+	 * Don't keep it local if this enqueue happens on a unplugged CPU
+	 * after hrtimer_cpu_dying() has been invoked.
 	 */
-	force_local &= this_cpu_base->online;
+	if (likely(this_cpu_base->online)) {
+		bool is_local = cpu_base == this_cpu_base;
+
+		keep_base = hrtimer_keep_base(timer, is_local, was_first, is_pinned);
+	}
 
 	/*
 	 * Remove an active timer from the queue. In case it is not queued
@@ -1254,8 +1282,11 @@ static bool __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, u64 del
 	 * reprogramming later if it was the first expiring timer.  This
 	 * avoids programming the underlying clock event twice (once at
 	 * removal and once after enqueue).
+	 *
+	 * @keep_base is also true if the timer callback is running on a
+	 * remote CPU and for local pinned timers.
 	 */
-	was_armed = remove_hrtimer(timer, base, true, force_local);
+	was_armed = remove_hrtimer(timer, base, true, keep_base);
 
 	if (mode & HRTIMER_MODE_REL)
 		tim = ktime_add_safe(tim, __hrtimer_cb_get_time(base->clockid));
@@ -1265,21 +1296,21 @@ static bool __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, u64 del
 	hrtimer_set_expires_range_ns(timer, tim, delta_ns);
 
 	/* Switch the timer base, if necessary: */
-	if (!force_local)
-		new_base = switch_hrtimer_base(timer, base, mode & HRTIMER_MODE_PINNED);
-	else
-		new_base = base;
+	if (!keep_base) {
+		base = switch_hrtimer_base(timer, base, is_pinned);
+		cpu_base = base->cpu_base;
+	}
 
-	first = enqueue_hrtimer(timer, new_base, mode, was_armed);
+	first = enqueue_hrtimer(timer, base, mode, was_armed);
 
 	/*
 	 * If the hrtimer interrupt is running, then it will reevaluate the
 	 * clock bases and reprogram the clock event device.
 	 */
-	if (new_base->cpu_base->in_hrtirq)
+	if (cpu_base->in_hrtirq)
 		return false;
 
-	if (!force_local) {
+	if (!was_first || cpu_base != this_cpu_base) {
 		/*
 		 * If the current CPU base is online, then the timer is never
 		 * queued on a remote CPU if it would be the first expiring
@@ -1288,7 +1319,7 @@ static bool __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, u64 del
 		 * re-evaluate the first expiring timer after completing the
 		 * callbacks.
 		 */
-		if (hrtimer_base_is_online(this_cpu_base))
+		if (likely(hrtimer_base_is_online(this_cpu_base)))
 			return first;
 
 		/*
@@ -1296,11 +1327,8 @@ static bool __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, u64 del
 		 * already offline. If the timer is the first to expire,
 		 * kick the remote CPU to reprogram the clock event.
 		 */
-		if (first) {
-			struct hrtimer_cpu_base *new_cpu_base = new_base->cpu_base;
-
-			smp_call_function_single_async(new_cpu_base->cpu, &new_cpu_base->csd);
-		}
+		if (first)
+			smp_call_function_single_async(cpu_base->cpu, &cpu_base->csd);
 		return false;
 	}
 
@@ -1314,16 +1342,17 @@ static bool __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, u64 del
 	 * required.
 	 */
 	if (timer->is_lazy) {
-		if (new_base->cpu_base->expires_next <= hrtimer_get_expires(timer))
+		if (cpu_base->expires_next <= hrtimer_get_expires(timer))
 			return false;
 	}
 
 	/*
-	 * Timer was forced to stay on the current CPU to avoid
-	 * reprogramming on removal and enqueue. Force reprogram the
-	 * hardware by evaluating the new first expiring timer.
+	 * Timer was the first expiring timer and forced to stay on the
+	 * current CPU to avoid reprogramming on removal and enqueue. Force
+	 * reprogram the hardware by evaluating the new first expiring
+	 * timer.
 	 */
-	hrtimer_force_reprogram(new_base->cpu_base, /* skip_equal */ true);
+	hrtimer_force_reprogram(cpu_base, /* skip_equal */ true);
 	return false;
 }
 

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] hrtimer: Convert state and properties to boolean
  2026-02-24 16:37 ` [patch 27/48] hrtimer: Convert state and properties to boolean Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     22f011be7aaa77ca8f502b9dd07b7334f9965d18
Gitweb:        https://git.kernel.org/tip/22f011be7aaa77ca8f502b9dd07b7334f9965d18
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:37:23 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:11 +01:00

hrtimer: Convert state and properties to boolean

All 'u8' flags are true booleans, so make it entirely clear that these can
only contain true or false.

This is especially true for hrtimer::state, which has a historical leftover
of using the state with bitwise operations. That was used in the early
hrtimer implementation with several bits, but then converted to a boolean
state. But that conversion missed to replace the bit OR and bit check
operations all over the place, which creates suboptimal code. As of today
'state' is a misnomer because it's only purpose is to reflect whether the
timer is enqueued into the RB-tree or not. Rename it to 'is_queued' and
make all operations on it boolean.

This reduces text size from 8926 to 8732 bytes.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.542427240@kernel.org
---
 include/linux/hrtimer.h       | 31 +-----------------
 include/linux/hrtimer_types.h | 12 +++----
 kernel/time/hrtimer.c         | 58 +++++++++++++++++++++++-----------
 kernel/time/timer_list.c      |  2 +-
 4 files changed, 49 insertions(+), 54 deletions(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index c924bb2..4ad4a45 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -63,33 +63,6 @@ enum hrtimer_mode {
 	HRTIMER_MODE_REL_PINNED_HARD = HRTIMER_MODE_REL_PINNED | HRTIMER_MODE_HARD,
 };
 
-/*
- * Values to track state of the timer
- *
- * Possible states:
- *
- * 0x00		inactive
- * 0x01		enqueued into rbtree
- *
- * The callback state is not part of the timer->state because clearing it would
- * mean touching the timer after the callback, this makes it impossible to free
- * the timer from the callback function.
- *
- * Therefore we track the callback state in:
- *
- *	timer->base->cpu_base->running == timer
- *
- * On SMP it is possible to have a "callback function running and enqueued"
- * status. It happens for example when a posix timer expired and the callback
- * queued a signal. Between dropping the lock which protects the posix timer
- * and reacquiring the base lock of the hrtimer, another CPU can deliver the
- * signal and rearm the timer.
- *
- * All state transitions are protected by cpu_base->lock.
- */
-#define HRTIMER_STATE_INACTIVE	0x00
-#define HRTIMER_STATE_ENQUEUED	0x01
-
 /**
  * struct hrtimer_sleeper - simple sleeper structure
  * @timer:	embedded timer structure
@@ -300,8 +273,8 @@ extern bool hrtimer_active(const struct hrtimer *timer);
  */
 static inline bool hrtimer_is_queued(struct hrtimer *timer)
 {
-	/* The READ_ONCE pairs with the update functions of timer->state */
-	return !!(READ_ONCE(timer->state) & HRTIMER_STATE_ENQUEUED);
+	/* The READ_ONCE pairs with the update functions of timer->is_queued */
+	return READ_ONCE(timer->is_queued);
 }
 
 /*
diff --git a/include/linux/hrtimer_types.h b/include/linux/hrtimer_types.h
index 64381c6..0e22bc9 100644
--- a/include/linux/hrtimer_types.h
+++ b/include/linux/hrtimer_types.h
@@ -28,7 +28,7 @@ enum hrtimer_restart {
  *		was armed.
  * @function:	timer expiry callback function
  * @base:	pointer to the timer base (per cpu and per clock)
- * @state:	state information (See bit values above)
+ * @is_queued:	Indicates whether a timer is enqueued or not
  * @is_rel:	Set if the timer was armed relative
  * @is_soft:	Set if hrtimer will be expired in soft interrupt context.
  * @is_hard:	Set if hrtimer will be expired in hard interrupt context
@@ -43,11 +43,11 @@ struct hrtimer {
 	ktime_t				_softexpires;
 	enum hrtimer_restart		(*__private function)(struct hrtimer *);
 	struct hrtimer_clock_base	*base;
-	u8				state;
-	u8				is_rel;
-	u8				is_soft;
-	u8				is_hard;
-	u8				is_lazy;
+	bool				is_queued;
+	bool				is_rel;
+	bool				is_soft;
+	bool				is_hard;
+	bool				is_lazy;
 };
 
 #endif /* _LINUX_HRTIMER_TYPES_H */
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 3b80a44..6bab3b7 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -50,6 +50,28 @@
 #include "tick-internal.h"
 
 /*
+ * Constants to set the queued state of the timer (INACTIVE, ENQUEUED)
+ *
+ * The callback state is kept separate in the CPU base because having it in
+ * the timer would required touching the timer after the callback, which
+ * makes it impossible to free the timer from the callback function.
+ *
+ * Therefore we track the callback state in:
+ *
+ *	timer->base->cpu_base->running == timer
+ *
+ * On SMP it is possible to have a "callback function running and enqueued"
+ * status. It happens for example when a posix timer expired and the callback
+ * queued a signal. Between dropping the lock which protects the posix timer
+ * and reacquiring the base lock of the hrtimer, another CPU can deliver the
+ * signal and rearm the timer.
+ *
+ * All state transitions are protected by cpu_base->lock.
+ */
+#define HRTIMER_STATE_INACTIVE	false
+#define HRTIMER_STATE_ENQUEUED	true
+
+/*
  * The resolution of the clocks. The resolution value is returned in
  * the clock_getres() system call to give application programmers an
  * idea of the (in)accuracy of timers. Timer values are rounded up to
@@ -1038,7 +1060,7 @@ u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
 	if (delta < 0)
 		return 0;
 
-	if (WARN_ON(timer->state & HRTIMER_STATE_ENQUEUED))
+	if (WARN_ON(timer->is_queued))
 		return 0;
 
 	if (interval < hrtimer_resolution)
@@ -1082,7 +1104,7 @@ static bool enqueue_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *ba
 	base->cpu_base->active_bases |= 1 << base->index;
 
 	/* Pairs with the lockless read in hrtimer_is_queued() */
-	WRITE_ONCE(timer->state, HRTIMER_STATE_ENQUEUED);
+	WRITE_ONCE(timer->is_queued, HRTIMER_STATE_ENQUEUED);
 
 	return timerqueue_add(&base->active, &timer->node);
 }
@@ -1096,18 +1118,18 @@ static bool enqueue_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *ba
  * anyway (e.g. timer interrupt)
  */
 static void __remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
-			     u8 newstate, bool reprogram)
+			     bool newstate, bool reprogram)
 {
 	struct hrtimer_cpu_base *cpu_base = base->cpu_base;
-	u8 state = timer->state;
 
 	lockdep_assert_held(&cpu_base->lock);
 
-	/* Pairs with the lockless read in hrtimer_is_queued() */
-	WRITE_ONCE(timer->state, newstate);
-	if (!(state & HRTIMER_STATE_ENQUEUED))
+	if (!timer->is_queued)
 		return;
 
+	/* Pairs with the lockless read in hrtimer_is_queued() */
+	WRITE_ONCE(timer->is_queued, newstate);
+
 	if (!timerqueue_del(&base->active, &timer->node))
 		cpu_base->active_bases &= ~(1 << base->index);
 
@@ -1127,11 +1149,11 @@ static void __remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *b
 static inline bool remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
 				 bool restart, bool keep_local)
 {
-	u8 state = timer->state;
+	bool queued_state = timer->is_queued;
 
 	lockdep_assert_held(&base->cpu_base->lock);
 
-	if (state & HRTIMER_STATE_ENQUEUED) {
+	if (queued_state) {
 		bool reprogram;
 
 		debug_hrtimer_deactivate(timer);
@@ -1153,11 +1175,11 @@ static inline bool remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_ba
 		 * and a moment later when it's requeued).
 		 */
 		if (!restart)
-			state = HRTIMER_STATE_INACTIVE;
+			queued_state = HRTIMER_STATE_INACTIVE;
 		else
 			reprogram &= !keep_local;
 
-		__remove_hrtimer(timer, base, state, reprogram);
+		__remove_hrtimer(timer, base, queued_state, reprogram);
 		return true;
 	}
 	return false;
@@ -1704,7 +1726,7 @@ bool hrtimer_active(const struct hrtimer *timer)
 		base = READ_ONCE(timer->base);
 		seq = raw_read_seqcount_begin(&base->seq);
 
-		if (timer->state != HRTIMER_STATE_INACTIVE || base->running == timer)
+		if (timer->is_queued || base->running == timer)
 			return true;
 
 	} while (read_seqcount_retry(&base->seq, seq) || base != READ_ONCE(timer->base));
@@ -1721,7 +1743,7 @@ EXPORT_SYMBOL_GPL(hrtimer_active);
  *  - callback:	the timer is being ran
  *  - post:	the timer is inactive or (re)queued
  *
- * On the read side we ensure we observe timer->state and cpu_base->running
+ * On the read side we ensure we observe timer->is_queued and cpu_base->running
  * from the same section, if anything changed while we looked at it, we retry.
  * This includes timer->base changing because sequence numbers alone are
  * insufficient for that.
@@ -1744,11 +1766,11 @@ static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base, struct hrtimer_cloc
 	base->running = timer;
 
 	/*
-	 * Separate the ->running assignment from the ->state assignment.
+	 * Separate the ->running assignment from the ->is_queued assignment.
 	 *
 	 * As with a regular write barrier, this ensures the read side in
 	 * hrtimer_active() cannot observe base->running == NULL &&
-	 * timer->state == INACTIVE.
+	 * timer->is_queued == INACTIVE.
 	 */
 	raw_write_seqcount_barrier(&base->seq);
 
@@ -1787,15 +1809,15 @@ static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base, struct hrtimer_cloc
 	 * hrtimer_start_range_ns() can have popped in and enqueued the timer
 	 * for us already.
 	 */
-	if (restart != HRTIMER_NORESTART && !(timer->state & HRTIMER_STATE_ENQUEUED))
+	if (restart == HRTIMER_RESTART && !timer->is_queued)
 		enqueue_hrtimer(timer, base, HRTIMER_MODE_ABS, false);
 
 	/*
-	 * Separate the ->running assignment from the ->state assignment.
+	 * Separate the ->running assignment from the ->is_queued assignment.
 	 *
 	 * As with a regular write barrier, this ensures the read side in
 	 * hrtimer_active() cannot observe base->running.timer == NULL &&
-	 * timer->state == INACTIVE.
+	 * timer->is_queued == INACTIVE.
 	 */
 	raw_write_seqcount_barrier(&base->seq);
 
diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c
index 488e47e..19e6182 100644
--- a/kernel/time/timer_list.c
+++ b/kernel/time/timer_list.c
@@ -47,7 +47,7 @@ print_timer(struct seq_file *m, struct hrtimer *taddr, struct hrtimer *timer,
 	    int idx, u64 now)
 {
 	SEQ_printf(m, " #%d: <%p>, %ps", idx, taddr, ACCESS_PRIVATE(timer, function));
-	SEQ_printf(m, ", S:%02x", timer->state);
+	SEQ_printf(m, ", S:%02x", timer->is_queued);
 	SEQ_printf(m, "\n");
 	SEQ_printf(m, " # expires at %Lu-%Lu nsecs [in %Ld to %Ld nsecs]\n",
 		(unsigned long long)ktime_to_ns(hrtimer_get_softexpires(timer)),

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] hrtimer: Replace the bitfield in hrtimer_cpu_base
  2026-02-24 16:37 ` [patch 26/48] hrtimer: Replace the bitfield in hrtimer_cpu_base Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     7d27eafe54659d19cef10dab4520cbcdfb17b0e3
Gitweb:        https://git.kernel.org/tip/7d27eafe54659d19cef10dab4520cbcdfb17b0e3
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:37:18 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:10 +01:00

hrtimer: Replace the bitfield in hrtimer_cpu_base

Use bool for the various flags as that creates better code in the hot path.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.475262618@kernel.org
---
 include/linux/hrtimer_defs.h | 10 +++++-----
 kernel/time/hrtimer.c        | 25 +++++++++++++------------
 2 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/include/linux/hrtimer_defs.h b/include/linux/hrtimer_defs.h
index 02b010d..f9fbf9a 100644
--- a/include/linux/hrtimer_defs.h
+++ b/include/linux/hrtimer_defs.h
@@ -83,11 +83,11 @@ struct hrtimer_cpu_base {
 	unsigned int			cpu;
 	unsigned int			active_bases;
 	unsigned int			clock_was_set_seq;
-	unsigned int			hres_active		: 1,
-					in_hrtirq		: 1,
-					hang_detected		: 1,
-					softirq_activated       : 1,
-					online			: 1;
+	bool				hres_active;
+	bool				in_hrtirq;
+	bool				hang_detected;
+	bool				softirq_activated;
+	bool				online;
 #ifdef CONFIG_HIGH_RES_TIMERS
 	unsigned int			nr_events;
 	unsigned short			nr_retries;
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index e6f02e9..3b80a44 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -741,7 +741,7 @@ static void hrtimer_switch_to_hres(void)
 		pr_warn("Could not switch to high resolution mode on CPU %u\n",	base->cpu);
 		return;
 	}
-	base->hres_active = 1;
+	base->hres_active = true;
 	hrtimer_resolution = HIGH_RES_NSEC;
 
 	tick_setup_sched_timer(true);
@@ -1854,7 +1854,7 @@ static __latent_entropy void hrtimer_run_softirq(void)
 	now = hrtimer_update_base(cpu_base);
 	__hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_SOFT);
 
-	cpu_base->softirq_activated = 0;
+	cpu_base->softirq_activated = false;
 	hrtimer_update_softirq_timer(cpu_base, true);
 
 	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
@@ -1881,7 +1881,7 @@ void hrtimer_interrupt(struct clock_event_device *dev)
 	raw_spin_lock_irqsave(&cpu_base->lock, flags);
 	entry_time = now = hrtimer_update_base(cpu_base);
 retry:
-	cpu_base->in_hrtirq = 1;
+	cpu_base->in_hrtirq = true;
 	/*
 	 * Set expires_next to KTIME_MAX, which prevents that remote CPUs queue
 	 * timers while __hrtimer_run_queues() is expiring the clock bases.
@@ -1892,7 +1892,7 @@ retry:
 
 	if (!ktime_before(now, cpu_base->softirq_expires_next)) {
 		cpu_base->softirq_expires_next = KTIME_MAX;
-		cpu_base->softirq_activated = 1;
+		cpu_base->softirq_activated = true;
 		raise_timer_softirq(HRTIMER_SOFTIRQ);
 	}
 
@@ -1905,12 +1905,12 @@ retry:
 	 * against it.
 	 */
 	cpu_base->expires_next = expires_next;
-	cpu_base->in_hrtirq = 0;
+	cpu_base->in_hrtirq = false;
 	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
 
 	/* Reprogramming necessary ? */
 	if (!tick_program_event(expires_next, 0)) {
-		cpu_base->hang_detected = 0;
+		cpu_base->hang_detected = false;
 		return;
 	}
 
@@ -1939,7 +1939,7 @@ retry:
 	 * time away.
 	 */
 	cpu_base->nr_hangs++;
-	cpu_base->hang_detected = 1;
+	cpu_base->hang_detected = true;
 	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
 
 	delta = ktime_sub(now, entry_time);
@@ -1987,7 +1987,7 @@ void hrtimer_run_queues(void)
 
 	if (!ktime_before(now, cpu_base->softirq_expires_next)) {
 		cpu_base->softirq_expires_next = KTIME_MAX;
-		cpu_base->softirq_activated = 1;
+		cpu_base->softirq_activated = true;
 		raise_timer_softirq(HRTIMER_SOFTIRQ);
 	}
 
@@ -2239,13 +2239,14 @@ int hrtimers_cpu_starting(unsigned int cpu)
 
 	/* Clear out any left over state from a CPU down operation */
 	cpu_base->active_bases = 0;
-	cpu_base->hres_active = 0;
-	cpu_base->hang_detected = 0;
+	cpu_base->hres_active = false;
+	cpu_base->hang_detected = false;
 	cpu_base->next_timer = NULL;
 	cpu_base->softirq_next_timer = NULL;
 	cpu_base->expires_next = KTIME_MAX;
 	cpu_base->softirq_expires_next = KTIME_MAX;
-	cpu_base->online = 1;
+	cpu_base->softirq_activated = false;
+	cpu_base->online = true;
 	return 0;
 }
 
@@ -2303,7 +2304,7 @@ int hrtimers_cpu_dying(unsigned int dying_cpu)
 	smp_call_function_single(ncpu, retrigger_next_event, NULL, 0);
 
 	raw_spin_unlock(&new_base->lock);
-	old_base->online = 0;
+	old_base->online = false;
 	raw_spin_unlock(&old_base->lock);
 
 	return 0;

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] hrtimer: Evaluate timer expiry only once
  2026-02-24 16:37 ` [patch 25/48] hrtimer: Evaluate timer expiry only once Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     8ffc9ea88136903812448a04127e1ee2c0460f24
Gitweb:        https://git.kernel.org/tip/8ffc9ea88136903812448a04127e1ee2c0460f24
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:37:14 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:10 +01:00

hrtimer: Evaluate timer expiry only once

No point in accessing the timer twice.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.409352042@kernel.org
---
 kernel/time/hrtimer.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 0448ba9..e6f02e9 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -810,10 +810,11 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram)
 {
 	struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
 	struct hrtimer_clock_base *base = timer->base;
-	ktime_t expires = ktime_sub(hrtimer_get_expires(timer), base->offset);
+	ktime_t expires = hrtimer_get_expires(timer);
 
-	WARN_ON_ONCE(hrtimer_get_expires(timer) < 0);
+	WARN_ON_ONCE(expires < 0);
 
+	expires = ktime_sub(expires, base->offset);
 	/*
 	 * CLOCK_REALTIME timer might be requested with an absolute
 	 * expiry time which is less than base->offset. Set it to 0.

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] hrtimer: Cleanup coding style and comments
  2026-02-24 16:37 ` [patch 24/48] hrtimer: Cleanup coding style and comments Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     0c6af0ea51bd2774f41a00a81ac276800975c3cc
Gitweb:        https://git.kernel.org/tip/0c6af0ea51bd2774f41a00a81ac276800975c3cc
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:37:09 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:10 +01:00

hrtimer: Cleanup coding style and comments

As this code has some major surgery ahead, clean up coding style and bring
comments up to date.

No functional change intended.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.342740952@kernel.org
---
 kernel/time/hrtimer.c | 364 ++++++++++++++++-------------------------
 1 file changed, 143 insertions(+), 221 deletions(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index a5df3c4..0448ba9 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -77,43 +77,22 @@ static ktime_t __hrtimer_cb_get_time(clockid_t clock_id);
  * to reach a base using a clockid, hrtimer_clockid_to_base()
  * is used to convert from clockid to the proper hrtimer_base_type.
  */
+
+#define BASE_INIT(idx, cid)			\
+	[idx] = { .index = idx, .clockid = cid }
+
 DEFINE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases) =
 {
 	.lock = __RAW_SPIN_LOCK_UNLOCKED(hrtimer_bases.lock),
-	.clock_base =
-	{
-		{
-			.index = HRTIMER_BASE_MONOTONIC,
-			.clockid = CLOCK_MONOTONIC,
-		},
-		{
-			.index = HRTIMER_BASE_REALTIME,
-			.clockid = CLOCK_REALTIME,
-		},
-		{
-			.index = HRTIMER_BASE_BOOTTIME,
-			.clockid = CLOCK_BOOTTIME,
-		},
-		{
-			.index = HRTIMER_BASE_TAI,
-			.clockid = CLOCK_TAI,
-		},
-		{
-			.index = HRTIMER_BASE_MONOTONIC_SOFT,
-			.clockid = CLOCK_MONOTONIC,
-		},
-		{
-			.index = HRTIMER_BASE_REALTIME_SOFT,
-			.clockid = CLOCK_REALTIME,
-		},
-		{
-			.index = HRTIMER_BASE_BOOTTIME_SOFT,
-			.clockid = CLOCK_BOOTTIME,
-		},
-		{
-			.index = HRTIMER_BASE_TAI_SOFT,
-			.clockid = CLOCK_TAI,
-		},
+	.clock_base = {
+		BASE_INIT(HRTIMER_BASE_MONOTONIC,	CLOCK_MONOTONIC),
+		BASE_INIT(HRTIMER_BASE_REALTIME,	CLOCK_REALTIME),
+		BASE_INIT(HRTIMER_BASE_BOOTTIME,	CLOCK_BOOTTIME),
+		BASE_INIT(HRTIMER_BASE_TAI,		CLOCK_TAI),
+		BASE_INIT(HRTIMER_BASE_MONOTONIC_SOFT,	CLOCK_MONOTONIC),
+		BASE_INIT(HRTIMER_BASE_REALTIME_SOFT,	CLOCK_REALTIME),
+		BASE_INIT(HRTIMER_BASE_BOOTTIME_SOFT,	CLOCK_BOOTTIME),
+		BASE_INIT(HRTIMER_BASE_TAI_SOFT,	CLOCK_TAI),
 	},
 	.csd = CSD_INIT(retrigger_next_event, NULL)
 };
@@ -150,18 +129,19 @@ static inline void hrtimer_schedule_hres_work(void) { }
  * single place
  */
 #ifdef CONFIG_SMP
-
 /*
  * We require the migration_base for lock_hrtimer_base()/switch_hrtimer_base()
  * such that hrtimer_callback_running() can unconditionally dereference
  * timer->base->cpu_base
  */
 static struct hrtimer_cpu_base migration_cpu_base = {
-	.clock_base = { {
-		.cpu_base = &migration_cpu_base,
-		.seq      = SEQCNT_RAW_SPINLOCK_ZERO(migration_cpu_base.seq,
-						     &migration_cpu_base.lock),
-	}, },
+	.clock_base = {
+		[0] = {
+			.cpu_base = &migration_cpu_base,
+			.seq      = SEQCNT_RAW_SPINLOCK_ZERO(migration_cpu_base.seq,
+							     &migration_cpu_base.lock),
+		},
+	},
 };
 
 #define migration_base	migration_cpu_base.clock_base[0]
@@ -178,15 +158,13 @@ static struct hrtimer_cpu_base migration_cpu_base = {
  * possible to set timer->base = &migration_base and drop the lock: the timer
  * remains locked.
  */
-static
-struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer,
-					     unsigned long *flags)
+static struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer,
+						    unsigned long *flags)
 	__acquires(&timer->base->lock)
 {
-	struct hrtimer_clock_base *base;
-
 	for (;;) {
-		base = READ_ONCE(timer->base);
+		struct hrtimer_clock_base *base = READ_ONCE(timer->base);
+
 		if (likely(base != &migration_base)) {
 			raw_spin_lock_irqsave(&base->cpu_base->lock, *flags);
 			if (likely(base == timer->base))
@@ -239,7 +217,7 @@ static bool hrtimer_suitable_target(struct hrtimer *timer, struct hrtimer_clock_
 	return expires >= new_base->cpu_base->expires_next;
 }
 
-static inline struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base, int pinned)
+static inline struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base, bool pinned)
 {
 	if (!hrtimer_base_is_online(base)) {
 		int cpu = cpumask_any_and(cpu_online_mask, housekeeping_cpumask(HK_TYPE_TIMER));
@@ -267,8 +245,7 @@ static inline struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *
  * the timer callback is currently running.
  */
 static inline struct hrtimer_clock_base *
-switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
-		    int pinned)
+switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base, bool pinned)
 {
 	struct hrtimer_cpu_base *new_cpu_base, *this_cpu_base;
 	struct hrtimer_clock_base *new_base;
@@ -281,13 +258,12 @@ again:
 
 	if (base != new_base) {
 		/*
-		 * We are trying to move timer to new_base.
-		 * However we can't change timer's base while it is running,
-		 * so we keep it on the same CPU. No hassle vs. reprogramming
-		 * the event source in the high resolution case. The softirq
-		 * code will take care of this when the timer function has
-		 * completed. There is no conflict as we hold the lock until
-		 * the timer is enqueued.
+		 * We are trying to move timer to new_base. However we can't
+		 * change timer's base while it is running, so we keep it on
+		 * the same CPU. No hassle vs. reprogramming the event source
+		 * in the high resolution case. The remote CPU will take care
+		 * of this when the timer function has completed. There is no
+		 * conflict as we hold the lock until the timer is enqueued.
 		 */
 		if (unlikely(hrtimer_callback_running(timer)))
 			return base;
@@ -297,8 +273,7 @@ again:
 		raw_spin_unlock(&base->cpu_base->lock);
 		raw_spin_lock(&new_base->cpu_base->lock);
 
-		if (!hrtimer_suitable_target(timer, new_base, new_cpu_base,
-					     this_cpu_base)) {
+		if (!hrtimer_suitable_target(timer, new_base, new_cpu_base, this_cpu_base)) {
 			raw_spin_unlock(&new_base->cpu_base->lock);
 			raw_spin_lock(&base->cpu_base->lock);
 			new_cpu_base = this_cpu_base;
@@ -317,14 +292,13 @@ again:
 
 #else /* CONFIG_SMP */
 
-static inline struct hrtimer_clock_base *
-lock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
+static inline struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer,
+							   unsigned long *flags)
 	__acquires(&timer->base->cpu_base->lock)
 {
 	struct hrtimer_clock_base *base = timer->base;
 
 	raw_spin_lock_irqsave(&base->cpu_base->lock, *flags);
-
 	return base;
 }
 
@@ -484,8 +458,7 @@ static inline void debug_hrtimer_init_on_stack(struct hrtimer *timer)
 	debug_object_init_on_stack(timer, &hrtimer_debug_descr);
 }
 
-static inline void debug_hrtimer_activate(struct hrtimer *timer,
-					  enum hrtimer_mode mode)
+static inline void debug_hrtimer_activate(struct hrtimer *timer, enum hrtimer_mode mode)
 {
 	debug_object_activate(timer, &hrtimer_debug_descr);
 }
@@ -510,8 +483,7 @@ EXPORT_SYMBOL_GPL(destroy_hrtimer_on_stack);
 
 static inline void debug_hrtimer_init(struct hrtimer *timer) { }
 static inline void debug_hrtimer_init_on_stack(struct hrtimer *timer) { }
-static inline void debug_hrtimer_activate(struct hrtimer *timer,
-					  enum hrtimer_mode mode) { }
+static inline void debug_hrtimer_activate(struct hrtimer *timer, enum hrtimer_mode mode) { }
 static inline void debug_hrtimer_deactivate(struct hrtimer *timer) { }
 static inline void debug_hrtimer_assert_init(struct hrtimer *timer) { }
 #endif
@@ -549,13 +521,12 @@ __next_base(struct hrtimer_cpu_base *cpu_base, unsigned int *active)
 	return &cpu_base->clock_base[idx];
 }
 
-#define for_each_active_base(base, cpu_base, active)	\
+#define for_each_active_base(base, cpu_base, active)		\
 	while ((base = __next_base((cpu_base), &(active))))
 
 static ktime_t __hrtimer_next_event_base(struct hrtimer_cpu_base *cpu_base,
 					 const struct hrtimer *exclude,
-					 unsigned int active,
-					 ktime_t expires_next)
+					 unsigned int active, ktime_t expires_next)
 {
 	struct hrtimer_clock_base *base;
 	ktime_t expires;
@@ -618,29 +589,24 @@ static ktime_t __hrtimer_next_event_base(struct hrtimer_cpu_base *cpu_base,
  *  - HRTIMER_ACTIVE_SOFT, or
  *  - HRTIMER_ACTIVE_HARD.
  */
-static ktime_t
-__hrtimer_get_next_event(struct hrtimer_cpu_base *cpu_base, unsigned int active_mask)
+static ktime_t __hrtimer_get_next_event(struct hrtimer_cpu_base *cpu_base, unsigned int active_mask)
 {
-	unsigned int active;
 	struct hrtimer *next_timer = NULL;
 	ktime_t expires_next = KTIME_MAX;
+	unsigned int active;
 
 	if (!cpu_base->softirq_activated && (active_mask & HRTIMER_ACTIVE_SOFT)) {
 		active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
 		cpu_base->softirq_next_timer = NULL;
-		expires_next = __hrtimer_next_event_base(cpu_base, NULL,
-							 active, KTIME_MAX);
-
+		expires_next = __hrtimer_next_event_base(cpu_base, NULL, active, KTIME_MAX);
 		next_timer = cpu_base->softirq_next_timer;
 	}
 
 	if (active_mask & HRTIMER_ACTIVE_HARD) {
 		active = cpu_base->active_bases & HRTIMER_ACTIVE_HARD;
 		cpu_base->next_timer = next_timer;
-		expires_next = __hrtimer_next_event_base(cpu_base, NULL, active,
-							 expires_next);
+		expires_next = __hrtimer_next_event_base(cpu_base, NULL, active, expires_next);
 	}
-
 	return expires_next;
 }
 
@@ -681,8 +647,8 @@ static inline ktime_t hrtimer_update_base(struct hrtimer_cpu_base *base)
 	ktime_t *offs_boot = &base->clock_base[HRTIMER_BASE_BOOTTIME].offset;
 	ktime_t *offs_tai = &base->clock_base[HRTIMER_BASE_TAI].offset;
 
-	ktime_t now = ktime_get_update_offsets_now(&base->clock_was_set_seq,
-					    offs_real, offs_boot, offs_tai);
+	ktime_t now = ktime_get_update_offsets_now(&base->clock_was_set_seq, offs_real,
+						   offs_boot, offs_tai);
 
 	base->clock_base[HRTIMER_BASE_REALTIME_SOFT].offset = *offs_real;
 	base->clock_base[HRTIMER_BASE_BOOTTIME_SOFT].offset = *offs_boot;
@@ -702,8 +668,7 @@ static inline int hrtimer_hres_active(struct hrtimer_cpu_base *cpu_base)
 		cpu_base->hres_active : 0;
 }
 
-static void __hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base,
-				struct hrtimer *next_timer,
+static void __hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base, struct hrtimer *next_timer,
 				ktime_t expires_next)
 {
 	cpu_base->expires_next = expires_next;
@@ -736,12 +701,9 @@ static void __hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base,
  * next event
  * Called with interrupts disabled and base->lock held
  */
-static void
-hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal)
+static void hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, bool skip_equal)
 {
-	ktime_t expires_next;
-
-	expires_next = hrtimer_update_next_event(cpu_base);
+	ktime_t expires_next = hrtimer_update_next_event(cpu_base);
 
 	if (skip_equal && expires_next == cpu_base->expires_next)
 		return;
@@ -752,41 +714,31 @@ hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal)
 /* High resolution timer related functions */
 #ifdef CONFIG_HIGH_RES_TIMERS
 
-/*
- * High resolution timer enabled ?
- */
+/* High resolution timer enabled ? */
 static bool hrtimer_hres_enabled __read_mostly  = true;
 unsigned int hrtimer_resolution __read_mostly = LOW_RES_NSEC;
 EXPORT_SYMBOL_GPL(hrtimer_resolution);
 
-/*
- * Enable / Disable high resolution mode
- */
+/* Enable / Disable high resolution mode */
 static int __init setup_hrtimer_hres(char *str)
 {
 	return (kstrtobool(str, &hrtimer_hres_enabled) == 0);
 }
-
 __setup("highres=", setup_hrtimer_hres);
 
-/*
- * hrtimer_high_res_enabled - query, if the highres mode is enabled
- */
-static inline int hrtimer_is_hres_enabled(void)
+/* hrtimer_high_res_enabled - query, if the highres mode is enabled */
+static inline bool hrtimer_is_hres_enabled(void)
 {
 	return hrtimer_hres_enabled;
 }
 
-/*
- * Switch to high resolution mode
- */
+/* Switch to high resolution mode */
 static void hrtimer_switch_to_hres(void)
 {
 	struct hrtimer_cpu_base *base = this_cpu_ptr(&hrtimer_bases);
 
 	if (tick_init_highres()) {
-		pr_warn("Could not switch to high resolution mode on CPU %u\n",
-			base->cpu);
+		pr_warn("Could not switch to high resolution mode on CPU %u\n",	base->cpu);
 		return;
 	}
 	base->hres_active = 1;
@@ -800,10 +752,11 @@ static void hrtimer_switch_to_hres(void)
 
 #else
 
-static inline int hrtimer_is_hres_enabled(void) { return 0; }
+static inline bool hrtimer_is_hres_enabled(void) { return 0; }
 static inline void hrtimer_switch_to_hres(void) { }
 
 #endif /* CONFIG_HIGH_RES_TIMERS */
+
 /*
  * Retrigger next event is called after clock was set with interrupts
  * disabled through an SMP function call or directly from low level
@@ -841,7 +794,7 @@ static void retrigger_next_event(void *arg)
 	guard(raw_spinlock)(&base->lock);
 	hrtimer_update_base(base);
 	if (hrtimer_hres_active(base))
-		hrtimer_force_reprogram(base, 0);
+		hrtimer_force_reprogram(base, /* skip_equal */ false);
 	else
 		hrtimer_update_next_event(base);
 }
@@ -887,8 +840,7 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram)
 		timer_cpu_base->softirq_next_timer = timer;
 		timer_cpu_base->softirq_expires_next = expires;
 
-		if (!ktime_before(expires, timer_cpu_base->expires_next) ||
-		    !reprogram)
+		if (!ktime_before(expires, timer_cpu_base->expires_next) || !reprogram)
 			return;
 	}
 
@@ -914,8 +866,7 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram)
 	__hrtimer_reprogram(cpu_base, timer, expires);
 }
 
-static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base,
-			     unsigned int active)
+static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base, unsigned int active)
 {
 	struct hrtimer_clock_base *base;
 	unsigned int seq;
@@ -1050,11 +1001,8 @@ void hrtimers_resume_local(void)
 	retrigger_next_event(NULL);
 }
 
-/*
- * Counterpart to lock_hrtimer_base above:
- */
-static inline
-void unlock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
+/* Counterpart to lock_hrtimer_base above */
+static inline void unlock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
 	__releases(&timer->base->cpu_base->lock)
 {
 	raw_spin_unlock_irqrestore(&timer->base->cpu_base->lock, *flags);
@@ -1071,7 +1019,7 @@ void unlock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
  * .. note::
  *  This only updates the timer expiry value and does not requeue the timer.
  *
- * There is also a variant of the function hrtimer_forward_now().
+ * There is also a variant of this function: hrtimer_forward_now().
  *
  * Context: Can be safely called from the callback function of @timer. If called
  *          from other contexts @timer must neither be enqueued nor running the
@@ -1081,8 +1029,8 @@ void unlock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
  */
 u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
 {
-	u64 orun = 1;
 	ktime_t delta;
+	u64 orun = 1;
 
 	delta = ktime_sub(now, hrtimer_get_expires(timer));
 
@@ -1118,13 +1066,15 @@ EXPORT_SYMBOL_GPL(hrtimer_forward);
  * enqueue_hrtimer - internal function to (re)start a timer
  *
  * The timer is inserted in expiry order. Insertion into the
- * red black tree is O(log(n)). Must hold the base lock.
+ * red black tree is O(log(n)).
  *
  * Returns true when the new timer is the leftmost timer in the tree.
  */
 static bool enqueue_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
 			    enum hrtimer_mode mode, bool was_armed)
 {
+	lockdep_assert_held(&base->cpu_base->lock);
+
 	debug_activate(timer, mode, was_armed);
 	WARN_ON_ONCE(!base->cpu_base->online);
 
@@ -1139,20 +1089,19 @@ static bool enqueue_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *ba
 /*
  * __remove_hrtimer - internal function to remove a timer
  *
- * Caller must hold the base lock.
- *
  * High resolution timer mode reprograms the clock event device when the
  * timer is the one which expires next. The caller can disable this by setting
  * reprogram to zero. This is useful, when the context does a reprogramming
  * anyway (e.g. timer interrupt)
  */
-static void __remove_hrtimer(struct hrtimer *timer,
-			     struct hrtimer_clock_base *base,
-			     u8 newstate, int reprogram)
+static void __remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
+			     u8 newstate, bool reprogram)
 {
 	struct hrtimer_cpu_base *cpu_base = base->cpu_base;
 	u8 state = timer->state;
 
+	lockdep_assert_held(&cpu_base->lock);
+
 	/* Pairs with the lockless read in hrtimer_is_queued() */
 	WRITE_ONCE(timer->state, newstate);
 	if (!(state & HRTIMER_STATE_ENQUEUED))
@@ -1162,26 +1111,25 @@ static void __remove_hrtimer(struct hrtimer *timer,
 		cpu_base->active_bases &= ~(1 << base->index);
 
 	/*
-	 * Note: If reprogram is false we do not update
-	 * cpu_base->next_timer. This happens when we remove the first
-	 * timer on a remote cpu. No harm as we never dereference
-	 * cpu_base->next_timer. So the worst thing what can happen is
-	 * an superfluous call to hrtimer_force_reprogram() on the
-	 * remote cpu later on if the same timer gets enqueued again.
+	 * If reprogram is false don't update cpu_base->next_timer and do not
+	 * touch the clock event device.
+	 *
+	 * This happens when removing the first timer on a remote CPU, which
+	 * will be handled by the remote CPU's interrupt. It also happens when
+	 * a local timer is removed to be immediately restarted. That's handled
+	 * at the call site.
 	 */
 	if (reprogram && timer == cpu_base->next_timer && !timer->is_lazy)
-		hrtimer_force_reprogram(cpu_base, 1);
+		hrtimer_force_reprogram(cpu_base, /* skip_equal */ true);
 }
 
-/*
- * remove hrtimer, called with base lock held
- */
-static inline int
-remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
-	       bool restart, bool keep_local)
+static inline bool remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
+				 bool restart, bool keep_local)
 {
 	u8 state = timer->state;
 
+	lockdep_assert_held(&base->cpu_base->lock);
+
 	if (state & HRTIMER_STATE_ENQUEUED) {
 		bool reprogram;
 
@@ -1209,9 +1157,9 @@ remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
 			reprogram &= !keep_local;
 
 		__remove_hrtimer(timer, base, state, reprogram);
-		return 1;
+		return true;
 	}
-	return 0;
+	return false;
 }
 
 static inline ktime_t hrtimer_update_lowres(struct hrtimer *timer, ktime_t tim,
@@ -1230,34 +1178,27 @@ static inline ktime_t hrtimer_update_lowres(struct hrtimer *timer, ktime_t tim,
 	return tim;
 }
 
-static void
-hrtimer_update_softirq_timer(struct hrtimer_cpu_base *cpu_base, bool reprogram)
+static void hrtimer_update_softirq_timer(struct hrtimer_cpu_base *cpu_base, bool reprogram)
 {
-	ktime_t expires;
+	ktime_t expires = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_SOFT);
 
 	/*
-	 * Find the next SOFT expiration.
-	 */
-	expires = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_SOFT);
-
-	/*
-	 * reprogramming needs to be triggered, even if the next soft
-	 * hrtimer expires at the same time than the next hard
+	 * Reprogramming needs to be triggered, even if the next soft
+	 * hrtimer expires at the same time as the next hard
 	 * hrtimer. cpu_base->softirq_expires_next needs to be updated!
 	 */
 	if (expires == KTIME_MAX)
 		return;
 
 	/*
-	 * cpu_base->*next_timer is recomputed by __hrtimer_get_next_event()
-	 * cpu_base->*expires_next is only set by hrtimer_reprogram()
+	 * cpu_base->next_timer is recomputed by __hrtimer_get_next_event()
+	 * cpu_base->expires_next is only set by hrtimer_reprogram()
 	 */
 	hrtimer_reprogram(cpu_base->softirq_next_timer, reprogram);
 }
 
-static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
-				    u64 delta_ns, const enum hrtimer_mode mode,
-				    struct hrtimer_clock_base *base)
+static bool __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, u64 delta_ns,
+				     const enum hrtimer_mode mode, struct hrtimer_clock_base *base)
 {
 	struct hrtimer_cpu_base *this_cpu_base = this_cpu_ptr(&hrtimer_bases);
 	struct hrtimer_clock_base *new_base;
@@ -1301,12 +1242,10 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
 	hrtimer_set_expires_range_ns(timer, tim, delta_ns);
 
 	/* Switch the timer base, if necessary: */
-	if (!force_local) {
-		new_base = switch_hrtimer_base(timer, base,
-					       mode & HRTIMER_MODE_PINNED);
-	} else {
+	if (!force_local)
+		new_base = switch_hrtimer_base(timer, base, mode & HRTIMER_MODE_PINNED);
+	else
 		new_base = base;
-	}
 
 	first = enqueue_hrtimer(timer, new_base, mode, was_armed);
 
@@ -1319,9 +1258,12 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
 
 	if (!force_local) {
 		/*
-		 * If the current CPU base is online, then the timer is
-		 * never queued on a remote CPU if it would be the first
-		 * expiring timer there.
+		 * If the current CPU base is online, then the timer is never
+		 * queued on a remote CPU if it would be the first expiring
+		 * timer there unless the timer callback is currently executed
+		 * on the remote CPU. In the latter case the remote CPU will
+		 * re-evaluate the first expiring timer after completing the
+		 * callbacks.
 		 */
 		if (hrtimer_base_is_online(this_cpu_base))
 			return first;
@@ -1336,7 +1278,7 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
 
 			smp_call_function_single_async(new_cpu_base->cpu, &new_cpu_base->csd);
 		}
-		return 0;
+		return false;
 	}
 
 	/*
@@ -1350,7 +1292,7 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
 	 */
 	if (timer->is_lazy) {
 		if (new_base->cpu_base->expires_next <= hrtimer_get_expires(timer))
-			return 0;
+			return false;
 	}
 
 	/*
@@ -1358,8 +1300,8 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
 	 * reprogramming on removal and enqueue. Force reprogram the
 	 * hardware by evaluating the new first expiring timer.
 	 */
-	hrtimer_force_reprogram(new_base->cpu_base, 1);
-	return 0;
+	hrtimer_force_reprogram(new_base->cpu_base, /* skip_equal */ true);
+	return false;
 }
 
 /**
@@ -1371,8 +1313,8 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
  *		relative (HRTIMER_MODE_REL), and pinned (HRTIMER_MODE_PINNED);
  *		softirq based mode is considered for debug purpose only!
  */
-void hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
-			    u64 delta_ns, const enum hrtimer_mode mode)
+void hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, u64 delta_ns,
+			    const enum hrtimer_mode mode)
 {
 	struct hrtimer_clock_base *base;
 	unsigned long flags;
@@ -1464,8 +1406,7 @@ static void hrtimer_cpu_base_unlock_expiry(struct hrtimer_cpu_base *base)
  * the timer callback to finish. Drop expiry_lock and reacquire it. That
  * allows the waiter to acquire the lock and make progress.
  */
-static void hrtimer_sync_wait_running(struct hrtimer_cpu_base *cpu_base,
-				      unsigned long flags)
+static void hrtimer_sync_wait_running(struct hrtimer_cpu_base *cpu_base, unsigned long flags)
 {
 	if (atomic_read(&cpu_base->timer_waiters)) {
 		raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
@@ -1530,14 +1471,10 @@ void hrtimer_cancel_wait_running(const struct hrtimer *timer)
 	spin_unlock_bh(&base->cpu_base->softirq_expiry_lock);
 }
 #else
-static inline void
-hrtimer_cpu_base_init_expiry_lock(struct hrtimer_cpu_base *base) { }
-static inline void
-hrtimer_cpu_base_lock_expiry(struct hrtimer_cpu_base *base) { }
-static inline void
-hrtimer_cpu_base_unlock_expiry(struct hrtimer_cpu_base *base) { }
-static inline void hrtimer_sync_wait_running(struct hrtimer_cpu_base *base,
-					     unsigned long flags) { }
+static inline void hrtimer_cpu_base_init_expiry_lock(struct hrtimer_cpu_base *base) { }
+static inline void hrtimer_cpu_base_lock_expiry(struct hrtimer_cpu_base *base) { }
+static inline void hrtimer_cpu_base_unlock_expiry(struct hrtimer_cpu_base *base) { }
+static inline void hrtimer_sync_wait_running(struct hrtimer_cpu_base *base, unsigned long fl) { }
 #endif
 
 /**
@@ -1668,8 +1605,7 @@ ktime_t hrtimer_cb_get_time(const struct hrtimer *timer)
 }
 EXPORT_SYMBOL_GPL(hrtimer_cb_get_time);
 
-static void __hrtimer_setup(struct hrtimer *timer,
-			    enum hrtimer_restart (*function)(struct hrtimer *),
+static void __hrtimer_setup(struct hrtimer *timer, enum hrtimer_restart (*fn)(struct hrtimer *),
 			    clockid_t clock_id, enum hrtimer_mode mode)
 {
 	bool softtimer = !!(mode & HRTIMER_MODE_SOFT);
@@ -1705,10 +1641,10 @@ static void __hrtimer_setup(struct hrtimer *timer,
 	timer->base = &cpu_base->clock_base[base];
 	timerqueue_init(&timer->node);
 
-	if (WARN_ON_ONCE(!function))
+	if (WARN_ON_ONCE(!fn))
 		ACCESS_PRIVATE(timer, function) = hrtimer_dummy_timeout;
 	else
-		ACCESS_PRIVATE(timer, function) = function;
+		ACCESS_PRIVATE(timer, function) = fn;
 }
 
 /**
@@ -1767,12 +1703,10 @@ bool hrtimer_active(const struct hrtimer *timer)
 		base = READ_ONCE(timer->base);
 		seq = raw_read_seqcount_begin(&base->seq);
 
-		if (timer->state != HRTIMER_STATE_INACTIVE ||
-		    base->running == timer)
+		if (timer->state != HRTIMER_STATE_INACTIVE || base->running == timer)
 			return true;
 
-	} while (read_seqcount_retry(&base->seq, seq) ||
-		 base != READ_ONCE(timer->base));
+	} while (read_seqcount_retry(&base->seq, seq) || base != READ_ONCE(timer->base));
 
 	return false;
 }
@@ -1795,11 +1729,9 @@ EXPORT_SYMBOL_GPL(hrtimer_active);
  * a false negative if the read side got smeared over multiple consecutive
  * __run_hrtimer() invocations.
  */
-
-static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base,
-			  struct hrtimer_clock_base *base,
-			  struct hrtimer *timer, ktime_t *now,
-			  unsigned long flags) __must_hold(&cpu_base->lock)
+static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base, struct hrtimer_clock_base *base,
+			  struct hrtimer *timer, ktime_t *now, unsigned long flags)
+	__must_hold(&cpu_base->lock)
 {
 	enum hrtimer_restart (*fn)(struct hrtimer *);
 	bool expires_in_hardirq;
@@ -1819,7 +1751,7 @@ static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base,
 	 */
 	raw_write_seqcount_barrier(&base->seq);
 
-	__remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE, 0);
+	__remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE, false);
 	fn = ACCESS_PRIVATE(timer, function);
 
 	/*
@@ -1854,8 +1786,7 @@ static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base,
 	 * hrtimer_start_range_ns() can have popped in and enqueued the timer
 	 * for us already.
 	 */
-	if (restart != HRTIMER_NORESTART &&
-	    !(timer->state & HRTIMER_STATE_ENQUEUED))
+	if (restart != HRTIMER_NORESTART && !(timer->state & HRTIMER_STATE_ENQUEUED))
 		enqueue_hrtimer(timer, base, HRTIMER_MODE_ABS, false);
 
 	/*
@@ -1874,8 +1805,8 @@ static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base,
 static void __hrtimer_run_queues(struct hrtimer_cpu_base *cpu_base, ktime_t now,
 				 unsigned long flags, unsigned int active_mask)
 {
-	struct hrtimer_clock_base *base;
 	unsigned int active = cpu_base->active_bases & active_mask;
+	struct hrtimer_clock_base *base;
 
 	for_each_active_base(base, cpu_base, active) {
 		struct timerqueue_node *node;
@@ -1951,11 +1882,10 @@ void hrtimer_interrupt(struct clock_event_device *dev)
 retry:
 	cpu_base->in_hrtirq = 1;
 	/*
-	 * We set expires_next to KTIME_MAX here with cpu_base->lock
-	 * held to prevent that a timer is enqueued in our queue via
-	 * the migration code. This does not affect enqueueing of
-	 * timers which run their callback and need to be requeued on
-	 * this CPU.
+	 * Set expires_next to KTIME_MAX, which prevents that remote CPUs queue
+	 * timers while __hrtimer_run_queues() is expiring the clock bases.
+	 * Timers which are re/enqueued on the local CPU are not affected by
+	 * this.
 	 */
 	cpu_base->expires_next = KTIME_MAX;
 
@@ -2069,8 +1999,7 @@ void hrtimer_run_queues(void)
  */
 static enum hrtimer_restart hrtimer_wakeup(struct hrtimer *timer)
 {
-	struct hrtimer_sleeper *t =
-		container_of(timer, struct hrtimer_sleeper, timer);
+	struct hrtimer_sleeper *t = container_of(timer, struct hrtimer_sleeper, timer);
 	struct task_struct *task = t->task;
 
 	t->task = NULL;
@@ -2088,8 +2017,7 @@ static enum hrtimer_restart hrtimer_wakeup(struct hrtimer *timer)
  * Wrapper around hrtimer_start_expires() for hrtimer_sleeper based timers
  * to allow PREEMPT_RT to tweak the delivery mode (soft/hardirq context)
  */
-void hrtimer_sleeper_start_expires(struct hrtimer_sleeper *sl,
-				   enum hrtimer_mode mode)
+void hrtimer_sleeper_start_expires(struct hrtimer_sleeper *sl, enum hrtimer_mode mode)
 {
 	/*
 	 * Make the enqueue delivery mode check work on RT. If the sleeper
@@ -2105,8 +2033,8 @@ void hrtimer_sleeper_start_expires(struct hrtimer_sleeper *sl,
 }
 EXPORT_SYMBOL_GPL(hrtimer_sleeper_start_expires);
 
-static void __hrtimer_setup_sleeper(struct hrtimer_sleeper *sl,
-				    clockid_t clock_id, enum hrtimer_mode mode)
+static void __hrtimer_setup_sleeper(struct hrtimer_sleeper *sl, clockid_t clock_id,
+				    enum hrtimer_mode mode)
 {
 	/*
 	 * On PREEMPT_RT enabled kernels hrtimers which are not explicitly
@@ -2142,8 +2070,8 @@ static void __hrtimer_setup_sleeper(struct hrtimer_sleeper *sl,
  * @clock_id:	the clock to be used
  * @mode:	timer mode abs/rel
  */
-void hrtimer_setup_sleeper_on_stack(struct hrtimer_sleeper *sl,
-				    clockid_t clock_id, enum hrtimer_mode mode)
+void hrtimer_setup_sleeper_on_stack(struct hrtimer_sleeper *sl, clockid_t clock_id,
+				    enum hrtimer_mode mode)
 {
 	debug_setup_on_stack(&sl->timer, clock_id, mode);
 	__hrtimer_setup_sleeper(sl, clock_id, mode);
@@ -2216,8 +2144,7 @@ static long __sched hrtimer_nanosleep_restart(struct restart_block *restart)
 	return ret;
 }
 
-long hrtimer_nanosleep(ktime_t rqtp, const enum hrtimer_mode mode,
-		       const clockid_t clockid)
+long hrtimer_nanosleep(ktime_t rqtp, const enum hrtimer_mode mode, const clockid_t clockid)
 {
 	struct restart_block *restart;
 	struct hrtimer_sleeper t;
@@ -2260,8 +2187,7 @@ SYSCALL_DEFINE2(nanosleep, struct __kernel_timespec __user *, rqtp,
 	current->restart_block.fn = do_no_restart_syscall;
 	current->restart_block.nanosleep.type = rmtp ? TT_NATIVE : TT_NONE;
 	current->restart_block.nanosleep.rmtp = rmtp;
-	return hrtimer_nanosleep(timespec64_to_ktime(tu), HRTIMER_MODE_REL,
-				 CLOCK_MONOTONIC);
+	return hrtimer_nanosleep(timespec64_to_ktime(tu), HRTIMER_MODE_REL, CLOCK_MONOTONIC);
 }
 
 #endif
@@ -2269,7 +2195,7 @@ SYSCALL_DEFINE2(nanosleep, struct __kernel_timespec __user *, rqtp,
 #ifdef CONFIG_COMPAT_32BIT_TIME
 
 SYSCALL_DEFINE2(nanosleep_time32, struct old_timespec32 __user *, rqtp,
-		       struct old_timespec32 __user *, rmtp)
+		struct old_timespec32 __user *, rmtp)
 {
 	struct timespec64 tu;
 
@@ -2282,8 +2208,7 @@ SYSCALL_DEFINE2(nanosleep_time32, struct old_timespec32 __user *, rqtp,
 	current->restart_block.fn = do_no_restart_syscall;
 	current->restart_block.nanosleep.type = rmtp ? TT_COMPAT : TT_NONE;
 	current->restart_block.nanosleep.compat_rmtp = rmtp;
-	return hrtimer_nanosleep(timespec64_to_ktime(tu), HRTIMER_MODE_REL,
-				 CLOCK_MONOTONIC);
+	return hrtimer_nanosleep(timespec64_to_ktime(tu), HRTIMER_MODE_REL, CLOCK_MONOTONIC);
 }
 #endif
 
@@ -2293,9 +2218,8 @@ SYSCALL_DEFINE2(nanosleep_time32, struct old_timespec32 __user *, rqtp,
 int hrtimers_prepare_cpu(unsigned int cpu)
 {
 	struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu);
-	int i;
 
-	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
+	for (int i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
 		struct hrtimer_clock_base *clock_b = &cpu_base->clock_base[i];
 
 		clock_b->cpu_base = cpu_base;
@@ -2329,8 +2253,8 @@ int hrtimers_cpu_starting(unsigned int cpu)
 static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base,
 				struct hrtimer_clock_base *new_base)
 {
-	struct hrtimer *timer;
 	struct timerqueue_node *node;
+	struct hrtimer *timer;
 
 	while ((node = timerqueue_getnext(&old_base->active))) {
 		timer = container_of(node, struct hrtimer, node);
@@ -2342,7 +2266,7 @@ static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base,
 		 * timer could be seen as !active and just vanish away
 		 * under us on another CPU
 		 */
-		__remove_hrtimer(timer, old_base, HRTIMER_STATE_ENQUEUED, 0);
+		__remove_hrtimer(timer, old_base, HRTIMER_STATE_ENQUEUED, false);
 		timer->base = new_base;
 		/*
 		 * Enqueue the timers on the new cpu. This does not
@@ -2358,7 +2282,7 @@ static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base,
 
 int hrtimers_cpu_dying(unsigned int dying_cpu)
 {
-	int i, ncpu = cpumask_any_and(cpu_active_mask, housekeeping_cpumask(HK_TYPE_TIMER));
+	int ncpu = cpumask_any_and(cpu_active_mask, housekeeping_cpumask(HK_TYPE_TIMER));
 	struct hrtimer_cpu_base *old_base, *new_base;
 
 	old_base = this_cpu_ptr(&hrtimer_bases);
@@ -2371,10 +2295,8 @@ int hrtimers_cpu_dying(unsigned int dying_cpu)
 	raw_spin_lock(&old_base->lock);
 	raw_spin_lock_nested(&new_base->lock, SINGLE_DEPTH_NESTING);
 
-	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
-		migrate_hrtimer_list(&old_base->clock_base[i],
-				     &new_base->clock_base[i]);
-	}
+	for (int i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++)
+		migrate_hrtimer_list(&old_base->clock_base[i], &new_base->clock_base[i]);
 
 	/* Tell the other CPU to retrigger the next event */
 	smp_call_function_single(ncpu, retrigger_next_event, NULL, 0);

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] hrtimer: Use guards where appropriate
  2026-02-24 16:37 ` [patch 23/48] hrtimer: Use guards where appropriate Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     6abfc2bd5b0cff70db99a273f2a161e2273eae6d
Gitweb:        https://git.kernel.org/tip/6abfc2bd5b0cff70db99a273f2a161e2273eae6d
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:37:04 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:09 +01:00

hrtimer: Use guards where appropriate

Simplify and tidy up the code where possible.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.275551488@kernel.org
---
 kernel/time/hrtimer.c | 48 +++++++++++++-----------------------------
 1 file changed, 15 insertions(+), 33 deletions(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 6e4ac8d..a5df3c4 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -838,13 +838,12 @@ static void retrigger_next_event(void *arg)
 	 * In periodic low resolution mode, the next softirq expiration
 	 * must also be updated.
 	 */
-	raw_spin_lock(&base->lock);
+	guard(raw_spinlock)(&base->lock);
 	hrtimer_update_base(base);
 	if (hrtimer_hres_active(base))
 		hrtimer_force_reprogram(base, 0);
 	else
 		hrtimer_update_next_event(base);
-	raw_spin_unlock(&base->lock);
 }
 
 /*
@@ -994,7 +993,6 @@ static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base,
 void clock_was_set(unsigned int bases)
 {
 	cpumask_var_t mask;
-	int cpu;
 
 	if (!hrtimer_highres_enabled() && !tick_nohz_is_active())
 		goto out_timerfd;
@@ -1005,24 +1003,19 @@ void clock_was_set(unsigned int bases)
 	}
 
 	/* Avoid interrupting CPUs if possible */
-	cpus_read_lock();
-	for_each_online_cpu(cpu) {
-		struct hrtimer_cpu_base *cpu_base;
-		unsigned long flags;
+	scoped_guard(cpus_read_lock) {
+		int cpu;
 
-		cpu_base = &per_cpu(hrtimer_bases, cpu);
-		raw_spin_lock_irqsave(&cpu_base->lock, flags);
+		for_each_online_cpu(cpu) {
+			struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu);
 
-		if (update_needs_ipi(cpu_base, bases))
-			cpumask_set_cpu(cpu, mask);
-
-		raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
+			guard(raw_spinlock_irqsave)(&cpu_base->lock);
+			if (update_needs_ipi(cpu_base, bases))
+				cpumask_set_cpu(cpu, mask);
+		}
+		scoped_guard(preempt)
+			smp_call_function_many(mask, retrigger_next_event, NULL, 1);
 	}
-
-	preempt_disable();
-	smp_call_function_many(mask, retrigger_next_event, NULL, 1);
-	preempt_enable();
-	cpus_read_unlock();
 	free_cpumask_var(mask);
 
 out_timerfd:
@@ -1600,15 +1593,11 @@ u64 hrtimer_get_next_event(void)
 {
 	struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
 	u64 expires = KTIME_MAX;
-	unsigned long flags;
-
-	raw_spin_lock_irqsave(&cpu_base->lock, flags);
 
+	guard(raw_spinlock_irqsave)(&cpu_base->lock);
 	if (!hrtimer_hres_active(cpu_base))
 		expires = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_ALL);
 
-	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
-
 	return expires;
 }
 
@@ -1623,25 +1612,18 @@ u64 hrtimer_next_event_without(const struct hrtimer *exclude)
 {
 	struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
 	u64 expires = KTIME_MAX;
-	unsigned long flags;
-
-	raw_spin_lock_irqsave(&cpu_base->lock, flags);
 
+	guard(raw_spinlock_irqsave)(&cpu_base->lock);
 	if (hrtimer_hres_active(cpu_base)) {
 		unsigned int active;
 
 		if (!cpu_base->softirq_activated) {
 			active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
-			expires = __hrtimer_next_event_base(cpu_base, exclude,
-							    active, KTIME_MAX);
+			expires = __hrtimer_next_event_base(cpu_base, exclude, active, KTIME_MAX);
 		}
 		active = cpu_base->active_bases & HRTIMER_ACTIVE_HARD;
-		expires = __hrtimer_next_event_base(cpu_base, exclude, active,
-						    expires);
+		expires = __hrtimer_next_event_base(cpu_base, exclude, active, expires);
 	}
-
-	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
-
 	return expires;
 }
 #endif

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] hrtimer: Reduce trace noise in hrtimer_start()
  2026-02-24 16:36 ` [patch 22/48] hrtimer: Reduce trace noise in hrtimer_start() Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     f2e388a019e4cf83a15883a3d1f1384298e9a6aa
Gitweb:        https://git.kernel.org/tip/f2e388a019e4cf83a15883a3d1f1384298e9a6aa
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:36:59 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:09 +01:00

hrtimer: Reduce trace noise in hrtimer_start()

hrtimer_start() when invoked with an already armed timer traces like:

 <comm>-..   [032] d.h2. 5.002263: hrtimer_cancel: hrtimer= ....
 <comm>-..   [032] d.h1. 5.002263: hrtimer_start: hrtimer= ....

Which is incorrect as the timer doesn't get canceled. Just the expiry time
changes. The internal dequeue operation which is required for that is not
really interesting for trace analysis. But it makes it tedious to keep real
cancellations and the above case apart.

Remove the cancel tracing in hrtimer_start() and add a 'was_armed'
indicator to the hrtimer start tracepoint, which clearly indicates what the
state of the hrtimer is when hrtimer_start() is invoked:

<comm>-..   [032] d.h1. 6.200103: hrtimer_start: hrtimer= .... was_armed=0
 <comm>-..   [032] d.h1. 6.200558: hrtimer_start: hrtimer= .... was_armed=1

Fixes: c6a2a1770245 ("hrtimer: Add tracepoint for hrtimers")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.208491877@kernel.org
---
 include/trace/events/timer.h | 11 +++++----
 kernel/time/hrtimer.c        | 43 ++++++++++++++++-------------------
 2 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/include/trace/events/timer.h b/include/trace/events/timer.h
index 1641ae3..ab9a938 100644
--- a/include/trace/events/timer.h
+++ b/include/trace/events/timer.h
@@ -218,12 +218,13 @@ TRACE_EVENT(hrtimer_setup,
  * hrtimer_start - called when the hrtimer is started
  * @hrtimer:	pointer to struct hrtimer
  * @mode:	the hrtimers mode
+ * @was_armed:	Was armed when hrtimer_start*() was invoked
  */
 TRACE_EVENT(hrtimer_start,
 
-	TP_PROTO(struct hrtimer *hrtimer, enum hrtimer_mode mode),
+	TP_PROTO(struct hrtimer *hrtimer, enum hrtimer_mode mode, bool was_armed),
 
-	TP_ARGS(hrtimer, mode),
+	TP_ARGS(hrtimer, mode, was_armed),
 
 	TP_STRUCT__entry(
 		__field( void *,	hrtimer		)
@@ -231,6 +232,7 @@ TRACE_EVENT(hrtimer_start,
 		__field( s64,		expires		)
 		__field( s64,		softexpires	)
 		__field( enum hrtimer_mode,	mode	)
+		__field( bool,		was_armed	)
 	),
 
 	TP_fast_assign(
@@ -239,13 +241,14 @@ TRACE_EVENT(hrtimer_start,
 		__entry->expires	= hrtimer_get_expires(hrtimer);
 		__entry->softexpires	= hrtimer_get_softexpires(hrtimer);
 		__entry->mode		= mode;
+		__entry->was_armed	= was_armed;
 	),
 
 	TP_printk("hrtimer=%p function=%ps expires=%llu softexpires=%llu "
-		  "mode=%s", __entry->hrtimer, __entry->function,
+		  "mode=%s was_armed=%d", __entry->hrtimer, __entry->function,
 		  (unsigned long long) __entry->expires,
 		  (unsigned long long) __entry->softexpires,
-		  decode_hrtimer_mode(__entry->mode))
+		  decode_hrtimer_mode(__entry->mode), __entry->was_armed)
 );
 
 /**
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index fa63e0b..6e4ac8d 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -529,17 +529,10 @@ static inline void debug_setup_on_stack(struct hrtimer *timer, clockid_t clockid
 	trace_hrtimer_setup(timer, clockid, mode);
 }
 
-static inline void debug_activate(struct hrtimer *timer,
-				  enum hrtimer_mode mode)
+static inline void debug_activate(struct hrtimer *timer, enum hrtimer_mode mode, bool was_armed)
 {
 	debug_hrtimer_activate(timer, mode);
-	trace_hrtimer_start(timer, mode);
-}
-
-static inline void debug_deactivate(struct hrtimer *timer)
-{
-	debug_hrtimer_deactivate(timer);
-	trace_hrtimer_cancel(timer);
+	trace_hrtimer_start(timer, mode, was_armed);
 }
 
 static struct hrtimer_clock_base *
@@ -1137,9 +1130,9 @@ EXPORT_SYMBOL_GPL(hrtimer_forward);
  * Returns true when the new timer is the leftmost timer in the tree.
  */
 static bool enqueue_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
-			    enum hrtimer_mode mode)
+			    enum hrtimer_mode mode, bool was_armed)
 {
-	debug_activate(timer, mode);
+	debug_activate(timer, mode, was_armed);
 	WARN_ON_ONCE(!base->cpu_base->online);
 
 	base->cpu_base->active_bases |= 1 << base->index;
@@ -1199,6 +1192,8 @@ remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
 	if (state & HRTIMER_STATE_ENQUEUED) {
 		bool reprogram;
 
+		debug_hrtimer_deactivate(timer);
+
 		/*
 		 * Remove the timer and force reprogramming when high
 		 * resolution mode is active and the timer is on the current
@@ -1207,7 +1202,6 @@ remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
 		 * reprogramming happens in the interrupt handler. This is a
 		 * rare case and less expensive than a smp call.
 		 */
-		debug_deactivate(timer);
 		reprogram = base->cpu_base == this_cpu_ptr(&hrtimer_bases);
 
 		/*
@@ -1274,15 +1268,15 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
 {
 	struct hrtimer_cpu_base *this_cpu_base = this_cpu_ptr(&hrtimer_bases);
 	struct hrtimer_clock_base *new_base;
-	bool force_local, first;
+	bool force_local, first, was_armed;
 
 	/*
 	 * If the timer is on the local cpu base and is the first expiring
 	 * timer then this might end up reprogramming the hardware twice
-	 * (on removal and on enqueue). To avoid that by prevent the
-	 * reprogram on removal, keep the timer local to the current CPU
-	 * and enforce reprogramming after it is queued no matter whether
-	 * it is the new first expiring timer again or not.
+	 * (on removal and on enqueue). To avoid that prevent the reprogram
+	 * on removal, keep the timer local to the current CPU and enforce
+	 * reprogramming after it is queued no matter whether it is the new
+	 * first expiring timer again or not.
 	 */
 	force_local = base->cpu_base == this_cpu_base;
 	force_local &= base->cpu_base->next_timer == timer;
@@ -1304,7 +1298,7 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
 	 * avoids programming the underlying clock event twice (once at
 	 * removal and once after enqueue).
 	 */
-	remove_hrtimer(timer, base, true, force_local);
+	was_armed = remove_hrtimer(timer, base, true, force_local);
 
 	if (mode & HRTIMER_MODE_REL)
 		tim = ktime_add_safe(tim, __hrtimer_cb_get_time(base->clockid));
@@ -1321,7 +1315,7 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
 		new_base = base;
 	}
 
-	first = enqueue_hrtimer(timer, new_base, mode);
+	first = enqueue_hrtimer(timer, new_base, mode, was_armed);
 
 	/*
 	 * If the hrtimer interrupt is running, then it will reevaluate the
@@ -1439,8 +1433,11 @@ int hrtimer_try_to_cancel(struct hrtimer *timer)
 
 	base = lock_hrtimer_base(timer, &flags);
 
-	if (!hrtimer_callback_running(timer))
+	if (!hrtimer_callback_running(timer)) {
 		ret = remove_hrtimer(timer, base, false, false);
+		if (ret)
+			trace_hrtimer_cancel(timer);
+	}
 
 	unlock_hrtimer_base(timer, &flags);
 
@@ -1877,7 +1874,7 @@ static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base,
 	 */
 	if (restart != HRTIMER_NORESTART &&
 	    !(timer->state & HRTIMER_STATE_ENQUEUED))
-		enqueue_hrtimer(timer, base, HRTIMER_MODE_ABS);
+		enqueue_hrtimer(timer, base, HRTIMER_MODE_ABS, false);
 
 	/*
 	 * Separate the ->running assignment from the ->state assignment.
@@ -2356,7 +2353,7 @@ static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base,
 	while ((node = timerqueue_getnext(&old_base->active))) {
 		timer = container_of(node, struct hrtimer, node);
 		BUG_ON(hrtimer_callback_running(timer));
-		debug_deactivate(timer);
+		debug_hrtimer_deactivate(timer);
 
 		/*
 		 * Mark it as ENQUEUED not INACTIVE otherwise the
@@ -2373,7 +2370,7 @@ static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base,
 		 * sort out already expired timers and reprogram the
 		 * event device.
 		 */
-		enqueue_hrtimer(timer, new_base, HRTIMER_MODE_ABS);
+		enqueue_hrtimer(timer, new_base, HRTIMER_MODE_ABS, true);
 	}
 }
 

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] hrtimer: Add debug object init assertion
  2026-02-24 16:36 ` [patch 21/48] hrtimer: Add debug object init assertion Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     513e744a0a4a70ebdb155611b897e9ed4d83831c
Gitweb:        https://git.kernel.org/tip/513e744a0a4a70ebdb155611b897e9ed4d83831c
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:36:54 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:09 +01:00

hrtimer: Add debug object init assertion

The debug object coverage in hrtimer_start_range_ns() happens too late to
do anything useful. Implement the init assert assertion part and invoke
that early in hrtimer_start_range_ns().

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.143098153@kernel.org
---
 kernel/time/hrtimer.c | 43 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 38 insertions(+), 5 deletions(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index e54f8b5..fa63e0b 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -441,12 +441,37 @@ static bool hrtimer_fixup_free(void *addr, enum debug_obj_state state)
 	}
 }
 
+/* Stub timer callback for improperly used timers. */
+static enum hrtimer_restart stub_timer(struct hrtimer *unused)
+{
+	WARN_ON_ONCE(1);
+	return HRTIMER_NORESTART;
+}
+
+/*
+ * hrtimer_fixup_assert_init is called when:
+ * - an untracked/uninit-ed object is found
+ */
+static bool hrtimer_fixup_assert_init(void *addr, enum debug_obj_state state)
+{
+	struct hrtimer *timer = addr;
+
+	switch (state) {
+	case ODEBUG_STATE_NOTAVAILABLE:
+		hrtimer_setup(timer, stub_timer, CLOCK_MONOTONIC, 0);
+		return true;
+	default:
+		return false;
+	}
+}
+
 static const struct debug_obj_descr hrtimer_debug_descr = {
-	.name		= "hrtimer",
-	.debug_hint	= hrtimer_debug_hint,
-	.fixup_init	= hrtimer_fixup_init,
-	.fixup_activate	= hrtimer_fixup_activate,
-	.fixup_free	= hrtimer_fixup_free,
+	.name			= "hrtimer",
+	.debug_hint		= hrtimer_debug_hint,
+	.fixup_init		= hrtimer_fixup_init,
+	.fixup_activate		= hrtimer_fixup_activate,
+	.fixup_free		= hrtimer_fixup_free,
+	.fixup_assert_init	= hrtimer_fixup_assert_init,
 };
 
 static inline void debug_hrtimer_init(struct hrtimer *timer)
@@ -470,6 +495,11 @@ static inline void debug_hrtimer_deactivate(struct hrtimer *timer)
 	debug_object_deactivate(timer, &hrtimer_debug_descr);
 }
 
+static inline void debug_hrtimer_assert_init(struct hrtimer *timer)
+{
+	debug_object_assert_init(timer, &hrtimer_debug_descr);
+}
+
 void destroy_hrtimer_on_stack(struct hrtimer *timer)
 {
 	debug_object_free(timer, &hrtimer_debug_descr);
@@ -483,6 +513,7 @@ static inline void debug_hrtimer_init_on_stack(struct hrtimer *timer) { }
 static inline void debug_hrtimer_activate(struct hrtimer *timer,
 					  enum hrtimer_mode mode) { }
 static inline void debug_hrtimer_deactivate(struct hrtimer *timer) { }
+static inline void debug_hrtimer_assert_init(struct hrtimer *timer) { }
 #endif
 
 static inline void debug_setup(struct hrtimer *timer, clockid_t clockid, enum hrtimer_mode mode)
@@ -1359,6 +1390,8 @@ void hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
 	struct hrtimer_clock_base *base;
 	unsigned long flags;
 
+	debug_hrtimer_assert_init(timer);
+
 	/*
 	 * Check whether the HRTIMER_MODE_SOFT bit and hrtimer.is_soft
 	 * match on CONFIG_PREEMPT_RT = n. With PREEMPT_RT check the hard

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] x86/apic: Enable TSC coupled programming mode
  2026-02-24 16:36 ` [patch 20/48] x86/apic: Enable TSC coupled programming mode Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  2026-03-03  1:29   ` [patch 20/48] " Nathan Chancellor
  1 sibling, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     f246ec3478cfdab830ee0815209f48923e7ee5e2
Gitweb:        https://git.kernel.org/tip/f246ec3478cfdab830ee0815209f48923e7ee5e2
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:36:49 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:09 +01:00

x86/apic: Enable TSC coupled programming mode

The TSC deadline timer is directly coupled to the TSC and setting the next
deadline is tedious as the clockevents core code converts the
CLOCK_MONOTONIC based absolute expiry time to a relative expiry by reading
the current time from the TSC. It converts that delta to cycles and hands
the result to lapic_next_deadline(), which then has read to the TSC and add
the delta to program the timer.

The core code now supports coupled clock event devices and can provide the
expiry time in TSC cycles directly without reading the TSC at all.

This obviouly works only when the TSC is the current clocksource, but
that's the default for all modern CPUs which implement the TSC deadline
timer. If the TSC is not the current clocksource (e.g. early boot) then the
core code falls back to the relative set_next_event() callback as before.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.076565985@kernel.org
---
 arch/x86/Kconfig                     |  1 +
 arch/x86/include/asm/clock_inlined.h |  8 ++++++++
 arch/x86/kernel/apic/apic.c          | 12 ++++++------
 arch/x86/kernel/tsc.c                |  3 ++-
 4 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d337d8d..560d2ce 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -164,6 +164,7 @@ config X86
 	select EDAC_SUPPORT
 	select GENERIC_CLOCKEVENTS_BROADCAST	if X86_64 || (X86_32 && X86_LOCAL_APIC)
 	select GENERIC_CLOCKEVENTS_BROADCAST_IDLE	if GENERIC_CLOCKEVENTS_BROADCAST
+	select GENERIC_CLOCKEVENTS_COUPLED_INLINE	if X86_64
 	select GENERIC_CLOCKEVENTS_MIN_ADJUST
 	select GENERIC_CMOS_UPDATE
 	select GENERIC_CPU_AUTOPROBE
diff --git a/arch/x86/include/asm/clock_inlined.h b/arch/x86/include/asm/clock_inlined.h
index 29902c5..b2dee8d 100644
--- a/arch/x86/include/asm/clock_inlined.h
+++ b/arch/x86/include/asm/clock_inlined.h
@@ -11,4 +11,12 @@ static __always_inline u64 arch_inlined_clocksource_read(struct clocksource *cs)
 	return (u64)rdtsc_ordered();
 }
 
+struct clock_event_device;
+
+static __always_inline void
+arch_inlined_clockevent_set_next_coupled(u64 cycles, struct clock_event_device *evt)
+{
+	native_wrmsrq(MSR_IA32_TSC_DEADLINE, cycles);
+}
+
 #endif
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 5bb5b39..60cab20 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -591,14 +591,14 @@ static void setup_APIC_timer(void)
 
 	if (this_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER)) {
 		levt->name = "lapic-deadline";
-		levt->features &= ~(CLOCK_EVT_FEAT_PERIODIC |
-				    CLOCK_EVT_FEAT_DUMMY);
+		levt->features &= ~(CLOCK_EVT_FEAT_PERIODIC | CLOCK_EVT_FEAT_DUMMY);
+		levt->features |= CLOCK_EVT_FEAT_CLOCKSOURCE_COUPLED;
+		levt->cs_id = CSID_X86_TSC;
 		levt->set_next_event = lapic_next_deadline;
-		clockevents_config_and_register(levt,
-						tsc_khz * (1000 / TSC_DIVISOR),
-						0xF, ~0UL);
-	} else
+		clockevents_config_and_register(levt, tsc_khz * (1000 / TSC_DIVISOR), 0xF, ~0UL);
+	} else {
 		clockevents_register_device(levt);
+	}
 
 	apic_update_vector(smp_processor_id(), LOCAL_TIMER_VECTOR, true);
 }
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 74a26fb..f31046f 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1203,7 +1203,8 @@ static struct clocksource clocksource_tsc = {
 				  CLOCK_SOURCE_VALID_FOR_HRES |
 				  CLOCK_SOURCE_CAN_INLINE_READ |
 				  CLOCK_SOURCE_MUST_VERIFY |
-				  CLOCK_SOURCE_VERIFY_PERCPU,
+				  CLOCK_SOURCE_VERIFY_PERCPU |
+				  CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT,
 	.id			= CSID_X86_TSC,
 	.vdso_clock_mode	= VDSO_CLOCKMODE_TSC,
 	.enable			= tsc_cs_enable,

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] clockevents: Provide support for clocksource coupled comparators
  2026-02-24 16:36 ` [patch 19/48] clockevents: Provide support for clocksource coupled comparators Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  2026-03-03 18:44   ` [patch 19/48] " Michael Kelley
  1 sibling, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     89f951a1e8ad781e7ac70eccddab0e0c270485f9
Gitweb:        https://git.kernel.org/tip/89f951a1e8ad781e7ac70eccddab0e0c270485f9
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:36:45 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:08 +01:00

clockevents: Provide support for clocksource coupled comparators

Some clockevent devices are coupled to the system clocksource by
implementing a less than or equal comparator which compares the programmed
absolute expiry time against the underlying time counter.

The timekeeping core provides a function to convert and absolute
CLOCK_MONOTONIC based expiry time to a absolute clock cycles time which can
be directly fed into the comparator. That spares two time reads in the next
event progamming path, one to convert the absolute nanoseconds time to a
delta value and the other to convert the delta value back to a absolute
time value suitable for the comparator.

Provide a new clocksource callback which takes the absolute cycle value and
wire it up in clockevents_program_event(). Similar to clocksources allow
architectures to inline the rearm operation.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.010425428@kernel.org
---
 include/linux/clockchips.h |  7 ++++--
 kernel/time/Kconfig        |  4 +++-
 kernel/time/clockevents.c  | 44 ++++++++++++++++++++++++++++++++-----
 3 files changed, 48 insertions(+), 7 deletions(-)

diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index 5e8f781..92d9022 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -43,8 +43,9 @@ enum clock_event_state {
 /*
  * Clock event features
  */
-# define CLOCK_EVT_FEAT_PERIODIC	0x000001
-# define CLOCK_EVT_FEAT_ONESHOT		0x000002
+# define CLOCK_EVT_FEAT_PERIODIC		0x000001
+# define CLOCK_EVT_FEAT_ONESHOT			0x000002
+# define CLOCK_EVT_FEAT_CLOCKSOURCE_COUPLED	0x000004
 
 /*
  * x86(64) specific (mis)features:
@@ -100,6 +101,7 @@ struct clock_event_device {
 	void			(*event_handler)(struct clock_event_device *);
 	int			(*set_next_event)(unsigned long evt, struct clock_event_device *);
 	int			(*set_next_ktime)(ktime_t expires, struct clock_event_device *);
+	void			(*set_next_coupled)(u64 cycles, struct clock_event_device *);
 	ktime_t			next_event;
 	u64			max_delta_ns;
 	u64			min_delta_ns;
@@ -107,6 +109,7 @@ struct clock_event_device {
 	u32			shift;
 	enum clock_event_state	state_use_accessors;
 	unsigned int		features;
+	enum clocksource_ids	cs_id;
 	unsigned long		retries;
 
 	int			(*set_state_periodic)(struct clock_event_device *);
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index b51bc56..e1968ab 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -50,6 +50,10 @@ config GENERIC_CLOCKEVENTS_MIN_ADJUST
 config GENERIC_CLOCKEVENTS_COUPLED
 	bool
 
+config GENERIC_CLOCKEVENTS_COUPLED_INLINE
+	select GENERIC_CLOCKEVENTS_COUPLED
+	bool
+
 # Generic update of CMOS clock
 config GENERIC_CMOS_UPDATE
 	bool
diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 5abaeef..83712aa 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -292,6 +292,38 @@ static int clockevents_program_min_delta(struct clock_event_device *dev)
 
 #endif /* CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST */
 
+#ifdef CONFIG_GENERIC_CLOCKEVENTS_COUPLED
+#ifdef CONFIG_GENERIC_CLOCKEVENTS_COUPLED_INLINE
+#include <asm/clock_inlined.h>
+#else
+static __always_inline void
+arch_inlined_clockevent_set_next_coupled(u64 u64 cycles, struct clock_event_device *dev) { }
+#endif
+
+static inline bool clockevent_set_next_coupled(struct clock_event_device *dev, ktime_t expires)
+{
+	u64 cycles;
+
+	if (unlikely(!(dev->features & CLOCK_EVT_FEAT_CLOCKSOURCE_COUPLED)))
+		return false;
+
+	if (unlikely(!ktime_expiry_to_cycles(dev->cs_id, expires, &cycles)))
+		return false;
+
+	if (IS_ENABLED(CONFIG_GENERIC_CLOCKEVENTS_COUPLED_INLINE))
+		arch_inlined_clockevent_set_next_coupled(cycles, dev);
+	else
+		dev->set_next_coupled(cycles, dev);
+	return true;
+}
+
+#else
+static inline bool clockevent_set_next_coupled(struct clock_event_device *dev, ktime_t expires)
+{
+	return false;
+}
+#endif
+
 /**
  * clockevents_program_event - Reprogram the clock event device.
  * @dev:	device to program
@@ -300,11 +332,10 @@ static int clockevents_program_min_delta(struct clock_event_device *dev)
  *
  * Returns 0 on success, -ETIME when the event is in the past.
  */
-int clockevents_program_event(struct clock_event_device *dev, ktime_t expires,
-			      bool force)
+int clockevents_program_event(struct clock_event_device *dev, ktime_t expires, bool force)
 {
-	unsigned long long clc;
 	int64_t delta;
+	u64 cycles;
 	int rc;
 
 	if (WARN_ON_ONCE(expires < 0))
@@ -323,6 +354,9 @@ int clockevents_program_event(struct clock_event_device *dev, ktime_t expires,
 	if (unlikely(dev->features & CLOCK_EVT_FEAT_HRTIMER))
 		return dev->set_next_ktime(expires, dev);
 
+	if (likely(clockevent_set_next_coupled(dev, expires)))
+		return 0;
+
 	delta = ktime_to_ns(ktime_sub(expires, ktime_get()));
 	if (delta <= 0)
 		return force ? clockevents_program_min_delta(dev) : -ETIME;
@@ -330,8 +364,8 @@ int clockevents_program_event(struct clock_event_device *dev, ktime_t expires,
 	delta = min(delta, (int64_t) dev->max_delta_ns);
 	delta = max(delta, (int64_t) dev->min_delta_ns);
 
-	clc = ((unsigned long long) delta * dev->mult) >> dev->shift;
-	rc = dev->set_next_event((unsigned long) clc, dev);
+	cycles = ((u64)delta * dev->mult) >> dev->shift;
+	rc = dev->set_next_event((unsigned long) cycles, dev);
 
 	return (rc && force) ? clockevents_program_min_delta(dev) : rc;
 }

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] timekeeping: Provide infrastructure for coupled clockevents
  2026-02-24 16:36 ` [patch 18/48] timekeeping: Provide infrastructure for coupled clockevents Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     cd38bdb8e696a1a1eb12fc6662a6e420977aacfd
Gitweb:        https://git.kernel.org/tip/cd38bdb8e696a1a1eb12fc6662a6e420977aacfd
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:36:40 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:08 +01:00

timekeeping: Provide infrastructure for coupled clockevents

Some architectures have clockevent devices which are coupled to the system
clocksource by implementing a less than or equal comparator which compares
the programmed absolute expiry time against the underlying time
counter. Well known examples are TSC/TSC deadline timer and the S390 TOD
clocksource/comparator.

While the concept is nice it has some downsides:

  1) The clockevents core code is strictly based on relative expiry times
     as that's the most common case for clockevent device hardware. That
     requires to convert the absolute expiry time provided by the caller
     (hrtimers, NOHZ code) to a relative expiry time by reading and
     substracting the current time.

     The clockevent::set_next_event() callback must then read the counter
     again to convert the relative expiry back into a absolute one.

  2) The conversion factors from nanoseconds to counter clock cycles are
     set up when the clockevent is registered. When NTP applies corrections
     then the clockevent conversion factors can deviate from the
     clocksource conversion substantially which either results in timers
     firing late or in the worst case early. The early expiry then needs to
     do a reprogam with a short delta.

     In most cases this is papered over by the fact that the read in the
     set_next_event() callback happens after the read which is used to
     calculate the delta. So the tendency is that timers expire mostly
     late.

All of this can be avoided by providing support for these devices in the
core code:

  1) The timekeeping core keeps track of the last update to the clocksource
     by storing the base nanoseconds and the corresponding clocksource
     counter value. That's used to keep the conversion math for reading the
     time within 64-bit in the common case.

     This information can be used to avoid both reads of the underlying
     clocksource in the clockevents reprogramming path:

     delta = expiry - base_ns;
     cycles = base_cycles + ((delta * clockevent::mult) >> clockevent::shift);

     The resulting cycles value can be directly used to program the
     comparator.

  2) As #1 does not longer provide the "compensation" through the second
     read the deviation of the clocksource and clockevent conversions
     caused by NTP become more prominent.

     This can be cured by letting the timekeeping core compute and store
     the reverse conversion factors when the clocksource cycles to
     nanoseconds factors are modified by NTP:

         CS::MULT      (1 << NS_TO_CYC_SHIFT)
     --------------- = ----------------------
     (1 << CS:SHIFT)       NS_TO_CYC_MULT

     Ergo: NS_TO_CYC_MULT = (1 << (CS::SHIFT + NS_TO_CYC_SHIFT)) / CS::MULT

     The NS_TO_CYC_SHIFT value is calculated when the clocksource is
     installed so that it aims for a one hour maximum sleep time.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.944763521@kernel.org
---
 include/linux/clocksource.h         |   1 +-
 include/linux/timekeeper_internal.h |   8 ++-
 kernel/time/Kconfig                 |   3 +-
 kernel/time/timekeeping.c           | 110 +++++++++++++++++++++++++++-
 kernel/time/timekeeping.h           |   2 +-
 5 files changed, 124 insertions(+)

diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 54366d5..25774fc 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -150,6 +150,7 @@ struct clocksource {
 #define CLOCK_SOURCE_RESELECT			0x100
 #define CLOCK_SOURCE_VERIFY_PERCPU		0x200
 #define CLOCK_SOURCE_CAN_INLINE_READ		0x400
+#define CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT	0x800
 
 /* simplify initialization of mask field */
 #define CLOCKSOURCE_MASK(bits) GENMASK_ULL((bits) - 1, 0)
diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h
index b8ae89e..e36d11e 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -72,6 +72,10 @@ struct tk_read_base {
  * @id:				The timekeeper ID
  * @tkr_raw:			The readout base structure for CLOCK_MONOTONIC_RAW
  * @raw_sec:			CLOCK_MONOTONIC_RAW  time in seconds
+ * @cs_id:			The ID of the current clocksource
+ * @cs_ns_to_cyc_mult:		Multiplicator for nanoseconds to cycles conversion
+ * @cs_ns_to_cyc_shift:		Shift value for nanoseconds to cycles conversion
+ * @cs_ns_to_cyc_maxns:		Maximum nanoseconds to cyles conversion range
  * @clock_was_set_seq:		The sequence number of clock was set events
  * @cs_was_changed_seq:		The sequence number of clocksource change events
  * @clock_valid:		Indicator for valid clock
@@ -159,6 +163,10 @@ struct timekeeper {
 	u64			raw_sec;
 
 	/* Cachline 3 and 4 (timekeeping internal variables): */
+	enum clocksource_ids	cs_id;
+	u32			cs_ns_to_cyc_mult;
+	u32			cs_ns_to_cyc_shift;
+	u64			cs_ns_to_cyc_maxns;
 	unsigned int		clock_was_set_seq;
 	u8			cs_was_changed_seq;
 	u8			clock_valid;
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 07b048b..b51bc56 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -47,6 +47,9 @@ config GENERIC_CLOCKEVENTS_BROADCAST_IDLE
 config GENERIC_CLOCKEVENTS_MIN_ADJUST
 	bool
 
+config GENERIC_CLOCKEVENTS_COUPLED
+	bool
+
 # Generic update of CMOS clock
 config GENERIC_CMOS_UPDATE
 	bool
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 63aa31f..b7a0f93 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -391,6 +391,20 @@ static void tk_setup_internals(struct timekeeper *tk, struct clocksource *clock)
 	tk->tkr_raw.mult = clock->mult;
 	tk->ntp_err_mult = 0;
 	tk->skip_second_overflow = 0;
+
+	tk->cs_id = clock->id;
+
+	/* Coupled clockevent data */
+	if (IS_ENABLED(CONFIG_GENERIC_CLOCKEVENTS_COUPLED) &&
+	    clock->flags & CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT) {
+		/*
+		 * Aim for an one hour maximum delta and use KHz to handle
+		 * clocksources with a frequency above 4GHz correctly as
+		 * the frequency argument of clocks_calc_mult_shift() is u32.
+		 */
+		clocks_calc_mult_shift(&tk->cs_ns_to_cyc_mult, &tk->cs_ns_to_cyc_shift,
+				       NSEC_PER_MSEC, clock->freq_khz, 3600 * 1000);
+	}
 }
 
 /* Timekeeper helper functions. */
@@ -720,6 +734,36 @@ static inline void tk_update_ktime_data(struct timekeeper *tk)
 	tk->tkr_raw.base = ns_to_ktime(tk->raw_sec * NSEC_PER_SEC);
 }
 
+static inline void tk_update_ns_to_cyc(struct timekeeper *tks, struct timekeeper *tkc)
+{
+	struct tk_read_base *tkrs = &tks->tkr_mono;
+	struct tk_read_base *tkrc = &tkc->tkr_mono;
+	unsigned int shift;
+
+	if (!IS_ENABLED(CONFIG_GENERIC_CLOCKEVENTS_COUPLED) ||
+	    !(tkrs->clock->flags & CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT))
+		return;
+
+	if (tkrs->mult == tkrc->mult && tkrs->shift == tkrc->shift)
+		return;
+	/*
+	 * The conversion math is simple:
+	 *
+	 *      CS::MULT       (1 << NS_TO_CYC_SHIFT)
+	 *   --------------- = ----------------------
+	 *   (1 << CS:SHIFT)       NS_TO_CYC_MULT
+	 *
+	 * Ergo:
+	 *
+	 *   NS_TO_CYC_MULT = (1 << (CS::SHIFT + NS_TO_CYC_SHIFT)) / CS::MULT
+	 *
+	 * NS_TO_CYC_SHIFT has been set up in tk_setup_internals()
+	 */
+	shift = tkrs->shift + tks->cs_ns_to_cyc_shift;
+	tks->cs_ns_to_cyc_mult = (u32)div_u64(1ULL << shift, tkrs->mult);
+	tks->cs_ns_to_cyc_maxns = div_u64(tkrs->clock->mask, tks->cs_ns_to_cyc_mult);
+}
+
 /*
  * Restore the shadow timekeeper from the real timekeeper.
  */
@@ -754,6 +798,7 @@ static void timekeeping_update_from_shadow(struct tk_data *tkd, unsigned int act
 	tk->tkr_mono.base_real = tk->tkr_mono.base + tk->offs_real;
 
 	if (tk->id == TIMEKEEPER_CORE) {
+		tk_update_ns_to_cyc(tk, &tkd->timekeeper);
 		update_vsyscall(tk);
 		update_pvclock_gtod(tk, action & TK_CLOCK_WAS_SET);
 
@@ -808,6 +853,71 @@ static void timekeeping_forward_now(struct timekeeper *tk)
 	tk_update_coarse_nsecs(tk);
 }
 
+/*
+ * ktime_expiry_to_cycles - Convert a expiry time to clocksource cycles
+ * @id:		Clocksource ID which is required for validity
+ * @expires_ns:	Absolute CLOCK_MONOTONIC expiry time (nsecs) to be converted
+ * @cycles:	Pointer to storage for corresponding absolute cycles value
+ *
+ * Convert a CLOCK_MONOTONIC based absolute expiry time to a cycles value
+ * based on the correlated clocksource of the clockevent device by using
+ * the base nanoseconds and cycles values of the last timekeeper update and
+ * converting the delta between @expires_ns and base nanoseconds to cycles.
+ *
+ * This only works for clockevent devices which are using a less than or
+ * equal comparator against the clocksource.
+ *
+ * Utilizing this avoids two clocksource reads for such devices, the
+ * ktime_get() in clockevents_program_event() to calculate the delta expiry
+ * value and the readout in the device::set_next_event() callback to
+ * convert the delta back to a absolute comparator value.
+ *
+ * Returns: True if @id matches the current clocksource ID, false otherwise
+ */
+bool ktime_expiry_to_cycles(enum clocksource_ids id, ktime_t expires_ns, u64 *cycles)
+{
+	struct timekeeper *tk = &tk_core.timekeeper;
+	struct tk_read_base *tkrm = &tk->tkr_mono;
+	ktime_t base_ns, delta_ns, max_ns;
+	u64 base_cycles, delta_cycles;
+	unsigned int seq;
+	u32 mult, shift;
+
+	/*
+	 * Racy check to avoid the seqcount overhead when ID does not match. If
+	 * the relevant clocksource is installed concurrently, then this will
+	 * just delay the switch over to this mechanism until the next event is
+	 * programmed. If the ID is not matching the clock events code will use
+	 * the regular relative set_next_event() callback as before.
+	 */
+	if (data_race(tk->cs_id) != id)
+		return false;
+
+	do {
+		seq = read_seqcount_begin(&tk_core.seq);
+
+		if (tk->cs_id != id)
+			return false;
+
+		base_cycles = tkrm->cycle_last;
+		base_ns = tkrm->base + (tkrm->xtime_nsec >> tkrm->shift);
+
+		mult = tk->cs_ns_to_cyc_mult;
+		shift = tk->cs_ns_to_cyc_shift;
+		max_ns = tk->cs_ns_to_cyc_maxns;
+
+	} while (read_seqcount_retry(&tk_core.seq, seq));
+
+	/* Prevent negative deltas and multiplication overflows */
+	delta_ns = min(expires_ns - base_ns, max_ns);
+	delta_ns = max(delta_ns, 0);
+
+	/* Convert to cycles */
+	delta_cycles = ((u64)delta_ns * mult) >> shift;
+	*cycles = base_cycles + delta_cycles;
+	return true;
+}
+
 /**
  * ktime_get_real_ts64 - Returns the time of day in a timespec64.
  * @ts:		pointer to the timespec to be set
diff --git a/kernel/time/timekeeping.h b/kernel/time/timekeeping.h
index 543beba..198d060 100644
--- a/kernel/time/timekeeping.h
+++ b/kernel/time/timekeeping.h
@@ -9,6 +9,8 @@ extern ktime_t ktime_get_update_offsets_now(unsigned int *cwsseq,
 					    ktime_t *offs_boot,
 					    ktime_t *offs_tai);
 
+bool ktime_expiry_to_cycles(enum clocksource_ids id, ktime_t expires_ns, u64 *cycles);
+
 extern int timekeeping_valid_for_hres(void);
 extern u64 timekeeping_max_deferment(void);
 extern void timekeeping_warp_clock(void);

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] x86/apic: Avoid the PVOPS indirection for the TSC deadline timer
  2026-02-24 16:36 ` [patch 17/48] x86/apic: Avoid the PVOPS indirection for the TSC deadline timer Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     23028286128d817a414eee0c0a2c6cdc57a83e6f
Gitweb:        https://git.kernel.org/tip/23028286128d817a414eee0c0a2c6cdc57a83e6f
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:36:34 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:08 +01:00

x86/apic: Avoid the PVOPS indirection for the TSC deadline timer

XEN PV does not emulate the TSC deadline timer, so the PVOPS indirection
for writing the deadline MSR can be avoided completely.

Use native_wrmsrq() instead.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.877429827@kernel.org
---
 arch/x86/kernel/apic/apic.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 18208be..5bb5b39 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -426,7 +426,7 @@ static int lapic_next_deadline(unsigned long delta, struct clock_event_device *e
 	 */
 	u64 tsc = rdtsc();
 
-	wrmsrq(MSR_IA32_TSC_DEADLINE, tsc + (((u64) delta) * TSC_DIVISOR));
+	native_wrmsrq(MSR_IA32_TSC_DEADLINE, tsc + (((u64) delta) * TSC_DIVISOR));
 	return 0;
 }
 
@@ -450,7 +450,7 @@ static int lapic_timer_shutdown(struct clock_event_device *evt)
 	 * the timer _and_ zero the counter registers:
 	 */
 	if (v & APIC_LVT_TIMER_TSCDEADLINE)
-		wrmsrq(MSR_IA32_TSC_DEADLINE, 0);
+		native_wrmsrq(MSR_IA32_TSC_DEADLINE, 0);
 	else
 		apic_write(APIC_TMICT, 0);
 
@@ -547,6 +547,11 @@ static __init bool apic_validate_deadline_timer(void)
 
 	if (!boot_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER))
 		return false;
+
+	/* XEN_PV does not support it, but be paranoia about it */
+	if (boot_cpu_has(X86_FEATURE_XENPV))
+		goto clear;
+
 	if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
 		return true;
 
@@ -559,9 +564,11 @@ static __init bool apic_validate_deadline_timer(void)
 	if (boot_cpu_data.microcode >= rev)
 		return true;
 
-	setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
 	pr_err(FW_BUG "TSC_DEADLINE disabled due to Errata; "
 	       "please update microcode to version: 0x%x (or later)\n", rev);
+
+clear:
+	setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
 	return false;
 }
 

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] x86/apic: Remove pointless fence in lapic_next_deadline()
  2026-02-24 16:36 ` [patch 16/48] x86/apic: Remove pointless fence in lapic_next_deadline() Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     92d0e753d57ec581a424d9903afff5e17bd1e6e4
Gitweb:        https://git.kernel.org/tip/92d0e753d57ec581a424d9903afff5e17bd1e6e4
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:36:29 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:07 +01:00

x86/apic: Remove pointless fence in lapic_next_deadline()

lapic_next_deadline() contains a fence before the TSC read and the write to
the TSC_DEADLINE MSR with a content free and therefore useless comment:

    /* This MSR is special and need a special fence: */

The MSR is not really special. It is just not a serializing MSR, but that
does not matter at all in this context as all of these operations are
strictly CPU local.

The only thing the fence prevents is that the RDTSC is speculated ahead,
but that's not really relevant as the delta is calculated way before based
on a previous TSC read and therefore inaccurate by definition.

So removing the fence is just making it slightly more inaccurate in the
worst case, but that is irrelevant as it's way below the actual system
immanent latencies and variations.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.809059527@kernel.org
---
 arch/x86/kernel/apic/apic.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index d93f87f..18208be 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -412,22 +412,20 @@ EXPORT_SYMBOL_GPL(setup_APIC_eilvt);
 /*
  * Program the next event, relative to now
  */
-static int lapic_next_event(unsigned long delta,
-			    struct clock_event_device *evt)
+static int lapic_next_event(unsigned long delta, struct clock_event_device *evt)
 {
 	apic_write(APIC_TMICT, delta);
 	return 0;
 }
 
-static int lapic_next_deadline(unsigned long delta,
-			       struct clock_event_device *evt)
+static int lapic_next_deadline(unsigned long delta, struct clock_event_device *evt)
 {
-	u64 tsc;
-
-	/* This MSR is special and need a special fence: */
-	weak_wrmsr_fence();
+	/*
+	 * There is no weak_wrmsr_fence() required here as all of this is purely
+	 * CPU local. Avoid the [ml]fence overhead.
+	 */
+	u64 tsc = rdtsc();
 
-	tsc = rdtsc();
 	wrmsrq(MSR_IA32_TSC_DEADLINE, tsc + (((u64) delta) * TSC_DIVISOR));
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] x86: Inline TSC reads in timekeeping
  2026-02-24 16:36 ` [patch 15/48] x86: Inline TSC reads in timekeeping Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     b27801189f7fc97a960a96a63b78dcabbb67a52f
Gitweb:        https://git.kernel.org/tip/b27801189f7fc97a960a96a63b78dcabbb67a52f
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:36:24 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:07 +01:00

x86: Inline TSC reads in timekeeping

Avoid the overhead of the indirect call for a single instruction to read
the TSC.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.741886362@kernel.org
---
 arch/x86/Kconfig                     |  1 +
 arch/x86/include/asm/clock_inlined.h | 14 ++++++++++++++
 arch/x86/kernel/tsc.c                |  1 +
 3 files changed, 16 insertions(+)
 create mode 100644 arch/x86/include/asm/clock_inlined.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e2df1b1..d337d8d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -141,6 +141,7 @@ config X86
 	select ARCH_USE_SYM_ANNOTATIONS
 	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	select ARCH_WANT_DEFAULT_BPF_JIT	if X86_64
+	select ARCH_WANTS_CLOCKSOURCE_READ_INLINE	if X86_64
 	select ARCH_WANTS_DYNAMIC_TASK_STRUCT
 	select ARCH_WANTS_NO_INSTR
 	select ARCH_WANT_GENERAL_HUGETLB
diff --git a/arch/x86/include/asm/clock_inlined.h b/arch/x86/include/asm/clock_inlined.h
new file mode 100644
index 0000000..29902c5
--- /dev/null
+++ b/arch/x86/include/asm/clock_inlined.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_CLOCK_INLINED_H
+#define _ASM_X86_CLOCK_INLINED_H
+
+#include <asm/tsc.h>
+
+struct clocksource;
+
+static __always_inline u64 arch_inlined_clocksource_read(struct clocksource *cs)
+{
+	return (u64)rdtsc_ordered();
+}
+
+#endif
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index d9aa694..74a26fb 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1201,6 +1201,7 @@ static struct clocksource clocksource_tsc = {
 	.mask			= CLOCKSOURCE_MASK(64),
 	.flags			= CLOCK_SOURCE_IS_CONTINUOUS |
 				  CLOCK_SOURCE_VALID_FOR_HRES |
+				  CLOCK_SOURCE_CAN_INLINE_READ |
 				  CLOCK_SOURCE_MUST_VERIFY |
 				  CLOCK_SOURCE_VERIFY_PERCPU,
 	.id			= CSID_X86_TSC,

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] timekeeping: Allow inlining clocksource::read()
  2026-02-24 16:36 ` [patch 14/48] timekeeping: Allow inlining clocksource::read() Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     2e27beeb66e43f3b84aef5a07e486a5d50695c06
Gitweb:        https://git.kernel.org/tip/2e27beeb66e43f3b84aef5a07e486a5d50695c06
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:36:20 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:07 +01:00

timekeeping: Allow inlining clocksource::read()

On some architectures clocksource::read() boils down to a single
instruction, so the indirect function call is just a massive overhead
especially with speculative execution mitigations in effect.

Allow architectures to enable conditional inlining of that read to avoid
that by:

   - providing a static branch to switch to the inlined variant

   - disabling the branch before clocksource changes

   - enabling the branch after a clocksource change, when the clocksource
     indicates in a feature flag that it is the one which provides the
     inlined variant

This is intentionally not a static call as that would only remove the
indirect call, but not the rest of the overhead.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.675151545@kernel.org
---
 include/linux/clocksource.h |  2 +-
 kernel/time/Kconfig         |  3 +-
 kernel/time/timekeeping.c   | 74 ++++++++++++++++++++++++++----------
 3 files changed, 60 insertions(+), 19 deletions(-)

diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 65b7c41..54366d5 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -149,6 +149,8 @@ struct clocksource {
 #define CLOCK_SOURCE_SUSPEND_NONSTOP		0x80
 #define CLOCK_SOURCE_RESELECT			0x100
 #define CLOCK_SOURCE_VERIFY_PERCPU		0x200
+#define CLOCK_SOURCE_CAN_INLINE_READ		0x400
+
 /* simplify initialization of mask field */
 #define CLOCKSOURCE_MASK(bits) GENMASK_ULL((bits) - 1, 0)
 
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 7c6a52f..07b048b 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -17,6 +17,9 @@ config ARCH_CLOCKSOURCE_DATA
 config ARCH_CLOCKSOURCE_INIT
 	bool
 
+config ARCH_WANTS_CLOCKSOURCE_READ_INLINE
+	bool
+
 # Timekeeping vsyscall support
 config GENERIC_TIME_VSYSCALL
 	bool
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 91fa200..63aa31f 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -3,34 +3,30 @@
  *  Kernel timekeeping code and accessor functions. Based on code from
  *  timer.c, moved in commit 8524070b7982.
  */
-#include <linux/timekeeper_internal.h>
-#include <linux/module.h>
-#include <linux/interrupt.h>
+#include <linux/audit.h>
+#include <linux/clocksource.h>
+#include <linux/compiler.h>
+#include <linux/jiffies.h>
 #include <linux/kobject.h>
-#include <linux/percpu.h>
-#include <linux/init.h>
-#include <linux/mm.h>
+#include <linux/module.h>
 #include <linux/nmi.h>
-#include <linux/sched.h>
-#include <linux/sched/loadavg.h>
+#include <linux/pvclock_gtod.h>
+#include <linux/random.h>
 #include <linux/sched/clock.h>
+#include <linux/sched/loadavg.h>
+#include <linux/static_key.h>
+#include <linux/stop_machine.h>
 #include <linux/syscore_ops.h>
-#include <linux/clocksource.h>
-#include <linux/jiffies.h>
+#include <linux/tick.h>
 #include <linux/time.h>
 #include <linux/timex.h>
-#include <linux/tick.h>
-#include <linux/stop_machine.h>
-#include <linux/pvclock_gtod.h>
-#include <linux/compiler.h>
-#include <linux/audit.h>
-#include <linux/random.h>
+#include <linux/timekeeper_internal.h>
 
 #include <vdso/auxclock.h>
 
 #include "tick-internal.h"
-#include "ntp_internal.h"
 #include "timekeeping_internal.h"
+#include "ntp_internal.h"
 
 #define TK_CLEAR_NTP		(1 << 0)
 #define TK_CLOCK_WAS_SET	(1 << 1)
@@ -275,6 +271,11 @@ static inline void tk_update_sleep_time(struct timekeeper *tk, ktime_t delta)
 	tk->monotonic_to_boot = ktime_to_timespec64(tk->offs_boot);
 }
 
+#ifdef CONFIG_ARCH_WANTS_CLOCKSOURCE_READ_INLINE
+#include <asm/clock_inlined.h>
+
+static DEFINE_STATIC_KEY_FALSE(clocksource_read_inlined);
+
 /*
  * tk_clock_read - atomic clocksource read() helper
  *
@@ -288,13 +289,36 @@ static inline void tk_update_sleep_time(struct timekeeper *tk, ktime_t delta)
  * a read of the fast-timekeeper tkrs (which is protected by its own locking
  * and update logic).
  */
-static inline u64 tk_clock_read(const struct tk_read_base *tkr)
+static __always_inline u64 tk_clock_read(const struct tk_read_base *tkr)
 {
 	struct clocksource *clock = READ_ONCE(tkr->clock);
 
+	if (static_branch_likely(&clocksource_read_inlined))
+		return arch_inlined_clocksource_read(clock);
+
 	return clock->read(clock);
 }
 
+static inline void clocksource_disable_inline_read(void)
+{
+	static_branch_disable(&clocksource_read_inlined);
+}
+
+static inline void clocksource_enable_inline_read(void)
+{
+	static_branch_enable(&clocksource_read_inlined);
+}
+#else
+static __always_inline u64 tk_clock_read(const struct tk_read_base *tkr)
+{
+	struct clocksource *clock = READ_ONCE(tkr->clock);
+
+	return clock->read(clock);
+}
+static inline void clocksource_disable_inline_read(void) { }
+static inline void clocksource_enable_inline_read(void) { }
+#endif
+
 /**
  * tk_setup_internals - Set up internals to use clocksource clock.
  *
@@ -375,7 +399,7 @@ static noinline u64 delta_to_ns_safe(const struct tk_read_base *tkr, u64 delta)
 	return mul_u64_u32_add_u64_shr(delta, tkr->mult, tkr->xtime_nsec, tkr->shift);
 }
 
-static inline u64 timekeeping_cycles_to_ns(const struct tk_read_base *tkr, u64 cycles)
+static __always_inline u64 timekeeping_cycles_to_ns(const struct tk_read_base *tkr, u64 cycles)
 {
 	/* Calculate the delta since the last update_wall_time() */
 	u64 mask = tkr->mask, delta = (cycles - tkr->cycle_last) & mask;
@@ -1631,7 +1655,19 @@ int timekeeping_notify(struct clocksource *clock)
 
 	if (tk->tkr_mono.clock == clock)
 		return 0;
+
+	/* Disable inlined reads accross the clocksource switch */
+	clocksource_disable_inline_read();
+
 	stop_machine(change_clocksource, clock, NULL);
+
+	/*
+	 * If the clocksource has been selected and supports inlined reads
+	 * enable the branch.
+	 */
+	if (tk->tkr_mono.clock == clock && clock->flags & CLOCK_SOURCE_CAN_INLINE_READ)
+		clocksource_enable_inline_read();
+
 	tick_clock_notify();
 	return tk->tkr_mono.clock == clock ? 0 : -1;
 }

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] clockevents: Remove redundant CLOCK_EVT_FEAT_KTIME
  2026-02-24 16:36 ` [patch 13/48] clockevents: Remove redundant CLOCK_EVT_FEAT_KTIME Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     70802807398c65f5a49b2baec87e1f6c8db43de6
Gitweb:        https://git.kernel.org/tip/70802807398c65f5a49b2baec87e1f6c8db43de6
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:36:15 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:06 +01:00

clockevents: Remove redundant CLOCK_EVT_FEAT_KTIME

The only real usecase for this is the hrtimer based broadcast device.
No point in using two different feature flags for this.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.609049777@kernel.org
---
 include/linux/clockchips.h           | 1 -
 kernel/time/clockevents.c            | 4 ++--
 kernel/time/tick-broadcast-hrtimer.c | 1 -
 3 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index b0df28d..5e8f781 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -45,7 +45,6 @@ enum clock_event_state {
  */
 # define CLOCK_EVT_FEAT_PERIODIC	0x000001
 # define CLOCK_EVT_FEAT_ONESHOT		0x000002
-# define CLOCK_EVT_FEAT_KTIME		0x000004
 
 /*
  * x86(64) specific (mis)features:
diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index eaae1ce..5abaeef 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -319,8 +319,8 @@ int clockevents_program_event(struct clock_event_device *dev, ktime_t expires,
 	WARN_ONCE(!clockevent_state_oneshot(dev), "Current state: %d\n",
 		  clockevent_get_state(dev));
 
-	/* Shortcut for clockevent devices that can deal with ktime. */
-	if (dev->features & CLOCK_EVT_FEAT_KTIME)
+	/* ktime_t based reprogramming for the broadcast hrtimer device */
+	if (unlikely(dev->features & CLOCK_EVT_FEAT_HRTIMER))
 		return dev->set_next_ktime(expires, dev);
 
 	delta = ktime_to_ns(ktime_sub(expires, ktime_get()));
diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c
index a88b72b..51f6a10 100644
--- a/kernel/time/tick-broadcast-hrtimer.c
+++ b/kernel/time/tick-broadcast-hrtimer.c
@@ -78,7 +78,6 @@ static struct clock_event_device ce_broadcast_hrtimer = {
 	.set_state_shutdown	= bc_shutdown,
 	.set_next_ktime		= bc_set_next,
 	.features		= CLOCK_EVT_FEAT_ONESHOT |
-				  CLOCK_EVT_FEAT_KTIME |
 				  CLOCK_EVT_FEAT_HRTIMER,
 	.rating			= 0,
 	.bound_on		= -1,

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] tick/sched: Avoid hrtimer_cancel/start() sequence
  2026-02-24 16:36 ` [patch 12/48] tick/sched: Avoid hrtimer_cancel/start() sequence Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     adcec6a7f566aa237db211f2947b039418450b92
Gitweb:        https://git.kernel.org/tip/adcec6a7f566aa237db211f2947b039418450b92
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:36:10 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:06 +01:00

tick/sched: Avoid hrtimer_cancel/start() sequence

The sequence of cancel and start is inefficient. It has to do the timer
lock/unlock twice and in the worst case has to reprogram the underlying
clock event device twice.

The reason why it is done this way is the usage of hrtimer_forward_now(),
which requires the timer to be inactive.

But that can be completely avoided as the forward can be done on a variable
and does not need any of the overrun accounting provided by
hrtimer_forward_now().

Implement a trivial forwarding mechanism and replace the cancel/reprogram
sequence with hrtimer_start(..., new_expiry).

For the non high resolution case the timer is not actually armed, but used
for storage so that code checking for expiry times can unconditially look
it up in the timer. So it is safe for that case to set the new expiry time
directly.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.542178086@kernel.org
---
 kernel/time/tick-sched.c | 27 ++++++++++++++++++++-------
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f7907fa..9e52644 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -864,19 +864,32 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 }
 EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
 
+/* Simplified variant of hrtimer_forward_now() */
+static ktime_t tick_forward_now(ktime_t expires, ktime_t now)
+{
+	ktime_t delta = now - expires;
+
+	if (likely(delta < TICK_NSEC))
+		return expires + TICK_NSEC;
+
+	expires += TICK_NSEC * ktime_divns(delta, TICK_NSEC);
+	if (expires > now)
+		return expires;
+	return expires + TICK_NSEC;
+}
+
 static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
 {
-	hrtimer_cancel(&ts->sched_timer);
-	hrtimer_set_expires(&ts->sched_timer, ts->last_tick);
+	ktime_t expires = ts->last_tick;
 
-	/* Forward the time to expire in the future */
-	hrtimer_forward(&ts->sched_timer, now, TICK_NSEC);
+	if (now >= expires)
+		expires = tick_forward_now(expires, now);
 
 	if (tick_sched_flag_test(ts, TS_FLAG_HIGHRES)) {
-		hrtimer_start_expires(&ts->sched_timer,
-				      HRTIMER_MODE_ABS_PINNED_HARD);
+		hrtimer_start(&ts->sched_timer,	expires, HRTIMER_MODE_ABS_PINNED_HARD);
 	} else {
-		tick_program_event(hrtimer_get_expires(&ts->sched_timer), 1);
+		hrtimer_set_expires(&ts->sched_timer, expires);
+		tick_program_event(expires, 1);
 	}
 
 	/*

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] sched/hrtick: Mark hrtick timer LAZY_REARM
  2026-02-24 16:36 ` [patch 11/48] sched/hrtick: Mark hrtick timer LAZY_REARM Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     0abec32a6836eca6b61ae81e4829f94abd4647c7
Gitweb:        https://git.kernel.org/tip/0abec32a6836eca6b61ae81e4829f94abd4647c7
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 24 Feb 2026 17:36:06 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:06 +01:00

sched/hrtick: Mark hrtick timer LAZY_REARM

The hrtick timer is frequently rearmed before expiry and most of the time
the new expiry is past the armed one. As this happens on every context
switch it becomes expensive with scheduling heavy work loads especially in
virtual machines as the "hardware" reprogamming implies a VM exit.

hrtimer now provide a lazy rearm mode flag which skips the reprogamming if:

    1) The timer was the first expiring timer before the rearm

    2) The new expiry time is farther out than the armed time

This avoids a massive amount of reprogramming operations of the hrtick
timer for the price of eventually taking the alredy armed interrupt for
nothing.

Mark the hrtick timer accordingly.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.475409346@kernel.org
---
 kernel/sched/core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5bc446e..2d1239a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -998,7 +998,8 @@ static void hrtick_rq_init(struct rq *rq)
 {
 	INIT_CSD(&rq->hrtick_csd, __hrtick_start, rq);
 	rq->hrtick_sched = HRTICK_SCHED_NONE;
-	hrtimer_setup(&rq->hrtick_timer, hrtick, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
+	hrtimer_setup(&rq->hrtick_timer, hrtick, CLOCK_MONOTONIC,
+		      HRTIMER_MODE_REL_HARD | HRTIMER_MODE_LAZY_REARM);
 }
 #else /* !CONFIG_SCHED_HRTICK: */
 static inline void hrtick_clear(struct rq *rq) { }

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] hrtimer: Provide LAZY_REARM mode
  2026-02-24 16:36 ` [patch 10/48] hrtimer: Provide LAZY_REARM mode Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     b7dd64778aa3f89de9afa1e81171cfe110ddc525
Gitweb:        https://git.kernel.org/tip/b7dd64778aa3f89de9afa1e81171cfe110ddc525
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 24 Feb 2026 17:36:01 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:06 +01:00

hrtimer: Provide LAZY_REARM mode

The hrtick timer is frequently rearmed before expiry and most of the time
the new expiry is past the armed one. As this happens on every context
switch it becomes expensive with scheduling heavy work loads especially in
virtual machines as the "hardware" reprogamming implies a VM exit.

Add a lazy rearm mode flag which skips the reprogamming if:

    1) The timer was the first expiring timer before the rearm

    2) The new expiry time is farther out than the armed time

This avoids a massive amount of reprogramming operations of the hrtick
timer for the price of eventually taking the alredy armed interrupt for
nothing.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.408524456@kernel.org
---
 include/linux/hrtimer.h       |  8 ++++++++
 include/linux/hrtimer_types.h |  3 +++
 kernel/time/hrtimer.c         | 17 ++++++++++++++++-
 3 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index b500385..c924bb2 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -31,6 +31,13 @@
  *				  soft irq context
  * HRTIMER_MODE_HARD		- Timer callback function will be executed in
  *				  hard irq context even on PREEMPT_RT.
+ * HRTIMER_MODE_LAZY_REARM	- Avoid reprogramming if the timer was the
+ *				  first expiring timer and is moved into the
+ *				  future. Special mode for the HRTICK timer to
+ *				  avoid extensive reprogramming of the hardware,
+ *				  which is expensive in virtual machines. Risks
+ *				  a pointless expiry, but that's better than
+ *				  reprogramming on every context switch,
  */
 enum hrtimer_mode {
 	HRTIMER_MODE_ABS	= 0x00,
@@ -38,6 +45,7 @@ enum hrtimer_mode {
 	HRTIMER_MODE_PINNED	= 0x02,
 	HRTIMER_MODE_SOFT	= 0x04,
 	HRTIMER_MODE_HARD	= 0x08,
+	HRTIMER_MODE_LAZY_REARM	= 0x10,
 
 	HRTIMER_MODE_ABS_PINNED = HRTIMER_MODE_ABS | HRTIMER_MODE_PINNED,
 	HRTIMER_MODE_REL_PINNED = HRTIMER_MODE_REL | HRTIMER_MODE_PINNED,
diff --git a/include/linux/hrtimer_types.h b/include/linux/hrtimer_types.h
index 8fbbb6b..64381c6 100644
--- a/include/linux/hrtimer_types.h
+++ b/include/linux/hrtimer_types.h
@@ -33,6 +33,8 @@ enum hrtimer_restart {
  * @is_soft:	Set if hrtimer will be expired in soft interrupt context.
  * @is_hard:	Set if hrtimer will be expired in hard interrupt context
  *		even on RT.
+ * @is_lazy:	Set if the timer is frequently rearmed to avoid updates
+ *		of the clock event device
  *
  * The hrtimer structure must be initialized by hrtimer_setup()
  */
@@ -45,6 +47,7 @@ struct hrtimer {
 	u8				is_rel;
 	u8				is_soft;
 	u8				is_hard;
+	u8				is_lazy;
 };
 
 #endif /* _LINUX_HRTIMER_TYPES_H */
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 67917ce..e54f8b5 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1152,7 +1152,7 @@ static void __remove_hrtimer(struct hrtimer *timer,
 	 * an superfluous call to hrtimer_force_reprogram() on the
 	 * remote cpu later on if the same timer gets enqueued again.
 	 */
-	if (reprogram && timer == cpu_base->next_timer)
+	if (reprogram && timer == cpu_base->next_timer && !timer->is_lazy)
 		hrtimer_force_reprogram(cpu_base, 1);
 }
 
@@ -1322,6 +1322,20 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
 	}
 
 	/*
+	 * Special case for the HRTICK timer. It is frequently rearmed and most
+	 * of the time moves the expiry into the future. That's expensive in
+	 * virtual machines and it's better to take the pointless already armed
+	 * interrupt than reprogramming the hardware on every context switch.
+	 *
+	 * If the new expiry is before the armed time, then reprogramming is
+	 * required.
+	 */
+	if (timer->is_lazy) {
+		if (new_base->cpu_base->expires_next <= hrtimer_get_expires(timer))
+			return 0;
+	}
+
+	/*
 	 * Timer was forced to stay on the current CPU to avoid
 	 * reprogramming on removal and enqueue. Force reprogram the
 	 * hardware by evaluating the new first expiring timer.
@@ -1675,6 +1689,7 @@ static void __hrtimer_setup(struct hrtimer *timer,
 	base += hrtimer_clockid_to_base(clock_id);
 	timer->is_soft = softtimer;
 	timer->is_hard = !!(mode & HRTIMER_MODE_HARD);
+	timer->is_lazy = !!(mode & HRTIMER_MODE_LAZY_REARM);
 	timer->base = &cpu_base->clock_base[base];
 	timerqueue_init(&timer->node);
 

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] sched/hrtick: Avoid tiny hrtick rearms
  2026-02-24 16:35 ` [patch 09/48] sched/hrtick: Avoid tiny hrtick rearms Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     c8cdb9b516407a0b8c653c9c1d6f0931c3864384
Gitweb:        https://git.kernel.org/tip/c8cdb9b516407a0b8c653c9c1d6f0931c3864384
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:35:56 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:05 +01:00

sched/hrtick: Avoid tiny hrtick rearms

Tiny adjustments to the hrtick expiry time below 5 microseconds are just
causing extra work for no real value. Filter them out when restarting the
hrtick.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.340593047@kernel.org
---
 kernel/sched/core.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a868f0a..5bc446e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -903,12 +903,24 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
 	return HRTIMER_NORESTART;
 }
 
-static void __hrtick_restart(struct rq *rq)
+static inline bool hrtick_needs_rearm(struct hrtimer *timer, ktime_t expires)
+{
+	/*
+	 * Queued is false when the timer is not started or currently
+	 * running the callback. In both cases, restart. If queued check
+	 * whether the expiry time actually changes substantially.
+	 */
+	return !hrtimer_is_queued(timer) ||
+		abs(expires - hrtimer_get_expires(timer)) > 5000;
+}
+
+static void hrtick_cond_restart(struct rq *rq)
 {
 	struct hrtimer *timer = &rq->hrtick_timer;
 	ktime_t time = rq->hrtick_time;
 
-	hrtimer_start(timer, time, HRTIMER_MODE_ABS_PINNED_HARD);
+	if (hrtick_needs_rearm(timer, time))
+		hrtimer_start(timer, time, HRTIMER_MODE_ABS_PINNED_HARD);
 }
 
 /*
@@ -920,7 +932,7 @@ static void __hrtick_start(void *arg)
 	struct rq_flags rf;
 
 	rq_lock(rq, &rf);
-	__hrtick_restart(rq);
+	hrtick_cond_restart(rq);
 	rq_unlock(rq, &rf);
 }
 
@@ -950,9 +962,11 @@ void hrtick_start(struct rq *rq, u64 delay)
 	}
 
 	rq->hrtick_time = ktime_add_ns(ktime_get(), delta);
+	if (!hrtick_needs_rearm(&rq->hrtick_timer, rq->hrtick_time))
+		return;
 
 	if (rq == this_rq())
-		__hrtick_restart(rq);
+		hrtimer_start(&rq->hrtick_timer, rq->hrtick_time, HRTIMER_MODE_ABS_PINNED_HARD);
 	else
 		smp_call_function_single_async(cpu_of(rq), &rq->hrtick_csd);
 }
@@ -966,7 +980,7 @@ static inline void hrtick_schedule_exit(struct rq *rq)
 {
 	if (rq->hrtick_sched & HRTICK_SCHED_START) {
 		rq->hrtick_time = ktime_add_ns(ktime_get(), rq->hrtick_delay);
-		__hrtick_restart(rq);
+		hrtick_cond_restart(rq);
 	} else if (idle_rq(rq)) {
 		/*
 		 * No need for using hrtimer_is_active(). The timer is CPU local

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] sched: Optimize hrtimer handling
  2026-02-24 16:35 ` [patch 08/48] sched: Optimize hrtimer handling Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     96d1610e0b20b5a627773874b4514ae922ad98f6
Gitweb:        https://git.kernel.org/tip/96d1610e0b20b5a627773874b4514ae922ad98f6
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:35:52 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:05 +01:00

sched: Optimize hrtimer handling

schedule() provides several mechanisms to update the hrtick timer:

  1) When the next task is picked

  2) When the balance callbacks are invoked before rq::lock is released

Each of them can result in a first expiring timer and cause a reprogram of
the clock event device.

Solve this by deferring the rearm to the end of schedule() right before
releasing rq::lock by setting a flag on entry which tells hrtick_start() to
cache the runtime constraint in rq::hrtick_delay without touching the timer
itself.

Right before releasing rq::lock evaluate the flags and either rearm or
cancel the hrtick timer.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.273068659@kernel.org
---
 kernel/sched/core.c  | 57 ++++++++++++++++++++++++++++++++++++-------
 kernel/sched/sched.h |  2 ++-
 2 files changed, 50 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a716cc6..a868f0a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -872,6 +872,12 @@ void update_rq_clock(struct rq *rq)
  * Use HR-timers to deliver accurate preemption points.
  */
 
+enum {
+	HRTICK_SCHED_NONE		= 0,
+	HRTICK_SCHED_DEFER		= BIT(1),
+	HRTICK_SCHED_START		= BIT(2),
+};
+
 static void hrtick_clear(struct rq *rq)
 {
 	if (hrtimer_active(&rq->hrtick_timer))
@@ -932,6 +938,17 @@ void hrtick_start(struct rq *rq, u64 delay)
 	 * doesn't make sense and can cause timer DoS.
 	 */
 	delta = max_t(s64, delay, 10000LL);
+
+	/*
+	 * If this is in the middle of schedule() only note the delay
+	 * and let hrtick_schedule_exit() deal with it.
+	 */
+	if (rq->hrtick_sched) {
+		rq->hrtick_sched |= HRTICK_SCHED_START;
+		rq->hrtick_delay = delta;
+		return;
+	}
+
 	rq->hrtick_time = ktime_add_ns(ktime_get(), delta);
 
 	if (rq == this_rq())
@@ -940,19 +957,40 @@ void hrtick_start(struct rq *rq, u64 delay)
 		smp_call_function_single_async(cpu_of(rq), &rq->hrtick_csd);
 }
 
-static void hrtick_rq_init(struct rq *rq)
+static inline void hrtick_schedule_enter(struct rq *rq)
 {
-	INIT_CSD(&rq->hrtick_csd, __hrtick_start, rq);
-	hrtimer_setup(&rq->hrtick_timer, hrtick, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
+	rq->hrtick_sched = HRTICK_SCHED_DEFER;
 }
-#else /* !CONFIG_SCHED_HRTICK: */
-static inline void hrtick_clear(struct rq *rq)
+
+static inline void hrtick_schedule_exit(struct rq *rq)
 {
+	if (rq->hrtick_sched & HRTICK_SCHED_START) {
+		rq->hrtick_time = ktime_add_ns(ktime_get(), rq->hrtick_delay);
+		__hrtick_restart(rq);
+	} else if (idle_rq(rq)) {
+		/*
+		 * No need for using hrtimer_is_active(). The timer is CPU local
+		 * and interrupts are disabled, so the callback cannot be
+		 * running and the queued state is valid.
+		 */
+		if (hrtimer_is_queued(&rq->hrtick_timer))
+			hrtimer_cancel(&rq->hrtick_timer);
+	}
+
+	rq->hrtick_sched = HRTICK_SCHED_NONE;
 }
 
-static inline void hrtick_rq_init(struct rq *rq)
+static void hrtick_rq_init(struct rq *rq)
 {
+	INIT_CSD(&rq->hrtick_csd, __hrtick_start, rq);
+	rq->hrtick_sched = HRTICK_SCHED_NONE;
+	hrtimer_setup(&rq->hrtick_timer, hrtick, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
 }
+#else /* !CONFIG_SCHED_HRTICK: */
+static inline void hrtick_clear(struct rq *rq) { }
+static inline void hrtick_rq_init(struct rq *rq) { }
+static inline void hrtick_schedule_enter(struct rq *rq) { }
+static inline void hrtick_schedule_exit(struct rq *rq) { }
 #endif /* !CONFIG_SCHED_HRTICK */
 
 /*
@@ -5028,6 +5066,7 @@ static inline void finish_lock_switch(struct rq *rq)
 	 */
 	spin_acquire(&__rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
 	__balance_callbacks(rq, NULL);
+	hrtick_schedule_exit(rq);
 	raw_spin_rq_unlock_irq(rq);
 }
 
@@ -6781,9 +6820,6 @@ static void __sched notrace __schedule(int sched_mode)
 
 	schedule_debug(prev, preempt);
 
-	if (sched_feat(HRTICK) || sched_feat(HRTICK_DL))
-		hrtick_clear(rq);
-
 	klp_sched_try_switch(prev);
 
 	local_irq_disable();
@@ -6810,6 +6846,8 @@ static void __sched notrace __schedule(int sched_mode)
 	rq_lock(rq, &rf);
 	smp_mb__after_spinlock();
 
+	hrtick_schedule_enter(rq);
+
 	/* Promote REQ to ACT */
 	rq->clock_update_flags <<= 1;
 	update_rq_clock(rq);
@@ -6911,6 +6949,7 @@ keep_resched:
 
 		rq_unpin_lock(rq, &rf);
 		__balance_callbacks(rq, NULL);
+		hrtick_schedule_exit(rq);
 		raw_spin_rq_unlock_irq(rq);
 	}
 	trace_sched_exit_tp(is_switch);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0aa089d..6774fb5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1285,6 +1285,8 @@ struct rq {
 	call_single_data_t	hrtick_csd;
 	struct hrtimer		hrtick_timer;
 	ktime_t			hrtick_time;
+	ktime_t			hrtick_delay;
+	unsigned int		hrtick_sched;
 #endif
 
 #ifdef CONFIG_SCHEDSTATS

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] sched: Use hrtimer_highres_enabled()
  2026-02-24 16:35 ` [patch 07/48] sched: Use hrtimer_highres_enabled() Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     c3a92213eb3dd8ea6f664d16a08eda800e34eaad
Gitweb:        https://git.kernel.org/tip/c3a92213eb3dd8ea6f664d16a08eda800e34eaad
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:35:47 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:05 +01:00

sched: Use hrtimer_highres_enabled()

Use the static branch based variant and thereby avoid following three
pointers.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.203610956@kernel.org
---
 include/linux/hrtimer.h |  6 ------
 kernel/sched/sched.h    | 37 +++++++++----------------------------
 2 files changed, 9 insertions(+), 34 deletions(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index c9ca105..b500385 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -146,12 +146,6 @@ static inline ktime_t hrtimer_expires_remaining(const struct hrtimer *timer)
 	return ktime_sub(timer->node.expires, hrtimer_cb_get_time(timer));
 }
 
-static inline int hrtimer_is_hres_active(struct hrtimer *timer)
-{
-	return IS_ENABLED(CONFIG_HIGH_RES_TIMERS) ?
-		timer->base->cpu_base->hres_active : 0;
-}
-
 #ifdef CONFIG_HIGH_RES_TIMERS
 extern unsigned int hrtimer_resolution;
 struct clock_event_device;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 73bc20c..0aa089d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3019,25 +3019,19 @@ extern unsigned int sysctl_numa_balancing_hot_threshold;
  *  - enabled by features
  *  - hrtimer is actually high res
  */
-static inline int hrtick_enabled(struct rq *rq)
+static inline bool hrtick_enabled(struct rq *rq)
 {
-	if (!cpu_active(cpu_of(rq)))
-		return 0;
-	return hrtimer_is_hres_active(&rq->hrtick_timer);
+	return cpu_active(cpu_of(rq)) && hrtimer_highres_enabled();
 }
 
-static inline int hrtick_enabled_fair(struct rq *rq)
+static inline bool hrtick_enabled_fair(struct rq *rq)
 {
-	if (!sched_feat(HRTICK))
-		return 0;
-	return hrtick_enabled(rq);
+	return sched_feat(HRTICK) && hrtick_enabled(rq);
 }
 
-static inline int hrtick_enabled_dl(struct rq *rq)
+static inline bool hrtick_enabled_dl(struct rq *rq)
 {
-	if (!sched_feat(HRTICK_DL))
-		return 0;
-	return hrtick_enabled(rq);
+	return sched_feat(HRTICK_DL) && hrtick_enabled(rq);
 }
 
 extern void hrtick_start(struct rq *rq, u64 delay);
@@ -3047,22 +3041,9 @@ static inline bool hrtick_active(struct rq *rq)
 }
 
 #else /* !CONFIG_SCHED_HRTICK: */
-
-static inline int hrtick_enabled_fair(struct rq *rq)
-{
-	return 0;
-}
-
-static inline int hrtick_enabled_dl(struct rq *rq)
-{
-	return 0;
-}
-
-static inline int hrtick_enabled(struct rq *rq)
-{
-	return 0;
-}
-
+static inline bool hrtick_enabled_fair(struct rq *rq) { return false; }
+static inline bool hrtick_enabled_dl(struct rq *rq) { return false; }
+static inline bool hrtick_enabled(struct rq *rq) { return false; }
 #endif /* !CONFIG_SCHED_HRTICK */
 
 #ifndef arch_scale_freq_tick

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] hrtimer: Provide a static branch based hrtimer_hres_enabled()
  2026-02-24 16:35 ` [patch 06/48] hrtimer: Provide a static branch based hrtimer_hres_enabled() Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     0a93d30861617ecf207dcc4c6c736435fac36dae
Gitweb:        https://git.kernel.org/tip/0a93d30861617ecf207dcc4c6c736435fac36dae
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:35:42 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:04 +01:00

hrtimer: Provide a static branch based hrtimer_hres_enabled()

The scheduler evaluates this via hrtimer_is_hres_active() every time it has
to update HRTICK. This needs to follow three pointers, which is expensive.

Provide a static branch based mechanism to avoid that.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.136503358@kernel.org
---
 include/linux/hrtimer.h | 13 +++++++++----
 kernel/time/hrtimer.c   | 28 +++++++++++++++++++++++++---
 2 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 74adbd4..c9ca105 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -153,17 +153,22 @@ static inline int hrtimer_is_hres_active(struct hrtimer *timer)
 }
 
 #ifdef CONFIG_HIGH_RES_TIMERS
+extern unsigned int hrtimer_resolution;
 struct clock_event_device;
 
 extern void hrtimer_interrupt(struct clock_event_device *dev);
 
-extern unsigned int hrtimer_resolution;
+extern struct static_key_false hrtimer_highres_enabled_key;
 
-#else
+static inline bool hrtimer_highres_enabled(void)
+{
+	return static_branch_likely(&hrtimer_highres_enabled_key);
+}
 
+#else  /* CONFIG_HIGH_RES_TIMERS */
 #define hrtimer_resolution	(unsigned int)LOW_RES_NSEC
-
-#endif
+static inline bool hrtimer_highres_enabled(void) { return false; }
+#endif  /* !CONFIG_HIGH_RES_TIMERS */
 
 static inline ktime_t
 __hrtimer_expires_remaining_adjusted(const struct hrtimer *timer, ktime_t now)
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 3088db4..67917ce 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -126,6 +126,25 @@ static inline bool hrtimer_base_is_online(struct hrtimer_cpu_base *base)
 		return likely(base->online);
 }
 
+#ifdef CONFIG_HIGH_RES_TIMERS
+DEFINE_STATIC_KEY_FALSE(hrtimer_highres_enabled_key);
+
+static void hrtimer_hres_workfn(struct work_struct *work)
+{
+	static_branch_enable(&hrtimer_highres_enabled_key);
+}
+
+static DECLARE_WORK(hrtimer_hres_work, hrtimer_hres_workfn);
+
+static inline void hrtimer_schedule_hres_work(void)
+{
+	if (!hrtimer_highres_enabled())
+		schedule_work(&hrtimer_hres_work);
+}
+#else
+static inline void hrtimer_schedule_hres_work(void) { }
+#endif
+
 /*
  * Functions and macros which are different for UP/SMP systems are kept in a
  * single place
@@ -649,7 +668,9 @@ static inline ktime_t hrtimer_update_base(struct hrtimer_cpu_base *base)
 }
 
 /*
- * Is the high resolution mode active ?
+ * Is the high resolution mode active in the CPU base. This cannot use the
+ * static key as the CPUs are switched to high resolution mode
+ * asynchronously.
  */
 static inline int hrtimer_hres_active(struct hrtimer_cpu_base *cpu_base)
 {
@@ -750,6 +771,7 @@ static void hrtimer_switch_to_hres(void)
 	tick_setup_sched_timer(true);
 	/* "Retrigger" the interrupt to get things going */
 	retrigger_next_event(NULL);
+	hrtimer_schedule_hres_work();
 }
 
 #else
@@ -947,11 +969,10 @@ static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base,
  */
 void clock_was_set(unsigned int bases)
 {
-	struct hrtimer_cpu_base *cpu_base = raw_cpu_ptr(&hrtimer_bases);
 	cpumask_var_t mask;
 	int cpu;
 
-	if (!hrtimer_hres_active(cpu_base) && !tick_nohz_is_active())
+	if (!hrtimer_highres_enabled() && !tick_nohz_is_active())
 		goto out_timerfd;
 
 	if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) {
@@ -962,6 +983,7 @@ void clock_was_set(unsigned int bases)
 	/* Avoid interrupting CPUs if possible */
 	cpus_read_lock();
 	for_each_online_cpu(cpu) {
+		struct hrtimer_cpu_base *cpu_base;
 		unsigned long flags;
 
 		cpu_base = &per_cpu(hrtimer_bases, cpu);

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] hrtimer: Avoid pointless reprogramming in __hrtimer_start_range_ns()
  2026-02-24 16:35 ` [patch 05/48] hrtimer: Avoid pointless reprogramming in __hrtimer_start_range_ns() Thomas Gleixner
@ 2026-02-28 15:36   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2026-02-28 15:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Thomas Gleixner, Juri Lelli, x86,
	linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     d19ff16c11db38f3ee179d72751fb9b340174330
Gitweb:        https://git.kernel.org/tip/d19ff16c11db38f3ee179d72751fb9b340174330
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 24 Feb 2026 17:35:37 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:04 +01:00

hrtimer: Avoid pointless reprogramming in __hrtimer_start_range_ns()

Much like hrtimer_reprogram(), skip programming if the cpu_base is running
the hrtimer interrupt.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260224163429.069535561@kernel.org
---
 kernel/time/hrtimer.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 860af7a..3088db4 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1269,6 +1269,14 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
 	}
 
 	first = enqueue_hrtimer(timer, new_base, mode);
+
+	/*
+	 * If the hrtimer interrupt is running, then it will reevaluate the
+	 * clock bases and reprogram the clock event device.
+	 */
+	if (new_base->cpu_base->in_hrtirq)
+		return false;
+
 	if (!force_local) {
 		/*
 		 * If the current CPU base is online, then the timer is

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] sched: Avoid ktime_get() indirection
  2026-02-24 16:35 ` [patch 04/48] sched: Avoid ktime_get() indirection Thomas Gleixner
@ 2026-02-28 15:37   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-02-28 15:37 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     d70c1080a957a5144e6c40e95bcbe04ab542fe05
Gitweb:        https://git.kernel.org/tip/d70c1080a957a5144e6c40e95bcbe04ab542fe05
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 24 Feb 2026 17:35:32 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:04 +01:00

sched: Avoid ktime_get() indirection

The clock of the hrtick and deadline timers is known to be CLOCK_MONOTONIC.
No point in looking it up via hrtimer_cb_get_time().

Just use ktime_get() directly.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.001511662@kernel.org
---
 kernel/sched/core.c     | 3 +--
 kernel/sched/deadline.c | 2 +-
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7597776..a716cc6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -925,7 +925,6 @@ static void __hrtick_start(void *arg)
  */
 void hrtick_start(struct rq *rq, u64 delay)
 {
-	struct hrtimer *timer = &rq->hrtick_timer;
 	s64 delta;
 
 	/*
@@ -933,7 +932,7 @@ void hrtick_start(struct rq *rq, u64 delay)
 	 * doesn't make sense and can cause timer DoS.
 	 */
 	delta = max_t(s64, delay, 10000LL);
-	rq->hrtick_time = ktime_add_ns(hrtimer_cb_get_time(timer), delta);
+	rq->hrtick_time = ktime_add_ns(ktime_get(), delta);
 
 	if (rq == this_rq())
 		__hrtick_restart(rq);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d08b004..9d619a4 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1097,7 +1097,7 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
 		act = ns_to_ktime(dl_next_period(dl_se));
 	}
 
-	now = hrtimer_cb_get_time(timer);
+	now = ktime_get();
 	delta = ktime_to_ns(now) - rq_clock(rq);
 	act = ktime_add_ns(act, delta);
 

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] sched/fair: Make hrtick resched hard
  2026-02-24 16:35 ` [patch 03/48] sched/fair: Make hrtick resched hard Thomas Gleixner
@ 2026-02-28 15:37   ` tip-bot2 for Peter Zijlstra (Intel)
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Peter Zijlstra (Intel) @ 2026-02-28 15:37 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     5d88e424ec1b3ea7f552bd14d932f510146c45c7
Gitweb:        https://git.kernel.org/tip/5d88e424ec1b3ea7f552bd14d932f510146c45c7
Author:        Peter Zijlstra (Intel) <peterz@infradead.org>
AuthorDate:    Tue, 24 Feb 2026 17:35:27 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:04 +01:00

sched/fair: Make hrtick resched hard

Since the tick causes hard preemption, the hrtick should too.

Letting the hrtick do lazy preemption completely defeats the purpose, since
it will then still be delayed until a old tick and be dependent on
CONFIG_HZ.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163428.933894105@kernel.org
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0b6ce88..e9e5fe4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5530,7 +5530,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	 * validating it and just reschedule.
 	 */
 	if (queued) {
-		resched_curr_lazy(rq_of(cfs_rq));
+		resched_curr(rq_of(cfs_rq));
 		return;
 	}
 #endif

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] sched/fair: Simplify hrtick_update()
  2026-02-24 16:35 ` [patch 02/48] sched/fair: Simplify hrtick_update() Thomas Gleixner
@ 2026-02-28 15:37   ` tip-bot2 for Peter Zijlstra (Intel)
  0 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Peter Zijlstra (Intel) @ 2026-02-28 15:37 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     97015376642f3cb7aa5c3cdb13bf094e94fbcd81
Gitweb:        https://git.kernel.org/tip/97015376642f3cb7aa5c3cdb13bf094e94fbcd81
Author:        Peter Zijlstra (Intel) <peterz@infradead.org>
AuthorDate:    Tue, 24 Feb 2026 17:35:22 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:03 +01:00

sched/fair: Simplify hrtick_update()

hrtick_update() was needed when the slice depended on nr_running, all that
code is gone. All that remains is starting the hrtick when nr_running
becomes more than 1.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163428.866374835@kernel.org
---
 kernel/sched/fair.c  | 12 ++++--------
 kernel/sched/sched.h |  4 ++++
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 247fecd..0b6ce88 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6769,9 +6769,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
 }
 
 /*
- * called from enqueue/dequeue and updates the hrtick when the
- * current task is from our class and nr_running is low enough
- * to matter.
+ * Called on enqueue to start the hrtick when h_nr_queued becomes more than 1.
  */
 static void hrtick_update(struct rq *rq)
 {
@@ -6780,6 +6778,9 @@ static void hrtick_update(struct rq *rq)
 	if (!hrtick_enabled_fair(rq) || donor->sched_class != &fair_sched_class)
 		return;
 
+	if (hrtick_active(rq))
+		return;
+
 	hrtick_start_fair(rq, donor);
 }
 #else /* !CONFIG_SCHED_HRTICK: */
@@ -7102,9 +7103,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		WARN_ON_ONCE(!task_sleep);
 		WARN_ON_ONCE(p->on_rq != 1);
 
-		/* Fix-up what dequeue_task_fair() skipped */
-		hrtick_update(rq);
-
 		/*
 		 * Fix-up what block_task() skipped.
 		 *
@@ -7138,8 +7136,6 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	/*
 	 * Must not reference @p after dequeue_entities(DEQUEUE_DELAYED).
 	 */
-
-	hrtick_update(rq);
 	return true;
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b82fb70..73bc20c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3041,6 +3041,10 @@ static inline int hrtick_enabled_dl(struct rq *rq)
 }
 
 extern void hrtick_start(struct rq *rq, u64 delay);
+static inline bool hrtick_active(struct rq *rq)
+{
+	return hrtimer_active(&rq->hrtick_timer);
+}
 
 #else /* !CONFIG_SCHED_HRTICK: */
 

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] sched/eevdf: Fix HRTICK duration
  2026-02-24 16:35 ` [patch 01/48] sched/eevdf: Fix HRTICK duration Thomas Gleixner
@ 2026-02-28 15:37   ` tip-bot2 for Peter Zijlstra
  2026-03-20 14:59     ` Shrikanth Hegde
  0 siblings, 1 reply; 128+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2026-02-28 15:37 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Thomas Gleixner, Juri Lelli, x86,
	linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     558c18d3fbb6c5b9c0b42629d7fe34476363ac00
Gitweb:        https://git.kernel.org/tip/558c18d3fbb6c5b9c0b42629d7fe34476363ac00
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 24 Feb 2026 17:35:17 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 27 Feb 2026 16:40:03 +01:00

sched/eevdf: Fix HRTICK duration

The nominal duration for an EEVDF task to run is until its deadline. At
which point the deadline is moved ahead and a new task selection is done.

Try and predict the time 'lost' to higher scheduling classes. Since this is
an estimate, the timer can be both early or late. In case it is early
task_tick_fair() will take the !need_resched() path and restarts the timer.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://patch.msgid.link/20260224163428.798198874@kernel.org
---
 kernel/sched/fair.c | 41 +++++++++++++++++++++++++++--------------
 1 file changed, 27 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eea99ec..247fecd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6735,21 +6735,37 @@ static inline void sched_fair_update_stop_tick(struct rq *rq, struct task_struct
 static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
+	unsigned long scale = 1024;
+	unsigned long util = 0;
+	u64 vdelta;
+	u64 delta;
 
 	WARN_ON_ONCE(task_rq(p) != rq);
 
-	if (rq->cfs.h_nr_queued > 1) {
-		u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
-		u64 slice = se->slice;
-		s64 delta = slice - ran;
+	if (rq->cfs.h_nr_queued <= 1)
+		return;
 
-		if (delta < 0) {
-			if (task_current_donor(rq, p))
-				resched_curr(rq);
-			return;
-		}
-		hrtick_start(rq, delta);
+	/*
+	 * Compute time until virtual deadline
+	 */
+	vdelta = se->deadline - se->vruntime;
+	if ((s64)vdelta < 0) {
+		if (task_current_donor(rq, p))
+			resched_curr(rq);
+		return;
 	}
+	delta = (se->load.weight * vdelta) / NICE_0_LOAD;
+
+	/*
+	 * Correct for instantaneous load of other classes.
+	 */
+	util += cpu_util_irq(rq);
+	if (util && util < 1024) {
+		scale *= 1024;
+		scale /= (1024 - util);
+	}
+
+	hrtick_start(rq, (scale * delta) / 1024);
 }
 
 /*
@@ -13365,11 +13381,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 		entity_tick(cfs_rq, se, queued);
 	}
 
-	if (queued) {
-		if (!need_resched())
-			hrtick_start_fair(rq, curr);
+	if (queued)
 		return;
-	}
 
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* Re: [patch 20/48] x86/apic: Enable TSC coupled programming mode
  2026-02-24 16:36 ` [patch 20/48] x86/apic: Enable TSC coupled programming mode Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
@ 2026-03-03  1:29   ` Nathan Chancellor
  2026-03-03 14:37     ` Thomas Gleixner
  1 sibling, 1 reply; 128+ messages in thread
From: Nathan Chancellor @ 2026-03-03  1:29 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Anna-Maria Behnsen, John Stultz, Stephen Boyd,
	Daniel Lezcano, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, x86,
	Peter Zijlstra, Frederic Weisbecker, Eric Dumazet

[-- Attachment #1: Type: text/plain, Size: 3765 bytes --]

Hi Thomas,

On Tue, Feb 24, 2026 at 05:36:49PM +0100, Thomas Gleixner wrote:
> The TSC deadline timer is directly coupled to the TSC and setting the next
> deadline is tedious as the clockevents core code converts the
> CLOCK_MONOTONIC based absolute expiry time to a relative expiry by reading
> the current time from the TSC. It converts that delta to cycles and hands
> the result to lapic_next_deadline(), which then has read to the TSC and add
> the delta to program the timer.
> 
> The core code now supports coupled clock event devices and can provide the
> expiry time in TSC cycles directly without reading the TSC at all.
> 
> This obviouly works only when the TSC is the current clocksource, but
> that's the default for all modern CPUs which implement the TSC deadline
> timer. If the TSC is not the current clocksource (e.g. early boot) then the
> core code falls back to the relative set_next_event() callback as before.
> 
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> Cc: x86@kernel.org

After this change landed in -next as commit f246ec3478cf ("x86/apic:
Enable TSC coupled programming mode"), two of my Intel-based test
machines fail to boot. Unfortunately, I do not think I have any serial
access on these, so I have little introspective ability. Is there any
information I can provide or patches I can test to try and help figure
out what is going on here? I have attached the output of lscpu of both
machines, in case there is some common thread there.

Cheers,
Nathan

# bad: [d517cb8cea012f43b069617fc8179b45404f8018] Add linux-next specific files for 20260302
# good: [11439c4635edd669ae435eec308f4ab8a0804808] Linux 7.0-rc2
git bisect start 'd517cb8cea012f43b069617fc8179b45404f8018' '11439c4635edd669ae435eec308f4ab8a0804808'
# good: [30cad5d4db9212a3e9bb99be1d99c4fbc17966c7] Merge branch 'master' of https://git.kernel.org/pub/scm/linux/kernel/git/wpan/wpan-next.git
git bisect good 30cad5d4db9212a3e9bb99be1d99c4fbc17966c7
# good: [5add127981db7fda704fb251de1a3a77e3282e37] Merge branch 'for-next' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git
git bisect good 5add127981db7fda704fb251de1a3a77e3282e37
# bad: [49b56e8a6ca8c20f7d9bb8904d4b2a5bf032554a] Merge branch 'char-misc-next' of https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git
git bisect bad 49b56e8a6ca8c20f7d9bb8904d4b2a5bf032554a
# good: [1695e1c2db4247e8428badf667900beec51c5174] Merge branch 'for-next' of https://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git
git bisect good 1695e1c2db4247e8428badf667900beec51c5174
# bad: [c50f05bd3c4e992c1dfb61b14d6f7d999f1381f9] Merge branch into tip/master: 'sched/hrtick'
git bisect bad c50f05bd3c4e992c1dfb61b14d6f7d999f1381f9
# bad: [343f2f4dc5425107d509d29e26ef59c2053aeaa4] hrtimer: Try to modify timers in place
git bisect bad 343f2f4dc5425107d509d29e26ef59c2053aeaa4
# bad: [6abfc2bd5b0cff70db99a273f2a161e2273eae6d] hrtimer: Use guards where appropriate
git bisect bad 6abfc2bd5b0cff70db99a273f2a161e2273eae6d
# good: [0abec32a6836eca6b61ae81e4829f94abd4647c7] sched/hrtick: Mark hrtick timer LAZY_REARM
git bisect good 0abec32a6836eca6b61ae81e4829f94abd4647c7
# good: [23028286128d817a414eee0c0a2c6cdc57a83e6f] x86/apic: Avoid the PVOPS indirection for the TSC deadline timer
git bisect good 23028286128d817a414eee0c0a2c6cdc57a83e6f
# bad: [f246ec3478cfdab830ee0815209f48923e7ee5e2] x86/apic: Enable TSC coupled programming mode
git bisect bad f246ec3478cfdab830ee0815209f48923e7ee5e2
# good: [89f951a1e8ad781e7ac70eccddab0e0c270485f9] clockevents: Provide support for clocksource coupled comparators
git bisect good 89f951a1e8ad781e7ac70eccddab0e0c270485f9
# first bad commit: [f246ec3478cfdab830ee0815209f48923e7ee5e2] x86/apic: Enable TSC coupled programming mode

[-- Attachment #2: lscpu-i7-11700 --]
[-- Type: text/plain, Size: 3646 bytes --]

Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           39 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               GenuineIntel
Model name:                              11th Gen Intel(R) Core(TM) i7-11700 @ 2.50GHz
CPU family:                              6
Model:                                   167
Thread(s) per core:                      2
Core(s) per socket:                      8
Socket(s):                               1
Stepping:                                1
CPU(s) scaling MHz:                      30%
CPU max MHz:                             4900.0000
CPU min MHz:                             800.0000
BogoMIPS:                                4992.00
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap avx512ifma clflushopt intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear flush_l1d arch_capabilities
Virtualization:                          VT-x
L1d cache:                               384 KiB (8 instances)
L1i cache:                               256 KiB (8 instances)
L2 cache:                                4 MiB (8 instances)
L3 cache:                                16 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Mitigation; Microcode
Vulnerability Ghostwrite:                Not affected
Vulnerability Indirect target selection: Mitigation; Aligned branch/return thunks
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Old microcode:             Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

[-- Attachment #3: lscpu-n100 --]
[-- Type: text/plain, Size: 3501 bytes --]

Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           39 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  4
On-line CPU(s) list:                     0-3
Vendor ID:                               GenuineIntel
Model name:                              Intel(R) N100
CPU family:                              6
Model:                                   190
Thread(s) per core:                      1
Core(s) per socket:                      4
Socket(s):                               1
Stepping:                                0
CPU(s) scaling MHz:                      41%
CPU max MHz:                             3400.0000
CPU min MHz:                             700.0000
BogoMIPS:                                1612.80
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l2 cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect user_shstk avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req vnmi umip pku ospke waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize arch_lbr ibt flush_l1d arch_capabilities
Virtualization:                          VT-x
L1d cache:                               128 KiB (4 instances)
L1i cache:                               256 KiB (4 instances)
L2 cache:                                2 MiB (1 instance)
L3 cache:                                6 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-3
Vulnerability Gather data sampling:      Not affected
Vulnerability Ghostwrite:                Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Old microcode:             Not affected
Vulnerability Reg file data sampling:    Mitigation; Clear Register File
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; PBRSB-eIBRS Not affected; BHI BHI_DIS_S
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Mitigation; IBPB before exit to userspace

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [patch 20/48] x86/apic: Enable TSC coupled programming mode
  2026-03-03  1:29   ` [patch 20/48] " Nathan Chancellor
@ 2026-03-03 14:37     ` Thomas Gleixner
  2026-03-03 14:45       ` Thomas Gleixner
  2026-03-03 17:38       ` Nathan Chancellor
  0 siblings, 2 replies; 128+ messages in thread
From: Thomas Gleixner @ 2026-03-03 14:37 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: LKML, Anna-Maria Behnsen, John Stultz, Stephen Boyd,
	Daniel Lezcano, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, x86,
	Peter Zijlstra, Frederic Weisbecker, Eric Dumazet

On Mon, Mar 02 2026 at 18:29, Nathan Chancellor wrote:
>
> After this change landed in -next as commit f246ec3478cf ("x86/apic:
> Enable TSC coupled programming mode"), two of my Intel-based test
> machines fail to boot. Unfortunately, I do not think I have any serial
> access on these, so I have little introspective ability. Is there any
> information I can provide or patches I can test to try and help figure
> out what is going on here? I have attached the output of lscpu of both
> machines, in case there is some common thread there.

Grmbl. I stared at it for a while and I have a suspicion. Can you try
the patch below and also provide from one of the machines the output of

  dmesg | grep -i tsc

In case that does not work, I'll send a debug patch in reply to this
mail.

Thanks,

        tglx
---
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -404,6 +404,7 @@ static void tk_setup_internals(struct ti
 		 */
 		clocks_calc_mult_shift(&tk->cs_ns_to_cyc_mult, &tk->cs_ns_to_cyc_shift,
 				       NSEC_PER_MSEC, clock->freq_khz, 3600 * 1000);
+		tk->cs_ns_to_cyc_maxns = div_u64(clock->mask, tk->cs_ns_to_cyc_mult);
 	}
 }
 

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [patch 20/48] x86/apic: Enable TSC coupled programming mode
  2026-03-03 14:37     ` Thomas Gleixner
@ 2026-03-03 14:45       ` Thomas Gleixner
  2026-03-03 17:38       ` Nathan Chancellor
  1 sibling, 0 replies; 128+ messages in thread
From: Thomas Gleixner @ 2026-03-03 14:45 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: LKML, Anna-Maria Behnsen, John Stultz, Stephen Boyd,
	Daniel Lezcano, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, x86,
	Peter Zijlstra, Frederic Weisbecker, Eric Dumazet

On Tue, Mar 03 2026 at 15:37, Thomas Gleixner wrote:
> On Mon, Mar 02 2026 at 18:29, Nathan Chancellor wrote:
>>
>> After this change landed in -next as commit f246ec3478cf ("x86/apic:
>> Enable TSC coupled programming mode"), two of my Intel-based test
>> machines fail to boot. Unfortunately, I do not think I have any serial
>> access on these, so I have little introspective ability. Is there any
>> information I can provide or patches I can test to try and help figure
>> out what is going on here? I have attached the output of lscpu of both
>> machines, in case there is some common thread there.
>
> Grmbl. I stared at it for a while and I have a suspicion. Can you try
> the patch below and also provide from one of the machines the output of
>
>   dmesg | grep -i tsc
>
> In case that does not work, I'll send a debug patch in reply to this
> mail.

Here you go. It fails the coupled mode, but emits all relevant data via
trace_printk. Just collect it right after booting and please tell me
where the boot stops in the boot process so I can narrow down the search.

Thanks,

        tglx
---
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -424,9 +424,10 @@ static int lapic_next_deadline(unsigned
 	 * There is no weak_wrmsr_fence() required here as all of this is purely
 	 * CPU local. Avoid the [ml]fence overhead.
 	 */
-	u64 tsc = rdtsc();
+	u64 dl = rdtsc() + (((u64) delta) * TSC_DIVISOR);
 
-	native_wrmsrq(MSR_IA32_TSC_DEADLINE, tsc + (((u64) delta) * TSC_DIVISOR));
+	native_wrmsrq(MSR_IA32_TSC_DEADLINE, dl);
+	trace_printk("APIC    deadline: %16llu\n", dl);
 	return 0;
 }
 
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -310,11 +310,9 @@ static inline bool clockevent_set_next_c
 	if (unlikely(!ktime_expiry_to_cycles(dev->cs_id, expires, &cycles)))
 		return false;
 
-	if (IS_ENABLED(CONFIG_GENERIC_CLOCKEVENTS_COUPLED_INLINE))
-		arch_inlined_clockevent_set_next_coupled(cycles, dev);
-	else
-		dev->set_next_coupled(cycles, dev);
-	return true;
+	trace_printk("Coupled deadline: %16llu Exp: %16lld\n", cycles, expires);
+
+	return false;
 }
 
 #else
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -404,6 +404,7 @@ static void tk_setup_internals(struct ti
 		 */
 		clocks_calc_mult_shift(&tk->cs_ns_to_cyc_mult, &tk->cs_ns_to_cyc_shift,
 				       NSEC_PER_MSEC, clock->freq_khz, 3600 * 1000);
+		tk->cs_ns_to_cyc_maxns = div_u64(clock->mask, tk->cs_ns_to_cyc_mult);
 	}
 }
 
@@ -762,6 +763,11 @@ static inline void tk_update_ns_to_cyc(s
 	shift = tkrs->shift + tks->cs_ns_to_cyc_shift;
 	tks->cs_ns_to_cyc_mult = (u32)div_u64(1ULL << shift, tkrs->mult);
 	tks->cs_ns_to_cyc_maxns = div_u64(tkrs->clock->mask, tks->cs_ns_to_cyc_mult);
+	trace_printk("CSM: %8u CSS: %8u CEM: %8u CES: %8u MNS: %16llu BNS: %16lld BCY: %16llu\n",
+		     tkrs->shift, tkrs->mult, tks->cs_ns_to_cyc_mult,
+		     tks->cs_ns_to_cyc_shift, tks->cs_ns_to_cyc_maxns,
+		     tkrs->base + (tkrs->xtime_nsec >> tkrs->shift),
+		     tkrs->cycle_last);
 }
 
 /*

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [patch 20/48] x86/apic: Enable TSC coupled programming mode
  2026-03-03 14:37     ` Thomas Gleixner
  2026-03-03 14:45       ` Thomas Gleixner
@ 2026-03-03 17:38       ` Nathan Chancellor
  2026-03-03 20:21         ` Thomas Gleixner
  2026-03-03 21:56         ` [PATCH] Subject: timekeeping: Initialize the coupled clocksource conversion completely Thomas Gleixner
  1 sibling, 2 replies; 128+ messages in thread
From: Nathan Chancellor @ 2026-03-03 17:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Anna-Maria Behnsen, John Stultz, Stephen Boyd,
	Daniel Lezcano, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, x86,
	Peter Zijlstra, Frederic Weisbecker, Eric Dumazet

On Tue, Mar 03, 2026 at 03:37:03PM +0100, Thomas Gleixner wrote:
> On Mon, Mar 02 2026 at 18:29, Nathan Chancellor wrote:
> >
> > After this change landed in -next as commit f246ec3478cf ("x86/apic:
> > Enable TSC coupled programming mode"), two of my Intel-based test
> > machines fail to boot. Unfortunately, I do not think I have any serial
> > access on these, so I have little introspective ability. Is there any
> > information I can provide or patches I can test to try and help figure
> > out what is going on here? I have attached the output of lscpu of both
> > machines, in case there is some common thread there.
> 
> Grmbl. I stared at it for a while and I have a suspicion. Can you try
> the patch below and also provide from one of the machines the output of
> 
>   dmesg | grep -i tsc

This patch works on both machines, so your suspicion seemed spot on.

Output of that dmesg commmand appears to be the same between
89f951a1e8ad and f246ec3478cf with that diff applied:

  [    0.000000] tsc: Detected 2500.000 MHz processor
  [    0.000000] tsc: Detected 2496.000 MHz TSC
  [    0.008989] TSC deadline timer available
  [    0.119139] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x23fa772cf26, max_idle_ns: 440795269835 ns
  [    0.312141] clocksource: Switched to clocksource tsc-early
  [    0.322686] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x23fa772cf26, max_idle_ns: 440795269835 ns
  [    0.322951] clocksource: Switched to clocksource tsc

If there is anything else I can provide, let me know. If it becomes a
formal patch:

Tested-by: Nathan Chancellor <nathan@kernel.org>

> ---
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -404,6 +404,7 @@ static void tk_setup_internals(struct ti
>  		 */
>  		clocks_calc_mult_shift(&tk->cs_ns_to_cyc_mult, &tk->cs_ns_to_cyc_shift,
>  				       NSEC_PER_MSEC, clock->freq_khz, 3600 * 1000);
> +		tk->cs_ns_to_cyc_maxns = div_u64(clock->mask, tk->cs_ns_to_cyc_mult);
>  	}
>  }
>  

^ permalink raw reply	[flat|nested] 128+ messages in thread

* RE: [patch 19/48] clockevents: Provide support for clocksource coupled comparators
  2026-02-24 16:36 ` [patch 19/48] clockevents: Provide support for clocksource coupled comparators Thomas Gleixner
  2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
@ 2026-03-03 18:44   ` Michael Kelley
  2026-03-03 19:14     ` Peter Zijlstra
  2026-03-23  4:24     ` Michael Kelley
  1 sibling, 2 replies; 128+ messages in thread
From: Michael Kelley @ 2026-03-03 18:44 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86@kernel.org,
	Peter Zijlstra, Frederic Weisbecker, Eric Dumazet

From: Thomas Gleixner <tglx@kernel.org> Sent: Tuesday, February 24, 2026 8:37 AM
> 
> Some clockevent devices are coupled to the system clocksource by
> implementing a less than or equal comparator which compares the programmed
> absolute expiry time against the underlying time counter.

I've been playing with this in linux-next, and particularly to set up the Hyper-V
TSC page clocksource and Hyper-V timer as coupled. Most Hyper-V guests these days
are running on hardware that allows using the TSC directly as the clocksource. But
even if the Hyper-V TSC page clocksource isn't used, the timer is still the Hyper-V
timer, so the coupling isn't active. However, SEV-SNP CoCo VMs on Hyper-V must
use both the Hyper-V TSC page clocksource and the Hyper-V timer, so they would
benefit from coupling. It's a nice idea!

In doing the Hyper-V clocksource and timer coupling, I encountered two issues as
noted below.

> 
> The timekeeping core provides a function to convert and absolute
> CLOCK_MONOTONIC based expiry time to a absolute clock cycles time which can
> be directly fed into the comparator. That spares two time reads in the next
> event progamming path, one to convert the absolute nanoseconds time to a
> delta value and the other to convert the delta value back to a absolute
> time value suitable for the comparator.
> 
> Provide a new clocksource callback which takes the absolute cycle value and
> wire it up in clockevents_program_event(). Similar to clocksources allow
> architectures to inline the rearm operation.
> 
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> ---
>  include/linux/clockchips.h |    7 +++++--
>  kernel/time/Kconfig        |    4 ++++
>  kernel/time/clockevents.c  |   44 +++++++++++++++++++++++++++++++++++++++---
> --
>  3 files changed, 48 insertions(+), 7 deletions(-)
> 
> --- a/include/linux/clockchips.h
> +++ b/include/linux/clockchips.h
> @@ -43,8 +43,9 @@ enum clock_event_state {
>  /*
>   * Clock event features
>   */
> -# define CLOCK_EVT_FEAT_PERIODIC	0x000001
> -# define CLOCK_EVT_FEAT_ONESHOT		0x000002
> +# define CLOCK_EVT_FEAT_PERIODIC		0x000001
> +# define CLOCK_EVT_FEAT_ONESHOT			0x000002
> +# define CLOCK_EVT_FEAT_CLOCKSOURCE_COUPLED	0x000004
> 
>  /*
>   * x86(64) specific (mis)features:
> @@ -100,6 +101,7 @@ struct clock_event_device {
>  	void			(*event_handler)(struct clock_event_device *);
>  	int			(*set_next_event)(unsigned long evt, struct clock_event_device *);
>  	int			(*set_next_ktime)(ktime_t expires, struct clock_event_device *);
> +	void			(*set_next_coupled)(u64 cycles, struct clock_event_device *);
>  	ktime_t			next_event;
>  	u64			max_delta_ns;
>  	u64			min_delta_ns;
> @@ -107,6 +109,7 @@ struct clock_event_device {
>  	u32			shift;
>  	enum clock_event_state	state_use_accessors;
>  	unsigned int		features;
> +	enum clocksource_ids	cs_id;
>  	unsigned long		retries;
> 
>  	int			(*set_state_periodic)(struct clock_event_device *);
> --- a/kernel/time/Kconfig
> +++ b/kernel/time/Kconfig
> @@ -50,6 +50,10 @@ config GENERIC_CLOCKEVENTS_MIN_ADJUST
>  config GENERIC_CLOCKEVENTS_COUPLED
>  	bool
> 
> +config GENERIC_CLOCKEVENTS_COUPLED_INLINE
> +	select GENERIC_CLOCKEVENTS_COUPLED
> +	bool
> +
>  # Generic update of CMOS clock
>  config GENERIC_CMOS_UPDATE
>  	bool
> --- a/kernel/time/clockevents.c
> +++ b/kernel/time/clockevents.c
> @@ -292,6 +292,38 @@ static int clockevents_program_min_delta
> 
>  #endif /* CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST */
> 
> +#ifdef CONFIG_GENERIC_CLOCKEVENTS_COUPLED
> +#ifdef CONFIG_GENERIC_CLOCKEVENTS_COUPLED_INLINE
> +#include <asm/clock_inlined.h>
> +#else
> +static __always_inline void
> +arch_inlined_clockevent_set_next_coupled(u64 u64 cycles, struct clock_event_device *dev) { }

Typo -- there are two "u64" in a row, so it doesn't compile if COUPLED is selected
but COUPLED_INLINE is not.

> +#endif
> +
> +static inline bool clockevent_set_next_coupled(struct clock_event_device *dev, ktime_t expires)
> +{
> +	u64 cycles;
> +
> +	if (unlikely(!(dev->features & CLOCK_EVT_FEAT_CLOCKSOURCE_COUPLED)))
> +		return false;
> +
> +	if (unlikely(!ktime_expiry_to_cycles(dev->cs_id, expires, &cycles)))
> +		return false;
> +
> +	if (IS_ENABLED(CONFIG_GENERIC_CLOCKEVENTS_COUPLED_INLINE))

Since COUPLED_INLINE is always selected for x64, there's no way to add the Hyper-V
clockevent that is coupled but not inline. Adding the machinery to allow a second
inline clockevent type may not be worth it, but adding a second coupled but not
inline clockevent type on x64 should be supported. Thoughts?

After fixing the u64 typo, and temporarily not always selecting COUPLED_INLINE in
arch/x86/Kconfig, the coupled Hyper-V TSC page clocksource and timer seem to work
correctly, though I'm still doing some testing. I'm also working on counting the number
of time reads to confirm the expected benefit.

Michael

> +		arch_inlined_clockevent_set_next_coupled(cycles, dev);
> +	else
> +		dev->set_next_coupled(cycles, dev);
> +	return true;
> +}
> +
> +#else
> +static inline bool clockevent_set_next_coupled(struct clock_event_device *dev, ktime_t expires)
> +{
> +	return false;
> +}
> +#endif
> +
>  /**
>   * clockevents_program_event - Reprogram the clock event device.
>   * @dev:	device to program
> @@ -300,11 +332,10 @@ static int clockevents_program_min_delta
>   *
>   * Returns 0 on success, -ETIME when the event is in the past.
>   */
> -int clockevents_program_event(struct clock_event_device *dev, ktime_t expires,
> -			      bool force)
> +int clockevents_program_event(struct clock_event_device *dev, ktime_t expires, bool force)
>  {
> -	unsigned long long clc;
>  	int64_t delta;
> +	u64 cycles;
>  	int rc;
> 
>  	if (WARN_ON_ONCE(expires < 0))
> @@ -323,6 +354,9 @@ int clockevents_program_event(struct clo
>  	if (unlikely(dev->features & CLOCK_EVT_FEAT_HRTIMER))
>  		return dev->set_next_ktime(expires, dev);
> 
> +	if (likely(clockevent_set_next_coupled(dev, expires)))
> +		return 0;
> +
>  	delta = ktime_to_ns(ktime_sub(expires, ktime_get()));
>  	if (delta <= 0)
>  		return force ? clockevents_program_min_delta(dev) : -ETIME;
> @@ -330,8 +364,8 @@ int clockevents_program_event(struct clo
>  	delta = min(delta, (int64_t) dev->max_delta_ns);
>  	delta = max(delta, (int64_t) dev->min_delta_ns);
> 
> -	clc = ((unsigned long long) delta * dev->mult) >> dev->shift;
> -	rc = dev->set_next_event((unsigned long) clc, dev);
> +	cycles = ((u64)delta * dev->mult) >> dev->shift;
> +	rc = dev->set_next_event((unsigned long) cycles, dev);
> 
>  	return (rc && force) ? clockevents_program_min_delta(dev) : rc;
>  }
> 


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [patch 19/48] clockevents: Provide support for clocksource coupled comparators
  2026-03-03 18:44   ` [patch 19/48] " Michael Kelley
@ 2026-03-03 19:14     ` Peter Zijlstra
  2026-03-23  4:24     ` Michael Kelley
  1 sibling, 0 replies; 128+ messages in thread
From: Peter Zijlstra @ 2026-03-03 19:14 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Thomas Gleixner, LKML, Anna-Maria Behnsen, John Stultz,
	Stephen Boyd, Daniel Lezcano, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, x86@kernel.org, Frederic Weisbecker,
	Eric Dumazet

On Tue, Mar 03, 2026 at 06:44:59PM +0000, Michael Kelley wrote:

> > +static inline bool clockevent_set_next_coupled(struct clock_event_device *dev, ktime_t expires)
> > +{
> > +	u64 cycles;
> > +
> > +	if (unlikely(!(dev->features & CLOCK_EVT_FEAT_CLOCKSOURCE_COUPLED)))
> > +		return false;
> > +
> > +	if (unlikely(!ktime_expiry_to_cycles(dev->cs_id, expires, &cycles)))
> > +		return false;
> > +
> > +	if (IS_ENABLED(CONFIG_GENERIC_CLOCKEVENTS_COUPLED_INLINE))
> 
> Since COUPLED_INLINE is always selected for x64, there's no way to add the Hyper-V
> clockevent that is coupled but not inline. Adding the machinery to allow a second
> inline clockevent type may not be worth it, but adding a second coupled but not
> inline clockevent type on x64 should be supported. Thoughts?
> 
> After fixing the u64 typo, and temporarily not always selecting COUPLED_INLINE in
> arch/x86/Kconfig, the coupled Hyper-V TSC page clocksource and timer seem to work
> correctly, though I'm still doing some testing. I'm also working on counting the number
> of time reads to confirm the expected benefit.
> 
> Michael
> 
> > +		arch_inlined_clockevent_set_next_coupled(cycles, dev);

How about something deliciously insane like this? :-)

Then you can update the static_call to point to an asm function of your
choice that pretends to be WRMSR, while the 'native' case replaces the
CALL with CS CS CS WRMSR.

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 42447b1e1dff..5426c6fd8ec8 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1568,3 +1568,12 @@ SYM_FUNC_START(clear_bhb_loop)
 SYM_FUNC_END(clear_bhb_loop)
 EXPORT_SYMBOL_FOR_KVM(clear_bhb_loop)
 STACK_FRAME_NON_STANDARD(clear_bhb_loop)
+
+.pushsection .text, "ax"
+SYM_CODE_START(x86_clockevent_set_next_coupled_thunk)
+	ANNOTATE_NOENDBR
+	UNWIND_HINT_FUNC
+	wrmsr
+	RET
+SYM_CODE_END(x86_clockevent_set_next_coupled_thunk)
+.popsection
diff --git a/arch/x86/include/asm/clock_inlined.h b/arch/x86/include/asm/clock_inlined.h
index b2dee8db2fb9..587f2005ef60 100644
--- a/arch/x86/include/asm/clock_inlined.h
+++ b/arch/x86/include/asm/clock_inlined.h
@@ -2,6 +2,9 @@
 #ifndef _ASM_X86_CLOCK_INLINED_H
 #define _ASM_X86_CLOCK_INLINED_H
 
+#include <linux/static_call_types.h>
+#include <asm/msr-index.h>
+#include <asm/asm.h>
 #include <asm/tsc.h>
 
 struct clocksource;
@@ -13,10 +16,18 @@ static __always_inline u64 arch_inlined_clocksource_read(struct clocksource *cs)
 
 struct clock_event_device;
 
+extern void x86_clockevent_set_next_coupled_thunk(void);
+
+DECLARE_STATIC_CALL(x86_clockevent_set_next_coupled, x86_clockevent_set_next_coupled_thunk);
+
 static __always_inline void
 arch_inlined_clockevent_set_next_coupled(u64 cycles, struct clock_event_device *evt)
 {
-	native_wrmsrq(MSR_IA32_TSC_DEADLINE, cycles);
+	asm volatile("1: call " STATIC_CALL_TRAMP_STR(x86_clockevent_set_next_coupled) " \n"
+		     "2:\n"
+		     _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
+		     : ASM_CALL_CONSTRAINT
+		     : "c" (MSR_IA32_TSC_DEADLINE), "a" ((u32)cycles), "d" ((u32)(cycles >> 32)) : "memory");
 }
 
 #endif
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 60cab20b7901..194209f857b0 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -67,6 +67,7 @@
 #include <asm/intel-family.h>
 #include <asm/irq_regs.h>
 #include <asm/cpu.h>
+#include <asm/clock_inlined.h>
 
 #include "local.h"
 
@@ -430,6 +431,8 @@ static int lapic_next_deadline(unsigned long delta, struct clock_event_device *e
 	return 0;
 }
 
+DEFINE_STATIC_CALL(x86_clockevent_set_next_coupled, x86_clockevent_set_next_coupled_thunk);
+
 static int lapic_timer_shutdown(struct clock_event_device *evt)
 {
 	unsigned int v;
diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
index 61592e41a6b1..4821d155102f 100644
--- a/arch/x86/kernel/static_call.c
+++ b/arch/x86/kernel/static_call.c
@@ -3,6 +3,7 @@
 #include <linux/memory.h>
 #include <linux/bug.h>
 #include <asm/text-patching.h>
+#include <asm/clock_inlined.h>
 
 enum insn_type {
 	CALL = 0, /* site call */
@@ -31,6 +32,11 @@ static const u8 retinsn[] = { RET_INSN_OPCODE, 0xcc, 0xcc, 0xcc, 0xcc };
  */
 static const u8 warninsn[] = { 0x67, 0x48, 0x0f, 0xb9, 0x3a };
 
+/*
+ * cs cs cs wrmsr
+ */
+static const u8 wrmsrinsn[] = { 0x2e, 0x2e, 0x2e, 0x0f, 0x30 };
+
 static u8 __is_Jcc(u8 *insn) /* Jcc.d32 */
 {
 	u8 ret = 0;
@@ -78,6 +84,10 @@ static void __ref __static_call_transform(void *insn, enum insn_type type,
 			emulate = code;
 			code = &warninsn;
 		}
+		if (func == x86_clockevent_set_next_coupled_thunk) {
+			emulate = code;
+			code = &wrmsrinsn;
+		}
 		break;
 
 	case NOP:

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* Re: [patch 20/48] x86/apic: Enable TSC coupled programming mode
  2026-03-03 17:38       ` Nathan Chancellor
@ 2026-03-03 20:21         ` Thomas Gleixner
  2026-03-03 21:30           ` Nathan Chancellor
  2026-03-03 21:56         ` [PATCH] Subject: timekeeping: Initialize the coupled clocksource conversion completely Thomas Gleixner
  1 sibling, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-03-03 20:21 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: LKML, Anna-Maria Behnsen, John Stultz, Stephen Boyd,
	Daniel Lezcano, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, x86,
	Peter Zijlstra, Frederic Weisbecker, Eric Dumazet

On Tue, Mar 03 2026 at 10:38, Nathan Chancellor wrote:
> On Tue, Mar 03, 2026 at 03:37:03PM +0100, Thomas Gleixner wrote:
>> On Mon, Mar 02 2026 at 18:29, Nathan Chancellor wrote:
>> >
>> > After this change landed in -next as commit f246ec3478cf ("x86/apic:
>> > Enable TSC coupled programming mode"), two of my Intel-based test
>> > machines fail to boot. Unfortunately, I do not think I have any serial
>> > access on these, so I have little introspective ability. Is there any
>> > information I can provide or patches I can test to try and help figure
>> > out what is going on here? I have attached the output of lscpu of both
>> > machines, in case there is some common thread there.
>> 
>> Grmbl. I stared at it for a while and I have a suspicion. Can you try
>> the patch below and also provide from one of the machines the output of
>> 
>>   dmesg | grep -i tsc
>
> This patch works on both machines, so your suspicion seemed spot on.
>
> Output of that dmesg commmand appears to be the same between
> 89f951a1e8ad and f246ec3478cf with that diff applied:
>
>   [    0.000000] tsc: Detected 2500.000 MHz processor
>   [    0.000000] tsc: Detected 2496.000 MHz TSC
>   [    0.008989] TSC deadline timer available
>   [    0.119139] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x23fa772cf26, max_idle_ns: 440795269835 ns
>   [    0.312141] clocksource: Switched to clocksource tsc-early
>   [    0.322686] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x23fa772cf26, max_idle_ns: 440795269835 ns
>   [    0.322951] clocksource: Switched to clocksource tsc

Ha! That's exactly what I suspected. What happens is:

TSC-early is installed, which is neither valid for high resolution
timers nor for coupled mode. A bit later TSC is installed with the same
frequency as TSC early. Which means the shift mult pair is not changing,
which then fails to invoke the update of maxns. That stays simply 0, so
the time is always armed for an event in the past and the machine dies
from TSC deadline timer interrupt storm.

On all my test machines TSC frequency is refined against HPET and
installed late and that refinement always changes the shift/mult pair so
I never ran into this situation and obviously did not think about it
either.

Let me write a proper change log and get this into the tip tree.

Thanks for testing!

       tglx

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [patch 20/48] x86/apic: Enable TSC coupled programming mode
  2026-03-03 20:21         ` Thomas Gleixner
@ 2026-03-03 21:30           ` Nathan Chancellor
  2026-03-04 18:40             ` Thomas Gleixner
  0 siblings, 1 reply; 128+ messages in thread
From: Nathan Chancellor @ 2026-03-03 21:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Anna-Maria Behnsen, John Stultz, Stephen Boyd,
	Daniel Lezcano, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, x86,
	Peter Zijlstra, Frederic Weisbecker, Eric Dumazet

On Tue, Mar 03, 2026 at 09:21:52PM +0100, Thomas Gleixner wrote:
> On Tue, Mar 03 2026 at 10:38, Nathan Chancellor wrote:
> > On Tue, Mar 03, 2026 at 03:37:03PM +0100, Thomas Gleixner wrote:
> >> On Mon, Mar 02 2026 at 18:29, Nathan Chancellor wrote:
> >> >
> >> > After this change landed in -next as commit f246ec3478cf ("x86/apic:
> >> > Enable TSC coupled programming mode"), two of my Intel-based test
> >> > machines fail to boot. Unfortunately, I do not think I have any serial
> >> > access on these, so I have little introspective ability. Is there any
> >> > information I can provide or patches I can test to try and help figure
> >> > out what is going on here? I have attached the output of lscpu of both
> >> > machines, in case there is some common thread there.
> >> 
> >> Grmbl. I stared at it for a while and I have a suspicion. Can you try
> >> the patch below and also provide from one of the machines the output of
> >> 
> >>   dmesg | grep -i tsc
> >
> > This patch works on both machines, so your suspicion seemed spot on.
> >
> > Output of that dmesg commmand appears to be the same between
> > 89f951a1e8ad and f246ec3478cf with that diff applied:
> >
> >   [    0.000000] tsc: Detected 2500.000 MHz processor
> >   [    0.000000] tsc: Detected 2496.000 MHz TSC
> >   [    0.008989] TSC deadline timer available
> >   [    0.119139] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x23fa772cf26, max_idle_ns: 440795269835 ns
> >   [    0.312141] clocksource: Switched to clocksource tsc-early
> >   [    0.322686] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x23fa772cf26, max_idle_ns: 440795269835 ns
> >   [    0.322951] clocksource: Switched to clocksource tsc
> 
> Ha! That's exactly what I suspected. What happens is:
> 
> TSC-early is installed, which is neither valid for high resolution
> timers nor for coupled mode. A bit later TSC is installed with the same
> frequency as TSC early. Which means the shift mult pair is not changing,
> which then fails to invoke the update of maxns. That stays simply 0, so
> the time is always armed for an event in the past and the machine dies
> from TSC deadline timer interrupt storm.
> 
> On all my test machines TSC frequency is refined against HPET and
> installed late and that refinement always changes the shift/mult pair so
> I never ran into this situation and obviously did not think about it
> either.
> 
> Let me write a proper change log and get this into the tip tree.
> 
> Thanks for testing!

No problem and thanks for the explanation! Unfortunately, in further
testing, that diff appears to break booting my two AMD test systems,
which had no problems with the current series. The output from that
previous dmesg command from both systems on a vanilla next-20260303:

  [    0.000000] tsc: Fast TSC calibration using PIT
  [    0.000000] tsc: Detected 3792.761 MHz processor
  [    0.061853] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x6d574392f2f, max_idle_ns: 881590904565 ns
  [    0.332910] clocksource: Switched to clocksource tsc-early
  [    1.368506] tsc: Refined TSC clocksource calibration: 3792.899 MHz
  [    1.368521] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x6d584a46b0d, max_idle_ns: 881590977212 ns
  [    1.368849] clocksource: Switched to clocksource tsc
  [    4.497901] kvm_amd: TSC scaling supported

  [    0.000000] tsc: Fast TSC calibration using PIT
  [    0.000000] tsc: Detected 2994.309 MHz processor
  [    0.179828] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2b29459826f, max_idle_ns: 440795319985 ns
  [    0.452947] clocksource: Switched to clocksource tsc-early
  [    1.485796] tsc: Refined TSC clocksource calibration: 2994.372 MHz
  [    1.485810] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x2b29812ce43, max_idle_ns: 440795323173 ns
  [    1.486231] clocksource: Switched to clocksource tsc
  [    7.870821] kvm_amd: TSC scaling supported

Does it need to be conditionalized somehow? If there is any other
information I can provide about these systems, please let me know.

Cheers,
Nathan

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [PATCH] Subject: timekeeping: Initialize the coupled clocksource conversion completely
  2026-03-03 17:38       ` Nathan Chancellor
  2026-03-03 20:21         ` Thomas Gleixner
@ 2026-03-03 21:56         ` Thomas Gleixner
  2026-03-03 23:16           ` John Stultz
  2026-03-05 16:47           ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  1 sibling, 2 replies; 128+ messages in thread
From: Thomas Gleixner @ 2026-03-03 21:56 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: LKML, Anna-Maria Behnsen, John Stultz, Stephen Boyd,
	Daniel Lezcano, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, x86,
	Peter Zijlstra, Frederic Weisbecker, Eric Dumazet

Nathan reported a boot failure after the coupled clocksource/event support
was enabled for the TSC deadline timer. It turns out that on the affected
test systems the TSC frequency is not refined against HPET, so it is
registered with the same frequency as the TSC-early clocksource.

As a consequence the update function which checks for a change of the
shift/mult pair of the clocksource fails to compute the conversion
limit, which is zero initialized. This check is there to avoid pointless
computations on every timekeeping update cycle (tick).

So the actual clockevent conversion function limits the delta expiry to
zero, which means the timer is always programmed to expire in the
past. This obviously results in a spectacular timer interrupt storm,
which goes unnoticed because the per CPU interrupts on x86 are not
exposed to the runaway detection mechanism and the NMI watchdog is not
yet functional. So the machine simply stops booting.

That did not show up in testing. All test machines refine the TSC frequency
so TSC has a differrent shift/mult pair than TSC-early and the conversion
limit is properly initialized.

Cure that by setting the conversion limit right at the point where the new
clocksource is installed.

Fixes: cd38bdb8e696 ("timekeeping: Provide infrastructure for coupled clockevents")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Closes: https://lore.kernel.org/20260303012905.GA978396@ax162
---
 kernel/time/timekeeping.c |    7 +++++++
 1 file changed, 7 insertions(+)

--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -404,6 +404,13 @@ static void tk_setup_internals(struct ti
 		 */
 		clocks_calc_mult_shift(&tk->cs_ns_to_cyc_mult, &tk->cs_ns_to_cyc_shift,
 				       NSEC_PER_MSEC, clock->freq_khz, 3600 * 1000);
+		/*
+		 * Initialize the conversion limit as the previous clocksource
+		 * might have the same shift/mult pair so the quick check in
+		 * tk_update_ns_to_cyc() fails to update it after a clocksource
+		 * change leaving it effectivly zero.
+		 */
+		tk->cs_ns_to_cyc_maxns = div_u64(clock->mask, tk->cs_ns_to_cyc_mult);
 	}
 }

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH] Subject: timekeeping: Initialize the coupled clocksource conversion completely
  2026-03-03 21:56         ` [PATCH] Subject: timekeeping: Initialize the coupled clocksource conversion completely Thomas Gleixner
@ 2026-03-03 23:16           ` John Stultz
  2026-03-05 16:47           ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 128+ messages in thread
From: John Stultz @ 2026-03-03 23:16 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Nathan Chancellor, LKML, Anna-Maria Behnsen, Stephen Boyd,
	Daniel Lezcano, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, x86,
	Peter Zijlstra, Frederic Weisbecker, Eric Dumazet

On Tue, Mar 3, 2026 at 1:56 PM Thomas Gleixner <tglx@kernel.org> wrote:
>
> Nathan reported a boot failure after the coupled clocksource/event support
> was enabled for the TSC deadline timer. It turns out that on the affected
> test systems the TSC frequency is not refined against HPET, so it is
> registered with the same frequency as the TSC-early clocksource.
>
> As a consequence the update function which checks for a change of the
> shift/mult pair of the clocksource fails to compute the conversion
> limit, which is zero initialized. This check is there to avoid pointless
> computations on every timekeeping update cycle (tick).
>
> So the actual clockevent conversion function limits the delta expiry to
> zero, which means the timer is always programmed to expire in the
> past. This obviously results in a spectacular timer interrupt storm,
> which goes unnoticed because the per CPU interrupts on x86 are not
> exposed to the runaway detection mechanism and the NMI watchdog is not
> yet functional. So the machine simply stops booting.
>
> That did not show up in testing. All test machines refine the TSC frequency
> so TSC has a differrent shift/mult pair than TSC-early and the conversion
> limit is properly initialized.
>
> Cure that by setting the conversion limit right at the point where the new
> clocksource is installed.
>
> Fixes: cd38bdb8e696 ("timekeeping: Provide infrastructure for coupled clockevents")
> Reported-by: Nathan Chancellor <nathan@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> Tested-by: Nathan Chancellor <nathan@kernel.org>
> Closes: https://lore.kernel.org/20260303012905.GA978396@ax162

Acked-by: John Stultz <jstultz@google.com>

thanks
-john

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement
  2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
                   ` (48 preceding siblings ...)
  2026-02-25 15:25 ` [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Peter Zijlstra
@ 2026-03-04 15:59 ` Christian Loehle
  49 siblings, 0 replies; 128+ messages in thread
From: Christian Loehle @ 2026-03-04 15:59 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86, Peter Zijlstra,
	Frederic Weisbecker, Eric Dumazet

On 2/24/26 16:35, Thomas Gleixner wrote:
> Peter recently posted a series tweaking the hrtimer subsystem to reduce the
> overhead of the scheduler hrtick timer so it can be enabled by default:
> 
>    https://lore.kernel.org/20260121162010.647043073@infradead.org
> 
> That turned out to be incomplete and led to a deeper investigation of the
> related bits and pieces.
> 
> The problem is that the hrtick deadline changes on every context switch and
> is also modified by wakeups and balancing. On a hackbench run this results
> in about 2500 clockevent reprogramming cycles per second, which is
> especially hurtful in a VM as accessing the clockevent device implies a
> VM-Exit.
> 
> The following series addresses various aspects of the overall related
> problem space:
> 
>     1) Scheduler
> 
>        Aside of some trivial fixes the handling of the hrtick timer in
>        the scheduler is suboptimal:
> 
>         - schedule() modifies the hrtick when picking the next task
> 
> 	- schedule() can modify the hrtick when the balance callback runs
>           before releasing rq:lock
> 
> 	- the expiry time is unfiltered and can result in really tiny
>           changes of the expiry time, which are functionally completely
>           irrelevant
> 
>        Solve this by deferring the hrtick update to the end of schedule()
>        and filtering out tiny changes.
> 
> 
>     2) Clocksource, clockevents, timekeeping
> 
>         - Reading the current clocksource involves an indirect call, which
>           is expensive especially for clocksources where the actual read is
>           a single instruction like the TSC read on x86.
> 
> 	  This could be solved with a static call, but the architecture
> 	  coverage for static calls is meager and that still has the
> 	  overhead of a function call and in the worst case a return
> 	  speculation mitigation.
> 
> 	  As x86 and other architectures like S390 have one preferred
> 	  clocksource which is normally used on all contemporary systems,
> 	  this begs for a fully inlined solution.
> 
> 	  This is achieved by a config option which tells the core code to
> 	  use the architecture provided inline guarded by a static branch.
> 
> 	  If the branch is disabled, the indirect function call is used as
> 	  before. If enabled the inlined read is utilized.
> 
> 	  The branch is disabled by default and only enabled after a
> 	  clocksource is installed which has the INLINE feature flag
> 	  set. When the clocksource is replaced the branch is disabled
> 	  before the clocksource change happens.
> 
> 
>         - Programming clock events is based on calculating a relative
>           expiry time, converting it to the clock cycles corresponding to
>           the clockevent device frequency and invoking the set_next_event()
>           callback of the clockevent device.
> 
> 	  That works perfectly fine as most hardware timers are count down
> 	  implementations which require a relative time for programming.
> 
> 	  But clockevent devices which are coupled to the clocksource and
> 	  provide a less than equal comparator suffer from this scheme. The
> 	  core calculates the relative expiry time based on a clock read
> 	  and the set_next_event() callback has to read the same clock
> 	  again to convert it back to a absolute time which can be
> 	  programmed into the comparator.
> 
> 	  The other issue is that the conversion factor of the clockevent
> 	  device is calculated at boot time and does not take the NTP/PTP
> 	  adjustments of the clocksource frequency into account. Depending
> 	  on the direction of the adjustment this can cause timers to fire
> 	  early or late. Early is the more problematic case as the timer
> 	  interrupt has to reprogram the device with a very short delta as
> 	  it can't expire timers early.
> 
> 	  This can be optimized by introducing a 'coupled' mode for the
> 	  clocksource and the clockevent device.
> 
> 	    A) If the clocksource indicates support for 'coupled' mode, the
> 	       timekeeping core calculates a (NTP adjusted) reverse
> 	       conversion factor from the clocksource to nanoseconds
> 	       conversion. This takes NTP adjustments into account and
> 	       keeps the conversion in sync.
> 
> 	    B) The timekeeping core provides a function to convert an
> 	       absolute CLOCK_MONOTONIC expiry time into a absolute time in
> 	       clocksource cycles which can be programmed directly into the
> 	       comparator without reading the clocksource at all.
> 
> 	       This is possible because timekeeping keeps a time pair of
> 	       the base cycle count and the corresponding CLOCK_MONOTONIC base
> 	       time at the last update of the timekeeper.
> 
> 	       So the absolute cycle time can be calculated by calculating
> 	       the relative time to the CLOCK_MONOTONIC base time,
> 	       converting the delta into cycles with the help of #A and
> 	       adding the base cycle count. Pure math, no hardware access.
> 
> 	    C) The clockevent reprogramming code invokes this conversion
> 	       function when the clockevent device indicates 'coupled'
> 	       mode.  The function returns false when the corresponding
> 	       clocksource is not the current system clocksource (based on
> 	       a clocksource ID check) and true if the clocksource matches
> 	       and the conversion is successful.
> 
> 	       If false, the regular relative set_next_event() mechanism is
> 	       used, otherwise a new set_next_coupled() callback which
> 	       takes the calculated absolute expiry time as argument.
> 
> 	       Similar to the clocksource, this new callback can optionally
> 	       be inlined.
> 
> 
>     3) hrtimers
> 
>        It turned out that the hrtimer code needed a long overdue spring
>        cleaning independent of the problem at hand. That was conducted
>        before tackling the actual performance issues:
> 
>        - Timer locality
> 
>        	 The handling of timer locality is suboptimal and results often in
> 	 pointless invocations of switch_hrtimer_base() which end up
> 	 keeping the CPU base unchanged.
> 
> 	 Aside of the pointless overhead, this prevents further
> 	 optimizations for the common local case.
> 
> 	 Address this by improving the decision logic for keeping the clock
> 	 base local and splitting out the (re)arm handling into a unified
> 	 operation.
> 
> 
>        - Evalutation of the clock base expiries
> 
>        	 The clock bases (MONOTONIC, REALTIME, BOOT, TAI) cache the first
>        	 expiring timer, but not the corresponding expiry time, which means
>        	 a re-evaluation of the clock bases for the next expiring timer on
>        	 the CPU requires to touch up to for extra cache lines.
> 
> 	 Trivial to solve by caching the earliest expiry time in the clock
> 	 base itself.
> 
> 
>        - Reprogramming of the clock event device
> 
>        	 The hrtimer interrupt already deferres reprogramming until the
>        	 interrupt handler completes, but in case of the hrtick timer
>        	 that's not sufficient because the hrtick timer callback only sets
>        	 the NEED_RESCHED flag but has no information about the next hrtick
>        	 timer expiry time, which can only be determined in the scheduler.
> 
> 	 Expand the deferred reprogramming so it can ideally be handled in
> 	 the subsequent schedule() after the new hrtick value has been
> 	 established. If there is no schedule, soft interrupts have to be
> 	 processed on return from interrupt or a nested interrupt hits
> 	 before reaching schedule, the deferred programming is handled in
> 	 those contexts.
> 
> 
>        - Modification of queued timers
> 
>        	 If a timer is already queued modifying the expiry time requires
>        	 dequeueing from the RB tree and requeuing after the new expiry
>        	 value has been updated. It turned out that the hrtick timer
>        	 modification end up very often at the same spot in the RB tree as
>        	 they have been before, which means the dequeue/enqueue cycle along
>        	 with the related rebalancing could have been avoided. The timer
>        	 wheel timers have a similar mechanism by checking upfront whether
>        	 the resulting expiry time keeps them in the same hash bucket.
> 
> 	 It was tried to check this by using rb_prev() and rb_next() to
> 	 evaluate whether the modification keeps the timer in the same
> 	 spot, but that turned out to be really inefficent.
> 
> 	 Solve this by providing a RB tree variant which extends the node
> 	 with links to the previous and next nodes, which is established
> 	 when the node is linked into the tree or adjusted when it is
> 	 removed. These links allow a quick peek into the previous and next
> 	 expiry time and if the new expiry stays in the boundary the whole
> 	 RB tree operation can be avoided.
> 
> 	 This also simplifies the caching and update of the leftmost node
> 	 as on remove the rb_next() walk can be completely avoided. It
> 	 would obviously provide a cached rightmost pointer too, but there
> 	 is not use case for that (yet).
> 
> 	 On a hackbench run this results in about 35% of the updates being
> 	 handled that way, which cuts the execution time of
> 	 hrtimer_start_range_ns() down to 50ns on a 2GHz machine.
> 
> 
>        - Cancellation of queued timers
> 
>        	 Cancelling a timer or moving its expiry time past the programmed
>        	 time can result in reprogramming the clock event device.
>        	 Especially with frequent modifications of a queued timer this
>        	 results in substantial overhead especially in VMs.
> 
> 	 Provide an option for hrtimers to tell the core to handle
> 	 reprogramming lazy in those cases, which means it trades frequent
> 	 reprogramming against an occasional pointless hrtimer interrupt.
> 
> 	 But it turned out for the hrtick timer this is a reasonable
> 	 tradeoff. It's especially valuable when transitioning to idle,
> 	 where the timer has to be cancelled but then the NOHZ idle code
> 	 will reprogram it in case of a long idle sleep anyway. But also in
> 	 high frequency scheduling scenarios this turned out to be
> 	 beneficial.
> 
> 
> With all the above modifications in place enabling hrtick does not longer
> result in regressions compared to the hrtick disabled mode.
> 
> The reprogramming frequency of the clockevent device got down from
> ~2500/sec to ~100/sec for a hackbench run with a spurious hrtimer interrupt
> ratio of about 25%.
> 
> What's interesting is the astonishing improvement of a hackbench run with
> the following command line parameters: '-l$LOOPS -p -s8'. That uses pipes
> with a message size of 8 bytes. On a 112 CPU SKL machine this results in:
> 
>        	   NO HRTICK[_DL]		HRTICK[_DL]
> runtime:   0.840s			0.481s		~-42%
> 
> With other message sizes up to 256, HRTICK still results in improvements,
> but not in that magnitude. Haven't investigated the cause of that yet.
> 
> While quite some parts of the series are independent enhancements, I've
> decided to keep them together in one big pile for now as all of the
> components are required to actually achieve the overall goal.
> 
> The patches have been already structured in a way that they can be
> distributed to different subsystem branches without causing major cross
> subsystem contamination or merge conflict headaches.
> 
> The series applies on v7.0-rc1 and is also available from git:
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git sched/hrtick
> 
> Thanks,
> 
> 	tglx
> ---
>  arch/x86/Kconfig                      |    2 
>  arch/x86/include/asm/clock_inlined.h  |   22 
>  arch/x86/kernel/apic/apic.c           |   41 -
>  arch/x86/kernel/tsc.c                 |    4 
>  include/asm-generic/thread_info_tif.h |    5 
>  include/linux/clockchips.h            |    8 
>  include/linux/clocksource.h           |    3 
>  include/linux/hrtimer.h               |   59 -
>  include/linux/hrtimer_defs.h          |   79 +-
>  include/linux/hrtimer_rearm.h         |   83 ++
>  include/linux/hrtimer_types.h         |   19 
>  include/linux/irq-entry-common.h      |   25 
>  include/linux/rbtree.h                |   81 ++
>  include/linux/rbtree_types.h          |   16 
>  include/linux/rseq_entry.h            |   14 
>  include/linux/timekeeper_internal.h   |    8 
>  include/linux/timerqueue.h            |   56 +
>  include/linux/timerqueue_types.h      |   15 
>  include/trace/events/timer.h          |   35 -
>  kernel/entry/common.c                 |    4 
>  kernel/sched/core.c                   |   89 ++
>  kernel/sched/deadline.c               |    2 
>  kernel/sched/fair.c                   |   55 -
>  kernel/sched/features.h               |    5 
>  kernel/sched/sched.h                  |   41 -
>  kernel/softirq.c                      |   15 
>  kernel/time/Kconfig                   |   16 
>  kernel/time/clockevents.c             |   48 +
>  kernel/time/hrtimer.c                 | 1116 +++++++++++++++++++---------------
>  kernel/time/tick-broadcast-hrtimer.c  |    1 
>  kernel/time/tick-sched.c              |   27 
>  kernel/time/timekeeping.c             |  184 +++++
>  kernel/time/timekeeping.h             |    2 
>  kernel/time/timer_list.c              |   12 
>  lib/rbtree.c                          |   17 
>  lib/timerqueue.c                      |   14 
>  36 files changed, 1497 insertions(+), 728 deletions(-)
> 
> 
> 

FWIW I tested various workloads for this on an arm64 rk3399 comparing
mainline NO_HRTICK
mainline HRTICK
rearm NO_HRTICK
rearm HRTICK
rearm being $SUBJECT + arm64 generic entry + enabling generic TIF bits.
https://lore.kernel.org/lkml/20260203133728.848283-1-ruanjinjie@huawei.com/

There's nothing statistically significant with 1000HZ (it has 6 CPUs, so base
slice granularity is 2.1ms).
With 250HZ I get at least something, a selection:
+-------------+---------------------+---------------------+----------------------+----------------------+----------------------+----------------------+
| Test        | mainline NO_HRTICK  | mainline HRTICK     | rearm NO_HRTICK      | rearm HRTICK         | subject NO_HRTICK    | subject HRTICK       |
+-------------+---------------------+---------------------+----------------------+----------------------+----------------------+----------------------+
| schbench    | 306.83 ± 3.10       | 301.81 ± 1.07       | 298.67 ± 3.33        | (304.87 ± 3.29)      | (305.79 ± 3.64)      | (307.07 ± 1.05)      |
| ebizzy      | 10664 ± 19          | (10565 ± 285)       | (10510 ± 245)        | (10580 ± 240)        | (10674 ± 259)        | 10816 ± 27           |
| hackbench   | 19.715 ± 0.11       | (19.707 ± 0.10)     | (19.826 ± 0.15)      | (19.81 ± 0.12)       | 19.98 ± 0.10         | (19.74 ± 0.11)       |
| nullb0 IOPS | 102525 ± 367        | (101850 ± 262)      | 92209 ± 7624         | (103385 ± 422)       | (101854 ± 473)       | (102141 ± 149)       |
+-------------+----------------------+--------------------+----------------------+----------------------+----------------------+----------------------+
(subject is $SUBJECT only, so no REARM_DEFERRED on arm64).
but at least no regression with sched_feat HRTICK.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [patch 20/48] x86/apic: Enable TSC coupled programming mode
  2026-03-03 21:30           ` Nathan Chancellor
@ 2026-03-04 18:40             ` Thomas Gleixner
  2026-03-04 18:49               ` [patch 20/48] clocksource: Update clocksource::freq_khz on registration Thomas Gleixner
  0 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-03-04 18:40 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: LKML, Anna-Maria Behnsen, John Stultz, Stephen Boyd,
	Daniel Lezcano, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, x86,
	Peter Zijlstra, Frederic Weisbecker, Eric Dumazet

On Tue, Mar 03 2026 at 14:30, Nathan Chancellor wrote:
> No problem and thanks for the explanation! Unfortunately, in further
> testing, that diff appears to break booting my two AMD test systems,
> which had no problems with the current series. The output from that
> previous dmesg command from both systems on a vanilla next-20260303:
>
>   [    0.000000] tsc: Fast TSC calibration using PIT
>   [    0.000000] tsc: Detected 3792.761 MHz processor
>   [    0.061853] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x6d574392f2f, max_idle_ns: 881590904565 ns
>   [    0.332910] clocksource: Switched to clocksource tsc-early
>   [    1.368506] tsc: Refined TSC clocksource calibration: 3792.899 MHz
>   [    1.368521] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x6d584a46b0d, max_idle_ns: 881590977212 ns
>   [    1.368849] clocksource: Switched to clocksource tsc
>   [    4.497901] kvm_amd: TSC scaling supported
>
>   [    0.000000] tsc: Fast TSC calibration using PIT
>   [    0.000000] tsc: Detected 2994.309 MHz processor
>   [    0.179828] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2b29459826f, max_idle_ns: 440795319985 ns
>   [    0.452947] clocksource: Switched to clocksource tsc-early
>   [    1.485796] tsc: Refined TSC clocksource calibration: 2994.372 MHz
>   [    1.485810] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x2b29812ce43, max_idle_ns: 440795323173 ns
>   [    1.486231] clocksource: Switched to clocksource tsc
>   [    7.870821] kvm_amd: TSC scaling supported

Borislav has observed that too. I send out a fix in a minute.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [patch 20/48] clocksource: Update clocksource::freq_khz on registration
  2026-03-04 18:40             ` Thomas Gleixner
@ 2026-03-04 18:49               ` Thomas Gleixner
  2026-03-04 19:10                 ` Borislav Petkov
                                   ` (2 more replies)
  0 siblings, 3 replies; 128+ messages in thread
From: Thomas Gleixner @ 2026-03-04 18:49 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: LKML, Anna-Maria Behnsen, John Stultz, Stephen Boyd,
	Daniel Lezcano, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, x86,
	Peter Zijlstra, Frederic Weisbecker, Eric Dumazet

Borislav reported a division by zero in the timekeeping code and random
hangs with the new coupled clocksource/clockevent functionality.

It turned out that the TSC clocksource is not always updating the
freq_khz field of the clocksource on registration. The coupled mode
conversion calculation requires the frequency and as it's not
initialized the resulting factor is zero or a random value. As a
consequence this causes a division by zero or random boot hangs.

Instead of chasing down all clocksources which fail to update that
member, fill it in at registration time where the caller has to supply
the frequency anyway. Except for special clocksources like jiffies which
never can have coupled mode.

To make this more robust put a check into the registration function to
validate that the caller supplied a frequency if the coupled mode
feature bit is set. If not, emit a warning and clear the feature bit.

Fixes: cd38bdb8e696 ("timekeeping: Provide infrastructure for coupled clockevents")
Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 kernel/time/clocksource.c |    7 +++++++
 1 file changed, 7 insertions(+)

--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -1169,6 +1169,9 @@ void __clocksource_update_freq_scale(str

 		clocks_calc_mult_shift(&cs->mult, &cs->shift, freq,
 				       NSEC_PER_SEC / scale, sec * scale);
+
+		/* Update cs::freq_khz */
+		cs->freq_khz = div_u64((u64)freq * scale, 1000);
 	}

 	/*
@@ -1241,6 +1244,10 @@ int __clocksource_register_scale(struct

 	if (WARN_ON_ONCE((unsigned int)cs->id >= CSID_MAX))
 		cs->id = CSID_GENERIC;
+
+	if (WARN_ON_ONCE(!freq && cs->flags & CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT))
+		cs->flags &= ~CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT;
+
 	if (cs->vdso_clock_mode < 0 ||
 	    cs->vdso_clock_mode >= VDSO_CLOCKMODE_MAX) {
 		pr_warn("clocksource %s registered with invalid VDSO mode %d. Disabling VDSO support.\n",

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [patch 20/48] clocksource: Update clocksource::freq_khz on registration
  2026-03-04 18:49               ` [patch 20/48] clocksource: Update clocksource::freq_khz on registration Thomas Gleixner
@ 2026-03-04 19:10                 ` Borislav Petkov
  2026-03-04 22:57                 ` Nathan Chancellor
  2026-03-05 16:47                 ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 128+ messages in thread
From: Borislav Petkov @ 2026-03-04 19:10 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Nathan Chancellor, LKML, Anna-Maria Behnsen, John Stultz,
	Stephen Boyd, Daniel Lezcano, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, x86, Peter Zijlstra, Frederic Weisbecker,
	Eric Dumazet

On Wed, Mar 04, 2026 at 07:49:29PM +0100, Thomas Gleixner wrote:
> Borislav reported a division by zero in the timekeeping code and random
> hangs with the new coupled clocksource/clockevent functionality.
> 
> It turned out that the TSC clocksource is not always updating the
> freq_khz field of the clocksource on registration. The coupled mode
> conversion calculation requires the frequency and as it's not
> initialized the resulting factor is zero or a random value. As a
> consequence this causes a division by zero or random boot hangs.
> 
> Instead of chasing down all clocksources which fail to update that
> member, fill it in at registration time where the caller has to supply
> the frequency anyway. Except for special clocksources like jiffies which
> never can have coupled mode.
> 
> To make this more robust put a check into the registration function to
> validate that the caller supplied a frequency if the coupled mode
> feature bit is set. If not, emit a warning and clear the feature bit.
> 
> Fixes: cd38bdb8e696 ("timekeeping: Provide infrastructure for coupled clockevents")
> Reported-by: Borislav Petkov <bp@alien8.de>
> Reported-by: Nathan Chancellor <nathan@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> ---
>  kernel/time/clocksource.c |    7 +++++++
>  1 file changed, 7 insertions(+)

Tested-by: Borislav Petkov (AMD) <bp@alien8.de>

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [patch 20/48] clocksource: Update clocksource::freq_khz on registration
  2026-03-04 18:49               ` [patch 20/48] clocksource: Update clocksource::freq_khz on registration Thomas Gleixner
  2026-03-04 19:10                 ` Borislav Petkov
@ 2026-03-04 22:57                 ` Nathan Chancellor
  2026-03-05 16:47                 ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 128+ messages in thread
From: Nathan Chancellor @ 2026-03-04 22:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Anna-Maria Behnsen, John Stultz, Stephen Boyd,
	Daniel Lezcano, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, x86,
	Peter Zijlstra, Frederic Weisbecker, Eric Dumazet

On Wed, Mar 04, 2026 at 07:49:29PM +0100, Thomas Gleixner wrote:
> Borislav reported a division by zero in the timekeeping code and random
> hangs with the new coupled clocksource/clockevent functionality.
> 
> It turned out that the TSC clocksource is not always updating the
> freq_khz field of the clocksource on registration. The coupled mode
> conversion calculation requires the frequency and as it's not
> initialized the resulting factor is zero or a random value. As a
> consequence this causes a division by zero or random boot hangs.
> 
> Instead of chasing down all clocksources which fail to update that
> member, fill it in at registration time where the caller has to supply
> the frequency anyway. Except for special clocksources like jiffies which
> never can have coupled mode.
> 
> To make this more robust put a check into the registration function to
> validate that the caller supplied a frequency if the coupled mode
> feature bit is set. If not, emit a warning and clear the feature bit.
> 
> Fixes: cd38bdb8e696 ("timekeeping: Provide infrastructure for coupled clockevents")
> Reported-by: Borislav Petkov <bp@alien8.de>
> Reported-by: Nathan Chancellor <nathan@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>

Tested-by: Nathan Chancellor <nathan@kernel.org>

> ---
>  kernel/time/clocksource.c |    7 +++++++
>  1 file changed, 7 insertions(+)
> 
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -1169,6 +1169,9 @@ void __clocksource_update_freq_scale(str
>  
>  		clocks_calc_mult_shift(&cs->mult, &cs->shift, freq,
>  				       NSEC_PER_SEC / scale, sec * scale);
> +
> +		/* Update cs::freq_khz */
> +		cs->freq_khz = div_u64((u64)freq * scale, 1000);
>  	}
>  
>  	/*
> @@ -1241,6 +1244,10 @@ int __clocksource_register_scale(struct
>  
>  	if (WARN_ON_ONCE((unsigned int)cs->id >= CSID_MAX))
>  		cs->id = CSID_GENERIC;
> +
> +	if (WARN_ON_ONCE(!freq && cs->flags & CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT))
> +		cs->flags &= ~CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT;
> +
>  	if (cs->vdso_clock_mode < 0 ||
>  	    cs->vdso_clock_mode >= VDSO_CLOCKMODE_MAX) {
>  		pr_warn("clocksource %s registered with invalid VDSO mode %d. Disabling VDSO support.\n",

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] clocksource: Update clocksource::freq_khz on registration
  2026-03-04 18:49               ` [patch 20/48] clocksource: Update clocksource::freq_khz on registration Thomas Gleixner
  2026-03-04 19:10                 ` Borislav Petkov
  2026-03-04 22:57                 ` Nathan Chancellor
@ 2026-03-05 16:47                 ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-03-05 16:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Borislav Petkov, Nathan Chancellor, Thomas Gleixner, x86,
	linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     53007d526e17d29f0e5b81c07eb594a93bc4d29c
Gitweb:        https://git.kernel.org/tip/53007d526e17d29f0e5b81c07eb594a93bc4d29c
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Wed, 04 Mar 2026 19:49:29 +01:00
Committer:     Thomas Gleixner <tglx@kernel.org>
CommitterDate: Thu, 05 Mar 2026 17:41:06 +01:00

clocksource: Update clocksource::freq_khz on registration

Borislav reported a division by zero in the timekeeping code and random
hangs with the new coupled clocksource/clockevent functionality.

It turned out that the TSC clocksource is not always updating the
freq_khz field of the clocksource on registration. The coupled mode
conversion calculation requires the frequency and as it's not
initialized the resulting factor is zero or a random value. As a
consequence this causes a division by zero or random boot hangs.

Instead of chasing down all clocksources which fail to update that
member, fill it in at registration time where the caller has to supply
the frequency anyway. Except for special clocksources like jiffies which
never can have coupled mode.

To make this more robust put a check into the registration function to
validate that the caller supplied a frequency if the coupled mode
feature bit is set. If not, emit a warning and clear the feature bit.

Fixes: cd38bdb8e696 ("timekeeping: Provide infrastructure for coupled clockevents")
Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Borislav Petkov <bp@alien8.de>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: https://patch.msgid.link/87cy1jsa4m.ffs@tglx
Closes: https://lore.kernel.org/20260303213027.GA2168957@ax162
---
 kernel/time/clocksource.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index df71949..3c20544 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -1169,6 +1169,9 @@ void __clocksource_update_freq_scale(struct clocksource *cs, u32 scale, u32 freq
 
 		clocks_calc_mult_shift(&cs->mult, &cs->shift, freq,
 				       NSEC_PER_SEC / scale, sec * scale);
+
+		/* Update cs::freq_khz */
+		cs->freq_khz = div_u64((u64)freq * scale, 1000);
 	}
 
 	/*
@@ -1241,6 +1244,10 @@ int __clocksource_register_scale(struct clocksource *cs, u32 scale, u32 freq)
 
 	if (WARN_ON_ONCE((unsigned int)cs->id >= CSID_MAX))
 		cs->id = CSID_GENERIC;
+
+	if (WARN_ON_ONCE(!freq && cs->flags & CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT))
+		cs->flags &= ~CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT;
+
 	if (cs->vdso_clock_mode < 0 ||
 	    cs->vdso_clock_mode >= VDSO_CLOCKMODE_MAX) {
 		pr_warn("clocksource %s registered with invalid VDSO mode %d. Disabling VDSO support.\n",

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [tip: sched/hrtick] timekeeping: Initialize the coupled clocksource conversion completely
  2026-03-03 21:56         ` [PATCH] Subject: timekeeping: Initialize the coupled clocksource conversion completely Thomas Gleixner
  2026-03-03 23:16           ` John Stultz
@ 2026-03-05 16:47           ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 128+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-03-05 16:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Nathan Chancellor, Thomas Gleixner, John Stultz, x86,
	linux-kernel

The following commit has been merged into the sched/hrtick branch of tip:

Commit-ID:     9d5e25b361b7228b422fd32bd1c327fd7fb919b4
Gitweb:        https://git.kernel.org/tip/9d5e25b361b7228b422fd32bd1c327fd7fb919b4
Author:        Thomas Gleixner <tglx@kernel.org>
AuthorDate:    Tue, 03 Mar 2026 22:56:27 +01:00
Committer:     Thomas Gleixner <tglx@kernel.org>
CommitterDate: Thu, 05 Mar 2026 17:40:46 +01:00

timekeeping: Initialize the coupled clocksource conversion completely

Nathan reported a boot failure after the coupled clocksource/event support
was enabled for the TSC deadline timer. It turns out that on the affected
test systems the TSC frequency is not refined against HPET, so it is
registered with the same frequency as the TSC-early clocksource.

As a consequence the update function which checks for a change of the
shift/mult pair of the clocksource fails to compute the conversion
limit, which is zero initialized. This check is there to avoid pointless
computations on every timekeeping update cycle (tick).

So the actual clockevent conversion function limits the delta expiry to
zero, which means the timer is always programmed to expire in the
past. This obviously results in a spectacular timer interrupt storm,
which goes unnoticed because the per CPU interrupts on x86 are not
exposed to the runaway detection mechanism and the NMI watchdog is not
yet functional. So the machine simply stops booting.

That did not show up in testing. All test machines refine the TSC frequency
so TSC has a differrent shift/mult pair than TSC-early and the conversion
limit is properly initialized.

Cure that by setting the conversion limit right at the point where the new
clocksource is installed.

Fixes: cd38bdb8e696 ("timekeeping: Provide infrastructure for coupled clockevents")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: John Stultz <jstultz@google.com>
Link: https://patch.msgid.link/87bjh4zies.ffs@tglx
Closes: https://lore.kernel.org/20260303012905.GA978396@ax162
---
 kernel/time/timekeeping.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index b7a0f93..5153218 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -404,6 +404,13 @@ static void tk_setup_internals(struct timekeeper *tk, struct clocksource *clock)
 		 */
 		clocks_calc_mult_shift(&tk->cs_ns_to_cyc_mult, &tk->cs_ns_to_cyc_shift,
 				       NSEC_PER_MSEC, clock->freq_khz, 3600 * 1000);
+		/*
+		 * Initialize the conversion limit as the previous clocksource
+		 * might have the same shift/mult pair so the quick check in
+		 * tk_update_ns_to_cyc() fails to update it after a clocksource
+		 * change leaving it effectivly zero.
+		 */
+		tk->cs_ns_to_cyc_maxns = div_u64(clock->mask, tk->cs_ns_to_cyc_mult);
 	}
 }

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* Re: [tip: sched/hrtick] sched/eevdf: Fix HRTICK duration
  2026-02-28 15:37   ` [tip: sched/hrtick] " tip-bot2 for Peter Zijlstra
@ 2026-03-20 14:59     ` Shrikanth Hegde
  2026-03-20 15:38       ` Peter Zijlstra
  0 siblings, 1 reply; 128+ messages in thread
From: Shrikanth Hegde @ 2026-03-20 14:59 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra (Intel)
  Cc: linux-kernel, Juri Lelli, x86, linux-tip-commits

Sorry for very very late reply. I was trying to go through this.

On 2/28/26 9:07 PM, tip-bot2 for Peter Zijlstra wrote:
> The following commit has been merged into the sched/hrtick branch of tip:
> 
> Commit-ID:     558c18d3fbb6c5b9c0b42629d7fe34476363ac00
> Gitweb:        https://git.kernel.org/tip/558c18d3fbb6c5b9c0b42629d7fe34476363ac00
> Author:        Peter Zijlstra <peterz@infradead.org>
> AuthorDate:    Tue, 24 Feb 2026 17:35:17 +01:00
> Committer:     Peter Zijlstra <peterz@infradead.org>
> CommitterDate: Fri, 27 Feb 2026 16:40:03 +01:00
> 
> sched/eevdf: Fix HRTICK duration
> 
> The nominal duration for an EEVDF task to run is until its deadline. At
> which point the deadline is moved ahead and a new task selection is done.
> 
> Try and predict the time 'lost' to higher scheduling classes. Since this is
> an estimate, the timer can be both early or late. In case it is early
> task_tick_fair() will take the !need_resched() path and restarts the timer.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
> Link: https://patch.msgid.link/20260224163428.798198874@kernel.org
> ---
>   kernel/sched/fair.c | 41 +++++++++++++++++++++++++++--------------
>   1 file changed, 27 insertions(+), 14 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index eea99ec..247fecd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6735,21 +6735,37 @@ static inline void sched_fair_update_stop_tick(struct rq *rq, struct task_struct
>   static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
>   {
>   	struct sched_entity *se = &p->se;
> +	unsigned long scale = 1024;
> +	unsigned long util = 0;
> +	u64 vdelta;
> +	u64 delta;
>   
>   	WARN_ON_ONCE(task_rq(p) != rq);
>   
> -	if (rq->cfs.h_nr_queued > 1) {
> -		u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> -		u64 slice = se->slice;
> -		s64 delta = slice - ran;
> +	if (rq->cfs.h_nr_queued <= 1)
> +		return;
>   
> -		if (delta < 0) {
> -			if (task_current_donor(rq, p))
> -				resched_curr(rq);
> -			return;
> -		}
> -		hrtick_start(rq, delta);
> +	/*
> +	 * Compute time until virtual deadline
> +	 */
> +	vdelta = se->deadline - se->vruntime;
> +	if ((s64)vdelta < 0) {
> +		if (task_current_donor(rq, p))
> +			resched_curr(rq);
> +		return;
>   	}
> +	delta = (se->load.weight * vdelta) / NICE_0_LOAD;
> +
> +	/*
> +	 * Correct for instantaneous load of other classes.
> +	 */
> +	util += cpu_util_irq(rq);
> +	if (util && util < 1024) {
> +		scale *= 1024;
> +		scale /= (1024 - util);
> +	}

Comments/Changelog says other classes.

Then why not consider cpu_util_dl, cpu_util_rq too?
Is there a reason why these are not taken into calculations?

> +
> +	hrtick_start(rq, (scale * delta) / 1024);
>   }
>   
>   /*
> @@ -13365,11 +13381,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>   		entity_tick(cfs_rq, se, queued);
>   	}
>   
> -	if (queued) {
> -		if (!need_resched())
> -			hrtick_start_fair(rq, curr);
> +	if (queued)
>   		return;
> -	}
>   
>   	if (static_branch_unlikely(&sched_numa_balancing))
>   		task_tick_numa(rq, curr);


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [tip: sched/hrtick] sched/eevdf: Fix HRTICK duration
  2026-03-20 14:59     ` Shrikanth Hegde
@ 2026-03-20 15:38       ` Peter Zijlstra
  2026-03-20 15:40         ` Shrikanth Hegde
  0 siblings, 1 reply; 128+ messages in thread
From: Peter Zijlstra @ 2026-03-20 15:38 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Thomas Gleixner, linux-kernel, Juri Lelli, x86, linux-tip-commits

On Fri, Mar 20, 2026 at 08:29:11PM +0530, Shrikanth Hegde wrote:

> > +	/*
> > +	 * Correct for instantaneous load of other classes.
> > +	 */
> > +	util += cpu_util_irq(rq);
> > +	if (util && util < 1024) {
> > +		scale *= 1024;
> > +		scale /= (1024 - util);
> > +	}
> 
> Comments/Changelog says other classes.
> 
> Then why not consider cpu_util_dl, cpu_util_rq too?
> Is there a reason why these are not taken into calculations?

Damn, forgot to fix that comment.

So yes, it used to correct for those, but then I realized that the
hrtick is strictly for current. So running RT/DL tasks means current is
different.

The only thing that can actually interrupt current and soak time are
interrupts.

Does that make sense?

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [tip: sched/hrtick] sched/eevdf: Fix HRTICK duration
  2026-03-20 15:38       ` Peter Zijlstra
@ 2026-03-20 15:40         ` Shrikanth Hegde
  0 siblings, 0 replies; 128+ messages in thread
From: Shrikanth Hegde @ 2026-03-20 15:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, linux-kernel, Juri Lelli, x86, linux-tip-commits



On 3/20/26 9:08 PM, Peter Zijlstra wrote:
> On Fri, Mar 20, 2026 at 08:29:11PM +0530, Shrikanth Hegde wrote:
> 
>>> +	/*
>>> +	 * Correct for instantaneous load of other classes.
>>> +	 */
>>> +	util += cpu_util_irq(rq);
>>> +	if (util && util < 1024) {
>>> +		scale *= 1024;
>>> +		scale /= (1024 - util);
>>> +	}
>>
>> Comments/Changelog says other classes.
>>
>> Then why not consider cpu_util_dl, cpu_util_rq too?
>> Is there a reason why these are not taken into calculations?
> 
> Damn, forgot to fix that comment.
> 
> So yes, it used to correct for those, but then I realized that the
> hrtick is strictly for current. So running RT/DL tasks means current is
> different.
> 
> The only thing that can actually interrupt current and soak time are
> interrupts.
> 
> Does that make sense?

Yes. That helps.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* RE: [patch 19/48] clockevents: Provide support for clocksource coupled comparators
  2026-03-03 18:44   ` [patch 19/48] " Michael Kelley
  2026-03-03 19:14     ` Peter Zijlstra
@ 2026-03-23  4:24     ` Michael Kelley
  2026-03-23 21:36       ` Thomas Gleixner
  1 sibling, 1 reply; 128+ messages in thread
From: Michael Kelley @ 2026-03-23  4:24 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86@kernel.org,
	Peter Zijlstra, Frederic Weisbecker, Eric Dumazet

From: Michael Kelley Sent: Tuesday, March 3, 2026 10:45 AM
> 
> From: Thomas Gleixner <tglx@kernel.org> Sent: Tuesday, February 24, 2026 8:37 AM
> >
> > Some clockevent devices are coupled to the system clocksource by
> > implementing a less than or equal comparator which compares the programmed
> > absolute expiry time against the underlying time counter.
> 
> I've been playing with this in linux-next, and particularly to set up the Hyper-V
> TSC page clocksource and Hyper-V timer as coupled. Most Hyper-V guests these days
> are running on hardware that allows using the TSC directly as the clocksource. But
> even if the Hyper-V TSC page clocksource isn't used, the timer is still the Hyper-V
> timer, so the coupling isn't active. However, SEV-SNP and TDX CoCo VMs on Hyper-V
> must use both the Hyper-V TSC page clocksource and the Hyper-V timer, so they
> would benefit from coupling. It's a nice idea!
> 

[snip]

> > +
> > +static inline bool clockevent_set_next_coupled(struct clock_event_device *dev, ktime_t expires)
> > +{
> > +	u64 cycles;
> > +
> > +	if (unlikely(!(dev->features & CLOCK_EVT_FEAT_CLOCKSOURCE_COUPLED)))
> > +		return false;
> > +
> > +	if (unlikely(!ktime_expiry_to_cycles(dev->cs_id, expires, &cycles)))
> > +		return false;
> > +
> > +	if (IS_ENABLED(CONFIG_GENERIC_CLOCKEVENTS_COUPLED_INLINE))
> 
> Since COUPLED_INLINE is always selected for x64, there's no way to add the Hyper-V
> clockevent that is coupled but not inline. Adding the machinery to allow a second
> inline clockevent type may not be worth it, but adding a second coupled but not
> inline clockevent type on x64 should be supported. Thoughts?
> 
> After fixing the u64 typo, and temporarily not always selecting COUPLED_INLINE in
> arch/x86/Kconfig, the coupled Hyper-V TSC page clocksource and timer seem to work
> correctly, though I'm still doing some testing. I'm also working on counting the number
> of time reads to confirm the expected benefit.
> 

Thomas --

Gentle ping.  Any thoughts on this?  (And on Peter Zijlstra's "deliciously insane"
follow-up?)

Michael

^ permalink raw reply	[flat|nested] 128+ messages in thread

* RE: [patch 19/48] clockevents: Provide support for clocksource coupled comparators
  2026-03-23  4:24     ` Michael Kelley
@ 2026-03-23 21:36       ` Thomas Gleixner
  2026-03-24  0:22         ` mhklkml
  0 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-03-23 21:36 UTC (permalink / raw)
  To: Michael Kelley, LKML
  Cc: Anna-Maria Behnsen, John Stultz, Stephen Boyd, Daniel Lezcano,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, x86@kernel.org,
	Peter Zijlstra, Frederic Weisbecker, Eric Dumazet

On Mon, Mar 23 2026 at 04:24, Michael Kelley wrote:
> From: Michael Kelley Sent: Tuesday, March 3, 2026 10:45 AM
>> From: Thomas Gleixner <tglx@kernel.org> Sent: Tuesday, February 24, 2026 8:37 AM
>> >
>> > Some clockevent devices are coupled to the system clocksource by
>> > implementing a less than or equal comparator which compares the programmed
>> > absolute expiry time against the underlying time counter.
>> 
>> I've been playing with this in linux-next, and particularly to set up the Hyper-V
>> TSC page clocksource and Hyper-V timer as coupled. Most Hyper-V guests these days
>> are running on hardware that allows using the TSC directly as the clocksource. But
>> even if the Hyper-V TSC page clocksource isn't used, the timer is still the Hyper-V
>> timer, so the coupling isn't active. However, SEV-SNP and TDX CoCo VMs on Hyper-V
>> must use both the Hyper-V TSC page clocksource and the Hyper-V timer, so they
>> would benefit from coupling. It's a nice idea!

Did not think about that. I try to avoid the virt dungeon as much as it
goes :)

>> > +static inline bool clockevent_set_next_coupled(struct clock_event_device *dev, ktime_t expires)
>> > +{
>> > +	u64 cycles;
>> > +
>> > +	if (unlikely(!(dev->features & CLOCK_EVT_FEAT_CLOCKSOURCE_COUPLED)))
>> > +		return false;
>> > +
>> > +	if (unlikely(!ktime_expiry_to_cycles(dev->cs_id, expires, &cycles)))
>> > +		return false;
>> > +
>> > +	if (IS_ENABLED(CONFIG_GENERIC_CLOCKEVENTS_COUPLED_INLINE))
>> 
>> Since COUPLED_INLINE is always selected for x64, there's no way to add the Hyper-V
>> clockevent that is coupled but not inline. Adding the machinery to allow a second
>> inline clockevent type may not be worth it, but adding a second coupled but not
>> inline clockevent type on x64 should be supported. Thoughts?
>> 
>> After fixing the u64 typo, and temporarily not always selecting COUPLED_INLINE in
>> arch/x86/Kconfig, the coupled Hyper-V TSC page clocksource and timer seem to work
>> correctly, though I'm still doing some testing. I'm also working on counting the number
>> of time reads to confirm the expected benefit.
>
> Gentle ping.  Any thoughts on this?  (And on Peter Zijlstra's "deliciously insane"
> follow-up?)

Sure, we should be able to support that and I think Peter's suggestion
is pretty clever. Did you get it working?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 128+ messages in thread

* RE: [patch 19/48] clockevents: Provide support for clocksource coupled comparators
  2026-03-23 21:36       ` Thomas Gleixner
@ 2026-03-24  0:22         ` mhklkml
  2026-03-24  3:37           ` Michael Kelley
  0 siblings, 1 reply; 128+ messages in thread
From: mhklkml @ 2026-03-24  0:22 UTC (permalink / raw)
  To: 'Thomas Gleixner', 'LKML'
  Cc: 'Anna-Maria Behnsen', 'John Stultz',
	'Stephen Boyd', 'Daniel Lezcano',
	'Juri Lelli', 'Vincent Guittot',
	'Dietmar Eggemann', 'Steven Rostedt',
	'Ben Segall', 'Mel Gorman',
	'Valentin Schneider', x86, 'Peter Zijlstra',
	'Frederic Weisbecker', 'Eric Dumazet'

From: Thomas Gleixner <tglx@kernel.org> Sent: Monday, March 23, 2026 2:37 PM
> 
> On Mon, Mar 23 2026 at 04:24, Michael Kelley wrote:
> >> > +static inline bool clockevent_set_next_coupled(struct clock_event_device
*dev, ktime_t expires)
> >> > +{
> >> > +	u64 cycles;
> >> > +
> >> > +	if (unlikely(!(dev->features & CLOCK_EVT_FEAT_CLOCKSOURCE_COUPLED)))
> >> > +		return false;
> >> > +
> >> > +	if (unlikely(!ktime_expiry_to_cycles(dev->cs_id, expires, &cycles)))
> >> > +		return false;
> >> > +
> >> > +	if (IS_ENABLED(CONFIG_GENERIC_CLOCKEVENTS_COUPLED_INLINE))
> >>
> >> Since COUPLED_INLINE is always selected for x64, there's no way to add the
Hyper-V
> >> clockevent that is coupled but not inline. Adding the machinery to allow a
second
> >> inline clockevent type may not be worth it, but adding a second coupled but not
> >> inline clockevent type on x64 should be supported. Thoughts?
> >>
> >> After fixing the u64 typo, and temporarily not always selecting COUPLED_INLINE
in
> >> arch/x86/Kconfig, the coupled Hyper-V TSC page clocksource and timer seem to
work
> >> correctly, though I'm still doing some testing. I'm also working on counting the
number
> >> of time reads to confirm the expected benefit.
> >
> > Gentle ping.  Any thoughts on this?  (And on Peter Zijlstra's "deliciously
insane"
> > follow-up?)
> 
> Sure, we should be able to support that and I think Peter's suggestion
> is pretty clever. Did you get it working?
> 

I got the coupling working with the Hyper-V clocksource and timer in
non-inlined form, and observed the expected reduction in time reads.
That was straightforward.

But I did not try Peter's suggestion, just to keep things simple. The
Hyper-V timer can be loaded with a single wrmsrq just like the TSC
deadline timer, but it's a different MSR that's synthetic and always
traps to the hypervisor. Given the unavoidable overhead of trapping
to the hypervisor, the relative gain of inlining would be small.

Michael


^ permalink raw reply	[flat|nested] 128+ messages in thread

* RE: [patch 19/48] clockevents: Provide support for clocksource coupled comparators
  2026-03-24  0:22         ` mhklkml
@ 2026-03-24  3:37           ` Michael Kelley
  2026-03-24 17:24             ` Thomas Gleixner
  0 siblings, 1 reply; 128+ messages in thread
From: Michael Kelley @ 2026-03-24  3:37 UTC (permalink / raw)
  To: 'Thomas Gleixner', 'LKML'
  Cc: 'Anna-Maria Behnsen', 'John Stultz',
	'Stephen Boyd', 'Daniel Lezcano',
	'Juri Lelli', 'Vincent Guittot',
	'Dietmar Eggemann', 'Steven Rostedt',
	'Ben Segall', 'Mel Gorman',
	'Valentin Schneider', x86@kernel.org,
	'Peter Zijlstra', 'Frederic Weisbecker',
	'Eric Dumazet'

From: mhklkml@zohomail.com <mhklkml@zohomail.com> Sent: Monday, March 23, 2026 5:22 PM
> 
> From: Thomas Gleixner <tglx@kernel.org> Sent: Monday, March 23, 2026 2:37 PM
> >
> > On Mon, Mar 23 2026 at 04:24, Michael Kelley wrote:
> > >> > +static inline bool clockevent_set_next_coupled(struct clock_event_device *dev, ktime_t expires)
> > >> > +{
> > >> > +	u64 cycles;
> > >> > +
> > >> > +	if (unlikely(!(dev->features & CLOCK_EVT_FEAT_CLOCKSOURCE_COUPLED)))
> > >> > +		return false;
> > >> > +
> > >> > +	if (unlikely(!ktime_expiry_to_cycles(dev->cs_id, expires, &cycles)))
> > >> > +		return false;
> > >> > +
> > >> > +	if (IS_ENABLED(CONFIG_GENERIC_CLOCKEVENTS_COUPLED_INLINE))
> > >>
> > >> Since COUPLED_INLINE is always selected for x64, there's no way to add the Hyper-V
> > >> clockevent that is coupled but not inline. Adding the machinery to allow a second
> > >> inline clockevent type may not be worth it, but adding a second coupled but not
> > >> inline clockevent type on x64 should be supported. Thoughts?
> > >>
> > >> After fixing the u64 typo, and temporarily not always selecting COUPLED_INLINE in
> > >> arch/x86/Kconfig, the coupled Hyper-V TSC page clocksource and timer seem to work
> > >> correctly, though I'm still doing some testing. I'm also working on counting the number
> > >> of time reads to confirm the expected benefit.
> > >
> > > Gentle ping.  Any thoughts on this?  (And on Peter Zijlstra's "deliciously insane"
> > > follow-up?)
> >
> > Sure, we should be able to support that and I think Peter's suggestion
> > is pretty clever. Did you get it working?
> >
> 
> I got the coupling working with the Hyper-V clocksource and timer in
> non-inlined form, and observed the expected reduction in time reads.
> That was straightforward.
> 
> But I did not try Peter's suggestion, just to keep things simple. The
> Hyper-V timer can be loaded with a single wrmsrq just like the TSC
> deadline timer, but it's a different MSR that's synthetic and always
> traps to the hypervisor. Given the unavoidable overhead of trapping
> to the hypervisor, the relative gain of inlining would be small.
> 
> Michael
> 

Another thought occurred to me. Since the Hyper-V timer "set next event"
does wrmsrq just like the TSC deadline timer, add an "msr" field to struct
clock_event_device, and require clock events that specify COUPLED to also
specify the MSR. Here's a diff of the core changes:

diff --git a/arch/x86/include/asm/clock_inlined.h b/arch/x86/include/asm/clock_inlined.h
index b2dee8db2fb9..a3a3c3670ad4 100644
--- a/arch/x86/include/asm/clock_inlined.h
+++ b/arch/x86/include/asm/clock_inlined.h
@@ -16,7 +16,7 @@ struct clock_event_device;
 static __always_inline void
 arch_inlined_clockevent_set_next_coupled(u64 cycles, struct clock_event_device *evt)
 {
-	native_wrmsrq(MSR_IA32_TSC_DEADLINE, cycles);
+	native_wrmsrq(evt->msr, cycles);
 }
 
 #endif
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 60cab20b7901..42326c7d3f41 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -593,6 +593,7 @@ static void setup_APIC_timer(void)
 		levt->name = "lapic-deadline";
 		levt->features &= ~(CLOCK_EVT_FEAT_PERIODIC | CLOCK_EVT_FEAT_DUMMY);
 		levt->features |= CLOCK_EVT_FEAT_CLOCKSOURCE_COUPLED;
+		levt->msr = MSR_IA32_TSC_DEADLINE;
 		levt->cs_id = CSID_X86_TSC;
 		levt->set_next_event = lapic_next_deadline;
 		clockevents_config_and_register(levt, tsc_khz * (1000 / TSC_DIVISOR), 0xF, ~0UL);
diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index 92d90220c0d4..deea69580db0 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -110,6 +110,7 @@ struct clock_event_device {
 	enum clock_event_state	state_use_accessors;
 	unsigned int		features;
 	enum clocksource_ids	cs_id;
+	u32			msr;
 	unsigned long		retries;
 
 	int			(*set_state_periodic)(struct clock_event_device *);

This approach is not as general as Peter's, but it covers the Hyper-V timer
case, and is simpler. The cost is an extra memory reference in
arch_inlined_clockevent_set_next_coupled(). arch/x86/Kconfig can continue
to select GENERIC_CLOCKEVENTS_COUPLED_INLINE without preventing
coupling of the Hyper-V clocksource and timer. I've built and run this with
a coupled Hyper-V clocksource and timer, and a basic smoke test works.
So something to consider ...  

If you like this approach, I'm happy to submit this as a patch. It would
be a prerequisite to my patch for the Hyper-V clocksource/timer changes
to enable coupling.

Michael

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* RE: [patch 19/48] clockevents: Provide support for clocksource coupled comparators
  2026-03-24  3:37           ` Michael Kelley
@ 2026-03-24 17:24             ` Thomas Gleixner
  2026-03-24 17:34               ` Peter Zijlstra
  0 siblings, 1 reply; 128+ messages in thread
From: Thomas Gleixner @ 2026-03-24 17:24 UTC (permalink / raw)
  To: Michael Kelley, 'LKML'
  Cc: 'Anna-Maria Behnsen', 'John Stultz',
	'Stephen Boyd', 'Daniel Lezcano',
	'Juri Lelli', 'Vincent Guittot',
	'Dietmar Eggemann', 'Steven Rostedt',
	'Ben Segall', 'Mel Gorman',
	'Valentin Schneider', x86@kernel.org,
	'Peter Zijlstra', 'Frederic Weisbecker',
	'Eric Dumazet'

On Tue, Mar 24 2026 at 03:37, Michael Kelley wrote:
> This approach is not as general as Peter's, but it covers the Hyper-V timer
> case, and is simpler. The cost is an extra memory reference in
> arch_inlined_clockevent_set_next_coupled(). arch/x86/Kconfig can continue

Which can be avoided with a runtime_const if the decision between hyperv
timer and tscdeadline timer happens before either of them registered the
clockevent and does not change later on.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [patch 19/48] clockevents: Provide support for clocksource coupled comparators
  2026-03-24 17:24             ` Thomas Gleixner
@ 2026-03-24 17:34               ` Peter Zijlstra
  0 siblings, 0 replies; 128+ messages in thread
From: Peter Zijlstra @ 2026-03-24 17:34 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Michael Kelley, 'LKML', 'Anna-Maria Behnsen',
	'John Stultz', 'Stephen Boyd',
	'Daniel Lezcano', 'Juri Lelli',
	'Vincent Guittot', 'Dietmar Eggemann',
	'Steven Rostedt', 'Ben Segall',
	'Mel Gorman', 'Valentin Schneider',
	x86@kernel.org, 'Frederic Weisbecker',
	'Eric Dumazet'

On Tue, Mar 24, 2026 at 06:24:16PM +0100, Thomas Gleixner wrote:
> On Tue, Mar 24 2026 at 03:37, Michael Kelley wrote:
> > This approach is not as general as Peter's, but it covers the Hyper-V timer
> > case, and is simpler. The cost is an extra memory reference in
> > arch_inlined_clockevent_set_next_coupled(). arch/x86/Kconfig can continue
> 
> Which can be avoided with a runtime_const if the decision between hyperv
> timer and tscdeadline timer happens before either of them registered the
> clockevent and does not change later on.

If the wrmsr immediate form is faster (that was their purpose) than the
current form this will no longer work and we'll have to get more
creative, but yes, until that time this should work.


^ permalink raw reply	[flat|nested] 128+ messages in thread

end of thread, other threads:[~2026-03-24 17:34 UTC | newest]

Thread overview: 128+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-24 16:35 [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Thomas Gleixner
2026-02-24 16:35 ` [patch 01/48] sched/eevdf: Fix HRTICK duration Thomas Gleixner
2026-02-28 15:37   ` [tip: sched/hrtick] " tip-bot2 for Peter Zijlstra
2026-03-20 14:59     ` Shrikanth Hegde
2026-03-20 15:38       ` Peter Zijlstra
2026-03-20 15:40         ` Shrikanth Hegde
2026-02-24 16:35 ` [patch 02/48] sched/fair: Simplify hrtick_update() Thomas Gleixner
2026-02-28 15:37   ` [tip: sched/hrtick] " tip-bot2 for Peter Zijlstra (Intel)
2026-02-24 16:35 ` [patch 03/48] sched/fair: Make hrtick resched hard Thomas Gleixner
2026-02-28 15:37   ` [tip: sched/hrtick] " tip-bot2 for Peter Zijlstra (Intel)
2026-02-24 16:35 ` [patch 04/48] sched: Avoid ktime_get() indirection Thomas Gleixner
2026-02-28 15:37   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:35 ` [patch 05/48] hrtimer: Avoid pointless reprogramming in __hrtimer_start_range_ns() Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Peter Zijlstra
2026-02-24 16:35 ` [patch 06/48] hrtimer: Provide a static branch based hrtimer_hres_enabled() Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:35 ` [patch 07/48] sched: Use hrtimer_highres_enabled() Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:35 ` [patch 08/48] sched: Optimize hrtimer handling Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:35 ` [patch 09/48] sched/hrtick: Avoid tiny hrtick rearms Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:36 ` [patch 10/48] hrtimer: Provide LAZY_REARM mode Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Peter Zijlstra
2026-02-24 16:36 ` [patch 11/48] sched/hrtick: Mark hrtick timer LAZY_REARM Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Peter Zijlstra
2026-02-24 16:36 ` [patch 12/48] tick/sched: Avoid hrtimer_cancel/start() sequence Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:36 ` [patch 13/48] clockevents: Remove redundant CLOCK_EVT_FEAT_KTIME Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:36 ` [patch 14/48] timekeeping: Allow inlining clocksource::read() Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:36 ` [patch 15/48] x86: Inline TSC reads in timekeeping Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:36 ` [patch 16/48] x86/apic: Remove pointless fence in lapic_next_deadline() Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:36 ` [patch 17/48] x86/apic: Avoid the PVOPS indirection for the TSC deadline timer Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:36 ` [patch 18/48] timekeeping: Provide infrastructure for coupled clockevents Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:36 ` [patch 19/48] clockevents: Provide support for clocksource coupled comparators Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-03-03 18:44   ` [patch 19/48] " Michael Kelley
2026-03-03 19:14     ` Peter Zijlstra
2026-03-23  4:24     ` Michael Kelley
2026-03-23 21:36       ` Thomas Gleixner
2026-03-24  0:22         ` mhklkml
2026-03-24  3:37           ` Michael Kelley
2026-03-24 17:24             ` Thomas Gleixner
2026-03-24 17:34               ` Peter Zijlstra
2026-02-24 16:36 ` [patch 20/48] x86/apic: Enable TSC coupled programming mode Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-03-03  1:29   ` [patch 20/48] " Nathan Chancellor
2026-03-03 14:37     ` Thomas Gleixner
2026-03-03 14:45       ` Thomas Gleixner
2026-03-03 17:38       ` Nathan Chancellor
2026-03-03 20:21         ` Thomas Gleixner
2026-03-03 21:30           ` Nathan Chancellor
2026-03-04 18:40             ` Thomas Gleixner
2026-03-04 18:49               ` [patch 20/48] clocksource: Update clocksource::freq_khz on registration Thomas Gleixner
2026-03-04 19:10                 ` Borislav Petkov
2026-03-04 22:57                 ` Nathan Chancellor
2026-03-05 16:47                 ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-03-03 21:56         ` [PATCH] Subject: timekeeping: Initialize the coupled clocksource conversion completely Thomas Gleixner
2026-03-03 23:16           ` John Stultz
2026-03-05 16:47           ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:36 ` [patch 21/48] hrtimer: Add debug object init assertion Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:36 ` [patch 22/48] hrtimer: Reduce trace noise in hrtimer_start() Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:37 ` [patch 23/48] hrtimer: Use guards where appropriate Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:37 ` [patch 24/48] hrtimer: Cleanup coding style and comments Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:37 ` [patch 25/48] hrtimer: Evaluate timer expiry only once Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:37 ` [patch 26/48] hrtimer: Replace the bitfield in hrtimer_cpu_base Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:37 ` [patch 27/48] hrtimer: Convert state and properties to boolean Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:37 ` [patch 28/48] hrtimer: Optimize for local timers Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:37 ` [patch 29/48] hrtimer: Use NOHZ information for locality Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:37 ` [patch 30/48] hrtimer: Separate remove/enqueue handling for local timers Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:37 ` [patch 31/48] hrtimer: Add hrtimer_rearm tracepoint Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:37 ` [patch 32/48] hrtimer: Re-arrange hrtimer_interrupt() Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Peter Zijlstra
2026-02-24 16:37 ` [patch 33/48] hrtimer: Rename hrtimer_cpu_base::in_hrtirq to deferred_rearm Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:37 ` [patch 34/48] hrtimer: Prepare stubs for deferred rearming Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Peter Zijlstra
2026-02-24 16:38 ` [patch 35/48] entry: Prepare for deferred hrtimer rearming Thomas Gleixner
2026-02-27 15:57   ` Christian Loehle
2026-02-27 16:25     ` Peter Zijlstra
2026-02-27 16:32       ` Christian Loehle
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Peter Zijlstra
2026-02-24 16:38 ` [patch 36/48] softirq: " Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Peter Zijlstra
2026-02-24 16:38 ` [patch 37/48] sched/core: " Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Peter Zijlstra
2026-02-24 16:38 ` [patch 38/48] hrtimer: Push reprogramming timers into the interrupt return path Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Peter Zijlstra
2026-02-24 16:38 ` [patch 39/48] hrtimer: Avoid re-evaluation when nothing changed Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:38 ` [patch 40/48] hrtimer: Keep track of first expiring timer per clock base Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:38 ` [patch 41/48] hrtimer: Rework next event evaluation Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:38 ` [patch 42/48] hrtimer: Simplify run_hrtimer_queues() Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:38 ` [patch 43/48] hrtimer: Optimize for_each_active_base() Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:38 ` [patch 44/48] rbtree: Provide rbtree with links Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:38 ` [patch 45/48] timerqueue: Provide linked timerqueue Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:38 ` [patch 46/48] hrtimer: Use " Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:39 ` [patch 47/48] hrtimer: Try to modify timers in place Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Thomas Gleixner
2026-02-24 16:39 ` [patch 48/48] sched: Default enable HRTICK when deferred rearming is enabled Thomas Gleixner
2026-02-28 15:36   ` [tip: sched/hrtick] " tip-bot2 for Peter Zijlstra
2026-02-25 15:25 ` [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Peter Zijlstra
2026-02-25 16:02   ` Thomas Gleixner
2026-03-04 15:59 ` Christian Loehle

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox