[PATCH 0/8] hrtimer/sched: Improve hrtick

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/8] hrtimer/sched: Improve hrtick
@ 2025-09-18  7:52 Peter Zijlstra
  2025-09-18  7:52 ` [PATCH 1/8] sched: Fix hrtick() vs scheduling context Peter Zijlstra
                   ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Peter Zijlstra @ 2025-09-18  7:52 UTC (permalink / raw)
  To: tglx
  Cc: arnd, anna-maria, frederic, peterz, luto, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, oliver.sang

Hi!

A few patches trying to improve the sched hrtick feature, which is currently
disabled due to overhead.

I wrote these after the last OSPM, but I've been sitting on them because 0-day
is not having a good time with the second to last patch. I've not been able to
reproduce :-(

Anyway, the first few patches should be 'good' and hopefully one of you will
spot my 'obvious' fail in that late patch.

For those of you rocking ARM64 systems, if you pick up the generic entry patches:

  https://lkml.kernel.org/r/20250916082611.2972008-1-ruanjinjie@huawei.com

this late patch should also 'work' (or not as the case might be) for you.

Patches also here:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/hrtick

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/8] sched: Fix hrtick() vs scheduling context
  2025-09-18  7:52 [PATCH 0/8] hrtimer/sched: Improve hrtick Peter Zijlstra
@ 2025-09-18  7:52 ` Peter Zijlstra
  2025-09-19  3:53   ` K Prateek Nayak
                     ` (4 more replies)
  2025-09-18  7:52 ` [PATCH 2/8] sched/fair: Limit hrtick work Peter Zijlstra
                   ` (6 subsequent siblings)
  7 siblings, 5 replies; 22+ messages in thread
From: Peter Zijlstra @ 2025-09-18  7:52 UTC (permalink / raw)
  To: tglx
  Cc: arnd, anna-maria, frederic, peterz, luto, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, oliver.sang, jstultz

The sched_class::task_tick() method is called on the donor
sched_class, and sched_tick() hands it rq->donor as argument, which is
consistent.

However, while hrtick() uses the donor sched_class, it then passes
rq->curr, which is inconsistent. Fix it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -875,7 +875,7 @@ static enum hrtimer_restart hrtick(struc
 
 	rq_lock(rq, &rf);
 	update_rq_clock(rq);
-	rq->donor->sched_class->task_tick(rq, rq->curr, 1);
+	rq->donor->sched_class->task_tick(rq, rq->donor, 1);
 	rq_unlock(rq, &rf);
 
 	return HRTIMER_NORESTART;



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 2/8] sched/fair: Limit hrtick work
  2025-09-18  7:52 [PATCH 0/8] hrtimer/sched: Improve hrtick Peter Zijlstra
  2025-09-18  7:52 ` [PATCH 1/8] sched: Fix hrtick() vs scheduling context Peter Zijlstra
@ 2025-09-18  7:52 ` Peter Zijlstra
  2025-09-19 14:59   ` K Prateek Nayak
  2025-12-14  7:46   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2025-09-18  7:52 ` [PATCH 3/8] sched/eevdf: Fix HRTICK duration Peter Zijlstra
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2025-09-18  7:52 UTC (permalink / raw)
  To: tglx
  Cc: arnd, anna-maria, frederic, peterz, luto, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, oliver.sang

The task_tick_fair() function does:

 - update the hierarchical runtimes
 - drive numa-balancing
 - update load-balance statistics
 - drive force-idle preemption

All but the very first can be limited to the periodic tick. Let hrtick
only update accounting and drive preemption, not load-balancing and
other bits.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |    6 ++++++
 1 file changed, 6 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13119,6 +13119,12 @@ static void task_tick_fair(struct rq *rq
 		entity_tick(cfs_rq, se, queued);
 	}
 
+	if (queued) {
+		if (!need_resched())
+			hrtick_start_fair(rq, curr);
+		return;
+	}
+
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 3/8] sched/eevdf: Fix HRTICK duration
  2025-09-18  7:52 [PATCH 0/8] hrtimer/sched: Improve hrtick Peter Zijlstra
  2025-09-18  7:52 ` [PATCH 1/8] sched: Fix hrtick() vs scheduling context Peter Zijlstra
  2025-09-18  7:52 ` [PATCH 2/8] sched/fair: Limit hrtick work Peter Zijlstra
@ 2025-09-18  7:52 ` Peter Zijlstra
  2025-09-19 15:34   ` K Prateek Nayak
  2025-09-18  7:52 ` [PATCH 4/8] hrtimer: Optimize __hrtimer_start_range_ns() Peter Zijlstra
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2025-09-18  7:52 UTC (permalink / raw)
  To: tglx
  Cc: arnd, anna-maria, frederic, peterz, luto, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, oliver.sang

The nominal duration for an EEVDF task to run is until its deadline.
At which point the deadline is moved ahead and a new task selection is
done.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   40 +++++++++++++++++++++++++++++-----------
 1 file changed, 29 insertions(+), 11 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6775,21 +6775,39 @@ static inline void sched_fair_update_sto
 static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
+	unsigned long scale = 1024;
+	unsigned long util = 0;
+	u64 vdelta;
+	u64 delta;
 
 	WARN_ON_ONCE(task_rq(p) != rq);
 
-	if (rq->cfs.h_nr_queued > 1) {
-		u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
-		u64 slice = se->slice;
-		s64 delta = slice - ran;
-
-		if (delta < 0) {
-			if (task_current_donor(rq, p))
-				resched_curr(rq);
-			return;
-		}
-		hrtick_start(rq, delta);
+	if (rq->cfs.h_nr_queued <= 1)
+		return;
+
+	/*
+	 * Compute time until virtual deadline
+	 */
+	vdelta = se->deadline - se->vruntime;
+	if ((s64)vdelta < 0) {
+		if (task_current_donor(rq, p))
+			resched_curr(rq);
+		return;
+	}
+	delta = (se->load.weight * vdelta) / NICE_0_LOAD;
+
+	/*
+	 * Correct for instantaneous load of other classes.
+	 */
+	util += cpu_util_dl(rq);
+	util += cpu_util_rt(rq);
+	util += cpu_util_irq(rq);
+	if (util && util < 1024) {
+		scale *= 1024;
+		scale /= (1024 - util);
 	}
+
+	hrtick_start(rq, (scale * delta) / 1024);
 }
 
 /*



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 4/8] hrtimer: Optimize __hrtimer_start_range_ns()
  2025-09-18  7:52 [PATCH 0/8] hrtimer/sched: Improve hrtick Peter Zijlstra
                   ` (2 preceding siblings ...)
  2025-09-18  7:52 ` [PATCH 3/8] sched/eevdf: Fix HRTICK duration Peter Zijlstra
@ 2025-09-18  7:52 ` Peter Zijlstra
  2025-09-18  7:52 ` [PATCH 5/8] hrtimer,sched: Add fuzzy hrtimer mode for HRTICK Peter Zijlstra
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2025-09-18  7:52 UTC (permalink / raw)
  To: tglx
  Cc: arnd, anna-maria, frederic, peterz, luto, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, oliver.sang

Much like hrtimer_reprogram(), skip programming is the cpu_base is
running the hrtimer interrupt.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/time/hrtimer.c |    8 ++++++++
 1 file changed, 8 insertions(+)

--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1261,6 +1261,14 @@ static int __hrtimer_start_range_ns(stru
 	}
 
 	first = enqueue_hrtimer(timer, new_base, mode);
+
+	/*
+	 * If the hrtimer interrupt is running, then it will reevaluate the
+	 * clock bases and reprogram the clock event device.
+	 */
+	if (new_base->cpu_base->in_hrtirq)
+		return 0;
+
 	if (!force_local) {
 		/*
 		 * If the current CPU base is online, then the timer is



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 5/8] hrtimer,sched: Add fuzzy hrtimer mode for HRTICK
  2025-09-18  7:52 [PATCH 0/8] hrtimer/sched: Improve hrtick Peter Zijlstra
                   ` (3 preceding siblings ...)
  2025-09-18  7:52 ` [PATCH 4/8] hrtimer: Optimize __hrtimer_start_range_ns() Peter Zijlstra
@ 2025-09-18  7:52 ` Peter Zijlstra
  2025-09-18  7:52 ` [PATCH 6/8] hrtimer: Re-arrange hrtimer_interrupt() Peter Zijlstra
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2025-09-18  7:52 UTC (permalink / raw)
  To: tglx
  Cc: arnd, anna-maria, frederic, peterz, luto, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, oliver.sang

Upon schedule() HRTICK will cancel the current timer, pick the next
task and reprogram the timer. When schedule() consistently triggers
due to blocking conditions instead of the timer, this leads to endless
reprogramming without ever firing.

Mitigate this with a new hrtimer mode: fuzzy (not really happy with
that name); this mode does two things:

 - skip reprogramming the hardware on timer remove;
 - skip reprogramming the hardware when the new timer
   is after cpu_base->expires_next

Both things are already possible;

 - removing a remote timer will leave the hardware programmed and
   cause a spurious interrupt.
 - this remote CPU adding a timer can skip the reprogramming
   when the timer's expiration is after the (spurious) expiration.

This new timer mode simply causes more of this 'fuzzy' behaviour; it
causes a few spurious interrupts, but similarly avoids endlessly
reprogramming the timer.

This makes the HRTICK match the NO_HRTICK hackbench runs.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/hrtimer.h       |    1 +
 include/linux/hrtimer_types.h |    1 +
 kernel/sched/core.c           |    3 ++-
 kernel/time/hrtimer.c         |   16 +++++++++++++++-
 4 files changed, 19 insertions(+), 2 deletions(-)

--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -38,6 +38,7 @@ enum hrtimer_mode {
 	HRTIMER_MODE_PINNED	= 0x02,
 	HRTIMER_MODE_SOFT	= 0x04,
 	HRTIMER_MODE_HARD	= 0x08,
+	HRTIMER_MODE_FUZZY	= 0x10,
 
 	HRTIMER_MODE_ABS_PINNED = HRTIMER_MODE_ABS | HRTIMER_MODE_PINNED,
 	HRTIMER_MODE_REL_PINNED = HRTIMER_MODE_REL | HRTIMER_MODE_PINNED,
--- a/include/linux/hrtimer_types.h
+++ b/include/linux/hrtimer_types.h
@@ -45,6 +45,7 @@ struct hrtimer {
 	u8				is_rel;
 	u8				is_soft;
 	u8				is_hard;
+	u8				is_fuzzy;
 };
 
 #endif /* _LINUX_HRTIMER_TYPES_H */
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -928,7 +928,8 @@ void hrtick_start(struct rq *rq, u64 del
 static void hrtick_rq_init(struct rq *rq)
 {
 	INIT_CSD(&rq->hrtick_csd, __hrtick_start, rq);
-	hrtimer_setup(&rq->hrtick_timer, hrtick, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
+	hrtimer_setup(&rq->hrtick_timer, hrtick, CLOCK_MONOTONIC,
+		      HRTIMER_MODE_REL_HARD | HRTIMER_MODE_FUZZY);
 }
 #else /* !CONFIG_SCHED_HRTICK: */
 static inline void hrtick_clear(struct rq *rq)
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1122,7 +1122,7 @@ static void __remove_hrtimer(struct hrti
 	 * an superfluous call to hrtimer_force_reprogram() on the
 	 * remote cpu later on if the same timer gets enqueued again.
 	 */
-	if (reprogram && timer == cpu_base->next_timer)
+	if (!timer->is_fuzzy && reprogram && timer == cpu_base->next_timer)
 		hrtimer_force_reprogram(cpu_base, 1);
 }
 
@@ -1269,6 +1269,19 @@ static int __hrtimer_start_range_ns(stru
 	if (new_base->cpu_base->in_hrtirq)
 		return 0;
 
+	if (timer->is_fuzzy) {
+		/*
+		 * XXX fuzzy implies pinned!  not sure how to deal with
+		 * retrigger_next_event() for the !local case.
+		 */
+		WARN_ON_ONCE(!(mode & HRTIMER_MODE_PINNED));
+		/*
+		 * Notably, by going into hrtimer_reprogram(), it will
+		 * not reprogram if cpu_base->expires_next is earlier.
+		 */
+		return first;
+	}
+
 	if (!force_local) {
 		/*
 		 * If the current CPU base is online, then the timer is
@@ -1645,6 +1658,7 @@ static void __hrtimer_setup(struct hrtim
 	base += hrtimer_clockid_to_base(clock_id);
 	timer->is_soft = softtimer;
 	timer->is_hard = !!(mode & HRTIMER_MODE_HARD);
+	timer->is_fuzzy = !!(mode & HRTIMER_MODE_FUZZY);
 	timer->base = &cpu_base->clock_base[base];
 	timerqueue_init(&timer->node);
 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 6/8] hrtimer: Re-arrange hrtimer_interrupt()
  2025-09-18  7:52 [PATCH 0/8] hrtimer/sched: Improve hrtick Peter Zijlstra
                   ` (4 preceding siblings ...)
  2025-09-18  7:52 ` [PATCH 5/8] hrtimer,sched: Add fuzzy hrtimer mode for HRTICK Peter Zijlstra
@ 2025-09-18  7:52 ` Peter Zijlstra
  2025-09-18  7:52 ` [RFC][PATCH 7/8] entry,hrtimer: Push reprogramming timers into the interrupt return path Peter Zijlstra
  2025-09-18  7:52 ` [RFC][PATCH 8/8] sched: Default enable HRTICK Peter Zijlstra
  7 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2025-09-18  7:52 UTC (permalink / raw)
  To: tglx
  Cc: arnd, anna-maria, frederic, peterz, luto, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, oliver.sang

Rework hrtimer_interrupt() such that reprogramming is split out into
an independent function at the end of the interrupt.

This prepares for reprogramming getting delayed beyond the end of
hrtimer_interrupt().

Notably, this changes the hang handling to always wait 100ms instead
of trying to keep it proportional to the actual delay. This simplifies
the state, also this really shouldn't be happening.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/time/hrtimer.c |   87 ++++++++++++++++++++++----------------------------
 1 file changed, 39 insertions(+), 48 deletions(-)

--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1889,6 +1889,29 @@ static __latent_entropy void hrtimer_run
 #ifdef CONFIG_HIGH_RES_TIMERS
 
 /*
+ * Very similar to hrtimer_force_reprogram(), except it deals with
+ * in_hrirq and hang_detected.
+ */
+static void __hrtimer_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t now)
+{
+	ktime_t expires_next = hrtimer_update_next_event(cpu_base);
+
+	cpu_base->expires_next = expires_next;
+	cpu_base->in_hrtirq = 0;
+
+	if (unlikely(cpu_base->hang_detected)) {
+		/*
+		 * Give the system a chance to do something else than looping
+		 * on hrtimer interrupts.
+		 */
+		expires_next = ktime_add_ns(now, 100 * NSEC_PER_MSEC);
+		cpu_base->hang_detected = 0;
+	}
+
+	tick_program_event(expires_next, 1);
+}
+
+/*
  * High resolution timer interrupt
  * Called with interrupts disabled
  */
@@ -1924,63 +1947,31 @@ void hrtimer_interrupt(struct clock_even
 
 	__hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_HARD);
 
-	/* Reevaluate the clock bases for the [soft] next expiry */
-	expires_next = hrtimer_update_next_event(cpu_base);
-	/*
-	 * Store the new expiry value so the migration code can verify
-	 * against it.
-	 */
-	cpu_base->expires_next = expires_next;
-	cpu_base->in_hrtirq = 0;
-	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
-
-	/* Reprogramming necessary ? */
-	if (!tick_program_event(expires_next, 0)) {
-		cpu_base->hang_detected = 0;
-		return;
-	}
-
 	/*
 	 * The next timer was already expired due to:
 	 * - tracing
 	 * - long lasting callbacks
 	 * - being scheduled away when running in a VM
 	 *
-	 * We need to prevent that we loop forever in the hrtimer
-	 * interrupt routine. We give it 3 attempts to avoid
-	 * overreacting on some spurious event.
-	 *
-	 * Acquire base lock for updating the offsets and retrieving
-	 * the current time.
+	 * We need to prevent that we loop forever in the hrtiner interrupt
+	 * routine. We give it 3 attempts to avoid overreacting on some
+	 * spurious event.
 	 */
-	raw_spin_lock_irqsave(&cpu_base->lock, flags);
+	expires_next = hrtimer_update_next_event(cpu_base);
 	now = hrtimer_update_base(cpu_base);
-	cpu_base->nr_retries++;
-	if (++retries < 3)
-		goto retry;
-	/*
-	 * Give the system a chance to do something else than looping
-	 * here. We stored the entry time, so we know exactly how long
-	 * we spent here. We schedule the next event this amount of
-	 * time away.
-	 */
-	cpu_base->nr_hangs++;
-	cpu_base->hang_detected = 1;
-	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
+	if (expires_next < now) {
+		if (++retries < 3)
+			goto retry;
+
+		delta = ktime_sub(now, entry_time);
+		cpu_base->max_hang_time = max_t(unsigned int,
+						cpu_base->max_hang_time, delta);
+		cpu_base->nr_hangs++;
+		cpu_base->hang_detected = 1;
+	}
 
-	delta = ktime_sub(now, entry_time);
-	if ((unsigned int)delta > cpu_base->max_hang_time)
-		cpu_base->max_hang_time = (unsigned int) delta;
-	/*
-	 * Limit it to a sensible value as we enforce a longer
-	 * delay. Give the CPU at least 100ms to catch up.
-	 */
-	if (delta > 100 * NSEC_PER_MSEC)
-		expires_next = ktime_add_ns(now, 100 * NSEC_PER_MSEC);
-	else
-		expires_next = ktime_add(now, delta);
-	tick_program_event(expires_next, 1);
-	pr_warn_once("hrtimer: interrupt took %llu ns\n", ktime_to_ns(delta));
+	__hrtimer_rearm(cpu_base, now);
+	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
 }
 #endif /* !CONFIG_HIGH_RES_TIMERS */
 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH 7/8] entry,hrtimer: Push reprogramming timers into the interrupt return path
  2025-09-18  7:52 [PATCH 0/8] hrtimer/sched: Improve hrtick Peter Zijlstra
                   ` (5 preceding siblings ...)
  2025-09-18  7:52 ` [PATCH 6/8] hrtimer: Re-arrange hrtimer_interrupt() Peter Zijlstra
@ 2025-09-18  7:52 ` Peter Zijlstra
  2025-09-20  9:29   ` Thomas Gleixner
  2025-09-18  7:52 ` [RFC][PATCH 8/8] sched: Default enable HRTICK Peter Zijlstra
  7 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2025-09-18  7:52 UTC (permalink / raw)
  To: tglx
  Cc: arnd, anna-maria, frederic, peterz, luto, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, oliver.sang

Currently hrtimer_interrupt() runs expired timers, which can re-arm
themselves, after which it computes the next expiration time and
re-programs the hardware.

However, things like HRTICK, a highres timer driving preemption,
cannot re-arm itself at the point of running, since the next task has
not been determined yet. The schedule() in the interrupt return path
will switch to the next task, which then causes a new hrtimer to be
programmed.

This then results in reprogramming the hardware at least twice, once
after running the timers, and once upon selecting the new task.

Notably, *both* events happen in the interrupt.

By pushing the hrtimer reprogram all the way into the interrupt return
path, it runs after schedule() and this double reprogram can be
avoided.

XXX: 0-day is unhappy with this patch -- it is reporting lockups that
very much look like a timer goes missing. Am unable to reproduce.
Notable: the lockup goes away when the workloads are ran without perf
monitors.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/asm-generic/thread_info_tif.h |    5 ++++-
 include/linux/hrtimer.h               |   17 +++++++++++++++++
 kernel/entry/common.c                 |    7 +++++++
 kernel/sched/core.c                   |    6 ++++++
 kernel/time/hrtimer.c                 |   28 ++++++++++++++++++++++++----
 5 files changed, 58 insertions(+), 5 deletions(-)

--- a/include/asm-generic/thread_info_tif.h
+++ b/include/asm-generic/thread_info_tif.h
@@ -41,8 +41,11 @@
 #define _TIF_PATCH_PENDING	BIT(TIF_PATCH_PENDING)
 
 #ifdef HAVE_TIF_RESTORE_SIGMASK
-# define TIF_RESTORE_SIGMASK	10	// Restore signal mask in do_signal() */
+# define TIF_RESTORE_SIGMASK	10	// Restore signal mask in do_signal()
 # define _TIF_RESTORE_SIGMASK	BIT(TIF_RESTORE_SIGMASK)
 #endif
 
+#define TIF_HRTIMER_REARM              11       // re-arm the timer
+#define _TIF_HRTIMER_REARM             BIT(TIF_HRTIMER_REARM)
+
 #endif /* _ASM_GENERIC_THREAD_INFO_TIF_H_ */
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -175,10 +175,27 @@ extern void hrtimer_interrupt(struct clo
 
 extern unsigned int hrtimer_resolution;
 
+#ifdef TIF_HRTIMER_REARM
+extern void _hrtimer_rearm(void);
+/*
+ * This is to be called on all irqentry_exit() paths; as well as in the context
+ * switch path before switch_to().
+ */
+static inline void hrtimer_rearm(void)
+{
+	if (test_thread_flag(TIF_HRTIMER_REARM))
+		_hrtimer_rearm();
+}
+#else
+static inline void hrtimer_rearm(void) { }
+#endif /* TIF_HRTIMER_REARM */
+
 #else
 
 #define hrtimer_resolution	(unsigned int)LOW_RES_NSEC
 
+static inline void hrtimer_rearm(void) { }
+
 #endif
 
 static inline ktime_t
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -7,6 +7,7 @@
 #include <linux/kmsan.h>
 #include <linux/livepatch.h>
 #include <linux/tick.h>
+#include <linux/hrtimer.h>
 
 /* Workaround to allow gradual conversion of architecture code */
 void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
@@ -71,6 +72,7 @@ noinstr void irqentry_exit_to_user_mode(
 {
 	instrumentation_begin();
 	exit_to_user_mode_prepare(regs);
+	hrtimer_rearm();
 	instrumentation_end();
 	exit_to_user_mode();
 }
@@ -183,6 +185,7 @@ noinstr void irqentry_exit(struct pt_reg
 		 */
 		if (state.exit_rcu) {
 			instrumentation_begin();
+			hrtimer_rearm();
 			/* Tell the tracer that IRET will enable interrupts */
 			trace_hardirqs_on_prepare();
 			lockdep_hardirqs_on_prepare();
@@ -196,10 +199,14 @@ noinstr void irqentry_exit(struct pt_reg
 		if (IS_ENABLED(CONFIG_PREEMPTION))
 			irqentry_exit_cond_resched();
 
+		hrtimer_rearm();
 		/* Covers both tracing and lockdep */
 		trace_hardirqs_on();
 		instrumentation_end();
 	} else {
+		instrumentation_begin();
+		hrtimer_rearm();
+		instrumentation_end();
 		/*
 		 * IRQ flags state is correct already. Just tell RCU if it
 		 * was not watching on entry.
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5161,6 +5161,12 @@ prepare_task_switch(struct rq *rq, struc
 	fire_sched_out_preempt_notifiers(prev, next);
 	kmap_local_sched_out();
 	prepare_task(next);
+	/*
+	 * Notably, this must be called after pick_next_task() but before
+	 * switch_to(), since the new task need not be on the return from
+	 * interrupt path.
+	 */
+	hrtimer_rearm();
 	prepare_arch_switch(next);
 }
 
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1892,10 +1892,9 @@ static __latent_entropy void hrtimer_run
  * Very similar to hrtimer_force_reprogram(), except it deals with
  * in_hrirq and hang_detected.
  */
-static void __hrtimer_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t now)
+static void __hrtimer_rearm(struct hrtimer_cpu_base *cpu_base,
+			    ktime_t now, ktime_t expires_next)
 {
-	ktime_t expires_next = hrtimer_update_next_event(cpu_base);
-
 	cpu_base->expires_next = expires_next;
 	cpu_base->in_hrtirq = 0;
 
@@ -1970,9 +1969,30 @@ void hrtimer_interrupt(struct clock_even
 		cpu_base->hang_detected = 1;
 	}
 
-	__hrtimer_rearm(cpu_base, now);
+#ifdef TIF_HRTIMER_REARM
+	set_thread_flag(TIF_HRTIMER_REARM);
+#else
+	__hrtimer_rearm(cpu_base, now, expires_next);
+#endif
 	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
 }
+
+#ifdef TIF_HRTIMER_REARM
+void _hrtimer_rearm(void)
+{
+	struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
+	ktime_t now, expires_next;
+
+	lockdep_assert_irqs_disabled();
+
+	scoped_guard (raw_spinlock, &cpu_base->lock) {
+		now = hrtimer_update_base(cpu_base);
+		expires_next = hrtimer_update_next_event(cpu_base);
+		__hrtimer_rearm(cpu_base, now, expires_next);
+		clear_thread_flag(TIF_HRTIMER_REARM);
+	}
+}
+#endif /* TIF_HRTIMER_REARM */
 #endif /* !CONFIG_HIGH_RES_TIMERS */
 
 /*



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH 8/8] sched: Default enable HRTICK
  2025-09-18  7:52 [PATCH 0/8] hrtimer/sched: Improve hrtick Peter Zijlstra
                   ` (6 preceding siblings ...)
  2025-09-18  7:52 ` [RFC][PATCH 7/8] entry,hrtimer: Push reprogramming timers into the interrupt return path Peter Zijlstra
@ 2025-09-18  7:52 ` Peter Zijlstra
  7 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2025-09-18  7:52 UTC (permalink / raw)
  To: tglx
  Cc: arnd, anna-maria, frederic, peterz, luto, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, oliver.sang

For the robots.. let us find regressions.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/features.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -63,8 +63,8 @@ SCHED_FEAT(DELAY_ZERO, true)
  */
 SCHED_FEAT(WAKEUP_PREEMPTION, true)
 
-SCHED_FEAT(HRTICK, false)
-SCHED_FEAT(HRTICK_DL, false)
+SCHED_FEAT(HRTICK, true)
+SCHED_FEAT(HRTICK_DL, true)
 
 /*
  * Decrement CPU capacity based on time not spent running tasks



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/8] sched: Fix hrtick() vs scheduling context
  2025-09-18  7:52 ` [PATCH 1/8] sched: Fix hrtick() vs scheduling context Peter Zijlstra
@ 2025-09-19  3:53   ` K Prateek Nayak
  2025-09-23  0:24   ` John Stultz
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: K Prateek Nayak @ 2025-09-19  3:53 UTC (permalink / raw)
  To: Peter Zijlstra, tglx
  Cc: arnd, anna-maria, frederic, luto, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, oliver.sang, jstultz

Hello Peter,

On 9/18/2025 1:22 PM, Peter Zijlstra wrote:
> The sched_class::task_tick() method is called on the donor
> sched_class, and sched_tick() hands it rq->donor as argument, which is
> consistent.
> 
> However, while hrtick() uses the donor sched_class, it then passes
> rq->curr, which is inconsistent. Fix it.

Can we add either a:

Fixes: 7de9d4f94638 ("sched: Start blocked_on chain processing in find_proxy_task()")

where this starts making a difference functionally since single CPU
proxy can have rq->curr != rq->donor, or we can target the same commit
where task_tick(0 in sched_tick() was updated with:

Fixes: af0c8b2bf67b ("sched: Split scheduler and execution contexts")

Other than that, this looks good to me. Feel free to include:

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>

> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -875,7 +875,7 @@ static enum hrtimer_restart hrtick(struc
>  
>  	rq_lock(rq, &rf);
>  	update_rq_clock(rq);
> -	rq->donor->sched_class->task_tick(rq, rq->curr, 1);
> +	rq->donor->sched_class->task_tick(rq, rq->donor, 1);
>  	rq_unlock(rq, &rf);
>  
>  	return HRTIMER_NORESTART;
> 
> 
> 

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/8] sched/fair: Limit hrtick work
  2025-09-18  7:52 ` [PATCH 2/8] sched/fair: Limit hrtick work Peter Zijlstra
@ 2025-09-19 14:59   ` K Prateek Nayak
  2025-11-28  8:25     ` Peter Zijlstra
  2025-12-14  7:46   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: K Prateek Nayak @ 2025-09-19 14:59 UTC (permalink / raw)
  To: Peter Zijlstra, tglx
  Cc: arnd, anna-maria, frederic, luto, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, oliver.sang

Hello Peter,

On 9/18/2025 1:22 PM, Peter Zijlstra wrote:
> @@ -13119,6 +13119,12 @@ static void task_tick_fair(struct rq *rq
>  		entity_tick(cfs_rq, se, queued);
>  	}
>  
> +	if (queued) {
> +		if (!need_resched())
> +			hrtick_start_fair(rq, curr);

Do we need a hrtick_start_fair() here? Queued tick will always do a
resched_curr_lazy() - if another HRTICK fires before the next tick,
all it'll do is resched_curr_lazy() again and the next opportunity to
resched is either exit to userspace or the periodic tick firing and
promoting that LAZY to a full NEED_RESCHED.

The early return does makes sense.

> +		return;
> +	}
> +
>  	if (static_branch_unlikely(&sched_numa_balancing))
>  		task_tick_numa(rq, curr);
>  

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/8] sched/eevdf: Fix HRTICK duration
  2025-09-18  7:52 ` [PATCH 3/8] sched/eevdf: Fix HRTICK duration Peter Zijlstra
@ 2025-09-19 15:34   ` K Prateek Nayak
  2025-11-28  8:32     ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: K Prateek Nayak @ 2025-09-19 15:34 UTC (permalink / raw)
  To: Peter Zijlstra, tglx
  Cc: arnd, anna-maria, frederic, luto, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, oliver.sang

Hello Peter,

On 9/18/2025 1:22 PM, Peter Zijlstra wrote:
> +	/*
> +	 * Compute time until virtual deadline
> +	 */
> +	vdelta = se->deadline - se->vruntime;
> +	if ((s64)vdelta < 0) {
> +		if (task_current_donor(rq, p))
> +			resched_curr(rq);

Why the task_current_donor() check? If the scheduling context has run
out of gas, shouldn't we reschedule curr even if we were proxied?

> +		return;
> +	}
> +	delta = (se->load.weight * vdelta) / NICE_0_LOAD;
> +
> +	/*
> +	 * Correct for instantaneous load of other classes.
> +	 */
> +	util += cpu_util_dl(rq);
> +	util += cpu_util_rt(rq);
> +	util += cpu_util_irq(rq);
> +	if (util && util < 1024) {
> +		scale *= 1024;
> +		scale /= (1024 - util);
>  	}

Could it be possible that we arrive here from the dl_server's pick and
end up inflating the HRTICK duration despite having an uninterrupted
period for fair tasks ahead?

> +
> +	hrtick_start(rq, (scale * delta) / 1024);
>  }
>  
>  /*
-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 7/8] entry,hrtimer: Push reprogramming timers into the interrupt return path
  2025-09-18  7:52 ` [RFC][PATCH 7/8] entry,hrtimer: Push reprogramming timers into the interrupt return path Peter Zijlstra
@ 2025-09-20  9:29   ` Thomas Gleixner
  2025-09-23  7:52     ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Thomas Gleixner @ 2025-09-20  9:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: arnd, anna-maria, frederic, peterz, luto, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, oliver.sang

On Thu, Sep 18 2025 at 09:52, Peter Zijlstra wrote:
> Currently hrtimer_interrupt() runs expired timers, which can re-arm
> themselves, after which it computes the next expiration time and
> re-programs the hardware.
>
> However, things like HRTICK, a highres timer driving preemption,
> cannot re-arm itself at the point of running, since the next task has
> not been determined yet. The schedule() in the interrupt return path
> will switch to the next task, which then causes a new hrtimer to be
> programmed.
>
> This then results in reprogramming the hardware at least twice, once
> after running the timers, and once upon selecting the new task.
>
> Notably, *both* events happen in the interrupt.
>
> By pushing the hrtimer reprogram all the way into the interrupt return
> path, it runs after schedule() and this double reprogram can be
> avoided.
>
> XXX: 0-day is unhappy with this patch -- it is reporting lockups that
> very much look like a timer goes missing. Am unable to reproduce.
> Notable: the lockup goes away when the workloads are ran without perf
> monitors.

After staring at it for a while, I have two observations.

1) In the 0-day report the lockup detector triggers on a spinlock
   contention in futex_wait_setup()

   I'm not really seeing how that's related to a missing timer.

   Without knowing what the other CPUs are doing and what holds the
   lock, it's pretty much impossible to tell what the hell is going on.

   So that might need a back trace triggered on all CPUs and perhaps
   some debug output in the backtrace about the hrtimer state.

   On the CPU where the lockup is detected, the timer is working.


2) I came up with the following scenario, which is broken with this
   delayed rearm.

   Assume this happens on the timekeeping CPU.

      hrtimer_interrupt()
        expire_timers();
        set(TIF_REARM);

      exit_to_user_mode_prepare()
        handle_tif_muck()
          ...
          to = jiffies + 2;
          while (!cond() && time_before(jiffies, to))
          	relax();

     If cond() does not become true for whatever reason, then this won't
     make progress ever because the tick hrtimer which increments
     jiffies is not happening.

     It can also be a wait on a remote CPU preventing progress
     indirectly or a subtle dependency on a timer (timer list or
     hrtimer) to expire.

  I have no idea whether that's related to the reported 0-day fallout,
  but it definitely is a real problem lurking in the dark.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/8] sched: Fix hrtick() vs scheduling context
  2025-09-18  7:52 ` [PATCH 1/8] sched: Fix hrtick() vs scheduling context Peter Zijlstra
  2025-09-19  3:53   ` K Prateek Nayak
@ 2025-09-23  0:24   ` John Stultz
  2025-12-03 18:25   ` [tip: sched/urgent] sched/hrtick: Fix hrtick() vs. " tip-bot2 for Peter Zijlstra
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: John Stultz @ 2025-09-23  0:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, arnd, anna-maria, frederic, luto, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, oliver.sang

On Thu, Sep 18, 2025 at 1:06 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> The sched_class::task_tick() method is called on the donor
> sched_class, and sched_tick() hands it rq->donor as argument, which is
> consistent.
>
> However, while hrtick() uses the donor sched_class, it then passes
> rq->curr, which is inconsistent. Fix it.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -875,7 +875,7 @@ static enum hrtimer_restart hrtick(struc
>
>         rq_lock(rq, &rf);
>         update_rq_clock(rq);
> -       rq->donor->sched_class->task_tick(rq, rq->curr, 1);
> +       rq->donor->sched_class->task_tick(rq, rq->donor, 1);
>         rq_unlock(rq, &rf);

Ah. Thanks for catching this! I've run through with some stress
testing on this and haven't seen any problems so far.

Acked-by: John Stultz <jstultz@google.com>

thanks
-john

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 7/8] entry,hrtimer: Push reprogramming timers into the interrupt return path
  2025-09-20  9:29   ` Thomas Gleixner
@ 2025-09-23  7:52     ` Peter Zijlstra
  2025-09-23  8:18       ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2025-09-23  7:52 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: arnd, anna-maria, frederic, luto, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, oliver.sang

On Sat, Sep 20, 2025 at 11:29:43AM +0200, Thomas Gleixner wrote:
> On Thu, Sep 18 2025 at 09:52, Peter Zijlstra wrote:
> > Currently hrtimer_interrupt() runs expired timers, which can re-arm
> > themselves, after which it computes the next expiration time and
> > re-programs the hardware.
> >
> > However, things like HRTICK, a highres timer driving preemption,
> > cannot re-arm itself at the point of running, since the next task has
> > not been determined yet. The schedule() in the interrupt return path
> > will switch to the next task, which then causes a new hrtimer to be
> > programmed.
> >
> > This then results in reprogramming the hardware at least twice, once
> > after running the timers, and once upon selecting the new task.
> >
> > Notably, *both* events happen in the interrupt.
> >
> > By pushing the hrtimer reprogram all the way into the interrupt return
> > path, it runs after schedule() and this double reprogram can be
> > avoided.
> >
> > XXX: 0-day is unhappy with this patch -- it is reporting lockups that
> > very much look like a timer goes missing. Am unable to reproduce.
> > Notable: the lockup goes away when the workloads are ran without perf
> > monitors.
> 
> After staring at it for a while, I have two observations.
> 
> 1) In the 0-day report the lockup detector triggers on a spinlock
>    contention in futex_wait_setup()
> 
>    I'm not really seeing how that's related to a missing timer.
> 
>    Without knowing what the other CPUs are doing and what holds the
>    lock, it's pretty much impossible to tell what the hell is going on.
> 
>    So that might need a back trace triggered on all CPUs and perhaps
>    some debug output in the backtrace about the hrtimer state.
> 
>    On the CPU where the lockup is detected, the timer is working.

Fair enough; I was thinking it got stuck on a missing timeout, but
indeed, that needs verifying.

> 2) I came up with the following scenario, which is broken with this
>    delayed rearm.
> 
>    Assume this happens on the timekeeping CPU.
> 
>       hrtimer_interrupt()
>         expire_timers();
>         set(TIF_REARM);
> 
>       exit_to_user_mode_prepare()
>         handle_tif_muck()
>           ...
>           to = jiffies + 2;
>           while (!cond() && time_before(jiffies, to))
>           	relax();
> 
>      If cond() does not become true for whatever reason, then this won't
>      make progress ever because the tick hrtimer which increments
>      jiffies is not happening.
> 
>      It can also be a wait on a remote CPU preventing progress
>      indirectly or a subtle dependency on a timer (timer list or
>      hrtimer) to expire.
> 
>   I have no idea whether that's related to the reported 0-day fallout,
>   but it definitely is a real problem lurking in the dark.

Argh... that exit_to_user_mode_loop() thing enables IRQs. Yes, buggered
something mighty.

Let me haz a poke.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 7/8] entry,hrtimer: Push reprogramming timers into the interrupt return path
  2025-09-23  7:52     ` Peter Zijlstra
@ 2025-09-23  8:18       ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2025-09-23  8:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: arnd, anna-maria, frederic, luto, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, oliver.sang

On Tue, Sep 23, 2025 at 09:52:40AM +0200, Peter Zijlstra wrote:

> > 2) I came up with the following scenario, which is broken with this
> >    delayed rearm.
> > 
> >    Assume this happens on the timekeeping CPU.
> > 
> >       hrtimer_interrupt()
> >         expire_timers();
> >         set(TIF_REARM);
> > 
> >       exit_to_user_mode_prepare()
> >         handle_tif_muck()
> >           ...
> >           to = jiffies + 2;
> >           while (!cond() && time_before(jiffies, to))
> >           	relax();
> > 
> >      If cond() does not become true for whatever reason, then this won't
> >      make progress ever because the tick hrtimer which increments
> >      jiffies is not happening.
> > 
> >      It can also be a wait on a remote CPU preventing progress
> >      indirectly or a subtle dependency on a timer (timer list or
> >      hrtimer) to expire.
> > 
> >   I have no idea whether that's related to the reported 0-day fallout,
> >   but it definitely is a real problem lurking in the dark.
> 
> Argh... that exit_to_user_mode_loop() thing enables IRQs. Yes, buggered
> something mighty.
> 
> Let me haz a poke.

Bah. So schedule() is first in the TIF loop. Delaying hrtimer_rearm()
until that first schedule() call might just be enough, but that also
means running all of sched_submit_work() without timers... it might just
work, but urgh.

Let me try that anyway. I'll push it out to the robot, we'll see what
happens.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/8] sched/fair: Limit hrtick work
  2025-09-19 14:59   ` K Prateek Nayak
@ 2025-11-28  8:25     ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2025-11-28  8:25 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: tglx, arnd, anna-maria, frederic, luto, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, oliver.sang

On Fri, Sep 19, 2025 at 08:29:09PM +0530, K Prateek Nayak wrote:
> Hello Peter,
> 
> On 9/18/2025 1:22 PM, Peter Zijlstra wrote:
> > @@ -13119,6 +13119,12 @@ static void task_tick_fair(struct rq *rq
> >  		entity_tick(cfs_rq, se, queued);
> >  	}
> >  
> > +	if (queued) {
> > +		if (!need_resched())
> > +			hrtick_start_fair(rq, curr);
> 
> Do we need a hrtick_start_fair() here? Queued tick will always do a
> resched_curr_lazy() - if another HRTICK fires before the next tick,
> all it'll do is resched_curr_lazy() again and the next opportunity to
> resched is either exit to userspace or the periodic tick firing and
> promoting that LAZY to a full NEED_RESCHED.

I think I had a version where entity_tick() doesn't force need_resched
on queue, and in that case the timer, which is wallclock, and
update_curr(), which is task_clock, might disagree and we might not have
reached the deadline, and so we need to try again.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/8] sched/eevdf: Fix HRTICK duration
  2025-09-19 15:34   ` K Prateek Nayak
@ 2025-11-28  8:32     ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2025-11-28  8:32 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: tglx, arnd, anna-maria, frederic, luto, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, oliver.sang

On Fri, Sep 19, 2025 at 09:04:02PM +0530, K Prateek Nayak wrote:
> Hello Peter,
> 
> On 9/18/2025 1:22 PM, Peter Zijlstra wrote:
> > +	/*
> > +	 * Compute time until virtual deadline
> > +	 */
> > +	vdelta = se->deadline - se->vruntime;
> > +	if ((s64)vdelta < 0) {
> > +		if (task_current_donor(rq, p))
> > +			resched_curr(rq);
> 
> Why the task_current_donor() check? If the scheduling context has run
> out of gas, shouldn't we reschedule curr even if we were proxied?

task_current_donor() is the current scheduling context, right? So we've
just determined that vruntime is ahead of deadline, which means we
should reschedule now.

> > +		return;
> > +	}
> > +	delta = (se->load.weight * vdelta) / NICE_0_LOAD;
> > +
> > +	/*
> > +	 * Correct for instantaneous load of other classes.
> > +	 */
> > +	util += cpu_util_dl(rq);
> > +	util += cpu_util_rt(rq);
> > +	util += cpu_util_irq(rq);
> > +	if (util && util < 1024) {
> > +		scale *= 1024;
> > +		scale /= (1024 - util);
> >  	}
> 
> Could it be possible that we arrive here from the dl_server's pick and
> end up inflating the HRTICK duration despite having an uninterrupted
> period for fair tasks ahead?

Yes, but since this is all approximation anyway, how many correction
terms do we want to stack on top ? :-)

> > +	hrtick_start(rq, (scale * delta) / 1024);
> >  }
> >  
> >  /*
> -- 
> Thanks and Regards,
> Prateek
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip: sched/urgent] sched/hrtick: Fix hrtick() vs. scheduling context
  2025-09-18  7:52 ` [PATCH 1/8] sched: Fix hrtick() vs scheduling context Peter Zijlstra
  2025-09-19  3:53   ` K Prateek Nayak
  2025-09-23  0:24   ` John Stultz
@ 2025-12-03 18:25   ` tip-bot2 for Peter Zijlstra
  2025-12-03 18:31   ` tip-bot2 for Peter Zijlstra
  2025-12-06  9:10   ` tip-bot2 for Peter Zijlstra
  4 siblings, 0 replies; 22+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-12-03 18:25 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Ingo Molnar, John Stultz, x86,
	linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     8720ba2d028f1aff08a55d8fe1a124dd5a6cfb0a
Gitweb:        https://git.kernel.org/tip/8720ba2d028f1aff08a55d8fe1a124dd5a6cfb0a
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Mon, 01 Sep 2025 22:46:29 +02:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 02 Dec 2025 15:37:52 +01:00

sched/hrtick: Fix hrtick() vs. scheduling context

The sched_class::task_tick() method is called on the donor
sched_class, and sched_tick() hands it rq->donor as argument,
which is consistent.

However, while hrtick() uses the donor sched_class, it then passes
rq->curr, which is inconsistent. Fix it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: John Stultz <jstultz@google.com>
Link: https://patch.msgid.link/20250918080205.442967033@infradead.org
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7dfb6a9..be55f95 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -878,7 +878,7 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
 
 	rq_lock(rq, &rf);
 	update_rq_clock(rq);
-	rq->donor->sched_class->task_tick(rq, rq->curr, 1);
+	rq->donor->sched_class->task_tick(rq, rq->donor, 1);
 	rq_unlock(rq, &rf);
 
 	return HRTIMER_NORESTART;

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip: sched/urgent] sched/hrtick: Fix hrtick() vs. scheduling context
  2025-09-18  7:52 ` [PATCH 1/8] sched: Fix hrtick() vs scheduling context Peter Zijlstra
                     ` (2 preceding siblings ...)
  2025-12-03 18:25   ` [tip: sched/urgent] sched/hrtick: Fix hrtick() vs. " tip-bot2 for Peter Zijlstra
@ 2025-12-03 18:31   ` tip-bot2 for Peter Zijlstra
  2025-12-06  9:10   ` tip-bot2 for Peter Zijlstra
  4 siblings, 0 replies; 22+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-12-03 18:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Ingo Molnar, John Stultz, x86,
	linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     40671f3f91986844df8947b65b9af5e770752047
Gitweb:        https://git.kernel.org/tip/40671f3f91986844df8947b65b9af5e770752047
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Mon, 01 Sep 2025 22:46:29 +02:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 03 Dec 2025 19:26:00 +01:00

sched/hrtick: Fix hrtick() vs. scheduling context

The sched_class::task_tick() method is called on the donor
sched_class, and sched_tick() hands it rq->donor as argument,
which is consistent.

However, while hrtick() uses the donor sched_class, it then passes
rq->curr, which is inconsistent. Fix it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: John Stultz <jstultz@google.com>
Link: https://patch.msgid.link/20250918080205.442967033@infradead.org
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fc358c1..1711e9e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -878,7 +878,7 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
 
 	rq_lock(rq, &rf);
 	update_rq_clock(rq);
-	rq->donor->sched_class->task_tick(rq, rq->curr, 1);
+	rq->donor->sched_class->task_tick(rq, rq->donor, 1);
 	rq_unlock(rq, &rf);
 
 	return HRTIMER_NORESTART;

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip: sched/urgent] sched/hrtick: Fix hrtick() vs. scheduling context
  2025-09-18  7:52 ` [PATCH 1/8] sched: Fix hrtick() vs scheduling context Peter Zijlstra
                     ` (3 preceding siblings ...)
  2025-12-03 18:31   ` tip-bot2 for Peter Zijlstra
@ 2025-12-06  9:10   ` tip-bot2 for Peter Zijlstra
  4 siblings, 0 replies; 22+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-12-06  9:10 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Ingo Molnar, John Stultz, x86,
	linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     e38e5299747b23015b00b0109891815db44a2f30
Gitweb:        https://git.kernel.org/tip/e38e5299747b23015b00b0109891815db44a2f30
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Mon, 01 Sep 2025 22:46:29 +02:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Sat, 06 Dec 2025 10:03:13 +01:00

sched/hrtick: Fix hrtick() vs. scheduling context

The sched_class::task_tick() method is called on the donor
sched_class, and sched_tick() hands it rq->donor as argument,
which is consistent.

However, while hrtick() uses the donor sched_class, it then passes
rq->curr, which is inconsistent. Fix it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: John Stultz <jstultz@google.com>
Link: https://patch.msgid.link/20250918080205.442967033@infradead.org
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fc358c1..1711e9e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -878,7 +878,7 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
 
 	rq_lock(rq, &rf);
 	update_rq_clock(rq);
-	rq->donor->sched_class->task_tick(rq, rq->curr, 1);
+	rq->donor->sched_class->task_tick(rq, rq->donor, 1);
 	rq_unlock(rq, &rf);
 
 	return HRTIMER_NORESTART;

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip: sched/core] sched/fair: Limit hrtick work
  2025-09-18  7:52 ` [PATCH 2/8] sched/fair: Limit hrtick work Peter Zijlstra
  2025-09-19 14:59   ` K Prateek Nayak
@ 2025-12-14  7:46   ` tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 22+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-12-14  7:46 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Ingo Molnar, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     95a0155224a658965f34ed4b1943b238d9be1fea
Gitweb:        https://git.kernel.org/tip/95a0155224a658965f34ed4b1943b238d9be1fea
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Mon, 01 Sep 2025 22:50:56 +02:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Sun, 14 Dec 2025 08:25:02 +01:00

sched/fair: Limit hrtick work

The task_tick_fair() function does:

 - update the hierarchical runtimes
 - drive NUMA-balancing
 - update load-balance statistics
 - drive force-idle preemption

All but the very first can be limited to the periodic tick. Let hrtick
only update accounting and drive preemption, not load-balancing and
other bits.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20250918080205.563385766@infradead.org
---
 kernel/sched/fair.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 496a30a..f79951f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13332,6 +13332,12 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 		entity_tick(cfs_rq, se, queued);
 	}
 
+	if (queued) {
+		if (!need_resched())
+			hrtick_start_fair(rq, curr);
+		return;
+	}
+
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-12-14  7:46 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-18  7:52 [PATCH 0/8] hrtimer/sched: Improve hrtick Peter Zijlstra
2025-09-18  7:52 ` [PATCH 1/8] sched: Fix hrtick() vs scheduling context Peter Zijlstra
2025-09-19  3:53   ` K Prateek Nayak
2025-09-23  0:24   ` John Stultz
2025-12-03 18:25   ` [tip: sched/urgent] sched/hrtick: Fix hrtick() vs. " tip-bot2 for Peter Zijlstra
2025-12-03 18:31   ` tip-bot2 for Peter Zijlstra
2025-12-06  9:10   ` tip-bot2 for Peter Zijlstra
2025-09-18  7:52 ` [PATCH 2/8] sched/fair: Limit hrtick work Peter Zijlstra
2025-09-19 14:59   ` K Prateek Nayak
2025-11-28  8:25     ` Peter Zijlstra
2025-12-14  7:46   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2025-09-18  7:52 ` [PATCH 3/8] sched/eevdf: Fix HRTICK duration Peter Zijlstra
2025-09-19 15:34   ` K Prateek Nayak
2025-11-28  8:32     ` Peter Zijlstra
2025-09-18  7:52 ` [PATCH 4/8] hrtimer: Optimize __hrtimer_start_range_ns() Peter Zijlstra
2025-09-18  7:52 ` [PATCH 5/8] hrtimer,sched: Add fuzzy hrtimer mode for HRTICK Peter Zijlstra
2025-09-18  7:52 ` [PATCH 6/8] hrtimer: Re-arrange hrtimer_interrupt() Peter Zijlstra
2025-09-18  7:52 ` [RFC][PATCH 7/8] entry,hrtimer: Push reprogramming timers into the interrupt return path Peter Zijlstra
2025-09-20  9:29   ` Thomas Gleixner
2025-09-23  7:52     ` Peter Zijlstra
2025-09-23  8:18       ` Peter Zijlstra
2025-09-18  7:52 ` [RFC][PATCH 8/8] sched: Default enable HRTICK Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox