From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 170F23A1CFF
	for <linux-kernel@vger.kernel.org>; Tue, 24 Feb 2026 16:35:15 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1771950916; cv=none; b=aj64WCDWgWXOf97xdW1eKcjcDvhqOdwpRcza50uPXsLtVDA1LoiNWAMLFNpVwOyuMWPn88UEEPJGvhDoNTfYWWTj/QnMmtEkgTbSBJO1bQGHLEM7FrQ7ag6vabfRhF/mIGeDi+SOgP9/rpnb0SFlPqozWxDoM6NFsyww6NFBJU8=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1771950916; c=relaxed/simple;
	bh=d9azu3M6ilbbP3QkkAGijPJVUpxI/uruSnet2Uh5bP0=;
	h=Date:Message-ID:From:To:Cc:Subject; b=Oi0AY4+2NpeUSQBe/QshwISs8YOmlnmlMimAQI78hW5GvnZNl5s/ycGcSoavzK8A2m5VzDejsBoDiglF99145FI53oTCy2sOezj/r8ho1U6CNlF/aR1yGL61+y3BI7Z3+raXuflh6sm0Ba5cCmeQ0m5YN2gaiuVC1SdjufplQpg=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=nsRwzrxr; arc=none smtp.client-ip=10.30.226.201
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="nsRwzrxr"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4C371C116D0;
	Tue, 24 Feb 2026 16:35:15 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1771950915;
	bh=d9azu3M6ilbbP3QkkAGijPJVUpxI/uruSnet2Uh5bP0=;
	h=Date:From:To:Cc:Subject:From;
	b=nsRwzrxr4AR/f37lmKXhwadNdZsL8wRau0gqk04AUzj4d6hGvRLyiCfsq72N+cLV3
	 XN2x2XbABng9ThEKojJ+rOfjpltZEKJ3pf9SXQzkaT5dLmcYD1JV5G0viLUv6pQb8o
	 aa0mNM3f9hcu6QOzLGFkkKgJ5ynJkYhRxedJEurrdB6hWsEmb3/H44I+llvCHt0VqJ
	 kOrCFxIQ8x9kOHT++QPFGqhDCbfQf4+c5XpJQhxh7TflkYM0E/YbhMZG0wEiUU+vhz
	 XazJQFrSSQg20bMaCR2pfRo3haKrFI47l+fui39Wi9q07ekMYTFgxBKbK7cAQbkAfj
	 WVcCk+11cXJNQ==
Date: Tue, 24 Feb 2026 17:35:12 +0100
Message-ID: <20260224163022.795809588@kernel.org>
User-Agent: quilt/0.68
From: Thomas Gleixner <tglx@kernel.org>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>,
 John Stultz <jstultz@google.com>,
 Stephen Boyd <sboyd@kernel.org>,
 Daniel Lezcano <daniel.lezcano@linaro.org>,
 Juri Lelli <juri.lelli@redhat.com>,
 Vincent Guittot <vincent.guittot@linaro.org>,
 Dietmar Eggemann <dietmar.eggemann@arm.com>,
 Steven Rostedt <rostedt@goodmis.org>,
 Ben Segall <bsegall@google.com>,
 Mel Gorman <mgorman@suse.de>,
 Valentin Schneider <vschneid@redhat.com>,
 x86@kernel.org,
 Peter Zijlstra <peterz@infradead.org>,
 Frederic Weisbecker <frederic@kernel.org>,
 Eric Dumazet <edumazet@google.com>
Subject: [patch 00/48] hrtimer,sched: General optimizations and hrtick
 enablement
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>

Peter recently posted a series tweaking the hrtimer subsystem to reduce the
overhead of the scheduler hrtick timer so it can be enabled by default:

   https://lore.kernel.org/20260121162010.647043073@infradead.org

That turned out to be incomplete and led to a deeper investigation of the
related bits and pieces.

The problem is that the hrtick deadline changes on every context switch and
is also modified by wakeups and balancing. On a hackbench run this results
in about 2500 clockevent reprogramming cycles per second, which is
especially hurtful in a VM as accessing the clockevent device implies a
VM-Exit.

The following series addresses various aspects of the overall related
problem space:

    1) Scheduler

       Aside of some trivial fixes the handling of the hrtick timer in
       the scheduler is suboptimal:

        - schedule() modifies the hrtick when picking the next task

	- schedule() can modify the hrtick when the balance callback runs
          before releasing rq:lock

	- the expiry time is unfiltered and can result in really tiny
          changes of the expiry time, which are functionally completely
          irrelevant

       Solve this by deferring the hrtick update to the end of schedule()
       and filtering out tiny changes.


    2) Clocksource, clockevents, timekeeping

        - Reading the current clocksource involves an indirect call, which
          is expensive especially for clocksources where the actual read is
          a single instruction like the TSC read on x86.

	  This could be solved with a static call, but the architecture
	  coverage for static calls is meager and that still has the
	  overhead of a function call and in the worst case a return
	  speculation mitigation.

	  As x86 and other architectures like S390 have one preferred
	  clocksource which is normally used on all contemporary systems,
	  this begs for a fully inlined solution.

	  This is achieved by a config option which tells the core code to
	  use the architecture provided inline guarded by a static branch.

	  If the branch is disabled, the indirect function call is used as
	  before. If enabled the inlined read is utilized.

	  The branch is disabled by default and only enabled after a
	  clocksource is installed which has the INLINE feature flag
	  set. When the clocksource is replaced the branch is disabled
	  before the clocksource change happens.


        - Programming clock events is based on calculating a relative
          expiry time, converting it to the clock cycles corresponding to
          the clockevent device frequency and invoking the set_next_event()
          callback of the clockevent device.

	  That works perfectly fine as most hardware timers are count down
	  implementations which require a relative time for programming.

	  But clockevent devices which are coupled to the clocksource and
	  provide a less than equal comparator suffer from this scheme. The
	  core calculates the relative expiry time based on a clock read
	  and the set_next_event() callback has to read the same clock
	  again to convert it back to a absolute time which can be
	  programmed into the comparator.

	  The other issue is that the conversion factor of the clockevent
	  device is calculated at boot time and does not take the NTP/PTP
	  adjustments of the clocksource frequency into account. Depending
	  on the direction of the adjustment this can cause timers to fire
	  early or late. Early is the more problematic case as the timer
	  interrupt has to reprogram the device with a very short delta as
	  it can't expire timers early.

	  This can be optimized by introducing a 'coupled' mode for the
	  clocksource and the clockevent device.

	    A) If the clocksource indicates support for 'coupled' mode, the
	       timekeeping core calculates a (NTP adjusted) reverse
	       conversion factor from the clocksource to nanoseconds
	       conversion. This takes NTP adjustments into account and
	       keeps the conversion in sync.

	    B) The timekeeping core provides a function to convert an
	       absolute CLOCK_MONOTONIC expiry time into a absolute time in
	       clocksource cycles which can be programmed directly into the
	       comparator without reading the clocksource at all.

	       This is possible because timekeeping keeps a time pair of
	       the base cycle count and the corresponding CLOCK_MONOTONIC base
	       time at the last update of the timekeeper.

	       So the absolute cycle time can be calculated by calculating
	       the relative time to the CLOCK_MONOTONIC base time,
	       converting the delta into cycles with the help of #A and
	       adding the base cycle count. Pure math, no hardware access.

	    C) The clockevent reprogramming code invokes this conversion
	       function when the clockevent device indicates 'coupled'
	       mode.  The function returns false when the corresponding
	       clocksource is not the current system clocksource (based on
	       a clocksource ID check) and true if the clocksource matches
	       and the conversion is successful.

	       If false, the regular relative set_next_event() mechanism is
	       used, otherwise a new set_next_coupled() callback which
	       takes the calculated absolute expiry time as argument.

	       Similar to the clocksource, this new callback can optionally
	       be inlined.


    3) hrtimers

       It turned out that the hrtimer code needed a long overdue spring
       cleaning independent of the problem at hand. That was conducted
       before tackling the actual performance issues:

       - Timer locality

       	 The handling of timer locality is suboptimal and results often in
	 pointless invocations of switch_hrtimer_base() which end up
	 keeping the CPU base unchanged.

	 Aside of the pointless overhead, this prevents further
	 optimizations for the common local case.

	 Address this by improving the decision logic for keeping the clock
	 base local and splitting out the (re)arm handling into a unified
	 operation.


       - Evalutation of the clock base expiries

       	 The clock bases (MONOTONIC, REALTIME, BOOT, TAI) cache the first
       	 expiring timer, but not the corresponding expiry time, which means
       	 a re-evaluation of the clock bases for the next expiring timer on
       	 the CPU requires to touch up to for extra cache lines.

	 Trivial to solve by caching the earliest expiry time in the clock
	 base itself.


       - Reprogramming of the clock event device

       	 The hrtimer interrupt already deferres reprogramming until the
       	 interrupt handler completes, but in case of the hrtick timer
       	 that's not sufficient because the hrtick timer callback only sets
       	 the NEED_RESCHED flag but has no information about the next hrtick
       	 timer expiry time, which can only be determined in the scheduler.

	 Expand the deferred reprogramming so it can ideally be handled in
	 the subsequent schedule() after the new hrtick value has been
	 established. If there is no schedule, soft interrupts have to be
	 processed on return from interrupt or a nested interrupt hits
	 before reaching schedule, the deferred programming is handled in
	 those contexts.


       - Modification of queued timers

       	 If a timer is already queued modifying the expiry time requires
       	 dequeueing from the RB tree and requeuing after the new expiry
       	 value has been updated. It turned out that the hrtick timer
       	 modification end up very often at the same spot in the RB tree as
       	 they have been before, which means the dequeue/enqueue cycle along
       	 with the related rebalancing could have been avoided. The timer
       	 wheel timers have a similar mechanism by checking upfront whether
       	 the resulting expiry time keeps them in the same hash bucket.

	 It was tried to check this by using rb_prev() and rb_next() to
	 evaluate whether the modification keeps the timer in the same
	 spot, but that turned out to be really inefficent.

	 Solve this by providing a RB tree variant which extends the node
	 with links to the previous and next nodes, which is established
	 when the node is linked into the tree or adjusted when it is
	 removed. These links allow a quick peek into the previous and next
	 expiry time and if the new expiry stays in the boundary the whole
	 RB tree operation can be avoided.

	 This also simplifies the caching and update of the leftmost node
	 as on remove the rb_next() walk can be completely avoided. It
	 would obviously provide a cached rightmost pointer too, but there
	 is not use case for that (yet).

	 On a hackbench run this results in about 35% of the updates being
	 handled that way, which cuts the execution time of
	 hrtimer_start_range_ns() down to 50ns on a 2GHz machine.


       - Cancellation of queued timers

       	 Cancelling a timer or moving its expiry time past the programmed
       	 time can result in reprogramming the clock event device.
       	 Especially with frequent modifications of a queued timer this
       	 results in substantial overhead especially in VMs.

	 Provide an option for hrtimers to tell the core to handle
	 reprogramming lazy in those cases, which means it trades frequent
	 reprogramming against an occasional pointless hrtimer interrupt.

	 But it turned out for the hrtick timer this is a reasonable
	 tradeoff. It's especially valuable when transitioning to idle,
	 where the timer has to be cancelled but then the NOHZ idle code
	 will reprogram it in case of a long idle sleep anyway. But also in
	 high frequency scheduling scenarios this turned out to be
	 beneficial.


With all the above modifications in place enabling hrtick does not longer
result in regressions compared to the hrtick disabled mode.

The reprogramming frequency of the clockevent device got down from
~2500/sec to ~100/sec for a hackbench run with a spurious hrtimer interrupt
ratio of about 25%.

What's interesting is the astonishing improvement of a hackbench run with
the following command line parameters: '-l$LOOPS -p -s8'. That uses pipes
with a message size of 8 bytes. On a 112 CPU SKL machine this results in:

       	   NO HRTICK[_DL]		HRTICK[_DL]
runtime:   0.840s			0.481s		~-42%

With other message sizes up to 256, HRTICK still results in improvements,
but not in that magnitude. Haven't investigated the cause of that yet.

While quite some parts of the series are independent enhancements, I've
decided to keep them together in one big pile for now as all of the
components are required to actually achieve the overall goal.

The patches have been already structured in a way that they can be
distributed to different subsystem branches without causing major cross
subsystem contamination or merge conflict headaches.

The series applies on v7.0-rc1 and is also available from git:

   git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git sched/hrtick

Thanks,

	tglx
---
 arch/x86/Kconfig                      |    2 
 arch/x86/include/asm/clock_inlined.h  |   22 
 arch/x86/kernel/apic/apic.c           |   41 -
 arch/x86/kernel/tsc.c                 |    4 
 include/asm-generic/thread_info_tif.h |    5 
 include/linux/clockchips.h            |    8 
 include/linux/clocksource.h           |    3 
 include/linux/hrtimer.h               |   59 -
 include/linux/hrtimer_defs.h          |   79 +-
 include/linux/hrtimer_rearm.h         |   83 ++
 include/linux/hrtimer_types.h         |   19 
 include/linux/irq-entry-common.h      |   25 
 include/linux/rbtree.h                |   81 ++
 include/linux/rbtree_types.h          |   16 
 include/linux/rseq_entry.h            |   14 
 include/linux/timekeeper_internal.h   |    8 
 include/linux/timerqueue.h            |   56 +
 include/linux/timerqueue_types.h      |   15 
 include/trace/events/timer.h          |   35 -
 kernel/entry/common.c                 |    4 
 kernel/sched/core.c                   |   89 ++
 kernel/sched/deadline.c               |    2 
 kernel/sched/fair.c                   |   55 -
 kernel/sched/features.h               |    5 
 kernel/sched/sched.h                  |   41 -
 kernel/softirq.c                      |   15 
 kernel/time/Kconfig                   |   16 
 kernel/time/clockevents.c             |   48 +
 kernel/time/hrtimer.c                 | 1116 +++++++++++++++++++---------------
 kernel/time/tick-broadcast-hrtimer.c  |    1 
 kernel/time/tick-sched.c              |   27 
 kernel/time/timekeeping.c             |  184 +++++
 kernel/time/timekeeping.h             |    2 
 kernel/time/timer_list.c              |   12 
 lib/rbtree.c                          |   17 
 lib/timerqueue.c                      |   14 
 36 files changed, 1497 insertions(+), 728 deletions(-)