From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 170F23A1CFF for ; Tue, 24 Feb 2026 16:35:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771950916; cv=none; b=aj64WCDWgWXOf97xdW1eKcjcDvhqOdwpRcza50uPXsLtVDA1LoiNWAMLFNpVwOyuMWPn88UEEPJGvhDoNTfYWWTj/QnMmtEkgTbSBJO1bQGHLEM7FrQ7ag6vabfRhF/mIGeDi+SOgP9/rpnb0SFlPqozWxDoM6NFsyww6NFBJU8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771950916; c=relaxed/simple; bh=d9azu3M6ilbbP3QkkAGijPJVUpxI/uruSnet2Uh5bP0=; h=Date:Message-ID:From:To:Cc:Subject; b=Oi0AY4+2NpeUSQBe/QshwISs8YOmlnmlMimAQI78hW5GvnZNl5s/ycGcSoavzK8A2m5VzDejsBoDiglF99145FI53oTCy2sOezj/r8ho1U6CNlF/aR1yGL61+y3BI7Z3+raXuflh6sm0Ba5cCmeQ0m5YN2gaiuVC1SdjufplQpg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=nsRwzrxr; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="nsRwzrxr" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4C371C116D0; Tue, 24 Feb 2026 16:35:15 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1771950915; bh=d9azu3M6ilbbP3QkkAGijPJVUpxI/uruSnet2Uh5bP0=; h=Date:From:To:Cc:Subject:From; b=nsRwzrxr4AR/f37lmKXhwadNdZsL8wRau0gqk04AUzj4d6hGvRLyiCfsq72N+cLV3 XN2x2XbABng9ThEKojJ+rOfjpltZEKJ3pf9SXQzkaT5dLmcYD1JV5G0viLUv6pQb8o aa0mNM3f9hcu6QOzLGFkkKgJ5ynJkYhRxedJEurrdB6hWsEmb3/H44I+llvCHt0VqJ kOrCFxIQ8x9kOHT++QPFGqhDCbfQf4+c5XpJQhxh7TflkYM0E/YbhMZG0wEiUU+vhz XazJQFrSSQg20bMaCR2pfRo3haKrFI47l+fui39Wi9q07ekMYTFgxBKbK7cAQbkAfj WVcCk+11cXJNQ== Date: Tue, 24 Feb 2026 17:35:12 +0100 Message-ID: <20260224163022.795809588@kernel.org> User-Agent: quilt/0.68 From: Thomas Gleixner To: LKML Cc: Anna-Maria Behnsen , John Stultz , Stephen Boyd , Daniel Lezcano , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , x86@kernel.org, Peter Zijlstra , Frederic Weisbecker , Eric Dumazet Subject: [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Peter recently posted a series tweaking the hrtimer subsystem to reduce the overhead of the scheduler hrtick timer so it can be enabled by default: https://lore.kernel.org/20260121162010.647043073@infradead.org That turned out to be incomplete and led to a deeper investigation of the related bits and pieces. The problem is that the hrtick deadline changes on every context switch and is also modified by wakeups and balancing. On a hackbench run this results in about 2500 clockevent reprogramming cycles per second, which is especially hurtful in a VM as accessing the clockevent device implies a VM-Exit. The following series addresses various aspects of the overall related problem space: 1) Scheduler Aside of some trivial fixes the handling of the hrtick timer in the scheduler is suboptimal: - schedule() modifies the hrtick when picking the next task - schedule() can modify the hrtick when the balance callback runs before releasing rq:lock - the expiry time is unfiltered and can result in really tiny changes of the expiry time, which are functionally completely irrelevant Solve this by deferring the hrtick update to the end of schedule() and filtering out tiny changes. 2) Clocksource, clockevents, timekeeping - Reading the current clocksource involves an indirect call, which is expensive especially for clocksources where the actual read is a single instruction like the TSC read on x86. This could be solved with a static call, but the architecture coverage for static calls is meager and that still has the overhead of a function call and in the worst case a return speculation mitigation. As x86 and other architectures like S390 have one preferred clocksource which is normally used on all contemporary systems, this begs for a fully inlined solution. This is achieved by a config option which tells the core code to use the architecture provided inline guarded by a static branch. If the branch is disabled, the indirect function call is used as before. If enabled the inlined read is utilized. The branch is disabled by default and only enabled after a clocksource is installed which has the INLINE feature flag set. When the clocksource is replaced the branch is disabled before the clocksource change happens. - Programming clock events is based on calculating a relative expiry time, converting it to the clock cycles corresponding to the clockevent device frequency and invoking the set_next_event() callback of the clockevent device. That works perfectly fine as most hardware timers are count down implementations which require a relative time for programming. But clockevent devices which are coupled to the clocksource and provide a less than equal comparator suffer from this scheme. The core calculates the relative expiry time based on a clock read and the set_next_event() callback has to read the same clock again to convert it back to a absolute time which can be programmed into the comparator. The other issue is that the conversion factor of the clockevent device is calculated at boot time and does not take the NTP/PTP adjustments of the clocksource frequency into account. Depending on the direction of the adjustment this can cause timers to fire early or late. Early is the more problematic case as the timer interrupt has to reprogram the device with a very short delta as it can't expire timers early. This can be optimized by introducing a 'coupled' mode for the clocksource and the clockevent device. A) If the clocksource indicates support for 'coupled' mode, the timekeeping core calculates a (NTP adjusted) reverse conversion factor from the clocksource to nanoseconds conversion. This takes NTP adjustments into account and keeps the conversion in sync. B) The timekeeping core provides a function to convert an absolute CLOCK_MONOTONIC expiry time into a absolute time in clocksource cycles which can be programmed directly into the comparator without reading the clocksource at all. This is possible because timekeeping keeps a time pair of the base cycle count and the corresponding CLOCK_MONOTONIC base time at the last update of the timekeeper. So the absolute cycle time can be calculated by calculating the relative time to the CLOCK_MONOTONIC base time, converting the delta into cycles with the help of #A and adding the base cycle count. Pure math, no hardware access. C) The clockevent reprogramming code invokes this conversion function when the clockevent device indicates 'coupled' mode. The function returns false when the corresponding clocksource is not the current system clocksource (based on a clocksource ID check) and true if the clocksource matches and the conversion is successful. If false, the regular relative set_next_event() mechanism is used, otherwise a new set_next_coupled() callback which takes the calculated absolute expiry time as argument. Similar to the clocksource, this new callback can optionally be inlined. 3) hrtimers It turned out that the hrtimer code needed a long overdue spring cleaning independent of the problem at hand. That was conducted before tackling the actual performance issues: - Timer locality The handling of timer locality is suboptimal and results often in pointless invocations of switch_hrtimer_base() which end up keeping the CPU base unchanged. Aside of the pointless overhead, this prevents further optimizations for the common local case. Address this by improving the decision logic for keeping the clock base local and splitting out the (re)arm handling into a unified operation. - Evalutation of the clock base expiries The clock bases (MONOTONIC, REALTIME, BOOT, TAI) cache the first expiring timer, but not the corresponding expiry time, which means a re-evaluation of the clock bases for the next expiring timer on the CPU requires to touch up to for extra cache lines. Trivial to solve by caching the earliest expiry time in the clock base itself. - Reprogramming of the clock event device The hrtimer interrupt already deferres reprogramming until the interrupt handler completes, but in case of the hrtick timer that's not sufficient because the hrtick timer callback only sets the NEED_RESCHED flag but has no information about the next hrtick timer expiry time, which can only be determined in the scheduler. Expand the deferred reprogramming so it can ideally be handled in the subsequent schedule() after the new hrtick value has been established. If there is no schedule, soft interrupts have to be processed on return from interrupt or a nested interrupt hits before reaching schedule, the deferred programming is handled in those contexts. - Modification of queued timers If a timer is already queued modifying the expiry time requires dequeueing from the RB tree and requeuing after the new expiry value has been updated. It turned out that the hrtick timer modification end up very often at the same spot in the RB tree as they have been before, which means the dequeue/enqueue cycle along with the related rebalancing could have been avoided. The timer wheel timers have a similar mechanism by checking upfront whether the resulting expiry time keeps them in the same hash bucket. It was tried to check this by using rb_prev() and rb_next() to evaluate whether the modification keeps the timer in the same spot, but that turned out to be really inefficent. Solve this by providing a RB tree variant which extends the node with links to the previous and next nodes, which is established when the node is linked into the tree or adjusted when it is removed. These links allow a quick peek into the previous and next expiry time and if the new expiry stays in the boundary the whole RB tree operation can be avoided. This also simplifies the caching and update of the leftmost node as on remove the rb_next() walk can be completely avoided. It would obviously provide a cached rightmost pointer too, but there is not use case for that (yet). On a hackbench run this results in about 35% of the updates being handled that way, which cuts the execution time of hrtimer_start_range_ns() down to 50ns on a 2GHz machine. - Cancellation of queued timers Cancelling a timer or moving its expiry time past the programmed time can result in reprogramming the clock event device. Especially with frequent modifications of a queued timer this results in substantial overhead especially in VMs. Provide an option for hrtimers to tell the core to handle reprogramming lazy in those cases, which means it trades frequent reprogramming against an occasional pointless hrtimer interrupt. But it turned out for the hrtick timer this is a reasonable tradeoff. It's especially valuable when transitioning to idle, where the timer has to be cancelled but then the NOHZ idle code will reprogram it in case of a long idle sleep anyway. But also in high frequency scheduling scenarios this turned out to be beneficial. With all the above modifications in place enabling hrtick does not longer result in regressions compared to the hrtick disabled mode. The reprogramming frequency of the clockevent device got down from ~2500/sec to ~100/sec for a hackbench run with a spurious hrtimer interrupt ratio of about 25%. What's interesting is the astonishing improvement of a hackbench run with the following command line parameters: '-l$LOOPS -p -s8'. That uses pipes with a message size of 8 bytes. On a 112 CPU SKL machine this results in: NO HRTICK[_DL] HRTICK[_DL] runtime: 0.840s 0.481s ~-42% With other message sizes up to 256, HRTICK still results in improvements, but not in that magnitude. Haven't investigated the cause of that yet. While quite some parts of the series are independent enhancements, I've decided to keep them together in one big pile for now as all of the components are required to actually achieve the overall goal. The patches have been already structured in a way that they can be distributed to different subsystem branches without causing major cross subsystem contamination or merge conflict headaches. The series applies on v7.0-rc1 and is also available from git: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git sched/hrtick Thanks, tglx --- arch/x86/Kconfig | 2 arch/x86/include/asm/clock_inlined.h | 22 arch/x86/kernel/apic/apic.c | 41 - arch/x86/kernel/tsc.c | 4 include/asm-generic/thread_info_tif.h | 5 include/linux/clockchips.h | 8 include/linux/clocksource.h | 3 include/linux/hrtimer.h | 59 - include/linux/hrtimer_defs.h | 79 +- include/linux/hrtimer_rearm.h | 83 ++ include/linux/hrtimer_types.h | 19 include/linux/irq-entry-common.h | 25 include/linux/rbtree.h | 81 ++ include/linux/rbtree_types.h | 16 include/linux/rseq_entry.h | 14 include/linux/timekeeper_internal.h | 8 include/linux/timerqueue.h | 56 + include/linux/timerqueue_types.h | 15 include/trace/events/timer.h | 35 - kernel/entry/common.c | 4 kernel/sched/core.c | 89 ++ kernel/sched/deadline.c | 2 kernel/sched/fair.c | 55 - kernel/sched/features.h | 5 kernel/sched/sched.h | 41 - kernel/softirq.c | 15 kernel/time/Kconfig | 16 kernel/time/clockevents.c | 48 + kernel/time/hrtimer.c | 1116 +++++++++++++++++++--------------- kernel/time/tick-broadcast-hrtimer.c | 1 kernel/time/tick-sched.c | 27 kernel/time/timekeeping.c | 184 +++++ kernel/time/timekeeping.h | 2 kernel/time/timer_list.c | 12 lib/rbtree.c | 17 lib/timerqueue.c | 14 36 files changed, 1497 insertions(+), 728 deletions(-)