From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D707E33C19E for ; Fri, 3 Apr 2026 14:42:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775227322; cv=none; b=uRulyyqabj77jURW5w1g5U+ceup1m8IT3hG2kuHJgFgEvcEq35QHZVFGpEVarEno8uZbZlKUqJLzhAcwHXNh4TN6KE++T/S5KbFp09UMGGL7EcNhBfSHoUMLuRbyYH9uvYH2I+ixpf/GIWTyxoRO+Rk0IiAp375JXTZhjvS0Jts= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775227322; c=relaxed/simple; bh=PhKxldYu/B8nCwdJi4qWvQ4+tt+7Kz1FoDyZ0bn1rN0=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=Yzg1YqovBvXXvjhVj4/loVcvx1uAxQ2C1H5wPrMrSYZwzMsAkmUwm20zbgiTkRPMlkbZe+RFL2Og9naQZL3cIAYFvtMU0yrd/5jXSyH1A+u5lJyZGnWNKXzgGPTqcRx2QDlRZmMZ9bgwGLLGv51u1l3unNkcLeHjGYa2spX2Vsg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=D4ux6nle; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="D4ux6nle" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 9C148C4CEF7; Fri, 3 Apr 2026 14:42:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1775227322; bh=PhKxldYu/B8nCwdJi4qWvQ4+tt+7Kz1FoDyZ0bn1rN0=; h=From:To:Cc:Subject:In-Reply-To:References:Date:From; b=D4ux6nle0l1VtFkcF/nXKi57pAgaPmay31TwF80hiopY1zYocUbuWlNsBRfJT3sKB ywqw5ehOfI9VF0Hn2//FKG/bXyW2ftnAp5pGsbc+b3D/VtYT6TIDPhctnKLXqoHZRA WDmPDhJIgS554hOn0rRupWyfGGfKkC6GPgoOEDOlUQlqBYtK/8sKobgLldPIlyXFBU agaJFFY3CuMqmgvnKLvCyqwvw+hmQ1DXUYIAYI0UdfsLnHEpuIyjV9Ni6OmGJztTtl Mi3myTgoCZHb/2tJDvAGZuftpkOCThs9QtL5zRtXS41lsF/boHBUtlLiMVV8McRQXi RkmYHritjoO2g== From: Thomas Gleixner To: Calvin Owens Cc: Borislav Petkov , Petr Mladek , linux-kernel@vger.kernel.org, arighi@nvidia.com, yaozhenguo1@gmail.com, tj@kernel.org, feng.tang@linux.alibaba.com, lirongqing@baidu.com, realwujing@gmail.com, hu.shengming@zte.com.cn, dianders@chromium.org, joel.granados@kernel.org, Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Frederic Weisbecker , Anna-Maria Behnsen , x86@kernel.org Subject: Re: [PATCH] clockevents: Prevent timer interrupt starvation In-Reply-To: References: <87v7ejetl1.ffs@tglx> <875x6a913n.ffs@tglx> <20260401163435.GGac1JG42tWmsCKL37@fat_crate.local> <87jyup70ka.ffs@tglx> Date: Fri, 03 Apr 2026 16:41:58 +0200 Message-ID: <87bjg06r7t.ffs@tglx> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain On Thu, Apr 02 2026 at 22:11, Calvin Owens wrote: > On Thursday 04/02 at 19:07 +0200, Thomas Gleixner wrote: >> Calvin reported an odd NMI watchdog lockup which claims that the CPU locked >> up in user space. He provided a reproducer, which set's up a timerfd based >> timer and then rearms it in a loop with an absolute expiry time of 1ns. >> >> As the expiry time is in the past, the timer ends up as the first expiring >> timer in the per CPU hrtimer base and the clockevent device is programmed >> with the minimum delta value. If the machine is fast enough, this ends up >> in a endless loop of programming the delta value to the minimum value >> defined by the clock event device, before the timer interrupt can fire, >> which starves the interrupt and consequently triggers the lockup detector >> because the hrtimer callback of the lockup mechanism is never invoked. >> >> As a first step to prevent this, avoid reprogramming the clock event device >> when: >> - a forced minimum delta event is pending >> - the new expiry delta is less then or equal to the minimum delta >> >> Thanks to Calvin for providing the reproducer and to Borislav for testing >> and providing data from his Zen5 machine. >> >> The problem is not limited to Zen5, but depending on the underlying >> clock event device (e.g. TSC deadline timer on Intel) and the CPU speed >> not necessarily observable. >> >> This change serves only as the last resort and further changes will be made >> to prevent this scenario earlier in the call chain. >> >> Reported-by: Calvin Owens >> Signed-off-by: Thomas Gleixner >> --- >> P.S: I'm working on the other changes, but wanted to get this out ASAP >> for testing. > > Unfortunately the AMD boxes won't boot with this: one gives me no > video console output, the other gets to userspace but hangs trying to > mount the rootfs and then prints hard lockup traces with idle stacks. > > Sorry not to have more info yet, I'll have time tomorrow to sit down and > get more data for you. If there's anything specific that you'd like me > grab just let me know. I'm an idiot. When I polished the patch up, I dropped the hunks which clear the flag in the interrupt handler and tired brain did not notice despite checking five times in a row. Updated version below. Thanks, tglx --- From: Thomas Gleixner Subject: clockevents: Prevent timer interrupt starvation Date: Thu, 02 Apr 2026 19:07:49 +0200 From: Thomas Gleixner Calvin reported an odd NMI watchdog lockup which claims that the CPU locked up in user space. He provided a reproducer, which set's up a timerfd based timer and then rearms it in a loop with an absolute expiry time of 1ns. As the expiry time is in the past, the timer ends up as the first expiring timer in the per CPU hrtimer base and the clockevent device is programmed with the minimum delta value. If the machine is fast enough, this ends up in a endless loop of programming the delta value to the minimum value defined by the clock event device, before the timer interrupt can fire, which starves the interrupt and consequently triggers the lockup detector because the hrtimer callback of the lockup mechanism is never invoked. As a first step to prevent this, avoid reprogramming the clock event device when: - a forced minimum delta event is pending - the new expiry delta is less then or equal to the minimum delta Thanks to Calvin for providing the reproducer and to Borislav for testing and providing data from his Zen5 machine. The problem is not limited to Zen5, but depending on the underlying clock event device (e.g. TSC deadline timer on Intel) and the CPU speed not necessarily observable. This change serves only as the last resort and further changes will be made to prevent this scenario earlier in the call chain. Reported-by: Calvin Owens Signed-off-by: Thomas Gleixner --- P.S: I'm working on the other changes, but wanted to get this out ASAP for testing. --- include/linux/clockchips.h | 2 ++ kernel/time/clockevents.c | 37 +++++++++++++++++++++++-------------- kernel/time/hrtimer.c | 1 + kernel/time/tick-common.c | 1 + kernel/time/tick-sched.c | 1 + 5 files changed, 28 insertions(+), 14 deletions(-) --- a/include/linux/clockchips.h +++ b/include/linux/clockchips.h @@ -80,6 +80,7 @@ enum clock_event_state { * @shift: nanoseconds to cycles divisor (power of two) * @state_use_accessors:current state of the device, assigned by the core code * @features: features + * @next_event_forced: True if the last programming was a forced event * @retries: number of forced programming retries * @set_state_periodic: switch state to periodic * @set_state_oneshot: switch state to oneshot @@ -108,6 +109,7 @@ struct clock_event_device { u32 shift; enum clock_event_state state_use_accessors; unsigned int features; + unsigned int next_event_forced; unsigned long retries; int (*set_state_periodic)(struct clock_event_device *); --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -172,6 +172,7 @@ void clockevents_shutdown(struct clock_e { clockevents_switch_state(dev, CLOCK_EVT_STATE_SHUTDOWN); dev->next_event = KTIME_MAX; + dev->next_event_forced = 0; } /** @@ -224,13 +225,7 @@ static int clockevents_increase_min_delt return 0; } -/** - * clockevents_program_min_delta - Set clock event device to the minimum delay. - * @dev: device to program - * - * Returns 0 on success, -ETIME when the retry loop failed. - */ -static int clockevents_program_min_delta(struct clock_event_device *dev) +static int __clockevents_program_min_delta(struct clock_event_device *dev) { unsigned long long clc; int64_t delta; @@ -263,13 +258,7 @@ static int clockevents_program_min_delta #else /* CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST */ -/** - * clockevents_program_min_delta - Set clock event device to the minimum delay. - * @dev: device to program - * - * Returns 0 on success, -ETIME when the retry loop failed. - */ -static int clockevents_program_min_delta(struct clock_event_device *dev) +static int __clockevents_program_min_delta(struct clock_event_device *dev) { unsigned long long clc; int64_t delta = 0; @@ -293,6 +282,21 @@ static int clockevents_program_min_delta #endif /* CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST */ /** + * clockevents_program_min_delta - Set clock event device to the minimum delay. + * @dev: device to program + * + * Returns 0 on success, -ETIME when the retry loop failed. + */ +static int clockevents_program_min_delta(struct clock_event_device *dev) +{ + if (dev->next_event_forced) + return 0; + + dev->next_event_forced = 1; + return __clockevents_program_min_delta(dev); +} + +/** * clockevents_program_event - Reprogram the clock event device. * @dev: device to program * @expires: absolute expiry time (monotonic clock) @@ -324,6 +328,11 @@ int clockevents_program_event(struct clo return dev->set_next_ktime(expires, dev); delta = ktime_to_ns(ktime_sub(expires, ktime_get())); + + /* Don't reprogram when a forced event is pending */ + if (dev->next_event_forced && delta <= (int64_t)dev->min_delta_ns) + return 0; + if (delta <= 0) return force ? clockevents_program_min_delta(dev) : -ETIME; --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -1888,6 +1888,7 @@ void hrtimer_interrupt(struct clock_even BUG_ON(!cpu_base->hres_active); cpu_base->nr_events++; dev->next_event = KTIME_MAX; + dev->next_event_forced = 0; raw_spin_lock_irqsave(&cpu_base->lock, flags); entry_time = now = hrtimer_update_base(cpu_base); --- a/kernel/time/tick-common.c +++ b/kernel/time/tick-common.c @@ -110,6 +110,7 @@ void tick_handle_periodic(struct clock_e int cpu = smp_processor_id(); ktime_t next = dev->next_event; + dev->next_event_forced = 0; tick_periodic(cpu); /* --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -1513,6 +1513,7 @@ static void tick_nohz_lowres_handler(str struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched); dev->next_event = KTIME_MAX; + dev->next_event_forced = 0; if (likely(tick_nohz_handler(&ts->sched_timer) == HRTIMER_RESTART)) tick_program_event(hrtimer_get_expires(&ts->sched_timer), 1);