From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F255C33FE1F for ; Wed, 1 Apr 2026 15:01:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775055665; cv=none; b=jWoUAkhPsebEreNgsrUcIzN/uMY6l30LRgQODLbfW+uPHkptNRmZOt6FQlyNhXNC8fymh/NrmYI27KV3/YDBzNKadBViEv9ZZDK+NRmtW5EJ96c2ga98kyNEzMcB8QEaUi/yPLTJT78Tzo68j4OSRycO+jxA/2FuIqYpDKIEgY0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775055665; c=relaxed/simple; bh=fG8rJWzMZyOD/7j6b357PY+h2CYw0ub07b0ZodISxn4=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=VH0DxAWTZXlJW1CQ906lgcKY8B5m0tlUSv2Ay7Hp0Wl3ybybwioMUWl4Wga0rq64gdTFDjK/tfpRM0LZd6ttqyFN1UIgm/L3ZUqrxALeZyatr15fsVKLait5g/P8eAA12BAQF2raLZx7QaihrjwwdaTs51JLEiYbaLLHImb/EQo= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=GFlhD1By; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="GFlhD1By" Received: by smtp.kernel.org (Postfix) with ESMTPSA id F2FB0C19421; Wed, 1 Apr 2026 15:01:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1775055664; bh=fG8rJWzMZyOD/7j6b357PY+h2CYw0ub07b0ZodISxn4=; h=From:To:Cc:Subject:In-Reply-To:References:Date:From; b=GFlhD1BypWPVVEksffStX1bLoem9cIpESFbImS9SgxuY2Y/yCiITpI/sp0j5aGRce hXkAEVxLTWd2LSlnU+6KdqQKJSMuZFdmEYs/7tswbR3Tldz09w1j/GVfesJ/e8P/Lj sXYZJ48RFy7Zp7Niv1BXJVGdY0VMK5zr7SmNBvshBPojbiSqd5sW3eQCEOYdz4+lDP llqNltu5LnNnx/kT7SM7BOQXtFpdd1xIXR5+xYBeTlPpqBJeHSvz+vqQnjFLcCP8LZ PubC93j2iQLHlUECzaWnLIa652sVJkJcahcTFrJKknTVx7nEKZXxqDVCB6RajyJL74 cRncnNyJr2UsA== From: Thomas Gleixner To: Calvin Owens Cc: Petr Mladek , linux-kernel@vger.kernel.org, arighi@nvidia.com, yaozhenguo1@gmail.com, tj@kernel.org, feng.tang@linux.alibaba.com, lirongqing@baidu.com, realwujing@gmail.com, hu.shengming@zte.com.cn, dianders@chromium.org, joel.granados@kernel.org, Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Frederic Weisbecker , Anna-Maria Behnsen , x86@kernel.org Subject: Re: [BUG] Random hard lockup with userspace %ip on 7.0-rc5 In-Reply-To: References: <87v7ejetl1.ffs@tglx> Date: Wed, 01 Apr 2026 17:01:00 +0200 Message-ID: <875x6a913n.ffs@tglx> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain On Tue, Mar 31 2026 at 18:58, Calvin Owens wrote: > On Wednesday 03/25 at 17:56 +0100, Thomas Gleixner wrote: > The below userspace reproducer consistently triggers the hard lockup > on two different machines with an AMD 7950X3D and an AMD 9950X3D CPU. Is that instantaneous or does it take some time? > However, it never reproduces at all on a Xeon E-2124. Maybe a clue? Not really, but there is a difference in how the timer hardware is programmed. The XEON uses the TSC deadline timer, the AMD CPUs use the good old local APIC timer. But you can disable the deadline timer on the XEON with 'notscdeadline' on the kernel command line. > I wish I had a nice clever story for how I found it, but I just guessed > based on how systemd uses timerfd_settime(). :) > #ifndef NR_THREADS > #define NR_THREADS 32 > #endif > > static void set(int fd) > { > struct itimerspec new = { > .it_value = { > .tv_sec = 0, > .tv_nsec = 1, > }, > }; > > if (timerfd_settime(fd, TFD_TIMER_ABSTIME | TFD_TIMER_CANCEL_ON_SET, > &new, NULL)) > err(2, "Can't set timer"); So this [re]starts the timer which immediately expires. Most likely even before the syscall returns. TFD_TIMER_CANCEL_ON_SET has no effect because the timer is based on CLOCK_MONOTONIC, which cannot be set. > static void *fn(void *arg) > { > int fd = timerfd_create(CLOCK_MONOTONIC, 0); > > while (1) > set(fd); and does so in an endless loop with NR_THREADS in parallel. That means all 32 CPUs are hogged by this. But the scheduler has full control of the tasks, so there is no real good explanation why the machine would actually lock up. Now that you have a reproducer, can you verify that the machine really locks up hard? Disable the NMI watchdog either via the kernel command line 'nmi_watchdog=0' or via echo 0 >/proc/sys/kernel/nmi_watchdog. If that works and the machine stays usable then the watchdog is hallucinating. Thanks, tglx