From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AB9CA20F08C; Tue, 13 Jan 2026 19:30:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1768332649; cv=none; b=KY1UcSF0gSDIvzW80B3rP4P2Vgdb2yEK2ZAiH3T+KUlv76zFjNya5CIqrx+yvjz+y/Y53q+A1SQZfdQPAlD1AZAQ4O5f2Lhc4v2KOjqJUz6cfEEmZ7hEDMnX+i6odyoOKN5GI2uePYbVQfNnDO/Ep106+ivCGYHrdcmYP9U+N24= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1768332649; c=relaxed/simple; bh=utkO7qAUKLK1RGX5DjRctYVXaccfBqPH23O3mAXzJXM=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=MzR0txhu7kfRu69Gs17Nzi15pk5HMsS+G+73iVmf5A340r+WGCsOMOFbslSO0Ax7MeUxHNexwR+gJAocKadyvTTtmiRvUv4HybpNDrnkVnImrYCy9v3+75pEOPDtDM4cL+mDDqC+G4Nc9idsSJzOlfUjBbqXC3jkfii4LYFhnbE= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=pZWKN/Ny; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="pZWKN/Ny" Received: by smtp.kernel.org (Postfix) with ESMTPSA id B9330C19422; Tue, 13 Jan 2026 19:30:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1768332649; bh=utkO7qAUKLK1RGX5DjRctYVXaccfBqPH23O3mAXzJXM=; h=From:To:Cc:Subject:In-Reply-To:References:Date:From; b=pZWKN/Ny5D3bdiWT78yZVnjkTuVPLGjheLzekAKnfOf7tQRHkZJHxKVQNxzmzM1La BHYcdQgrQPaUCEksIZgKPbKhw5VCAuayX4L9oai0ZuzJ0W2opl8tLrS99cUNUlbBT5 w6SirAuUsER1+8+JLuBzrKPWU3TeqGP+TEjuer+HvcojK/Ryik5PyJ9pknsJV982GA L+VHWVgfgG/a012qHzG4rIwT8A91LNB8mnZ7AWp98A2n3OQnRSHQpuOJmR53nMRQLY +K0DXIbrnaBwEOAIKx6Un6ChgD4mP62y8Jj1PJOLXCxWRbFpWTOXlRobLlXeL54ZVe cj3OBi3JQNikg== From: Thomas Gleixner To: Bert Karwatzki , linux-kernel@vger.kernel.org Cc: linux-next@vger.kernel.org, spasswolf@web.de, Mario Limonciello , Sebastian Andrzej Siewior , Clark Williams , Steven Rostedt , Christian =?utf-8?Q?K=C3=B6nig?= , regressions@lists.linux.dev, linux-pci@vger.kernel.org, linux-acpi@vger.kernel.org, "Rafael J . Wysocki" , acpica-devel@lists.linux.dev, Robert Moore , Saket Dumbre , Bjorn Helgaas , Clemens Ladisch , Jinchao Wang , Yury Norov , Anna Schumaker , Baoquan He , "Darrick J. Wong" , Dave Young , Doug Anderson , "Guilherme G. Piccoli" , Helge Deller , Ingo Molnar , Jason Gunthorpe , Joanthan Cameron , Joel Granados , John Ogness , Kees Cook , Li Huafei , "Luck, Tony" , Luo Gengkun , Max Kellermann , Nam Cao , oushixiong , Petr Mladek , Qianqiang Liu , Sergey Senozhatsky , Sohil Mehta , Tejun Heo , Thomas Zimemrmann , Thorsten Blum , Ville Syrjala , Vivek Goyal , Yunhui Cui , Andrew Morton , W_Armin@gmx.de Subject: Re: NMI stack overflow during resume of PCIe bridge with CONFIG_HARDLOCKUP_DETECTOR=y In-Reply-To: References: <20260113094129.3357-1-spasswolf@web.de> <87h5spk01t.ffs@tglx> Date: Tue, 13 Jan 2026 20:30:46 +0100 Message-ID: <87v7h5ia3d.ffs@tglx> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain On Tue, Jan 13 2026 at 18:50, Bert Karwatzki wrote: > Am Dienstag, dem 13.01.2026 um 16:24 +0100 schrieb Thomas Gleixner: > >> What's more likely is that after a while _ALL_ CPUs are hung up in >> the NMI handler after they tripped over the HPET read. > > I'm not sure about that, my latest testrun (with v6.18) crashed with > only one message from exc_nmi(). What means crashed? Did it actually crash and output something or does the machine just go dead? I assume the latter as you have no output. >> along with the full output of /proc/iomem > > The physical address is 0xf0100000 > > $ cat /proc/iomem > f0000000-fcffffff : PCI Bus 0000:00 > f0000000-f7ffffff : PCI ECAM 0000 [bus 00-7f] > f0000000-f7ffffff : pnp 00:00 That's the memory mapped PCI config space and this tries to access: MMIO_START 0xf0000000 BUSNUM 0x01 << 20 0x00100000 SLOT/FN 0x00 << 12 0x00000000 OFFSET 0x00 << 0 0x00000000 ----------- 0xf0100000 Offset 0 is vendor/device ID IIRC. Anyway if that access does not complete because of a hardware issue, then any subsequent access to the MMIO mapped HPET goes stale as well. As the HPET is the active clocksource on your machine, this obviously does not only affect the NMI watchdog readout, it affects the regular timekeeper accesses too and all other MMIO accesses all over the place. So gradually your machine just stalls on outstanding MMIO transactions w/o further notice... The NMI is just a red herring. You need to figure out why that MMIO access to that device's configuration space stalls as anything else is just subsequent damage. There is not much what can be done about that unless the PCI bus raises a failure interrupt and some magic reset sequence aborts the outstanding stalled transactions. Whether that's feasible or not, I don't know. The failure mechanism might run into the same stall scenario when accessing the PCI muck for reset... Sorry for not being helpful. Thanks, tglx