From: Bert Karwatzki <spasswolf@web.de>
To: Thomas Gleixner <tglx@kernel.org>, linux-kernel@vger.kernel.org
Cc: spasswolf@web.de, linux-next@vger.kernel.org,
"Mario Limonciello" <mario.limonciello@amd.com>,
"Sebastian Andrzej Siewior" <bigeasy@linutronix.de>,
"Clark Williams" <clrkwllms@kernel.org>,
"Steven Rostedt" <rostedt@goodmis.org>,
"Christian König" <christian.koenig@amd.com>,
regressions@lists.linux.dev, linux-pci@vger.kernel.org,
linux-acpi@vger.kernel.org,
"Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
acpica-devel@lists.linux.dev,
"Robert Moore" <robert.moore@intel.com>,
"Saket Dumbre" <saket.dumbre@intel.com>,
"Bjorn Helgaas" <bhelgaas@google.com>,
"Clemens Ladisch" <clemens@ladisch.de>,
"Jinchao Wang" <wangjinchao600@gmail.com>,
"Yury Norov" <yury.norov@gmail.com>,
"Anna Schumaker" <anna.schumaker@oracle.com>,
"Baoquan He" <bhe@redhat.com>,
"Darrick J. Wong" <djwong@kernel.org>,
"Dave Young" <dyoung@redhat.com>,
"Doug Anderson" <dianders@chromium.org>,
"Guilherme G. Piccoli" <gpiccoli@igalia.com>,
"Helge Deller" <deller@gmx.de>, "Ingo Molnar" <mingo@kernel.org>,
"Jason Gunthorpe" <jgg@ziepe.ca>,
"Joanthan Cameron" <Jonathan.Cameron@huawei.com>,
"Joel Granados" <joel.granados@kernel.org>,
"John Ogness" <john.ogness@linutronix.de>,
"Kees Cook" <kees@kernel.org>, "Li Huafei" <lihuafei1@huawei.com>,
"Luck, Tony" <tony.luck@intel.com>,
"Luo Gengkun" <luogengkun@huaweicloud.com>,
"Max Kellermann" <max.kellermann@ionos.com>,
"Nam Cao" <namcao@linutronix.de>,
oushixiong <oushixiong@kylinos.cn>,
"Petr Mladek" <pmladek@suse.com>,
"Qianqiang Liu" <qianqiang.liu@163.com>,
"Sergey Senozhatsky" <senozhatsky@chromium.org>,
"Sohil Mehta" <sohil.mehta@intel.com>,
"Tejun Heo" <tj@kernel.org>,
"Thomas Zimemrmann" <tzimmermann@suse.de>,
"Thorsten Blum" <thorsten.blum@linux.dev>,
"Ville Syrjala" <ville.syrjala@linux.intel.com>,
"Vivek Goyal" <vgoyal@redhat.com>,
"Yunhui Cui" <cuiyunhui@bytedance.com>,
"Andrew Morton" <akpm@linux-foundation.org>,
W_Armin@gmx.de
Subject: Re: NMI stack overflow during resume of PCIe bridge with CONFIG_HARDLOCKUP_DETECTOR=y
Date: Tue, 13 Jan 2026 23:19:30 +0100 [thread overview]
Message-ID: <99f1aaba32030d2b9285dbd983fdf8518a181a8d.camel@web.de> (raw)
In-Reply-To: <87v7h5ia3d.ffs@tglx>
Am Dienstag, dem 13.01.2026 um 20:30 +0100 schrieb Thomas Gleixner:
> On Tue, Jan 13 2026 at 18:50, Bert Karwatzki wrote:
> > Am Dienstag, dem 13.01.2026 um 16:24 +0100 schrieb Thomas Gleixner:
> >
> > > What's more likely is that after a while _ALL_ CPUs are hung up in
> > > the NMI handler after they tripped over the HPET read.
> >
> > I'm not sure about that, my latest testrun (with v6.18) crashed with
> > only one message from exc_nmi().
>
> What means crashed? Did it actually crash and output something or does
> the machine just go dead? I assume the latter as you have no output.
>
Crashed means, rebooting and trying to reboot into the new kernel, which fails
(i.e. drops to rescue mode)
because one of my 2 nvme devices (not the root fs) is missing from the PCI bus
(the missing device reappears after a shutdown).
When sound is running while such a crash occurs the sound loops for about ~2s,
before such a reboot occurs (a behaviour I've seen during an unrelated NULL ptr deref in an
interrupt handler once). There's no additional error message output when such
a crash occurs, only my printk()s.
> > > along with the full output of /proc/iomem
> >
> > The physical address is 0xf0100000
> >
> > $ cat /proc/iomem
> > f0000000-fcffffff : PCI Bus 0000:00
> > f0000000-f7ffffff : PCI ECAM 0000 [bus 00-7f]
> > f0000000-f7ffffff : pnp 00:00
>
> That's the memory mapped PCI config space and this tries to access:
>
> MMIO_START 0xf0000000
> BUSNUM 0x01 << 20 0x00100000
> SLOT/FN 0x00 << 12 0x00000000
> OFFSET 0x00 << 0 0x00000000
> -----------
> 0xf0100000
>
> Offset 0 is vendor/device ID IIRC.
>
> Anyway if that access does not complete because of a hardware issue,
> then any subsequent access to the MMIO mapped HPET goes stale as well.
>
> As the HPET is the active clocksource on your machine, this obviously
> does not only affect the NMI watchdog readout, it affects the regular
> timekeeper accesses too and all other MMIO accesses all over the place.
> So gradually your machine just stalls on outstanding MMIO transactions
> w/o further notice... The NMI is just a red herring.
>
I've already tried different clocsources, with mixed results. tsc does not
work on my system as it is unstable, while acpi_pm also crashed. With jiffies
I had mixed results. It might have avoided one crash, but crashed on another
testrun in the same situation.
> You need to figure out why that MMIO access to that device's
> configuration space stalls as anything else is just subsequent
> damage.
>
> There is not much what can be done about that unless the PCI bus raises
> a failure interrupt and some magic reset sequence aborts the outstanding
> stalled transactions.
>
> Whether that's feasible or not, I don't know. The failure mechanism
> might run into the same stall scenario when accessing the PCI muck for
> reset...
In the two testruns without CONFIG_HARDLOCKUP_DETECTOR I simply lost the discrete
GPU without a crash, i.e. after 4 and 8.5h the GPU would not resume any more, and
accessing the GPU with DRI_PRIME=1 glxinfo just gives a (user) segfault. But I need
more testruns to see if this behaviour is consistent.
>
> Sorry for not being helpful.
>
> Thanks,
>
> tglx
>
Thanks you
Bert Karwatzki
next prev parent reply other threads:[~2026-01-13 22:20 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-13 9:41 NMI stack overflow during resume of PCIe bridge with CONFIG_HARDLOCKUP_DETECTOR=y Bert Karwatzki
2026-01-13 15:24 ` Thomas Gleixner
2026-01-13 17:50 ` Bert Karwatzki
2026-01-13 19:30 ` Thomas Gleixner
2026-01-13 21:15 ` Jason Gunthorpe
2026-01-13 22:19 ` Bert Karwatzki [this message]
2026-01-20 10:27 ` crash during resume of PCIe bridge in v5.17 (v5.16 works) Bert Karwatzki
2026-02-01 0:36 ` crash during resume of PCIe bridge from v5.17 to next-20260130 " Bert Karwatzki
2026-02-01 10:19 ` Armin Wolf
2026-02-01 11:42 ` Rafael J. Wysocki
2026-02-01 16:42 ` Thomas Gleixner
2026-02-02 10:37 ` Christian König
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=99f1aaba32030d2b9285dbd983fdf8518a181a8d.camel@web.de \
--to=spasswolf@web.de \
--cc=Jonathan.Cameron@huawei.com \
--cc=W_Armin@gmx.de \
--cc=acpica-devel@lists.linux.dev \
--cc=akpm@linux-foundation.org \
--cc=anna.schumaker@oracle.com \
--cc=bhe@redhat.com \
--cc=bhelgaas@google.com \
--cc=bigeasy@linutronix.de \
--cc=christian.koenig@amd.com \
--cc=clemens@ladisch.de \
--cc=clrkwllms@kernel.org \
--cc=cuiyunhui@bytedance.com \
--cc=deller@gmx.de \
--cc=dianders@chromium.org \
--cc=djwong@kernel.org \
--cc=dyoung@redhat.com \
--cc=gpiccoli@igalia.com \
--cc=jgg@ziepe.ca \
--cc=joel.granados@kernel.org \
--cc=john.ogness@linutronix.de \
--cc=kees@kernel.org \
--cc=lihuafei1@huawei.com \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-next@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=luogengkun@huaweicloud.com \
--cc=mario.limonciello@amd.com \
--cc=max.kellermann@ionos.com \
--cc=mingo@kernel.org \
--cc=namcao@linutronix.de \
--cc=oushixiong@kylinos.cn \
--cc=pmladek@suse.com \
--cc=qianqiang.liu@163.com \
--cc=rafael.j.wysocki@intel.com \
--cc=regressions@lists.linux.dev \
--cc=robert.moore@intel.com \
--cc=rostedt@goodmis.org \
--cc=saket.dumbre@intel.com \
--cc=senozhatsky@chromium.org \
--cc=sohil.mehta@intel.com \
--cc=tglx@kernel.org \
--cc=thorsten.blum@linux.dev \
--cc=tj@kernel.org \
--cc=tony.luck@intel.com \
--cc=tzimmermann@suse.de \
--cc=vgoyal@redhat.com \
--cc=ville.syrjala@linux.intel.com \
--cc=wangjinchao600@gmail.com \
--cc=yury.norov@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox