NMI stack overflow during resume of PCIe bridge with CONFIG_HARDLOCKUP_DETECTOR=y

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Bert Karwatzki <spasswolf@web.de>
To: linux-kernel@vger.kernel.org
Cc: "Bert Karwatzki" <spasswolf@web.de>,
	linux-next@vger.kernel.org,
	"Mario Limonciello" <mario.limonciello@amd.com>,
	"Sebastian Andrzej Siewior" <bigeasy@linutronix.de>,
	"Clark Williams" <clrkwllms@kernel.org>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	"Christian König" <christian.koenig@amd.com>,
	regressions@lists.linux.dev, linux-pci@vger.kernel.org,
	linux-acpi@vger.kernel.org,
	"Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
	acpica-devel@lists.linux.dev,
	"Robert Moore" <robert.moore@intel.com>,
	"Saket Dumbre" <saket.dumbre@intel.com>,
	"Bjorn Helgaas" <bhelgaas@google.com>,
	"Clemens Ladisch" <clemens@ladisch.de>,
	"Jinchao Wang" <wangjinchao600@gmail.com>,
	"Yury Norov" <yury.norov@gmail.com>,
	"Anna Schumaker" <anna.schumaker@oracle.com>,
	"Baoquan He" <bhe@redhat.com>,
	"Darrick J. Wong" <djwong@kernel.org>,
	"Dave Young" <dyoung@redhat.com>,
	"Doug Anderson" <dianders@chromium.org>,
	"Guilherme G. Piccoli" <gpiccoli@igalia.com>,
	"Helge Deller" <deller@gmx.de>, "Ingo Molnar" <mingo@kernel.org>,
	"Jason Gunthorpe" <jgg@ziepe.ca>,
	"Joanthan Cameron" <Jonathan.Cameron@huawei.com>,
	"Joel Granados" <joel.granados@kernel.org>,
	"John Ogness" <john.ogness@linutronix.de>,
	"Kees Cook" <kees@kernel.org>, "Li Huafei" <lihuafei1@huawei.com>,
	"Luck, Tony" <tony.luck@intel.com>,
	"Luo Gengkun" <luogengkun@huaweicloud.com>,
	"Max Kellermann" <max.kellermann@ionos.com>,
	"Nam Cao" <namcao@linutronix.de>,
	oushixiong <oushixiong@kylinos.cn>,
	"Petr Mladek" <pmladek@suse.com>,
	"Qianqiang Liu" <qianqiang.liu@163.com>,
	"Sergey Senozhatsky" <senozhatsky@chromium.org>,
	"Sohil Mehta" <sohil.mehta@intel.com>,
	"Tejun Heo" <tj@kernel.org>,
	"Thomas Gleinxer" <tglx@linutronix.de>,
	"Thomas Zimemrmann" <tzimmermann@suse.de>,
	"Thorsten Blum" <thorsten.blum@linux.dev>,
	"Ville Syrjala" <ville.syrjala@linux.intel.com>,
	"Vivek Goyal" <vgoyal@redhat.com>,
	"Yicong Yang" <yangyicong@hisilicon.com>,
	"Yunhui Cui" <cuiyunhui@bytedance.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	W_Armin@gmx.de
Subject: NMI stack overflow during resume of PCIe bridge with CONFIG_HARDLOCKUP_DETECTOR=y
Date: Tue, 13 Jan 2026 10:41:25 +0100	[thread overview]
Message-ID: <20260113094129.3357-1-spasswolf@web.de> (raw)
In-Reply-To: ec99725df78fdd0fd9d4398d00fdebb84cb38ee6.camel@web.de

The investigation into this bug has taken yet another dramatic turn of events.
I'll summarize what I've found so far:

On my MSI Alpha 15 Dual GPU Laptop with the following hardware

$ lspci -nn
00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne Root Complex [1022:1630]
00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne IOMMU [1022:1631]
00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge [1022:1633]
00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
00:02.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
00:02.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
00:02.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
00:02.4 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir Internal PCIe GPP Bridge to Bus [1022:1635]
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 51)
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 0 [1022:166a]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 1 [1022:166b]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 2 [1022:166c]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 3 [1022:166d]
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 4 [1022:166e]
00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 5 [1022:166f]
00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 6 [1022:1670]
00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 7 [1022:1671]
01:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch [1002:1478] (rev c3)
02:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch [1002:1479]
03:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] [1002:73ff] (rev c3)
03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
04:00.0 Network controller [0280]: MEDIATEK Corp. MT7921K (RZ608) Wi-Fi 6E 80MHz [14c3:0608]
05:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 15)
06:00.0 Non-Volatile memory controller [0108]: Kingston Technology Company, Inc. KC3000/FURY Renegade NVMe SSD [E18] [2646:5013] (rev 01)
07:00.0 Non-Volatile memory controller [0108]: Micron/Crucial Technology P1 NVMe PCIe SSD[Frampton] [c0a9:2263] (rev 03)
08:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] [1002:1638] (rev c5)
08:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Renoir Radeon High Definition Audio Controller [1002:1637]
08:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor [1022:15df]
08:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1 [1022:1639]
08:00.4 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1 [1022:1639]
08:00.5 Multimedia controller [0480]: Advanced Micro Devices, Inc. [AMD] Audio Coprocessor [1022:15e2] (rev 01)
08:00.6 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Family 17h/19h/1ah HD Audio Controller [1022:15e3]
08:00.7 Signal processing controller [1180]: Advanced Micro Devices, Inc. [AMD] Sensor Fusion Hub [1022:15e4]

I've been encountering random crashes when resuming the discrete GPU

03:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] [1002:73ff] (rev c3)

These random crashes can actually be provoked by the following script

#!/bin/bash
for i in {0..10000}
do
	echo $i
	evolution &
	sleep 5
	killall evolution
	sleep 5
done

or

#!/bin/bash

while :
do
	DRI_PRIME=1 glxinfo > /dev/null
	sleep 10
done

though it still takes between 2 and 5 hours to trigger a crash.

The actual crash happens when resuming the PCI bridge to which the GPU is connected

00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge [1022:1633]

As there are no error message for this in dmesg (using netconsole) and neiter kdump or kgdboE
work in this case I used printk()s in the resume code to locate try and locate the
exact place of the crash. I found that the last shown printk message before the crash
was in the ACPICA interpreter in acpi_ex_system_memory_space_handler():

acpi_ex_system_memory_space_handler(...)
{
	[...]
	/*
	 * Perform the memory read or write
	 *
	 * Note: For machines that do not support non-aligned transfers, the target
	 * address was checked for alignment above. We do not attempt to break the
	 * transfer up into smaller (byte-size) chunks because the AML specifically
	 * asked for a transfer width that the hardware may require.
	 */
	switch (function) {
	case ACPI_READ:
		if (debug)
			printk(KERN_INFO "%s %d value = %px\n", __func__, __LINE__, value);

		*value = 0;
		if (debug)
			printk(KERN_INFO "%s %d\n", __func__, __LINE__);
		switch (bit_width) {
		case 8:

			*value = (u64)ACPI_GET8(logical_addr_ptr);
			break;

		case 16:

			*value = (u64)ACPI_GET16(logical_addr_ptr);
			break;

		case 32:

			if (debug) // This is the last message shown on netconsole!
				printk(KERN_INFO "%s %d: logical_addr_ptr = %px\n", __func__, __LINE__, logical_addr_ptr);
			*value = (u64)ACPI_GET32(logical_addr_ptr);
			if (debug)
				printk(KERN_INFO "%s %d\n", __func__, __LINE__);
			break;

		case 64:

			*value = (u64)ACPI_GET64(logical_addr_ptr);
			break;

		default:

			/* bit_width was already validated */

			break;
		}
		break;

	case ACPI_WRITE:
		if (debug)
			printk(KERN_INFO "%s %d\n", __func__, __LINE__);

		switch (bit_width) {
			[...]
		}
		break;

	default:
		if (debug)
			printk(KERN_INFO "%s %d\n", __func__, __LINE__);

		status = AE_BAD_PARAMETER;
		break;
	}

	[...]
}

The memory which ACPICA is trying to read is at physical address 0xf0100000,
which belongs to the PCI ECAM memory on my machine (from /proc/iomem):

f0000000-fcffffff : PCI Bus 0000:00
  f0000000-f7ffffff : PCI ECAM 0000 [bus 00-7f]
    f0000000-f7ffffff : pnp 00:00

According to the PCIe specification the failing address 0xf0100000 is belongs to bus 01:
00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge [1022:1633]
so the error occurs when trying to read ECAM memory that belongs to a failing device.

The code used to get to this can be found here (based on v6.14) (It's rather messy though)
https://gitlab.freedesktop.org/spasswolf/linux-stable/-/commits/amdgpu_suspend_resume_4?ref_type=heads
and more details on the investigation can be found here:
https://github.com/acpica/acpica/issues/1060

So we seem to have a read to IO memory where the physical address fails as the
device stopped working. I tried to consult the Documentation  (AMD64 Architecture Programmer’s Manual
Volume 2: System Programming), to find if an execption is raised in this case, but the Documentation
doesn't consider this case.

So I put printk()s in most of the execption handlers to find out if there is a chance
to catch this failed memory access and work around it. The code used in this investigation can be found here:
https://gitlab.freedesktop.org/spasswolf/linux-stable/-/commits/amdgpu_suspend_resume_fault_handler?ref_type=heads

The result from this is that after the debug messages from the ACPICA interpreter
stop (where I previously thought a crash had occured), there are message from exc_nmi()
and the functions called by it.

Here's what I've found for a normal NMI:
2026-01-12T04:23:56.396721+01:00 T279584;exc_nmi: 0
2026-01-12T04:23:56.396721+01:00 C10;exc_nmi: 10.3
2026-01-12T04:23:56.396721+01:00 C10;default_do_nmi 
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: type=0x0
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: a=0xffffffffa1612de0
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: a->handler=perf_event_nmi_handler+0x0/0xa6
2026-01-12T04:23:56.396721+01:00 C10;perf_event_nmi_handler: 0
2026-01-12T04:23:56.396721+01:00 C10;perf_event_nmi_handler: 1
2026-01-12T04:23:56.396721+01:00 C10;perf_event_nmi_handler: 2
2026-01-12T04:23:56.396721+01:00 C10;x86_pmu_handle_irq: 2
2026-01-12T04:23:56.396721+01:00 C10;x86_pmu_handle_irq: 2.6
2026-01-12T04:23:56.396721+01:00 C10;__perf_event_overflow: 0
2026-01-12T04:23:56.396721+01:00 C10;__perf_event_overflow: 6.99: overflow_handler=watchdog_overflow_callback+0x0/0x10d
2026-01-12T04:23:56.396721+01:00 C10;watchdog_overflow_callback: 0
2026-01-12T04:23:56.396721+01:00 C10;__ktime_get_fast_ns_debug: 0.1
2026-01-12T04:23:56.396721+01:00 C10;tk_clock_read_debug: read=read_hpet+0x0/0xf0
2026-01-12T04:23:56.396721+01:00 C10;read_hpet: 0
2026-01-12T04:23:56.396721+01:00 C10;read_hpet: 0.1
2026-01-12T04:23:56.396721+01:00 C10;timekeeping_cycles_to_ns_debug: 0
2026-01-12T04:23:56.396721+01:00 C10;watchdog_check_timestamp: 0
2026-01-12T04:23:56.396721+01:00 C10;watchdog_check_timestamp: 1
2026-01-12T04:23:56.396721+01:00 C10;watchdog_check_timestamp: 2
2026-01-12T04:23:56.396721+01:00 C10;watchdog_check_timestamp: 3
2026-01-12T04:23:56.396721+01:00 C10;watchdog_overflow_callback: 1
2026-01-12T04:23:56.396721+01:00 C10;watchdog_overflow_callback: 2
2026-01-12T04:23:56.396721+01:00 C10;__perf_event_overflow: 7
2026-01-12T04:23:56.396721+01:00 C10;__perf_event_overflow: ret=0x0
2026-01-12T04:23:56.396721+01:00 C10;x86_pmu_handle_irq: 2.7
2026-01-12T04:23:56.396721+01:00 C10;x86_pmu_handle_irq: handled=0x1
2026-01-12T04:23:56.396721+01:00 C10;perf_event_nmi_handler: ret = 1
2026-01-12T04:23:56.396721+01:00 C10;perf_event_nmi_handler: 3
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: thishandled=0x1
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: a=0xffffffffa1623040
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: a->handler=nmi_cpu_backtrace_handler+0x0/0x20
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: thishandled=0x0
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: a=0xffffffffa16148e0
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: a->handler=perf_ibs_nmi_handler+0x0/0x60
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: thishandled=0x0
2026-01-12T04:23:56.396721+01:00 C10;exc_nmi: 10.4
2026-01-12T04:23:56.396721+01:00 T279584;exc_nmi: 11
2026-01-12T04:23:56.396721+01:00 T279584;exc_nmi: 12
2026-01-12T04:23:56.396721+01:00 T279584;exc_nmi: 13

Here the interrupt handling works without triggering additional NMIs.

Here's the result in case of the crash:
2026-01-12T04:24:36.809904+01:00 T1510;acpi_ex_system_memory_space_handler 255: logical_addr_ptr = ffffc066977b3000
2026-01-12T04:24:36.846170+01:00 C14;exc_nmi: 0
2026-01-12T04:24:36.960760+01:00 C14;exc_nmi: 10.3
2026-01-12T04:24:36.960760+01:00 C14;default_do_nmi 
2026-01-12T04:24:36.960760+01:00 C14;nmi_handle: type=0x0
2026-01-12T04:24:36.960760+01:00 C14;nmi_handle: a=0xffffffffa1612de0
2026-01-12T04:24:36.960760+01:00 C14;nmi_handle: a->handler=perf_event_nmi_handler+0x0/0xa6
2026-01-12T04:24:36.960760+01:00 C14;perf_event_nmi_handler: 0
2026-01-12T04:24:36.960760+01:00 C14;perf_event_nmi_handler: 1
2026-01-12T04:24:36.960760+01:00 C14;perf_event_nmi_handler: 2
2026-01-12T04:24:36.960760+01:00 C14;x86_pmu_handle_irq: 2
2026-01-12T04:24:36.960760+01:00 C14;x86_pmu_handle_irq: 2.6
2026-01-12T04:24:36.960760+01:00 C14;__perf_event_overflow: 0
2026-01-12T04:24:36.960760+01:00 C14;__perf_event_overflow: 6.99: overflow_handler=watchdog_overflow_callback+0x0/0x10d
2026-01-12T04:24:36.960760+01:00 C14;watchdog_overflow_callback: 0
2026-01-12T04:24:36.960760+01:00 C14;__ktime_get_fast_ns_debug: 0.1
2026-01-12T04:24:36.960760+01:00 C14;tk_clock_read_debug: read=read_hpet+0x0/0xf0
2026-01-12T04:24:36.960760+01:00 C14;read_hpet: 0
2026-01-12T04:24:36.960760+01:00 C14;read_hpet: 0.1
2026-01-12T04:24:36.960760+01:00 T0;exc_nmi: 0
2026-01-12T04:24:38.674625+01:00 C13;exc_nmi: 10.3
2026-01-12T04:24:38.674625+01:00 C13;default_do_nmi 
2026-01-12T04:24:38.674625+01:00 C13;nmi_handle: type=0x0
2026-01-12T04:24:38.674625+01:00 C13;nmi_handle: a=0xffffffffa1612de0
2026-01-12T04:24:38.674625+01:00 C13;nmi_handle: a->handler=perf_event_nmi_handler+0x0/0xa6
2026-01-12T04:24:38.674625+01:00 C13;perf_event_nmi_handler: 0
2026-01-12T04:24:38.674625+01:00 C13;perf_event_nmi_handler: 1
2026-01-12T04:24:38.674625+01:00 C13;perf_event_nmi_handler: 2
2026-01-12T04:24:38.674625+01:00 C13;x86_pmu_handle_irq: 2
2026-01-12T04:24:38.674625+01:00 C13;x86_pmu_handle_irq: 2.6
2026-01-12T04:24:38.674625+01:00 C13;__perf_event_overflow: 0
2026-01-12T04:24:38.674625+01:00 C13;__perf_event_overflow: 6.99: overflow_handler=watchdog_overflow_callback+0x0/0x10d
2026-01-12T04:24:38.674625+01:00 C13;watchdog_overflow_callback: 0
2026-01-12T04:24:38.674625+01:00 C13;__ktime_get_fast_ns_debug: 0.1
2026-01-12T04:24:38.674625+01:00 C13;tk_clock_read_debug: read=read_hpet+0x0/0xf0
2026-01-12T04:24:38.674625+01:00 C13;read_hpet: 0
2026-01-12T04:24:38.674625+01:00 C13;read_hpet: 0.1
2026-01-12T04:24:38.674625+01:00 T0;exc_nmi: 0
2026-01-12T04:24:39.355101+01:00 C2;exc_nmi: 10.3
2026-01-12T04:24:39.355101+01:00 C2;default_do_nmi 
2026-01-12T04:24:39.355101+01:00 C2;nmi_handle: type=0x0
2026-01-12T04:24:39.355101+01:00 C2;nmi_handle: a=0xffffffffa1612de0
2026-01-12T04:24:39.355101+01:00 C2;nmi_handle: a->handler=perf_event_nmi_handler+0x0/0xa6
2026-01-12T04:24:39.355101+01:00 C2;perf_event_nmi_handler: 0
2026-01-12T04:24:39.355101+01:00 C2;perf_event_nmi_handler: 1
2026-01-12T04:24:39.355101+01:00 C2;perf_event_nmi_handler: 2
2026-01-12T04:24:39.355101+01:00 C2;x86_pmu_handle_irq: 2
2026-01-12T04:24:39.355101+01:00 C2;x86_pmu_handle_irq: 2.6
2026-01-12T04:24:39.355101+01:00 C2;__perf_event_overflow: 0
2026-01-12T04:24:39.355101+01:00 C2;__perf_event_overflow: 6.99: overflow_handler=watchdog_overflow_callback+0x0/0x10d
2026-01-12T04:24:39.355101+01:00 C2;watchdog_overflow_callback: 0
2026-01-12T04:24:39.355101+01:00 C2;__ktime_get_fast_ns_debug: 0.1
2026-01-12T04:24:39.355101+01:00 C2;tk_clock_read_debug: read=read_hpet+0x0/0xf0
2026-01-12T04:24:39.355101+01:00 C2;read_hpet: 0
2026-01-12T04:24:39.355101+01:00 C2;read_hpet: 0.1
2026-01-12T04:24:39.355101+01:00 T0;exc_nmi: 0
2026-01-12T04:24:39.410207+01:00 C0;exc_nmi: 10.3
2026-01-12T04:24:39.410207+01:00 C0;default_do_nmi 
2026-01-12T04:24:39.410207+01:00 C0;nmi_handle: type=0x0
2026-01-12T04:24:39.410207+01:00 C0;nmi_handle: a=0xffffffffa1612de0
2026-01-12T04:24:39.410207+01:00 C0;nmi_handle: a->handler=perf_event_nmi_handler+0x0/0xa6
2026-01-12T04:24:39.410207+01:00 C0;perf_event_nmi_handler: 0
2026-01-12T04:24:39.410207+01:00 C0;perf_event_nmi_handler: 1
2026-01-12T04:24:39.410207+01:00 C0;perf_event_nmi_handler: 2
2026-01-12T04:24:39.410207+01:00 C0;x86_pmu_handle_irq: 2
2026-01-12T04:24:39.410207+01:00 C0;x86_pmu_handle_irq: 2.6
2026-01-12T04:24:39.410207+01:00 C0;__perf_event_overflow: 0
2026-01-12T04:24:39.410207+01:00 C0;__perf_event_overflow: 6.99: overflow_handler=watchdog_overflow_callback+0x0/0x10d
2026-01-12T04:24:39.410207+01:00 C0;watchdog_overflow_callback: 0
2026-01-12T04:24:39.410207+01:00 C0;__ktime_get_fast_ns_debug: 0.1
2026-01-12T04:24:39.410207+01:00 C0;tk_clock_read_debug: read=read_hpet+0x0/0xf0
2026-01-12T04:24:39.410207+01:00 C0;read_hpet: 0
2026-01-12T04:24:39.410207+01:00 C0;read_hpet: 0.1
2026-01-12T04:24:39.410207+01:00 T0;exc_nmi: 0

In the case of the crash the interrupt handler never returns because when accessing
the HPET another NMI is triggered. This goes on until a crash happens, probably because
of stack overflow.

One can work around this bug by disabling CONFIG_HARDLOCKUP_DETECT in .config, but
I've only tested this twice so far.

The behaviour described here seems to be similar to the bug that commit
3d5f4f15b778 ("watchdog: skip checks when panic is in progress") is fixing, but
this is actually a different bug as kernel 6.18 (which contains 3d5f4f15b778)
is also affected (I've conducted 5 tests with 6.18 so far and got 4 crashes (crashes occured
after (0.5h, 1h, 4.5h, 1.5h) of testing)). 
Nevertheless these look similar enough to CC the involved people.

Bert Karwatzki

next             reply	other threads:[~2026-01-13  9:43 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-13  9:41 Bert Karwatzki [this message]
2026-01-13 15:24 ` NMI stack overflow during resume of PCIe bridge with CONFIG_HARDLOCKUP_DETECTOR=y Thomas Gleixner
2026-01-13 17:50   ` Bert Karwatzki
2026-01-13 19:30     ` Thomas Gleixner
2026-01-13 21:15       ` Jason Gunthorpe
2026-01-13 22:19       ` Bert Karwatzki
2026-01-20 10:27         ` crash during resume of PCIe bridge in v5.17 (v5.16 works) Bert Karwatzki
2026-02-01  0:36           ` crash during resume of PCIe bridge from v5.17 to next-20260130 " Bert Karwatzki
2026-02-01 10:19             ` Armin Wolf
2026-02-01 11:42               ` Rafael J. Wysocki
2026-02-01 16:42             ` Thomas Gleixner
2026-02-02 10:37               ` Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260113094129.3357-1-spasswolf@web.de \
    --to=spasswolf@web.de \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=W_Armin@gmx.de \
    --cc=acpica-devel@lists.linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=anna.schumaker@oracle.com \
    --cc=bhe@redhat.com \
    --cc=bhelgaas@google.com \
    --cc=bigeasy@linutronix.de \
    --cc=christian.koenig@amd.com \
    --cc=clemens@ladisch.de \
    --cc=clrkwllms@kernel.org \
    --cc=cuiyunhui@bytedance.com \
    --cc=deller@gmx.de \
    --cc=dianders@chromium.org \
    --cc=djwong@kernel.org \
    --cc=dyoung@redhat.com \
    --cc=gpiccoli@igalia.com \
    --cc=jgg@ziepe.ca \
    --cc=joel.granados@kernel.org \
    --cc=john.ogness@linutronix.de \
    --cc=kees@kernel.org \
    --cc=lihuafei1@huawei.com \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-next@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=luogengkun@huaweicloud.com \
    --cc=mario.limonciello@amd.com \
    --cc=max.kellermann@ionos.com \
    --cc=mingo@kernel.org \
    --cc=namcao@linutronix.de \
    --cc=oushixiong@kylinos.cn \
    --cc=pmladek@suse.com \
    --cc=qianqiang.liu@163.com \
    --cc=rafael.j.wysocki@intel.com \
    --cc=regressions@lists.linux.dev \
    --cc=robert.moore@intel.com \
    --cc=rostedt@goodmis.org \
    --cc=saket.dumbre@intel.com \
    --cc=senozhatsky@chromium.org \
    --cc=sohil.mehta@intel.com \
    --cc=tglx@linutronix.de \
    --cc=thorsten.blum@linux.dev \
    --cc=tj@kernel.org \
    --cc=tony.luck@intel.com \
    --cc=tzimmermann@suse.de \
    --cc=vgoyal@redhat.com \
    --cc=ville.syrjala@linux.intel.com \
    --cc=wangjinchao600@gmail.com \
    --cc=yangyicong@hisilicon.com \
    --cc=yury.norov@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox