NMI stack overflow during resume of PCIe bridge with CONFIG_HARDLOCKUP

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* NMI stack overflow during resume of PCIe bridge with CONFIG_HARDLOCKUP_DETECTOR=y
@ 2026-01-13  9:41 Bert Karwatzki
  2026-01-13 15:24 ` Thomas Gleixner
  0 siblings, 1 reply; 12+ messages in thread
From: Bert Karwatzki @ 2026-01-13  9:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Bert Karwatzki, linux-next, Mario Limonciello,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	Christian König, regressions, linux-pci, linux-acpi,
	Rafael J . Wysocki, acpica-devel, Robert Moore, Saket Dumbre,
	Bjorn Helgaas, Clemens Ladisch, Jinchao Wang, Yury Norov,
	Anna Schumaker, Baoquan He, Darrick J. Wong, Dave Young,
	Doug Anderson, Guilherme G. Piccoli, Helge Deller, Ingo Molnar,
	Jason Gunthorpe, Joanthan Cameron, Joel Granados, John Ogness,
	Kees Cook, Li Huafei, Luck, Tony, Luo Gengkun, Max Kellermann,
	Nam Cao, oushixiong, Petr Mladek, Qianqiang Liu,
	Sergey Senozhatsky, Sohil Mehta, Tejun Heo, Thomas Gleinxer,
	Thomas Zimemrmann, Thorsten Blum, Ville Syrjala, Vivek Goyal,
	Yicong Yang, Yunhui Cui, Andrew Morton, W_Armin

The investigation into this bug has taken yet another dramatic turn of events.
I'll summarize what I've found so far:

On my MSI Alpha 15 Dual GPU Laptop with the following hardware

$ lspci -nn
00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne Root Complex [1022:1630]
00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne IOMMU [1022:1631]
00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge [1022:1633]
00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
00:02.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
00:02.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
00:02.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
00:02.4 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir Internal PCIe GPP Bridge to Bus [1022:1635]
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 51)
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 0 [1022:166a]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 1 [1022:166b]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 2 [1022:166c]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 3 [1022:166d]
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 4 [1022:166e]
00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 5 [1022:166f]
00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 6 [1022:1670]
00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 7 [1022:1671]
01:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch [1002:1478] (rev c3)
02:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch [1002:1479]
03:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] [1002:73ff] (rev c3)
03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
04:00.0 Network controller [0280]: MEDIATEK Corp. MT7921K (RZ608) Wi-Fi 6E 80MHz [14c3:0608]
05:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 15)
06:00.0 Non-Volatile memory controller [0108]: Kingston Technology Company, Inc. KC3000/FURY Renegade NVMe SSD [E18] [2646:5013] (rev 01)
07:00.0 Non-Volatile memory controller [0108]: Micron/Crucial Technology P1 NVMe PCIe SSD[Frampton] [c0a9:2263] (rev 03)
08:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] [1002:1638] (rev c5)
08:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Renoir Radeon High Definition Audio Controller [1002:1637]
08:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor [1022:15df]
08:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1 [1022:1639]
08:00.4 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1 [1022:1639]
08:00.5 Multimedia controller [0480]: Advanced Micro Devices, Inc. [AMD] Audio Coprocessor [1022:15e2] (rev 01)
08:00.6 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Family 17h/19h/1ah HD Audio Controller [1022:15e3]
08:00.7 Signal processing controller [1180]: Advanced Micro Devices, Inc. [AMD] Sensor Fusion Hub [1022:15e4]

I've been encountering random crashes when resuming the discrete GPU

03:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] [1002:73ff] (rev c3)

These random crashes can actually be provoked by the following script

#!/bin/bash
for i in {0..10000}
do
	echo $i
	evolution &
	sleep 5
	killall evolution
	sleep 5
done

or

#!/bin/bash

while :
do
	DRI_PRIME=1 glxinfo > /dev/null
	sleep 10
done

though it still takes between 2 and 5 hours to trigger a crash.

The actual crash happens when resuming the PCI bridge to which the GPU is connected

00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge [1022:1633]

As there are no error message for this in dmesg (using netconsole) and neiter kdump or kgdboE
work in this case I used printk()s in the resume code to locate try and locate the
exact place of the crash. I found that the last shown printk message before the crash
was in the ACPICA interpreter in acpi_ex_system_memory_space_handler():

acpi_ex_system_memory_space_handler(...)
{
	[...]
	/*
	 * Perform the memory read or write
	 *
	 * Note: For machines that do not support non-aligned transfers, the target
	 * address was checked for alignment above. We do not attempt to break the
	 * transfer up into smaller (byte-size) chunks because the AML specifically
	 * asked for a transfer width that the hardware may require.
	 */
	switch (function) {
	case ACPI_READ:
		if (debug)
			printk(KERN_INFO "%s %d value = %px\n", __func__, __LINE__, value);

		*value = 0;
		if (debug)
			printk(KERN_INFO "%s %d\n", __func__, __LINE__);
		switch (bit_width) {
		case 8:

			*value = (u64)ACPI_GET8(logical_addr_ptr);
			break;

		case 16:

			*value = (u64)ACPI_GET16(logical_addr_ptr);
			break;

		case 32:

			if (debug) // This is the last message shown on netconsole!
				printk(KERN_INFO "%s %d: logical_addr_ptr = %px\n", __func__, __LINE__, logical_addr_ptr);
			*value = (u64)ACPI_GET32(logical_addr_ptr);
			if (debug)
				printk(KERN_INFO "%s %d\n", __func__, __LINE__);
			break;

		case 64:

			*value = (u64)ACPI_GET64(logical_addr_ptr);
			break;

		default:

			/* bit_width was already validated */

			break;
		}
		break;

	case ACPI_WRITE:
		if (debug)
			printk(KERN_INFO "%s %d\n", __func__, __LINE__);

		switch (bit_width) {
			[...]
		}
		break;

	default:
		if (debug)
			printk(KERN_INFO "%s %d\n", __func__, __LINE__);

		status = AE_BAD_PARAMETER;
		break;
	}

	[...]
}

The memory which ACPICA is trying to read is at physical address 0xf0100000,
which belongs to the PCI ECAM memory on my machine (from /proc/iomem):

f0000000-fcffffff : PCI Bus 0000:00
  f0000000-f7ffffff : PCI ECAM 0000 [bus 00-7f]
    f0000000-f7ffffff : pnp 00:00

According to the PCIe specification the failing address 0xf0100000 is belongs to bus 01:
00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge [1022:1633]
so the error occurs when trying to read ECAM memory that belongs to a failing device.

The code used to get to this can be found here (based on v6.14) (It's rather messy though)
https://gitlab.freedesktop.org/spasswolf/linux-stable/-/commits/amdgpu_suspend_resume_4?ref_type=heads
and more details on the investigation can be found here:
https://github.com/acpica/acpica/issues/1060

So we seem to have a read to IO memory where the physical address fails as the
device stopped working. I tried to consult the Documentation  (AMD64 Architecture Programmer’s Manual
Volume 2: System Programming), to find if an execption is raised in this case, but the Documentation
doesn't consider this case.

So I put printk()s in most of the execption handlers to find out if there is a chance
to catch this failed memory access and work around it. The code used in this investigation can be found here:
https://gitlab.freedesktop.org/spasswolf/linux-stable/-/commits/amdgpu_suspend_resume_fault_handler?ref_type=heads

The result from this is that after the debug messages from the ACPICA interpreter
stop (where I previously thought a crash had occured), there are message from exc_nmi()
and the functions called by it.

Here's what I've found for a normal NMI:
2026-01-12T04:23:56.396721+01:00 T279584;exc_nmi: 0
2026-01-12T04:23:56.396721+01:00 C10;exc_nmi: 10.3
2026-01-12T04:23:56.396721+01:00 C10;default_do_nmi 
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: type=0x0
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: a=0xffffffffa1612de0
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: a->handler=perf_event_nmi_handler+0x0/0xa6
2026-01-12T04:23:56.396721+01:00 C10;perf_event_nmi_handler: 0
2026-01-12T04:23:56.396721+01:00 C10;perf_event_nmi_handler: 1
2026-01-12T04:23:56.396721+01:00 C10;perf_event_nmi_handler: 2
2026-01-12T04:23:56.396721+01:00 C10;x86_pmu_handle_irq: 2
2026-01-12T04:23:56.396721+01:00 C10;x86_pmu_handle_irq: 2.6
2026-01-12T04:23:56.396721+01:00 C10;__perf_event_overflow: 0
2026-01-12T04:23:56.396721+01:00 C10;__perf_event_overflow: 6.99: overflow_handler=watchdog_overflow_callback+0x0/0x10d
2026-01-12T04:23:56.396721+01:00 C10;watchdog_overflow_callback: 0
2026-01-12T04:23:56.396721+01:00 C10;__ktime_get_fast_ns_debug: 0.1
2026-01-12T04:23:56.396721+01:00 C10;tk_clock_read_debug: read=read_hpet+0x0/0xf0
2026-01-12T04:23:56.396721+01:00 C10;read_hpet: 0
2026-01-12T04:23:56.396721+01:00 C10;read_hpet: 0.1
2026-01-12T04:23:56.396721+01:00 C10;timekeeping_cycles_to_ns_debug: 0
2026-01-12T04:23:56.396721+01:00 C10;watchdog_check_timestamp: 0
2026-01-12T04:23:56.396721+01:00 C10;watchdog_check_timestamp: 1
2026-01-12T04:23:56.396721+01:00 C10;watchdog_check_timestamp: 2
2026-01-12T04:23:56.396721+01:00 C10;watchdog_check_timestamp: 3
2026-01-12T04:23:56.396721+01:00 C10;watchdog_overflow_callback: 1
2026-01-12T04:23:56.396721+01:00 C10;watchdog_overflow_callback: 2
2026-01-12T04:23:56.396721+01:00 C10;__perf_event_overflow: 7
2026-01-12T04:23:56.396721+01:00 C10;__perf_event_overflow: ret=0x0
2026-01-12T04:23:56.396721+01:00 C10;x86_pmu_handle_irq: 2.7
2026-01-12T04:23:56.396721+01:00 C10;x86_pmu_handle_irq: handled=0x1
2026-01-12T04:23:56.396721+01:00 C10;perf_event_nmi_handler: ret = 1
2026-01-12T04:23:56.396721+01:00 C10;perf_event_nmi_handler: 3
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: thishandled=0x1
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: a=0xffffffffa1623040
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: a->handler=nmi_cpu_backtrace_handler+0x0/0x20
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: thishandled=0x0
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: a=0xffffffffa16148e0
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: a->handler=perf_ibs_nmi_handler+0x0/0x60
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: thishandled=0x0
2026-01-12T04:23:56.396721+01:00 C10;exc_nmi: 10.4
2026-01-12T04:23:56.396721+01:00 T279584;exc_nmi: 11
2026-01-12T04:23:56.396721+01:00 T279584;exc_nmi: 12
2026-01-12T04:23:56.396721+01:00 T279584;exc_nmi: 13

Here the interrupt handling works without triggering additional NMIs.

Here's the result in case of the crash:
2026-01-12T04:24:36.809904+01:00 T1510;acpi_ex_system_memory_space_handler 255: logical_addr_ptr = ffffc066977b3000
2026-01-12T04:24:36.846170+01:00 C14;exc_nmi: 0
2026-01-12T04:24:36.960760+01:00 C14;exc_nmi: 10.3
2026-01-12T04:24:36.960760+01:00 C14;default_do_nmi 
2026-01-12T04:24:36.960760+01:00 C14;nmi_handle: type=0x0
2026-01-12T04:24:36.960760+01:00 C14;nmi_handle: a=0xffffffffa1612de0
2026-01-12T04:24:36.960760+01:00 C14;nmi_handle: a->handler=perf_event_nmi_handler+0x0/0xa6
2026-01-12T04:24:36.960760+01:00 C14;perf_event_nmi_handler: 0
2026-01-12T04:24:36.960760+01:00 C14;perf_event_nmi_handler: 1
2026-01-12T04:24:36.960760+01:00 C14;perf_event_nmi_handler: 2
2026-01-12T04:24:36.960760+01:00 C14;x86_pmu_handle_irq: 2
2026-01-12T04:24:36.960760+01:00 C14;x86_pmu_handle_irq: 2.6
2026-01-12T04:24:36.960760+01:00 C14;__perf_event_overflow: 0
2026-01-12T04:24:36.960760+01:00 C14;__perf_event_overflow: 6.99: overflow_handler=watchdog_overflow_callback+0x0/0x10d
2026-01-12T04:24:36.960760+01:00 C14;watchdog_overflow_callback: 0
2026-01-12T04:24:36.960760+01:00 C14;__ktime_get_fast_ns_debug: 0.1
2026-01-12T04:24:36.960760+01:00 C14;tk_clock_read_debug: read=read_hpet+0x0/0xf0
2026-01-12T04:24:36.960760+01:00 C14;read_hpet: 0
2026-01-12T04:24:36.960760+01:00 C14;read_hpet: 0.1
2026-01-12T04:24:36.960760+01:00 T0;exc_nmi: 0
2026-01-12T04:24:38.674625+01:00 C13;exc_nmi: 10.3
2026-01-12T04:24:38.674625+01:00 C13;default_do_nmi 
2026-01-12T04:24:38.674625+01:00 C13;nmi_handle: type=0x0
2026-01-12T04:24:38.674625+01:00 C13;nmi_handle: a=0xffffffffa1612de0
2026-01-12T04:24:38.674625+01:00 C13;nmi_handle: a->handler=perf_event_nmi_handler+0x0/0xa6
2026-01-12T04:24:38.674625+01:00 C13;perf_event_nmi_handler: 0
2026-01-12T04:24:38.674625+01:00 C13;perf_event_nmi_handler: 1
2026-01-12T04:24:38.674625+01:00 C13;perf_event_nmi_handler: 2
2026-01-12T04:24:38.674625+01:00 C13;x86_pmu_handle_irq: 2
2026-01-12T04:24:38.674625+01:00 C13;x86_pmu_handle_irq: 2.6
2026-01-12T04:24:38.674625+01:00 C13;__perf_event_overflow: 0
2026-01-12T04:24:38.674625+01:00 C13;__perf_event_overflow: 6.99: overflow_handler=watchdog_overflow_callback+0x0/0x10d
2026-01-12T04:24:38.674625+01:00 C13;watchdog_overflow_callback: 0
2026-01-12T04:24:38.674625+01:00 C13;__ktime_get_fast_ns_debug: 0.1
2026-01-12T04:24:38.674625+01:00 C13;tk_clock_read_debug: read=read_hpet+0x0/0xf0
2026-01-12T04:24:38.674625+01:00 C13;read_hpet: 0
2026-01-12T04:24:38.674625+01:00 C13;read_hpet: 0.1
2026-01-12T04:24:38.674625+01:00 T0;exc_nmi: 0
2026-01-12T04:24:39.355101+01:00 C2;exc_nmi: 10.3
2026-01-12T04:24:39.355101+01:00 C2;default_do_nmi 
2026-01-12T04:24:39.355101+01:00 C2;nmi_handle: type=0x0
2026-01-12T04:24:39.355101+01:00 C2;nmi_handle: a=0xffffffffa1612de0
2026-01-12T04:24:39.355101+01:00 C2;nmi_handle: a->handler=perf_event_nmi_handler+0x0/0xa6
2026-01-12T04:24:39.355101+01:00 C2;perf_event_nmi_handler: 0
2026-01-12T04:24:39.355101+01:00 C2;perf_event_nmi_handler: 1
2026-01-12T04:24:39.355101+01:00 C2;perf_event_nmi_handler: 2
2026-01-12T04:24:39.355101+01:00 C2;x86_pmu_handle_irq: 2
2026-01-12T04:24:39.355101+01:00 C2;x86_pmu_handle_irq: 2.6
2026-01-12T04:24:39.355101+01:00 C2;__perf_event_overflow: 0
2026-01-12T04:24:39.355101+01:00 C2;__perf_event_overflow: 6.99: overflow_handler=watchdog_overflow_callback+0x0/0x10d
2026-01-12T04:24:39.355101+01:00 C2;watchdog_overflow_callback: 0
2026-01-12T04:24:39.355101+01:00 C2;__ktime_get_fast_ns_debug: 0.1
2026-01-12T04:24:39.355101+01:00 C2;tk_clock_read_debug: read=read_hpet+0x0/0xf0
2026-01-12T04:24:39.355101+01:00 C2;read_hpet: 0
2026-01-12T04:24:39.355101+01:00 C2;read_hpet: 0.1
2026-01-12T04:24:39.355101+01:00 T0;exc_nmi: 0
2026-01-12T04:24:39.410207+01:00 C0;exc_nmi: 10.3
2026-01-12T04:24:39.410207+01:00 C0;default_do_nmi 
2026-01-12T04:24:39.410207+01:00 C0;nmi_handle: type=0x0
2026-01-12T04:24:39.410207+01:00 C0;nmi_handle: a=0xffffffffa1612de0
2026-01-12T04:24:39.410207+01:00 C0;nmi_handle: a->handler=perf_event_nmi_handler+0x0/0xa6
2026-01-12T04:24:39.410207+01:00 C0;perf_event_nmi_handler: 0
2026-01-12T04:24:39.410207+01:00 C0;perf_event_nmi_handler: 1
2026-01-12T04:24:39.410207+01:00 C0;perf_event_nmi_handler: 2
2026-01-12T04:24:39.410207+01:00 C0;x86_pmu_handle_irq: 2
2026-01-12T04:24:39.410207+01:00 C0;x86_pmu_handle_irq: 2.6
2026-01-12T04:24:39.410207+01:00 C0;__perf_event_overflow: 0
2026-01-12T04:24:39.410207+01:00 C0;__perf_event_overflow: 6.99: overflow_handler=watchdog_overflow_callback+0x0/0x10d
2026-01-12T04:24:39.410207+01:00 C0;watchdog_overflow_callback: 0
2026-01-12T04:24:39.410207+01:00 C0;__ktime_get_fast_ns_debug: 0.1
2026-01-12T04:24:39.410207+01:00 C0;tk_clock_read_debug: read=read_hpet+0x0/0xf0
2026-01-12T04:24:39.410207+01:00 C0;read_hpet: 0
2026-01-12T04:24:39.410207+01:00 C0;read_hpet: 0.1
2026-01-12T04:24:39.410207+01:00 T0;exc_nmi: 0

In the case of the crash the interrupt handler never returns because when accessing
the HPET another NMI is triggered. This goes on until a crash happens, probably because
of stack overflow.

One can work around this bug by disabling CONFIG_HARDLOCKUP_DETECT in .config, but
I've only tested this twice so far.

The behaviour described here seems to be similar to the bug that commit
3d5f4f15b778 ("watchdog: skip checks when panic is in progress") is fixing, but
this is actually a different bug as kernel 6.18 (which contains 3d5f4f15b778)
is also affected (I've conducted 5 tests with 6.18 so far and got 4 crashes (crashes occured
after (0.5h, 1h, 4.5h, 1.5h) of testing)). 
Nevertheless these look similar enough to CC the involved people.

Bert Karwatzki

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NMI stack overflow during resume of PCIe bridge with CONFIG_HARDLOCKUP_DETECTOR=y
  2026-01-13  9:41 NMI stack overflow during resume of PCIe bridge with CONFIG_HARDLOCKUP_DETECTOR=y Bert Karwatzki
@ 2026-01-13 15:24 ` Thomas Gleixner
  2026-01-13 17:50   ` Bert Karwatzki
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas Gleixner @ 2026-01-13 15:24 UTC (permalink / raw)
  To: Bert Karwatzki, linux-kernel
  Cc: Bert Karwatzki, linux-next, Mario Limonciello,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	Christian König, regressions, linux-pci, linux-acpi,
	Rafael J . Wysocki, acpica-devel, Robert Moore, Saket Dumbre,
	Bjorn Helgaas, Clemens Ladisch, Jinchao Wang, Yury Norov,
	Anna Schumaker, Baoquan He, Darrick J. Wong, Dave Young,
	Doug Anderson, Guilherme G. Piccoli, Helge Deller, Ingo Molnar,
	Jason Gunthorpe, Joanthan Cameron, Joel Granados, John Ogness,
	Kees Cook, Li Huafei, Luck, Tony, Luo Gengkun, Max Kellermann,
	Nam Cao, oushixiong, Petr Mladek, Qianqiang Liu,
	Sergey Senozhatsky, Sohil Mehta, Tejun Heo, Thomas Zimemrmann,
	Thorsten Blum, Ville Syrjala, Vivek Goyal, Yicong Yang,
	Yunhui Cui, Andrew Morton, W_Armin

On Tue, Jan 13 2026 at 10:41, Bert Karwatzki wrote:
> Here's the result in case of the crash:
> 2026-01-12T04:24:36.809904+01:00 T1510;acpi_ex_system_memory_space_handler 255: logical_addr_ptr = ffffc066977b3000
> 2026-01-12T04:24:36.846170+01:00 C14;exc_nmi: 0

Here the NMI triggers in non-task context on CPU14

> 2026-01-12T04:24:36.960760+01:00 C14;exc_nmi: 10.3
> 2026-01-12T04:24:36.960760+01:00 C14;default_do_nmi 
> 2026-01-12T04:24:36.960760+01:00 C14;nmi_handle: type=0x0
> 2026-01-12T04:24:36.960760+01:00 C14;nmi_handle: a=0xffffffffa1612de0
> 2026-01-12T04:24:36.960760+01:00 C14;nmi_handle: a->handler=perf_event_nmi_handler+0x0/0xa6
> 2026-01-12T04:24:36.960760+01:00 C14;perf_event_nmi_handler: 0
> 2026-01-12T04:24:36.960760+01:00 C14;perf_event_nmi_handler: 1
> 2026-01-12T04:24:36.960760+01:00 C14;perf_event_nmi_handler: 2
> 2026-01-12T04:24:36.960760+01:00 C14;x86_pmu_handle_irq: 2
> 2026-01-12T04:24:36.960760+01:00 C14;x86_pmu_handle_irq: 2.6
> 2026-01-12T04:24:36.960760+01:00 C14;__perf_event_overflow: 0
> 2026-01-12T04:24:36.960760+01:00 C14;__perf_event_overflow: 6.99: overflow_handler=watchdog_overflow_callback+0x0/0x10d
> 2026-01-12T04:24:36.960760+01:00 C14;watchdog_overflow_callback: 0
> 2026-01-12T04:24:36.960760+01:00 C14;__ktime_get_fast_ns_debug: 0.1
> 2026-01-12T04:24:36.960760+01:00 C14;tk_clock_read_debug: read=read_hpet+0x0/0xf0
> 2026-01-12T04:24:36.960760+01:00 C14;read_hpet: 0
> 2026-01-12T04:24:36.960760+01:00 C14;read_hpet: 0.1

> 2026-01-12T04:24:36.960760+01:00 T0;exc_nmi: 0

This one triggers in task context of PID0, aka idle task, but it's not
clear on which CPU that happens. It's probably CPU13 as that continues
with the expected 10.3 output, but that's almost ~1.71 seconds later.

> 2026-01-12T04:24:38.674625+01:00 C13;exc_nmi: 10.3
> 2026-01-12T04:24:38.674625+01:00 C13;default_do_nmi 
> 2026-01-12T04:24:38.674625+01:00 C13;nmi_handle: type=0x0
> 2026-01-12T04:24:38.674625+01:00 C13;nmi_handle: a=0xffffffffa1612de0
> 2026-01-12T04:24:38.674625+01:00 C13;nmi_handle: a->handler=perf_event_nmi_handler+0x0/0xa6
> 2026-01-12T04:24:38.674625+01:00 C13;perf_event_nmi_handler: 0
> 2026-01-12T04:24:38.674625+01:00 C13;perf_event_nmi_handler: 1
> 2026-01-12T04:24:38.674625+01:00 C13;perf_event_nmi_handler: 2
> 2026-01-12T04:24:38.674625+01:00 C13;x86_pmu_handle_irq: 2
> 2026-01-12T04:24:38.674625+01:00 C13;x86_pmu_handle_irq: 2.6
> 2026-01-12T04:24:38.674625+01:00 C13;__perf_event_overflow: 0
> 2026-01-12T04:24:38.674625+01:00 C13;__perf_event_overflow: 6.99: overflow_handler=watchdog_overflow_callback+0x0/0x10d
> 2026-01-12T04:24:38.674625+01:00 C13;watchdog_overflow_callback: 0
> 2026-01-12T04:24:38.674625+01:00 C13;__ktime_get_fast_ns_debug: 0.1
> 2026-01-12T04:24:38.674625+01:00 C13;tk_clock_read_debug: read=read_hpet+0x0/0xf0
> 2026-01-12T04:24:38.674625+01:00 C13;read_hpet: 0
> 2026-01-12T04:24:38.674625+01:00 C13;read_hpet: 0.1

> 2026-01-12T04:24:38.674625+01:00 T0;exc_nmi: 0

Same picture as above, but this time on CPU2 with a delay of 0.68
seconds

> 2026-01-12T04:24:39.355101+01:00 C2;exc_nmi: 10.3
> 2026-01-12T04:24:39.355101+01:00 C2;default_do_nmi 
> 2026-01-12T04:24:39.355101+01:00 C2;nmi_handle: type=0x0
> 2026-01-12T04:24:39.355101+01:00 C2;nmi_handle: a=0xffffffffa1612de0
> 2026-01-12T04:24:39.355101+01:00 C2;nmi_handle: a->handler=perf_event_nmi_handler+0x0/0xa6
> 2026-01-12T04:24:39.355101+01:00 C2;perf_event_nmi_handler: 0
> 2026-01-12T04:24:39.355101+01:00 C2;perf_event_nmi_handler: 1
> 2026-01-12T04:24:39.355101+01:00 C2;perf_event_nmi_handler: 2
> 2026-01-12T04:24:39.355101+01:00 C2;x86_pmu_handle_irq: 2
> 2026-01-12T04:24:39.355101+01:00 C2;x86_pmu_handle_irq: 2.6
> 2026-01-12T04:24:39.355101+01:00 C2;__perf_event_overflow: 0
> 2026-01-12T04:24:39.355101+01:00 C2;__perf_event_overflow: 6.99: overflow_handler=watchdog_overflow_callback+0x0/0x10d
> 2026-01-12T04:24:39.355101+01:00 C2;watchdog_overflow_callback: 0
> 2026-01-12T04:24:39.355101+01:00 C2;__ktime_get_fast_ns_debug: 0.1
> 2026-01-12T04:24:39.355101+01:00 C2;tk_clock_read_debug: read=read_hpet+0x0/0xf0
> 2026-01-12T04:24:39.355101+01:00 C2;read_hpet: 0
> 2026-01-12T04:24:39.355101+01:00 C2;read_hpet: 0.1

> 2026-01-12T04:24:39.355101+01:00 T0;exc_nmi: 0

Again on CPU0 with a delay of 0.06 seconds

> 2026-01-12T04:24:39.410207+01:00 C0;exc_nmi: 10.3
> 2026-01-12T04:24:39.410207+01:00 C0;default_do_nmi 
> 2026-01-12T04:24:39.410207+01:00 C0;nmi_handle: type=0x0
> 2026-01-12T04:24:39.410207+01:00 C0;nmi_handle: a=0xffffffffa1612de0
> 2026-01-12T04:24:39.410207+01:00 C0;nmi_handle: a->handler=perf_event_nmi_handler+0x0/0xa6
> 2026-01-12T04:24:39.410207+01:00 C0;perf_event_nmi_handler: 0
> 2026-01-12T04:24:39.410207+01:00 C0;perf_event_nmi_handler: 1
> 2026-01-12T04:24:39.410207+01:00 C0;perf_event_nmi_handler: 2
> 2026-01-12T04:24:39.410207+01:00 C0;x86_pmu_handle_irq: 2
> 2026-01-12T04:24:39.410207+01:00 C0;x86_pmu_handle_irq: 2.6
> 2026-01-12T04:24:39.410207+01:00 C0;__perf_event_overflow: 0
> 2026-01-12T04:24:39.410207+01:00 C0;__perf_event_overflow: 6.99: overflow_handler=watchdog_overflow_callback+0x0/0x10d
> 2026-01-12T04:24:39.410207+01:00 C0;watchdog_overflow_callback: 0
> 2026-01-12T04:24:39.410207+01:00 C0;__ktime_get_fast_ns_debug: 0.1
> 2026-01-12T04:24:39.410207+01:00 C0;tk_clock_read_debug: read=read_hpet+0x0/0xf0
> 2026-01-12T04:24:39.410207+01:00 C0;read_hpet: 0
> 2026-01-12T04:24:39.410207+01:00 C0;read_hpet: 0.1

> 2026-01-12T04:24:39.410207+01:00 T0;exc_nmi: 0

....

> In the case of the crash the interrupt handler never returns because when accessing
> the HPET another NMI is triggered. This goes on until a crash happens, probably because
> of stack overflow.

No. NMI nesting is only one level deep and immediately returns:

        if (this_cpu_read(nmi_state) != NMI_NOT_RUNNING) {
		this_cpu_write(nmi_state, NMI_LATCHED);
		return;
	}


So it's not a stack overflow. What's more likely is that after a while
_ALL_ CPUs are hung up in the NMI handler after they tripped over the
HPET read.

> The behaviour described here seems to be similar to the bug that commit
> 3d5f4f15b778 ("watchdog: skip checks when panic is in progress") is fixing, but
> this is actually a different bug as kernel 6.18 (which contains 3d5f4f15b778)
> is also affected (I've conducted 5 tests with 6.18 so far and got 4 crashes (crashes occured
> after (0.5h, 1h, 4.5h, 1.5h) of testing)). 
> Nevertheless these look similar enough to CC the involved people.

There is nothing similar.

Your problem originates from a screwed up hardware state which in turn
causes the HPET to go haywire for unknown reasons.

What is the physical address of this ACPI handler access:

       logical_addr_ptr = ffffc066977b3000

along with the full output of /proc/iomem

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NMI stack overflow during resume of PCIe bridge with CONFIG_HARDLOCKUP_DETECTOR=y
  2026-01-13 15:24 ` Thomas Gleixner
@ 2026-01-13 17:50   ` Bert Karwatzki
  2026-01-13 19:30     ` Thomas Gleixner
  0 siblings, 1 reply; 12+ messages in thread
From: Bert Karwatzki @ 2026-01-13 17:50 UTC (permalink / raw)
  To: Thomas Gleixner, linux-kernel
  Cc: linux-next, spasswolf, Mario Limonciello,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	Christian König, regressions, linux-pci, linux-acpi,
	Rafael J . Wysocki, acpica-devel, Robert Moore, Saket Dumbre,
	Bjorn Helgaas, Clemens Ladisch, Jinchao Wang, Yury Norov,
	Anna Schumaker, Baoquan He, Darrick J. Wong, Dave Young,
	Doug Anderson, Guilherme G. Piccoli, Helge Deller, Ingo Molnar,
	Jason Gunthorpe, Joanthan Cameron, Joel Granados, John Ogness,
	Kees Cook, Li Huafei, Luck, Tony, Luo Gengkun, Max Kellermann,
	Nam Cao, oushixiong, Petr Mladek, Qianqiang Liu,
	Sergey Senozhatsky, Sohil Mehta, Tejun Heo, Thomas Zimemrmann,
	Thorsten Blum, Ville Syrjala, Vivek Goyal, Yunhui Cui,
	Andrew Morton, W_Armin

Am Dienstag, dem 13.01.2026 um 16:24 +0100 schrieb Thomas Gleixner:
> On Tue, Jan 13 2026 at 10:41, Bert Karwatzki wrote:
> > Here's the result in case of the crash:
> > 2026-01-12T04:24:36.809904+01:00 T1510;acpi_ex_system_memory_space_handler 255: logical_addr_ptr = ffffc066977b3000
> > 2026-01-12T04:24:36.846170+01:00 C14;exc_nmi: 0
> 
> Here the NMI triggers in non-task context on CPU14
> 
> > 2026-01-12T04:24:36.960760+01:00 C14;exc_nmi: 10.3
> > 2026-01-12T04:24:36.960760+01:00 C14;default_do_nmi 
> > 2026-01-12T04:24:36.960760+01:00 C14;nmi_handle: type=0x0
> > 2026-01-12T04:24:36.960760+01:00 C14;nmi_handle: a=0xffffffffa1612de0
> > 2026-01-12T04:24:36.960760+01:00 C14;nmi_handle: a->handler=perf_event_nmi_handler+0x0/0xa6
> > 2026-01-12T04:24:36.960760+01:00 C14;perf_event_nmi_handler: 0
> > 2026-01-12T04:24:36.960760+01:00 C14;perf_event_nmi_handler: 1
> > 2026-01-12T04:24:36.960760+01:00 C14;perf_event_nmi_handler: 2
> > 2026-01-12T04:24:36.960760+01:00 C14;x86_pmu_handle_irq: 2
> > 2026-01-12T04:24:36.960760+01:00 C14;x86_pmu_handle_irq: 2.6
> > 2026-01-12T04:24:36.960760+01:00 C14;__perf_event_overflow: 0
> > 2026-01-12T04:24:36.960760+01:00 C14;__perf_event_overflow: 6.99: overflow_handler=watchdog_overflow_callback+0x0/0x10d
> > 2026-01-12T04:24:36.960760+01:00 C14;watchdog_overflow_callback: 0
> > 2026-01-12T04:24:36.960760+01:00 C14;__ktime_get_fast_ns_debug: 0.1
> > 2026-01-12T04:24:36.960760+01:00 C14;tk_clock_read_debug: read=read_hpet+0x0/0xf0
> > 2026-01-12T04:24:36.960760+01:00 C14;read_hpet: 0
> > 2026-01-12T04:24:36.960760+01:00 C14;read_hpet: 0.1
> 
> > 2026-01-12T04:24:36.960760+01:00 T0;exc_nmi: 0
> 
> This one triggers in task context of PID0, aka idle task, but it's not
> clear on which CPU that happens. It's probably CPU13 as that continues
> with the expected 10.3 output, but that's almost ~1.71 seconds later.
> 
The long delays seem to be typical for the first NMI after trying to access
the broken memory at phys_addr 0xf0100000, here's an example from an earlier
run with more printk()s in that part of the code (too many printk()s seem to
cause addtional system freezes ...)


2026-01-03T14:10:10.312182+01:00 T1511;acpi_ex_system_memory_space_handler 255: logical_addr_ptr = ffffbaa49c15d000
2026-01-03T14:10:10.616281+01:00 T0;exc_nmi: 0
2026-01-03T14:10:10.616281+01:00 T0;exc_nmi: 1
2026-01-03T14:10:10.616281+01:00 T0;exc_nmi: 2
2026-01-03T14:10:10.616281+01:00 T0;exc_nmi: 3
2026-01-03T14:10:10.616281+01:00 T0;exc_nmi: 4
2026-01-03T14:10:10.616281+01:00 T0;exc_nmi: 5
2026-01-03T14:10:10.616281+01:00 T0;exc_nmi: 6
2026-01-03T14:10:10.616281+01:00 T0;exc_nmi: 7
2026-01-03T14:10:10.616281+01:00 T0;irqentry_nmi_enter: 0
2026-01-03T14:10:10.616281+01:00 T0;irqentry_nmi_enter: 1
2026-01-03T14:10:11.055800+01:00 C8;irqentry_nmi_enter: 2
2026-01-03T14:10:11.055800+01:00 C8;irqentry_nmi_enter: 3
2026-01-03T14:10:11.055800+01:00 C8;irqentry_nmi_enter: 4
2026-01-03T14:10:11.055800+01:00 C8;irqentry_nmi_enter: 5
2026-01-03T14:10:11.055800+01:00 C8;irqentry_nmi_enter: irq_state=0x0
2026-01-03T14:10:11.055800+01:00 C8;exc_nmi: 8
2026-01-03T14:10:11.055800+01:00 C8;exc_nmi: 9
2026-01-03T14:10:11.055800+01:00 C8;exc_nmi: 10.3

Position of printk()s in irqentry_nmi_enter() was:
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index e33691d5adf7..42cba2ea7aa1 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -370,12 +370,18 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
 {
        irqentry_state_t irq_state;
 
+       printk(KERN_INFO "%s: 0\n", __func__);
        irq_state.lockdep = lockdep_hardirqs_enabled();
+       printk(KERN_INFO "%s: 1\n", __func__);
 
        __nmi_enter();
+       printk(KERN_INFO "%s: 2\n", __func__);
        lockdep_hardirqs_off(CALLER_ADDR0);
+       printk(KERN_INFO "%s: 3\n", __func__);
        lockdep_hardirq_enter();
+       printk(KERN_INFO "%s: 4\n", __func__);
        ct_nmi_enter();
+       printk(KERN_INFO "%s: 5\n", __func__);
 
        instrumentation_begin();
        kmsan_unpoison_entry_regs(regs);
@@ -383,6 +389,7 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
        ftrace_nmi_enter();
        instrumentation_end();
 
+       printk(KERN_INFO "%s: irq_state=0x%x\n", __func__, irq_state);
        return irq_state;
 }
 

>  What's more likely is that after a while
> _ALL_ CPUs are hung up in the NMI handler after they tripped over the
> HPET read.

I'm not sure about that, my latest testrun (with v6.18) crashed with only one message
from exc_nmi().

> 
> > The behaviour described here seems to be similar to the bug that commit
> > 3d5f4f15b778 ("watchdog: skip checks when panic is in progress") is fixing, but
> > this is actually a different bug as kernel 6.18 (which contains 3d5f4f15b778)
> > is also affected (I've conducted 5 tests with 6.18 so far and got 4 crashes (crashes occured
> > after (0.5h, 1h, 4.5h, 1.5h) of testing)). 
> > Nevertheless these look similar enough to CC the involved people.
> 
> There is nothing similar.
> 
> Your problem originates from a screwed up hardware state which in turn
> causes the HPET to go haywire for unknown reasons.
> 
> What is the physical address of this ACPI handler access:
> 
>        logical_addr_ptr = ffffc066977b3000
> 
> along with the full output of /proc/iomem

The physical address is 0xf0100000

$ cat /proc/iomem
00000000-00000fff : Reserved
00001000-0009ffff : System RAM
000a0000-000fffff : Reserved
  000a0000-000dffff : PCI Bus 0000:00
  000f0000-000fffff : System ROM
00100000-09bfefff : System RAM
09bff000-0a000fff : Reserved
0a001000-0a1fffff : System RAM
0a200000-0a20efff : ACPI Non-volatile Storage
0a20f000-e6057fff : System RAM
  15000000-15b252c1 : Kernel code
  15c00000-15f60fff : Kernel rodata
  16000000-1610e27f : Kernel data
  165ce000-167fffff : Kernel bss
  9c000000-dbffffff : Crash kernel
e6058000-e614bfff : Reserved
e614c000-e868afff : System RAM
e868b000-e868bfff : Reserved
e868c000-e9cdefff : System RAM
e9cdf000-eb1fdfff : Reserved
  eb1dd000-eb1e0fff : MSFT0101:00
  eb1e1000-eb1e4fff : MSFT0101:00
eb1fe000-eb25dfff : ACPI Tables
eb25e000-eb555fff : ACPI Non-volatile Storage
eb556000-ed1fefff : Reserved
ed1ff000-edffffff : System RAM
ee000000-efffffff : Reserved
f0000000-fcffffff : PCI Bus 0000:00
  f0000000-f7ffffff : PCI ECAM 0000 [bus 00-7f]
    f0000000-f7ffffff : pnp 00:00
  fc500000-fc9fffff : PCI Bus 0000:08
    fc500000-fc5fffff : 0000:08:00.7
      fc500000-fc5fffff : pcie_mp2_amd
    fc600000-fc6fffff : 0000:08:00.4
      fc600000-fc6fffff : xhci-hcd
    fc700000-fc7fffff : 0000:08:00.3
      fc700000-fc7fffff : xhci-hcd
    fc800000-fc8fffff : 0000:08:00.2
      fc800000-fc8fffff : ccp
    fc900000-fc97ffff : 0000:08:00.0
    fc980000-fc9bffff : 0000:08:00.5
      fc980000-fc9bffff : AMD ACP3x audio
        fc980000-fc990200 : acp_pdm_iomem
    fc9c0000-fc9c7fff : 0000:08:00.6
      fc9c0000-fc9c7fff : ICH HD audio
    fc9c8000-fc9cbfff : 0000:08:00.1
      fc9c8000-fc9cbfff : ICH HD audio
    fc9cc000-fc9cdfff : 0000:08:00.7
    fc9ce000-fc9cffff : 0000:08:00.2
      fc9ce000-fc9cffff : ccp
  fca00000-fccfffff : PCI Bus 0000:01
    fca00000-fcbfffff : PCI Bus 0000:02
      fca00000-fcbfffff : PCI Bus 0000:03
        fca00000-fcafffff : 0000:03:00.0
        fcb00000-fcb1ffff : 0000:03:00.0
        fcb20000-fcb23fff : 0000:03:00.1
          fcb20000-fcb23fff : ICH HD audio
    fcc00000-fcc03fff : 0000:01:00.0
  fcd00000-fcdfffff : PCI Bus 0000:07
    fcd00000-fcd03fff : 0000:07:00.0
      fcd00000-fcd03fff : nvme
  fce00000-fcefffff : PCI Bus 0000:06
    fce00000-fce03fff : 0000:06:00.0
      fce00000-fce03fff : nvme
  fcf00000-fcffffff : PCI Bus 0000:05
    fcf00000-fcf03fff : 0000:05:00.0
    fcf04000-fcf04fff : 0000:05:00.0
      fcf04000-fcf04fff : r8169
fd300000-fd37ffff : amd_iommu
fec00000-fec003ff : IOAPIC 0
fec01000-fec013ff : IOAPIC 1
fec10000-fec10fff : Reserved
  fec10000-fec10fff : pnp 00:04
fed00000-fed00fff : Reserved
  fed00000-fed003ff : HPET 0
    fed00000-fed003ff : PNP0103:00
fed40000-fed44fff : Reserved
fed80000-fed8ffff : Reserved
  fed81200-fed812ff : AMDI0030:00
  fed81500-fed818ff : AMDI0030:00
    fed81500-fed818ff : AMDI0030:00 AMDI0030:00
fedc0000-fedc0fff : pnp 00:04
fedc4000-fedc9fff : Reserved
  fedc5000-fedc5fff : AMDI0010:03
    fedc5000-fedc5fff : AMDI0010:03 AMDI0010:03
fedcc000-fedcefff : Reserved
fedd5000-fedd5fff : Reserved
fee00000-fee00fff : pnp 00:04
ff000000-ffffffff : pnp 00:04
100000000-3ee2fffff : System RAM
3ee300000-40fffffff : Reserved
410000000-ffffffffff : PCI Bus 0000:00
  fc00000000-fe0fffffff : PCI Bus 0000:01
    fc00000000-fe0fffffff : PCI Bus 0000:02
      fc00000000-fe0fffffff : PCI Bus 0000:03
        fc00000000-fdffffffff : 0000:03:00.0
        fe00000000-fe0fffffff : 0000:03:00.0
  fe20000000-fe301fffff : PCI Bus 0000:08
    fe20000000-fe2fffffff : 0000:08:00.0
    fe30000000-fe301fffff : 0000:08:00.0
  fe30300000-fe304fffff : PCI Bus 0000:04
    fe30300000-fe303fffff : 0000:04:00.0
      fe30300000-fe303fffff : 0000:04:00.0
    fe30400000-fe30403fff : 0000:04:00.0
    fe30404000-fe30404fff : 0000:04:00.0

> 
> Thanks,
> 
>         tglx

Thank you,

Bert Karwatzki

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: NMI stack overflow during resume of PCIe bridge with CONFIG_HARDLOCKUP_DETECTOR=y
  2026-01-13 17:50   ` Bert Karwatzki
@ 2026-01-13 19:30     ` Thomas Gleixner
  2026-01-13 21:15       ` Jason Gunthorpe
  2026-01-13 22:19       ` Bert Karwatzki
  0 siblings, 2 replies; 12+ messages in thread
From: Thomas Gleixner @ 2026-01-13 19:30 UTC (permalink / raw)
  To: Bert Karwatzki, linux-kernel
  Cc: linux-next, spasswolf, Mario Limonciello,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	Christian König, regressions, linux-pci, linux-acpi,
	Rafael J . Wysocki, acpica-devel, Robert Moore, Saket Dumbre,
	Bjorn Helgaas, Clemens Ladisch, Jinchao Wang, Yury Norov,
	Anna Schumaker, Baoquan He, Darrick J. Wong, Dave Young,
	Doug Anderson, Guilherme G. Piccoli, Helge Deller, Ingo Molnar,
	Jason Gunthorpe, Joanthan Cameron, Joel Granados, John Ogness,
	Kees Cook, Li Huafei, Luck, Tony, Luo Gengkun, Max Kellermann,
	Nam Cao, oushixiong, Petr Mladek, Qianqiang Liu,
	Sergey Senozhatsky, Sohil Mehta, Tejun Heo, Thomas Zimemrmann,
	Thorsten Blum, Ville Syrjala, Vivek Goyal, Yunhui Cui,
	Andrew Morton, W_Armin

On Tue, Jan 13 2026 at 18:50, Bert Karwatzki wrote:
> Am Dienstag, dem 13.01.2026 um 16:24 +0100 schrieb Thomas Gleixner:
>
>>  What's more likely is that after a while _ALL_ CPUs are hung up in
>> the NMI handler after they tripped over the HPET read.
>
> I'm not sure about that, my latest testrun (with v6.18) crashed with
> only one message from exc_nmi().

What means crashed? Did it actually crash and output something or does
the machine just go dead? I assume the latter as you have no output.

>> along with the full output of /proc/iomem
>
> The physical address is 0xf0100000
>
> $ cat /proc/iomem
> f0000000-fcffffff : PCI Bus 0000:00
>   f0000000-f7ffffff : PCI ECAM 0000 [bus 00-7f]
>     f0000000-f7ffffff : pnp 00:00

That's the memory mapped PCI config space and this tries to access:

   MMIO_START			0xf0000000
   BUSNUM	0x01 << 20      0x00100000
   SLOT/FN	0x00 << 12      0x00000000
   OFFSET	0x00 <<  0      0x00000000
                               -----------
                                0xf0100000

Offset 0 is vendor/device ID IIRC.

Anyway if that access does not complete because of a hardware issue,
then any subsequent access to the MMIO mapped HPET goes stale as well.

As the HPET is the active clocksource on your machine, this obviously
does not only affect the NMI watchdog readout, it affects the regular
timekeeper accesses too and all other MMIO accesses all over the place.

So gradually your machine just stalls on outstanding MMIO transactions
w/o further notice... The NMI is just a red herring.

You need to figure out why that MMIO access to that device's
configuration space stalls as anything else is just subsequent
damage.

There is not much what can be done about that unless the PCI bus raises
a failure interrupt and some magic reset sequence aborts the outstanding
stalled transactions.

Whether that's feasible or not, I don't know. The failure mechanism
might run into the same stall scenario when accessing the PCI muck for
reset...

Sorry for not being helpful.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NMI stack overflow during resume of PCIe bridge with CONFIG_HARDLOCKUP_DETECTOR=y
  2026-01-13 19:30     ` Thomas Gleixner
@ 2026-01-13 21:15       ` Jason Gunthorpe
  2026-01-13 22:19       ` Bert Karwatzki
  1 sibling, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2026-01-13 21:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Bert Karwatzki, linux-kernel, linux-next, Mario Limonciello,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	Christian König, regressions, linux-pci, linux-acpi,
	Rafael J . Wysocki, acpica-devel, Robert Moore, Saket Dumbre,
	Bjorn Helgaas, Clemens Ladisch, Jinchao Wang, Yury Norov,
	Anna Schumaker, Baoquan He, Darrick J. Wong, Dave Young,
	Doug Anderson, Guilherme G. Piccoli, Helge Deller, Ingo Molnar,
	Joanthan Cameron, Joel Granados, John Ogness, Kees Cook,
	Li Huafei, Luck, Tony, Luo Gengkun, Max Kellermann, Nam Cao,
	oushixiong, Petr Mladek, Qianqiang Liu, Sergey Senozhatsky,
	Sohil Mehta, Tejun Heo, Thomas Zimemrmann, Thorsten Blum,
	Ville Syrjala, Vivek Goyal, Yunhui Cui, Andrew Morton, W_Armin

On Tue, Jan 13, 2026 at 08:30:46PM +0100, Thomas Gleixner wrote:
> So gradually your machine just stalls on outstanding MMIO transactions
> w/o further notice... The NMI is just a red herring.

CPUs usualy have timeouts for these things and they return 0xFF back
for the timed out read. Beyond that "it depends" if any other RAS
indications are raised.
 
> You need to figure out why that MMIO access to that device's
> configuration space stalls as anything else is just subsequent
> damage.

Given this is a resume it seems likely the PCI routing inside the
bridge chip has been messed up somehow during the suspend/resume.

Possibily due to errata in the bridge, there are many weird bridge
errata :\

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NMI stack overflow during resume of PCIe bridge with CONFIG_HARDLOCKUP_DETECTOR=y
  2026-01-13 19:30     ` Thomas Gleixner
  2026-01-13 21:15       ` Jason Gunthorpe
@ 2026-01-13 22:19       ` Bert Karwatzki
  2026-01-20 10:27         ` crash during resume of PCIe bridge in v5.17 (v5.16 works) Bert Karwatzki
  1 sibling, 1 reply; 12+ messages in thread
From: Bert Karwatzki @ 2026-01-13 22:19 UTC (permalink / raw)
  To: Thomas Gleixner, linux-kernel
  Cc: spasswolf, linux-next, Mario Limonciello,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	Christian König, regressions, linux-pci, linux-acpi,
	Rafael J . Wysocki, acpica-devel, Robert Moore, Saket Dumbre,
	Bjorn Helgaas, Clemens Ladisch, Jinchao Wang, Yury Norov,
	Anna Schumaker, Baoquan He, Darrick J. Wong, Dave Young,
	Doug Anderson, Guilherme G. Piccoli, Helge Deller, Ingo Molnar,
	Jason Gunthorpe, Joanthan Cameron, Joel Granados, John Ogness,
	Kees Cook, Li Huafei, Luck, Tony, Luo Gengkun, Max Kellermann,
	Nam Cao, oushixiong, Petr Mladek, Qianqiang Liu,
	Sergey Senozhatsky, Sohil Mehta, Tejun Heo, Thomas Zimemrmann,
	Thorsten Blum, Ville Syrjala, Vivek Goyal, Yunhui Cui,
	Andrew Morton, W_Armin

Am Dienstag, dem 13.01.2026 um 20:30 +0100 schrieb Thomas Gleixner:
> On Tue, Jan 13 2026 at 18:50, Bert Karwatzki wrote:
> > Am Dienstag, dem 13.01.2026 um 16:24 +0100 schrieb Thomas Gleixner:
> > 
> > >  What's more likely is that after a while _ALL_ CPUs are hung up in
> > > the NMI handler after they tripped over the HPET read.
> > 
> > I'm not sure about that, my latest testrun (with v6.18) crashed with
> > only one message from exc_nmi().
> 
> What means crashed? Did it actually crash and output something or does
> the machine just go dead? I assume the latter as you have no output.
> 
Crashed means, rebooting and trying to reboot into the new kernel, which fails
(i.e. drops to rescue mode)
because one of my 2 nvme devices (not the root fs) is missing from the PCI bus
(the missing device reappears after a shutdown).
When sound is running while such a crash occurs the sound loops for about ~2s,
before such a reboot occurs (a behaviour I've seen during an unrelated NULL ptr deref in an
interrupt handler once). There's no additional error message output when such
a crash occurs, only my printk()s.

> > > along with the full output of /proc/iomem
> > 
> > The physical address is 0xf0100000
> > 
> > $ cat /proc/iomem
> > f0000000-fcffffff : PCI Bus 0000:00
> >   f0000000-f7ffffff : PCI ECAM 0000 [bus 00-7f]
> >     f0000000-f7ffffff : pnp 00:00
> 
> That's the memory mapped PCI config space and this tries to access:
> 
>    MMIO_START			0xf0000000
>    BUSNUM	0x01 << 20      0x00100000
>    SLOT/FN	0x00 << 12      0x00000000
>    OFFSET	0x00 <<  0      0x00000000
>                                -----------
>                                 0xf0100000
> 
> Offset 0 is vendor/device ID IIRC.
> 
> Anyway if that access does not complete because of a hardware issue,
> then any subsequent access to the MMIO mapped HPET goes stale as well.
> 
> As the HPET is the active clocksource on your machine, this obviously
> does not only affect the NMI watchdog readout, it affects the regular
> timekeeper accesses too and all other MMIO accesses all over the place.
> So gradually your machine just stalls on outstanding MMIO transactions
> w/o further notice... The NMI is just a red herring.
> 
I've already tried different clocsources, with mixed results. tsc does not
work on my system as it is unstable, while acpi_pm also crashed. With jiffies
I had mixed results. It might have avoided one crash, but crashed on another
testrun in the same situation.

> You need to figure out why that MMIO access to that device's
> configuration space stalls as anything else is just subsequent
> damage.
> 
> There is not much what can be done about that unless the PCI bus raises
> a failure interrupt and some magic reset sequence aborts the outstanding
> stalled transactions.
> 
> Whether that's feasible or not, I don't know. The failure mechanism
> might run into the same stall scenario when accessing the PCI muck for
> reset...

In the two testruns without CONFIG_HARDLOCKUP_DETECTOR I simply lost the discrete
GPU without a crash, i.e. after 4 and 8.5h the GPU would not resume any more, and
accessing the GPU with DRI_PRIME=1 glxinfo just gives a (user) segfault. But I need
more testruns to see if this behaviour is consistent.

> 
> Sorry for not being helpful.
> 
> Thanks,
> 
>         tglx
> 

Thanks you

Bert Karwatzki

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: crash during resume of PCIe bridge in v5.17 (v5.16 works)
  2026-01-13 22:19       ` Bert Karwatzki
@ 2026-01-20 10:27         ` Bert Karwatzki
  2026-02-01  0:36           ` crash during resume of PCIe bridge from v5.17 to next-20260130 " Bert Karwatzki
  0 siblings, 1 reply; 12+ messages in thread
From: Bert Karwatzki @ 2026-01-20 10:27 UTC (permalink / raw)
  To: Thomas Gleixner, linux-kernel
  Cc: linux-next, spasswolf, Mario Limonciello,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	Christian König, regressions, linux-pci, linux-acpi,
	Rafael J . Wysocki, acpica-devel, Robert Moore, Saket Dumbre,
	Bjorn Helgaas, Clemens Ladisch, Jinchao Wang, Yury Norov,
	Anna Schumaker, Baoquan He, Darrick J. Wong, Dave Young,
	Doug Anderson, Guilherme G. Piccoli, Helge Deller, Ingo Molnar,
	Jason Gunthorpe, Joanthan Cameron, Joel Granados, John Ogness,
	Kees Cook, Li Huafei, Luck, Tony, Luo Gengkun, Max Kellermann,
	Nam Cao, oushixiong, Petr Mladek, Qianqiang Liu,
	Sergey Senozhatsky, Sohil Mehta, Tejun Heo, Thomas Zimemrmann,
	Thorsten Blum, Ville Syrjala, Vivek Goyal, Yunhui Cui,
	Andrew Morton, W_Armin

There are exitining news here:
I tested older linux versions with this script

#!/bin/bash
for i in {0..20000}
do
	echo $i
	evolution &
	sleep 5
	killall evolution
	sleep 5
done

with the following results:
Version v5.17, v6.1 and v6.8 show more or less the same
behaviour as v6.12+, i.e. repreated resumes crash after a 
while (sometimes the discrete GPU is lost without a crash)

Version v5.15 shows a different behaviour:
5.15.0-stable-dirty			booted 14:28, 16.1.2026 error 14:52 (25min, 142 resumes)
[ 1453.515962] [  T18093] amdgpu 0000:03:00.0: amdgpu: Failed to setup smc hw!
[ 1453.515978] [  T18093] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <smu> failed -62
[ 1453.516046] [  T18093] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62). (-ETIME!)
5.15.0-stable-dirty			booted 17:09, 16.1.2026 error 20:18 (3h, 1102 resumes)
[11337.547257] [ T157373] amdgpu 0000:03:00.0: amdgpu: Failed to setup smc hw!
[11337.547273] [ T157373] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <smu> failed -62
[11337.547358] [ T157373] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).
5.15.0-stable-dirty			booted 20:51, 16.1.2026 error 21:20 (30min, 164 resumes)
[ 1698.065653] [  T22129] amdgpu 0000:03:00.0: amdgpu: Failed to setup smc hw!
[ 1698.065665] [  T22129] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <smu> failed -62
[ 1698.065734] [  T22129] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).
5.15.0-stable-dirty			booted 21:25, 16.1.2026 error 21:41 (10min, 91 resumes)
[  965.908197] [   T3843] amdgpu 0000:03:00.0: amdgpu: Failed to setup smc hw!
[  965.908212] [   T3843] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <smu> failed -62
[  965.908284] [   T3843] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).
5.15.0-stable-dirty			booted 21:46, 16.1.2026 error 1:43 17.1.2026 (4h, 1411 resumes)
[14220.044577] [ T203585] amdgpu 0000:03:00.0: amdgpu: Failed to setup smc hw!
[14220.044593] [ T203585] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <smu> failed -62
[14220.044662] [ T203585] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).

In all 5 tests the resume failed, but with an error message not seen before, also no crash occured.

And most importantly v5.16 did not crash at all, and also did not show any GPU suspend/resume related error
despite being tested for 36h and 11644 resumes:
Testing (dirty because of -Wno-error=use-after-free and -Wno-error=format-truncation compile fix)
5.16.0-stable-dirty			booted 2:09, 17.1.2026 14:20, 18.1.2026 no error (neither crash or loss of device)
					(36h, 11644 resumes)

So the whole issue seems to be a pure software issue after all (not just an issue related to
probably broken hardware). I'm currently bisecting this between v5.16 and v5.17, but getting a
result can take 2 weeks given the length of the testruns.

The first step of the bisection is already finished and GOOD:
Testing (dirty because of -Wno-error=use-after-free and -Wno-error=format-truncation compile fix)
5.16.0-bisect-07203-g22ef12195e13-dirty booted 14:27, 18.1.2026 no error 1:57 (35.5h, 12508 resumes)

Bert Karwatzki



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: crash during resume of PCIe bridge from v5.17 to next-20260130 (v5.16 works)
  2026-01-20 10:27         ` crash during resume of PCIe bridge in v5.17 (v5.16 works) Bert Karwatzki
@ 2026-02-01  0:36           ` Bert Karwatzki
  2026-02-01 10:19             ` Armin Wolf
  2026-02-01 16:42             ` Thomas Gleixner
  0 siblings, 2 replies; 12+ messages in thread
From: Bert Karwatzki @ 2026-02-01  0:36 UTC (permalink / raw)
  To: Thomas Gleixner, linux-kernel
  Cc: linux-next, spasswolf, Mario Limonciello,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	Christian König, regressions, linux-pci, linux-acpi,
	Rafael J . Wysocki, acpica-devel, Robert Moore, Saket Dumbre,
	Bjorn Helgaas, Clemens Ladisch, Jinchao Wang, Yury Norov,
	Anna Schumaker, Baoquan He, Darrick J. Wong, Dave Young,
	Doug Anderson, Guilherme G. Piccoli, Helge Deller, Ingo Molnar,
	Jason Gunthorpe, Joanthan Cameron, Joel Granados, John Ogness,
	Kees Cook, Li Huafei, Luck, Tony, Luo Gengkun, Max Kellermann,
	Nam Cao, oushixiong, Petr Mladek, Qianqiang Liu,
	Sergey Senozhatsky, Sohil Mehta, Tejun Heo, Thomas Zimemrmann,
	Thorsten Blum, Ville Syrjala, Vivek Goyal, Yunhui Cui,
	Andrew Morton, W_Armin

I found the error, the commit 
("drm/amd: Check if ASPM is enabled from PCIe subsystem")
has been applied twice first as cba07cce39ac and a second time
as 7294863a6f01 after it had been superseeded by commit
0ab5d711ec74 ("drm/amd: Refactor `amdgpu_aspm` to be evaluated per device") 
This effectively disables ASPM globally after the built-in GPU (which does not
support ASPM) is probed. This is the reason for the crashes and loss of devices
errors which on average occur after ~1000 resumes of the discrete GPU.

snippet from git log --oneline drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c in linux-next:
158a05a0b885 drm/amdgpu: Add use_xgmi_p2p module parameter
7294863a6f01 drm/amd: Check if ASPM is enabled from PCIe subsystem  <--- This does not belong here!
b784f42cf78b drm/amdgpu: drop testing module parameter
0b1a63487b0f drm/amdgpu: drop benchmark module parameter
cec2cc7b1c4a drm/amdgpu: Fix typo in *whether* in comment
0ab5d711ec74 drm/amd: Refactor `amdgpu_aspm` to be evaluated per device <--- This removes the code from the previous commit.
cba07cce39ac drm/amd: Check if ASPM is enabled from PCIe subsystem  <--- The first time the commit was applied.
dfcc3e8c24cc drm/amdgpu: make cyan skillfish support code more consistent

The fix is simply to revert commit 7294863a6f01.

I sent a patch for linux-next (unfortunately without CC'ing stable) and a seperate patch for
v6.18.8, I hope this does not cause confusion ...

Bert Karwatzki

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: crash during resume of PCIe bridge from v5.17 to next-20260130 (v5.16 works)
  2026-02-01  0:36           ` crash during resume of PCIe bridge from v5.17 to next-20260130 " Bert Karwatzki
@ 2026-02-01 10:19             ` Armin Wolf
  2026-02-01 11:42               ` Rafael J. Wysocki
  2026-02-01 16:42             ` Thomas Gleixner
  1 sibling, 1 reply; 12+ messages in thread
From: Armin Wolf @ 2026-02-01 10:19 UTC (permalink / raw)
  To: Bert Karwatzki, Thomas Gleixner, linux-kernel
  Cc: linux-next, Mario Limonciello, Sebastian Andrzej Siewior,
	Clark Williams, Steven Rostedt, Christian König, regressions,
	linux-pci, linux-acpi, Rafael J . Wysocki, acpica-devel,
	Robert Moore, Saket Dumbre, Bjorn Helgaas, Clemens Ladisch,
	Jinchao Wang, Yury Norov, Anna Schumaker, Baoquan He,
	Darrick J. Wong, Dave Young, Doug Anderson, Guilherme G. Piccoli,
	Helge Deller, Ingo Molnar, Jason Gunthorpe, Joanthan Cameron,
	Joel Granados, John Ogness, Kees Cook, Li Huafei, Luck, Tony,
	Luo Gengkun, Max Kellermann, Nam Cao, oushixiong, Petr Mladek,
	Qianqiang Liu, Sergey Senozhatsky, Sohil Mehta, Tejun Heo,
	Thomas Zimemrmann, Thorsten Blum, Ville Syrjala, Vivek Goyal,
	Yunhui Cui, Andrew Morton

Am 01.02.26 um 01:36 schrieb Bert Karwatzki:

> I found the error, the commit
> ("drm/amd: Check if ASPM is enabled from PCIe subsystem")
> has been applied twice first as cba07cce39ac and a second time
> as 7294863a6f01 after it had been superseeded by commit
> 0ab5d711ec74 ("drm/amd: Refactor `amdgpu_aspm` to be evaluated per device")
> This effectively disables ASPM globally after the built-in GPU (which does not
> support ASPM) is probed. This is the reason for the crashes and loss of devices
> errors which on average occur after ~1000 resumes of the discrete GPU.
>
> snippet from git log --oneline drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c in linux-next:
> 158a05a0b885 drm/amdgpu: Add use_xgmi_p2p module parameter
> 7294863a6f01 drm/amd: Check if ASPM is enabled from PCIe subsystem  <--- This does not belong here!
> b784f42cf78b drm/amdgpu: drop testing module parameter
> 0b1a63487b0f drm/amdgpu: drop benchmark module parameter
> cec2cc7b1c4a drm/amdgpu: Fix typo in *whether* in comment
> 0ab5d711ec74 drm/amd: Refactor `amdgpu_aspm` to be evaluated per device <--- This removes the code from the previous commit.
> cba07cce39ac drm/amd: Check if ASPM is enabled from PCIe subsystem  <--- The first time the commit was applied.
> dfcc3e8c24cc drm/amdgpu: make cyan skillfish support code more consistent
>
> The fix is simply to revert commit 7294863a6f01.
>
> I sent a patch for linux-next (unfortunately without CC'ing stable) and a seperate patch for
> v6.18.8, I hope this does not cause confusion ...
>
> Bert Karwatzki

Good work! Thank you for researching the faulty commit that lead to this strange behavior.

Thanks,
Armin Wolf


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: crash during resume of PCIe bridge from v5.17 to next-20260130 (v5.16 works)
  2026-02-01 10:19             ` Armin Wolf
@ 2026-02-01 11:42               ` Rafael J. Wysocki
  0 siblings, 0 replies; 12+ messages in thread
From: Rafael J. Wysocki @ 2026-02-01 11:42 UTC (permalink / raw)
  To: Armin Wolf, Bert Karwatzki
  Cc: Thomas Gleixner, linux-kernel, linux-next, Mario Limonciello,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	Christian König, regressions, linux-pci, linux-acpi,
	Rafael J . Wysocki, acpica-devel, Robert Moore, Saket Dumbre,
	Bjorn Helgaas, Clemens Ladisch, Jinchao Wang, Yury Norov,
	Anna Schumaker, Baoquan He, Darrick J. Wong, Dave Young,
	Doug Anderson, Guilherme G. Piccoli, Helge Deller, Ingo Molnar,
	Jason Gunthorpe, Joanthan Cameron, Joel Granados, John Ogness,
	Kees Cook, Li Huafei, Luck, Tony, Luo Gengkun, Max Kellermann,
	Nam Cao, oushixiong, Petr Mladek, Qianqiang Liu,
	Sergey Senozhatsky, Sohil Mehta, Tejun Heo, Thomas Zimemrmann,
	Thorsten Blum, Ville Syrjala, Vivek Goyal, Yunhui Cui,
	Andrew Morton

On Sun, Feb 1, 2026 at 11:20 AM Armin Wolf <W_Armin@gmx.de> wrote:
>
> Am 01.02.26 um 01:36 schrieb Bert Karwatzki:
>
> > I found the error, the commit
> > ("drm/amd: Check if ASPM is enabled from PCIe subsystem")
> > has been applied twice first as cba07cce39ac and a second time
> > as 7294863a6f01 after it had been superseeded by commit
> > 0ab5d711ec74 ("drm/amd: Refactor `amdgpu_aspm` to be evaluated per device")
> > This effectively disables ASPM globally after the built-in GPU (which does not
> > support ASPM) is probed. This is the reason for the crashes and loss of devices
> > errors which on average occur after ~1000 resumes of the discrete GPU.
> >
> > snippet from git log --oneline drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c in linux-next:
> > 158a05a0b885 drm/amdgpu: Add use_xgmi_p2p module parameter
> > 7294863a6f01 drm/amd: Check if ASPM is enabled from PCIe subsystem  <--- This does not belong here!
> > b784f42cf78b drm/amdgpu: drop testing module parameter
> > 0b1a63487b0f drm/amdgpu: drop benchmark module parameter
> > cec2cc7b1c4a drm/amdgpu: Fix typo in *whether* in comment
> > 0ab5d711ec74 drm/amd: Refactor `amdgpu_aspm` to be evaluated per device <--- This removes the code from the previous commit.
> > cba07cce39ac drm/amd: Check if ASPM is enabled from PCIe subsystem  <--- The first time the commit was applied.
> > dfcc3e8c24cc drm/amdgpu: make cyan skillfish support code more consistent
> >
> > The fix is simply to revert commit 7294863a6f01.
> >
> > I sent a patch for linux-next (unfortunately without CC'ing stable) and a seperate patch for
> > v6.18.8, I hope this does not cause confusion ...
> >
> > Bert Karwatzki
>
> Good work! Thank you for researching the faulty commit that lead to this strange behavior.

Yes, nice work, thanks!

I wish all of the reporters of kernel issues were so persistent.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: crash during resume of PCIe bridge from v5.17 to next-20260130 (v5.16 works)
  2026-02-01  0:36           ` crash during resume of PCIe bridge from v5.17 to next-20260130 " Bert Karwatzki
  2026-02-01 10:19             ` Armin Wolf
@ 2026-02-01 16:42             ` Thomas Gleixner
  2026-02-02 10:37               ` Christian König
  1 sibling, 1 reply; 12+ messages in thread
From: Thomas Gleixner @ 2026-02-01 16:42 UTC (permalink / raw)
  To: Bert Karwatzki, linux-kernel
  Cc: linux-next, spasswolf, Mario Limonciello,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	Christian König, regressions, linux-pci, linux-acpi,
	Rafael J . Wysocki, acpica-devel, Robert Moore, Saket Dumbre,
	Bjorn Helgaas, Clemens Ladisch, Jinchao Wang, Yury Norov,
	Anna Schumaker, Baoquan He, Darrick J. Wong, Dave Young,
	Doug Anderson, Guilherme G. Piccoli, Helge Deller, Ingo Molnar,
	Jason Gunthorpe, Joanthan Cameron, Joel Granados, John Ogness,
	Kees Cook, Li Huafei, Luck, Tony, Luo Gengkun, Max Kellermann,
	Nam Cao, oushixiong, Petr Mladek, Qianqiang Liu,
	Sergey Senozhatsky, Sohil Mehta, Tejun Heo, Thomas Zimemrmann,
	Thorsten Blum, Ville Syrjala, Vivek Goyal, Yunhui Cui,
	Andrew Morton, W_Armin

On Sun, Feb 01 2026 at 01:36, Bert Karwatzki wrote:
> I found the error, the commit 
> ("drm/amd: Check if ASPM is enabled from PCIe subsystem")
> has been applied twice first as cba07cce39ac and a second time
> as 7294863a6f01 after it had been superseeded by commit
> 0ab5d711ec74 ("drm/amd: Refactor `amdgpu_aspm` to be evaluated per device") 
> This effectively disables ASPM globally after the built-in GPU (which does not
> support ASPM) is probed. This is the reason for the crashes and loss of devices
> errors which on average occur after ~1000 resumes of the discrete GPU.

Wow. Nice detective work...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: crash during resume of PCIe bridge from v5.17 to next-20260130 (v5.16 works)
  2026-02-01 16:42             ` Thomas Gleixner
@ 2026-02-02 10:37               ` Christian König
  0 siblings, 0 replies; 12+ messages in thread
From: Christian König @ 2026-02-02 10:37 UTC (permalink / raw)
  To: Thomas Gleixner, Bert Karwatzki, linux-kernel
  Cc: linux-next, Mario Limonciello, Sebastian Andrzej Siewior,
	Clark Williams, Steven Rostedt, regressions, linux-pci,
	linux-acpi, Rafael J . Wysocki, acpica-devel, Robert Moore,
	Saket Dumbre, Bjorn Helgaas, Clemens Ladisch, Jinchao Wang,
	Yury Norov, Anna Schumaker, Baoquan He, Darrick J. Wong,
	Dave Young, Doug Anderson, Guilherme G. Piccoli, Helge Deller,
	Ingo Molnar, Jason Gunthorpe, Joanthan Cameron, Joel Granados,
	John Ogness, Kees Cook, Li Huafei, Luck, Tony, Luo Gengkun,
	Max Kellermann, Nam Cao, oushixiong, Petr Mladek, Qianqiang Liu,
	Sergey Senozhatsky, Sohil Mehta, Tejun Heo, Thomas Zimemrmann,
	Thorsten Blum, Ville Syrjala, Vivek Goyal, Yunhui Cui,
	Andrew Morton, W_Armin

On 2/1/26 17:42, Thomas Gleixner wrote:
> On Sun, Feb 01 2026 at 01:36, Bert Karwatzki wrote:
>> I found the error, the commit 
>> ("drm/amd: Check if ASPM is enabled from PCIe subsystem")
>> has been applied twice first as cba07cce39ac and a second time
>> as 7294863a6f01 after it had been superseeded by commit
>> 0ab5d711ec74 ("drm/amd: Refactor `amdgpu_aspm` to be evaluated per device") 
>> This effectively disables ASPM globally after the built-in GPU (which does not
>> support ASPM) is probed. This is the reason for the crashes and loss of devices
>> errors which on average occur after ~1000 resumes of the discrete GPU.
> 
> Wow. Nice detective work...

Good catch, indeed.

But it is not clear to me why disabling ASPM causes trouble, usually it is the other way around.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-02-02 10:37 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-13  9:41 NMI stack overflow during resume of PCIe bridge with CONFIG_HARDLOCKUP_DETECTOR=y Bert Karwatzki
2026-01-13 15:24 ` Thomas Gleixner
2026-01-13 17:50   ` Bert Karwatzki
2026-01-13 19:30     ` Thomas Gleixner
2026-01-13 21:15       ` Jason Gunthorpe
2026-01-13 22:19       ` Bert Karwatzki
2026-01-20 10:27         ` crash during resume of PCIe bridge in v5.17 (v5.16 works) Bert Karwatzki
2026-02-01  0:36           ` crash during resume of PCIe bridge from v5.17 to next-20260130 " Bert Karwatzki
2026-02-01 10:19             ` Armin Wolf
2026-02-01 11:42               ` Rafael J. Wysocki
2026-02-01 16:42             ` Thomas Gleixner
2026-02-02 10:37               ` Christian König

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox