* Re: RE: [Intel-wired-lan] Bug#1104670: linux-image-6.12.25-amd64: system does not shut down - GHES: Fatal hardware error
@ 2026-04-13 14:01 Tim Connors
0 siblings, 0 replies; only message in thread
From: Tim Connors @ 2026-04-13 14:01 UTC (permalink / raw)
To: 1104670
Cc: Loktionov, Aleksandr, Hutchings, Ben,
intel-wired-lan@lists.osuosl.org, linux-pci, Pavan Chebbi,
Michael Chan, Laurent Bonnaud, netdev@vger.kernel.org
On Mon, 14 Jul 2025 09:21:25 +0000 "Loktionov, Aleksandr" <
aleksandr.loktionov@intel.com> wrote:
> > On Sun, 2025-05-04 at 13:45 +0200, Laurent Bonnaud wrote:
> > [...]
> > > - Previously the kernel would output an error in
> > /var/lib/systemd/pstore/ but would shutdown anyway.
> > >
> > > - Now, with kernel 6.1.135-1, the shutdown is blocked as with
> > 6.12.x kernels (see below).
> > > <30>[ 961.098671] systemd-shutdown[1]: Rebooting.
> > > <6>[ 961.098743] kvm: exiting hardware virtualization <6>[
> > > 961.361878] megaraid_sas 0000:17:00.0: megasas_disable_intr_fusion
> > is
> > > called outbound_intr_mask:0x40000009 <6>[ 961.414526] ACPI: PM:
> > > Preparing to enter system sleep state S5 <0>[ 963.828210]
> > > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error
> > > Source: 5 <0>[ 963.828213] {1}[Hardware Error]: event severity:
> > fatal <0>[ 963.828214] {1}[Hardware Error]: Error 0, type: fatal
> > > <0>[ 963.828216] {1}[Hardware Error]: section_type: PCIe error
> > > <0>[ 963.828216] {1}[Hardware Error]: port_type: 0, PCIe end
> > point
> > > <0>[ 963.828217] {1}[Hardware Error]: version: 3.0
> > > <0>[ 963.828218] {1}[Hardware Error]: command: 0x0002, status:
> > 0x0010
> > > <0>[ 963.828220] {1}[Hardware Error]: device_id: 0000:01:00.1
> > > <0>[ 963.828221] {1}[Hardware Error]: slot: 6
>>> <0>[ 963.828222] {1}[Hardware Error]: secondary_bus: 0x00
>>> <0>[ 963.828223] {1}[Hardware Error]: vendor_id: 0x8086,
>> device_id: 0x1563
>>> <0>[ 963.828224] {1}[Hardware Error]: class_code: 020000
>>> <0>[ 963.828225] {1}[Hardware Error]: aer_uncor_status:
>> 0x00100000, aer_uncor_mask: 0x00018000
>>> <0>[ 963.828226] {1}[Hardware Error]: aer_uncor_severity:
>> 0x000ef010
>>> <0>[ 963.828227] {1}[Hardware Error]: TLP Header: 40000001
>> 0000000f 90028090 00000000
>> [...]
>>
>> It seems that this is a known bug in the BIOS of several Dell
>> PowerEdge models including (in this case) the R540.
Yup, R730XD here.
>> A workaround was added to the tg3 driver
>> <https://git.kernel.org/linus/e0efe83ed325277bb70f9435d4d9fc70bebdcca8
>>
>> and a similar change was proposed (but not accepted) in the i40e
>> driver <https://lore.kernel.org/all/20241227035459.90602-1-
>> yue.zhao@shopee.com/>.
>> On tihis system the erorr log points to a deivce handled by the ixgbe
>> driver, and no workaround has been implemented for that.
>>
>> Since this issue seems to affect multiple different NIC vendors and
>> drivers, would it make more sense to implement this workaround as a
>> PCI quirk?
It's not just network devices either.
<5>[965917.449277] sd 4:0:0:0: [sda] Synchronizing SCSI cache
<6>[965917.614364] [drm] PCIE GART of 256M enabled (table at
0x000000F47FF80000).
<6>[965917.820364] [drm] UVD and UVD ENC initialized successfully.
<6>[965917.921559] [drm] VCE initialized successfully.
<6>[965917.926574] amdgpu 0000:04:00.0: [drm] Cannot find any crtc or sizes
<6>[965917.934684] amdgpu 0000:04:00.0: [drm] Cannot find any crtc or sizes
<0>[965919.725575] {1}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 3
<0>[965919.725582] {1}[Hardware Error]: event severity: fatal
<0>[965919.725587] {1}[Hardware Error]: Error 0, type: fatal
<0>[965919.725591] {1}[Hardware Error]: section_type: PCIe error
<0>[965919.725595] {1}[Hardware Error]: port_type: 1, legacy PCI end point
<0>[965919.725598] {1}[Hardware Error]: version: 1.16
<0>[965919.725602] {1}[Hardware Error]: command: 0x0407, status: 0x0010
<0>[965919.725607] {1}[Hardware Error]: device_id: 0000:04:00.1
<0>[965919.725611] {1}[Hardware Error]: slot: 0
<0>[965919.725614] {1}[Hardware Error]: secondary_bus: 0x00
<0>[965919.725617] {1}[Hardware Error]: vendor_id: 0x1002, device_id:
0xaae0
<0>[965919.725622] {1}[Hardware Error]: class_code: 040300
<0>[965919.725625] {1}[Hardware Error]: aer_cor_status: 0x00002000,
aer_cor_mask: 0x000031c0
<0>[965919.725630] {1}[Hardware Error]: aer_uncor_status: 0x00100000,
aer_uncor_mask: 0x00010000
<0>[965919.725635] {1}[Hardware Error]: aer_uncor_severity: 0x004e7030
<0>[965919.725638] {1}[Hardware Error]: TLP Header: 40008001 00000a0f
96a121a0 00000000
<0>[965919.725646] GHES: Fatal hardware error but panic disabled
<0>[965919.725650] Kernel panic - not syncing: GHES: Fatal hardware error
<4>[965919.725662] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: P
O 6.14.11-5-bpo12-pve #1
<4>[965919.725676] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE
<4>[965919.725689] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS
2.19.0 12/12/2023
<4>[965919.725694] Call Trace:
<4>[965919.725700] <NMI>
<4>[965919.725706] dump_stack_lvl+0x27/0xa0
<4>[965919.725722] dump_stack+0x10/0x20
<4>[965919.725729] panic+0x358/0x3b0
<4>[965919.725742] __ghes_panic+0x60/0x80
<4>[965919.725756] ghes_notify_nmi+0x1d5/0x380
<4>[965919.725768] nmi_handle.part.0+0x58/0x160
<4>[965919.725781] default_do_nmi+0x131/0x170
<4>[965919.725792] exc_nmi+0x1c4/0x290
<4>[965919.725799] end_repeat_nmi+0xf/0x53
<4>[965919.725816] RIP: 0010:intel_idle+0x51/0x90
<4>[965919.725824] Code: 2d 80 ca 2b 00 eb 52 cc cc cc 48 89 f0 0f 1f 00 31
d2 48 89 d1 0f 01 c8 48 8b 06 a8 08 75 0b b9 01 00 00 00 4c 89 c0 0f 01 c9
<f0> 80 66 02 df f0 83 44 24 fc 00 48 8b 06 a8 08 74 0b 65 81 25 ea
<4>[965919.725830] RSP: 0018:ffffffff8ec03db0 EFLAGS: 00000046
<4>[965919.725837] RAX: 0000000000000020 RBX: ffff8aa2ffa44680 RCX:
0000000000000001
<4>[965919.725841] RDX: 0000000000000000 RSI: ffffffff8ec107c0 RDI:
0000000000000004
<4>[965919.725849] RBP: ffffffff8ec03df0 R08: 0000000000000020 R09:
0000000000000000
<4>[965919.725854] R10: 0000000000000000 R11: 0000000000000000 R12:
0000000000000004
<4>[965919.725857] R13: ffffffff8ee86960 R14: ffffffff8ee86b18 R15:
0000000000000004
<4>[965919.725866] ? intel_idle+0x51/0x90
<4>[965919.725873] ? intel_idle+0x51/0x90
<4>[965919.725879] </NMI>
<4>[965919.725882] <TASK>
<4>[965919.725884] ? cpuidle_enter_state+0x85/0x450
<4>[965919.725895] cpuidle_enter+0x2e/0x50
<4>[965919.725908] call_cpuidle+0x22/0x60
<4>[965919.725918] do_idle+0x1de/0x240
<4>[965919.725925] cpu_startup_entry+0x29/0x30
<4>[965919.725930] rest_init+0xd0/0xd0
<4>[965919.725934] start_kernel+0x779/0xb60
<4>[965919.725941] ? load_ucode_intel_bsp+0x43/0xa0
<4>[965919.725952] x86_64_start_reservations+0x18/0x30
<4>[965919.725961] x86_64_start_kernel+0xbf/0x110
<4>[965919.725968] common_startup_64+0x13e/0x141
<4>[965919.725980] </TASK>
<0>[965919.726136] Kernel Offset: 0xb600000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI]
Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] (rev c7) (prog-if 00
[VGA controller])
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon
540/540X/550/550X / RX 540X/550/550X]
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 174
NUMA node: 0
IOMMU group: 29
Region 0: Memory at 38000000000 (64-bit, prefetchable) [size=2G]
Region 2: Memory at 38080000000 (64-bit, prefetchable) [size=2M]
Region 4: I/O ports at 2000 [size=256]
Region 5: Memory at 96a00000 (32-bit, non-prefetchable) [size=256K]
Expansion ROM at 96a60000 [disabled] [size=128K]
Capabilities: <access denied>
Kernel driver in use: amdgpu
Kernel modules: amdgpu
04:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP
Audio [Radeon RX 550 640SP / RX 560/560X]
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP
Audio [Radeon RX 550 640SP / RX 560/560X]
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin B routed to IRQ 172
NUMA node: 0
IOMMU group: 30
Region 0: Memory at 96a40000 (64-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
Was completely idle and unused all boot session, and reboot was routine
after patching. kernel 6.14.11-5-bpo12 from proxmox backports (so ubuntu
backports, essentially).
> I support the idea of PCI workaround, but who will implement it ?
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2026-04-13 14:09 UTC | newest]
Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-13 14:01 RE: [Intel-wired-lan] Bug#1104670: linux-image-6.12.25-amd64: system does not shut down - GHES: Fatal hardware error Tim Connors
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox