* 2499f53 (PCI: Rework optional resource handling) regression with AMDGPU on Arm AVA platform
@ 2025-10-22 16:51 Alex Bennée
2025-10-22 17:08 ` Ard Biesheuvel
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: Alex Bennée @ 2025-10-22 16:51 UTC (permalink / raw)
To: linux-pci
Cc: Ard Biesheuvel, Lorenzo Pieralisi, Alex Deucher,
Christian König, amd-gfx, Bjorn Helgaas, Ilpo Järvinen,
D Scott Phillips
Hi,
I've been tracking a regression on my Arm64 (Altra) AVA platform between
6.14 and 6.15. It looks like the rework commit broke the ability of the
amdgpu driver to resize it's bar, resulting in an SError and failure to
boot:
[ 15.348097] amdgpu 000d:03:00.0: amdgpu: detected ip block number 8 <vcn_v4_0>
[ 15.355901] amdgpu 000d:03:00.0: amdgpu: detected ip block number 9 <jpeg_v4_0>
[ 15.363202] amdgpu 000d:03:00.0: amdgpu: detected ip block number 10 <mes_v11_0>
[ 15.384163] amdgpu 000d:03:00.0: amdgpu: Fetched VBIOS from ROM BAR
[ 15.390434] amdgpu: ATOM BIOS: 113-4481LHS-UC1
[ 15.400079] amdgpu 000d:03:00.0: amdgpu: CP RS64 enable
[ 15.411830] amdgpu 000d:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[ 15.419932] amdgpu 000d:03:00.0: amdgpu: PCIE atomic ops is not supported
[ 15.426719] [drm] GPU posting now...
[ 15.430329] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[ 15.438871] amdgpu 000d:03:00.0: BAR 2 [mem 0x340010000000-0x3400101fffff 64bit pref]: releasing
[ 15.447648] amdgpu 000d:03:00.0: BAR 0 [mem 0x340000000000-0x34000fffffff 64bit pref]: releasing
[ 15.456452] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
[ 15.466095] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
[ 15.475738] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
[ 15.485386] pcieport 000d:00:01.0: bridge window [io 0x1000-0x0fff] to [bus 01-03] add_size 1000
[ 15.494252] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
[ 15.503809] pcieport 000d:00:01.0: bridge window [io size 0x1000]: can't assign; no space
[ 15.512063] pcieport 000d:00:01.0: bridge window [io size 0x1000]: failed to assign
[ 15.519796] pcieport 000d:00:01.0: bridge window [io size 0x1000]: can't assign; no space
[ 15.528049] pcieport 000d:00:01.0: bridge window [io size 0x1000]: failed to assign
[ 15.535787] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
[ 15.545349] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
[ 15.554911] amdgpu 000d:03:00.0: BAR 0 [mem 0x340000000000-0x3401ffffffff 64bit pref]: assigned
[ 15.563612] amdgpu 000d:03:00.0: BAR 2 [mem 0x340200000000-0x3402001fffff 64bit pref]: assigned
[ 15.572313] pcieport 000d:00:01.0: PCI bridge to [bus 01-03]
[ 15.577962] pcieport 000d:00:01.0: bridge window [mem 0x50000000-0x502fffff]
[ 15.585175] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]
[ 15.594038] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: can't claim; address conflict with PCI Bus 000d:01 [mem 0x340000000000-0x3
40017ffffff 64bit pref]
Failure to claim space for the bridge window...
[ 15.611321] pcieport 000d:00:01.0: PCI bridge to [bus 01-03]
[ 15.616971] pcieport 000d:00:01.0: bridge window [io size 0x1000]
[ 15.623315] pcieport 000d:00:01.0: bridge window [mem 0x50000000-0x502fffff]
[ 15.630527] pcieport 000d:00:01.0: bridge window [mem size 0x18000000 64bit pref]
[ 15.638174] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: can't claim; no compatible bridge window
[ 15.650508] pcieport 000d:01:00.0: PCI bridge to [bus 02-03]
[ 15.656164] pcieport 000d:01:00.0: bridge window [mem 0x50000000-0x501fffff]
[ 15.663381] pcieport 000d:01:00.0: bridge window [mem size 0x18000000 64bit pref]
[ 15.671036] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: can't claim; no compatible bridge window
[ 15.683370] pcieport 000d:02:00.0: PCI bridge to [bus 03]
[ 15.688764] pcieport 000d:02:00.0: bridge window [mem 0x50000000-0x501fffff]
[ 15.695982] pcieport 000d:02:00.0: bridge window [mem size 0x18000000 64bit pref]
[ 15.703643] [drm] Not enough PCI address space for a large BAR.
Realisation not enough space for the BAR
[ 15.703648] amdgpu 000d:03:00.0: amdgpu: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used)
[ 15.719119] amdgpu 000d:03:00.0: amdgpu: GART: 512M 0x00007FFF00000000 - 0x00007FFF1FFFFFFF
[ 15.727470] [drm] Detected VRAM RAM=8176M, BAR=256M
[ 15.732339] [drm] RAM width 128bits GDDR6
[ 15.736552] [drm] amdgpu: 8176M of VRAM memory ready
[ 15.741516] [drm] amdgpu: 15888M of GTT memory ready.
[ 15.746592] [drm] GART: num cpu pages 131072, num gpu pages 131072
[ 15.752862] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
[ 15.850408] [drm] Loading DMUB firmware via PSP: version=0x07002D00
[ 16.128604] [drm] Found VCN firmware Version ENC: 1.23 DEC: 9 VEP: 0 Revision: 16
[ 16.446347] SError Interrupt on CPU3, code 0x00000000be000411 -- SError
[ 16.446354] CPU: 3 UID: 0 PID: 11 Comm: kworker/u128:0 Tainted: G U 6.14.0-rc1-ajb-debian-bisect-00027-g2499f5348431-dirty #68
[ 16.446359] Tainted: [U]=USER
[ 16.446360] Hardware name: ADLINK AVA Developer Platform/AVA Developer Platform, BIOS TianoCore 2.04.100.07 (SYS: 2.06.20220308) 09/08/2022
[ 16.446362] Workqueue: efi_rts_wq efi_call_rts
[ 16.446371] pstate: 204000c9 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 16.446374] pc : __wake_up_common_lock+0x40/0xc0
[ 16.446379] lr : __wake_up+0x20/0x40
[ 16.446382] sp : ffff800080aa3790
[ 16.446383] x29: ffff800080aa3790 x28: ffff3e8780bcb780 x27: 00000000fa481000
[ 16.446387] x26: ffff3e87a7e14b98 x25: ffffb6df6e1e2978 x24: ffffb6df6e351ed8
[ 16.446390] x23: ffff3e87a7e10000 x22: 00000000000000c0 x21: 0000000000000003
[ 16.446392] x20: 0000000000000000 x19: ffff3e87a7e14b98 x18: 0000000000000000
[ 16.446395] x17: ffff3e878245d180 x16: ffffb6dfa26e0c28 x15: ffff3e87810bcbc0
[ 16.446398] x14: 00000000fa481758 x13: 0000000000000000 x12: ffff800080aa3dd7
[ 16.446401] x11: 0000000000000040 x10: ffff3e87801ba830 x9 : ffffb6dfa26e0c48
[ 16.446403] x8 : ffff3e8786eb5268 x7 : 0000000000000000 x6 : 0000000000000000
[ 16.446406] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
[ 16.446408] x2 : 0000000000000000 x1 : 0000000000000003 x0 : 0000000000000001
[ 16.446412] Kernel panic - not syncing: Asynchronous SError Interrupt
Boom - unrecoverable bus error triggered by the PCI access.
[ 16.446414] CPU: 3 UID: 0 PID: 11 Comm: kworker/u128:0 Tainted: G U 6.14.0-rc1-ajb-debian-bisect-00027-g2499f5348431-dirty #68
[ 16.446417] Tainted: [U]=USER
[ 16.446418] Hardware name: ADLINK AVA Developer Platform/AVA Developer Platform, BIOS TianoCore 2.04.100.07 (SYS: 2.06.20220308) 09/08/2022
[ 16.446419] Workqueue: efi_rts_wq efi_call_rts
[ 16.446424] Call trace:
[ 16.446425] show_stack+0x34/0x98 (C)
[ 16.446431] dump_stack_lvl+0x60/0x80
[ 16.446436] dump_stack+0x18/0x24
[ 16.446440] panic+0x164/0x378
[ 16.446443] nmi_panic+0x90/0x98
[ 16.446448] arm64_serror_panic+0x6c/0x80
[ 16.446452] do_serror+0x30/0x78
[ 16.446456] el1h_64_error_handler+0x30/0x50
[ 16.446462] el1h_64_error+0x6c/0x70
[ 16.446464] __wake_up_common_lock+0x40/0xc0 (P)
[ 16.446468] __wake_up+0x20/0x40
[ 16.446471] amdgpu_ih_process+0x100/0x160 [amdgpu]
[ 16.447083] amdgpu_irq_handler+0x34/0xa0 [amdgpu]
[ 16.447637] __handle_irq_event_percpu+0x60/0x1d8
[ 16.447642] handle_irq_event+0x4c/0x110
[ 16.447646] handle_fasteoi_irq+0xb4/0x220
[ 16.447649] handle_irq_desc+0x3c/0x68
[ 16.447652] generic_handle_domain_irq+0x24/0x40
[ 16.447656] gic_handle_irq+0x54/0x124
[ 16.447658] do_interrupt_handler+0x58/0xa0
[ 16.447661] el1_interrupt+0x34/0x58
[ 16.447665] el1h_64_irq_handler+0x18/0x28
[ 16.447669] el1h_64_irq+0x6c/0x70
[ 16.447672] 0xfad10918 (P)
[ 16.447674] 0xfabe01c8
[ 16.447676] 0xfabe02d4
[ 16.447677] 0xfa3e209c
[ 16.447679] 0xfa43ae7c
[ 16.447680] 0xfa43b6bc
[ 16.447681] 0xfa436e44
[ 16.447683] 0xfa43c3f8
[ 16.447684] __efi_rt_asm_wrapper+0x50/0x78
[ 16.447687] efi_call_rts+0x1c8/0x280
[ 16.447691] process_one_work+0x178/0x3e0
[ 16.447695] worker_thread+0x204/0x3f0
[ 16.447698] kthread+0x10c/0x1f0
[ 16.447703] ret_from_fork+0x10/0x20
[ 16.447705] SMP: stopping secondary CPUs
[ 16.447796] Kernel Offset: 0x36df225a0000 from 0xffff800080000000
[ 16.447798] PHYS_OFFSET: 0xffffc97880000000
[ 16.447799] CPU features: 0x200,00002170,00901250,8241720b
[ 16.447802] Memory Limit: none
[ 16.471034] pstore: backend (efi_pstore) writing error (-16)
[ 16.801136] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
The bisection was slightly complicated by the fact I'm carrying some
additional patches to work around other PCIe issues which however work
find before the failing commit. For convenience I've pushed a branch with the work
around applied here:
https://gitlab.com/stsquad/linux/-/commits/testing/pci-amdgpu-regression-reference
Additional information
lspci -vv info for card
000d:03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 33 [Radeon RX 7600/7600 XT/7600M XT/7600S/7700S / PRO W7600] (rev cf) (prog-if 00 [VGA controller])
Subsystem: Sapphire Technology Limited Device e448
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 151
NUMA node: 0
IOMMU group: 21
Region 0: Memory at 340000000000 (64-bit, prefetchable) [size=8G]
Region 2: Memory at 340200000000 (64-bit, prefetchable) [size=2M]
Region 5: Memory at 50000000 (32-bit, non-prefetchable) [size=1M]
Expansion ROM at 50100000 [disabled] [size=128K]
Capabilities: [48] Vendor Specific Information: Len=08 <?>
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [64] Express (v2) Legacy Endpoint, IntMsgNum 0
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- TEE-IO-
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 16GT/s, Width x8, ASPM L1, Exit Latency L1 <1us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 16GT/s, Width x8
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
EmergencyPowerReduction Form Factor Dev Specific, EmergencyPowerReductionInit-
FRS-
AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
AtomicOpsCtl: ReqEn-
IDOReq- IDOCompl- LTR- EmergencyPowerReductionReq-
10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000ffb77040 Data: 0000
Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Capabilities: [150 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
ECRC- UnsupReq- ACSViol- UncorrIntErr+ BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr- HeaderOF-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr- HeaderOF-
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [200 v1] Physical Resizable BAR
BAR 0: current size: 8GB, supported: 256MB 512MB 1GB 2GB 4GB 8GB
BAR 2: current size: 2MB, supported: 2MB 4MB 8MB 16MB 32MB 64MB 128MB 256MB
Capabilities: [240 v1] Power Budgeting <?>
Capabilities: [270 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
Capabilities: [2a0 v1] Access Control Services
ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
Capabilities: [2d0 v1] Process Address Space ID (PASID)
PASIDCap: Exec+ Priv+, Max PASID Width: 10
PASIDCtl: Enable+ Exec+ Priv+
Capabilities: [320 v1] Latency Tolerance Reporting
Max snoop latency: 0ns
Max no snoop latency: 0ns
Capabilities: [410 v1] Physical Layer 16.0 GT/s <?>
Capabilities: [450 v1] Lane Margining at the Receiver
PortCap: Uses Driver-
PortSta: MargReady+ MargSoftReady-
Kernel driver in use: amdgpu
Kernel modules: amdgpu
iomem layout from a working bootup (e89df6d2beae):
08000000-0fffffff : PCI Bus 0002:00
08000000-081fffff : PCI Bus 0002:01
08200000-083fffff : PCI Bus 0002:02
20000000-2fffffff : PCI Bus 0004:00
20000000-217fffff : PCI Bus 0004:01
20000000-217fffff : PCI Bus 0004:02
20000000-20ffffff : 0004:02:00.0
20000000-202fffff : efifb
21000000-2101ffff : 0004:02:00.0
21800000-219fffff : PCI Bus 0004:03
21800000-21801fff : 0004:03:00.0
21800000-21801fff : xhci-hcd
21a00000-21bfffff : PCI Bus 0004:04
21a00000-21a7ffff : 0004:04:00.0
21a00000-21a7ffff : igb
21a80000-21a83fff : 0004:04:00.0
21a80000-21a83fff : igb
21c00000-21dfffff : PCI Bus 0004:05
30000000-3fffffff : PCI Bus 0005:00
30000000-301fffff : PCI Bus 0005:01
30200000-303fffff : PCI Bus 0005:02
30400000-305fffff : PCI Bus 0005:03
30400000-30403fff : 0005:03:00.0
30400000-30403fff : nvme
30600000-307fffff : PCI Bus 0005:04
30600000-30603fff : 0005:04:00.0
30600000-30603fff : nvme
40000000-4fffffff : PCI Bus 000c:00
40000000-401fffff : PCI Bus 000c:01
50000000-5fffffff : PCI Bus 000d:00
50000000-502fffff : PCI Bus 000d:01
50000000-501fffff : PCI Bus 000d:02
50000000-501fffff : PCI Bus 000d:03
50000000-500fffff : 000d:03:00.0
50100000-5011ffff : 000d:03:00.0
50120000-50123fff : 000d:03:00.1
50120000-50123fff : ICH HD audio
50200000-50203fff : 000d:01:00.0
70000000-7fffffff : PCI Bus 0000:00
70000000-701fffff : PCI Bus 0000:01
88300000-883fffff : reserved
88500000-885fffff : IFX0785:00
88500000-885fffff : IFX0785:00
88900000-8891ffff : AMPC0005:00
90000000-91ffffff : System RAM
92000000-927bffff : reserved
927c0000-f896ffff : System RAM
d54f0000-d6adffff : Kernel code
d6ae0000-d6daffff : reserved
d6db0000-d717ffff : Kernel data
ef650000-f3650fff : reserved
f3850000-f49a2fff : reserved
f88b0000-f88bffff : reserved
f8970000-f898ffff : reserved
f8990000-f899ffff : System RAM
f89a0000-f89fffff : reserved
f8a00000-f9196fff : System RAM
f8a00000-f8a00fff : reserved
f8a02000-f8a02fff : reserved
f9197000-f91ecfff : reserved
f91ed000-f94cffff : System RAM
f91fb000-f91fbfff : reserved
f94d0000-f950ffff : reserved
f9510000-f98bffff : System RAM
f98c0000-f98fffff : reserved
f9900000-f999ffff : System RAM
f99a0000-f99dffff : reserved
f99e0000-f9f4ffff : System RAM
f9ef0000-f9f1ffff : reserved
f9f50000-f9f6ffff : reserved
f9f70000-fa0affff : System RAM
fa0b0000-fa0effff : reserved
fa0f0000-fa1cffff : System RAM
fa1d0000-fa26ffff : reserved
fa270000-fa33ffff : System RAM
fa340000-fa4affff : reserved
fa4b0000-fa4bffff : System RAM
fa4c0000-fa57ffff : reserved
fa580000-fa72ffff : System RAM
fa730000-fa7cffff : reserved
fa7d0000-faa4ffff : System RAM
faa50000-faaeffff : reserved
faaf0000-fab7ffff : System RAM
fab80000-fac1ffff : reserved
fac20000-facaffff : System RAM
facb0000-fad4ffff : reserved
fad50000-fae1ffff : System RAM
fae20000-faebffff : reserved
faec0000-faf4ffff : System RAM
faf50000-fafeffff : reserved
faff0000-ffefffff : System RAM
fbe00000-ffdfffff : reserved
fff00000-fff4ffff : reserved
fff50000-fffaffff : System RAM
fffb0000-fffdffff : reserved
fffc0000-fffc0fff : reserved
fffe0000-ffffffff : System RAM
fffe0000-fffeffff : reserved
80000000000-8007fffffff : System RAM
800002bc000-800002bcfff : reserved
80000840000-8000084ffff : reserved
80000850000-8000085ffff : reserved
80000860000-8000086ffff : reserved
80000870000-8000087ffff : reserved
80000880000-8000088ffff : reserved
80000890000-8000089ffff : reserved
800008a0000-800008affff : reserved
800008b0000-800008bffff : reserved
800008c0000-800008cffff : reserved
800008d0000-800008dffff : reserved
800008e0000-800008effff : reserved
800008f0000-800008fffff : reserved
80000900000-8000090ffff : reserved
80000910000-8000091ffff : reserved
80000920000-8000092ffff : reserved
80000930000-8000093ffff : reserved
80000940000-8000094ffff : reserved
80000950000-8000095ffff : reserved
80000960000-8000096ffff : reserved
80000970000-8000097ffff : reserved
80000980000-8000098ffff : reserved
80000990000-8000099ffff : reserved
800009a0000-800009affff : reserved
800009b0000-800009bffff : reserved
800009c0000-800009cffff : reserved
800009d0000-800009dffff : reserved
800009e0000-800009effff : reserved
800009f0000-800009fffff : reserved
80000a00000-80000a0ffff : reserved
80000a10000-80000a1ffff : reserved
80000a20000-80000a2ffff : reserved
80000a30000-80000a3ffff : reserved
80000a40000-80000a4ffff : reserved
80100000000-807ffffffff : System RAM
807d8c10000-807fbffffff : reserved
807fc009000-807fc039fff : reserved
807fc03c000-807fc03ffff : reserved
807fc040000-807fc040fff : reserved
807fc041000-807fc044fff : reserved
807fc045000-807fc06afff : reserved
807fc06b000-807ffffffff : reserved
100002600000-100002600fff : ARMH0011:00
100002600000-100002600fff : ARMH0011:00 ARMH0011:00
100002620000-100002620fff : ARMH0011:01
100002620000-100002620fff : ARMH0011:01 ARMH0011:01
1000026c0000-1000026cffff : APMC0D0F:00
1000026c0000-1000026cffff : APMC0D0F:00 APMC0D0F:00
1000026d0000-1000026dffff : APMC0D07:02
1000026f0000-1000026fffff : APMC0D07:00
100002730000-100002730fff : arch_mem_timer
100002750000-10000275ffff : APMC0D0F:01
100002750000-10000275ffff : APMC0D0F:01 APMC0D0F:01
100002780000-10000278ffff : APMC0D0F:02
100002780000-10000278ffff : APMC0D0F:02 APMC0D0F:02
1000027b0000-1000027bffff : APMC0D07:01
1000027c0000-1000027c0fff : sbsa-gwdt.0
1000027c0000-1000027c0fff : sbsa-gwdt.0 sbsa-gwdt.0
1000027d0000-1000027d0fff : sbsa-gwdt.0
1000027d0000-1000027d0fff : sbsa-gwdt.0 sbsa-gwdt.0
100010000000-10001fffffff : ARMHC600:00
100012500000-1000164fffff : ARMHC600:00
10008c000a00-10008c000bff : ARMHD620:00
10008d000a00-10008d000bff : ARMHD620:04
100100000000-10010000ffff : GICD
100100140000-10010113ffff : GICR
200000000000-23ffdfffffff : PCI Bus 0002:00
200000000000-2000001fffff : PCI Bus 0002:01
200000200000-2000003fffff : PCI Bus 0002:02
23ffe0000000-23ffe001ffff : arm-smmu-v3.3.auto
23ffe0000000-23ffe0000dff : arm-smmu-v3.3.auto
23ffe0010000-23ffe0010dff : arm-smmu-v3.3.auto
23fff0000000-23ffffffffff : PCI ECAM
27fff0000000-27ffffffffff : pnp 00:00
280000000000-2bffdfffffff : PCI Bus 0004:00
280000000000-2800001fffff : PCI Bus 0004:01
280000200000-2800003fffff : PCI Bus 0004:03
280000400000-2800005fffff : PCI Bus 0004:04
280000600000-2800007fffff : PCI Bus 0004:05
2bffe0000000-2bffe001ffff : arm-smmu-v3.4.auto
2bffe0000000-2bffe0000dff : arm-smmu-v3.4.auto
2bffe0010000-2bffe0010dff : arm-smmu-v3.4.auto
2bfff0000000-2bffffffffff : PCI ECAM
2c0000000000-2fffdfffffff : PCI Bus 0005:00
2c0000000000-2c00001fffff : PCI Bus 0005:01
2c0000200000-2c00003fffff : PCI Bus 0005:02
2c0000400000-2c00005fffff : PCI Bus 0005:03
2c0000600000-2c00007fffff : PCI Bus 0005:04
2fffe0000000-2fffe001ffff : arm-smmu-v3.5.auto
2fffe0000000-2fffe0000dff : arm-smmu-v3.5.auto
2fffe0010000-2fffe0010dff : arm-smmu-v3.5.auto
2ffff0000000-2fffffffffff : PCI ECAM
300000000000-33ffdfffffff : PCI Bus 000c:00
300000000000-3000001fffff : PCI Bus 000c:01
33ffe0000000-33ffe001ffff : arm-smmu-v3.0.auto
33ffe0000000-33ffe0000dff : arm-smmu-v3.0.auto
33ffe0010000-33ffe0010dff : arm-smmu-v3.0.auto
33fff0000000-33ffffffffff : PCI ECAM
340000000000-37ffdfffffff : PCI Bus 000d:00
340000000000-3402ffffffff : PCI Bus 000d:01
340000000000-3402ffffffff : PCI Bus 000d:02
340000000000-3402ffffffff : PCI Bus 000d:03
340000000000-3401ffffffff : 000d:03:00.0
340200000000-3402001fffff : 000d:03:00.0
37ffe0000000-37ffe001ffff : arm-smmu-v3.1.auto
37ffe0000000-37ffe0000dff : arm-smmu-v3.1.auto
37ffe0010000-37ffe0010dff : arm-smmu-v3.1.auto
37fff0000000-37ffffffffff : PCI ECAM
3bfff0000000-3bffffffffff : pnp 00:00
3c0000000000-3fffdfffffff : PCI Bus 0000:00
3c0000000000-3c00001fffff : PCI Bus 0000:01
3fffe0000000-3fffe001ffff : arm-smmu-v3.2.auto
3fffe0000000-3fffe0000dff : arm-smmu-v3.2.auto
3fffe0010000-3fffe0010dff : arm-smmu-v3.2.auto
3ffff0000000-3fffffffffff : PCI ECAM
63fff0000000-63ffffffffff : pnp 00:00
67fff0000000-67ffffffffff : pnp 00:00
6bfff0000000-6bffffffffff : pnp 00:00
6ffff0000000-6fffffffffff : pnp 00:00
7bfff0000000-7bffffffffff : pnp 00:00
7ffff0000000-7fffffffffff : pnp 00:00
working dmesg from same:
[ 15.500492] [drm] GPU posting now...
[ 15.504110] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[ 15.512654] amdgpu 000d:03:00.0: BAR 2 [mem 0x340010000000-0x3400101fffff 64bit pref]: releasing
[ 15.521431] amdgpu 000d:03:00.0: BAR 0 [mem 0x340000000000-0x34000fffffff 64bit pref]: releasing
[ 15.530230] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
[ 15.539881] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
[ 15.549528] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
[ 15.549535] pcieport 000d:00:01.0: bridge window [io 0x1000-0x0fff] to [bus 01-03] add_size 1000
[ 15.549544] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
[ 15.549546] pcieport 000d:00:01.0: bridge window [io size 0x1000]: can't assign; no space
[ 15.549549] pcieport 000d:00:01.0: bridge window [io size 0x1000]: failed to assign
[ 15.596468] pcieport 000d:00:01.0: bridge window [io size 0x1000]: can't assign; no space
[ 15.607594] pcieport 000d:00:01.0: bridge window [io size 0x1000]: failed to assign
[ 15.618090] pcieport 000d:00:01.0: bridge window [io size 0x1000]: ignoring failure in optional allocation
[ 15.618095] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
[ 15.628249] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
[ 15.637806] amdgpu 000d:03:00.0: BAR 0 [mem 0x340000000000-0x3401ffffffff 64bit pref]: assigned
[ 15.646506] amdgpu 000d:03:00.0: BAR 2 [mem 0x340200000000-0x3402001fffff 64bit pref]: assigned
[ 15.655205] pcieport 000d:00:01.0: PCI bridge to [bus 01-03]
[ 15.660856] pcieport 000d:00:01.0: bridge window [mem 0x50000000-0x502fffff]
[ 15.668069] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]
[ 15.676931] pcieport 000d:01:00.0: PCI bridge to [bus 02-03]
[ 15.682586] pcieport 000d:01:00.0: bridge window [mem 0x50000000-0x501fffff]
[ 15.689804] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]
[ 15.698672] pcieport 000d:02:00.0: PCI bridge to [bus 03]
[ 15.704067] pcieport 000d:02:00.0: bridge window [mem 0x50000000-0x501fffff]
[ 15.711285] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]
[ 15.720157] amdgpu 000d:03:00.0: amdgpu: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used)
[ 15.729714] amdgpu 000d:03:00.0: amdgpu: GART: 512M 0x00007FFF00000000 - 0x00007FFF1FFFFFFF
[ 15.738064] [drm] Detected VRAM RAM=8176M, BAR=8192M
[ 15.743019] [drm] RAM width 128bits GDDR6
[ 15.747258] [drm] amdgpu: 8176M of VRAM memory ready
[ 15.752219] [drm] amdgpu: 15888M of GTT memory ready.
[ 15.757297] [drm] GART: num cpu pages 131072, num gpu pages 131072
[ 15.763558] [drm] PCIE GART of 512M enabled (table at 0x00000081FEB00000).
[ 15.884845] [drm] Loading DMUB firmware via PSP: version=0x07002D00
[ 16.129125] [drm] Found VCN firmware Version ENC: 1.23 DEC: 9 VEP: 0 Revision: 16
From discussions with Ard it seems if the firmware had resized the BAR first,
and then assigned the resources, there would be no issue. However there
is no latter firmware for the platform.
While the PCI change has provoked this regression I suspect the amdgpu code
could handle the failure to resize the BAR better and if it can't get
what it wants just not initialise the driver. I did hit some cases while
bisecting where the GPU just wasn't visible.
I'm available to test patches and generate additional debug info so do
let me know if there is anything I can do to help.
Thanks,
--
Alex Bennée
Virtualisation Tech Lead @ Linaro
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: 2499f53 (PCI: Rework optional resource handling) regression with AMDGPU on Arm AVA platform
2025-10-22 16:51 2499f53 (PCI: Rework optional resource handling) regression with AMDGPU on Arm AVA platform Alex Bennée
@ 2025-10-22 17:08 ` Ard Biesheuvel
2025-10-23 16:20 ` Bjorn Helgaas
2025-10-23 17:24 ` Ilpo Järvinen
2 siblings, 0 replies; 4+ messages in thread
From: Ard Biesheuvel @ 2025-10-22 17:08 UTC (permalink / raw)
To: Alex Bennée
Cc: linux-pci, Lorenzo Pieralisi, Alex Deucher, Christian König,
amd-gfx, Bjorn Helgaas, Ilpo Järvinen, D Scott Phillips
On Wed, 22 Oct 2025 at 18:51, Alex Bennée <alex.bennee@linaro.org> wrote:
>
>
> Hi,
>
> I've been tracking a regression on my Arm64 (Altra) AVA platform between
> 6.14 and 6.15. It looks like the rework commit broke the ability of the
> amdgpu driver to resize it's bar, resulting in an SError and failure to
> boot:
>
...
> From discussions with Ard it seems if the firmware had resized the BAR first,
> and then assigned the resources, there would be no issue. However there
> is no latter firmware for the platform.
>
> While the PCI change has provoked this regression I suspect the amdgpu code
> could handle the failure to resize the BAR better and if it can't get
> what it wants just not initialise the driver.
Actually, looking again at the below, which follows the error about
overlapping resource windows, it seems the PCI code is failing to roll
back the changes, and it is not the driver at fault here.
> [ 15.611321] pcieport 000d:00:01.0: PCI bridge to [bus 01-03]
> [ 15.616971] pcieport 000d:00:01.0: bridge window [io size 0x1000]
> [ 15.623315] pcieport 000d:00:01.0: bridge window [mem 0x50000000-0x502fffff]
> [ 15.630527] pcieport 000d:00:01.0: bridge window [mem size 0x18000000 64bit pref]
> [ 15.638174] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: can't claim; no compatible bridge window
> [ 15.650508] pcieport 000d:01:00.0: PCI bridge to [bus 02-03]
> [ 15.656164] pcieport 000d:01:00.0: bridge window [mem 0x50000000-0x501fffff]
> [ 15.663381] pcieport 000d:01:00.0: bridge window [mem size 0x18000000 64bit pref]
> [ 15.671036] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: can't claim; no compatible bridge window
> [ 15.683370] pcieport 000d:02:00.0: PCI bridge to [bus 03]
> [ 15.688764] pcieport 000d:02:00.0: bridge window [mem 0x50000000-0x501fffff]
> [ 15.695982] pcieport 000d:02:00.0: bridge window [mem size 0x18000000 64bit pref]
On Wed, 22 Oct 2025 at 18:51, Alex Bennée <alex.bennee@linaro.org> wrote:
>
>
> Hi,
>
> I've been tracking a regression on my Arm64 (Altra) AVA platform between
> 6.14 and 6.15. It looks like the rework commit broke the ability of the
> amdgpu driver to resize it's bar, resulting in an SError and failure to
> boot:
>
> [ 15.348097] amdgpu 000d:03:00.0: amdgpu: detected ip block number 8 <vcn_v4_0>
> [ 15.355901] amdgpu 000d:03:00.0: amdgpu: detected ip block number 9 <jpeg_v4_0>
> [ 15.363202] amdgpu 000d:03:00.0: amdgpu: detected ip block number 10 <mes_v11_0>
> [ 15.384163] amdgpu 000d:03:00.0: amdgpu: Fetched VBIOS from ROM BAR
> [ 15.390434] amdgpu: ATOM BIOS: 113-4481LHS-UC1
> [ 15.400079] amdgpu 000d:03:00.0: amdgpu: CP RS64 enable
> [ 15.411830] amdgpu 000d:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
> [ 15.419932] amdgpu 000d:03:00.0: amdgpu: PCIE atomic ops is not supported
> [ 15.426719] [drm] GPU posting now...
> [ 15.430329] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
> [ 15.438871] amdgpu 000d:03:00.0: BAR 2 [mem 0x340010000000-0x3400101fffff 64bit pref]: releasing
> [ 15.447648] amdgpu 000d:03:00.0: BAR 0 [mem 0x340000000000-0x34000fffffff 64bit pref]: releasing
> [ 15.456452] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
> [ 15.466095] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
> [ 15.475738] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
> [ 15.485386] pcieport 000d:00:01.0: bridge window [io 0x1000-0x0fff] to [bus 01-03] add_size 1000
> [ 15.494252] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
> [ 15.503809] pcieport 000d:00:01.0: bridge window [io size 0x1000]: can't assign; no space
> [ 15.512063] pcieport 000d:00:01.0: bridge window [io size 0x1000]: failed to assign
> [ 15.519796] pcieport 000d:00:01.0: bridge window [io size 0x1000]: can't assign; no space
> [ 15.528049] pcieport 000d:00:01.0: bridge window [io size 0x1000]: failed to assign
> [ 15.535787] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
> [ 15.545349] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
> [ 15.554911] amdgpu 000d:03:00.0: BAR 0 [mem 0x340000000000-0x3401ffffffff 64bit pref]: assigned
> [ 15.563612] amdgpu 000d:03:00.0: BAR 2 [mem 0x340200000000-0x3402001fffff 64bit pref]: assigned
> [ 15.572313] pcieport 000d:00:01.0: PCI bridge to [bus 01-03]
> [ 15.577962] pcieport 000d:00:01.0: bridge window [mem 0x50000000-0x502fffff]
> [ 15.585175] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]
> [ 15.594038] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: can't claim; address conflict with PCI Bus 000d:01 [mem 0x340000000000-0x3
> 40017ffffff 64bit pref]
>
> Failure to claim space for the bridge window...
>
> [ 15.611321] pcieport 000d:00:01.0: PCI bridge to [bus 01-03]
> [ 15.616971] pcieport 000d:00:01.0: bridge window [io size 0x1000]
> [ 15.623315] pcieport 000d:00:01.0: bridge window [mem 0x50000000-0x502fffff]
> [ 15.630527] pcieport 000d:00:01.0: bridge window [mem size 0x18000000 64bit pref]
> [ 15.638174] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: can't claim; no compatible bridge window
> [ 15.650508] pcieport 000d:01:00.0: PCI bridge to [bus 02-03]
> [ 15.656164] pcieport 000d:01:00.0: bridge window [mem 0x50000000-0x501fffff]
> [ 15.663381] pcieport 000d:01:00.0: bridge window [mem size 0x18000000 64bit pref]
> [ 15.671036] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: can't claim; no compatible bridge window
> [ 15.683370] pcieport 000d:02:00.0: PCI bridge to [bus 03]
> [ 15.688764] pcieport 000d:02:00.0: bridge window [mem 0x50000000-0x501fffff]
> [ 15.695982] pcieport 000d:02:00.0: bridge window [mem size 0x18000000 64bit pref]
> [ 15.703643] [drm] Not enough PCI address space for a large BAR.
>
> Realisation not enough space for the BAR
>
> [ 15.703648] amdgpu 000d:03:00.0: amdgpu: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used)
> [ 15.719119] amdgpu 000d:03:00.0: amdgpu: GART: 512M 0x00007FFF00000000 - 0x00007FFF1FFFFFFF
> [ 15.727470] [drm] Detected VRAM RAM=8176M, BAR=256M
> [ 15.732339] [drm] RAM width 128bits GDDR6
> [ 15.736552] [drm] amdgpu: 8176M of VRAM memory ready
> [ 15.741516] [drm] amdgpu: 15888M of GTT memory ready.
> [ 15.746592] [drm] GART: num cpu pages 131072, num gpu pages 131072
> [ 15.752862] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
> [ 15.850408] [drm] Loading DMUB firmware via PSP: version=0x07002D00
> [ 16.128604] [drm] Found VCN firmware Version ENC: 1.23 DEC: 9 VEP: 0 Revision: 16
> [ 16.446347] SError Interrupt on CPU3, code 0x00000000be000411 -- SError
> [ 16.446354] CPU: 3 UID: 0 PID: 11 Comm: kworker/u128:0 Tainted: G U 6.14.0-rc1-ajb-debian-bisect-00027-g2499f5348431-dirty #68
> [ 16.446359] Tainted: [U]=USER
> [ 16.446360] Hardware name: ADLINK AVA Developer Platform/AVA Developer Platform, BIOS TianoCore 2.04.100.07 (SYS: 2.06.20220308) 09/08/2022
> [ 16.446362] Workqueue: efi_rts_wq efi_call_rts
> [ 16.446371] pstate: 204000c9 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [ 16.446374] pc : __wake_up_common_lock+0x40/0xc0
> [ 16.446379] lr : __wake_up+0x20/0x40
> [ 16.446382] sp : ffff800080aa3790
> [ 16.446383] x29: ffff800080aa3790 x28: ffff3e8780bcb780 x27: 00000000fa481000
> [ 16.446387] x26: ffff3e87a7e14b98 x25: ffffb6df6e1e2978 x24: ffffb6df6e351ed8
> [ 16.446390] x23: ffff3e87a7e10000 x22: 00000000000000c0 x21: 0000000000000003
> [ 16.446392] x20: 0000000000000000 x19: ffff3e87a7e14b98 x18: 0000000000000000
> [ 16.446395] x17: ffff3e878245d180 x16: ffffb6dfa26e0c28 x15: ffff3e87810bcbc0
> [ 16.446398] x14: 00000000fa481758 x13: 0000000000000000 x12: ffff800080aa3dd7
> [ 16.446401] x11: 0000000000000040 x10: ffff3e87801ba830 x9 : ffffb6dfa26e0c48
> [ 16.446403] x8 : ffff3e8786eb5268 x7 : 0000000000000000 x6 : 0000000000000000
> [ 16.446406] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
> [ 16.446408] x2 : 0000000000000000 x1 : 0000000000000003 x0 : 0000000000000001
> [ 16.446412] Kernel panic - not syncing: Asynchronous SError Interrupt
>
> Boom - unrecoverable bus error triggered by the PCI access.
>
> [ 16.446414] CPU: 3 UID: 0 PID: 11 Comm: kworker/u128:0 Tainted: G U 6.14.0-rc1-ajb-debian-bisect-00027-g2499f5348431-dirty #68
> [ 16.446417] Tainted: [U]=USER
> [ 16.446418] Hardware name: ADLINK AVA Developer Platform/AVA Developer Platform, BIOS TianoCore 2.04.100.07 (SYS: 2.06.20220308) 09/08/2022
> [ 16.446419] Workqueue: efi_rts_wq efi_call_rts
> [ 16.446424] Call trace:
> [ 16.446425] show_stack+0x34/0x98 (C)
> [ 16.446431] dump_stack_lvl+0x60/0x80
> [ 16.446436] dump_stack+0x18/0x24
> [ 16.446440] panic+0x164/0x378
> [ 16.446443] nmi_panic+0x90/0x98
> [ 16.446448] arm64_serror_panic+0x6c/0x80
> [ 16.446452] do_serror+0x30/0x78
> [ 16.446456] el1h_64_error_handler+0x30/0x50
> [ 16.446462] el1h_64_error+0x6c/0x70
> [ 16.446464] __wake_up_common_lock+0x40/0xc0 (P)
> [ 16.446468] __wake_up+0x20/0x40
> [ 16.446471] amdgpu_ih_process+0x100/0x160 [amdgpu]
> [ 16.447083] amdgpu_irq_handler+0x34/0xa0 [amdgpu]
> [ 16.447637] __handle_irq_event_percpu+0x60/0x1d8
> [ 16.447642] handle_irq_event+0x4c/0x110
> [ 16.447646] handle_fasteoi_irq+0xb4/0x220
> [ 16.447649] handle_irq_desc+0x3c/0x68
> [ 16.447652] generic_handle_domain_irq+0x24/0x40
> [ 16.447656] gic_handle_irq+0x54/0x124
> [ 16.447658] do_interrupt_handler+0x58/0xa0
> [ 16.447661] el1_interrupt+0x34/0x58
> [ 16.447665] el1h_64_irq_handler+0x18/0x28
> [ 16.447669] el1h_64_irq+0x6c/0x70
> [ 16.447672] 0xfad10918 (P)
> [ 16.447674] 0xfabe01c8
> [ 16.447676] 0xfabe02d4
> [ 16.447677] 0xfa3e209c
> [ 16.447679] 0xfa43ae7c
> [ 16.447680] 0xfa43b6bc
> [ 16.447681] 0xfa436e44
> [ 16.447683] 0xfa43c3f8
> [ 16.447684] __efi_rt_asm_wrapper+0x50/0x78
> [ 16.447687] efi_call_rts+0x1c8/0x280
> [ 16.447691] process_one_work+0x178/0x3e0
> [ 16.447695] worker_thread+0x204/0x3f0
> [ 16.447698] kthread+0x10c/0x1f0
> [ 16.447703] ret_from_fork+0x10/0x20
> [ 16.447705] SMP: stopping secondary CPUs
> [ 16.447796] Kernel Offset: 0x36df225a0000 from 0xffff800080000000
> [ 16.447798] PHYS_OFFSET: 0xffffc97880000000
> [ 16.447799] CPU features: 0x200,00002170,00901250,8241720b
> [ 16.447802] Memory Limit: none
> [ 16.471034] pstore: backend (efi_pstore) writing error (-16)
> [ 16.801136] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
>
> The bisection was slightly complicated by the fact I'm carrying some
> additional patches to work around other PCIe issues which however work
> find before the failing commit. For convenience I've pushed a branch with the work
> around applied here:
>
> https://gitlab.com/stsquad/linux/-/commits/testing/pci-amdgpu-regression-reference
>
> Additional information
>
> lspci -vv info for card
>
> 000d:03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 33 [Radeon RX 7600/7600 XT/7600M XT/7600S/7700S / PRO W7600] (rev cf) (prog-if 00 [VGA controller])
> Subsystem: Sapphire Technology Limited Device e448
> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> Latency: 0
> Interrupt: pin A routed to IRQ 151
> NUMA node: 0
> IOMMU group: 21
> Region 0: Memory at 340000000000 (64-bit, prefetchable) [size=8G]
> Region 2: Memory at 340200000000 (64-bit, prefetchable) [size=2M]
> Region 5: Memory at 50000000 (32-bit, non-prefetchable) [size=1M]
> Expansion ROM at 50100000 [disabled] [size=128K]
> Capabilities: [48] Vendor Specific Information: Len=08 <?>
> Capabilities: [50] Power Management version 3
> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> Capabilities: [64] Express (v2) Legacy Endpoint, IntMsgNum 0
> DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
> ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- TEE-IO-
> DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
> RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
> MaxPayload 128 bytes, MaxReadReq 512 bytes
> DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
> LnkCap: Port #0, Speed 16GT/s, Width x8, ASPM L1, Exit Latency L1 <1us
> ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
> LnkCtl: ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk-
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 16GT/s, Width x8
> TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
> 10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
> EmergencyPowerReduction Form Factor Dev Specific, EmergencyPowerReductionInit-
> FRS-
> AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
> AtomicOpsCtl: ReqEn-
> IDOReq- IDOCompl- LTR- EmergencyPowerReductionReq-
> 10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
> LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
> LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
> Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
> Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
> LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
> EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
> Retimer- 2Retimers- CrosslinkRes: unsupported
> Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
> Address: 00000000ffb77040 Data: 0000
> Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
> Capabilities: [150 v2] Advanced Error Reporting
> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
> ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
> PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
> ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
> PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
> ECRC- UnsupReq- ACSViol- UncorrIntErr+ BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
> PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr- HeaderOF-
> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr- HeaderOF-
> AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
> MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
> HeaderLog: 00000000 00000000 00000000 00000000
> Capabilities: [200 v1] Physical Resizable BAR
> BAR 0: current size: 8GB, supported: 256MB 512MB 1GB 2GB 4GB 8GB
> BAR 2: current size: 2MB, supported: 2MB 4MB 8MB 16MB 32MB 64MB 128MB 256MB
> Capabilities: [240 v1] Power Budgeting <?>
> Capabilities: [270 v1] Secondary PCI Express
> LnkCtl3: LnkEquIntrruptEn- PerformEqu-
> LaneErrStat: 0
> Capabilities: [2a0 v1] Access Control Services
> ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
> ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
> Capabilities: [2d0 v1] Process Address Space ID (PASID)
> PASIDCap: Exec+ Priv+, Max PASID Width: 10
> PASIDCtl: Enable+ Exec+ Priv+
> Capabilities: [320 v1] Latency Tolerance Reporting
> Max snoop latency: 0ns
> Max no snoop latency: 0ns
> Capabilities: [410 v1] Physical Layer 16.0 GT/s <?>
> Capabilities: [450 v1] Lane Margining at the Receiver
> PortCap: Uses Driver-
> PortSta: MargReady+ MargSoftReady-
> Kernel driver in use: amdgpu
> Kernel modules: amdgpu
>
> iomem layout from a working bootup (e89df6d2beae):
>
> 08000000-0fffffff : PCI Bus 0002:00
> 08000000-081fffff : PCI Bus 0002:01
> 08200000-083fffff : PCI Bus 0002:02
> 20000000-2fffffff : PCI Bus 0004:00
> 20000000-217fffff : PCI Bus 0004:01
> 20000000-217fffff : PCI Bus 0004:02
> 20000000-20ffffff : 0004:02:00.0
> 20000000-202fffff : efifb
> 21000000-2101ffff : 0004:02:00.0
> 21800000-219fffff : PCI Bus 0004:03
> 21800000-21801fff : 0004:03:00.0
> 21800000-21801fff : xhci-hcd
> 21a00000-21bfffff : PCI Bus 0004:04
> 21a00000-21a7ffff : 0004:04:00.0
> 21a00000-21a7ffff : igb
> 21a80000-21a83fff : 0004:04:00.0
> 21a80000-21a83fff : igb
> 21c00000-21dfffff : PCI Bus 0004:05
> 30000000-3fffffff : PCI Bus 0005:00
> 30000000-301fffff : PCI Bus 0005:01
> 30200000-303fffff : PCI Bus 0005:02
> 30400000-305fffff : PCI Bus 0005:03
> 30400000-30403fff : 0005:03:00.0
> 30400000-30403fff : nvme
> 30600000-307fffff : PCI Bus 0005:04
> 30600000-30603fff : 0005:04:00.0
> 30600000-30603fff : nvme
> 40000000-4fffffff : PCI Bus 000c:00
> 40000000-401fffff : PCI Bus 000c:01
> 50000000-5fffffff : PCI Bus 000d:00
> 50000000-502fffff : PCI Bus 000d:01
> 50000000-501fffff : PCI Bus 000d:02
> 50000000-501fffff : PCI Bus 000d:03
> 50000000-500fffff : 000d:03:00.0
> 50100000-5011ffff : 000d:03:00.0
> 50120000-50123fff : 000d:03:00.1
> 50120000-50123fff : ICH HD audio
> 50200000-50203fff : 000d:01:00.0
> 70000000-7fffffff : PCI Bus 0000:00
> 70000000-701fffff : PCI Bus 0000:01
> 88300000-883fffff : reserved
> 88500000-885fffff : IFX0785:00
> 88500000-885fffff : IFX0785:00
> 88900000-8891ffff : AMPC0005:00
> 90000000-91ffffff : System RAM
> 92000000-927bffff : reserved
> 927c0000-f896ffff : System RAM
> d54f0000-d6adffff : Kernel code
> d6ae0000-d6daffff : reserved
> d6db0000-d717ffff : Kernel data
> ef650000-f3650fff : reserved
> f3850000-f49a2fff : reserved
> f88b0000-f88bffff : reserved
> f8970000-f898ffff : reserved
> f8990000-f899ffff : System RAM
> f89a0000-f89fffff : reserved
> f8a00000-f9196fff : System RAM
> f8a00000-f8a00fff : reserved
> f8a02000-f8a02fff : reserved
> f9197000-f91ecfff : reserved
> f91ed000-f94cffff : System RAM
> f91fb000-f91fbfff : reserved
> f94d0000-f950ffff : reserved
> f9510000-f98bffff : System RAM
> f98c0000-f98fffff : reserved
> f9900000-f999ffff : System RAM
> f99a0000-f99dffff : reserved
> f99e0000-f9f4ffff : System RAM
> f9ef0000-f9f1ffff : reserved
> f9f50000-f9f6ffff : reserved
> f9f70000-fa0affff : System RAM
> fa0b0000-fa0effff : reserved
> fa0f0000-fa1cffff : System RAM
> fa1d0000-fa26ffff : reserved
> fa270000-fa33ffff : System RAM
> fa340000-fa4affff : reserved
> fa4b0000-fa4bffff : System RAM
> fa4c0000-fa57ffff : reserved
> fa580000-fa72ffff : System RAM
> fa730000-fa7cffff : reserved
> fa7d0000-faa4ffff : System RAM
> faa50000-faaeffff : reserved
> faaf0000-fab7ffff : System RAM
> fab80000-fac1ffff : reserved
> fac20000-facaffff : System RAM
> facb0000-fad4ffff : reserved
> fad50000-fae1ffff : System RAM
> fae20000-faebffff : reserved
> faec0000-faf4ffff : System RAM
> faf50000-fafeffff : reserved
> faff0000-ffefffff : System RAM
> fbe00000-ffdfffff : reserved
> fff00000-fff4ffff : reserved
> fff50000-fffaffff : System RAM
> fffb0000-fffdffff : reserved
> fffc0000-fffc0fff : reserved
> fffe0000-ffffffff : System RAM
> fffe0000-fffeffff : reserved
> 80000000000-8007fffffff : System RAM
> 800002bc000-800002bcfff : reserved
> 80000840000-8000084ffff : reserved
> 80000850000-8000085ffff : reserved
> 80000860000-8000086ffff : reserved
> 80000870000-8000087ffff : reserved
> 80000880000-8000088ffff : reserved
> 80000890000-8000089ffff : reserved
> 800008a0000-800008affff : reserved
> 800008b0000-800008bffff : reserved
> 800008c0000-800008cffff : reserved
> 800008d0000-800008dffff : reserved
> 800008e0000-800008effff : reserved
> 800008f0000-800008fffff : reserved
> 80000900000-8000090ffff : reserved
> 80000910000-8000091ffff : reserved
> 80000920000-8000092ffff : reserved
> 80000930000-8000093ffff : reserved
> 80000940000-8000094ffff : reserved
> 80000950000-8000095ffff : reserved
> 80000960000-8000096ffff : reserved
> 80000970000-8000097ffff : reserved
> 80000980000-8000098ffff : reserved
> 80000990000-8000099ffff : reserved
> 800009a0000-800009affff : reserved
> 800009b0000-800009bffff : reserved
> 800009c0000-800009cffff : reserved
> 800009d0000-800009dffff : reserved
> 800009e0000-800009effff : reserved
> 800009f0000-800009fffff : reserved
> 80000a00000-80000a0ffff : reserved
> 80000a10000-80000a1ffff : reserved
> 80000a20000-80000a2ffff : reserved
> 80000a30000-80000a3ffff : reserved
> 80000a40000-80000a4ffff : reserved
> 80100000000-807ffffffff : System RAM
> 807d8c10000-807fbffffff : reserved
> 807fc009000-807fc039fff : reserved
> 807fc03c000-807fc03ffff : reserved
> 807fc040000-807fc040fff : reserved
> 807fc041000-807fc044fff : reserved
> 807fc045000-807fc06afff : reserved
> 807fc06b000-807ffffffff : reserved
> 100002600000-100002600fff : ARMH0011:00
> 100002600000-100002600fff : ARMH0011:00 ARMH0011:00
> 100002620000-100002620fff : ARMH0011:01
> 100002620000-100002620fff : ARMH0011:01 ARMH0011:01
> 1000026c0000-1000026cffff : APMC0D0F:00
> 1000026c0000-1000026cffff : APMC0D0F:00 APMC0D0F:00
> 1000026d0000-1000026dffff : APMC0D07:02
> 1000026f0000-1000026fffff : APMC0D07:00
> 100002730000-100002730fff : arch_mem_timer
> 100002750000-10000275ffff : APMC0D0F:01
> 100002750000-10000275ffff : APMC0D0F:01 APMC0D0F:01
> 100002780000-10000278ffff : APMC0D0F:02
> 100002780000-10000278ffff : APMC0D0F:02 APMC0D0F:02
> 1000027b0000-1000027bffff : APMC0D07:01
> 1000027c0000-1000027c0fff : sbsa-gwdt.0
> 1000027c0000-1000027c0fff : sbsa-gwdt.0 sbsa-gwdt.0
> 1000027d0000-1000027d0fff : sbsa-gwdt.0
> 1000027d0000-1000027d0fff : sbsa-gwdt.0 sbsa-gwdt.0
> 100010000000-10001fffffff : ARMHC600:00
> 100012500000-1000164fffff : ARMHC600:00
> 10008c000a00-10008c000bff : ARMHD620:00
> 10008d000a00-10008d000bff : ARMHD620:04
> 100100000000-10010000ffff : GICD
> 100100140000-10010113ffff : GICR
> 200000000000-23ffdfffffff : PCI Bus 0002:00
> 200000000000-2000001fffff : PCI Bus 0002:01
> 200000200000-2000003fffff : PCI Bus 0002:02
> 23ffe0000000-23ffe001ffff : arm-smmu-v3.3.auto
> 23ffe0000000-23ffe0000dff : arm-smmu-v3.3.auto
> 23ffe0010000-23ffe0010dff : arm-smmu-v3.3.auto
> 23fff0000000-23ffffffffff : PCI ECAM
> 27fff0000000-27ffffffffff : pnp 00:00
> 280000000000-2bffdfffffff : PCI Bus 0004:00
> 280000000000-2800001fffff : PCI Bus 0004:01
> 280000200000-2800003fffff : PCI Bus 0004:03
> 280000400000-2800005fffff : PCI Bus 0004:04
> 280000600000-2800007fffff : PCI Bus 0004:05
> 2bffe0000000-2bffe001ffff : arm-smmu-v3.4.auto
> 2bffe0000000-2bffe0000dff : arm-smmu-v3.4.auto
> 2bffe0010000-2bffe0010dff : arm-smmu-v3.4.auto
> 2bfff0000000-2bffffffffff : PCI ECAM
> 2c0000000000-2fffdfffffff : PCI Bus 0005:00
> 2c0000000000-2c00001fffff : PCI Bus 0005:01
> 2c0000200000-2c00003fffff : PCI Bus 0005:02
> 2c0000400000-2c00005fffff : PCI Bus 0005:03
> 2c0000600000-2c00007fffff : PCI Bus 0005:04
> 2fffe0000000-2fffe001ffff : arm-smmu-v3.5.auto
> 2fffe0000000-2fffe0000dff : arm-smmu-v3.5.auto
> 2fffe0010000-2fffe0010dff : arm-smmu-v3.5.auto
> 2ffff0000000-2fffffffffff : PCI ECAM
> 300000000000-33ffdfffffff : PCI Bus 000c:00
> 300000000000-3000001fffff : PCI Bus 000c:01
> 33ffe0000000-33ffe001ffff : arm-smmu-v3.0.auto
> 33ffe0000000-33ffe0000dff : arm-smmu-v3.0.auto
> 33ffe0010000-33ffe0010dff : arm-smmu-v3.0.auto
> 33fff0000000-33ffffffffff : PCI ECAM
> 340000000000-37ffdfffffff : PCI Bus 000d:00
> 340000000000-3402ffffffff : PCI Bus 000d:01
> 340000000000-3402ffffffff : PCI Bus 000d:02
> 340000000000-3402ffffffff : PCI Bus 000d:03
> 340000000000-3401ffffffff : 000d:03:00.0
> 340200000000-3402001fffff : 000d:03:00.0
> 37ffe0000000-37ffe001ffff : arm-smmu-v3.1.auto
> 37ffe0000000-37ffe0000dff : arm-smmu-v3.1.auto
> 37ffe0010000-37ffe0010dff : arm-smmu-v3.1.auto
> 37fff0000000-37ffffffffff : PCI ECAM
> 3bfff0000000-3bffffffffff : pnp 00:00
> 3c0000000000-3fffdfffffff : PCI Bus 0000:00
> 3c0000000000-3c00001fffff : PCI Bus 0000:01
> 3fffe0000000-3fffe001ffff : arm-smmu-v3.2.auto
> 3fffe0000000-3fffe0000dff : arm-smmu-v3.2.auto
> 3fffe0010000-3fffe0010dff : arm-smmu-v3.2.auto
> 3ffff0000000-3fffffffffff : PCI ECAM
> 63fff0000000-63ffffffffff : pnp 00:00
> 67fff0000000-67ffffffffff : pnp 00:00
> 6bfff0000000-6bffffffffff : pnp 00:00
> 6ffff0000000-6fffffffffff : pnp 00:00
> 7bfff0000000-7bffffffffff : pnp 00:00
> 7ffff0000000-7fffffffffff : pnp 00:00
>
> working dmesg from same:
>
> [ 15.500492] [drm] GPU posting now...
> [ 15.504110] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
> [ 15.512654] amdgpu 000d:03:00.0: BAR 2 [mem 0x340010000000-0x3400101fffff 64bit pref]: releasing
> [ 15.521431] amdgpu 000d:03:00.0: BAR 0 [mem 0x340000000000-0x34000fffffff 64bit pref]: releasing
> [ 15.530230] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
> [ 15.539881] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
> [ 15.549528] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
> [ 15.549535] pcieport 000d:00:01.0: bridge window [io 0x1000-0x0fff] to [bus 01-03] add_size 1000
> [ 15.549544] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
> [ 15.549546] pcieport 000d:00:01.0: bridge window [io size 0x1000]: can't assign; no space
> [ 15.549549] pcieport 000d:00:01.0: bridge window [io size 0x1000]: failed to assign
> [ 15.596468] pcieport 000d:00:01.0: bridge window [io size 0x1000]: can't assign; no space
> [ 15.607594] pcieport 000d:00:01.0: bridge window [io size 0x1000]: failed to assign
> [ 15.618090] pcieport 000d:00:01.0: bridge window [io size 0x1000]: ignoring failure in optional allocation
> [ 15.618095] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
> [ 15.628249] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
> [ 15.637806] amdgpu 000d:03:00.0: BAR 0 [mem 0x340000000000-0x3401ffffffff 64bit pref]: assigned
> [ 15.646506] amdgpu 000d:03:00.0: BAR 2 [mem 0x340200000000-0x3402001fffff 64bit pref]: assigned
> [ 15.655205] pcieport 000d:00:01.0: PCI bridge to [bus 01-03]
> [ 15.660856] pcieport 000d:00:01.0: bridge window [mem 0x50000000-0x502fffff]
> [ 15.668069] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]
> [ 15.676931] pcieport 000d:01:00.0: PCI bridge to [bus 02-03]
> [ 15.682586] pcieport 000d:01:00.0: bridge window [mem 0x50000000-0x501fffff]
> [ 15.689804] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]
> [ 15.698672] pcieport 000d:02:00.0: PCI bridge to [bus 03]
> [ 15.704067] pcieport 000d:02:00.0: bridge window [mem 0x50000000-0x501fffff]
> [ 15.711285] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]
> [ 15.720157] amdgpu 000d:03:00.0: amdgpu: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used)
> [ 15.729714] amdgpu 000d:03:00.0: amdgpu: GART: 512M 0x00007FFF00000000 - 0x00007FFF1FFFFFFF
> [ 15.738064] [drm] Detected VRAM RAM=8176M, BAR=8192M
> [ 15.743019] [drm] RAM width 128bits GDDR6
> [ 15.747258] [drm] amdgpu: 8176M of VRAM memory ready
> [ 15.752219] [drm] amdgpu: 15888M of GTT memory ready.
> [ 15.757297] [drm] GART: num cpu pages 131072, num gpu pages 131072
> [ 15.763558] [drm] PCIE GART of 512M enabled (table at 0x00000081FEB00000).
> [ 15.884845] [drm] Loading DMUB firmware via PSP: version=0x07002D00
> [ 16.129125] [drm] Found VCN firmware Version ENC: 1.23 DEC: 9 VEP: 0 Revision: 16
>
> From discussions with Ard it seems if the firmware had resized the BAR first,
> and then assigned the resources, there would be no issue. However there
> is no latter firmware for the platform.
>
> While the PCI change has provoked this regression I suspect the amdgpu code
> could handle the failure to resize the BAR better and if it can't get
> what it wants just not initialise the driver. I did hit some cases while
> bisecting where the GPU just wasn't visible.
>
> I'm available to test patches and generate additional debug info so do
> let me know if there is anything I can do to help.
>
> Thanks,
>
> --
> Alex Bennée
> Virtualisation Tech Lead @ Linaro
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: 2499f53 (PCI: Rework optional resource handling) regression with AMDGPU on Arm AVA platform
2025-10-22 16:51 2499f53 (PCI: Rework optional resource handling) regression with AMDGPU on Arm AVA platform Alex Bennée
2025-10-22 17:08 ` Ard Biesheuvel
@ 2025-10-23 16:20 ` Bjorn Helgaas
2025-10-23 17:24 ` Ilpo Järvinen
2 siblings, 0 replies; 4+ messages in thread
From: Bjorn Helgaas @ 2025-10-23 16:20 UTC (permalink / raw)
To: Alex Bennée
Cc: linux-pci, Ard Biesheuvel, Lorenzo Pieralisi, Alex Deucher,
Christian König, amd-gfx, Bjorn Helgaas, Ilpo Järvinen,
D Scott Phillips, regressions
On Wed, Oct 22, 2025 at 05:51:24PM +0100, Alex Bennée wrote:
> I've been tracking a regression on my Arm64 (Altra) AVA platform between
> 6.14 and 6.15. It looks like the rework commit broke the ability of the
> amdgpu driver to resize it's bar, resulting in an SError and failure to
> boot:
> ...
#regzbot ^introduced: 2499f5348431 ("PCI: Rework optional resource handling")
#regzbot title: arm64 SError panic with amdgpu BAR resize
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: 2499f53 (PCI: Rework optional resource handling) regression with AMDGPU on Arm AVA platform
2025-10-22 16:51 2499f53 (PCI: Rework optional resource handling) regression with AMDGPU on Arm AVA platform Alex Bennée
2025-10-22 17:08 ` Ard Biesheuvel
2025-10-23 16:20 ` Bjorn Helgaas
@ 2025-10-23 17:24 ` Ilpo Järvinen
2 siblings, 0 replies; 4+ messages in thread
From: Ilpo Järvinen @ 2025-10-23 17:24 UTC (permalink / raw)
To: Alex Bennée
Cc: linux-pci, Ard Biesheuvel, Lorenzo Pieralisi, Alex Deucher,
Christian König, amd-gfx, Bjorn Helgaas, D Scott Phillips
[-- Attachment #1: Type: text/plain, Size: 5118 bytes --]
On Wed, 22 Oct 2025, Alex Bennée wrote:
> I've been tracking a regression on my Arm64 (Altra) AVA platform between
> 6.14 and 6.15. It looks like the rework commit broke the ability of the
> amdgpu driver to resize it's bar, resulting in an SError and failure to
> boot:
>
> [ 15.348097] amdgpu 000d:03:00.0: amdgpu: detected ip block number 8 <vcn_v4_0>
> [ 15.355901] amdgpu 000d:03:00.0: amdgpu: detected ip block number 9 <jpeg_v4_0>
> [ 15.363202] amdgpu 000d:03:00.0: amdgpu: detected ip block number 10 <mes_v11_0>
> [ 15.384163] amdgpu 000d:03:00.0: amdgpu: Fetched VBIOS from ROM BAR
> [ 15.390434] amdgpu: ATOM BIOS: 113-4481LHS-UC1
> [ 15.400079] amdgpu 000d:03:00.0: amdgpu: CP RS64 enable
> [ 15.411830] amdgpu 000d:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
> [ 15.419932] amdgpu 000d:03:00.0: amdgpu: PCIE atomic ops is not supported
> [ 15.426719] [drm] GPU posting now...
> [ 15.430329] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
> [ 15.438871] amdgpu 000d:03:00.0: BAR 2 [mem 0x340010000000-0x3400101fffff 64bit pref]: releasing
> [ 15.447648] amdgpu 000d:03:00.0: BAR 0 [mem 0x340000000000-0x34000fffffff 64bit pref]: releasing
> [ 15.456452] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
> [ 15.466095] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
> [ 15.475738] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
> [ 15.485386] pcieport 000d:00:01.0: bridge window [io 0x1000-0x0fff] to [bus 01-03] add_size 1000
> [ 15.494252] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
> [ 15.503809] pcieport 000d:00:01.0: bridge window [io size 0x1000]: can't assign; no space
> [ 15.512063] pcieport 000d:00:01.0: bridge window [io size 0x1000]: failed to assign
> [ 15.519796] pcieport 000d:00:01.0: bridge window [io size 0x1000]: can't assign; no space
> [ 15.528049] pcieport 000d:00:01.0: bridge window [io size 0x1000]: failed to assign
> [ 15.535787] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
> [ 15.545349] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
> [ 15.554911] amdgpu 000d:03:00.0: BAR 0 [mem 0x340000000000-0x3401ffffffff 64bit pref]: assigned
> [ 15.563612] amdgpu 000d:03:00.0: BAR 2 [mem 0x340200000000-0x3402001fffff 64bit pref]: assigned
> [ 15.572313] pcieport 000d:00:01.0: PCI bridge to [bus 01-03]
> [ 15.577962] pcieport 000d:00:01.0: bridge window [mem 0x50000000-0x502fffff]
> [ 15.585175] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]
> [ 15.594038] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: can't claim; address conflict with PCI Bus 000d:01 [mem 0x340000000000-0x340017ffffff 64bit pref]
>
> Failure to claim space for the bridge window...
Thanks for the report.
I was just looking at a similar oddity from another reporter and thanks
this getting second case with an "impossible" claim conflict, I was
finally able to zero in on a bug in the resize code which has been there
since the introduction of the BAR resizing.
It will take a few days for me to come up fixes that do address also the
problems you'd likely hit next after this claim conflict bug is fixed.
> >From discussions with Ard it seems if the firmware had resized the BAR first,
> and then assigned the resources, there would be no issue. However there
> is no latter firmware for the platform.
We want to make kernel capable of considering BARs with their maximum
sizes eventually so it wouldn't matter what FW does. I've been working
towards that direction for a while now but I keep getting distracted by
fixing all these other bugs in the existing code. :-)
> While the PCI change has provoked this regression I suspect the amdgpu code
> could handle the failure to resize the BAR better and if it can't get
> what it wants just not initialise the driver. I did hit some cases while
> bisecting where the GPU just wasn't visible.
Indeed, things could be better on multiple levels.
Also the entire pci_resize_resource() API is flawed in that it isn't
currently able to restore all device's resources as they were in case of a
failure. It seems I might have to fix it now as there seem no other way to
fix this claim conflict problem.
...And fix will be a bit invasive as I need to merge
pbus_reassign_bridge_resources() and pci_resize_resource() into a new
pci_release_and_resize_resource() API that handles rollback properly
in case of an error.
> I'm available to test patches and generate additional debug info so do
> let me know if there is anything I can do to help.
Thanks, I'll send the fix series for testing once it is ready.
--
i.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-10-23 17:24 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-22 16:51 2499f53 (PCI: Rework optional resource handling) regression with AMDGPU on Arm AVA platform Alex Bennée
2025-10-22 17:08 ` Ard Biesheuvel
2025-10-23 16:20 ` Bjorn Helgaas
2025-10-23 17:24 ` Ilpo Järvinen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).