linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 2499f53 (PCI: Rework optional resource handling) regression with AMDGPU on Arm AVA platform
@ 2025-10-22 16:51 Alex Bennée
  2025-10-22 17:08 ` Ard Biesheuvel
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Alex Bennée @ 2025-10-22 16:51 UTC (permalink / raw)
  To: linux-pci
  Cc: Ard Biesheuvel, Lorenzo Pieralisi, Alex Deucher,
	Christian König, amd-gfx, Bjorn Helgaas, Ilpo Järvinen,
	D Scott Phillips


Hi,

I've been tracking a regression on my Arm64 (Altra) AVA platform between
6.14 and 6.15. It looks like the rework commit broke the ability of the
amdgpu driver to resize it's bar, resulting in an SError and failure to
boot:

  [   15.348097] amdgpu 000d:03:00.0: amdgpu: detected ip block number 8 <vcn_v4_0>
  [   15.355901] amdgpu 000d:03:00.0: amdgpu: detected ip block number 9 <jpeg_v4_0>
  [   15.363202] amdgpu 000d:03:00.0: amdgpu: detected ip block number 10 <mes_v11_0>
  [   15.384163] amdgpu 000d:03:00.0: amdgpu: Fetched VBIOS from ROM BAR
  [   15.390434] amdgpu: ATOM BIOS: 113-4481LHS-UC1
  [   15.400079] amdgpu 000d:03:00.0: amdgpu: CP RS64 enable
  [   15.411830] amdgpu 000d:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
  [   15.419932] amdgpu 000d:03:00.0: amdgpu: PCIE atomic ops is not supported
  [   15.426719] [drm] GPU posting now...
  [   15.430329] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
  [   15.438871] amdgpu 000d:03:00.0: BAR 2 [mem 0x340010000000-0x3400101fffff 64bit pref]: releasing
  [   15.447648] amdgpu 000d:03:00.0: BAR 0 [mem 0x340000000000-0x34000fffffff 64bit pref]: releasing
  [   15.456452] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
  [   15.466095] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
  [   15.475738] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
  [   15.485386] pcieport 000d:00:01.0: bridge window [io  0x1000-0x0fff] to [bus 01-03] add_size 1000
  [   15.494252] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
  [   15.503809] pcieport 000d:00:01.0: bridge window [io  size 0x1000]: can't assign; no space
  [   15.512063] pcieport 000d:00:01.0: bridge window [io  size 0x1000]: failed to assign
  [   15.519796] pcieport 000d:00:01.0: bridge window [io  size 0x1000]: can't assign; no space
  [   15.528049] pcieport 000d:00:01.0: bridge window [io  size 0x1000]: failed to assign
  [   15.535787] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
  [   15.545349] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
  [   15.554911] amdgpu 000d:03:00.0: BAR 0 [mem 0x340000000000-0x3401ffffffff 64bit pref]: assigned
  [   15.563612] amdgpu 000d:03:00.0: BAR 2 [mem 0x340200000000-0x3402001fffff 64bit pref]: assigned
  [   15.572313] pcieport 000d:00:01.0: PCI bridge to [bus 01-03]
  [   15.577962] pcieport 000d:00:01.0:   bridge window [mem 0x50000000-0x502fffff]
  [   15.585175] pcieport 000d:00:01.0:   bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]
  [   15.594038] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: can't claim; address conflict with PCI Bus 000d:01 [mem 0x340000000000-0x3
  40017ffffff 64bit pref]

Failure to claim space for the bridge window...

  [   15.611321] pcieport 000d:00:01.0: PCI bridge to [bus 01-03]
  [   15.616971] pcieport 000d:00:01.0:   bridge window [io  size 0x1000]
  [   15.623315] pcieport 000d:00:01.0:   bridge window [mem 0x50000000-0x502fffff]
  [   15.630527] pcieport 000d:00:01.0:   bridge window [mem size 0x18000000 64bit pref]
  [   15.638174] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: can't claim; no compatible bridge window
  [   15.650508] pcieport 000d:01:00.0: PCI bridge to [bus 02-03]
  [   15.656164] pcieport 000d:01:00.0:   bridge window [mem 0x50000000-0x501fffff]
  [   15.663381] pcieport 000d:01:00.0:   bridge window [mem size 0x18000000 64bit pref]
  [   15.671036] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: can't claim; no compatible bridge window
  [   15.683370] pcieport 000d:02:00.0: PCI bridge to [bus 03]
  [   15.688764] pcieport 000d:02:00.0:   bridge window [mem 0x50000000-0x501fffff]
  [   15.695982] pcieport 000d:02:00.0:   bridge window [mem size 0x18000000 64bit pref]
  [   15.703643] [drm] Not enough PCI address space for a large BAR.

Realisation not enough space for the BAR

  [   15.703648] amdgpu 000d:03:00.0: amdgpu: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used)
  [   15.719119] amdgpu 000d:03:00.0: amdgpu: GART: 512M 0x00007FFF00000000 - 0x00007FFF1FFFFFFF
  [   15.727470] [drm] Detected VRAM RAM=8176M, BAR=256M
  [   15.732339] [drm] RAM width 128bits GDDR6
  [   15.736552] [drm] amdgpu: 8176M of VRAM memory ready
  [   15.741516] [drm] amdgpu: 15888M of GTT memory ready.
  [   15.746592] [drm] GART: num cpu pages 131072, num gpu pages 131072
  [   15.752862] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
  [   15.850408] [drm] Loading DMUB firmware via PSP: version=0x07002D00
  [   16.128604] [drm] Found VCN firmware Version ENC: 1.23 DEC: 9 VEP: 0 Revision: 16
  [   16.446347] SError Interrupt on CPU3, code 0x00000000be000411 -- SError
  [   16.446354] CPU: 3 UID: 0 PID: 11 Comm: kworker/u128:0 Tainted: G     U             6.14.0-rc1-ajb-debian-bisect-00027-g2499f5348431-dirty #68
  [   16.446359] Tainted: [U]=USER
  [   16.446360] Hardware name: ADLINK AVA Developer Platform/AVA Developer Platform, BIOS TianoCore 2.04.100.07 (SYS: 2.06.20220308) 09/08/2022
  [   16.446362] Workqueue: efi_rts_wq efi_call_rts
  [   16.446371] pstate: 204000c9 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
  [   16.446374] pc : __wake_up_common_lock+0x40/0xc0
  [   16.446379] lr : __wake_up+0x20/0x40
  [   16.446382] sp : ffff800080aa3790
  [   16.446383] x29: ffff800080aa3790 x28: ffff3e8780bcb780 x27: 00000000fa481000
  [   16.446387] x26: ffff3e87a7e14b98 x25: ffffb6df6e1e2978 x24: ffffb6df6e351ed8
  [   16.446390] x23: ffff3e87a7e10000 x22: 00000000000000c0 x21: 0000000000000003
  [   16.446392] x20: 0000000000000000 x19: ffff3e87a7e14b98 x18: 0000000000000000
  [   16.446395] x17: ffff3e878245d180 x16: ffffb6dfa26e0c28 x15: ffff3e87810bcbc0
  [   16.446398] x14: 00000000fa481758 x13: 0000000000000000 x12: ffff800080aa3dd7
  [   16.446401] x11: 0000000000000040 x10: ffff3e87801ba830 x9 : ffffb6dfa26e0c48
  [   16.446403] x8 : ffff3e8786eb5268 x7 : 0000000000000000 x6 : 0000000000000000
  [   16.446406] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
  [   16.446408] x2 : 0000000000000000 x1 : 0000000000000003 x0 : 0000000000000001
  [   16.446412] Kernel panic - not syncing: Asynchronous SError Interrupt

Boom - unrecoverable bus error triggered by the PCI access.

  [   16.446414] CPU: 3 UID: 0 PID: 11 Comm: kworker/u128:0 Tainted: G     U             6.14.0-rc1-ajb-debian-bisect-00027-g2499f5348431-dirty #68
  [   16.446417] Tainted: [U]=USER
  [   16.446418] Hardware name: ADLINK AVA Developer Platform/AVA Developer Platform, BIOS TianoCore 2.04.100.07 (SYS: 2.06.20220308) 09/08/2022
  [   16.446419] Workqueue: efi_rts_wq efi_call_rts
  [   16.446424] Call trace:
  [   16.446425]  show_stack+0x34/0x98 (C)
  [   16.446431]  dump_stack_lvl+0x60/0x80
  [   16.446436]  dump_stack+0x18/0x24
  [   16.446440]  panic+0x164/0x378
  [   16.446443]  nmi_panic+0x90/0x98
  [   16.446448]  arm64_serror_panic+0x6c/0x80
  [   16.446452]  do_serror+0x30/0x78
  [   16.446456]  el1h_64_error_handler+0x30/0x50
  [   16.446462]  el1h_64_error+0x6c/0x70
  [   16.446464]  __wake_up_common_lock+0x40/0xc0 (P)
  [   16.446468]  __wake_up+0x20/0x40
  [   16.446471]  amdgpu_ih_process+0x100/0x160 [amdgpu]
  [   16.447083]  amdgpu_irq_handler+0x34/0xa0 [amdgpu]
  [   16.447637]  __handle_irq_event_percpu+0x60/0x1d8
  [   16.447642]  handle_irq_event+0x4c/0x110
  [   16.447646]  handle_fasteoi_irq+0xb4/0x220
  [   16.447649]  handle_irq_desc+0x3c/0x68
  [   16.447652]  generic_handle_domain_irq+0x24/0x40
  [   16.447656]  gic_handle_irq+0x54/0x124
  [   16.447658]  do_interrupt_handler+0x58/0xa0
  [   16.447661]  el1_interrupt+0x34/0x58
  [   16.447665]  el1h_64_irq_handler+0x18/0x28
  [   16.447669]  el1h_64_irq+0x6c/0x70
  [   16.447672]  0xfad10918 (P)
  [   16.447674]  0xfabe01c8
  [   16.447676]  0xfabe02d4
  [   16.447677]  0xfa3e209c
  [   16.447679]  0xfa43ae7c
  [   16.447680]  0xfa43b6bc
  [   16.447681]  0xfa436e44
  [   16.447683]  0xfa43c3f8
  [   16.447684]  __efi_rt_asm_wrapper+0x50/0x78
  [   16.447687]  efi_call_rts+0x1c8/0x280
  [   16.447691]  process_one_work+0x178/0x3e0
  [   16.447695]  worker_thread+0x204/0x3f0
  [   16.447698]  kthread+0x10c/0x1f0
  [   16.447703]  ret_from_fork+0x10/0x20
  [   16.447705] SMP: stopping secondary CPUs
  [   16.447796] Kernel Offset: 0x36df225a0000 from 0xffff800080000000
  [   16.447798] PHYS_OFFSET: 0xffffc97880000000
  [   16.447799] CPU features: 0x200,00002170,00901250,8241720b
  [   16.447802] Memory Limit: none
  [   16.471034] pstore: backend (efi_pstore) writing error (-16)
  [   16.801136] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---

The bisection was slightly complicated by the fact I'm carrying some
additional patches to work around other PCIe issues which however work
find before the failing commit. For convenience I've pushed a branch with the work
around applied here:

  https://gitlab.com/stsquad/linux/-/commits/testing/pci-amdgpu-regression-reference

Additional information

lspci -vv info for card

  000d:03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 33 [Radeon RX 7600/7600 XT/7600M XT/7600S/7700S / PRO W7600] (rev cf) (prog-if 00 [VGA controller])
          Subsystem: Sapphire Technology Limited Device e448
          Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
          Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
          Latency: 0
          Interrupt: pin A routed to IRQ 151
          NUMA node: 0
          IOMMU group: 21
          Region 0: Memory at 340000000000 (64-bit, prefetchable) [size=8G]
          Region 2: Memory at 340200000000 (64-bit, prefetchable) [size=2M]
          Region 5: Memory at 50000000 (32-bit, non-prefetchable) [size=1M]
          Expansion ROM at 50100000 [disabled] [size=128K]
          Capabilities: [48] Vendor Specific Information: Len=08 <?>
          Capabilities: [50] Power Management version 3
                  Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
                  Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
          Capabilities: [64] Express (v2) Legacy Endpoint, IntMsgNum 0
                  DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
                          ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- TEE-IO-
                  DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
                          RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                          MaxPayload 128 bytes, MaxReadReq 512 bytes
                  DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
                  LnkCap:	Port #0, Speed 16GT/s, Width x8, ASPM L1, Exit Latency L1 <1us
                          ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                  LnkCtl:	ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk-
                          ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                  LnkSta:	Speed 16GT/s, Width x8
                          TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                  DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
                           10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
                           EmergencyPowerReduction Form Factor Dev Specific, EmergencyPowerReductionInit-
                           FRS-
                           AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
                  DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
                           AtomicOpsCtl: ReqEn-
                           IDOReq- IDOCompl- LTR- EmergencyPowerReductionReq-
                           10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
                  LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
                  LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
                           Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                           Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                  LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
                           EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                           Retimer- 2Retimers- CrosslinkRes: unsupported
          Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                  Address: 00000000ffb77040  Data: 0000
          Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
          Capabilities: [150 v2] Advanced Error Reporting
                  UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
                          ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                          PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                  UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
                          ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                          PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                  UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
                          ECRC- UnsupReq- ACSViol- UncorrIntErr+ BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                          PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                  CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr- HeaderOF-
                  CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr- HeaderOF-
                  AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                          MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                  HeaderLog: 00000000 00000000 00000000 00000000
          Capabilities: [200 v1] Physical Resizable BAR
                  BAR 0: current size: 8GB, supported: 256MB 512MB 1GB 2GB 4GB 8GB
                  BAR 2: current size: 2MB, supported: 2MB 4MB 8MB 16MB 32MB 64MB 128MB 256MB
          Capabilities: [240 v1] Power Budgeting <?>
          Capabilities: [270 v1] Secondary PCI Express
                  LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                  LaneErrStat: 0
          Capabilities: [2a0 v1] Access Control Services
                  ACSCap:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                  ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
          Capabilities: [2d0 v1] Process Address Space ID (PASID)
                  PASIDCap: Exec+ Priv+, Max PASID Width: 10
                  PASIDCtl: Enable+ Exec+ Priv+
          Capabilities: [320 v1] Latency Tolerance Reporting
                  Max snoop latency: 0ns
                  Max no snoop latency: 0ns
          Capabilities: [410 v1] Physical Layer 16.0 GT/s <?>
          Capabilities: [450 v1] Lane Margining at the Receiver
                  PortCap: Uses Driver-
                  PortSta: MargReady+ MargSoftReady-
          Kernel driver in use: amdgpu
          Kernel modules: amdgpu

iomem layout from a working bootup (e89df6d2beae):

  08000000-0fffffff : PCI Bus 0002:00
    08000000-081fffff : PCI Bus 0002:01
    08200000-083fffff : PCI Bus 0002:02
  20000000-2fffffff : PCI Bus 0004:00
    20000000-217fffff : PCI Bus 0004:01
      20000000-217fffff : PCI Bus 0004:02
        20000000-20ffffff : 0004:02:00.0
          20000000-202fffff : efifb
        21000000-2101ffff : 0004:02:00.0
    21800000-219fffff : PCI Bus 0004:03
      21800000-21801fff : 0004:03:00.0
        21800000-21801fff : xhci-hcd
    21a00000-21bfffff : PCI Bus 0004:04
      21a00000-21a7ffff : 0004:04:00.0
        21a00000-21a7ffff : igb
      21a80000-21a83fff : 0004:04:00.0
        21a80000-21a83fff : igb
    21c00000-21dfffff : PCI Bus 0004:05
  30000000-3fffffff : PCI Bus 0005:00
    30000000-301fffff : PCI Bus 0005:01
    30200000-303fffff : PCI Bus 0005:02
    30400000-305fffff : PCI Bus 0005:03
      30400000-30403fff : 0005:03:00.0
        30400000-30403fff : nvme
    30600000-307fffff : PCI Bus 0005:04
      30600000-30603fff : 0005:04:00.0
        30600000-30603fff : nvme
  40000000-4fffffff : PCI Bus 000c:00
    40000000-401fffff : PCI Bus 000c:01
  50000000-5fffffff : PCI Bus 000d:00
    50000000-502fffff : PCI Bus 000d:01
      50000000-501fffff : PCI Bus 000d:02
        50000000-501fffff : PCI Bus 000d:03
          50000000-500fffff : 000d:03:00.0
          50100000-5011ffff : 000d:03:00.0
          50120000-50123fff : 000d:03:00.1
            50120000-50123fff : ICH HD audio
      50200000-50203fff : 000d:01:00.0
  70000000-7fffffff : PCI Bus 0000:00
    70000000-701fffff : PCI Bus 0000:01
  88300000-883fffff : reserved
  88500000-885fffff : IFX0785:00
    88500000-885fffff : IFX0785:00
  88900000-8891ffff : AMPC0005:00
  90000000-91ffffff : System RAM
  92000000-927bffff : reserved
  927c0000-f896ffff : System RAM
    d54f0000-d6adffff : Kernel code
    d6ae0000-d6daffff : reserved
    d6db0000-d717ffff : Kernel data
    ef650000-f3650fff : reserved
    f3850000-f49a2fff : reserved
    f88b0000-f88bffff : reserved
  f8970000-f898ffff : reserved
  f8990000-f899ffff : System RAM
  f89a0000-f89fffff : reserved
  f8a00000-f9196fff : System RAM
    f8a00000-f8a00fff : reserved
    f8a02000-f8a02fff : reserved
  f9197000-f91ecfff : reserved
  f91ed000-f94cffff : System RAM
    f91fb000-f91fbfff : reserved
  f94d0000-f950ffff : reserved
  f9510000-f98bffff : System RAM
  f98c0000-f98fffff : reserved
  f9900000-f999ffff : System RAM
  f99a0000-f99dffff : reserved
  f99e0000-f9f4ffff : System RAM
    f9ef0000-f9f1ffff : reserved
  f9f50000-f9f6ffff : reserved
  f9f70000-fa0affff : System RAM
  fa0b0000-fa0effff : reserved
  fa0f0000-fa1cffff : System RAM
  fa1d0000-fa26ffff : reserved
  fa270000-fa33ffff : System RAM
  fa340000-fa4affff : reserved
  fa4b0000-fa4bffff : System RAM
  fa4c0000-fa57ffff : reserved
  fa580000-fa72ffff : System RAM
  fa730000-fa7cffff : reserved
  fa7d0000-faa4ffff : System RAM
  faa50000-faaeffff : reserved
  faaf0000-fab7ffff : System RAM
  fab80000-fac1ffff : reserved
  fac20000-facaffff : System RAM
  facb0000-fad4ffff : reserved
  fad50000-fae1ffff : System RAM
  fae20000-faebffff : reserved
  faec0000-faf4ffff : System RAM
  faf50000-fafeffff : reserved
  faff0000-ffefffff : System RAM
    fbe00000-ffdfffff : reserved
  fff00000-fff4ffff : reserved
  fff50000-fffaffff : System RAM
  fffb0000-fffdffff : reserved
    fffc0000-fffc0fff : reserved
  fffe0000-ffffffff : System RAM
    fffe0000-fffeffff : reserved
  80000000000-8007fffffff : System RAM
    800002bc000-800002bcfff : reserved
    80000840000-8000084ffff : reserved
    80000850000-8000085ffff : reserved
    80000860000-8000086ffff : reserved
    80000870000-8000087ffff : reserved
    80000880000-8000088ffff : reserved
    80000890000-8000089ffff : reserved
    800008a0000-800008affff : reserved
    800008b0000-800008bffff : reserved
    800008c0000-800008cffff : reserved
    800008d0000-800008dffff : reserved
    800008e0000-800008effff : reserved
    800008f0000-800008fffff : reserved
    80000900000-8000090ffff : reserved
    80000910000-8000091ffff : reserved
    80000920000-8000092ffff : reserved
    80000930000-8000093ffff : reserved
    80000940000-8000094ffff : reserved
    80000950000-8000095ffff : reserved
    80000960000-8000096ffff : reserved
    80000970000-8000097ffff : reserved
    80000980000-8000098ffff : reserved
    80000990000-8000099ffff : reserved
    800009a0000-800009affff : reserved
    800009b0000-800009bffff : reserved
    800009c0000-800009cffff : reserved
    800009d0000-800009dffff : reserved
    800009e0000-800009effff : reserved
    800009f0000-800009fffff : reserved
    80000a00000-80000a0ffff : reserved
    80000a10000-80000a1ffff : reserved
    80000a20000-80000a2ffff : reserved
    80000a30000-80000a3ffff : reserved
    80000a40000-80000a4ffff : reserved
  80100000000-807ffffffff : System RAM
    807d8c10000-807fbffffff : reserved
    807fc009000-807fc039fff : reserved
    807fc03c000-807fc03ffff : reserved
    807fc040000-807fc040fff : reserved
    807fc041000-807fc044fff : reserved
    807fc045000-807fc06afff : reserved
    807fc06b000-807ffffffff : reserved
  100002600000-100002600fff : ARMH0011:00
    100002600000-100002600fff : ARMH0011:00 ARMH0011:00
  100002620000-100002620fff : ARMH0011:01
    100002620000-100002620fff : ARMH0011:01 ARMH0011:01
  1000026c0000-1000026cffff : APMC0D0F:00
    1000026c0000-1000026cffff : APMC0D0F:00 APMC0D0F:00
  1000026d0000-1000026dffff : APMC0D07:02
  1000026f0000-1000026fffff : APMC0D07:00
  100002730000-100002730fff : arch_mem_timer
  100002750000-10000275ffff : APMC0D0F:01
    100002750000-10000275ffff : APMC0D0F:01 APMC0D0F:01
  100002780000-10000278ffff : APMC0D0F:02
    100002780000-10000278ffff : APMC0D0F:02 APMC0D0F:02
  1000027b0000-1000027bffff : APMC0D07:01
  1000027c0000-1000027c0fff : sbsa-gwdt.0
    1000027c0000-1000027c0fff : sbsa-gwdt.0 sbsa-gwdt.0
  1000027d0000-1000027d0fff : sbsa-gwdt.0
    1000027d0000-1000027d0fff : sbsa-gwdt.0 sbsa-gwdt.0
  100010000000-10001fffffff : ARMHC600:00
    100012500000-1000164fffff : ARMHC600:00
  10008c000a00-10008c000bff : ARMHD620:00
  10008d000a00-10008d000bff : ARMHD620:04
  100100000000-10010000ffff : GICD
  100100140000-10010113ffff : GICR
  200000000000-23ffdfffffff : PCI Bus 0002:00
    200000000000-2000001fffff : PCI Bus 0002:01
    200000200000-2000003fffff : PCI Bus 0002:02
  23ffe0000000-23ffe001ffff : arm-smmu-v3.3.auto
    23ffe0000000-23ffe0000dff : arm-smmu-v3.3.auto
    23ffe0010000-23ffe0010dff : arm-smmu-v3.3.auto
  23fff0000000-23ffffffffff : PCI ECAM
  27fff0000000-27ffffffffff : pnp 00:00
  280000000000-2bffdfffffff : PCI Bus 0004:00
    280000000000-2800001fffff : PCI Bus 0004:01
    280000200000-2800003fffff : PCI Bus 0004:03
    280000400000-2800005fffff : PCI Bus 0004:04
    280000600000-2800007fffff : PCI Bus 0004:05
  2bffe0000000-2bffe001ffff : arm-smmu-v3.4.auto
    2bffe0000000-2bffe0000dff : arm-smmu-v3.4.auto
    2bffe0010000-2bffe0010dff : arm-smmu-v3.4.auto
  2bfff0000000-2bffffffffff : PCI ECAM
  2c0000000000-2fffdfffffff : PCI Bus 0005:00
    2c0000000000-2c00001fffff : PCI Bus 0005:01
    2c0000200000-2c00003fffff : PCI Bus 0005:02
    2c0000400000-2c00005fffff : PCI Bus 0005:03
    2c0000600000-2c00007fffff : PCI Bus 0005:04
  2fffe0000000-2fffe001ffff : arm-smmu-v3.5.auto
    2fffe0000000-2fffe0000dff : arm-smmu-v3.5.auto
    2fffe0010000-2fffe0010dff : arm-smmu-v3.5.auto
  2ffff0000000-2fffffffffff : PCI ECAM
  300000000000-33ffdfffffff : PCI Bus 000c:00
    300000000000-3000001fffff : PCI Bus 000c:01
  33ffe0000000-33ffe001ffff : arm-smmu-v3.0.auto
    33ffe0000000-33ffe0000dff : arm-smmu-v3.0.auto
    33ffe0010000-33ffe0010dff : arm-smmu-v3.0.auto
  33fff0000000-33ffffffffff : PCI ECAM
  340000000000-37ffdfffffff : PCI Bus 000d:00
    340000000000-3402ffffffff : PCI Bus 000d:01
      340000000000-3402ffffffff : PCI Bus 000d:02
        340000000000-3402ffffffff : PCI Bus 000d:03
          340000000000-3401ffffffff : 000d:03:00.0
          340200000000-3402001fffff : 000d:03:00.0
  37ffe0000000-37ffe001ffff : arm-smmu-v3.1.auto
    37ffe0000000-37ffe0000dff : arm-smmu-v3.1.auto
    37ffe0010000-37ffe0010dff : arm-smmu-v3.1.auto
  37fff0000000-37ffffffffff : PCI ECAM
  3bfff0000000-3bffffffffff : pnp 00:00
  3c0000000000-3fffdfffffff : PCI Bus 0000:00
    3c0000000000-3c00001fffff : PCI Bus 0000:01
  3fffe0000000-3fffe001ffff : arm-smmu-v3.2.auto
    3fffe0000000-3fffe0000dff : arm-smmu-v3.2.auto
    3fffe0010000-3fffe0010dff : arm-smmu-v3.2.auto
  3ffff0000000-3fffffffffff : PCI ECAM
  63fff0000000-63ffffffffff : pnp 00:00
  67fff0000000-67ffffffffff : pnp 00:00
  6bfff0000000-6bffffffffff : pnp 00:00
  6ffff0000000-6fffffffffff : pnp 00:00
  7bfff0000000-7bffffffffff : pnp 00:00
  7ffff0000000-7fffffffffff : pnp 00:00

working dmesg from same:

  [   15.500492] [drm] GPU posting now...
  [   15.504110] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
  [   15.512654] amdgpu 000d:03:00.0: BAR 2 [mem 0x340010000000-0x3400101fffff 64bit pref]: releasing
  [   15.521431] amdgpu 000d:03:00.0: BAR 0 [mem 0x340000000000-0x34000fffffff 64bit pref]: releasing
  [   15.530230] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
  [   15.539881] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
  [   15.549528] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
  [   15.549535] pcieport 000d:00:01.0: bridge window [io  0x1000-0x0fff] to [bus 01-03] add_size 1000
  [   15.549544] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
  [   15.549546] pcieport 000d:00:01.0: bridge window [io  size 0x1000]: can't assign; no space
  [   15.549549] pcieport 000d:00:01.0: bridge window [io  size 0x1000]: failed to assign
  [   15.596468] pcieport 000d:00:01.0: bridge window [io  size 0x1000]: can't assign; no space
  [   15.607594] pcieport 000d:00:01.0: bridge window [io  size 0x1000]: failed to assign
  [   15.618090] pcieport 000d:00:01.0: bridge window [io  size 0x1000]: ignoring failure in optional allocation
  [   15.618095] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
  [   15.628249] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
  [   15.637806] amdgpu 000d:03:00.0: BAR 0 [mem 0x340000000000-0x3401ffffffff 64bit pref]: assigned
  [   15.646506] amdgpu 000d:03:00.0: BAR 2 [mem 0x340200000000-0x3402001fffff 64bit pref]: assigned
  [   15.655205] pcieport 000d:00:01.0: PCI bridge to [bus 01-03]
  [   15.660856] pcieport 000d:00:01.0:   bridge window [mem 0x50000000-0x502fffff]
  [   15.668069] pcieport 000d:00:01.0:   bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]
  [   15.676931] pcieport 000d:01:00.0: PCI bridge to [bus 02-03]
  [   15.682586] pcieport 000d:01:00.0:   bridge window [mem 0x50000000-0x501fffff]
  [   15.689804] pcieport 000d:01:00.0:   bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]
  [   15.698672] pcieport 000d:02:00.0: PCI bridge to [bus 03]
  [   15.704067] pcieport 000d:02:00.0:   bridge window [mem 0x50000000-0x501fffff]
  [   15.711285] pcieport 000d:02:00.0:   bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]
  [   15.720157] amdgpu 000d:03:00.0: amdgpu: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used)
  [   15.729714] amdgpu 000d:03:00.0: amdgpu: GART: 512M 0x00007FFF00000000 - 0x00007FFF1FFFFFFF
  [   15.738064] [drm] Detected VRAM RAM=8176M, BAR=8192M
  [   15.743019] [drm] RAM width 128bits GDDR6
  [   15.747258] [drm] amdgpu: 8176M of VRAM memory ready
  [   15.752219] [drm] amdgpu: 15888M of GTT memory ready.
  [   15.757297] [drm] GART: num cpu pages 131072, num gpu pages 131072
  [   15.763558] [drm] PCIE GART of 512M enabled (table at 0x00000081FEB00000).
  [   15.884845] [drm] Loading DMUB firmware via PSP: version=0x07002D00
  [   16.129125] [drm] Found VCN firmware Version ENC: 1.23 DEC: 9 VEP: 0 Revision: 16

From discussions with Ard it seems if the firmware had resized the BAR first,
and then assigned the resources, there would be no issue. However there
is no latter firmware for the platform.

While the PCI change has provoked this regression I suspect the amdgpu code
could handle the failure to resize the BAR better and if it can't get
what it wants just not initialise the driver. I did hit some cases while
bisecting where the GPU just wasn't visible.

I'm available to test patches and generate additional debug info so do
let me know if there is anything I can do to help.

Thanks,

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 2499f53 (PCI: Rework optional resource handling) regression with AMDGPU on Arm AVA platform
  2025-10-22 16:51 2499f53 (PCI: Rework optional resource handling) regression with AMDGPU on Arm AVA platform Alex Bennée
@ 2025-10-22 17:08 ` Ard Biesheuvel
  2025-10-23 16:20 ` Bjorn Helgaas
  2025-10-23 17:24 ` Ilpo Järvinen
  2 siblings, 0 replies; 4+ messages in thread
From: Ard Biesheuvel @ 2025-10-22 17:08 UTC (permalink / raw)
  To: Alex Bennée
  Cc: linux-pci, Lorenzo Pieralisi, Alex Deucher, Christian König,
	amd-gfx, Bjorn Helgaas, Ilpo Järvinen, D Scott Phillips

On Wed, 22 Oct 2025 at 18:51, Alex Bennée <alex.bennee@linaro.org> wrote:
>
>
> Hi,
>
> I've been tracking a regression on my Arm64 (Altra) AVA platform between
> 6.14 and 6.15. It looks like the rework commit broke the ability of the
> amdgpu driver to resize it's bar, resulting in an SError and failure to
> boot:
>
...

> From discussions with Ard it seems if the firmware had resized the BAR first,
> and then assigned the resources, there would be no issue. However there
> is no latter firmware for the platform.
>
> While the PCI change has provoked this regression I suspect the amdgpu code
> could handle the failure to resize the BAR better and if it can't get
> what it wants just not initialise the driver.

Actually, looking again at the below, which follows the error about
overlapping resource windows, it seems the PCI code is failing to roll
back the changes, and it is not the driver at fault here.

>   [   15.611321] pcieport 000d:00:01.0: PCI bridge to [bus 01-03]
>   [   15.616971] pcieport 000d:00:01.0:   bridge window [io  size 0x1000]
>   [   15.623315] pcieport 000d:00:01.0:   bridge window [mem 0x50000000-0x502fffff]
>   [   15.630527] pcieport 000d:00:01.0:   bridge window [mem size 0x18000000 64bit pref]
>   [   15.638174] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: can't claim; no compatible bridge window
>   [   15.650508] pcieport 000d:01:00.0: PCI bridge to [bus 02-03]
>   [   15.656164] pcieport 000d:01:00.0:   bridge window [mem 0x50000000-0x501fffff]
>   [   15.663381] pcieport 000d:01:00.0:   bridge window [mem size 0x18000000 64bit pref]
>   [   15.671036] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: can't claim; no compatible bridge window
>   [   15.683370] pcieport 000d:02:00.0: PCI bridge to [bus 03]
>   [   15.688764] pcieport 000d:02:00.0:   bridge window [mem 0x50000000-0x501fffff]
>   [   15.695982] pcieport 000d:02:00.0:   bridge window [mem size 0x18000000 64bit pref]

On Wed, 22 Oct 2025 at 18:51, Alex Bennée <alex.bennee@linaro.org> wrote:
>
>
> Hi,
>
> I've been tracking a regression on my Arm64 (Altra) AVA platform between
> 6.14 and 6.15. It looks like the rework commit broke the ability of the
> amdgpu driver to resize it's bar, resulting in an SError and failure to
> boot:
>
>   [   15.348097] amdgpu 000d:03:00.0: amdgpu: detected ip block number 8 <vcn_v4_0>
>   [   15.355901] amdgpu 000d:03:00.0: amdgpu: detected ip block number 9 <jpeg_v4_0>
>   [   15.363202] amdgpu 000d:03:00.0: amdgpu: detected ip block number 10 <mes_v11_0>
>   [   15.384163] amdgpu 000d:03:00.0: amdgpu: Fetched VBIOS from ROM BAR
>   [   15.390434] amdgpu: ATOM BIOS: 113-4481LHS-UC1
>   [   15.400079] amdgpu 000d:03:00.0: amdgpu: CP RS64 enable
>   [   15.411830] amdgpu 000d:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
>   [   15.419932] amdgpu 000d:03:00.0: amdgpu: PCIE atomic ops is not supported
>   [   15.426719] [drm] GPU posting now...
>   [   15.430329] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
>   [   15.438871] amdgpu 000d:03:00.0: BAR 2 [mem 0x340010000000-0x3400101fffff 64bit pref]: releasing
>   [   15.447648] amdgpu 000d:03:00.0: BAR 0 [mem 0x340000000000-0x34000fffffff 64bit pref]: releasing
>   [   15.456452] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
>   [   15.466095] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
>   [   15.475738] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
>   [   15.485386] pcieport 000d:00:01.0: bridge window [io  0x1000-0x0fff] to [bus 01-03] add_size 1000
>   [   15.494252] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
>   [   15.503809] pcieport 000d:00:01.0: bridge window [io  size 0x1000]: can't assign; no space
>   [   15.512063] pcieport 000d:00:01.0: bridge window [io  size 0x1000]: failed to assign
>   [   15.519796] pcieport 000d:00:01.0: bridge window [io  size 0x1000]: can't assign; no space
>   [   15.528049] pcieport 000d:00:01.0: bridge window [io  size 0x1000]: failed to assign
>   [   15.535787] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
>   [   15.545349] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
>   [   15.554911] amdgpu 000d:03:00.0: BAR 0 [mem 0x340000000000-0x3401ffffffff 64bit pref]: assigned
>   [   15.563612] amdgpu 000d:03:00.0: BAR 2 [mem 0x340200000000-0x3402001fffff 64bit pref]: assigned
>   [   15.572313] pcieport 000d:00:01.0: PCI bridge to [bus 01-03]
>   [   15.577962] pcieport 000d:00:01.0:   bridge window [mem 0x50000000-0x502fffff]
>   [   15.585175] pcieport 000d:00:01.0:   bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]
>   [   15.594038] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: can't claim; address conflict with PCI Bus 000d:01 [mem 0x340000000000-0x3
>   40017ffffff 64bit pref]
>
> Failure to claim space for the bridge window...
>
>   [   15.611321] pcieport 000d:00:01.0: PCI bridge to [bus 01-03]
>   [   15.616971] pcieport 000d:00:01.0:   bridge window [io  size 0x1000]
>   [   15.623315] pcieport 000d:00:01.0:   bridge window [mem 0x50000000-0x502fffff]
>   [   15.630527] pcieport 000d:00:01.0:   bridge window [mem size 0x18000000 64bit pref]
>   [   15.638174] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: can't claim; no compatible bridge window
>   [   15.650508] pcieport 000d:01:00.0: PCI bridge to [bus 02-03]
>   [   15.656164] pcieport 000d:01:00.0:   bridge window [mem 0x50000000-0x501fffff]
>   [   15.663381] pcieport 000d:01:00.0:   bridge window [mem size 0x18000000 64bit pref]
>   [   15.671036] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: can't claim; no compatible bridge window
>   [   15.683370] pcieport 000d:02:00.0: PCI bridge to [bus 03]
>   [   15.688764] pcieport 000d:02:00.0:   bridge window [mem 0x50000000-0x501fffff]
>   [   15.695982] pcieport 000d:02:00.0:   bridge window [mem size 0x18000000 64bit pref]
>   [   15.703643] [drm] Not enough PCI address space for a large BAR.
>
> Realisation not enough space for the BAR
>
>   [   15.703648] amdgpu 000d:03:00.0: amdgpu: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used)
>   [   15.719119] amdgpu 000d:03:00.0: amdgpu: GART: 512M 0x00007FFF00000000 - 0x00007FFF1FFFFFFF
>   [   15.727470] [drm] Detected VRAM RAM=8176M, BAR=256M
>   [   15.732339] [drm] RAM width 128bits GDDR6
>   [   15.736552] [drm] amdgpu: 8176M of VRAM memory ready
>   [   15.741516] [drm] amdgpu: 15888M of GTT memory ready.
>   [   15.746592] [drm] GART: num cpu pages 131072, num gpu pages 131072
>   [   15.752862] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
>   [   15.850408] [drm] Loading DMUB firmware via PSP: version=0x07002D00
>   [   16.128604] [drm] Found VCN firmware Version ENC: 1.23 DEC: 9 VEP: 0 Revision: 16
>   [   16.446347] SError Interrupt on CPU3, code 0x00000000be000411 -- SError
>   [   16.446354] CPU: 3 UID: 0 PID: 11 Comm: kworker/u128:0 Tainted: G     U             6.14.0-rc1-ajb-debian-bisect-00027-g2499f5348431-dirty #68
>   [   16.446359] Tainted: [U]=USER
>   [   16.446360] Hardware name: ADLINK AVA Developer Platform/AVA Developer Platform, BIOS TianoCore 2.04.100.07 (SYS: 2.06.20220308) 09/08/2022
>   [   16.446362] Workqueue: efi_rts_wq efi_call_rts
>   [   16.446371] pstate: 204000c9 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>   [   16.446374] pc : __wake_up_common_lock+0x40/0xc0
>   [   16.446379] lr : __wake_up+0x20/0x40
>   [   16.446382] sp : ffff800080aa3790
>   [   16.446383] x29: ffff800080aa3790 x28: ffff3e8780bcb780 x27: 00000000fa481000
>   [   16.446387] x26: ffff3e87a7e14b98 x25: ffffb6df6e1e2978 x24: ffffb6df6e351ed8
>   [   16.446390] x23: ffff3e87a7e10000 x22: 00000000000000c0 x21: 0000000000000003
>   [   16.446392] x20: 0000000000000000 x19: ffff3e87a7e14b98 x18: 0000000000000000
>   [   16.446395] x17: ffff3e878245d180 x16: ffffb6dfa26e0c28 x15: ffff3e87810bcbc0
>   [   16.446398] x14: 00000000fa481758 x13: 0000000000000000 x12: ffff800080aa3dd7
>   [   16.446401] x11: 0000000000000040 x10: ffff3e87801ba830 x9 : ffffb6dfa26e0c48
>   [   16.446403] x8 : ffff3e8786eb5268 x7 : 0000000000000000 x6 : 0000000000000000
>   [   16.446406] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
>   [   16.446408] x2 : 0000000000000000 x1 : 0000000000000003 x0 : 0000000000000001
>   [   16.446412] Kernel panic - not syncing: Asynchronous SError Interrupt
>
> Boom - unrecoverable bus error triggered by the PCI access.
>
>   [   16.446414] CPU: 3 UID: 0 PID: 11 Comm: kworker/u128:0 Tainted: G     U             6.14.0-rc1-ajb-debian-bisect-00027-g2499f5348431-dirty #68
>   [   16.446417] Tainted: [U]=USER
>   [   16.446418] Hardware name: ADLINK AVA Developer Platform/AVA Developer Platform, BIOS TianoCore 2.04.100.07 (SYS: 2.06.20220308) 09/08/2022
>   [   16.446419] Workqueue: efi_rts_wq efi_call_rts
>   [   16.446424] Call trace:
>   [   16.446425]  show_stack+0x34/0x98 (C)
>   [   16.446431]  dump_stack_lvl+0x60/0x80
>   [   16.446436]  dump_stack+0x18/0x24
>   [   16.446440]  panic+0x164/0x378
>   [   16.446443]  nmi_panic+0x90/0x98
>   [   16.446448]  arm64_serror_panic+0x6c/0x80
>   [   16.446452]  do_serror+0x30/0x78
>   [   16.446456]  el1h_64_error_handler+0x30/0x50
>   [   16.446462]  el1h_64_error+0x6c/0x70
>   [   16.446464]  __wake_up_common_lock+0x40/0xc0 (P)
>   [   16.446468]  __wake_up+0x20/0x40
>   [   16.446471]  amdgpu_ih_process+0x100/0x160 [amdgpu]
>   [   16.447083]  amdgpu_irq_handler+0x34/0xa0 [amdgpu]
>   [   16.447637]  __handle_irq_event_percpu+0x60/0x1d8
>   [   16.447642]  handle_irq_event+0x4c/0x110
>   [   16.447646]  handle_fasteoi_irq+0xb4/0x220
>   [   16.447649]  handle_irq_desc+0x3c/0x68
>   [   16.447652]  generic_handle_domain_irq+0x24/0x40
>   [   16.447656]  gic_handle_irq+0x54/0x124
>   [   16.447658]  do_interrupt_handler+0x58/0xa0
>   [   16.447661]  el1_interrupt+0x34/0x58
>   [   16.447665]  el1h_64_irq_handler+0x18/0x28
>   [   16.447669]  el1h_64_irq+0x6c/0x70
>   [   16.447672]  0xfad10918 (P)
>   [   16.447674]  0xfabe01c8
>   [   16.447676]  0xfabe02d4
>   [   16.447677]  0xfa3e209c
>   [   16.447679]  0xfa43ae7c
>   [   16.447680]  0xfa43b6bc
>   [   16.447681]  0xfa436e44
>   [   16.447683]  0xfa43c3f8
>   [   16.447684]  __efi_rt_asm_wrapper+0x50/0x78
>   [   16.447687]  efi_call_rts+0x1c8/0x280
>   [   16.447691]  process_one_work+0x178/0x3e0
>   [   16.447695]  worker_thread+0x204/0x3f0
>   [   16.447698]  kthread+0x10c/0x1f0
>   [   16.447703]  ret_from_fork+0x10/0x20
>   [   16.447705] SMP: stopping secondary CPUs
>   [   16.447796] Kernel Offset: 0x36df225a0000 from 0xffff800080000000
>   [   16.447798] PHYS_OFFSET: 0xffffc97880000000
>   [   16.447799] CPU features: 0x200,00002170,00901250,8241720b
>   [   16.447802] Memory Limit: none
>   [   16.471034] pstore: backend (efi_pstore) writing error (-16)
>   [   16.801136] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
>
> The bisection was slightly complicated by the fact I'm carrying some
> additional patches to work around other PCIe issues which however work
> find before the failing commit. For convenience I've pushed a branch with the work
> around applied here:
>
>   https://gitlab.com/stsquad/linux/-/commits/testing/pci-amdgpu-regression-reference
>
> Additional information
>
> lspci -vv info for card
>
>   000d:03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 33 [Radeon RX 7600/7600 XT/7600M XT/7600S/7700S / PRO W7600] (rev cf) (prog-if 00 [VGA controller])
>           Subsystem: Sapphire Technology Limited Device e448
>           Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
>           Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>           Latency: 0
>           Interrupt: pin A routed to IRQ 151
>           NUMA node: 0
>           IOMMU group: 21
>           Region 0: Memory at 340000000000 (64-bit, prefetchable) [size=8G]
>           Region 2: Memory at 340200000000 (64-bit, prefetchable) [size=2M]
>           Region 5: Memory at 50000000 (32-bit, non-prefetchable) [size=1M]
>           Expansion ROM at 50100000 [disabled] [size=128K]
>           Capabilities: [48] Vendor Specific Information: Len=08 <?>
>           Capabilities: [50] Power Management version 3
>                   Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
>                   Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>           Capabilities: [64] Express (v2) Legacy Endpoint, IntMsgNum 0
>                   DevCap:       MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
>                           ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- TEE-IO-
>                   DevCtl:       CorrErr- NonFatalErr- FatalErr- UnsupReq-
>                           RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
>                           MaxPayload 128 bytes, MaxReadReq 512 bytes
>                   DevSta:       CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
>                   LnkCap:       Port #0, Speed 16GT/s, Width x8, ASPM L1, Exit Latency L1 <1us
>                           ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
>                   LnkCtl:       ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk-
>                           ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                   LnkSta:       Speed 16GT/s, Width x8
>                           TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>                   DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
>                            10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
>                            EmergencyPowerReduction Form Factor Dev Specific, EmergencyPowerReductionInit-
>                            FRS-
>                            AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
>                   DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
>                            AtomicOpsCtl: ReqEn-
>                            IDOReq- IDOCompl- LTR- EmergencyPowerReductionReq-
>                            10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
>                   LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
>                   LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
>                            Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
>                            Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
>                   LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
>                            EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
>                            Retimer- 2Retimers- CrosslinkRes: unsupported
>           Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>                   Address: 00000000ffb77040  Data: 0000
>           Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
>           Capabilities: [150 v2] Advanced Error Reporting
>                   UESta:        DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
>                           ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
>                           PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
>                   UEMsk:        DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
>                           ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
>                           PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
>                   UESvrt:       DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
>                           ECRC- UnsupReq- ACSViol- UncorrIntErr+ BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
>                           PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
>                   CESta:        RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr- HeaderOF-
>                   CEMsk:        RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr- HeaderOF-
>                   AERCap:       First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
>                           MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
>                   HeaderLog: 00000000 00000000 00000000 00000000
>           Capabilities: [200 v1] Physical Resizable BAR
>                   BAR 0: current size: 8GB, supported: 256MB 512MB 1GB 2GB 4GB 8GB
>                   BAR 2: current size: 2MB, supported: 2MB 4MB 8MB 16MB 32MB 64MB 128MB 256MB
>           Capabilities: [240 v1] Power Budgeting <?>
>           Capabilities: [270 v1] Secondary PCI Express
>                   LnkCtl3: LnkEquIntrruptEn- PerformEqu-
>                   LaneErrStat: 0
>           Capabilities: [2a0 v1] Access Control Services
>                   ACSCap:       SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
>                   ACSCtl:       SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
>           Capabilities: [2d0 v1] Process Address Space ID (PASID)
>                   PASIDCap: Exec+ Priv+, Max PASID Width: 10
>                   PASIDCtl: Enable+ Exec+ Priv+
>           Capabilities: [320 v1] Latency Tolerance Reporting
>                   Max snoop latency: 0ns
>                   Max no snoop latency: 0ns
>           Capabilities: [410 v1] Physical Layer 16.0 GT/s <?>
>           Capabilities: [450 v1] Lane Margining at the Receiver
>                   PortCap: Uses Driver-
>                   PortSta: MargReady+ MargSoftReady-
>           Kernel driver in use: amdgpu
>           Kernel modules: amdgpu
>
> iomem layout from a working bootup (e89df6d2beae):
>
>   08000000-0fffffff : PCI Bus 0002:00
>     08000000-081fffff : PCI Bus 0002:01
>     08200000-083fffff : PCI Bus 0002:02
>   20000000-2fffffff : PCI Bus 0004:00
>     20000000-217fffff : PCI Bus 0004:01
>       20000000-217fffff : PCI Bus 0004:02
>         20000000-20ffffff : 0004:02:00.0
>           20000000-202fffff : efifb
>         21000000-2101ffff : 0004:02:00.0
>     21800000-219fffff : PCI Bus 0004:03
>       21800000-21801fff : 0004:03:00.0
>         21800000-21801fff : xhci-hcd
>     21a00000-21bfffff : PCI Bus 0004:04
>       21a00000-21a7ffff : 0004:04:00.0
>         21a00000-21a7ffff : igb
>       21a80000-21a83fff : 0004:04:00.0
>         21a80000-21a83fff : igb
>     21c00000-21dfffff : PCI Bus 0004:05
>   30000000-3fffffff : PCI Bus 0005:00
>     30000000-301fffff : PCI Bus 0005:01
>     30200000-303fffff : PCI Bus 0005:02
>     30400000-305fffff : PCI Bus 0005:03
>       30400000-30403fff : 0005:03:00.0
>         30400000-30403fff : nvme
>     30600000-307fffff : PCI Bus 0005:04
>       30600000-30603fff : 0005:04:00.0
>         30600000-30603fff : nvme
>   40000000-4fffffff : PCI Bus 000c:00
>     40000000-401fffff : PCI Bus 000c:01
>   50000000-5fffffff : PCI Bus 000d:00
>     50000000-502fffff : PCI Bus 000d:01
>       50000000-501fffff : PCI Bus 000d:02
>         50000000-501fffff : PCI Bus 000d:03
>           50000000-500fffff : 000d:03:00.0
>           50100000-5011ffff : 000d:03:00.0
>           50120000-50123fff : 000d:03:00.1
>             50120000-50123fff : ICH HD audio
>       50200000-50203fff : 000d:01:00.0
>   70000000-7fffffff : PCI Bus 0000:00
>     70000000-701fffff : PCI Bus 0000:01
>   88300000-883fffff : reserved
>   88500000-885fffff : IFX0785:00
>     88500000-885fffff : IFX0785:00
>   88900000-8891ffff : AMPC0005:00
>   90000000-91ffffff : System RAM
>   92000000-927bffff : reserved
>   927c0000-f896ffff : System RAM
>     d54f0000-d6adffff : Kernel code
>     d6ae0000-d6daffff : reserved
>     d6db0000-d717ffff : Kernel data
>     ef650000-f3650fff : reserved
>     f3850000-f49a2fff : reserved
>     f88b0000-f88bffff : reserved
>   f8970000-f898ffff : reserved
>   f8990000-f899ffff : System RAM
>   f89a0000-f89fffff : reserved
>   f8a00000-f9196fff : System RAM
>     f8a00000-f8a00fff : reserved
>     f8a02000-f8a02fff : reserved
>   f9197000-f91ecfff : reserved
>   f91ed000-f94cffff : System RAM
>     f91fb000-f91fbfff : reserved
>   f94d0000-f950ffff : reserved
>   f9510000-f98bffff : System RAM
>   f98c0000-f98fffff : reserved
>   f9900000-f999ffff : System RAM
>   f99a0000-f99dffff : reserved
>   f99e0000-f9f4ffff : System RAM
>     f9ef0000-f9f1ffff : reserved
>   f9f50000-f9f6ffff : reserved
>   f9f70000-fa0affff : System RAM
>   fa0b0000-fa0effff : reserved
>   fa0f0000-fa1cffff : System RAM
>   fa1d0000-fa26ffff : reserved
>   fa270000-fa33ffff : System RAM
>   fa340000-fa4affff : reserved
>   fa4b0000-fa4bffff : System RAM
>   fa4c0000-fa57ffff : reserved
>   fa580000-fa72ffff : System RAM
>   fa730000-fa7cffff : reserved
>   fa7d0000-faa4ffff : System RAM
>   faa50000-faaeffff : reserved
>   faaf0000-fab7ffff : System RAM
>   fab80000-fac1ffff : reserved
>   fac20000-facaffff : System RAM
>   facb0000-fad4ffff : reserved
>   fad50000-fae1ffff : System RAM
>   fae20000-faebffff : reserved
>   faec0000-faf4ffff : System RAM
>   faf50000-fafeffff : reserved
>   faff0000-ffefffff : System RAM
>     fbe00000-ffdfffff : reserved
>   fff00000-fff4ffff : reserved
>   fff50000-fffaffff : System RAM
>   fffb0000-fffdffff : reserved
>     fffc0000-fffc0fff : reserved
>   fffe0000-ffffffff : System RAM
>     fffe0000-fffeffff : reserved
>   80000000000-8007fffffff : System RAM
>     800002bc000-800002bcfff : reserved
>     80000840000-8000084ffff : reserved
>     80000850000-8000085ffff : reserved
>     80000860000-8000086ffff : reserved
>     80000870000-8000087ffff : reserved
>     80000880000-8000088ffff : reserved
>     80000890000-8000089ffff : reserved
>     800008a0000-800008affff : reserved
>     800008b0000-800008bffff : reserved
>     800008c0000-800008cffff : reserved
>     800008d0000-800008dffff : reserved
>     800008e0000-800008effff : reserved
>     800008f0000-800008fffff : reserved
>     80000900000-8000090ffff : reserved
>     80000910000-8000091ffff : reserved
>     80000920000-8000092ffff : reserved
>     80000930000-8000093ffff : reserved
>     80000940000-8000094ffff : reserved
>     80000950000-8000095ffff : reserved
>     80000960000-8000096ffff : reserved
>     80000970000-8000097ffff : reserved
>     80000980000-8000098ffff : reserved
>     80000990000-8000099ffff : reserved
>     800009a0000-800009affff : reserved
>     800009b0000-800009bffff : reserved
>     800009c0000-800009cffff : reserved
>     800009d0000-800009dffff : reserved
>     800009e0000-800009effff : reserved
>     800009f0000-800009fffff : reserved
>     80000a00000-80000a0ffff : reserved
>     80000a10000-80000a1ffff : reserved
>     80000a20000-80000a2ffff : reserved
>     80000a30000-80000a3ffff : reserved
>     80000a40000-80000a4ffff : reserved
>   80100000000-807ffffffff : System RAM
>     807d8c10000-807fbffffff : reserved
>     807fc009000-807fc039fff : reserved
>     807fc03c000-807fc03ffff : reserved
>     807fc040000-807fc040fff : reserved
>     807fc041000-807fc044fff : reserved
>     807fc045000-807fc06afff : reserved
>     807fc06b000-807ffffffff : reserved
>   100002600000-100002600fff : ARMH0011:00
>     100002600000-100002600fff : ARMH0011:00 ARMH0011:00
>   100002620000-100002620fff : ARMH0011:01
>     100002620000-100002620fff : ARMH0011:01 ARMH0011:01
>   1000026c0000-1000026cffff : APMC0D0F:00
>     1000026c0000-1000026cffff : APMC0D0F:00 APMC0D0F:00
>   1000026d0000-1000026dffff : APMC0D07:02
>   1000026f0000-1000026fffff : APMC0D07:00
>   100002730000-100002730fff : arch_mem_timer
>   100002750000-10000275ffff : APMC0D0F:01
>     100002750000-10000275ffff : APMC0D0F:01 APMC0D0F:01
>   100002780000-10000278ffff : APMC0D0F:02
>     100002780000-10000278ffff : APMC0D0F:02 APMC0D0F:02
>   1000027b0000-1000027bffff : APMC0D07:01
>   1000027c0000-1000027c0fff : sbsa-gwdt.0
>     1000027c0000-1000027c0fff : sbsa-gwdt.0 sbsa-gwdt.0
>   1000027d0000-1000027d0fff : sbsa-gwdt.0
>     1000027d0000-1000027d0fff : sbsa-gwdt.0 sbsa-gwdt.0
>   100010000000-10001fffffff : ARMHC600:00
>     100012500000-1000164fffff : ARMHC600:00
>   10008c000a00-10008c000bff : ARMHD620:00
>   10008d000a00-10008d000bff : ARMHD620:04
>   100100000000-10010000ffff : GICD
>   100100140000-10010113ffff : GICR
>   200000000000-23ffdfffffff : PCI Bus 0002:00
>     200000000000-2000001fffff : PCI Bus 0002:01
>     200000200000-2000003fffff : PCI Bus 0002:02
>   23ffe0000000-23ffe001ffff : arm-smmu-v3.3.auto
>     23ffe0000000-23ffe0000dff : arm-smmu-v3.3.auto
>     23ffe0010000-23ffe0010dff : arm-smmu-v3.3.auto
>   23fff0000000-23ffffffffff : PCI ECAM
>   27fff0000000-27ffffffffff : pnp 00:00
>   280000000000-2bffdfffffff : PCI Bus 0004:00
>     280000000000-2800001fffff : PCI Bus 0004:01
>     280000200000-2800003fffff : PCI Bus 0004:03
>     280000400000-2800005fffff : PCI Bus 0004:04
>     280000600000-2800007fffff : PCI Bus 0004:05
>   2bffe0000000-2bffe001ffff : arm-smmu-v3.4.auto
>     2bffe0000000-2bffe0000dff : arm-smmu-v3.4.auto
>     2bffe0010000-2bffe0010dff : arm-smmu-v3.4.auto
>   2bfff0000000-2bffffffffff : PCI ECAM
>   2c0000000000-2fffdfffffff : PCI Bus 0005:00
>     2c0000000000-2c00001fffff : PCI Bus 0005:01
>     2c0000200000-2c00003fffff : PCI Bus 0005:02
>     2c0000400000-2c00005fffff : PCI Bus 0005:03
>     2c0000600000-2c00007fffff : PCI Bus 0005:04
>   2fffe0000000-2fffe001ffff : arm-smmu-v3.5.auto
>     2fffe0000000-2fffe0000dff : arm-smmu-v3.5.auto
>     2fffe0010000-2fffe0010dff : arm-smmu-v3.5.auto
>   2ffff0000000-2fffffffffff : PCI ECAM
>   300000000000-33ffdfffffff : PCI Bus 000c:00
>     300000000000-3000001fffff : PCI Bus 000c:01
>   33ffe0000000-33ffe001ffff : arm-smmu-v3.0.auto
>     33ffe0000000-33ffe0000dff : arm-smmu-v3.0.auto
>     33ffe0010000-33ffe0010dff : arm-smmu-v3.0.auto
>   33fff0000000-33ffffffffff : PCI ECAM
>   340000000000-37ffdfffffff : PCI Bus 000d:00
>     340000000000-3402ffffffff : PCI Bus 000d:01
>       340000000000-3402ffffffff : PCI Bus 000d:02
>         340000000000-3402ffffffff : PCI Bus 000d:03
>           340000000000-3401ffffffff : 000d:03:00.0
>           340200000000-3402001fffff : 000d:03:00.0
>   37ffe0000000-37ffe001ffff : arm-smmu-v3.1.auto
>     37ffe0000000-37ffe0000dff : arm-smmu-v3.1.auto
>     37ffe0010000-37ffe0010dff : arm-smmu-v3.1.auto
>   37fff0000000-37ffffffffff : PCI ECAM
>   3bfff0000000-3bffffffffff : pnp 00:00
>   3c0000000000-3fffdfffffff : PCI Bus 0000:00
>     3c0000000000-3c00001fffff : PCI Bus 0000:01
>   3fffe0000000-3fffe001ffff : arm-smmu-v3.2.auto
>     3fffe0000000-3fffe0000dff : arm-smmu-v3.2.auto
>     3fffe0010000-3fffe0010dff : arm-smmu-v3.2.auto
>   3ffff0000000-3fffffffffff : PCI ECAM
>   63fff0000000-63ffffffffff : pnp 00:00
>   67fff0000000-67ffffffffff : pnp 00:00
>   6bfff0000000-6bffffffffff : pnp 00:00
>   6ffff0000000-6fffffffffff : pnp 00:00
>   7bfff0000000-7bffffffffff : pnp 00:00
>   7ffff0000000-7fffffffffff : pnp 00:00
>
> working dmesg from same:
>
>   [   15.500492] [drm] GPU posting now...
>   [   15.504110] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
>   [   15.512654] amdgpu 000d:03:00.0: BAR 2 [mem 0x340010000000-0x3400101fffff 64bit pref]: releasing
>   [   15.521431] amdgpu 000d:03:00.0: BAR 0 [mem 0x340000000000-0x34000fffffff 64bit pref]: releasing
>   [   15.530230] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
>   [   15.539881] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
>   [   15.549528] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
>   [   15.549535] pcieport 000d:00:01.0: bridge window [io  0x1000-0x0fff] to [bus 01-03] add_size 1000
>   [   15.549544] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
>   [   15.549546] pcieport 000d:00:01.0: bridge window [io  size 0x1000]: can't assign; no space
>   [   15.549549] pcieport 000d:00:01.0: bridge window [io  size 0x1000]: failed to assign
>   [   15.596468] pcieport 000d:00:01.0: bridge window [io  size 0x1000]: can't assign; no space
>   [   15.607594] pcieport 000d:00:01.0: bridge window [io  size 0x1000]: failed to assign
>   [   15.618090] pcieport 000d:00:01.0: bridge window [io  size 0x1000]: ignoring failure in optional allocation
>   [   15.618095] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
>   [   15.628249] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
>   [   15.637806] amdgpu 000d:03:00.0: BAR 0 [mem 0x340000000000-0x3401ffffffff 64bit pref]: assigned
>   [   15.646506] amdgpu 000d:03:00.0: BAR 2 [mem 0x340200000000-0x3402001fffff 64bit pref]: assigned
>   [   15.655205] pcieport 000d:00:01.0: PCI bridge to [bus 01-03]
>   [   15.660856] pcieport 000d:00:01.0:   bridge window [mem 0x50000000-0x502fffff]
>   [   15.668069] pcieport 000d:00:01.0:   bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]
>   [   15.676931] pcieport 000d:01:00.0: PCI bridge to [bus 02-03]
>   [   15.682586] pcieport 000d:01:00.0:   bridge window [mem 0x50000000-0x501fffff]
>   [   15.689804] pcieport 000d:01:00.0:   bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]
>   [   15.698672] pcieport 000d:02:00.0: PCI bridge to [bus 03]
>   [   15.704067] pcieport 000d:02:00.0:   bridge window [mem 0x50000000-0x501fffff]
>   [   15.711285] pcieport 000d:02:00.0:   bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]
>   [   15.720157] amdgpu 000d:03:00.0: amdgpu: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used)
>   [   15.729714] amdgpu 000d:03:00.0: amdgpu: GART: 512M 0x00007FFF00000000 - 0x00007FFF1FFFFFFF
>   [   15.738064] [drm] Detected VRAM RAM=8176M, BAR=8192M
>   [   15.743019] [drm] RAM width 128bits GDDR6
>   [   15.747258] [drm] amdgpu: 8176M of VRAM memory ready
>   [   15.752219] [drm] amdgpu: 15888M of GTT memory ready.
>   [   15.757297] [drm] GART: num cpu pages 131072, num gpu pages 131072
>   [   15.763558] [drm] PCIE GART of 512M enabled (table at 0x00000081FEB00000).
>   [   15.884845] [drm] Loading DMUB firmware via PSP: version=0x07002D00
>   [   16.129125] [drm] Found VCN firmware Version ENC: 1.23 DEC: 9 VEP: 0 Revision: 16
>
> From discussions with Ard it seems if the firmware had resized the BAR first,
> and then assigned the resources, there would be no issue. However there
> is no latter firmware for the platform.
>
> While the PCI change has provoked this regression I suspect the amdgpu code
> could handle the failure to resize the BAR better and if it can't get
> what it wants just not initialise the driver. I did hit some cases while
> bisecting where the GPU just wasn't visible.
>
> I'm available to test patches and generate additional debug info so do
> let me know if there is anything I can do to help.
>
> Thanks,
>
> --
> Alex Bennée
> Virtualisation Tech Lead @ Linaro

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 2499f53 (PCI: Rework optional resource handling) regression with AMDGPU on Arm AVA platform
  2025-10-22 16:51 2499f53 (PCI: Rework optional resource handling) regression with AMDGPU on Arm AVA platform Alex Bennée
  2025-10-22 17:08 ` Ard Biesheuvel
@ 2025-10-23 16:20 ` Bjorn Helgaas
  2025-10-23 17:24 ` Ilpo Järvinen
  2 siblings, 0 replies; 4+ messages in thread
From: Bjorn Helgaas @ 2025-10-23 16:20 UTC (permalink / raw)
  To: Alex Bennée
  Cc: linux-pci, Ard Biesheuvel, Lorenzo Pieralisi, Alex Deucher,
	Christian König, amd-gfx, Bjorn Helgaas, Ilpo Järvinen,
	D Scott Phillips, regressions

On Wed, Oct 22, 2025 at 05:51:24PM +0100, Alex Bennée wrote:
> I've been tracking a regression on my Arm64 (Altra) AVA platform between
> 6.14 and 6.15. It looks like the rework commit broke the ability of the
> amdgpu driver to resize it's bar, resulting in an SError and failure to
> boot:
> ...

#regzbot ^introduced: 2499f5348431 ("PCI: Rework optional resource handling")
#regzbot title: arm64 SError panic with amdgpu BAR resize

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 2499f53 (PCI: Rework optional resource handling) regression with AMDGPU on Arm AVA platform
  2025-10-22 16:51 2499f53 (PCI: Rework optional resource handling) regression with AMDGPU on Arm AVA platform Alex Bennée
  2025-10-22 17:08 ` Ard Biesheuvel
  2025-10-23 16:20 ` Bjorn Helgaas
@ 2025-10-23 17:24 ` Ilpo Järvinen
  2 siblings, 0 replies; 4+ messages in thread
From: Ilpo Järvinen @ 2025-10-23 17:24 UTC (permalink / raw)
  To: Alex Bennée
  Cc: linux-pci, Ard Biesheuvel, Lorenzo Pieralisi, Alex Deucher,
	Christian König, amd-gfx, Bjorn Helgaas, D Scott Phillips

[-- Attachment #1: Type: text/plain, Size: 5118 bytes --]

On Wed, 22 Oct 2025, Alex Bennée wrote:

> I've been tracking a regression on my Arm64 (Altra) AVA platform between
> 6.14 and 6.15. It looks like the rework commit broke the ability of the
> amdgpu driver to resize it's bar, resulting in an SError and failure to
> boot:
> 
>   [   15.348097] amdgpu 000d:03:00.0: amdgpu: detected ip block number 8 <vcn_v4_0>
>   [   15.355901] amdgpu 000d:03:00.0: amdgpu: detected ip block number 9 <jpeg_v4_0>
>   [   15.363202] amdgpu 000d:03:00.0: amdgpu: detected ip block number 10 <mes_v11_0>
>   [   15.384163] amdgpu 000d:03:00.0: amdgpu: Fetched VBIOS from ROM BAR
>   [   15.390434] amdgpu: ATOM BIOS: 113-4481LHS-UC1
>   [   15.400079] amdgpu 000d:03:00.0: amdgpu: CP RS64 enable
>   [   15.411830] amdgpu 000d:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
>   [   15.419932] amdgpu 000d:03:00.0: amdgpu: PCIE atomic ops is not supported
>   [   15.426719] [drm] GPU posting now...
>   [   15.430329] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
>   [   15.438871] amdgpu 000d:03:00.0: BAR 2 [mem 0x340010000000-0x3400101fffff 64bit pref]: releasing
>   [   15.447648] amdgpu 000d:03:00.0: BAR 0 [mem 0x340000000000-0x34000fffffff 64bit pref]: releasing
>   [   15.456452] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
>   [   15.466095] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
>   [   15.475738] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
>   [   15.485386] pcieport 000d:00:01.0: bridge window [io  0x1000-0x0fff] to [bus 01-03] add_size 1000
>   [   15.494252] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
>   [   15.503809] pcieport 000d:00:01.0: bridge window [io  size 0x1000]: can't assign; no space
>   [   15.512063] pcieport 000d:00:01.0: bridge window [io  size 0x1000]: failed to assign
>   [   15.519796] pcieport 000d:00:01.0: bridge window [io  size 0x1000]: can't assign; no space
>   [   15.528049] pcieport 000d:00:01.0: bridge window [io  size 0x1000]: failed to assign
>   [   15.535787] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
>   [   15.545349] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
>   [   15.554911] amdgpu 000d:03:00.0: BAR 0 [mem 0x340000000000-0x3401ffffffff 64bit pref]: assigned
>   [   15.563612] amdgpu 000d:03:00.0: BAR 2 [mem 0x340200000000-0x3402001fffff 64bit pref]: assigned
>   [   15.572313] pcieport 000d:00:01.0: PCI bridge to [bus 01-03]
>   [   15.577962] pcieport 000d:00:01.0:   bridge window [mem 0x50000000-0x502fffff]
>   [   15.585175] pcieport 000d:00:01.0:   bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]
>   [   15.594038] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: can't claim; address conflict with PCI Bus 000d:01 [mem 0x340000000000-0x340017ffffff 64bit pref]
> 
> Failure to claim space for the bridge window...

Thanks for the report.

I was just looking at a similar oddity from another reporter and thanks 
this getting second case with an "impossible" claim conflict, I was 
finally able to zero in on a bug in the resize code which has been there 
since the introduction of the BAR resizing.

It will take a few days for me to come up fixes that do address also the
problems you'd likely hit next after this claim conflict bug is fixed.

> >From discussions with Ard it seems if the firmware had resized the BAR first,
> and then assigned the resources, there would be no issue. However there
> is no latter firmware for the platform.

We want to make kernel capable of considering BARs with their maximum 
sizes eventually so it wouldn't matter what FW does. I've been working 
towards that direction for a while now but I keep getting distracted by 
fixing all these other bugs in the existing code. :-)

> While the PCI change has provoked this regression I suspect the amdgpu code
> could handle the failure to resize the BAR better and if it can't get
> what it wants just not initialise the driver. I did hit some cases while
> bisecting where the GPU just wasn't visible.

Indeed, things could be better on multiple levels.

Also the entire pci_resize_resource() API is flawed in that it isn't 
currently able to restore all device's resources as they were in case of a 
failure. It seems I might have to fix it now as there seem no other way to 
fix this claim conflict problem.

...And fix will be a bit invasive as I need to merge 
pbus_reassign_bridge_resources() and pci_resize_resource() into a new
pci_release_and_resize_resource() API that handles rollback properly
in case of an error.

> I'm available to test patches and generate additional debug info so do
> let me know if there is anything I can do to help.

Thanks, I'll send the fix series for testing once it is ready.

--
 i.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-10-23 17:24 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-22 16:51 2499f53 (PCI: Rework optional resource handling) regression with AMDGPU on Arm AVA platform Alex Bennée
2025-10-22 17:08 ` Ard Biesheuvel
2025-10-23 16:20 ` Bjorn Helgaas
2025-10-23 17:24 ` Ilpo Järvinen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).