[REGRESSION] amdgpu with Thunderbolt eGPU bracket fails since new bridge window alignment calculation code

public inbox for linux-pci@vger.kernel.org
 help / color / mirror / Atom feed

* [REGRESSION] amdgpu with Thunderbolt eGPU bracket fails since new bridge window alignment calculation code
@ 2026-03-27 23:02 Jonas Höglund
  2026-03-28  8:46 ` Thorsten Leemhuis
  2026-03-30  7:21 ` Thorsten Leemhuis
  0 siblings, 2 replies; 11+ messages in thread
From: Jonas Höglund @ 2026-03-27 23:02 UTC (permalink / raw)
  To: stable; +Cc: linux-pci, regressions

[-- Attachment #1: Type: text/plain, Size: 2072 bytes --]

Hello,

I have an AMD GPU in an external Thunderbolt enclosure that recently
stopped working with the latest longterm kernel release.  The GPU in
question is an AMD RX 6750 XT.

    [...]
    amdgpu 0000:3e:00.0: amdgpu: vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
    amdgpu 0000:3e:00.0: BAR 2 [mem 0x74200000-0x743fffff 64bit pref]: releasing
    amdgpu 0000:3e:00.0: amdgpu: Problem resizing BAR0 (-16).
    amdgpu 0000:3e:00.0: BAR 2 [mem 0x74200000-0x743fffff 64bit pref]: assigned
    amdgpu 0000:3e:00.0: amdgpu: VRAM: 12272M 0x0000008000000000 - 0x00000082FEFFFFFF (12272M used)
    amdgpu 0000:3e:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
    resource: resource sanity check: requesting [mem 0x0000000000000000-0xffffffffffffffff], which spans more than PCI Bus 0000:00 [mem 0x000a0000-0x000bffff window]
    ------------[ cut here ]------------
    WARNING: CPU: 7 PID: 2260 at arch/x86/mm/pat/memtype.c:720 memtype_reserve_io+0xfd/0x110
    [...]

Searching for the issue I found this very similar report from last year:
https://lkml.org/lkml/2025/6/9/88

so I suppose this might be a re-regression of the same issue(?).


I checked out latest longterm (v6.18.20) and bisected, which took me to
commit b855d99 (upstream commit 3958bf16), which is
"PCI: Stop over-estimating bridge window size".


After the bisect I tried reverting the commit on top of v6.18.20 (going
back to the old way of calculating alignment), and this is sufficient
for the eGPU to dock properly again.

I've attached a longer excerpt of the kernel logs from a failing boot
(let me know if a full dmesg would be helpful and I can find somewhere
to upload it).  I've also attached an excerpt when connecting the eGPU
in a "good" case, since I figured the memory adress ranges could be
useful.


Hardware:
  Machine: Dell XPS 13 9310 (0991)
  GPU: AMD RX 6750 XT
  Dock: EXP GDC TH3P4G3

Distribution: NixOS
Architecture: x86-64

#regzbot introduced: b855d99

Let me know if any additional information would be helpful.

Thanks,
Jonas

[-- Attachment #2: amdgpu-egpu-crash-excerpt.dmesg --]
[-- Type: application/octet-stream, Size: 11576 bytes --]

Mar 25 21:04:36 ar kernel: amdgpu: Virtual CRAT table created for CPU
Mar 25 21:04:36 ar kernel: amdgpu: Topology: Add CPU node
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: enabling device (0000 -> 0003)
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: amdgpu: initializing kernel modesetting (NAVY_FLOUNDER 0x1002:0x73DF 0x1DA2:0xE445 0xC5).
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: amdgpu: register mmio base: 0x74000000
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: amdgpu: register mmio size: 1048576
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: amdgpu: detected ip block number 0 <common_v1_0_0> (nv_common)
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: amdgpu: detected ip block number 1 <gmc_v10_0_0> (gmc_v10_0)
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: amdgpu: detected ip block number 2 <ih_v5_0_0> (navi10_ih)
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: amdgpu: detected ip block number 3 <psp_v11_0_0> (psp)
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: amdgpu: detected ip block number 4 <smu_v11_0_0> (smu)
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: amdgpu: detected ip block number 5 <dce_v1_0_0> (dm)
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: amdgpu: detected ip block number 6 <gfx_v10_0_0> (gfx_v10_0)
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: amdgpu: detected ip block number 7 <sdma_v5_2_0> (sdma_v5_2)
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: amdgpu: detected ip block number 8 <vcn_v3_0_0> (vcn_v3_0)
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: amdgpu: detected ip block number 9 <jpeg_v3_0_0> (jpeg_v3_0)
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: amdgpu: Fetched VBIOS from ROM BAR
Mar 25 21:04:36 ar kernel: amdgpu: ATOM BIOS: 113-D5122200-S05
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: amdgpu: PCIE atomic ops is not supported
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: amdgpu: GPU posting now...
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: amdgpu: vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: BAR 2 [mem 0x74200000-0x743fffff 64bit pref]: releasing
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: amdgpu: Problem resizing BAR0 (-16).
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: BAR 2 [mem 0x74200000-0x743fffff 64bit pref]: assigned
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: amdgpu: VRAM: 12272M 0x0000008000000000 - 0x00000082FEFFFFFF (12272M used)
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
Mar 25 21:04:36 ar kernel: resource: resource sanity check: requesting [mem 0x0000000000000000-0xffffffffffffffff], which spans more than PCI Bus 0000:00 [mem 0x000a0000-0x000bffff window]
Mar 25 21:04:36 ar kernel: ------------[ cut here ]------------
Mar 25 21:04:36 ar kernel: WARNING: CPU: 7 PID: 2260 at arch/x86/mm/pat/memtype.c:720 memtype_reserve_io+0xfd/0x110
Mar 25 21:04:36 ar kernel: Modules linked in: typec_thunderbolt amdgpu(+) snd_hda_codec_atihdmi amdxcp drm_panel_backlight_quirks fuse rfcomm snd_seq_dummy snd_hrtimer snd_seq snd_seq_device xt_conntrack xt_MASQUERADE xfrm_user xfrm_algo xt_set ip_set nft_chain_nat xt_addrtype nft_compat overlay ccm michael_mic af_packet nf_log_syslog nft_log nft_ct cmac algif_hash nft_fib_inet algif_skcipher nft_fib_ipv4 nft_fib_ipv6 af_alg nft_fib bnep nf_tables msr snd_ctl_led snd_soc_skl_hda_dsp snd_soc_intel_sof_board_helpers snd_soc_intel_hda_dsp_common snd_sof_probes xe snd_hda_codec_intelhdmi nls_iso8859_1 nls_cp437 snd_hda_codec_alc269 drm_gpuvm vfat snd_hda_scodec_component fat drm_gpusvm_helper snd_hda_codec_realtek_lib gpu_sched drm_exec snd_hda_codec_generic drm_suballoc_helper drm_ttm_helper configfs snd_soc_dmic snd_hda_intel qrtr_mhi snd_sof_pci_intel_tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic soundwire_intel snd_sof_intel_hda_sdw_bpt snd_sof_intel_hda_common snd_soc_hdac_hda snd_sof_intel_hda_mlink snd_sof_intel_hda
Mar 25 21:04:36 ar kernel:  snd_hda_codec_hdmi soundwire_cadence qrtr snd_sof_pci snd_sof_xtensa_dsp ath11k_pci snd_sof ath11k snd_sof_utils snd_soc_acpi_intel_match snd_soc_acpi_intel_sdca_quirks soundwire_generic_allocation snd_soc_acpi soundwire_bus snd_soc_sdca mac80211 crc8 snd_soc_avs snd_soc_hda_codec snd_hda_ext_core snd_hda_codec dell_pc joydev mousedev snd_hda_core hid_sensor_als snd_intel_dspcfg hid_sensor_trigger snd_intel_sdw_acpi industrialio_triggered_buffer snd_hwdep kfifo_buf hid_sensor_iio_common snd_soc_core industrialio cfg80211 snd_compress hci_uart ac97_bus hid_sensor_hub snd_pcm_dmaengine uvcvideo btqca snd_pcm ofpart videobuf2_vmalloc btbcm wacom uvc cmdlinepart iTCO_wdt videobuf2_memops intel_uncore_frequency hid_multitouch videobuf2_v4l2 intel_pmc_bxt dell_laptop pwrseq_core dell_wmi snd_timer intel_ishtp_hid intel_uncore_frequency_common spi_nor usbhid bluetooth mtd hid_generic 8250_dw dell_wmi_ddv x86_pkg_temp_thermal qmi_helpers ucsi_acpi dell_smbios snd i2c_i801 videobuf2_common mhi ecdh_generic
Mar 25 21:04:36 ar kernel:  typec_ucsi i2c_smbus dcdbas intel_powerclamp spi_intel_pci dell_wmi_sysman mei_pxp mei_hdcp i915 coretemp firmware_attributes_class intel_rapl_msr polyval_clmulni videodev dell_wmi_descriptor ghash_clmulni_intel rapl wmi_bmof dell_smm_hwmon intel_cstate intel_uncore mc roles efi_pstore soundcore spi_intel i2c_mux typec libarc4 rfkill i2c_hid_acpi i2c_hid ecc hid drm_buddy ttm drm_display_helper intel_pmc_core processor_thermal_device_pci_legacy processor_thermal_device processor_thermal_wt_hint platform_temperature_control processor_thermal_soc_slider intel_skl_int3472_tps68470 tps68470_regulator cec platform_profile pmt_telemetry intel_gtt clk_tps68470 pmt_discovery processor_thermal_rfim i2c_algo_bit tiny_power_button intel_skl_int3472_discrete intel_hid pmt_class processor_thermal_rapl intel_oc_wdt int3400_thermal video watchdog int3403_thermal rtc_cmos intel_skl_int3472_common battery button sparse_keymap pinctrl_tigerlake intel_pmc_ssram_telemetry acpi_thermal_rel intel_rapl_common mei_me
Mar 25 21:04:36 ar kernel:  intel_ish_ipc processor_thermal_wt_req acpi_tad ac intel_lpss_pci acpi_pad processor_thermal_power_floor intel_ishtp igen6_edac wmi intel_vsec intel_lpss mei processor_thermal_mbox int340x_thermal_zone idma64 evdev 8250_pci edac_core virt_dma mac_hid intel_soc_dts_iosf serio_raw sch_fq_codel kvm_intel kvm xt_nat x_tables nf_nat br_netfilter nf_conntrack bridge nf_defrag_ipv6 nf_defrag_ipv4 uinput veth loop irqbypass stp cpufreq_powersave llc nfnetlink efivarfs dmi_sysfs autofs4 ext4 crc16 mbcache jbd2 dm_crypt encrypted_keys trusted asn1_encoder tee input_leds led_class atkbd rtsx_pci_sdmmc libps2 mmc_core vivaldi_fmap nvme nvme_core aesni_intel i8042 nvme_keyring xhci_pci nvme_auth thunderbolt serio xhci_hcd rtsx_pci hkdf dm_mod dax
Mar 25 21:04:36 ar kernel: CPU: 7 UID: 0 PID: 2260 Comm: (udev-worker) Not tainted 6.18.19 #1-NixOS PREEMPT(lazy)
Mar 25 21:04:36 ar kernel: Hardware name: Dell Inc. XPS 13 9310/0DMPXV, BIOS 3.34.0 08/08/2025
Mar 25 21:04:36 ar kernel: RIP: 0010:memtype_reserve_io+0xfd/0x110
Mar 25 21:04:36 ar kernel: Code: fb ff ff b8 f0 ff ff ff eb 88 8b 54 24 04 4c 89 ee 48 89 df e8 04 fe ff ff 85 c0 75 db 8b 54 24 04 41 89 16 e9 69 ff ff ff 90 <0f> 0b 90 e9 49 ff
 ff ff e8 96 54 d8 00 66 0f 1f 44 00 00 90 90 90
Mar 25 21:04:36 ar kernel: RSP: 0018:ffffd3cc0401ba40 EFLAGS: 00010286
Mar 25 21:04:36 ar kernel: RAX: 00000000ffffffff RBX: 0000000000000000 RCX: 0000000000000027
Mar 25 21:04:36 ar kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffba9204f0
Mar 25 21:04:36 ar kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000ffffdfff
Mar 25 21:04:36 ar kernel: R10: ffffffffb9e60fc0 R11: ffffd3cc0401b8c8 R12: 0000000000000001
Mar 25 21:04:36 ar kernel: R13: 0000000000000000 R14: ffffd3cc0401ba8c R15: ffff8b9ef56d0d38
Mar 25 21:04:36 ar kernel: FS:  00007f61c58d2480(0000) GS:ffff8ba634ed0000(0000) knlGS:0000000000000000
Mar 25 21:04:36 ar kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 25 21:04:36 ar kernel: CR2: 00007f61c5d20000 CR3: 0000000146c7d004 CR4: 0000000000f70ef0
Mar 25 21:04:36 ar kernel: PKRU: 55555554
Mar 25 21:04:36 ar kernel: Call Trace:
Mar 25 21:04:36 ar kernel:  <TASK>
Mar 25 21:04:36 ar kernel:  arch_io_reserve_memtype_wc+0x31/0x50
Mar 25 21:04:36 ar kernel:  amdgpu_bo_init+0x3e/0x90 [amdgpu]
Mar 25 21:04:36 ar kernel:  ? amdgpu_gmc_get_vbios_allocations+0xa9/0x140 [amdgpu]
Mar 25 21:04:36 ar kernel:  gmc_v10_0_sw_init+0x352/0x5c0 [amdgpu]
Mar 25 21:04:36 ar kernel:  amdgpu_device_init.cold+0x17ce/0x251a [amdgpu]
Mar 25 21:04:36 ar kernel:  ? pci_bus_read_config_word+0x4c/0x80
Mar 25 21:04:36 ar kernel:  amdgpu_driver_load_kms+0x13/0x70 [amdgpu]
Mar 25 21:04:36 ar kernel:  amdgpu_pci_probe+0x1e2/0x4a0 [amdgpu]
Mar 25 21:04:36 ar kernel:  local_pci_probe+0x3f/0x80
Mar 25 21:04:36 ar kernel:  pci_device_probe+0xd6/0x270
Mar 25 21:04:36 ar kernel:  ? sysfs_do_create_link_sd+0x6d/0xd0
Mar 25 21:04:36 ar kernel:  really_probe+0xde/0x340
Mar 25 21:04:36 ar kernel:  ? pm_runtime_barrier+0x55/0x90
Mar 25 21:04:36 ar kernel:  __driver_probe_device+0x78/0x140
Mar 25 21:04:36 ar kernel:  driver_probe_device+0x1f/0xa0
Mar 25 21:04:36 ar kernel:  ? __pfx___driver_attach+0x10/0x10
Mar 25 21:04:36 ar kernel:  __driver_attach+0xcb/0x1e0
Mar 25 21:04:36 ar kernel:  bus_for_each_dev+0x85/0xd0
Mar 25 21:04:36 ar kernel:  bus_add_driver+0x111/0x2a0
Mar 25 21:04:36 ar kernel:  ? __pfx_amdgpu_init+0x10/0x10 [amdgpu]
Mar 25 21:04:36 ar kernel:  driver_register+0x75/0xe0
Mar 25 21:04:36 ar kernel:  ? amdgpu_init+0x36/0xff0 [amdgpu]
Mar 25 21:04:36 ar kernel:  do_one_initcall+0x5b/0x310
Mar 25 21:04:36 ar kernel:  do_init_module+0xb1/0x2d0
Mar 25 21:04:36 ar kernel:  __do_sys_init_module+0x1a8/0x1e0
Mar 25 21:04:36 ar kernel:  do_syscall_64+0xb6/0x7e0
Mar 25 21:04:36 ar kernel:  ? clear_bhb_loop+0x50/0xa0
Mar 25 21:04:36 ar kernel:  entry_SYSCALL_64_after_hwframe+0x77/0x7f
Mar 25 21:04:36 ar kernel: RIP: 0033:0x7f61c57267ce
Mar 25 21:04:36 ar kernel: Code: 48 8b 0d 35 56 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 02 56 0d 00 f7 d8 64 89 01 48
Mar 25 21:04:36 ar kernel: RSP: 002b:00007ffeeb35a288 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
Mar 25 21:04:36 ar kernel: RAX: ffffffffffffffda RBX: 00005583ff8424e0 RCX: 00007f61c57267ce
Mar 25 21:04:36 ar kernel: RDX: 00007f61c5853304 RSI: 0000000001ebae98 RDI: 00007f61c0d2c010
Mar 25 21:04:36 ar kernel: RBP: 00007ffeeb35a2d0 R08: 0000000000000000 R09: 0000000000000000
Mar 25 21:04:36 ar kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f61c0d2c010
Mar 25 21:04:36 ar kernel: R13: 0000000000020000 R14: 00007f61c5853304 R15: 00005583ff9df9d0
Mar 25 21:04:36 ar kernel:  </TASK>
Mar 25 21:04:36 ar kernel: ---[ end trace 0000000000000000 ]---
Mar 25 21:04:36 ar kernel: [drm:amdgpu_bo_init [amdgpu]] *ERROR* Unable to set WC memtype for the aperture base
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: amdgpu: sw_init of IP block <gmc_v10_0> failed -22
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: amdgpu: amdgpu_device_ip_init failed
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: amdgpu: Fatal error during GPU init
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: amdgpu: amdgpu: finishing device.
Mar 25 21:04:36 ar kernel: amdgpu 0000:3e:00.0: probe with driver amdgpu failed with error -22

[-- Attachment #3: amdgpu-egpu-good-excerpt.dmesg --]
[-- Type: application/octet-stream, Size: 11928 bytes --]

[   75.926298] amdgpu: Virtual CRAT table created for CPU
[   75.926351] amdgpu: Topology: Add CPU node
[   75.926621] amdgpu 0000:3e:00.0: enabling device (0000 -> 0003)
[   75.926839] amdgpu 0000:3e:00.0: amdgpu: initializing kernel modesetting (NAVY_FLOUNDER 0x1002:0x73DF 0x1DA2:0xE445 0xC5).
[   75.930099] amdgpu 0000:3e:00.0: amdgpu: register mmio base: 0x74000000
[   75.930105] amdgpu 0000:3e:00.0: amdgpu: register mmio size: 1048576
[   75.939097] amdgpu 0000:3e:00.0: amdgpu: detected ip block number 0 <common_v1_0_0> (nv_common)
[   75.939107] amdgpu 0000:3e:00.0: amdgpu: detected ip block number 1 <gmc_v10_0_0> (gmc_v10_0)
[   75.939114] amdgpu 0000:3e:00.0: amdgpu: detected ip block number 2 <ih_v5_0_0> (navi10_ih)
[   75.939120] amdgpu 0000:3e:00.0: amdgpu: detected ip block number 3 <psp_v11_0_0> (psp)
[   75.939125] amdgpu 0000:3e:00.0: amdgpu: detected ip block number 4 <smu_v11_0_0> (smu)
[   75.939130] amdgpu 0000:3e:00.0: amdgpu: detected ip block number 5 <dce_v1_0_0> (dm)
[   75.939136] amdgpu 0000:3e:00.0: amdgpu: detected ip block number 6 <gfx_v10_0_0> (gfx_v10_0)
[   75.939141] amdgpu 0000:3e:00.0: amdgpu: detected ip block number 7 <sdma_v5_2_0> (sdma_v5_2)
[   75.939146] amdgpu 0000:3e:00.0: amdgpu: detected ip block number 8 <vcn_v3_0_0> (vcn_v3_0)
[   75.939151] amdgpu 0000:3e:00.0: amdgpu: detected ip block number 9 <jpeg_v3_0_0> (jpeg_v3_0)
[   76.024083] amdgpu 0000:3e:00.0: amdgpu: Fetched VBIOS from ROM BAR
[   76.024090] amdgpu: ATOM BIOS: 113-D5122200-S05
[   76.046087] amdgpu 0000:3e:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[   76.046106] amdgpu 0000:3e:00.0: amdgpu: PCIE atomic ops is not supported
[   76.046116] amdgpu 0000:3e:00.0: amdgpu: GPU posting now...
[   76.046184] amdgpu 0000:3e:00.0: amdgpu: vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[   76.046216] amdgpu 0000:3e:00.0: BAR 2 [mem 0x6040000000-0x60401fffff 64bit pref]: releasing
[   76.046220] amdgpu 0000:3e:00.0: BAR 0 [mem 0x6030000000-0x603fffffff 64bit pref]: releasing
[   76.046248] pcieport 0000:3d:00.0: bridge window [mem 0x6030000000-0x60401fffff 64bit pref]: releasing
[   76.046251] pcieport 0000:3c:00.0: bridge window [mem 0x6030000000-0x60401fffff 64bit pref]: releasing
[   76.046254] pcieport 0000:3b:01.0: bridge window [mem 0x6030000000-0x60401fffff 64bit pref]: releasing
[   76.046256] pcieport 0000:3a:00.0: bridge window [mem 0x6030000000-0x604fffffff 64bit pref]: was not released (still contains assigned resources)
[   76.046259] pcieport 0000:00:07.2: bridge window [mem 0x6030000000-0x6051ffffff 64bit pref]: was not released (still contains assigned resources)
[   76.046264] pcieport 0000:3a:00.0: Assigned bridge window [mem 0x6030000000-0x604fffffff 64bit pref] to [bus 3b-71] cannot fit 0x600000000 required for 0000:3d:00.0 bridging to [bus 3e]
[   76.046270] pcieport 0000:3d:00.0: bridge window [mem size 0x400200000 64bit pref] to [bus 3e] requires relaxed alignment rules
[   76.046274] pcieport 0000:3a:00.0: Assigned bridge window [mem 0x6030000000-0x604fffffff 64bit pref] to [bus 3b-71] cannot fit 0x800000000 required for 0000:3c:00.0 bridging to [bus 3d-3e]
[   76.046277] pcieport 0000:3c:00.0: bridge window [mem size 0x400200000 64bit pref] to [bus 3d-3e] requires relaxed alignment rules
[   76.046280] pcieport 0000:3a:00.0: Assigned bridge window [mem 0x6030000000-0x604fffffff 64bit pref] to [bus 3b-71] cannot fit 0x800000000 required for 0000:3b:01.0 bridging to [bus 3c-56]
[   76.046283] pcieport 0000:3b:01.0: bridge window [mem size 0x400200000 64bit pref] to [bus 3c-56] requires relaxed alignment rules
[   76.046285] pcieport 0000:3a:00.0: Assigned bridge window [mem 0x6030000000-0x604fffffff 64bit pref] to [bus 3b-71] cannot fit 0x800000000 required for 0000:3b:01.0 bridging to [bus 3c-56]
[   76.046288] pcieport 0000:3b:01.0: bridge window [mem size 0x400200000 64bit pref] to [bus 3c-56] requires relaxed alignment rules
[   76.046299] pcieport 0000:3b:01.0: bridge window [mem size 0x400200000 64bit pref]: can't assign; no space
[   76.046301] pcieport 0000:3b:01.0: bridge window [mem size 0x400200000 64bit pref]: failed to assign
[   76.046305] pcieport 0000:3b:01.0: bridge window [mem size 0x400200000 64bit pref]: can't assign; no space
[   76.046307] pcieport 0000:3b:01.0: bridge window [mem size 0x400200000 64bit pref]: failed to assign
[   76.046310] pcieport 0000:3c:00.0: bridge window [mem size 0x400200000 64bit pref]: can't assign; no space
[   76.046312] pcieport 0000:3c:00.0: bridge window [mem size 0x400200000 64bit pref]: failed to assign
[   76.046315] pcieport 0000:3c:00.0: bridge window [mem size 0x400200000 64bit pref]: can't assign; no space
[   76.046316] pcieport 0000:3c:00.0: bridge window [mem size 0x400200000 64bit pref]: failed to assign
[   76.046319] pcieport 0000:3d:00.0: bridge window [mem size 0x400200000 64bit pref]: can't assign; no space
[   76.046321] pcieport 0000:3d:00.0: bridge window [mem size 0x400200000 64bit pref]: failed to assign
[   76.046323] pcieport 0000:3d:00.0: bridge window [mem size 0x400200000 64bit pref]: can't assign; no space
[   76.046325] pcieport 0000:3d:00.0: bridge window [mem size 0x400200000 64bit pref]: failed to assign
[   76.046329] amdgpu 0000:3e:00.0: BAR 0 [mem size 0x400000000 64bit pref]: can't assign; no space
[   76.046331] amdgpu 0000:3e:00.0: BAR 0 [mem size 0x400000000 64bit pref]: failed to assign
[   76.046334] amdgpu 0000:3e:00.0: BAR 2 [mem 0x74200000-0x743fffff 64bit pref]: assigned
[   76.046355] amdgpu 0000:3e:00.0: BAR 2 [mem 0x74200000-0x743fffff 64bit pref]: releasing
[   76.046357] amdgpu 0000:3e:00.0: BAR 0 [mem size 0x400000000 64bit pref]: can't assign; no space
[   76.046359] amdgpu 0000:3e:00.0: BAR 0 [mem size 0x400000000 64bit pref]: failed to assign
[   76.046362] amdgpu 0000:3e:00.0: BAR 2 [mem 0x74200000-0x743fffff 64bit pref]: assigned
[   76.046383] pcieport 0000:00:07.2: PCI bridge to [bus 3a-71]
[   76.046386] pcieport 0000:00:07.2:   bridge window [io  0x5000-0x7fff]
[   76.046391] pcieport 0000:00:07.2:   bridge window [mem 0x74000000-0x8a3fffff]
[   76.046394] pcieport 0000:00:07.2:   bridge window [mem 0x6030000000-0x6051ffffff 64bit pref]
[   76.046400] pcieport 0000:3b:01.0: PCI bridge to [bus 3c-56]
[   76.046404] pcieport 0000:3b:01.0:   bridge window [io  0x5000-0x5fff]
[   76.046411] pcieport 0000:3b:01.0:   bridge window [mem 0x74000000-0x7f1fffff]
[   76.046417] pcieport 0000:3b:01.0:   bridge window [mem 0x6030000000-0x60401fffff 64bit pref]
[   76.046427] pcieport 0000:3c:00.0: PCI bridge to [bus 3d-3e]
[   76.046431] pcieport 0000:3c:00.0:   bridge window [io  0x5000-0x5fff]
[   76.046441] pcieport 0000:3c:00.0:   bridge window [mem 0x74000000-0x7f0fffff]
[   76.046447] pcieport 0000:3c:00.0:   bridge window [mem 0x6030000000-0x60401fffff 64bit pref]
[   76.046460] pcieport 0000:3d:00.0: PCI bridge to [bus 3e]
[   76.046464] pcieport 0000:3d:00.0:   bridge window [io  0x5000-0x5fff]
[   76.046473] pcieport 0000:3d:00.0:   bridge window [mem 0x74000000-0x7f0fffff]
[   76.046480] pcieport 0000:3d:00.0:   bridge window [mem 0x6030000000-0x60401fffff 64bit pref]
[   76.046502] amdgpu 0000:3e:00.0: amdgpu: Not enough PCI address space for a large BAR.
[   76.046505] amdgpu 0000:3e:00.0: BAR 0 [mem 0x6030000000-0x603fffffff 64bit pref]: assigned
[   76.046531] amdgpu 0000:3e:00.0: amdgpu: VRAM: 12272M 0x0000008000000000 - 0x00000082FEFFFFFF (12272M used)
[   76.046534] amdgpu 0000:3e:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[   76.046567] [drm] Detected VRAM RAM=12272M, BAR=256M
[   76.046570] [drm] RAM width 192bits GDDR6
[   76.046919] amdgpu 0000:3e:00.0: amdgpu: amdgpu: 12272M of VRAM memory ready
[   76.046923] amdgpu 0000:3e:00.0: amdgpu: amdgpu: 15905M of GTT memory ready.
[   76.046957] [drm] GART: num cpu pages 131072, num gpu pages 131072
[   76.047164] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
[   78.266783] amdgpu 0000:3e:00.0: amdgpu: STB initialized to 2048 entries
[   78.266958] amdgpu 0000:3e:00.0: amdgpu: [drm] Loading DMUB firmware via PSP: version=0x02020021
[   78.267759] [drm] use_doorbell being set to: [true]
[   78.267802] [drm] use_doorbell being set to: [true]
[   78.267835] amdgpu 0000:3e:00.0: amdgpu: [VCN instance 0] Found VCN firmware Version ENC: 1.33 DEC: 4 VEP: 0 Revision: 14
[   78.299944] thunderbolt 1-1: new device found, vendor=0x8086 device=0x2
[   78.299957] thunderbolt 1-1: Intel Tamales Module 2
[   78.335698] amdgpu 0000:3e:00.0: amdgpu: reserve 0xa00000 from 0x82fd000000 for PSP TMR
[   78.439353] amdgpu 0000:3e:00.0: amdgpu: RAS: optional ras ta ucode is not available
[   78.453657] amdgpu 0000:3e:00.0: amdgpu: SECUREDISPLAY: optional securedisplay ta ucode is not available
[   78.453698] amdgpu 0000:3e:00.0: amdgpu: smu driver if version = 0x0000000e, smu fw if version = 0x00000012, smu fw program = 0, version = 0x00413f00 (65.63.0)
[   78.453704] amdgpu 0000:3e:00.0: amdgpu: SMU driver if version not matched
[   78.453755] amdgpu 0000:3e:00.0: amdgpu: use vbios provided pptable
[   78.512775] amdgpu 0000:3e:00.0: amdgpu: SMU is initialized successfully!
[   78.514425] amdgpu 0000:3e:00.0: amdgpu: [drm] Display Core v3.2.351 initialized on DCN 3.0
[   78.514432] amdgpu 0000:3e:00.0: amdgpu: [drm] DP-HDMI FRL PCON supported
[   78.516222] amdgpu 0000:3e:00.0: amdgpu: [drm] DMUB hardware initialized: version=0x02020021
[   78.542712] snd_hda_intel 0000:3e:00.1: bound 0000:3e:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
[   78.793741] [drm] DM_MST: Differing MST start on aconnector: 00000000973e8d11 [id: 124]
[   78.798456] amdgpu 0000:3e:00.0: amdgpu: kiq ring mec 2 pipe 1 q 0
[   78.889208] amdgpu: HMM registered 12272MB device memory
[   78.891864] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[   78.891919] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
[   78.892495] amdgpu: Virtual CRAT table created for GPU
[   78.892876] amdgpu: Topology: Add dGPU node [0x73df:0x1002]
[   78.892883] kfd kfd: amdgpu: added device 1002:73df
[   78.892931] amdgpu 0000:3e:00.0: amdgpu: SE 2, SH per SE 2, CU per SH 10, active_cu_number 40
[   78.892942] amdgpu 0000:3e:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[   78.892948] amdgpu 0000:3e:00.0: amdgpu: ring gfx_0.1.0 uses VM inv eng 1 on hub 0
[   78.892952] amdgpu 0000:3e:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 4 on hub 0
[   78.892955] amdgpu 0000:3e:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 5 on hub 0
[   78.892959] amdgpu 0000:3e:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[   78.892963] amdgpu 0000:3e:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[   78.892966] amdgpu 0000:3e:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[   78.892970] amdgpu 0000:3e:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[   78.892973] amdgpu 0000:3e:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[   78.892977] amdgpu 0000:3e:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[   78.892981] amdgpu 0000:3e:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 12 on hub 0
[   78.892985] amdgpu 0000:3e:00.0: amdgpu: ring sdma0 uses VM inv eng 13 on hub 0
[   78.892989] amdgpu 0000:3e:00.0: amdgpu: ring sdma1 uses VM inv eng 14 on hub 0
[   78.892993] amdgpu 0000:3e:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
[   78.892997] amdgpu 0000:3e:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8
[   78.893002] amdgpu 0000:3e:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8
[   78.893006] amdgpu 0000:3e:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
[   78.903957] amdgpu 0000:3e:00.0: amdgpu: Using BOCO for runtime pm
[   78.906118] amdgpu 0000:3e:00.0: [drm] Registered 6 planes with drm panic
[   78.906125] [drm] Initialized amdgpu 3.64.0 for 0000:3e:00.0 on minor 0

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] amdgpu with Thunderbolt eGPU bracket fails since new bridge window alignment calculation code
  2026-03-27 23:02 [REGRESSION] amdgpu with Thunderbolt eGPU bracket fails since new bridge window alignment calculation code Jonas Höglund
@ 2026-03-28  8:46 ` Thorsten Leemhuis
  2026-03-28 16:09   ` Jonas Höglund
  2026-03-30  7:21 ` Thorsten Leemhuis
  1 sibling, 1 reply; 11+ messages in thread
From: Thorsten Leemhuis @ 2026-03-28  8:46 UTC (permalink / raw)
  To: Jonas Höglund; +Cc: linux-pci, regressions, stable

On 3/28/26 00:02, Jonas Höglund wrote:
> 
> I have an AMD GPU in an external Thunderbolt enclosure that recently
> stopped working with the latest longterm kernel release.  The GPU in
> question is an AMD RX 6750 XT.

Thx for the report. One important information is missing afaics: Does
the problem happen with latest mainline (say 7.0-rc5) as well? The
answer determines how this will be dealt with.

Ciao, Thorsten

>     [...]
>     amdgpu 0000:3e:00.0: amdgpu: vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
>     amdgpu 0000:3e:00.0: BAR 2 [mem 0x74200000-0x743fffff 64bit pref]: releasing
>     amdgpu 0000:3e:00.0: amdgpu: Problem resizing BAR0 (-16).
>     amdgpu 0000:3e:00.0: BAR 2 [mem 0x74200000-0x743fffff 64bit pref]: assigned
>     amdgpu 0000:3e:00.0: amdgpu: VRAM: 12272M 0x0000008000000000 - 0x00000082FEFFFFFF (12272M used)
>     amdgpu 0000:3e:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
>     resource: resource sanity check: requesting [mem 0x0000000000000000-0xffffffffffffffff], which spans more than PCI Bus 0000:00 [mem 0x000a0000-0x000bffff window]
>     ------------[ cut here ]------------
>     WARNING: CPU: 7 PID: 2260 at arch/x86/mm/pat/memtype.c:720 memtype_reserve_io+0xfd/0x110
>     [...]
> 
> Searching for the issue I found this very similar report from last year:
> https://lkml.org/lkml/2025/6/9/88
> 
> so I suppose this might be a re-regression of the same issue(?).
> 
> 
> I checked out latest longterm (v6.18.20) and bisected, which took me to
> commit b855d99 (upstream commit 3958bf16), which is
> "PCI: Stop over-estimating bridge window size".
> 
> 
> After the bisect I tried reverting the commit on top of v6.18.20 (going
> back to the old way of calculating alignment), and this is sufficient
> for the eGPU to dock properly again.
> 
> I've attached a longer excerpt of the kernel logs from a failing boot
> (let me know if a full dmesg would be helpful and I can find somewhere
> to upload it).  I've also attached an excerpt when connecting the eGPU
> in a "good" case, since I figured the memory adress ranges could be
> useful.
> 
> 
> Hardware:
>   Machine: Dell XPS 13 9310 (0991)
>   GPU: AMD RX 6750 XT
>   Dock: EXP GDC TH3P4G3
> 
> Distribution: NixOS
> Architecture: x86-64
> 
> #regzbot introduced: b855d99
> 
> Let me know if any additional information would be helpful.
> 
> Thanks,
> Jonas


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] amdgpu with Thunderbolt eGPU bracket fails since new bridge window alignment calculation code
  2026-03-28  8:46 ` Thorsten Leemhuis
@ 2026-03-28 16:09   ` Jonas Höglund
  0 siblings, 0 replies; 11+ messages in thread
From: Jonas Höglund @ 2026-03-28 16:09 UTC (permalink / raw)
  To: Thorsten Leemhuis; +Cc: linux-pci, regressions, stable

On Sat, 28 Mar 2026, at 08:46, Thorsten Leemhuis wrote:
> Thx for the report. One important information is missing afaics: Does
> the problem happen with latest mainline (say 7.0-rc5) as well? The
> answer determines how this will be dealt with.

My bad! I've tested against 7.0.0-rc5 now, and it seems the problem
persists there.

Thanks,
Jonas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] amdgpu with Thunderbolt eGPU bracket fails since new bridge window alignment calculation code
  2026-03-27 23:02 [REGRESSION] amdgpu with Thunderbolt eGPU bracket fails since new bridge window alignment calculation code Jonas Höglund
  2026-03-28  8:46 ` Thorsten Leemhuis
@ 2026-03-30  7:21 ` Thorsten Leemhuis
  2026-03-30 14:33   ` Ilpo Järvinen
  1 sibling, 1 reply; 11+ messages in thread
From: Thorsten Leemhuis @ 2026-03-30  7:21 UTC (permalink / raw)
  To: Ilpo Järvinen, Bjorn Helgaas
  Cc: linux-pci, regressions, Jonas Höglund

[adding author and committer to list of recipients while dropping stable
from CC, as the problem seems to happen in mainline, too]

On 3/28/26 00:02, Jonas Höglund wrote:
> 
> I have an AMD GPU in an external Thunderbolt enclosure that recently
> stopped working with the latest longterm kernel release.  The GPU in
> question is an AMD RX 6750 XT.
> 
>     [...]
>     amdgpu 0000:3e:00.0: amdgpu: vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
>     amdgpu 0000:3e:00.0: BAR 2 [mem 0x74200000-0x743fffff 64bit pref]: releasing
>     amdgpu 0000:3e:00.0: amdgpu: Problem resizing BAR0 (-16).
>     amdgpu 0000:3e:00.0: BAR 2 [mem 0x74200000-0x743fffff 64bit pref]: assigned
>     amdgpu 0000:3e:00.0: amdgpu: VRAM: 12272M 0x0000008000000000 - 0x00000082FEFFFFFF (12272M used)
>     amdgpu 0000:3e:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
>     resource: resource sanity check: requesting [mem 0x0000000000000000-0xffffffffffffffff], which spans more than PCI Bus 0000:00 [mem 0x000a0000-0x000bffff window]
>     ------------[ cut here ]------------
>     WARNING: CPU: 7 PID: 2260 at arch/x86/mm/pat/memtype.c:720 memtype_reserve_io+0xfd/0x110
>     [...]
> 
> Searching for the issue I found this very similar report from last year:
> https://lkml.org/lkml/2025/6/9/88
> 
> so I suppose this might be a re-regression of the same issue(?).
> 
> 
> I checked out latest longterm (v6.18.20) and bisected, which took me to
> commit b855d99 (upstream commit 3958bf16), which is
> "PCI: Stop over-estimating bridge window size".

Ilpo, Bjorn: Jonas in a subthread later confirmed that this problem
happens with 7.0-rc5, too, so it's likely something for you.

Side note: I noticed the "PCI: Improve head free space usage" series
referred to the commit in question; wondering if that is related:
https://lore.kernel.org/all/20260324165633.4583-1-ilpo.jarvinen@linux.intel.com/

Ciao, Thorsten

> After the bisect I tried reverting the commit on top of v6.18.20 (going
> back to the old way of calculating alignment), and this is sufficient
> for the eGPU to dock properly again.
> 
> I've attached a longer excerpt of the kernel logs from a failing boot
> (let me know if a full dmesg would be helpful and I can find somewhere
> to upload it).  I've also attached an excerpt when connecting the eGPU
> in a "good" case, since I figured the memory adress ranges could be
> useful.
> 
> 
> Hardware:
>   Machine: Dell XPS 13 9310 (0991)
>   GPU: AMD RX 6750 XT
>   Dock: EXP GDC TH3P4G3
> 
> Distribution: NixOS
> Architecture: x86-64
> 
> #regzbot introduced: b855d99
> 
> Let me know if any additional information would be helpful.
> 
> Thanks,
> Jonas


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] amdgpu with Thunderbolt eGPU bracket fails since new bridge window alignment calculation code
  2026-03-30  7:21 ` Thorsten Leemhuis
@ 2026-03-30 14:33   ` Ilpo Järvinen
  2026-03-30 15:50     ` Jonas Höglund
  0 siblings, 1 reply; 11+ messages in thread
From: Ilpo Järvinen @ 2026-03-30 14:33 UTC (permalink / raw)
  To: Thorsten Leemhuis, Jonas Höglund
  Cc: Bjorn Helgaas, linux-pci, regressions

[-- Attachment #1: Type: text/plain, Size: 4240 bytes --]

On Mon, 30 Mar 2026, Thorsten Leemhuis wrote:

> [adding author and committer to list of recipients while dropping stable
> from CC, as the problem seems to happen in mainline, too]
> 
> On 3/28/26 00:02, Jonas Höglund wrote:
> > 
> > I have an AMD GPU in an external Thunderbolt enclosure that recently
> > stopped working with the latest longterm kernel release.  The GPU in
> > question is an AMD RX 6750 XT.
> > 
> >     [...]
> >     amdgpu 0000:3e:00.0: amdgpu: vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
> >     amdgpu 0000:3e:00.0: BAR 2 [mem 0x74200000-0x743fffff 64bit pref]: releasing
> >     amdgpu 0000:3e:00.0: amdgpu: Problem resizing BAR0 (-16).
> >     amdgpu 0000:3e:00.0: BAR 2 [mem 0x74200000-0x743fffff 64bit pref]: assigned
> >     amdgpu 0000:3e:00.0: amdgpu: VRAM: 12272M 0x0000008000000000 - 0x00000082FEFFFFFF (12272M used)
> >     amdgpu 0000:3e:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
> >     resource: resource sanity check: requesting [mem 0x0000000000000000-0xffffffffffffffff], which spans more than PCI Bus 0000:00 [mem 0x000a0000-0x000bffff window]
> >     ------------[ cut here ]------------
> >     WARNING: CPU: 7 PID: 2260 at arch/x86/mm/pat/memtype.c:720 memtype_reserve_io+0xfd/0x110
> >     [...]
> > 
> > Searching for the issue I found this very similar report from last year:
> > https://lkml.org/lkml/2025/6/9/88
> > 
> > so I suppose this might be a re-regression of the same issue(?).

I'm skeptical it's exactly the same issue even if the end result is the 
same.

The resource fitting algorithm has been in a state of constant flux due to 
various fixes and improvements into it over all the recent two years.
Unfortunately, fixing one thing (or even moving towards fixing an issue) 
may break another thing due to how different resource interact.

> > I checked out latest longterm (v6.18.20) and bisected, which took me to
> > commit b855d99 (upstream commit 3958bf16), which is
> > "PCI: Stop over-estimating bridge window size".
> 
> Ilpo, Bjorn: Jonas in a subthread later confirmed that this problem
> happens with 7.0-rc5, too, so it's likely something for you.
> 
> Side note: I noticed the "PCI: Improve head free space usage" series
> referred to the commit in question; wondering if that is related:
> https://lore.kernel.org/all/20260324165633.4583-1-ilpo.jarvinen@linux.intel.com/

That "PCI: Improve head free space usage" series is certainly fixing two 
known corner case with the commit 3958bf16e2fe ("PCI: Stop over-estimating 
bridge window size") but with only heavily filtered logs, I'm unable to 
confirm if it applies to this case as well.

From the limited logs, I suspect this is primarily a BAR resize rollback 
failure which leaves the resources into a state worse than they were prior 
to the resize. The commit 337b1b566db0 ("PCI: Fix restoring BARs on BAR 
resize rollback path") attempts to rectify that. The entire series is here 
(not all of it went to stable):

https://lore.kernel.org/all/20251113162628.5946-1-ilpo.jarvinen@linux.intel.com/T/#m9b0e316c94f7abc0686e58f902d05ff35aeac3ac

The fixes to that series are here:

5528fd38f230 ("PCI: Fix Resizable BAR restore order")
08d9eae76b85 ("PCI: Fix BAR resize rollback path overwriting ret")

(I'm sorry how complex this all is.)

-- 
 i.

> > After the bisect I tried reverting the commit on top of v6.18.20 (going
> > back to the old way of calculating alignment), and this is sufficient
> > for the eGPU to dock properly again.
> > 
> > I've attached a longer excerpt of the kernel logs from a failing boot
> > (let me know if a full dmesg would be helpful and I can find somewhere
> > to upload it).  I've also attached an excerpt when connecting the eGPU
> > in a "good" case, since I figured the memory adress ranges could be
> > useful.
> > 
> > 
> > Hardware:
> >   Machine: Dell XPS 13 9310 (0991)
> >   GPU: AMD RX 6750 XT
> >   Dock: EXP GDC TH3P4G3
> > 
> > Distribution: NixOS
> > Architecture: x86-64
> > 
> > #regzbot introduced: b855d99
> > 
> > Let me know if any additional information would be helpful.
> > 
> > Thanks,
> > Jonas
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] amdgpu with Thunderbolt eGPU bracket fails since new bridge window alignment calculation code
  2026-03-30 14:33   ` Ilpo Järvinen
@ 2026-03-30 15:50     ` Jonas Höglund
  2026-03-30 16:32       ` Ilpo Järvinen
  0 siblings, 1 reply; 11+ messages in thread
From: Jonas Höglund @ 2026-03-30 15:50 UTC (permalink / raw)
  To: Ilpo Järvinen, Thorsten Leemhuis
  Cc: Bjorn Helgaas, linux-pci, regressions

On Mon, 30 Mar 2026, at 14:33, Ilpo Järvinen wrote:
> I'm skeptical it's exactly the same issue even if the end result is the 
> same.
>
> The resource fitting algorithm has been in a state of constant flux due to 
> various fixes and improvements into it over all the recent two years.
> Unfortunately, fixing one thing (or even moving towards fixing an issue) 
> may break another thing due to how different resource interact.

Ok, yeah, I don't envy having to deal with that.  You're probably right
it's more BAR-related, I mostly keyed in on the very similar symptom.


> That "PCI: Improve head free space usage" series is certainly fixing two 
> known corner case with the commit 3958bf16e2fe ("PCI: Stop over-estimating 
> bridge window size") but with only heavily filtered logs, I'm unable to 
> confirm if it applies to this case as well.

Sorry for not providing full logs from the get-go; I couldn't think of
suitable location.  Here's a full dmesg for reference of the crash
manifesting on 7.0.0-rc5:

https://up.firefly.nu/pub/amdgpu-egpu-crash-7.0.0-rc5.dmesg.txt


> From the limited logs, I suspect this is primarily a BAR resize rollback 
> failure which leaves the resources into a state worse than they were prior 
> to the resize. The commit 337b1b566db0 ("PCI: Fix restoring BARs on BAR 
> resize rollback path") attempts to rectify that. The entire series is here 
> (not all of it went to stable):

> https://lore.kernel.org/all/20251113162628.5946-1-ilpo.jarvinen@linux.intel.com/T/#m9b0e316c94f7abc0686e58f902d05ff35aeac3ac
>
> The fixes to that series are here:
>
> 5528fd38f230 ("PCI: Fix Resizable BAR restore order")
> 08d9eae76b85 ("PCI: Fix BAR resize rollback path overwriting ret")

Unless I misread something, they should both be included in the recently
tagged 7.0.0-rc6--I'll try building it and see if the issue is resolved.

I'll reply once I've tested 7.0.0-rc6.


> (I'm sorry how complex this all is.)

All good.

Thanks,
Jonas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] amdgpu with Thunderbolt eGPU bracket fails since new bridge window alignment calculation code
  2026-03-30 15:50     ` Jonas Höglund
@ 2026-03-30 16:32       ` Ilpo Järvinen
  2026-04-02 16:51         ` Jonas Höglund
  0 siblings, 1 reply; 11+ messages in thread
From: Ilpo Järvinen @ 2026-03-30 16:32 UTC (permalink / raw)
  To: Jonas Höglund
  Cc: Thorsten Leemhuis, Bjorn Helgaas, linux-pci, regressions

[-- Attachment #1: Type: text/plain, Size: 3164 bytes --]

On Mon, 30 Mar 2026, Jonas Höglund wrote:

> On Mon, 30 Mar 2026, at 14:33, Ilpo Järvinen wrote:
> > I'm skeptical it's exactly the same issue even if the end result is the 
> > same.
> >
> > The resource fitting algorithm has been in a state of constant flux due to 
> > various fixes and improvements into it over all the recent two years.
> > Unfortunately, fixing one thing (or even moving towards fixing an issue) 
> > may break another thing due to how different resource interact.
> 
> Ok, yeah, I don't envy having to deal with that.  You're probably right
> it's more BAR-related, I mostly keyed in on the very similar symptom.

Definitely the gpu driver could handle an resource issue better than by 
calling something that triggers a sanity check somewhere, but it's 
secondary problem.

> > That "PCI: Improve head free space usage" series is certainly fixing two 
> > known corner case with the commit 3958bf16e2fe ("PCI: Stop over-estimating 
> > bridge window size") but with only heavily filtered logs, I'm unable to 
> > confirm if it applies to this case as well.
> 
> Sorry for not providing full logs from the get-go; I couldn't think of
> suitable location.  Here's a full dmesg for reference of the crash
> manifesting on 7.0.0-rc5:
> 
> https://up.firefly.nu/pub/amdgpu-egpu-crash-7.0.0-rc5.dmesg.txt
> 
> 
> > From the limited logs, I suspect this is primarily a BAR resize rollback 
> > failure which leaves the resources into a state worse than they were prior 
> > to the resize. The commit 337b1b566db0 ("PCI: Fix restoring BARs on BAR 
> > resize rollback path") attempts to rectify that. The entire series is here 
> > (not all of it went to stable):
> 
> > https://lore.kernel.org/all/20251113162628.5946-1-ilpo.jarvinen@linux.intel.com/T/#m9b0e316c94f7abc0686e58f902d05ff35aeac3ac
> >
> > The fixes to that series are here:
> >
> > 5528fd38f230 ("PCI: Fix Resizable BAR restore order")
> > 08d9eae76b85 ("PCI: Fix BAR resize rollback path overwriting ret")
> 
> Unless I misread something, they should both be included in the recently
> tagged 7.0.0-rc6--I'll try building it and see if the issue is resolved.
> 
> I'll reply once I've tested 7.0.0-rc6.

Hi again,

Now that I look more into the logs that probably won't help. For some 
reason, it seems that resize is not even attempted and the errno is 
-EINVAL which is a bit unexpected.

I'm starting to wonder that the problem fixed by this patch once again is 
showing its ugly head (it's currently in pci/resource branch, so it won't 
appear until 7.1-rc1):

https://lore.kernel.org/linux-pci/20260326200427.GA1340256@bhelgaas/

I still don't understand why pbus_select_window() would return NULL in 
this case but it looks the most likely candidate where -EINVAL could come 
from (I still don't understand what cleared resource's flags if that's the 
case but it still seems the best explanation).

Please take logs from this point on with dyndbg="file drivers/pci/*.c +p" 
on the kernel's command line so there's little bit of extra info (and 
check you are building with CONFIG_DYNAMIC_DEBUG).

-- 
 i.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] amdgpu with Thunderbolt eGPU bracket fails since new bridge window alignment calculation code
  2026-03-30 16:32       ` Ilpo Järvinen
@ 2026-04-02 16:51         ` Jonas Höglund
  2026-04-02 16:56           ` Jonas Höglund
  2026-04-07  7:26           ` Ilpo Järvinen
  0 siblings, 2 replies; 11+ messages in thread
From: Jonas Höglund @ 2026-04-02 16:51 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Thorsten Leemhuis, Bjorn Helgaas, linux-pci, regressions

On Mon, 30 Mar 2026, at 16:32, Ilpo Järvinen wrote:
> On Mon, 30 Mar 2026, Jonas Höglund wrote:
>> 
>> Unless I misread something, they should both be included in the recently
>> tagged 7.0.0-rc6--I'll try building it and see if the issue is resolved.
>> 
>> I'll reply once I've tested 7.0.0-rc6.
>
> Hi again,
>
> Now that I look more into the logs that probably won't help. For some 
> reason, it seems that resize is not even attempted and the errno is 
> -EINVAL which is a bit unexpected.
>
> I'm starting to wonder that the problem fixed by this patch once again is 
> showing its ugly head (it's currently in pci/resource branch, so it won't 
> appear until 7.1-rc1):
>
> https://lore.kernel.org/linux-pci/20260326200427.GA1340256@bhelgaas/

Seems your hunch was right--I've now gotten around to testing with
7.0-rc6 as well as the pci/resource branch.  The problem persists in the
former whereas it seems docking succeeds in the latter.


> I still don't understand why pbus_select_window() would return NULL in 
> this case but it looks the most likely candidate where -EINVAL could come 
> from (I still don't understand what cleared resource's flags if that's the 
> case but it still seems the best explanation).
>
> Please take logs from this point on with dyndbg="file drivers/pci/*.c +p" 
> on the kernel's command line so there's little bit of extra info (and 
> check you are building with CONFIG_DYNAMIC_DEBUG).

Here are dmesg logs (with the appropriate dyndbg cmdline flag) for both
cases, in case it's useful:

https://up.firefly.nu/pub/amdgpu-egpu-crash-7.0.0-rc6.dmesg.txt
https://up.firefly.nu/pub/amdgpu-egpu-good-pci-resource.dmesg.txt


That's good enough on my end, knowing the issue is addressed already
upstream and slated for 7.1.  I'm happy to test anything else if it'd
be useful (for eventual backports or so), but otherwise I think I'll
just pick thoes patches from the pci/resource tree for now.


Thanks,
Jonas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] amdgpu with Thunderbolt eGPU bracket fails since new bridge window alignment calculation code
  2026-04-02 16:51         ` Jonas Höglund
@ 2026-04-02 16:56           ` Jonas Höglund
  2026-04-07  7:37             ` Ilpo Järvinen
  2026-04-07  7:26           ` Ilpo Järvinen
  1 sibling, 1 reply; 11+ messages in thread
From: Jonas Höglund @ 2026-04-02 16:56 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Thorsten Leemhuis, Bjorn Helgaas, linux-pci, regressions

On Thu, 2 Apr 2026, at 16:51, Jonas Höglund wrote:
> Here are dmesg logs (with the appropriate dyndbg cmdline flag) for both
> cases, in case it's useful:
>
> https://up.firefly.nu/pub/amdgpu-egpu-crash-7.0.0-rc6.dmesg.txt
> https://up.firefly.nu/pub/amdgpu-egpu-good-pci-resource.dmesg.txt

Actually, now that I look closer at the good-case dmesg (just after
sending the email, typical), I guess I'm not sure how happy the driver
and dock really are.  Well, I'll let you weigh in since you're more
intimately familiar with the subsystem and expected debug messages.

Thanks

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] amdgpu with Thunderbolt eGPU bracket fails since new bridge window alignment calculation code
  2026-04-02 16:51         ` Jonas Höglund
  2026-04-02 16:56           ` Jonas Höglund
@ 2026-04-07  7:26           ` Ilpo Järvinen
  1 sibling, 0 replies; 11+ messages in thread
From: Ilpo Järvinen @ 2026-04-07  7:26 UTC (permalink / raw)
  To: Jonas Höglund
  Cc: Thorsten Leemhuis, Bjorn Helgaas, linux-pci, regressions

[-- Attachment #1: Type: text/plain, Size: 3447 bytes --]

On Thu, 2 Apr 2026, Jonas Höglund wrote:

> On Mon, 30 Mar 2026, at 16:32, Ilpo Järvinen wrote:
> > On Mon, 30 Mar 2026, Jonas Höglund wrote:
> >> 
> >> Unless I misread something, they should both be included in the recently
> >> tagged 7.0.0-rc6--I'll try building it and see if the issue is resolved.
> >> 
> >> I'll reply once I've tested 7.0.0-rc6.
> >
> > Hi again,
> >
> > Now that I look more into the logs that probably won't help. For some 
> > reason, it seems that resize is not even attempted and the errno is 
> > -EINVAL which is a bit unexpected.
> >
> > I'm starting to wonder that the problem fixed by this patch once again is 
> > showing its ugly head (it's currently in pci/resource branch, so it won't 
> > appear until 7.1-rc1):
> >
> > https://lore.kernel.org/linux-pci/20260326200427.GA1340256@bhelgaas/
> 
> Seems your hunch was right--I've now gotten around to testing with
> 7.0-rc6 as well as the pci/resource branch.  The problem persists in the
> former whereas it seems docking succeeds in the latter.
> 
> 
> > I still don't understand why pbus_select_window() would return NULL in 
> > this case but it looks the most likely candidate where -EINVAL could come 
> > from (I still don't understand what cleared resource's flags if that's the 
> > case but it still seems the best explanation).
> >
> > Please take logs from this point on with dyndbg="file drivers/pci/*.c +p" 
> > on the kernel's command line so there's little bit of extra info (and 
> > check you are building with CONFIG_DYNAMIC_DEBUG).
> 
> Here are dmesg logs (with the appropriate dyndbg cmdline flag) for both
> cases, in case it's useful:
> 
> https://up.firefly.nu/pub/amdgpu-egpu-crash-7.0.0-rc6.dmesg.txt
> https://up.firefly.nu/pub/amdgpu-egpu-good-pci-resource.dmesg.txt
> 
> 
> That's good enough on my end, knowing the issue is addressed already
> upstream and slated for 7.1.  I'm happy to test anything else if it'd
> be useful (for eventual backports or so), but otherwise I think I'll
> just pick thoes patches from the pci/resource tree for now.

Hi,

Thanks. It certainly looks the commit dc4b4d04e1ca ("PCI: Prevent 
shrinking bridge window from its required size") I referred to above might 
indeed help here (For Thorsten's convinience: as mentioned above, it is 
currently in the pci/resource branch slated for 7.1).

With the extra debug enabled, "shrunken by" lines appear in the log which 
indicates the hotplug memory distribution algorithm goes to mess with the 
calculated bridge window sizes in between resource sizing and resource 
assignment and my fix aims to prevent that from happening.

If that fix does not help or does not fully solve the issue, please do 
take a new log with that patch included into the kernel (preferrably with 
all the fixes that are currently in the pci/resource branch so we don't 
hit yet another issue that already has a fix). If you need to take more 
logs, please include also /proc/iomem dump (as figuring the iomem layout 
from dmesg is pretty tedious and error prone).

Also if you see this line, it's worth to posting the log (even if things 
would appear as working):

amdgpu 0000:3e:00.0: Not enough PCI address space for a large BAR.

...I'll see if I can somehow improve that as well (not a guarantee but 
it's still worth taking a look, it appears also in the case you labeled 
"good").

-- 
 i.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] amdgpu with Thunderbolt eGPU bracket fails since new bridge window alignment calculation code
  2026-04-02 16:56           ` Jonas Höglund
@ 2026-04-07  7:37             ` Ilpo Järvinen
  0 siblings, 0 replies; 11+ messages in thread
From: Ilpo Järvinen @ 2026-04-07  7:37 UTC (permalink / raw)
  To: Jonas Höglund
  Cc: Thorsten Leemhuis, Bjorn Helgaas, linux-pci, regressions

[-- Attachment #1: Type: text/plain, Size: 1173 bytes --]

On Thu, 2 Apr 2026, Jonas Höglund wrote:

> On Thu, 2 Apr 2026, at 16:51, Jonas Höglund wrote:
> > Here are dmesg logs (with the appropriate dyndbg cmdline flag) for both
> > cases, in case it's useful:
> >
> > https://up.firefly.nu/pub/amdgpu-egpu-crash-7.0.0-rc6.dmesg.txt
> > https://up.firefly.nu/pub/amdgpu-egpu-good-pci-resource.dmesg.txt
> 
> Actually, now that I look closer at the good-case dmesg (just after
> sending the email, typical), I guess I'm not sure how happy the driver
> and dock really are.  Well, I'll let you weigh in since you're more
> intimately familiar with the subsystem and expected debug messages.

Hi,

Even the "good" case does fail to enlarge the eGPU resource because some 
of the upstream bridge window are pinned by other resources.

It is a problem that has come up multiple times over the past few years 
but it's also a problem that is not very easy to solve (but I'm working on 
it). The sizing needs to be corrected earlier than that point (before the 
other resources pin those bridge windows), at the time we discover that 
BAR resize fails, it's already way too late to correct things.

-- 
 i.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-04-07  7:37 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-27 23:02 [REGRESSION] amdgpu with Thunderbolt eGPU bracket fails since new bridge window alignment calculation code Jonas Höglund
2026-03-28  8:46 ` Thorsten Leemhuis
2026-03-28 16:09   ` Jonas Höglund
2026-03-30  7:21 ` Thorsten Leemhuis
2026-03-30 14:33   ` Ilpo Järvinen
2026-03-30 15:50     ` Jonas Höglund
2026-03-30 16:32       ` Ilpo Järvinen
2026-04-02 16:51         ` Jonas Höglund
2026-04-02 16:56           ` Jonas Höglund
2026-04-07  7:37             ` Ilpo Järvinen
2026-04-07  7:26           ` Ilpo Järvinen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox