* [BUG] Thunderbolt eGPU PCI BARs incorrectly assigned, fails to assign memory @ 2025-08-31 10:51 Steve Oswald 2025-09-01 13:25 ` Ilpo Järvinen 0 siblings, 1 reply; 11+ messages in thread From: Steve Oswald @ 2025-08-31 10:51 UTC (permalink / raw) To: linux-pci Hello, I’ve encountered an issue with Thunderbolt eGPU (externally connected gpu via thunderbolt 4). The change from kernel 6.10.14 to 6.11.0 broke the pci memory assignment of the external pcie device. I figured out which version broke it by using ubuntu 25.04 and downgrading the kernel (https://raw.githubusercontent.com/pimlie/ubuntu-mainline-kernel.sh/master/ubuntu-mainline-kernel.sh). From the dmesg output, on the broken 6.11.0 I see 'failed to assign'. The issue occurs (almost never) on previous kernel version 6.10.14. Using pci=realloc did not change the behavior (I can produce the dmesg output if necessary). The issue was tested with 2 egpus (Radeon Instinct MI50 32GB, NVIDIA 3080 10GB). Both the amd and the nvidia driver fail to initialize the device because they cannot write the pcie messages. System details: - Kernel: Linux 6.10.14-061014-generic (Ubuntu build) > 6.11.0-061100 - Laptop: TUXEDO InfinityBook Pro 16 - Gen8 with Thunderbolt 4 - eGPU: Radeon Instinct MI50 32GB, NVIDIA 3080 10GB Steps to reproduce: 1. Boot the system with the eGPU. 2. Observe PCI BAR message in `dmesg`. Logs: both kernel messages, lspci can be found here: https://gist.github.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af raw files: - dmesg_linux_6.11.0.log https://gist.githubusercontent.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af/raw/f9470a06ff929d386c50ec6b5d07e0ff3f053dcf/dmesg_linux_6.11.0.log - dmesg_linux_6.10.14.log https://gist.githubusercontent.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af/raw/f9470a06ff929d386c50ec6b5d07e0ff3f053dcf/dmesg_linux_6.10.14.log If additional info is needed, I'm happy to help. Cheers, Steve ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] Thunderbolt eGPU PCI BARs incorrectly assigned, fails to assign memory 2025-08-31 10:51 [BUG] Thunderbolt eGPU PCI BARs incorrectly assigned, fails to assign memory Steve Oswald @ 2025-09-01 13:25 ` Ilpo Järvinen 2025-09-01 15:50 ` Ilpo Järvinen 0 siblings, 1 reply; 11+ messages in thread From: Ilpo Järvinen @ 2025-09-01 13:25 UTC (permalink / raw) To: Steve Oswald; +Cc: linux-pci [-- Attachment #1: Type: text/plain, Size: 3454 bytes --] On Sun, 31 Aug 2025, Steve Oswald wrote: > Hello, > > I’ve encountered an issue with Thunderbolt eGPU (externally connected > gpu via thunderbolt 4). The change from kernel 6.10.14 to 6.11.0 broke > the pci memory assignment of the external pcie device. I figured out > which version broke it by using ubuntu 25.04 and downgrading the > kernel (https://raw.githubusercontent.com/pimlie/ubuntu-mainline-kernel.sh/master/ubuntu-mainline-kernel.sh). > > >From the dmesg output, on the broken 6.11.0 I see 'failed to assign'. > The issue occurs (almost never) on previous kernel version 6.10.14. > Using pci=realloc did not change the behavior (I can produce the dmesg > output if necessary). > > The issue was tested with 2 egpus (Radeon Instinct MI50 32GB, NVIDIA > 3080 10GB). Both the amd and the nvidia driver fail to initialize the > device because they cannot write the pcie messages. > > System details: > - Kernel: Linux 6.10.14-061014-generic (Ubuntu build) > 6.11.0-061100 > - Laptop: TUXEDO InfinityBook Pro 16 - Gen8 with Thunderbolt 4 > - eGPU: Radeon Instinct MI50 32GB, NVIDIA 3080 10GB > > Steps to reproduce: > 1. Boot the system with the eGPU. > 2. Observe PCI BAR message in `dmesg`. > > Logs: > both kernel messages, lspci can be found here: > https://gist.github.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af > raw files: > - dmesg_linux_6.11.0.log > https://gist.githubusercontent.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af/raw/f9470a06ff929d386c50ec6b5d07e0ff3f053dcf/dmesg_linux_6.11.0.log > - dmesg_linux_6.10.14.log > https://gist.githubusercontent.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af/raw/f9470a06ff929d386c50ec6b5d07e0ff3f053dcf/dmesg_linux_6.10.14.log > > If additional info is needed, I'm happy to help. Hi Steve, Thanks for the report. My analysis is that the problem boils down to lack of this line with 6.11: pcieport 0000:00:07.0: resource 15 [mem 0x6000000000-0x601bffffff 64bit pref] released It means one of the upstream bridge windows could not be released for resize as it is printed from pci_reassign_bridge_resources() which likely occurs inside pci_resize_resource() call from amdgpu(?). The very likely cause is this check: /* Ignore BARs which are still in use */ if (res->child) continue; ...which (until very recently) is entirely silent so there's no warning whatsover what is the root cause. What this means, is that there's some assigned resource underneath 0000:00:07.0 with 6.11 that wasn't there with 6.10. And it is because 6.11 tried harder to get your resources assigned and was successful here and there resulting in pinning the bridge window in its place, whereas 6.10 failed to assign the same resource. Could you provide /proc/iomem (it's enough to do that for 6.11 for now)? You could try to use hpmmioprefsize= on kernel's command line to reserve more space for the bridge windows, the default is only 2M and these GPUs need a magnitude more (gigabytes), you can check from 6.10 what the sizes of the BARs on the GPU are, and round the sum upwards to the next power of two multiple. I'd also be interested to see why pci=realloc failed to solve this problem as it should reconfigure the entire resource tree so if you could provide the logs with that. Please take lspci with -vvv. -- i. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] Thunderbolt eGPU PCI BARs incorrectly assigned, fails to assign memory 2025-09-01 13:25 ` Ilpo Järvinen @ 2025-09-01 15:50 ` Ilpo Järvinen 2025-09-01 16:06 ` Ilpo Järvinen 2025-09-01 16:10 ` Steve Oswald 0 siblings, 2 replies; 11+ messages in thread From: Ilpo Järvinen @ 2025-09-01 15:50 UTC (permalink / raw) To: Steve Oswald; +Cc: linux-pci [-- Attachment #1: Type: text/plain, Size: 4666 bytes --] On Mon, 1 Sep 2025, Ilpo Järvinen wrote: > On Sun, 31 Aug 2025, Steve Oswald wrote: > > > I’ve encountered an issue with Thunderbolt eGPU (externally connected > > gpu via thunderbolt 4). The change from kernel 6.10.14 to 6.11.0 broke > > the pci memory assignment of the external pcie device. I figured out > > which version broke it by using ubuntu 25.04 and downgrading the > > kernel (https://raw.githubusercontent.com/pimlie/ubuntu-mainline-kernel.sh/master/ubuntu-mainline-kernel.sh). > > > > >From the dmesg output, on the broken 6.11.0 I see 'failed to assign'. > > The issue occurs (almost never) on previous kernel version 6.10.14. > > Using pci=realloc did not change the behavior (I can produce the dmesg > > output if necessary). > > > > The issue was tested with 2 egpus (Radeon Instinct MI50 32GB, NVIDIA > > 3080 10GB). Both the amd and the nvidia driver fail to initialize the > > device because they cannot write the pcie messages. > > > > System details: > > - Kernel: Linux 6.10.14-061014-generic (Ubuntu build) > 6.11.0-061100 > > - Laptop: TUXEDO InfinityBook Pro 16 - Gen8 with Thunderbolt 4 > > - eGPU: Radeon Instinct MI50 32GB, NVIDIA 3080 10GB > > > > Steps to reproduce: > > 1. Boot the system with the eGPU. > > 2. Observe PCI BAR message in `dmesg`. > > > > Logs: > > both kernel messages, lspci can be found here: > > https://gist.github.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af > > raw files: > > - dmesg_linux_6.11.0.log > > https://gist.githubusercontent.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af/raw/f9470a06ff929d386c50ec6b5d07e0ff3f053dcf/dmesg_linux_6.11.0.log > > - dmesg_linux_6.10.14.log > > https://gist.githubusercontent.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af/raw/f9470a06ff929d386c50ec6b5d07e0ff3f053dcf/dmesg_linux_6.10.14.log > > > > If additional info is needed, I'm happy to help. > > Hi Steve, > > Thanks for the report. > > My analysis is that the problem boils down to lack of this line with 6.11: > > pcieport 0000:00:07.0: resource 15 [mem 0x6000000000-0x601bffffff 64bit pref] released > > It means one of the upstream bridge windows could not be released for > resize as it is printed from pci_reassign_bridge_resources() which likely > occurs inside pci_resize_resource() call from amdgpu(?). > > The very likely cause is this check: > > /* Ignore BARs which are still in use */ > if (res->child) > continue; > > ...which (until very recently) is entirely silent so there's no warning > whatsover what is the root cause. Hi again, Actually, scratch most of that. It's not during resize as the log should say "releasing" (I don't know how I got this confused). "released" is from pci_bridge_release_resources() which is called from pci_bus_release_bridge_resources() doesn't even try to walk upwards. But that begs question, why didn't also the bridge windows fail their assignments. Resource fitting calculates size for the bridge window: pci 0000:03:00.0: bridge window [mem 0x800000000-0x10003fffff 64bit pref] to [bus 04-2c] add_size 100000 add_align 100000 ...but I cannot see assignment for that even being attempted as almost immediately, this occurs: pci 0000:03:00.0: bridge window [mem 0x6000000000-0x601bffffff 64bit pref]: assigned ...which is much less than 0x10003fffff-0x800000000. I cannot think of anything what could make it shrink like that. I'll have to think this more, it might require a debug patch but I'll think until tomorrow to see if I can understand it from the code alone. > What this means, is that there's some assigned resource underneath > 0000:00:07.0 with 6.11 that wasn't there with 6.10. And it is because 6.11 > tried harder to get your resources assigned and was successful here and > there resulting in pinning the bridge window in its place, whereas 6.10 > failed to assign the same resource. > > Could you provide /proc/iomem (it's enough to do that for 6.11 for now)? > > > You could try to use hpmmioprefsize= on kernel's command line to reserve > more space for the bridge windows, the default is only 2M and these GPUs > need a magnitude more (gigabytes), you can check from 6.10 what the sizes > of the BARs on the GPU are, and round the sum upwards to the next power of > two multiple. > > I'd also be interested to see why pci=realloc failed to solve this problem > as it should reconfigure the entire resource tree so if you could provide > the logs with that. Please take lspci with -vvv. > > > -- i. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] Thunderbolt eGPU PCI BARs incorrectly assigned, fails to assign memory 2025-09-01 15:50 ` Ilpo Järvinen @ 2025-09-01 16:06 ` Ilpo Järvinen 2025-09-01 16:18 ` Steve Oswald 2025-09-01 16:10 ` Steve Oswald 1 sibling, 1 reply; 11+ messages in thread From: Ilpo Järvinen @ 2025-09-01 16:06 UTC (permalink / raw) To: Steve Oswald; +Cc: linux-pci [-- Attachment #1: Type: text/plain, Size: 5117 bytes --] On Mon, 1 Sep 2025, Ilpo Järvinen wrote: > On Mon, 1 Sep 2025, Ilpo Järvinen wrote: > > On Sun, 31 Aug 2025, Steve Oswald wrote: > > > > > I’ve encountered an issue with Thunderbolt eGPU (externally connected > > > gpu via thunderbolt 4). The change from kernel 6.10.14 to 6.11.0 broke > > > the pci memory assignment of the external pcie device. I figured out > > > which version broke it by using ubuntu 25.04 and downgrading the > > > kernel (https://raw.githubusercontent.com/pimlie/ubuntu-mainline-kernel.sh/master/ubuntu-mainline-kernel.sh). > > > > > > >From the dmesg output, on the broken 6.11.0 I see 'failed to assign'. > > > The issue occurs (almost never) on previous kernel version 6.10.14. > > > Using pci=realloc did not change the behavior (I can produce the dmesg > > > output if necessary). > > > > > > The issue was tested with 2 egpus (Radeon Instinct MI50 32GB, NVIDIA > > > 3080 10GB). Both the amd and the nvidia driver fail to initialize the > > > device because they cannot write the pcie messages. > > > > > > System details: > > > - Kernel: Linux 6.10.14-061014-generic (Ubuntu build) > 6.11.0-061100 > > > - Laptop: TUXEDO InfinityBook Pro 16 - Gen8 with Thunderbolt 4 > > > - eGPU: Radeon Instinct MI50 32GB, NVIDIA 3080 10GB > > > > > > Steps to reproduce: > > > 1. Boot the system with the eGPU. > > > 2. Observe PCI BAR message in `dmesg`. > > > > > > Logs: > > > both kernel messages, lspci can be found here: > > > https://gist.github.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af > > > raw files: > > > - dmesg_linux_6.11.0.log > > > https://gist.githubusercontent.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af/raw/f9470a06ff929d386c50ec6b5d07e0ff3f053dcf/dmesg_linux_6.11.0.log > > > - dmesg_linux_6.10.14.log > > > https://gist.githubusercontent.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af/raw/f9470a06ff929d386c50ec6b5d07e0ff3f053dcf/dmesg_linux_6.10.14.log > > > > > > If additional info is needed, I'm happy to help. > > > > Hi Steve, > > > > Thanks for the report. > > > > My analysis is that the problem boils down to lack of this line with 6.11: > > > > pcieport 0000:00:07.0: resource 15 [mem 0x6000000000-0x601bffffff 64bit pref] released > > > > It means one of the upstream bridge windows could not be released for > > resize as it is printed from pci_reassign_bridge_resources() which likely > > occurs inside pci_resize_resource() call from amdgpu(?). > > > > The very likely cause is this check: > > > > /* Ignore BARs which are still in use */ > > if (res->child) > > continue; > > > > ...which (until very recently) is entirely silent so there's no warning > > whatsover what is the root cause. > > Hi again, > > Actually, scratch most of that. It's not during resize as the log should > say "releasing" (I don't know how I got this confused). "released" is from > pci_bridge_release_resources() which is called from > pci_bus_release_bridge_resources() doesn't even try to walk upwards. > > But that begs question, why didn't also the bridge windows fail their > assignments. > > Resource fitting calculates size for the bridge window: > > pci 0000:03:00.0: bridge window [mem 0x800000000-0x10003fffff 64bit pref] to [bus 04-2c] add_size 100000 add_align 100000 > > ...but I cannot see assignment for that even being attempted as almost > immediately, this occurs: > > pci 0000:03:00.0: bridge window [mem 0x6000000000-0x601bffffff 64bit pref]: assigned > > ...which is much less than 0x10003fffff-0x800000000. I cannot think of > anything what could make it shrink like that. > > I'll have to think this more, it might require a debug patch but I'll > think until tomorrow to see if I can understand it from the code alone. Only thing I can think of is something going wrong in adjust_bridge_window(). Could you please provide a dmesg with dyndbg="file drivers/pci/* +p" on the kernel cmdline from 6.11. > > What this means, is that there's some assigned resource underneath > > 0000:00:07.0 with 6.11 that wasn't there with 6.10. And it is because 6.11 > > tried harder to get your resources assigned and was successful here and > > there resulting in pinning the bridge window in its place, whereas 6.10 > > failed to assign the same resource. > > > > Could you provide /proc/iomem (it's enough to do that for 6.11 for now)? > > > > > > You could try to use hpmmioprefsize= on kernel's command line to reserve > > more space for the bridge windows, the default is only 2M and these GPUs > > need a magnitude more (gigabytes), you can check from 6.10 what the sizes > > of the BARs on the GPU are, and round the sum upwards to the next power of > > two multiple. > > > > I'd also be interested to see why pci=realloc failed to solve this problem > > as it should reconfigure the entire resource tree so if you could provide > > the logs with that. Please take lspci with -vvv. > > > > > > > > -- i. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] Thunderbolt eGPU PCI BARs incorrectly assigned, fails to assign memory 2025-09-01 16:06 ` Ilpo Järvinen @ 2025-09-01 16:18 ` Steve Oswald 2025-09-01 16:28 ` Ilpo Järvinen 2025-09-03 13:09 ` Ilpo Järvinen 0 siblings, 2 replies; 11+ messages in thread From: Steve Oswald @ 2025-09-01 16:18 UTC (permalink / raw) To: Ilpo Järvinen; +Cc: linux-pci I've added the dmesg output fordyndbg="file drivers/pci/*. I wasn't sure if I added it with the escaped quotes correctly. https://gist.github.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af/raw/9cf5fc3a8c4f13588a33d61865f804f85e50470a/dmesg_linux_6.11.0_dyndbg.log Am Mo., 1. Sept. 2025 um 19:07 Uhr schrieb Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>: > > On Mon, 1 Sep 2025, Ilpo Järvinen wrote: > > > On Mon, 1 Sep 2025, Ilpo Järvinen wrote: > > > On Sun, 31 Aug 2025, Steve Oswald wrote: > > > > > > > I’ve encountered an issue with Thunderbolt eGPU (externally connected > > > > gpu via thunderbolt 4). The change from kernel 6.10.14 to 6.11.0 broke > > > > the pci memory assignment of the external pcie device. I figured out > > > > which version broke it by using ubuntu 25.04 and downgrading the > > > > kernel (https://raw.githubusercontent.com/pimlie/ubuntu-mainline-kernel.sh/master/ubuntu-mainline-kernel.sh). > > > > > > > > >From the dmesg output, on the broken 6.11.0 I see 'failed to assign'. > > > > The issue occurs (almost never) on previous kernel version 6.10.14. > > > > Using pci=realloc did not change the behavior (I can produce the dmesg > > > > output if necessary). > > > > > > > > The issue was tested with 2 egpus (Radeon Instinct MI50 32GB, NVIDIA > > > > 3080 10GB). Both the amd and the nvidia driver fail to initialize the > > > > device because they cannot write the pcie messages. > > > > > > > > System details: > > > > - Kernel: Linux 6.10.14-061014-generic (Ubuntu build) > 6.11.0-061100 > > > > - Laptop: TUXEDO InfinityBook Pro 16 - Gen8 with Thunderbolt 4 > > > > - eGPU: Radeon Instinct MI50 32GB, NVIDIA 3080 10GB > > > > > > > > Steps to reproduce: > > > > 1. Boot the system with the eGPU. > > > > 2. Observe PCI BAR message in `dmesg`. > > > > > > > > Logs: > > > > both kernel messages, lspci can be found here: > > > > https://gist.github.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af > > > > raw files: > > > > - dmesg_linux_6.11.0.log > > > > https://gist.githubusercontent.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af/raw/f9470a06ff929d386c50ec6b5d07e0ff3f053dcf/dmesg_linux_6.11.0.log > > > > - dmesg_linux_6.10.14.log > > > > https://gist.githubusercontent.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af/raw/f9470a06ff929d386c50ec6b5d07e0ff3f053dcf/dmesg_linux_6.10.14.log > > > > > > > > If additional info is needed, I'm happy to help. > > > > > > Hi Steve, > > > > > > Thanks for the report. > > > > > > My analysis is that the problem boils down to lack of this line with 6.11: > > > > > > pcieport 0000:00:07.0: resource 15 [mem 0x6000000000-0x601bffffff 64bit pref] released > > > > > > It means one of the upstream bridge windows could not be released for > > > resize as it is printed from pci_reassign_bridge_resources() which likely > > > occurs inside pci_resize_resource() call from amdgpu(?). > > > > > > The very likely cause is this check: > > > > > > /* Ignore BARs which are still in use */ > > > if (res->child) > > > continue; > > > > > > ...which (until very recently) is entirely silent so there's no warning > > > whatsover what is the root cause. > > > > Hi again, > > > > Actually, scratch most of that. It's not during resize as the log should > > say "releasing" (I don't know how I got this confused). "released" is from > > pci_bridge_release_resources() which is called from > > pci_bus_release_bridge_resources() doesn't even try to walk upwards. > > > > But that begs question, why didn't also the bridge windows fail their > > assignments. > > > > Resource fitting calculates size for the bridge window: > > > > pci 0000:03:00.0: bridge window [mem 0x800000000-0x10003fffff 64bit pref] to [bus 04-2c] add_size 100000 add_align 100000 > > > > ...but I cannot see assignment for that even being attempted as almost > > immediately, this occurs: > > > > pci 0000:03:00.0: bridge window [mem 0x6000000000-0x601bffffff 64bit pref]: assigned > > > > ...which is much less than 0x10003fffff-0x800000000. I cannot think of > > anything what could make it shrink like that. > > > > I'll have to think this more, it might require a debug patch but I'll > > think until tomorrow to see if I can understand it from the code alone. > > Only thing I can think of is something going wrong in > adjust_bridge_window(). > > Could you please provide a dmesg with dyndbg="file drivers/pci/* +p" on > the kernel cmdline from 6.11. > > > > What this means, is that there's some assigned resource underneath > > > 0000:00:07.0 with 6.11 that wasn't there with 6.10. And it is because 6.11 > > > tried harder to get your resources assigned and was successful here and > > > there resulting in pinning the bridge window in its place, whereas 6.10 > > > failed to assign the same resource. > > > > > > Could you provide /proc/iomem (it's enough to do that for 6.11 for now)? > > > > > > > > > You could try to use hpmmioprefsize= on kernel's command line to reserve > > > more space for the bridge windows, the default is only 2M and these GPUs > > > need a magnitude more (gigabytes), you can check from 6.10 what the sizes > > > of the BARs on the GPU are, and round the sum upwards to the next power of > > > two multiple. > > > > > > I'd also be interested to see why pci=realloc failed to solve this problem > > > as it should reconfigure the entire resource tree so if you could provide > > > the logs with that. Please take lspci with -vvv. > > > > > > > > > > > > > > > -- > i. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] Thunderbolt eGPU PCI BARs incorrectly assigned, fails to assign memory 2025-09-01 16:18 ` Steve Oswald @ 2025-09-01 16:28 ` Ilpo Järvinen 2025-09-03 13:09 ` Ilpo Järvinen 1 sibling, 0 replies; 11+ messages in thread From: Ilpo Järvinen @ 2025-09-01 16:28 UTC (permalink / raw) To: Steve Oswald; +Cc: linux-pci [-- Attachment #1: Type: text/plain, Size: 6226 bytes --] On Mon, 1 Sep 2025, Steve Oswald wrote: > I've added the dmesg output fordyndbg="file drivers/pci/*. I wasn't > sure if I added it with the escaped quotes correctly. > https://gist.github.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af/raw/9cf5fc3a8c4f13588a33d61865f804f85e50470a/dmesg_linux_6.11.0_dyndbg.log Thanks. It was correct and caught the reason for the problem: [ 9.090106] pci 0000:03:00.0: bridge window [mem 0x800000000-0x10003fffff 64bit pref] shrunken by 0x00000007e4400000 I'll get back to this once I've tried to figure why it does what it does. -- i. > Am Mo., 1. Sept. 2025 um 19:07 Uhr schrieb Ilpo Järvinen > <ilpo.jarvinen@linux.intel.com>: > > > > On Mon, 1 Sep 2025, Ilpo Järvinen wrote: > > > > > On Mon, 1 Sep 2025, Ilpo Järvinen wrote: > > > > On Sun, 31 Aug 2025, Steve Oswald wrote: > > > > > > > > > I’ve encountered an issue with Thunderbolt eGPU (externally connected > > > > > gpu via thunderbolt 4). The change from kernel 6.10.14 to 6.11.0 broke > > > > > the pci memory assignment of the external pcie device. I figured out > > > > > which version broke it by using ubuntu 25.04 and downgrading the > > > > > kernel (https://raw.githubusercontent.com/pimlie/ubuntu-mainline-kernel.sh/master/ubuntu-mainline-kernel.sh). > > > > > > > > > > >From the dmesg output, on the broken 6.11.0 I see 'failed to assign'. > > > > > The issue occurs (almost never) on previous kernel version 6.10.14. > > > > > Using pci=realloc did not change the behavior (I can produce the dmesg > > > > > output if necessary). > > > > > > > > > > The issue was tested with 2 egpus (Radeon Instinct MI50 32GB, NVIDIA > > > > > 3080 10GB). Both the amd and the nvidia driver fail to initialize the > > > > > device because they cannot write the pcie messages. > > > > > > > > > > System details: > > > > > - Kernel: Linux 6.10.14-061014-generic (Ubuntu build) > 6.11.0-061100 > > > > > - Laptop: TUXEDO InfinityBook Pro 16 - Gen8 with Thunderbolt 4 > > > > > - eGPU: Radeon Instinct MI50 32GB, NVIDIA 3080 10GB > > > > > > > > > > Steps to reproduce: > > > > > 1. Boot the system with the eGPU. > > > > > 2. Observe PCI BAR message in `dmesg`. > > > > > > > > > > Logs: > > > > > both kernel messages, lspci can be found here: > > > > > https://gist.github.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af > > > > > raw files: > > > > > - dmesg_linux_6.11.0.log > > > > > https://gist.githubusercontent.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af/raw/f9470a06ff929d386c50ec6b5d07e0ff3f053dcf/dmesg_linux_6.11.0.log > > > > > - dmesg_linux_6.10.14.log > > > > > https://gist.githubusercontent.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af/raw/f9470a06ff929d386c50ec6b5d07e0ff3f053dcf/dmesg_linux_6.10.14.log > > > > > > > > > > If additional info is needed, I'm happy to help. > > > > > > > > Hi Steve, > > > > > > > > Thanks for the report. > > > > > > > > My analysis is that the problem boils down to lack of this line with 6.11: > > > > > > > > pcieport 0000:00:07.0: resource 15 [mem 0x6000000000-0x601bffffff 64bit pref] released > > > > > > > > It means one of the upstream bridge windows could not be released for > > > > resize as it is printed from pci_reassign_bridge_resources() which likely > > > > occurs inside pci_resize_resource() call from amdgpu(?). > > > > > > > > The very likely cause is this check: > > > > > > > > /* Ignore BARs which are still in use */ > > > > if (res->child) > > > > continue; > > > > > > > > ...which (until very recently) is entirely silent so there's no warning > > > > whatsover what is the root cause. > > > > > > Hi again, > > > > > > Actually, scratch most of that. It's not during resize as the log should > > > say "releasing" (I don't know how I got this confused). "released" is from > > > pci_bridge_release_resources() which is called from > > > pci_bus_release_bridge_resources() doesn't even try to walk upwards. > > > > > > But that begs question, why didn't also the bridge windows fail their > > > assignments. > > > > > > Resource fitting calculates size for the bridge window: > > > > > > pci 0000:03:00.0: bridge window [mem 0x800000000-0x10003fffff 64bit pref] to [bus 04-2c] add_size 100000 add_align 100000 > > > > > > ...but I cannot see assignment for that even being attempted as almost > > > immediately, this occurs: > > > > > > pci 0000:03:00.0: bridge window [mem 0x6000000000-0x601bffffff 64bit pref]: assigned > > > > > > ...which is much less than 0x10003fffff-0x800000000. I cannot think of > > > anything what could make it shrink like that. > > > > > > I'll have to think this more, it might require a debug patch but I'll > > > think until tomorrow to see if I can understand it from the code alone. > > > > Only thing I can think of is something going wrong in > > adjust_bridge_window(). > > > > Could you please provide a dmesg with dyndbg="file drivers/pci/* +p" on > > the kernel cmdline from 6.11. > > > > > > What this means, is that there's some assigned resource underneath > > > > 0000:00:07.0 with 6.11 that wasn't there with 6.10. And it is because 6.11 > > > > tried harder to get your resources assigned and was successful here and > > > > there resulting in pinning the bridge window in its place, whereas 6.10 > > > > failed to assign the same resource. > > > > > > > > Could you provide /proc/iomem (it's enough to do that for 6.11 for now)? > > > > > > > > > > > > You could try to use hpmmioprefsize= on kernel's command line to reserve > > > > more space for the bridge windows, the default is only 2M and these GPUs > > > > need a magnitude more (gigabytes), you can check from 6.10 what the sizes > > > > of the BARs on the GPU are, and round the sum upwards to the next power of > > > > two multiple. > > > > > > > > I'd also be interested to see why pci=realloc failed to solve this problem > > > > as it should reconfigure the entire resource tree so if you could provide > > > > the logs with that. Please take lspci with -vvv. > > > > > > > > > > > > > > > > > > > > > > -- > > i. > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] Thunderbolt eGPU PCI BARs incorrectly assigned, fails to assign memory 2025-09-01 16:18 ` Steve Oswald 2025-09-01 16:28 ` Ilpo Järvinen @ 2025-09-03 13:09 ` Ilpo Järvinen 2025-10-08 10:43 ` Ilpo Järvinen 1 sibling, 1 reply; 11+ messages in thread From: Ilpo Järvinen @ 2025-09-03 13:09 UTC (permalink / raw) To: Steve Oswald; +Cc: linux-pci [-- Attachment #1: Type: text/plain, Size: 9993 bytes --] On Mon, 1 Sep 2025, Steve Oswald wrote: > I've added the dmesg output fordyndbg="file drivers/pci/*. I wasn't > sure if I added it with the escaped quotes correctly. > https://gist.github.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af/raw/9cf5fc3a8c4f13588a33d61865f804f85e50470a/dmesg_linux_6.11.0_dyndbg.log Hi again, I think the patch below should solve your issue. If you manage to get it tested and are confident enough it fixed your issue, you may provide your Tested-by tag so I can include it into the official submission of the fix. From 1c8ef31c6ac6616869b447a473e879b233df62db Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ilpo=20J=C3=A4rvinen?= <ilpo.jarvinen@linux.intel.com> Date: Wed, 3 Sep 2025 15:53:48 +0300 Subject: [PATCH 1/1] PCI: Prevent shrinking bridge window from its required size MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit pci_bridge_distribute_available_resources() -> ... -> adjust_bridge_window() is called in between __pci_bus_size_bridges() and assigning the resources. Since the commit 948675736a77 ("PCI: Allow adjust_bridge_window() to shrink resource if necessary") adjust_bridge_window() can also shrink the bridge window to force it to a smaller size. The shrunken size, however, conflicts what __pci_bus_size_bridges() -> pbus_size_mem() calculated as the required bridge window size. By shrinking the resource size, adjust_bridge_window() prevents rest of the resource fitting algorithm working as intended. Resource fitting logic is expecting assignment failures when bridge windows need resizing, but there are cases where failures are no longer happening after the commit 948675736a77 ("PCI: Allow adjust_bridge_window() to shrink resource if necessary"). The commit 948675736a77 ("PCI: Allow adjust_bridge_window() to shrink resource if necessary") justifies the change by the extra reservation made due to hpmemsize parameter, however, the kernel code contradicts with that statement. (For simplicity, finer-grained hpmmiosize and hpmmiopref parameters that can be used to the same effect as hpmemsize are ignored in this description.) pbus_size_mem() calls calculate_memsize() twice. First with add_size=0 to find out the minimal required resource size. The second call occurs with add_size=hpmemsize (effectively) but the result does not directly affect the resource size only resulting in an entry on the realloc_head list (a.k.a. add_list). Yet, adjust_bridge_window() directly changes the resource size which does not include what is reserved due to hpmemsize. Also, if the required size for the bridge window exceeds hpmemsize, the parameter does not have any effect even on the second size calculcation made by pbus_size_mem(); from calculate_memsize(): size = max(size, add_size) + children_add_size; The commit ae4611f1d7e9 ("PCI: Set resource size directly in adjust_bridge_window()") that precedes the commit 948675736a77 ("PCI: Allow adjust_bridge_window() to shrink resource if necessary") is also related to causing this problem. Its changelog explicitly states adjust_bridge_window() wants to "guarantee" allocation success. Guaranteed allocations, however, are incompatible with how the other parts of the resource fitting algorithm work. The given justification fails to explain why guaranteed allocations at this stage are required nor why forcing window to a smaller value than what was calculated by pbus_size_mem() is correct. The the change might have worked by chance in some test scenario, too small bridge window does not "guarantee" success from the point of view of the endpoint device resource assignments. No issue is mentioned within the changelog so it's unclear if the change was made to fix some observed issue nor and what that issue was. The unwanted shrinking of a bridge window occurs, e.g., when a device with large BARs such as eGPU is attached using Thunderbolt and the Root Port holds less than enough resource space for the eGPU. The GPU resources are in order of GBs and the default hotplug allocation is mere 2M (DEFAULT_HOTPLUG_MMIO_PREF_SIZE). The problem is illustrated by this log (filtered to the relevant content only): pci 0000:00:07.0: PCI bridge to [bus 03-2c] pci 0000:00:07.0: bridge window [mem 0x6000000000-0x601bffffff 64bit pref] pci 0000:03:00.0: PCI bridge to [bus 00] pci 0000:03:00.0: bridge window [mem 0x00000000-0x000fffff 64bit pref] pci 0000:03:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring pci 0000:03:00.0: PCI bridge to [bus 04-2c] pcieport 0000:00:07.0: Assigned bridge window [mem 0x6000000000-0x601bffffff 64bit pref] to [bus 03-2c] cannot fit 0xc00000000 required for 0000:03:00.0 bridging to [bus 04-2c] pci 0000:03:00.0: bridge window [mem 0x800000000-0x10003fffff 64bit pref] to [bus 04-2c] add_size 100000 add_align 100000 pcieport 0000:00:07.0: distributing available resources pci 0000:03:00.0: bridge window [mem 0x800000000-0x10003fffff 64bit pref] shrunken by 0x00000007e4400000 pci 0000:03:00.0: bridge window [mem 0x6000000000-0x601bffffff 64bit pref]: assigned The initial size of the Root Port's window is 448MB (0x601bffffff - 0x6000000000). __pci_bus_size_bridges() -> pbus_size_mem() calculates the required size to be 32772 MB (0x10003fffff - 0x800000000) which would fit the eGPU resources. adjust_bridge_window() then shrinks the bridge window down to what is guaranteed to fit into the Root Port's bridge window. The bridge window for 03:00.0 is also eliminated from the add_list (a.k.a. realloc_head) list by adjust_bridge_window(). After adjustment, the resources are assigned and as the bridge window for 03:00.0 is assigned successfully, no failure is recorded. Without a failure, no attempt to resize the window of the Root Port is required. The end result is eGPU not having large enough resources to work. The commit 948675736a77 ("PCI: Allow adjust_bridge_window() to shrink resource if necessary") also claims nested bridge windows are sized the same, which is false. pbus_size_mem() calculates the size for the parent bridge window by summing all the downstream resources so the resource fitting calculates larger bridge window for the parent to accomodate the childen. That is, hpmemsize does not result the same size for the case where there are nested bridge windows. In order to fix the most immediate problem, don't shrink the resource size as hpmemsize had nothing to do with it. When considering add_size, only reduce it up to what is added due to hpmemsize (if required size is larger than hpmemsize, the parameter has no impact, see calculate_memsize()). This is not exactly a revert of the commits e4611f1d7e9 ("PCI: Set resource size directly in adjust_bridge_window()") and 948675736a77 ("PCI: Allow adjust_bridge_window() to shrink resource if necessary") as shrinking still remains in place but is implemented differently, and the end result behaves very differently. It is possible that those two commits fixed some other issue that is not described with enough detail in the changelog and undoing parts of them results in another regression due to behavioral change. Nonetheless, as described above, the solution by those two commits was flawed and the issue, if one exists, should be solved in a way that is compatible with the rest of the resource fitting algorithm instead of working against it. Besides shrinking, the case where adjust_bridge_window() expands the bridge window is likely somewhat wrong as well because it removes the entry from add_list (a.k.a. realloc_head), but it is less damaging as that only impacts optional resources and may have no impact if expanding by hpmemsize is larger than what add_size was. Fixing it is left as further work. Reported-by: Steve Oswald <stevepeter.oswald@gmail.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> --- drivers/pci/setup-bus.c | 42 +++++++++++++++++++++++++++++++++++++++-- 1 file changed, 40 insertions(+), 2 deletions(-) diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index 23082bc0ca37..4dd618bc4196 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -1828,6 +1828,7 @@ static void adjust_bridge_window(struct pci_dev *bridge, struct resource *res, resource_size_t new_size) { resource_size_t add_size, size = resource_size(res); + struct pci_dev_resource *dev_res; if (res->parent) return; @@ -1840,9 +1841,46 @@ static void adjust_bridge_window(struct pci_dev *bridge, struct resource *res, pci_dbg(bridge, "bridge window %pR extended by %pa\n", res, &add_size); } else if (new_size < size) { + int idx = pci_resource_num(bridge, res); + + /* + * hpio/mmio/mmioprefsize hasn't been included at all? See the + * add_size param at the callsites of calculate_memsize(). + */ + if (!add_list) + return; + + /* Only shrink if the hotplug extra relates to window size. */ + switch (idx) { + case PCI_BRIDGE_IO_WINDOW: + if (size > pci_hotplug_io_size) + return; + break; + case PCI_BRIDGE_MEM_WINDOW: + if (size > pci_hotplug_mmio_size) + return; + break; + case PCI_BRIDGE_PREF_MEM_WINDOW: + if (size > pci_hotplug_mmio_pref_size) + return; + break; + default: + break; + } + + dev_res = res_to_dev_res(add_list, res); add_size = size - new_size; - pci_dbg(bridge, "bridge window %pR shrunken by %pa\n", res, - &add_size); + if (add_size < dev_res->add_size) { + dev_res->add_size -= add_size; + pci_dbg(bridge, "bridge window %pR optional size shrunken by %pa\n", + res, &add_size); + } else { + pci_dbg(bridge, "bridge window %pR optional size removed\n", + res); + remove_from_list(add_list, res); + } + return; + } else { return; } base-commit: f6d41443f54856ceece0d5b584f47f681513bde4 -- 2.39.5 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [BUG] Thunderbolt eGPU PCI BARs incorrectly assigned, fails to assign memory 2025-09-03 13:09 ` Ilpo Järvinen @ 2025-10-08 10:43 ` Ilpo Järvinen 2025-10-11 14:12 ` Steve Oswald 0 siblings, 1 reply; 11+ messages in thread From: Ilpo Järvinen @ 2025-10-08 10:43 UTC (permalink / raw) To: Steve Oswald; +Cc: linux-pci [-- Attachment #1: Type: text/plain, Size: 10507 bytes --] On Wed, 3 Sep 2025, Ilpo Järvinen wrote: > On Mon, 1 Sep 2025, Steve Oswald wrote: > > > I've added the dmesg output fordyndbg="file drivers/pci/*. I wasn't > > sure if I added it with the escaped quotes correctly. > > https://gist.github.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af/raw/9cf5fc3a8c4f13588a33d61865f804f85e50470a/dmesg_linux_6.11.0_dyndbg.log > > Hi again, > > I think the patch below should solve your issue. If you manage to get it > tested and are confident enough it fixed your issue, you may provide your > Tested-by tag so I can include it into the official submission of the fix. Hi Steve, Did you manage to test this patch? -- i. > > > From 1c8ef31c6ac6616869b447a473e879b233df62db Mon Sep 17 00:00:00 2001 > From: =?UTF-8?q?Ilpo=20J=C3=A4rvinen?= <ilpo.jarvinen@linux.intel.com> > Date: Wed, 3 Sep 2025 15:53:48 +0300 > Subject: [PATCH 1/1] PCI: Prevent shrinking bridge window from its required > size > MIME-Version: 1.0 > Content-Type: text/plain; charset=UTF-8 > Content-Transfer-Encoding: 8bit > > pci_bridge_distribute_available_resources() -> ... -> > adjust_bridge_window() is called in between __pci_bus_size_bridges() > and assigning the resources. Since the commit 948675736a77 ("PCI: Allow > adjust_bridge_window() to shrink resource if necessary") > adjust_bridge_window() can also shrink the bridge window to force it to > a smaller size. The shrunken size, however, conflicts what > __pci_bus_size_bridges() -> pbus_size_mem() calculated as the required > bridge window size. By shrinking the resource size, > adjust_bridge_window() prevents rest of the resource fitting algorithm > working as intended. Resource fitting logic is expecting assignment > failures when bridge windows need resizing, but there are cases where > failures are no longer happening after the commit 948675736a77 ("PCI: > Allow adjust_bridge_window() to shrink resource if necessary"). > > The commit 948675736a77 ("PCI: Allow adjust_bridge_window() to shrink > resource if necessary") justifies the change by the extra reservation > made due to hpmemsize parameter, however, the kernel code contradicts > with that statement. (For simplicity, finer-grained hpmmiosize and > hpmmiopref parameters that can be used to the same effect as hpmemsize > are ignored in this description.) > > pbus_size_mem() calls calculate_memsize() twice. First with add_size=0 > to find out the minimal required resource size. The second call occurs > with add_size=hpmemsize (effectively) but the result does not directly > affect the resource size only resulting in an entry on the realloc_head > list (a.k.a. add_list). Yet, adjust_bridge_window() directly changes > the resource size which does not include what is reserved due to > hpmemsize. Also, if the required size for the bridge window exceeds > hpmemsize, the parameter does not have any effect even on the second > size calculcation made by pbus_size_mem(); from calculate_memsize(): > > size = max(size, add_size) + children_add_size; > > The commit ae4611f1d7e9 ("PCI: Set resource size directly in > adjust_bridge_window()") that precedes the commit 948675736a77 ("PCI: > Allow adjust_bridge_window() to shrink resource if necessary") is also > related to causing this problem. Its changelog explicitly states > adjust_bridge_window() wants to "guarantee" allocation success. > Guaranteed allocations, however, are incompatible with how the other > parts of the resource fitting algorithm work. The given justification > fails to explain why guaranteed allocations at this stage are required > nor why forcing window to a smaller value than what was calculated by > pbus_size_mem() is correct. The the change might have worked by chance > in some test scenario, too small bridge window does not "guarantee" > success from the point of view of the endpoint device resource > assignments. No issue is mentioned within the changelog so it's unclear > if the change was made to fix some observed issue nor and what that > issue was. > > The unwanted shrinking of a bridge window occurs, e.g., when a device > with large BARs such as eGPU is attached using Thunderbolt and the Root > Port holds less than enough resource space for the eGPU. The GPU > resources are in order of GBs and the default hotplug allocation is > mere 2M (DEFAULT_HOTPLUG_MMIO_PREF_SIZE). The problem is illustrated by > this log (filtered to the relevant content only): > > pci 0000:00:07.0: PCI bridge to [bus 03-2c] > pci 0000:00:07.0: bridge window [mem 0x6000000000-0x601bffffff 64bit pref] > pci 0000:03:00.0: PCI bridge to [bus 00] > pci 0000:03:00.0: bridge window [mem 0x00000000-0x000fffff 64bit pref] > pci 0000:03:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring > pci 0000:03:00.0: PCI bridge to [bus 04-2c] > pcieport 0000:00:07.0: Assigned bridge window [mem 0x6000000000-0x601bffffff 64bit pref] to [bus 03-2c] cannot fit 0xc00000000 required for 0000:03:00.0 bridging to [bus 04-2c] > pci 0000:03:00.0: bridge window [mem 0x800000000-0x10003fffff 64bit pref] to [bus 04-2c] add_size 100000 add_align 100000 > pcieport 0000:00:07.0: distributing available resources > pci 0000:03:00.0: bridge window [mem 0x800000000-0x10003fffff 64bit pref] shrunken by 0x00000007e4400000 > pci 0000:03:00.0: bridge window [mem 0x6000000000-0x601bffffff 64bit pref]: assigned > > The initial size of the Root Port's window is 448MB (0x601bffffff - > 0x6000000000). __pci_bus_size_bridges() -> pbus_size_mem() calculates > the required size to be 32772 MB (0x10003fffff - 0x800000000) which > would fit the eGPU resources. adjust_bridge_window() then shrinks the > bridge window down to what is guaranteed to fit into the Root Port's > bridge window. The bridge window for 03:00.0 is also eliminated from > the add_list (a.k.a. realloc_head) list by adjust_bridge_window(). > > After adjustment, the resources are assigned and as the bridge window > for 03:00.0 is assigned successfully, no failure is recorded. Without a > failure, no attempt to resize the window of the Root Port is required. > The end result is eGPU not having large enough resources to work. > > The commit 948675736a77 ("PCI: Allow adjust_bridge_window() to shrink > resource if necessary") also claims nested bridge windows are sized the > same, which is false. pbus_size_mem() calculates the size for the > parent bridge window by summing all the downstream resources so the > resource fitting calculates larger bridge window for the parent to > accomodate the childen. That is, hpmemsize does not result the same > size for the case where there are nested bridge windows. > > In order to fix the most immediate problem, don't shrink the resource > size as hpmemsize had nothing to do with it. When considering add_size, > only reduce it up to what is added due to hpmemsize (if required size > is larger than hpmemsize, the parameter has no impact, see > calculate_memsize()). > > This is not exactly a revert of the commits e4611f1d7e9 ("PCI: Set > resource size directly in adjust_bridge_window()") and 948675736a77 > ("PCI: Allow adjust_bridge_window() to shrink resource if necessary") > as shrinking still remains in place but is implemented differently, > and the end result behaves very differently. > > It is possible that those two commits fixed some other issue that is > not described with enough detail in the changelog and undoing parts of > them results in another regression due to behavioral change. > Nonetheless, as described above, the solution by those two commits was > flawed and the issue, if one exists, should be solved in a way that is > compatible with the rest of the resource fitting algorithm instead of > working against it. > > Besides shrinking, the case where adjust_bridge_window() expands the > bridge window is likely somewhat wrong as well because it removes the > entry from add_list (a.k.a. realloc_head), but it is less damaging as > that only impacts optional resources and may have no impact if > expanding by hpmemsize is larger than what add_size was. Fixing it is > left as further work. > > Reported-by: Steve Oswald <stevepeter.oswald@gmail.com> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> > --- > drivers/pci/setup-bus.c | 42 +++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 40 insertions(+), 2 deletions(-) > > diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c > index 23082bc0ca37..4dd618bc4196 100644 > --- a/drivers/pci/setup-bus.c > +++ b/drivers/pci/setup-bus.c > @@ -1828,6 +1828,7 @@ static void adjust_bridge_window(struct pci_dev *bridge, struct resource *res, > resource_size_t new_size) > { > resource_size_t add_size, size = resource_size(res); > + struct pci_dev_resource *dev_res; > > if (res->parent) > return; > @@ -1840,9 +1841,46 @@ static void adjust_bridge_window(struct pci_dev *bridge, struct resource *res, > pci_dbg(bridge, "bridge window %pR extended by %pa\n", res, > &add_size); > } else if (new_size < size) { > + int idx = pci_resource_num(bridge, res); > + > + /* > + * hpio/mmio/mmioprefsize hasn't been included at all? See the > + * add_size param at the callsites of calculate_memsize(). > + */ > + if (!add_list) > + return; > + > + /* Only shrink if the hotplug extra relates to window size. */ > + switch (idx) { > + case PCI_BRIDGE_IO_WINDOW: > + if (size > pci_hotplug_io_size) > + return; > + break; > + case PCI_BRIDGE_MEM_WINDOW: > + if (size > pci_hotplug_mmio_size) > + return; > + break; > + case PCI_BRIDGE_PREF_MEM_WINDOW: > + if (size > pci_hotplug_mmio_pref_size) > + return; > + break; > + default: > + break; > + } > + > + dev_res = res_to_dev_res(add_list, res); > add_size = size - new_size; > - pci_dbg(bridge, "bridge window %pR shrunken by %pa\n", res, > - &add_size); > + if (add_size < dev_res->add_size) { > + dev_res->add_size -= add_size; > + pci_dbg(bridge, "bridge window %pR optional size shrunken by %pa\n", > + res, &add_size); > + } else { > + pci_dbg(bridge, "bridge window %pR optional size removed\n", > + res); > + remove_from_list(add_list, res); > + } > + return; > + > } else { > return; > } > > base-commit: f6d41443f54856ceece0d5b584f47f681513bde4 > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] Thunderbolt eGPU PCI BARs incorrectly assigned, fails to assign memory 2025-10-08 10:43 ` Ilpo Järvinen @ 2025-10-11 14:12 ` Steve Oswald 2025-11-07 16:22 ` Ilpo Järvinen 0 siblings, 1 reply; 11+ messages in thread From: Steve Oswald @ 2025-10-11 14:12 UTC (permalink / raw) To: Ilpo Järvinen; +Cc: linux-pci Hi Ilpo, sorry for the late reply. It took me quite a while. The patch you sent was probably not based off of f6d41443f54856ceece0d5b584f47f681513bde4, because I used the linux-stable git and there was no pci_resource_num. I used git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git @ f6d41443f548 added the missing function from the mainline repo. Hotplugging works now, it still crashes the gpu driver on unplugging at runtime though, might still be related or just a driver bug. Here is the dmesg output: https://gist.githubusercontent.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af/raw/048752be295efe536c4f453938aac6e47e2429a3/patched%2520kernel Thanks for the fix! Best Steve Am Mi., 8. Okt. 2025 um 13:43 Uhr schrieb Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>: > > On Wed, 3 Sep 2025, Ilpo Järvinen wrote: > > > On Mon, 1 Sep 2025, Steve Oswald wrote: > > > > > I've added the dmesg output fordyndbg="file drivers/pci/*. I wasn't > > > sure if I added it with the escaped quotes correctly. > > > https://gist.github.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af/raw/9cf5fc3a8c4f13588a33d61865f804f85e50470a/dmesg_linux_6.11.0_dyndbg.log > > > > Hi again, > > > > I think the patch below should solve your issue. If you manage to get it > > tested and are confident enough it fixed your issue, you may provide your > > Tested-by tag so I can include it into the official submission of the fix. > > Hi Steve, > > Did you manage to test this patch? > > -- > i. > > > > > > > From 1c8ef31c6ac6616869b447a473e879b233df62db Mon Sep 17 00:00:00 2001 > > From: =?UTF-8?q?Ilpo=20J=C3=A4rvinen?= <ilpo.jarvinen@linux.intel.com> > > Date: Wed, 3 Sep 2025 15:53:48 +0300 > > Subject: [PATCH 1/1] PCI: Prevent shrinking bridge window from its required > > size > > MIME-Version: 1.0 > > Content-Type: text/plain; charset=UTF-8 > > Content-Transfer-Encoding: 8bit > > > > pci_bridge_distribute_available_resources() -> ... -> > > adjust_bridge_window() is called in between __pci_bus_size_bridges() > > and assigning the resources. Since the commit 948675736a77 ("PCI: Allow > > adjust_bridge_window() to shrink resource if necessary") > > adjust_bridge_window() can also shrink the bridge window to force it to > > a smaller size. The shrunken size, however, conflicts what > > __pci_bus_size_bridges() -> pbus_size_mem() calculated as the required > > bridge window size. By shrinking the resource size, > > adjust_bridge_window() prevents rest of the resource fitting algorithm > > working as intended. Resource fitting logic is expecting assignment > > failures when bridge windows need resizing, but there are cases where > > failures are no longer happening after the commit 948675736a77 ("PCI: > > Allow adjust_bridge_window() to shrink resource if necessary"). > > > > The commit 948675736a77 ("PCI: Allow adjust_bridge_window() to shrink > > resource if necessary") justifies the change by the extra reservation > > made due to hpmemsize parameter, however, the kernel code contradicts > > with that statement. (For simplicity, finer-grained hpmmiosize and > > hpmmiopref parameters that can be used to the same effect as hpmemsize > > are ignored in this description.) > > > > pbus_size_mem() calls calculate_memsize() twice. First with add_size=0 > > to find out the minimal required resource size. The second call occurs > > with add_size=hpmemsize (effectively) but the result does not directly > > affect the resource size only resulting in an entry on the realloc_head > > list (a.k.a. add_list). Yet, adjust_bridge_window() directly changes > > the resource size which does not include what is reserved due to > > hpmemsize. Also, if the required size for the bridge window exceeds > > hpmemsize, the parameter does not have any effect even on the second > > size calculcation made by pbus_size_mem(); from calculate_memsize(): > > > > size = max(size, add_size) + children_add_size; > > > > The commit ae4611f1d7e9 ("PCI: Set resource size directly in > > adjust_bridge_window()") that precedes the commit 948675736a77 ("PCI: > > Allow adjust_bridge_window() to shrink resource if necessary") is also > > related to causing this problem. Its changelog explicitly states > > adjust_bridge_window() wants to "guarantee" allocation success. > > Guaranteed allocations, however, are incompatible with how the other > > parts of the resource fitting algorithm work. The given justification > > fails to explain why guaranteed allocations at this stage are required > > nor why forcing window to a smaller value than what was calculated by > > pbus_size_mem() is correct. The the change might have worked by chance > > in some test scenario, too small bridge window does not "guarantee" > > success from the point of view of the endpoint device resource > > assignments. No issue is mentioned within the changelog so it's unclear > > if the change was made to fix some observed issue nor and what that > > issue was. > > > > The unwanted shrinking of a bridge window occurs, e.g., when a device > > with large BARs such as eGPU is attached using Thunderbolt and the Root > > Port holds less than enough resource space for the eGPU. The GPU > > resources are in order of GBs and the default hotplug allocation is > > mere 2M (DEFAULT_HOTPLUG_MMIO_PREF_SIZE). The problem is illustrated by > > this log (filtered to the relevant content only): > > > > pci 0000:00:07.0: PCI bridge to [bus 03-2c] > > pci 0000:00:07.0: bridge window [mem 0x6000000000-0x601bffffff 64bit pref] > > pci 0000:03:00.0: PCI bridge to [bus 00] > > pci 0000:03:00.0: bridge window [mem 0x00000000-0x000fffff 64bit pref] > > pci 0000:03:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring > > pci 0000:03:00.0: PCI bridge to [bus 04-2c] > > pcieport 0000:00:07.0: Assigned bridge window [mem 0x6000000000-0x601bffffff 64bit pref] to [bus 03-2c] cannot fit 0xc00000000 required for 0000:03:00.0 bridging to [bus 04-2c] > > pci 0000:03:00.0: bridge window [mem 0x800000000-0x10003fffff 64bit pref] to [bus 04-2c] add_size 100000 add_align 100000 > > pcieport 0000:00:07.0: distributing available resources > > pci 0000:03:00.0: bridge window [mem 0x800000000-0x10003fffff 64bit pref] shrunken by 0x00000007e4400000 > > pci 0000:03:00.0: bridge window [mem 0x6000000000-0x601bffffff 64bit pref]: assigned > > > > The initial size of the Root Port's window is 448MB (0x601bffffff - > > 0x6000000000). __pci_bus_size_bridges() -> pbus_size_mem() calculates > > the required size to be 32772 MB (0x10003fffff - 0x800000000) which > > would fit the eGPU resources. adjust_bridge_window() then shrinks the > > bridge window down to what is guaranteed to fit into the Root Port's > > bridge window. The bridge window for 03:00.0 is also eliminated from > > the add_list (a.k.a. realloc_head) list by adjust_bridge_window(). > > > > After adjustment, the resources are assigned and as the bridge window > > for 03:00.0 is assigned successfully, no failure is recorded. Without a > > failure, no attempt to resize the window of the Root Port is required. > > The end result is eGPU not having large enough resources to work. > > > > The commit 948675736a77 ("PCI: Allow adjust_bridge_window() to shrink > > resource if necessary") also claims nested bridge windows are sized the > > same, which is false. pbus_size_mem() calculates the size for the > > parent bridge window by summing all the downstream resources so the > > resource fitting calculates larger bridge window for the parent to > > accomodate the childen. That is, hpmemsize does not result the same > > size for the case where there are nested bridge windows. > > > > In order to fix the most immediate problem, don't shrink the resource > > size as hpmemsize had nothing to do with it. When considering add_size, > > only reduce it up to what is added due to hpmemsize (if required size > > is larger than hpmemsize, the parameter has no impact, see > > calculate_memsize()). > > > > This is not exactly a revert of the commits e4611f1d7e9 ("PCI: Set > > resource size directly in adjust_bridge_window()") and 948675736a77 > > ("PCI: Allow adjust_bridge_window() to shrink resource if necessary") > > as shrinking still remains in place but is implemented differently, > > and the end result behaves very differently. > > > > It is possible that those two commits fixed some other issue that is > > not described with enough detail in the changelog and undoing parts of > > them results in another regression due to behavioral change. > > Nonetheless, as described above, the solution by those two commits was > > flawed and the issue, if one exists, should be solved in a way that is > > compatible with the rest of the resource fitting algorithm instead of > > working against it. > > > > Besides shrinking, the case where adjust_bridge_window() expands the > > bridge window is likely somewhat wrong as well because it removes the > > entry from add_list (a.k.a. realloc_head), but it is less damaging as > > that only impacts optional resources and may have no impact if > > expanding by hpmemsize is larger than what add_size was. Fixing it is > > left as further work. > > > > Reported-by: Steve Oswald <stevepeter.oswald@gmail.com> > > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> > > --- > > drivers/pci/setup-bus.c | 42 +++++++++++++++++++++++++++++++++++++++-- > > 1 file changed, 40 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c > > index 23082bc0ca37..4dd618bc4196 100644 > > --- a/drivers/pci/setup-bus.c > > +++ b/drivers/pci/setup-bus.c > > @@ -1828,6 +1828,7 @@ static void adjust_bridge_window(struct pci_dev *bridge, struct resource *res, > > resource_size_t new_size) > > { > > resource_size_t add_size, size = resource_size(res); > > + struct pci_dev_resource *dev_res; > > > > if (res->parent) > > return; > > @@ -1840,9 +1841,46 @@ static void adjust_bridge_window(struct pci_dev *bridge, struct resource *res, > > pci_dbg(bridge, "bridge window %pR extended by %pa\n", res, > > &add_size); > > } else if (new_size < size) { > > + int idx = pci_resource_num(bridge, res); > > + > > + /* > > + * hpio/mmio/mmioprefsize hasn't been included at all? See the > > + * add_size param at the callsites of calculate_memsize(). > > + */ > > + if (!add_list) > > + return; > > + > > + /* Only shrink if the hotplug extra relates to window size. */ > > + switch (idx) { > > + case PCI_BRIDGE_IO_WINDOW: > > + if (size > pci_hotplug_io_size) > > + return; > > + break; > > + case PCI_BRIDGE_MEM_WINDOW: > > + if (size > pci_hotplug_mmio_size) > > + return; > > + break; > > + case PCI_BRIDGE_PREF_MEM_WINDOW: > > + if (size > pci_hotplug_mmio_pref_size) > > + return; > > + break; > > + default: > > + break; > > + } > > + > > + dev_res = res_to_dev_res(add_list, res); > > add_size = size - new_size; > > - pci_dbg(bridge, "bridge window %pR shrunken by %pa\n", res, > > - &add_size); > > + if (add_size < dev_res->add_size) { > > + dev_res->add_size -= add_size; > > + pci_dbg(bridge, "bridge window %pR optional size shrunken by %pa\n", > > + res, &add_size); > > + } else { > > + pci_dbg(bridge, "bridge window %pR optional size removed\n", > > + res); > > + remove_from_list(add_list, res); > > + } > > + return; > > + > > } else { > > return; > > } > > > > base-commit: f6d41443f54856ceece0d5b584f47f681513bde4 > > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] Thunderbolt eGPU PCI BARs incorrectly assigned, fails to assign memory 2025-10-11 14:12 ` Steve Oswald @ 2025-11-07 16:22 ` Ilpo Järvinen 0 siblings, 0 replies; 11+ messages in thread From: Ilpo Järvinen @ 2025-11-07 16:22 UTC (permalink / raw) To: Steve Oswald; +Cc: linux-pci [-- Attachment #1: Type: text/plain, Size: 13442 bytes --] On Sat, 11 Oct 2025, Steve Oswald wrote: > Hi Ilpo, > sorry for the late reply. It took me quite a while. > The patch you sent was probably not based off of > f6d41443f54856ceece0d5b584f47f681513bde4, because I used the > linux-stable git and there was no pci_resource_num. > I used git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git > @ f6d41443f548 added the missing function from the mainline repo. > > Hotplugging works now, Thanks for testing. > it still crashes the gpu driver on unplugging > at runtime though, might still be related or just a driver bug. > Here is the dmesg output: > https://gist.githubusercontent.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af/raw/048752be295efe536c4f453938aac6e47e2429a3/patched%2520kernel Those warnings look amdgpu driver state machine problem which have nothing to do with PCI core code. > Thanks for the fix! I finally remembered to submit the fix officially. I you feel like it, you can give your Tested-by tag if you want to be credited for your testing efforts (in the thread where the official patch submission is so maintainer's tools will pick it up automatically into the commit message). -- i. > Am Mi., 8. Okt. 2025 um 13:43 Uhr schrieb Ilpo Järvinen > <ilpo.jarvinen@linux.intel.com>: > > > > On Wed, 3 Sep 2025, Ilpo Järvinen wrote: > > > > > On Mon, 1 Sep 2025, Steve Oswald wrote: > > > > > > > I've added the dmesg output fordyndbg="file drivers/pci/*. I wasn't > > > > sure if I added it with the escaped quotes correctly. > > > > https://gist.github.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af/raw/9cf5fc3a8c4f13588a33d61865f804f85e50470a/dmesg_linux_6.11.0_dyndbg.log > > > > > > Hi again, > > > > > > I think the patch below should solve your issue. If you manage to get it > > > tested and are confident enough it fixed your issue, you may provide your > > > Tested-by tag so I can include it into the official submission of the fix. > > > > Hi Steve, > > > > Did you manage to test this patch? > > > > -- > > i. > > > > > > > > > > > From 1c8ef31c6ac6616869b447a473e879b233df62db Mon Sep 17 00:00:00 2001 > > > From: =?UTF-8?q?Ilpo=20J=C3=A4rvinen?= <ilpo.jarvinen@linux.intel.com> > > > Date: Wed, 3 Sep 2025 15:53:48 +0300 > > > Subject: [PATCH 1/1] PCI: Prevent shrinking bridge window from its required > > > size > > > MIME-Version: 1.0 > > > Content-Type: text/plain; charset=UTF-8 > > > Content-Transfer-Encoding: 8bit > > > > > > pci_bridge_distribute_available_resources() -> ... -> > > > adjust_bridge_window() is called in between __pci_bus_size_bridges() > > > and assigning the resources. Since the commit 948675736a77 ("PCI: Allow > > > adjust_bridge_window() to shrink resource if necessary") > > > adjust_bridge_window() can also shrink the bridge window to force it to > > > a smaller size. The shrunken size, however, conflicts what > > > __pci_bus_size_bridges() -> pbus_size_mem() calculated as the required > > > bridge window size. By shrinking the resource size, > > > adjust_bridge_window() prevents rest of the resource fitting algorithm > > > working as intended. Resource fitting logic is expecting assignment > > > failures when bridge windows need resizing, but there are cases where > > > failures are no longer happening after the commit 948675736a77 ("PCI: > > > Allow adjust_bridge_window() to shrink resource if necessary"). > > > > > > The commit 948675736a77 ("PCI: Allow adjust_bridge_window() to shrink > > > resource if necessary") justifies the change by the extra reservation > > > made due to hpmemsize parameter, however, the kernel code contradicts > > > with that statement. (For simplicity, finer-grained hpmmiosize and > > > hpmmiopref parameters that can be used to the same effect as hpmemsize > > > are ignored in this description.) > > > > > > pbus_size_mem() calls calculate_memsize() twice. First with add_size=0 > > > to find out the minimal required resource size. The second call occurs > > > with add_size=hpmemsize (effectively) but the result does not directly > > > affect the resource size only resulting in an entry on the realloc_head > > > list (a.k.a. add_list). Yet, adjust_bridge_window() directly changes > > > the resource size which does not include what is reserved due to > > > hpmemsize. Also, if the required size for the bridge window exceeds > > > hpmemsize, the parameter does not have any effect even on the second > > > size calculcation made by pbus_size_mem(); from calculate_memsize(): > > > > > > size = max(size, add_size) + children_add_size; > > > > > > The commit ae4611f1d7e9 ("PCI: Set resource size directly in > > > adjust_bridge_window()") that precedes the commit 948675736a77 ("PCI: > > > Allow adjust_bridge_window() to shrink resource if necessary") is also > > > related to causing this problem. Its changelog explicitly states > > > adjust_bridge_window() wants to "guarantee" allocation success. > > > Guaranteed allocations, however, are incompatible with how the other > > > parts of the resource fitting algorithm work. The given justification > > > fails to explain why guaranteed allocations at this stage are required > > > nor why forcing window to a smaller value than what was calculated by > > > pbus_size_mem() is correct. The the change might have worked by chance > > > in some test scenario, too small bridge window does not "guarantee" > > > success from the point of view of the endpoint device resource > > > assignments. No issue is mentioned within the changelog so it's unclear > > > if the change was made to fix some observed issue nor and what that > > > issue was. > > > > > > The unwanted shrinking of a bridge window occurs, e.g., when a device > > > with large BARs such as eGPU is attached using Thunderbolt and the Root > > > Port holds less than enough resource space for the eGPU. The GPU > > > resources are in order of GBs and the default hotplug allocation is > > > mere 2M (DEFAULT_HOTPLUG_MMIO_PREF_SIZE). The problem is illustrated by > > > this log (filtered to the relevant content only): > > > > > > pci 0000:00:07.0: PCI bridge to [bus 03-2c] > > > pci 0000:00:07.0: bridge window [mem 0x6000000000-0x601bffffff 64bit pref] > > > pci 0000:03:00.0: PCI bridge to [bus 00] > > > pci 0000:03:00.0: bridge window [mem 0x00000000-0x000fffff 64bit pref] > > > pci 0000:03:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring > > > pci 0000:03:00.0: PCI bridge to [bus 04-2c] > > > pcieport 0000:00:07.0: Assigned bridge window [mem 0x6000000000-0x601bffffff 64bit pref] to [bus 03-2c] cannot fit 0xc00000000 required for 0000:03:00.0 bridging to [bus 04-2c] > > > pci 0000:03:00.0: bridge window [mem 0x800000000-0x10003fffff 64bit pref] to [bus 04-2c] add_size 100000 add_align 100000 > > > pcieport 0000:00:07.0: distributing available resources > > > pci 0000:03:00.0: bridge window [mem 0x800000000-0x10003fffff 64bit pref] shrunken by 0x00000007e4400000 > > > pci 0000:03:00.0: bridge window [mem 0x6000000000-0x601bffffff 64bit pref]: assigned > > > > > > The initial size of the Root Port's window is 448MB (0x601bffffff - > > > 0x6000000000). __pci_bus_size_bridges() -> pbus_size_mem() calculates > > > the required size to be 32772 MB (0x10003fffff - 0x800000000) which > > > would fit the eGPU resources. adjust_bridge_window() then shrinks the > > > bridge window down to what is guaranteed to fit into the Root Port's > > > bridge window. The bridge window for 03:00.0 is also eliminated from > > > the add_list (a.k.a. realloc_head) list by adjust_bridge_window(). > > > > > > After adjustment, the resources are assigned and as the bridge window > > > for 03:00.0 is assigned successfully, no failure is recorded. Without a > > > failure, no attempt to resize the window of the Root Port is required. > > > The end result is eGPU not having large enough resources to work. > > > > > > The commit 948675736a77 ("PCI: Allow adjust_bridge_window() to shrink > > > resource if necessary") also claims nested bridge windows are sized the > > > same, which is false. pbus_size_mem() calculates the size for the > > > parent bridge window by summing all the downstream resources so the > > > resource fitting calculates larger bridge window for the parent to > > > accomodate the childen. That is, hpmemsize does not result the same > > > size for the case where there are nested bridge windows. > > > > > > In order to fix the most immediate problem, don't shrink the resource > > > size as hpmemsize had nothing to do with it. When considering add_size, > > > only reduce it up to what is added due to hpmemsize (if required size > > > is larger than hpmemsize, the parameter has no impact, see > > > calculate_memsize()). > > > > > > This is not exactly a revert of the commits e4611f1d7e9 ("PCI: Set > > > resource size directly in adjust_bridge_window()") and 948675736a77 > > > ("PCI: Allow adjust_bridge_window() to shrink resource if necessary") > > > as shrinking still remains in place but is implemented differently, > > > and the end result behaves very differently. > > > > > > It is possible that those two commits fixed some other issue that is > > > not described with enough detail in the changelog and undoing parts of > > > them results in another regression due to behavioral change. > > > Nonetheless, as described above, the solution by those two commits was > > > flawed and the issue, if one exists, should be solved in a way that is > > > compatible with the rest of the resource fitting algorithm instead of > > > working against it. > > > > > > Besides shrinking, the case where adjust_bridge_window() expands the > > > bridge window is likely somewhat wrong as well because it removes the > > > entry from add_list (a.k.a. realloc_head), but it is less damaging as > > > that only impacts optional resources and may have no impact if > > > expanding by hpmemsize is larger than what add_size was. Fixing it is > > > left as further work. > > > > > > Reported-by: Steve Oswald <stevepeter.oswald@gmail.com> > > > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> > > > --- > > > drivers/pci/setup-bus.c | 42 +++++++++++++++++++++++++++++++++++++++-- > > > 1 file changed, 40 insertions(+), 2 deletions(-) > > > > > > diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c > > > index 23082bc0ca37..4dd618bc4196 100644 > > > --- a/drivers/pci/setup-bus.c > > > +++ b/drivers/pci/setup-bus.c > > > @@ -1828,6 +1828,7 @@ static void adjust_bridge_window(struct pci_dev *bridge, struct resource *res, > > > resource_size_t new_size) > > > { > > > resource_size_t add_size, size = resource_size(res); > > > + struct pci_dev_resource *dev_res; > > > > > > if (res->parent) > > > return; > > > @@ -1840,9 +1841,46 @@ static void adjust_bridge_window(struct pci_dev *bridge, struct resource *res, > > > pci_dbg(bridge, "bridge window %pR extended by %pa\n", res, > > > &add_size); > > > } else if (new_size < size) { > > > + int idx = pci_resource_num(bridge, res); > > > + > > > + /* > > > + * hpio/mmio/mmioprefsize hasn't been included at all? See the > > > + * add_size param at the callsites of calculate_memsize(). > > > + */ > > > + if (!add_list) > > > + return; > > > + > > > + /* Only shrink if the hotplug extra relates to window size. */ > > > + switch (idx) { > > > + case PCI_BRIDGE_IO_WINDOW: > > > + if (size > pci_hotplug_io_size) > > > + return; > > > + break; > > > + case PCI_BRIDGE_MEM_WINDOW: > > > + if (size > pci_hotplug_mmio_size) > > > + return; > > > + break; > > > + case PCI_BRIDGE_PREF_MEM_WINDOW: > > > + if (size > pci_hotplug_mmio_pref_size) > > > + return; > > > + break; > > > + default: > > > + break; > > > + } > > > + > > > + dev_res = res_to_dev_res(add_list, res); > > > add_size = size - new_size; > > > - pci_dbg(bridge, "bridge window %pR shrunken by %pa\n", res, > > > - &add_size); > > > + if (add_size < dev_res->add_size) { > > > + dev_res->add_size -= add_size; > > > + pci_dbg(bridge, "bridge window %pR optional size shrunken by %pa\n", > > > + res, &add_size); > > > + } else { > > > + pci_dbg(bridge, "bridge window %pR optional size removed\n", > > > + res); > > > + remove_from_list(add_list, res); > > > + } > > > + return; > > > + > > > } else { > > > return; > > > } > > > > > > base-commit: f6d41443f54856ceece0d5b584f47f681513bde4 > > > > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] Thunderbolt eGPU PCI BARs incorrectly assigned, fails to assign memory 2025-09-01 15:50 ` Ilpo Järvinen 2025-09-01 16:06 ` Ilpo Järvinen @ 2025-09-01 16:10 ` Steve Oswald 1 sibling, 0 replies; 11+ messages in thread From: Steve Oswald @ 2025-09-01 16:10 UTC (permalink / raw) To: Ilpo Järvinen; +Cc: linux-pci Hi Ilpo, thank you very much for taking a look at this. I've added the dmesg output with pcie=realloc and with hpmmioprefsize kernel parameters, both run into the same problem. - gist with separate log files: https://gist.github.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af - raw dmesg and lspci -vvv from pcie=realloc https://gist.github.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af/raw/647a94a1d5d40166a065343aab1be869039fe694/dmesg_linux_6.11.0_pcie=realloc.log (please not that # cat /proc/iomem output is at the very end around line 3259.) - raw dmesg from hpmmioprefsize https://gist.githubusercontent.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af/raw/647a94a1d5d40166a065343aab1be869039fe694/dmesg_linux_6.11.0_hpmm.log I'll try to be as helpful as possible, however I haven't had success compiling the kernel from source in the past. I'll be sure to give it a try if you provide a patch file. If something is missing, please let me know. Cheers, Steve Am Mo., 1. Sept. 2025 um 18:50 Uhr schrieb Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>: > > On Mon, 1 Sep 2025, Ilpo Järvinen wrote: > > On Sun, 31 Aug 2025, Steve Oswald wrote: > > > > > I’ve encountered an issue with Thunderbolt eGPU (externally connected > > > gpu via thunderbolt 4). The change from kernel 6.10.14 to 6.11.0 broke > > > the pci memory assignment of the external pcie device. I figured out > > > which version broke it by using ubuntu 25.04 and downgrading the > > > kernel (https://raw.githubusercontent.com/pimlie/ubuntu-mainline-kernel.sh/master/ubuntu-mainline-kernel.sh). > > > > > > >From the dmesg output, on the broken 6.11.0 I see 'failed to assign'. > > > The issue occurs (almost never) on previous kernel version 6.10.14. > > > Using pci=realloc did not change the behavior (I can produce the dmesg > > > output if necessary). > > > > > > The issue was tested with 2 egpus (Radeon Instinct MI50 32GB, NVIDIA > > > 3080 10GB). Both the amd and the nvidia driver fail to initialize the > > > device because they cannot write the pcie messages. > > > > > > System details: > > > - Kernel: Linux 6.10.14-061014-generic (Ubuntu build) > 6.11.0-061100 > > > - Laptop: TUXEDO InfinityBook Pro 16 - Gen8 with Thunderbolt 4 > > > - eGPU: Radeon Instinct MI50 32GB, NVIDIA 3080 10GB > > > > > > Steps to reproduce: > > > 1. Boot the system with the eGPU. > > > 2. Observe PCI BAR message in `dmesg`. > > > > > > Logs: > > > both kernel messages, lspci can be found here: > > > https://gist.github.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af > > > raw files: > > > - dmesg_linux_6.11.0.log > > > https://gist.githubusercontent.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af/raw/f9470a06ff929d386c50ec6b5d07e0ff3f053dcf/dmesg_linux_6.11.0.log > > > - dmesg_linux_6.10.14.log > > > https://gist.githubusercontent.com/stepeos/cd060c7d66ab195f51ab4d5675b4e4af/raw/f9470a06ff929d386c50ec6b5d07e0ff3f053dcf/dmesg_linux_6.10.14.log > > > > > > If additional info is needed, I'm happy to help. > > > > Hi Steve, > > > > Thanks for the report. > > > > My analysis is that the problem boils down to lack of this line with 6.11: > > > > pcieport 0000:00:07.0: resource 15 [mem 0x6000000000-0x601bffffff 64bit pref] released > > > > It means one of the upstream bridge windows could not be released for > > resize as it is printed from pci_reassign_bridge_resources() which likely > > occurs inside pci_resize_resource() call from amdgpu(?). > > > > The very likely cause is this check: > > > > /* Ignore BARs which are still in use */ > > if (res->child) > > continue; > > > > ...which (until very recently) is entirely silent so there's no warning > > whatsover what is the root cause. > > Hi again, > > Actually, scratch most of that. It's not during resize as the log should > say "releasing" (I don't know how I got this confused). "released" is from > pci_bridge_release_resources() which is called from > pci_bus_release_bridge_resources() doesn't even try to walk upwards. > > But that begs question, why didn't also the bridge windows fail their > assignments. > > Resource fitting calculates size for the bridge window: > > pci 0000:03:00.0: bridge window [mem 0x800000000-0x10003fffff 64bit pref] to [bus 04-2c] add_size 100000 add_align 100000 > > ...but I cannot see assignment for that even being attempted as almost > immediately, this occurs: > > pci 0000:03:00.0: bridge window [mem 0x6000000000-0x601bffffff 64bit pref]: assigned > > ...which is much less than 0x10003fffff-0x800000000. I cannot think of > anything what could make it shrink like that. > > I'll have to think this more, it might require a debug patch but I'll > think until tomorrow to see if I can understand it from the code alone. > > > What this means, is that there's some assigned resource underneath > > 0000:00:07.0 with 6.11 that wasn't there with 6.10. And it is because 6.11 > > tried harder to get your resources assigned and was successful here and > > there resulting in pinning the bridge window in its place, whereas 6.10 > > failed to assign the same resource. > > > > Could you provide /proc/iomem (it's enough to do that for 6.11 for now)? > > > > > > You could try to use hpmmioprefsize= on kernel's command line to reserve > > more space for the bridge windows, the default is only 2M and these GPUs > > need a magnitude more (gigabytes), you can check from 6.10 what the sizes > > of the BARs on the GPU are, and round the sum upwards to the next power of > > two multiple. > > > > I'd also be interested to see why pci=realloc failed to solve this problem > > as it should reconfigure the entire resource tree so if you could provide > > the logs with that. Please take lspci with -vvv. > > > > > > > > -- > i. ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-11-07 16:22 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-08-31 10:51 [BUG] Thunderbolt eGPU PCI BARs incorrectly assigned, fails to assign memory Steve Oswald 2025-09-01 13:25 ` Ilpo Järvinen 2025-09-01 15:50 ` Ilpo Järvinen 2025-09-01 16:06 ` Ilpo Järvinen 2025-09-01 16:18 ` Steve Oswald 2025-09-01 16:28 ` Ilpo Järvinen 2025-09-03 13:09 ` Ilpo Järvinen 2025-10-08 10:43 ` Ilpo Järvinen 2025-10-11 14:12 ` Steve Oswald 2025-11-07 16:22 ` Ilpo Järvinen 2025-09-01 16:10 ` Steve Oswald
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox