linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Ilpo Järvinen" <ilpo.jarvinen@linux.intel.com>
To: "Alex Bennée" <alex.bennee@linaro.org>
Cc: linux-pci@vger.kernel.org, "Ard Biesheuvel" <ardb@kernel.org>,
	"Lorenzo Pieralisi" <lorenzo.pieralisi@linaro.org>,
	"Alex Deucher" <alexander.deucher@amd.com>,
	"Christian König" <christian.koenig@amd.com>,
	amd-gfx@lists.freedesktop.org,
	"Bjorn Helgaas" <bhelgaas@google.com>,
	"D Scott Phillips" <scott@os.amperecomputing.com>
Subject: Re: 2499f53 (PCI: Rework optional resource handling) regression with AMDGPU on Arm AVA platform
Date: Thu, 23 Oct 2025 20:24:02 +0300 (EEST)	[thread overview]
Message-ID: <6a8559ea-1528-f09c-21c1-822e8d7569d2@linux.intel.com> (raw)
In-Reply-To: <874irqop6b.fsf@draig.linaro.org>

[-- Attachment #1: Type: text/plain, Size: 5118 bytes --]

On Wed, 22 Oct 2025, Alex Bennée wrote:

> I've been tracking a regression on my Arm64 (Altra) AVA platform between
> 6.14 and 6.15. It looks like the rework commit broke the ability of the
> amdgpu driver to resize it's bar, resulting in an SError and failure to
> boot:
> 
>   [   15.348097] amdgpu 000d:03:00.0: amdgpu: detected ip block number 8 <vcn_v4_0>
>   [   15.355901] amdgpu 000d:03:00.0: amdgpu: detected ip block number 9 <jpeg_v4_0>
>   [   15.363202] amdgpu 000d:03:00.0: amdgpu: detected ip block number 10 <mes_v11_0>
>   [   15.384163] amdgpu 000d:03:00.0: amdgpu: Fetched VBIOS from ROM BAR
>   [   15.390434] amdgpu: ATOM BIOS: 113-4481LHS-UC1
>   [   15.400079] amdgpu 000d:03:00.0: amdgpu: CP RS64 enable
>   [   15.411830] amdgpu 000d:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
>   [   15.419932] amdgpu 000d:03:00.0: amdgpu: PCIE atomic ops is not supported
>   [   15.426719] [drm] GPU posting now...
>   [   15.430329] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
>   [   15.438871] amdgpu 000d:03:00.0: BAR 2 [mem 0x340010000000-0x3400101fffff 64bit pref]: releasing
>   [   15.447648] amdgpu 000d:03:00.0: BAR 0 [mem 0x340000000000-0x34000fffffff 64bit pref]: releasing
>   [   15.456452] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
>   [   15.466095] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
>   [   15.475738] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: releasing
>   [   15.485386] pcieport 000d:00:01.0: bridge window [io  0x1000-0x0fff] to [bus 01-03] add_size 1000
>   [   15.494252] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
>   [   15.503809] pcieport 000d:00:01.0: bridge window [io  size 0x1000]: can't assign; no space
>   [   15.512063] pcieport 000d:00:01.0: bridge window [io  size 0x1000]: failed to assign
>   [   15.519796] pcieport 000d:00:01.0: bridge window [io  size 0x1000]: can't assign; no space
>   [   15.528049] pcieport 000d:00:01.0: bridge window [io  size 0x1000]: failed to assign
>   [   15.535787] pcieport 000d:01:00.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
>   [   15.545349] pcieport 000d:02:00.0: bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]: assigned
>   [   15.554911] amdgpu 000d:03:00.0: BAR 0 [mem 0x340000000000-0x3401ffffffff 64bit pref]: assigned
>   [   15.563612] amdgpu 000d:03:00.0: BAR 2 [mem 0x340200000000-0x3402001fffff 64bit pref]: assigned
>   [   15.572313] pcieport 000d:00:01.0: PCI bridge to [bus 01-03]
>   [   15.577962] pcieport 000d:00:01.0:   bridge window [mem 0x50000000-0x502fffff]
>   [   15.585175] pcieport 000d:00:01.0:   bridge window [mem 0x340000000000-0x3402ffffffff 64bit pref]
>   [   15.594038] pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: can't claim; address conflict with PCI Bus 000d:01 [mem 0x340000000000-0x340017ffffff 64bit pref]
> 
> Failure to claim space for the bridge window...

Thanks for the report.

I was just looking at a similar oddity from another reporter and thanks 
this getting second case with an "impossible" claim conflict, I was 
finally able to zero in on a bug in the resize code which has been there 
since the introduction of the BAR resizing.

It will take a few days for me to come up fixes that do address also the
problems you'd likely hit next after this claim conflict bug is fixed.

> >From discussions with Ard it seems if the firmware had resized the BAR first,
> and then assigned the resources, there would be no issue. However there
> is no latter firmware for the platform.

We want to make kernel capable of considering BARs with their maximum 
sizes eventually so it wouldn't matter what FW does. I've been working 
towards that direction for a while now but I keep getting distracted by 
fixing all these other bugs in the existing code. :-)

> While the PCI change has provoked this regression I suspect the amdgpu code
> could handle the failure to resize the BAR better and if it can't get
> what it wants just not initialise the driver. I did hit some cases while
> bisecting where the GPU just wasn't visible.

Indeed, things could be better on multiple levels.

Also the entire pci_resize_resource() API is flawed in that it isn't 
currently able to restore all device's resources as they were in case of a 
failure. It seems I might have to fix it now as there seem no other way to 
fix this claim conflict problem.

...And fix will be a bit invasive as I need to merge 
pbus_reassign_bridge_resources() and pci_resize_resource() into a new
pci_release_and_resize_resource() API that handles rollback properly
in case of an error.

> I'm available to test patches and generate additional debug info so do
> let me know if there is anything I can do to help.

Thanks, I'll send the fix series for testing once it is ready.

--
 i.

      parent reply	other threads:[~2025-10-23 17:24 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-22 16:51 2499f53 (PCI: Rework optional resource handling) regression with AMDGPU on Arm AVA platform Alex Bennée
2025-10-22 17:08 ` Ard Biesheuvel
2025-10-23 16:20 ` Bjorn Helgaas
2025-10-23 17:24 ` Ilpo Järvinen [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6a8559ea-1528-f09c-21c1-822e8d7569d2@linux.intel.com \
    --to=ilpo.jarvinen@linux.intel.com \
    --cc=alex.bennee@linaro.org \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=ardb@kernel.org \
    --cc=bhelgaas@google.com \
    --cc=christian.koenig@amd.com \
    --cc=linux-pci@vger.kernel.org \
    --cc=lorenzo.pieralisi@linaro.org \
    --cc=scott@os.amperecomputing.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).