Hello Ritesh/Dan, Here is the motivation for my patch and thoughts on the issue. Before my patch, there were 2 scenarios to consider where, even when the memory was pre-mapped for DMA, coherent allocations were getting mapped from 2GB default DMA Window. In case of pre-mapped memory, the allocations should not be directed towards 2GB default DMA window. 1. AMD GPU which has device DMA mask > 32 bits but less then 64 bits. In this case the PHB is put into Limited Addressability mode.    This scenario doesn't have vPMEM 2. Device that supports 64-bit DMA mask. The LPAR has vPMEM assigned. In both the above scenarios, IOMMU has pre-mapped RAM from DDW (64-bit PPC DMA window). Lets consider code paths for both the case, before my patch 1. AMD GPU dev->dma_ops_bypass = true dev->bus_dma_limit = 0 - Here the AMD controller shows 3 functions on the PHB. - After the first function is probed, it sees that the memory is pre-mapped   and doesn't direct DMA allocations towards 2GB default window.   So, dma_go_direct() worked as expected. - AMD GPU driver, adds device memory to system pages. The stack is as below add_pages+0x118/0x130 (unreliable) pagemap_range+0x404/0x5e0 memremap_pages+0x15c/0x3d0 devm_memremap_pages+0x38/0xa0 kgd2kfd_init_zone_device+0x110/0x210 [amdgpu] amdgpu_device_ip_init+0x648/0x6d8 [amdgpu] amdgpu_device_init+0xb10/0x10c0 [amdgpu] amdgpu_driver_load_kms+0x2c/0xb0 [amdgpu] amdgpu_pci_probe+0x2e4/0x790 [amdgpu] - This changed max_pfn to some high value beyond max RAM. - Subsequently, for each other functions on the PHB, the call to   dma_go_direct() will return false which will then direct DMA allocations towards   2GB Default DMA window even if the memory is pre-mapped.    dev->dma_ops_bypass is true, dma_direct_get_required_mask() resulted in large    value for the mask (due to changed max_pfn) which is beyond AMD GPU device DMA mask 2. Device supports 64-bit DMA mask. The LPAR has vPMEM assigned dev->dma_ops_bypass = false dev->bus_dma_limit = has some value depending on size of RAM (eg.  0x0800001000000000) - Here the call to dma_go_direct() returns false since dev->dma_ops_bypass = false. I crafted the solution to cover both the case. I tested today on an LPAR with 7.0-rc4 and it works with AMDGPU. With my patch, allocations will go towards direct only when dev->dma_ops_bypass = true, which will be the case for "pre-mapped" RAM. Ritesh mentioned that this is PowerNV. I need to revisit this patch and see why it is failing on PowerNV. From the logs, I do see some issue. The log indicates dev->bus_dma_limit is set to 0. This is incorrect. For pre-mapped RAM, with my patch, bus_dma_limit should always be set to some value. bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: iommu: 64-bit OK but direct DMA is limited by *0* Thanks, Gaurav On 3/15/26 4:50 AM, Dan Horák wrote: > Hi Ritesh, > > On Sun, 15 Mar 2026 09:55:11 +0530 > Ritesh Harjani (IBM) wrote: > >> Dan Horák writes: >> >> +cc Gaurav, >> >>> Hi, >>> >>> starting with 7.0-rc1 (meaning 6.19 is OK) the amdgpu driver fails to >>> initialize on my Linux/ppc64le Power9 based system (with Radeon Pro WX4100) >>> with the following in the log >>> >>> ... >>> bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: GART: 256M 0x000000FF00000000 - 0x000000FF0FFFFFFF >> ^^^^ >> So looks like this is a PowerNV (Power9) machine. > correct :-) > >>> bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm] Detected VRAM RAM=4096M, BAR=4096M >>> bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm] RAM width 128bits GDDR5 >>> bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: iommu: 64-bit OK but direct DMA is limited by 0 >>> bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: dma_iommu_get_required_mask: returning bypass mask 0xfffffffffffffff >>> bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: 4096M of VRAM memory ready >>> bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: 32570M of GTT memory ready. >>> bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: (-12) failed to allocate kernel bo >>> bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm] Debug VRAM access will use slowpath MM access >>> bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm] GART: num cpu pages 4096, num gpu pages 65536 >>> bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm] PCIE GART of 256M enabled (table at 0x000000F4FFF80000). >>> bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: (-12) failed to allocate kernel bo >>> bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: (-12) create WB bo failed >>> bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: amdgpu_device_wb_init failed -12 >>> bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: amdgpu_device_ip_init failed >>> bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: Fatal error during GPU init >>> bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: finishing device. >>> bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: probe with driver amdgpu failed with error -12 >>> bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: ttm finalized >>> ... >>> >>> After some hints from Alex and bisecting and other investigation I have >>> found thathttps://github.com/torvalds/linux/commit/1471c517cf7dae1a6342fb821d8ed501af956dd0 >>> is the culprit and reverting it makes amdgpu load (and work) again. >> Thanks for confirming this. Yes, this was recently added [1] >> >> [1]:https://lore.kernel.org/linuxppc-dev/20251107161105.85999-1-gbatra@linux.ibm.com/ >> >> >> @Gaurav, >> >> I am not too familiar with the area, however looking at the logs shared >> by Dan, it looks like we might be always going for dma direct allocation >> path and maybe the device doesn't support this address limit. >> >> bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: iommu: 64-bit OK but direct DMA is limited by 0 >> bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: dma_iommu_get_required_mask: returning bypass mask 0xfffffffffffffff > a complete kernel log is at > https://gitlab.freedesktop.org/-/project/4522/uploads/c4935bca6f37bbd06bb4045c07d00b5b/kernel.log > > Please let me know if you need more info. > > > Dan > > >> Looking at the code.. >> >> diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c >> index fe7472f13b10..d5743b3c3ab3 100644 >> --- a/kernel/dma/mapping.c >> +++ b/kernel/dma/mapping.c >> @@ -654,7 +654,7 @@ void *dma_alloc_attrs(struct device *dev, size_t size, dma_addr_t *dma_handle, >> /* let the implementation decide on the zone to allocate from: */ >> flag &= ~(__GFP_DMA | __GFP_DMA32 | __GFP_HIGHMEM); >> >> - if (dma_alloc_direct(dev, ops)) { >> + if (dma_alloc_direct(dev, ops) || arch_dma_alloc_direct(dev)) { >> cpu_addr = dma_direct_alloc(dev, size, dma_handle, flag, attrs); >> } else if (use_dma_iommu(dev)) { >> cpu_addr = iommu_dma_alloc(dev, size, dma_handle, flag, attrs); >> >> Now, do we need arch_dma_alloc_direct() here? It always returns true if >> dev->dma_ops_bypass is set to true, w/o checking for checks that >> dma_go_direct() has. >> >> whereas... >> >> /* >> * Check if the devices uses a direct mapping for streaming DMA operations. >> * This allows IOMMU drivers to set a bypass mode if the DMA mask is large >> * enough. >> */ >> static inline bool >> dma_alloc_direct(struct device *dev, const struct dma_map_ops *ops) >> ..dma_go_direct(dev, dev->coherent_dma_mask, ops); >> .... ... >> #ifdef CONFIG_DMA_OPS_BYPASS >> if (dev->dma_ops_bypass) >> return min_not_zero(mask, dev->bus_dma_limit) >= >> dma_direct_get_required_mask(dev); >> #endif >> >> dma_alloc_direct() already checks for dma_ops_bypass and also if >> dev->coherent_dma_mask >= dma_direct_get_required_mask(). So... >> >> .... Do we really need the machinary of arch_dma_{alloc|free}_direct()? >> Isn't dma_alloc_direct() checks sufficient? >> >> Thoughts? >> >> -ritesh >> >> >>> for the record, I have originally openedhttps://gitlab.freedesktop.org/drm/amd/-/issues/5039 >>> >>> >>> With regards, >>> >>> Dan