Hello Ritesh

I think, what you are proposing to add dev->bus_dma_limit in the check 
might work. In the case of PowerNV, this is not set, but 
dev->dma_ops_bypass is set. So, for PowerNV, it will fall back to how it 
was before.

Also, since these both are set in LPAR mode, the current patch as-is 
will work.

Dan, can you please try Ritesh proposed fix on your PowerNV box? I am 
not able to lay my hands on a PowerNV box yet.

Thanks,

Gaurav

On 3/25/26 7:12 AM, Ritesh Harjani (IBM) wrote:
> Gaurav Batra<gbatra@linux.ibm.com> writes:
>
> Hi Gaurav,
>
>> Hello Ritesh/Dan,
>>
>>
>> Here is the motivation for my patch and thoughts on the issue.
>>
>>
>> Before my patch, there were 2 scenarios to consider where, even when the
>> memory
>> was pre-mapped for DMA, coherent allocations were getting mapped from 2GB
>> default DMA Window. In case of pre-mapped memory, the allocations should
>> not be
>> directed towards 2GB default DMA window.
>>
>> 1. AMD GPU which has device DMA mask > 32 bits but less then 64 bits. In
>> this
>> case the PHB is put into Limited Addressability mode.
>>
>>      This scenario doesn't have vPMEM
>>
>> 2. Device that supports 64-bit DMA mask. The LPAR has vPMEM assigned.
>>
>>
>> In both the above scenarios, IOMMU has pre-mapped RAM from DDW (64-bit
>> PPC DMA
>> window).
>>
>>
>> Lets consider code paths for both the case, before my patch
>>
>> 1. AMD GPU
>>
>> dev->dma_ops_bypass = true
>>
>> dev->bus_dma_limit = 0
>>
>> - Here the AMD controller shows 3 functions on the PHB.
>>
>> - After the first function is probed, it sees that the memory is pre-mapped
>>     and doesn't direct DMA allocations towards 2GB default window.
>>     So, dma_go_direct() worked as expected.
>>
>> - AMD GPU driver, adds device memory to system pages. The stack is as below
>>
>> add_pages+0x118/0x130 (unreliable)
>> pagemap_range+0x404/0x5e0
>> memremap_pages+0x15c/0x3d0
>> devm_memremap_pages+0x38/0xa0
>> kgd2kfd_init_zone_device+0x110/0x210 [amdgpu]
>> amdgpu_device_ip_init+0x648/0x6d8 [amdgpu]
>> amdgpu_device_init+0xb10/0x10c0 [amdgpu]
>> amdgpu_driver_load_kms+0x2c/0xb0 [amdgpu]
>> amdgpu_pci_probe+0x2e4/0x790 [amdgpu]
>>
>> - This changed max_pfn to some high value beyond max RAM.
>>
>> - Subsequently, for each other functions on the PHB, the call to
>>     dma_go_direct() will return false which will then direct DMA
>> allocations towards
>>     2GB Default DMA window even if the memory is pre-mapped.
>>
>>      dev->dma_ops_bypass is true, dma_direct_get_required_mask() resulted
>> in large
>>      value for the mask (due to changed max_pfn) which is beyond AMD GPU
>> device DMA mask
>>
>>
>> 2. Device supports 64-bit DMA mask. The LPAR has vPMEM assigned
>>
>> dev->dma_ops_bypass = false
>> dev->bus_dma_limit = has some value depending on size of RAM (eg.
>> 0x0800001000000000)
>>
>> - Here the call to dma_go_direct() returns false since
>> dev->dma_ops_bypass = false.
>>
>>
>>
>> I crafted the solution to cover both the case. I tested today on an LPAR
>> with 7.0-rc4 and it works with AMDGPU.
>>
>> With my patch, allocations will go towards direct only when
>> dev->dma_ops_bypass = true,
>> which will be the case for "pre-mapped" RAM.
>>
>> Ritesh mentioned that this is PowerNV. I need to revisit this patch and
>> see why it is failing on PowerNV.
>> ...
>>  From the logs, I do see some issue. The log indicates
>> dev->bus_dma_limit is set to 0. This is incorrect. For pre-mapped RAM,
>> with my
>> patch, bus_dma_limit should always be set to some value.
>>
> In that case, do you think adding an extra check for dev->bus_dma_limit
> would help? I am sure you already would have thought of this and
> probably are still working to find the correct fix?
>
> +bool arch_dma_alloc_direct(struct device *dev)
> +{
> +	if (dev->dma_ops_bypass && dev->bus_dma_limit)
> +		return true;
> +
> +	return false;
> +}
> +
> +bool arch_dma_free_direct(struct device *dev, dma_addr_t dma_handle)
> +{
> +	if (!dev->dma_ops_bypass || !dev->bus_dma_limit)
> +		return false;
> +
> +	return is_direct_handle(dev, dma_handle);
> +}
>
> <snip from Timothy>
>
>> introduced a serious regression into the kernel for a large number of
>> active users of the PowerNV platform, I would kindly ask that it be
>> reverted until it can be reworked not to break PowerNV support.  Bear
>> in mind there are other devices that are 40 bit DMA limited, and they
>> are also likely to break on Linux 7.0.
> Looks like more people are facing an issue with this now.
>
> -ritesh