From: Donet Tom <donettom@linux.ibm.com>
To: "Yat Sin, David" <David.YatSin@amd.com>,
Alex Deucher <alexdeucher@gmail.com>
Cc: "Koenig, Christian" <Christian.Koenig@amd.com>,
"Ritesh Harjani (IBM)" <ritesh.list@gmail.com>,
"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
"Kuehling, Felix" <Felix.Kuehling@amd.com>,
"Deucher, Alexander" <Alexander.Deucher@amd.com>,
"Russell, Kent" <Kent.Russell@amd.com>,
Vaidyanathan Srinivasan <svaidy@linux.ibm.com>,
Mukesh Kumar Chaurasiya <mkchauras@linux.ibm.com>
Subject: Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems
Date: Sat, 3 Jan 2026 00:23:08 +0530 [thread overview]
Message-ID: <c5f6fec0-3b35-472f-ad81-211fa680f132@linux.ibm.com> (raw)
In-Reply-To: <DM6PR12MB5021DE8E1ECC352D5B9D92AC95ABA@DM6PR12MB5021.namprd12.prod.outlook.com>
On 12/18/25 3:01 AM, Yat Sin, David wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> HDP flush is done in ROCm using these 3 methods:
>
> 1. For AQL packets, this is done by setting the system-scope acquire and release fences in the packet header.
> For example, it is set here:
> https://github.com/ROCm/rocm-systems/blob/develop/projects/rocr-runtime/runtime/hsa-runtime/core/runtime/amd_blit_kernel.cpp#L878
>
> And when the headers are defined here:
> https://github.com/ROCm/rocm-systems/blob/develop/projects/clr/rocclr/device/rocm/rocvirtual.cpp#L85
>
>
> 2. Via a SDMA packet. This is done before doing a memory copy:
> The function is called here:
> https://github.com/ROCm/rocm-systems/blob/develop/projects/rocr-runtime/runtime/hsa-runtime/core/runtime/amd_blit_sdma.cpp#L484
> And the packet (POLL_REGMEM) is generated here:
> https://github.com/ROCm/rocm-systems/blob/develop/projects/rocr-runtime/runtime/hsa-runtime/core/runtime/amd_blit_sdma.cpp#L1154
>
>
> 3. By writing to a MMIO remapped address:
> The address is stored in rocclr here:
> https://github.com/ROCm/rocm-systems/blob/develop/projects/clr/rocclr/device/rocm/rocdevice.cpp#L607
>
> And the flush is triggered by writing a 1, e.g here:
> https://github.com/ROCm/rocm-systems/blob/develop/projects/clr/rocclr/device/rocm/rocvirtual.cpp#L3831
Thank you David.
>
>
> Regards,
> David
>
>
>> -----Original Message-----
>> From: Alex Deucher <alexdeucher@gmail.com>
>> Sent: Wednesday, December 17, 2025 9:23 AM
>> To: Donet Tom <donettom@linux.ibm.com>; Yat Sin, David
>> <David.YatSin@amd.com>
>> Cc: Koenig, Christian <Christian.Koenig@amd.com>; Ritesh Harjani (IBM)
>> <ritesh.list@gmail.com>; amd-gfx@lists.freedesktop.org; Kuehling, Felix
>> <Felix.Kuehling@amd.com>; Deucher, Alexander
>> <Alexander.Deucher@amd.com>; Russell, Kent <Kent.Russell@amd.com>;
>> Vaidyanathan Srinivasan <svaidy@linux.ibm.com>; Mukesh Kumar Chaurasiya
>> <mkchauras@linux.ibm.com>
>> Subject: Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page
>> size systems
>>
>> On Wed, Dec 17, 2025 at 4:03 AM Donet Tom <donettom@linux.ibm.com> wrote:
>>>
>>> On 12/16/25 7:32 PM, Alex Deucher wrote:
>>>> On Tue, Dec 16, 2025 at 8:55 AM Donet Tom <donettom@linux.ibm.com>
>> wrote:
>>>>> On 12/15/25 7:39 PM, Alex Deucher wrote:
>>>>>> On Mon, Dec 15, 2025 at 4:47 AM Christian König
>>>>>> <christian.koenig@amd.com> wrote:
>>>>>>> On 12/12/25 18:24, Alex Deucher wrote:
>>>>>>>> On Fri, Dec 12, 2025 at 8:19 AM Christian König
>>>>>>>> <christian.koenig@amd.com> wrote:
>>>>>>>>> On 12/12/25 11:45, Ritesh Harjani (IBM) wrote:
>>>>>>>>>> Christian König <christian.koenig@amd.com> writes:
>>>>>>>>>>>> Setup details:
>>>>>>>>>>>> ============
>>>>>>>>>>>> System details: Power10 LPAR using 64K pagesize.
>>>>>>>>>>>> AMD GPU:
>>>>>>>>>>>> Name: gfx90a
>>>>>>>>>>>> Marketing Name: AMD Instinct MI210
>>>>>>>>>>>>
>>>>>>>>>>>> Queries:
>>>>>>>>>>>> =======
>>>>>>>>>>>> 1. We currently ran rocr-debug agent tests [1] and rccl unit tests [2]
>> to test
>>>>>>>>>>>> these changes. Is there anything else that you would suggest us
>> to run to
>>>>>>>>>>>> shake out any other page size related issues w.r.t the kernel
>> driver?
>>>>>>>>>>> The ROCm team needs to answer that.
>>>>>>>>>>>
>>>>>>>>>> Is there any separate mailing list or list of people whom we
>>>>>>>>>> can cc then?
>>>>>>>>> With Felix on CC you already got the right person, but he's on vacation
>> and will not be back before the end of the year.
>>>>>>>>> I can check on Monday if some people are still around which could
>> answer a couple of questions, but in general don't expect a quick response.
>>>>>>>>>>>> 2. Patch 1/8: We have a querry regarding eop buffer size Is this eop
>> ring buffer
>>>>>>>>>>>> size HW dependent? Should it be made PAGE_SIZE?
>>>>>>>>>>> Yes and no.
>>>>>>>>>>>
>>>>>>>>>> If you could more elaborate on this please? I am assuming you
>>>>>>>>>> would anyway respond with more context / details on Patch-1
>>>>>>>>>> itself. If yes, that would be great!
>>>>>>>>> Well, in general the EOP (End of Pipe) buffer contains in a ring buffer of
>> all the events and actions the CP should execute when shaders and cache flushes
>> finish.
>>>>>>>>> The size depends on the HW generation and configuration of the GPU
>> etc..., but don't ask me for details how that is calculated.
>>>>>>>>> The point is that the size is completely unrelated to the CPU, so using
>> PAGE_SIZE is clearly incorrect.
>>>>>>>>>>>> 3. Patch 5/8: also have a query w.r.t the error paths when system
>> page size > 4K.
>>>>>>>>>>>> Do we need to lift this restriction and add MMIO remap support
>> for systems with
>>>>>>>>>>>> non-4K page sizes?
>>>>>>>>>>> The problem is the HW can't do this.
>>>>>>>>>>>
>>>>>>>>>> We aren't that familiar with the HW / SW stack here. Wanted to
>>>>>>>>>> understand what functionality will be unsupported due to this HW
>> limitation then?
>>>>>>>>> The problem is that the CPU must map some of the registers/resources
>> of the GPU into the address space of the application and you run into security
>> issues when you map more than 4k at a time.
>>>>>>>> Right. There are some 4K pages with the MMIO register BAR which
>>>>>>>> are empty and registers can be remapped into them. In this case
>>>>>>>> we remap the HDP flush registers into one of those register
>>>>>>>> pages. This allows applications to flush the HDP write FIFO
>>>>>>>> from either the CPU or another device. This is needed to flush
>>>>>>>> data written by the CPU or another device to the VRAM BAR out to
>>>>>>>> VRAM (i.e., so the GPU can see it). This is flushed internally
>>>>>>>> as part of the shader dispatch packets,
>>>>>>> As far as I know this is only done for graphics shader submissions to the
>> classic CS interface, but not for compute dispatches through ROCm queues.
>>>>>> There is an explicit PM4 packet to flush the HDP cache for userqs
>>>>>> and for AQL the flush is handled via one of the flags in the
>>>>>> dispatch packet. The MMIO remap is needed for more fine grained
>>>>>> use cases where you might have the CPU or another device operating
>>>>>> in a gang like scenario with the GPU.
>>>>> Thank you, Alex.
>>>>>
>>>>> We were encountering an issue while running the RCCL unit tests.
>>>>> With 2 GPUs, all tests passed successfully; however, when running
>>>>> with more than 2 GPUs, the tests began to fail at random points
>>>>> with the following
>>>>> errors:
>>>>>
>>>>> [ 598.576821] amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed
>>>>> for queue with doorbell_id: 80030008 [ 606.696820] amdgpu
>>>>> 0048:0f:00.0: amdgpu: Failed to evict process queues [ 606.696826]
>>>>> amdgpu 0048:0f:00.0: amdgpu: GPU reset begin!. Source: 4 [
>>>>> 610.696852] amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed
>>>>> for queue with doorbell_id: 80030008 [ 610.696869] amdgpu
>>>>> 0048:0f:00.0: amdgpu: Failed to evict process queues [ 610.696942]
>>>>> amdgpu 0048:0f:00.0: amdgpu: Failed to restore process queues
>>>>>
>>>>>
>>>>> After applying patches 7/8 and 8/8, we are no longer seeing this issue.
>>>>>
>>>>> One question I have is: we only started observing this problem when
>>>>> the number of GPUs increased. Could this be related to MMIO
>>>>> remapping not being available?
>>>> It could be. E.g., if the CPU or a GPU writes data to VRAM on
>>>> another GPU, you will need to flush the HDP to make sure that data
>>>> hits VRAM before the GPU attached to the VRAM can see it.
>>>
>>> Thanks Alex
>>>
>>> I am now suspecting that the queue preemption issue may be related to
>>> the unavailability of MMIO remapping. I am not very familiar with this area.
>>>
>>> Could you please point me to the relevant code path where the PM4
>>> packet is issued to flush the HDP cache?
>> + David who is more familiar with the ROCm runtime.
>>
>> PM4 has a packet called HDP_FLUSH which flushes the HDP. For AQL, it's
>> handled by one of the flags I think. Most things in ROCm use AQL.
>>
>> @David Yat Sin Can you point to how HDP flushes are handled in the ROCm
>> runtime?
>>
>> Alex
>>
>>> I am consistently able to reproduce this issue on my system when using
>>> more than three GPUs if patches 7/8 and 8/8 are not applied. In your
>>> opinion, is there anything that can be done to speed up the HDP flush
>>> or to avoid this situation altogether?
>>>
>>>
>>>
>>>> Alex
>>>>
>>>>>> Alex
>>>>>>
>>>>>>> That's the reason why ROCm needs the remapped MMIO register BAR.
>>>>>>>
>>>>>>>> but there are certain cases where an application may want more
>>>>>>>> control. This is probably not a showstopper for most ROCm apps.
>>>>>>> Well the problem is that you absolutely need the HDP flush/invalidation for
>> 100% correctness. It does work most of the time without it, but you then risk data
>> corruption.
>>>>>>> Apart from making the flush/invalidate an IOCTL I think we could also just
>> use a global dummy page in VRAM.
>>>>>>> If you make two 32bit writes which are apart from each other and then a
>> read back a 32bit value from VRAM that should invalidate the HDP as well. It's less
>> efficient than the MMIO BAR remap but still much better than going though an
>> IOCTL.
>>>>>>> The only tricky part is that you need to get the HW barriers with the doorbell
>> write right.....
>>>>>>>> That said, the region is only 4K so if you allow applications to
>>>>>>>> map a larger region they would get access to GPU register pages
>>>>>>>> which they shouldn't have access to.
>>>>>>> But don't we also have problems with the doorbell? E.g. the global
>> aggregated one needs to be 4k as well, or is it ok to over allocate there?
>>>>>>> Thinking more about it there is also a major problem with page tables.
>> Those are 4k by default on modern systems as well and while over allocating them
>> to 64k is possible that not only wastes some VRAM but can also result in OOM
>> situations because we can't allocate the necessary page tables to switch from 2MiB
>> to 4k pages in some cases.
>>>>>>> Christian.
>>>>>>>
>>>>>>>> Alex
>>>>>>>>
>>>>>>>>>>>> [1] ROCr debug agent tests:
>>>>>>>>>>>> https://github.com/ROCm/rocr_debug_agent
>>>>>>>>>>>> [2] RCCL tests:
>>>>>>>>>>>> https://github.com/ROCm/rccl/tree/develop/test
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Please note that the changes in this series are on a best
>>>>>>>>>>>> effort basis from our end. Therefore, requesting the amd-gfx
>>>>>>>>>>>> community (who have deeper knowledge of the HW & SW stack)
>>>>>>>>>>>> to kindly help with the review and provide feedback /
>>>>>>>>>>>> comments on these patches. The idea here is, to also have non-4K
>> pagesize (e.g. 64K) well supported with amd gpu kernel driver.
>>>>>>>>>>> Well this is generally nice to have, but there are unfortunately some
>> HW limitations which makes ROCm pretty much unusable on non 4k page size
>> systems.
>>>>>>>>>> That's a bummer :(
>>>>>>>>>> - Do we have some HW documentation around what are these
>> limitations around non-4K pagesize? Any links to such please?
>>>>>>>>> You already mentioned MMIO remap which obviously has that problem,
>> but if I'm not completely mistaken the PCIe doorbell BAR and some global seq
>> counter resources will also cause problems here.
>>>>>>>>> This can all be worked around by delegating those MMIO accesses into
>> the kernel, but that means tons of extra IOCTL overhead.
>>>>>>>>> Especially the cache flushes which are necessary to avoid corruption
>> are really bad for performance in such an approach.
>>>>>>>>>> - Are there any latest AMD GPU versions which maybe lifts such
>> restrictions?
>>>>>>>>> Not that I know off any.
>>>>>>>>>
>>>>>>>>>>> What we can do is to support graphics and MM, but that should
>> already work out of the box.
>>>>>>>>>> - Maybe we should also document, what will work and what won't work
>> due to these HW limitations.
>>>>>>>>> Well pretty much everything, I need to double check how ROCm does
>> HDP flushing/invalidating when the MMIO remap isn't available.
>>>>>>>>> Could be that there is already a fallback path and that's the reason why
>> this approach actually works at all.
>>>>>>>>>>> What we can do is to support graphics and MM, but that should
>>>>>>>>>>> already work out of the box.>
>>>>>>>>>> So these patches helped us resolve most of the issues like
>>>>>>>>>> SDMA hangs and GPU kernel page faults which we saw with rocr
>>>>>>>>>> and rccl tests with 64K pagesize. Meaning, we didn't see this
>>>>>>>>>> working out of box perhaps due to 64K pagesize.
>>>>>>>>> Yeah, but this is all for ROCm and not the graphics side.
>>>>>>>>>
>>>>>>>>> To be honest I'm not sure how ROCm even works when you have 64k
>> pages at the moment. I would expect much more issue lurking in the kernel driver.
>>>>>>>>>> AFAIU, some of these patches may require re-work based on
>>>>>>>>>> reviews, but at least with these changes, we were able to see all the
>> tests passing.
>>>>>>>>>>> I need to talk with Alex and the ROCm team about it if workarounds
>> can be implemented for those issues.
>>>>>>>>>> Thanks a lot! That would be super helpful!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>> Thanks again for the quick response on the patch series.
>>>>>>>>> You are welcome, but since it's so near to the end of the year not all
>> people are available any more.
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>> -ritesh
next prev parent reply other threads:[~2026-01-02 18:53 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-12 6:40 [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems Donet Tom
2025-12-12 6:40 ` [RFC PATCH v1 1/8] drm/amdkfd: Relax size checking during queue buffer get Donet Tom
2025-12-15 20:25 ` Philip Yang
2025-12-16 10:12 ` Donet Tom
2025-12-12 6:40 ` [RFC PATCH v1 2/8] amdkfd/kfd_svm: Fix SVM map/unmap address conversion for non-4k page sizes Donet Tom
2025-12-15 20:44 ` Philip Yang
2025-12-16 10:09 ` Donet Tom
2025-12-12 6:40 ` [RFC PATCH v1 3/8] amdkfd/kfd_migrate: Fix GART PTE for non-4K pagesize in svm_migrate_gart_map() Donet Tom
2025-12-15 21:03 ` Philip Yang
2025-12-12 6:40 ` [RFC PATCH v1 4/8] amdgpu/amdgpu_ttm: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size Donet Tom
2025-12-12 8:53 ` Christian König
2025-12-12 12:14 ` Donet Tom
2026-01-06 12:55 ` Donet Tom
2026-01-08 12:31 ` Christian König
2026-01-09 10:22 ` Pierre-Eric Pelloux-Prayer
2026-01-09 12:57 ` Donet Tom
2025-12-12 6:40 ` [RFC PATCH v1 5/8] amdkfd/kfd_chardev: Add error message for non-4k pagesize failures Donet Tom
2025-12-12 6:40 ` [RFC PATCH v1 6/8] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems Donet Tom
2025-12-12 6:40 ` [RFC PATCH v1 7/8] amdgpu: Align ctl_stack_size and wg_data_size to GPU page size instead of CPU page size Donet Tom
2025-12-12 9:04 ` Christian König
2025-12-12 12:29 ` Donet Tom
2025-12-19 10:27 ` Donet Tom
2026-01-06 13:01 ` Donet Tom
2025-12-12 6:40 ` [RFC PATCH v1 8/8] amdgpu: Fix MQD and control stack alignment for non-4K CPU page size systems Donet Tom
2025-12-12 9:01 ` [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K " Christian König
2025-12-12 10:45 ` Ritesh Harjani
2025-12-12 13:01 ` Christian König
2025-12-12 17:24 ` Alex Deucher
2025-12-15 9:47 ` Christian König
2025-12-15 10:11 ` Donet Tom
2025-12-15 16:11 ` Christian König
2025-12-16 10:08 ` Donet Tom
2025-12-16 16:06 ` Christian König
2025-12-17 9:04 ` Donet Tom
2025-12-17 9:46 ` Donet Tom
2025-12-17 10:10 ` Christian König
2025-12-15 14:09 ` Alex Deucher
2025-12-16 13:54 ` Donet Tom
2025-12-16 14:02 ` Alex Deucher
2025-12-17 9:03 ` Donet Tom
2025-12-17 14:23 ` Alex Deucher
2025-12-17 21:31 ` Yat Sin, David
2026-01-02 18:53 ` Donet Tom [this message]
2026-01-06 12:58 ` Donet Tom
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c5f6fec0-3b35-472f-ad81-211fa680f132@linux.ibm.com \
--to=donettom@linux.ibm.com \
--cc=Alexander.Deucher@amd.com \
--cc=Christian.Koenig@amd.com \
--cc=David.YatSin@amd.com \
--cc=Felix.Kuehling@amd.com \
--cc=Kent.Russell@amd.com \
--cc=alexdeucher@gmail.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=mkchauras@linux.ibm.com \
--cc=ritesh.list@gmail.com \
--cc=svaidy@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox