Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems

AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Donet Tom <donettom@linux.ibm.com>
To: Alex Deucher <alexdeucher@gmail.com>
Cc: "Christian König" <christian.koenig@amd.com>,
	"Ritesh Harjani (IBM)" <ritesh.list@gmail.com>,
	amd-gfx@lists.freedesktop.org,
	"Felix Kuehling" <Felix.Kuehling@amd.com>,
	"Alex Deucher" <alexander.deucher@amd.com>,
	Kent.Russell@amd.com,
	"Vaidyanathan Srinivasan" <svaidy@linux.ibm.com>,
	"Mukesh Kumar Chaurasiya" <mkchauras@linux.ibm.com>
Subject: Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems
Date: Wed, 17 Dec 2025 14:33:24 +0530	[thread overview]
Message-ID: <1f2a0b14-9cff-40cd-bdbc-71fae06c34b1@linux.ibm.com> (raw)
In-Reply-To: <CADnq5_Owfg0fG5mUo7NDZUNeB+QNas2EL+sK=42_deVSxiGfQQ@mail.gmail.com>


On 12/16/25 7:32 PM, Alex Deucher wrote:
> On Tue, Dec 16, 2025 at 8:55 AM Donet Tom <donettom@linux.ibm.com> wrote:
>>
>> On 12/15/25 7:39 PM, Alex Deucher wrote:
>>> On Mon, Dec 15, 2025 at 4:47 AM Christian König
>>> <christian.koenig@amd.com> wrote:
>>>> On 12/12/25 18:24, Alex Deucher wrote:
>>>>> On Fri, Dec 12, 2025 at 8:19 AM Christian König
>>>>> <christian.koenig@amd.com> wrote:
>>>>>> On 12/12/25 11:45, Ritesh Harjani (IBM) wrote:
>>>>>>> Christian König <christian.koenig@amd.com> writes:
>>>>>>>>> Setup details:
>>>>>>>>> ============
>>>>>>>>> System details: Power10 LPAR using 64K pagesize.
>>>>>>>>> AMD GPU:
>>>>>>>>>     Name:                    gfx90a
>>>>>>>>>     Marketing Name:          AMD Instinct MI210
>>>>>>>>>
>>>>>>>>> Queries:
>>>>>>>>> =======
>>>>>>>>> 1. We currently ran rocr-debug agent tests [1]  and rccl unit tests [2] to test
>>>>>>>>>      these changes. Is there anything else that you would suggest us to run to
>>>>>>>>>      shake out any other page size related issues w.r.t the kernel driver?
>>>>>>>> The ROCm team needs to answer that.
>>>>>>>>
>>>>>>> Is there any separate mailing list or list of people whom we can cc
>>>>>>> then?
>>>>>> With Felix on CC you already got the right person, but he's on vacation and will not be back before the end of the year.
>>>>>>
>>>>>> I can check on Monday if some people are still around which could answer a couple of questions, but in general don't expect a quick response.
>>>>>>
>>>>>>>>> 2. Patch 1/8: We have a querry regarding eop buffer size Is this eop ring buffer
>>>>>>>>>      size HW dependent? Should it be made PAGE_SIZE?
>>>>>>>> Yes and no.
>>>>>>>>
>>>>>>> If you could more elaborate on this please? I am assuming you would
>>>>>>> anyway respond with more context / details on Patch-1 itself. If yes,
>>>>>>> that would be great!
>>>>>> Well, in general the EOP (End of Pipe) buffer contains in a ring buffer of all the events and actions the CP should execute when shaders and cache flushes finish.
>>>>>>
>>>>>> The size depends on the HW generation and configuration of the GPU etc..., but don't ask me for details how that is calculated.
>>>>>>
>>>>>> The point is that the size is completely unrelated to the CPU, so using PAGE_SIZE is clearly incorrect.
>>>>>>
>>>>>>>>> 3. Patch 5/8: also have a query w.r.t the error paths when system page size > 4K.
>>>>>>>>>      Do we need to lift this restriction and add MMIO remap support for systems with
>>>>>>>>>      non-4K page sizes?
>>>>>>>> The problem is the HW can't do this.
>>>>>>>>
>>>>>>> We aren't that familiar with the HW / SW stack here. Wanted to understand
>>>>>>> what functionality will be unsupported due to this HW limitation then?
>>>>>> The problem is that the CPU must map some of the registers/resources of the GPU into the address space of the application and you run into security issues when you map more than 4k at a time.
>>>>> Right.  There are some 4K pages with the MMIO register BAR which are
>>>>> empty and registers can be remapped into them.  In this case we remap
>>>>> the HDP flush registers into one of those register pages.  This allows
>>>>> applications to flush the HDP write FIFO from either the CPU or
>>>>> another device.  This is needed to flush data written by the CPU or
>>>>> another device to the VRAM BAR out to VRAM (i.e., so the GPU can see
>>>>> it).  This is flushed internally as part of the shader dispatch
>>>>> packets,
>>>> As far as I know this is only done for graphics shader submissions to the classic CS interface, but not for compute dispatches through ROCm queues.
>>> There is an explicit PM4 packet to flush the HDP cache for userqs and
>>> for AQL the flush is handled via one of the flags in the dispatch
>>> packet.  The MMIO remap is needed for more fine grained use cases
>>> where you might have the CPU or another device operating in a gang
>>> like scenario with the GPU.
>>
>> Thank you, Alex.
>>
>> We were encountering an issue while running the RCCL unit tests. With 2
>> GPUs, all tests passed successfully; however, when running with more
>> than 2 GPUs, the tests began to fail at random points with the following
>> errors:
>>
>> [  598.576821] amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for
>> queue with doorbell_id: 80030008
>> [  606.696820] amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues
>> [  606.696826] amdgpu 0048:0f:00.0: amdgpu: GPU reset begin!. Source: 4
>> [  610.696852] amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for
>> queue with doorbell_id: 80030008
>> [  610.696869] amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues
>> [  610.696942] amdgpu 0048:0f:00.0: amdgpu: Failed to restore process queues
>>
>>
>> After applying patches 7/8 and 8/8, we are no longer seeing this issue.
>>
>> One question I have is: we only started observing this problem when the
>> number of GPUs increased. Could this be related to MMIO remapping not
>> being available?
> It could be.  E.g., if the CPU or a GPU writes data to VRAM on another
> GPU, you will need to flush the HDP to make sure that data hits VRAM
> before the GPU attached to the VRAM can see it.


Thanks Alex

I am now suspecting that the queue preemption issue may be related to 
the unavailability of MMIO remapping. I am not very familiar with this area.

Could you please point me to the relevant code path where the PM4 packet 
is issued to flush the HDP cache?

I am consistently able to reproduce this issue on my system when using 
more than three GPUs if patches 7/8 and 8/8 are not applied. In your 
opinion, is there anything that can be done to speed up the HDP flush or 
to avoid this situation altogether?



>
> Alex
>
>>
>>> Alex
>>>
>>>> That's the reason why ROCm needs the remapped MMIO register BAR.
>>>>
>>>>> but there are certain cases where an application may want
>>>>> more control.  This is probably not a showstopper for most ROCm apps.
>>>> Well the problem is that you absolutely need the HDP flush/invalidation for 100% correctness. It does work most of the time without it, but you then risk data corruption.
>>>>
>>>> Apart from making the flush/invalidate an IOCTL I think we could also just use a global dummy page in VRAM.
>>>>
>>>> If you make two 32bit writes which are apart from each other and then a read back a 32bit value from VRAM that should invalidate the HDP as well. It's less efficient than the MMIO BAR remap but still much better than going though an IOCTL.
>>>>
>>>> The only tricky part is that you need to get the HW barriers with the doorbell write right.....
>>>>
>>>>> That said, the region is only 4K so if you allow applications to map a
>>>>> larger region they would get access to GPU register pages which they
>>>>> shouldn't have access to.
>>>> But don't we also have problems with the doorbell? E.g. the global aggregated one needs to be 4k as well, or is it ok to over allocate there?
>>>>
>>>> Thinking more about it there is also a major problem with page tables. Those are 4k by default on modern systems as well and while over allocating them to 64k is possible that not only wastes some VRAM but can also result in OOM situations because we can't allocate the necessary page tables to switch from 2MiB to 4k pages in some cases.
>>>>
>>>> Christian.
>>>>
>>>>> Alex
>>>>>
>>>>>>>>> [1] ROCr debug agent tests: https://github.com/ROCm/rocr_debug_agent
>>>>>>>>> [2] RCCL tests: https://github.com/ROCm/rccl/tree/develop/test
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Please note that the changes in this series are on a best effort basis from our
>>>>>>>>> end. Therefore, requesting the amd-gfx community (who have deeper knowledge of the
>>>>>>>>> HW & SW stack) to kindly help with the review and provide feedback / comments on
>>>>>>>>> these patches. The idea here is, to also have non-4K pagesize (e.g. 64K) well
>>>>>>>>> supported with amd gpu kernel driver.
>>>>>>>> Well this is generally nice to have, but there are unfortunately some HW limitations which makes ROCm pretty much unusable on non 4k page size systems.
>>>>>>> That's a bummer :(
>>>>>>> - Do we have some HW documentation around what are these limitations around non-4K pagesize? Any links to such please?
>>>>>> You already mentioned MMIO remap which obviously has that problem, but if I'm not completely mistaken the PCIe doorbell BAR and some global seq counter resources will also cause problems here.
>>>>>>
>>>>>> This can all be worked around by delegating those MMIO accesses into the kernel, but that means tons of extra IOCTL overhead.
>>>>>>
>>>>>> Especially the cache flushes which are necessary to avoid corruption are really bad for performance in such an approach.
>>>>>>
>>>>>>> - Are there any latest AMD GPU versions which maybe lifts such restrictions?
>>>>>> Not that I know off any.
>>>>>>
>>>>>>>> What we can do is to support graphics and MM, but that should already work out of the box.
>>>>>>>>
>>>>>>> - Maybe we should also document, what will work and what won't work due to these HW limitations.
>>>>>> Well pretty much everything, I need to double check how ROCm does HDP flushing/invalidating when the MMIO remap isn't available.
>>>>>>
>>>>>> Could be that there is already a fallback path and that's the reason why this approach actually works at all.
>>>>>>
>>>>>>>> What we can do is to support graphics and MM, but that should already work out of the box.>
>>>>>>> So these patches helped us resolve most of the issues like SDMA hangs
>>>>>>> and GPU kernel page faults which we saw with rocr and rccl tests with
>>>>>>> 64K pagesize. Meaning, we didn't see this working out of box perhaps
>>>>>>> due to 64K pagesize.
>>>>>> Yeah, but this is all for ROCm and not the graphics side.
>>>>>>
>>>>>> To be honest I'm not sure how ROCm even works when you have 64k pages at the moment. I would expect much more issue lurking in the kernel driver.
>>>>>>
>>>>>>> AFAIU, some of these patches may require re-work based on reviews, but
>>>>>>> at least with these changes, we were able to see all the tests passing.
>>>>>>>
>>>>>>>> I need to talk with Alex and the ROCm team about it if workarounds can be implemented for those issues.
>>>>>>>>
>>>>>>> Thanks a lot! That would be super helpful!
>>>>>>>
>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>> Thanks again for the quick response on the patch series.
>>>>>> You are welcome, but since it's so near to the end of the year not all people are available any more.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>> -ritesh

next prev parent reply	other threads:[~2025-12-17  9:03 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-12  6:40 [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems Donet Tom
2025-12-12  6:40 ` [RFC PATCH v1 1/8] drm/amdkfd: Relax size checking during queue buffer get Donet Tom
2025-12-15 20:25   ` Philip Yang
2025-12-16 10:12     ` Donet Tom
2025-12-12  6:40 ` [RFC PATCH v1 2/8] amdkfd/kfd_svm: Fix SVM map/unmap address conversion for non-4k page sizes Donet Tom
2025-12-15 20:44   ` Philip Yang
2025-12-16 10:09     ` Donet Tom
2025-12-12  6:40 ` [RFC PATCH v1 3/8] amdkfd/kfd_migrate: Fix GART PTE for non-4K pagesize in svm_migrate_gart_map() Donet Tom
2025-12-15 21:03   ` Philip Yang
2025-12-12  6:40 ` [RFC PATCH v1 4/8] amdgpu/amdgpu_ttm: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size Donet Tom
2025-12-12  8:53   ` Christian König
2025-12-12 12:14     ` Donet Tom
2026-01-06 12:55     ` Donet Tom
2026-01-08 12:31       ` Christian König
2026-01-09 10:22         ` Pierre-Eric Pelloux-Prayer
2026-01-09 12:57         ` Donet Tom
2025-12-12  6:40 ` [RFC PATCH v1 5/8] amdkfd/kfd_chardev: Add error message for non-4k pagesize failures Donet Tom
2025-12-12  6:40 ` [RFC PATCH v1 6/8] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems Donet Tom
2025-12-12  6:40 ` [RFC PATCH v1 7/8] amdgpu: Align ctl_stack_size and wg_data_size to GPU page size instead of CPU page size Donet Tom
2025-12-12  9:04   ` Christian König
2025-12-12 12:29     ` Donet Tom
2025-12-19 10:27     ` Donet Tom
2026-01-06 13:01       ` Donet Tom
2025-12-12  6:40 ` [RFC PATCH v1 8/8] amdgpu: Fix MQD and control stack alignment for non-4K CPU page size systems Donet Tom
2025-12-12  9:01 ` [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K " Christian König
2025-12-12 10:45   ` Ritesh Harjani
2025-12-12 13:01     ` Christian König
2025-12-12 17:24       ` Alex Deucher
2025-12-15  9:47         ` Christian König
2025-12-15 10:11           ` Donet Tom
2025-12-15 16:11             ` Christian König
2025-12-16 10:08               ` Donet Tom
2025-12-16 16:06                 ` Christian König
2025-12-17  9:04                   ` Donet Tom
2025-12-17  9:46               ` Donet Tom
2025-12-17 10:10                 ` Christian König
2025-12-15 14:09           ` Alex Deucher
2025-12-16 13:54             ` Donet Tom
2025-12-16 14:02               ` Alex Deucher
2025-12-17  9:03                 ` Donet Tom [this message]
2025-12-17 14:23                   ` Alex Deucher
2025-12-17 21:31                     ` Yat Sin, David
2026-01-02 18:53                       ` Donet Tom
2026-01-06 12:58                       ` Donet Tom

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1f2a0b14-9cff-40cd-bdbc-71fae06c34b1@linux.ibm.com \
    --to=donettom@linux.ibm.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=Kent.Russell@amd.com \
    --cc=alexander.deucher@amd.com \
    --cc=alexdeucher@gmail.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=mkchauras@linux.ibm.com \
    --cc=ritesh.list@gmail.com \
    --cc=svaidy@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox