From: Donet Tom <donettom@linux.ibm.com>
To: amd-gfx@lists.freedesktop.org,
Felix Kuehling <Felix.Kuehling@amd.com>,
Alex Deucher <alexander.deucher@amd.com>,
christian.koenig@amd.com
Cc: Kent.Russell@amd.com, Ritesh Harjani <ritesh.list@gmail.com>,
Vaidyanathan Srinivasan <svaidy@linux.ibm.com>,
Mukesh Kumar Chaurasiya <mkchauras@linux.ibm.com>,
donettom@linux.ibm.com
Subject: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems
Date: Fri, 12 Dec 2025 12:10:07 +0530 [thread overview]
Message-ID: <cover.1765519875.git.donettom@linux.ibm.com> (raw)
This patch series addresses few issues which we encountered while running rocr
debug agent and rccl unit tests with AMD GPU on Power10 (ppc64le), using 64k
system pagesize.
Note that we don't observe any of these issues while booting with 4k system
pagesize on Power. So with the 64K system pagesize what we observed so far is,
at few of the places, the conversion between gpu pfn to cpu pfn (or vice versa)
may not be done correctly (due to different page size of AMD GPU (4K)
v/s cpu pagesize (64K)) which causes issues like gpu page faults or gpu hang
while running these tests.
Changes so far in this series:
=============================
1. For now, during kfd queue creation, this patch lifts the restriction on EOP
buffer size to be same buffer object mapping size.
2. Fix SVM range map/unmap operations to convert CPU page numbers to GPU page
numbers before calling amdgpu_vm_update_range(), which expects 4K GPU pages.
Without this the rocr-debug-agent tests and rccl unit tests were failing.
3. Fix GART PTE allocation in migration code to account for multiple GPU pages
per CPU page. The current code only allocates PTEs based on number of CPU
pages, but GART may need one PTE per 4K GPU page.
4. Adjust AMDGPU_GTT_MAX_TRANSFER_SIZE to respect the SDMA engine's 4MB hardware
limit regardless of CPU page size. The hardcoded 512 pages worked on 4K
systems but seems to be exceeding the limit with 64K system page size.
5. In the current driver, MMIO remap is supported only when the system page
size is 4K. Error messages have been added to indicate that MMIO remap
is not supported on systems with a non-4K page size.
6. Fix amdgpu page fault handler (for xnack) to pass the corresponding system
pfn (instead of gpu pfn) for restoring SVM range mapping.
7. Align ctl_stack_size and wg_data_size to GPU page size.
8. On systems where the CPU page size is larger than the GPU’s 4K page size,
the MQD and control stack are aligned to the CPU PAGE_SIZE, causing
multiple GPU pages to inherit the UC attribute incorrectly. This results
in the control-stack area being mis-mapped and leads to queue preemption
and eviction failures. Aligning both regions to the GPU page size
ensures the MQD is mapped UC and the control stack NC, restoring correct
behavior.
9. Apart from these 8 changes, we also needed this change [1]. Without this change
kernel simply crashes when running rocminfo command itself.
[1]: https://github.com/greenforce-project/chromeos-kernel-mirror/commit/2b33fad96c3129a2a53a42b9d90fb3b906145b98
Setup details:
============
System details: Power10 LPAR using 64K pagesize.
AMD GPU:
Name: gfx90a
Marketing Name: AMD Instinct MI210
Queries:
=======
1. We currently ran rocr-debug agent tests [1] and rccl unit tests [2] to test
these changes. Is there anything else that you would suggest us to run to
shake out any other page size related issues w.r.t the kernel driver?
2. Patch 1/8: We have a querry regarding eop buffer size Is this eop ring buffer
size HW dependent? Should it be made PAGE_SIZE?
3. Patch 5/8: also have a query w.r.t the error paths when system page size > 4K.
Do we need to lift this restriction and add MMIO remap support for systems with
non-4K page sizes?
[1] ROCr debug agent tests: https://github.com/ROCm/rocr_debug_agent
[2] RCCL tests: https://github.com/ROCm/rccl/tree/develop/test
Please note that the changes in this series are on a best effort basis from our
end. Therefore, requesting the amd-gfx community (who have deeper knowledge of the
HW & SW stack) to kindly help with the review and provide feedback / comments on
these patches. The idea here is, to also have non-4K pagesize (e.g. 64K) well
supported with amd gpu kernel driver.
Donet Tom (7):
drm/amdkfd: Relax size checking during queue buffer get
amdkfd/kfd_svm: Fix SVM map/unmap address conversion for non-4k page
sizes
amdkfd/kfd_migrate: Fix GART PTE for non-4K pagesize in
svm_migrate_gart_map()
amdgpu/amdgpu_ttm: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page
size
drm/amdgpu: Handle GPU page faults correctly on non-4K page systems
amdgpu: Align ctl_stack_size and wg_data_size to GPU page size instead
of CPU page size
amdgpu: Fix MQD and control stack alignment for non-4K CPU page size
systems
Ritesh Harjani (IBM) (1):
amdkfd/kfd_chardev: Add error message for non-4k pagesize failures
drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c | 29 ++++++++++++++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h | 2 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 16 ++--------
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 6 ++--
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 10 +++++--
drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 2 +-
.../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c | 15 +++++-----
drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 17 ++++++-----
drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 30 ++++++++++++++-----
10 files changed, 86 insertions(+), 43 deletions(-)
--
2.52.0
next reply other threads:[~2025-12-12 8:29 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-12 6:40 Donet Tom [this message]
2025-12-12 6:40 ` [RFC PATCH v1 1/8] drm/amdkfd: Relax size checking during queue buffer get Donet Tom
2025-12-15 20:25 ` Philip Yang
2025-12-16 10:12 ` Donet Tom
2025-12-12 6:40 ` [RFC PATCH v1 2/8] amdkfd/kfd_svm: Fix SVM map/unmap address conversion for non-4k page sizes Donet Tom
2025-12-15 20:44 ` Philip Yang
2025-12-16 10:09 ` Donet Tom
2025-12-12 6:40 ` [RFC PATCH v1 3/8] amdkfd/kfd_migrate: Fix GART PTE for non-4K pagesize in svm_migrate_gart_map() Donet Tom
2025-12-15 21:03 ` Philip Yang
2025-12-12 6:40 ` [RFC PATCH v1 4/8] amdgpu/amdgpu_ttm: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size Donet Tom
2025-12-12 8:53 ` Christian König
2025-12-12 12:14 ` Donet Tom
2026-01-06 12:55 ` Donet Tom
2026-01-08 12:31 ` Christian König
2026-01-09 10:22 ` Pierre-Eric Pelloux-Prayer
2026-01-09 12:57 ` Donet Tom
2025-12-12 6:40 ` [RFC PATCH v1 5/8] amdkfd/kfd_chardev: Add error message for non-4k pagesize failures Donet Tom
2025-12-12 6:40 ` [RFC PATCH v1 6/8] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems Donet Tom
2025-12-12 6:40 ` [RFC PATCH v1 7/8] amdgpu: Align ctl_stack_size and wg_data_size to GPU page size instead of CPU page size Donet Tom
2025-12-12 9:04 ` Christian König
2025-12-12 12:29 ` Donet Tom
2025-12-19 10:27 ` Donet Tom
2026-01-06 13:01 ` Donet Tom
2025-12-12 6:40 ` [RFC PATCH v1 8/8] amdgpu: Fix MQD and control stack alignment for non-4K CPU page size systems Donet Tom
2025-12-12 9:01 ` [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K " Christian König
2025-12-12 10:45 ` Ritesh Harjani
2025-12-12 13:01 ` Christian König
2025-12-12 17:24 ` Alex Deucher
2025-12-15 9:47 ` Christian König
2025-12-15 10:11 ` Donet Tom
2025-12-15 16:11 ` Christian König
2025-12-16 10:08 ` Donet Tom
2025-12-16 16:06 ` Christian König
2025-12-17 9:04 ` Donet Tom
2025-12-17 9:46 ` Donet Tom
2025-12-17 10:10 ` Christian König
2025-12-15 14:09 ` Alex Deucher
2025-12-16 13:54 ` Donet Tom
2025-12-16 14:02 ` Alex Deucher
2025-12-17 9:03 ` Donet Tom
2025-12-17 14:23 ` Alex Deucher
2025-12-17 21:31 ` Yat Sin, David
2026-01-02 18:53 ` Donet Tom
2026-01-06 12:58 ` Donet Tom
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cover.1765519875.git.donettom@linux.ibm.com \
--to=donettom@linux.ibm.com \
--cc=Felix.Kuehling@amd.com \
--cc=Kent.Russell@amd.com \
--cc=alexander.deucher@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=christian.koenig@amd.com \
--cc=mkchauras@linux.ibm.com \
--cc=ritesh.list@gmail.com \
--cc=svaidy@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox