From: Donet Tom <donettom@linux.ibm.com>
To: amd-gfx@lists.freedesktop.org,
Felix Kuehling <Felix.Kuehling@amd.com>,
Alex Deucher <alexander.deucher@amd.com>,
Alex Deucher <alexdeucher@gmail.com>,
christian.koenig@amd.com, Philip Yang <yangp@amd.com>
Cc: David.YatSin@amd.com, Kent.Russell@amd.com,
Ritesh Harjani <ritesh.list@gmail.com>,
Vaidyanathan Srinivasan <svaidy@linux.ibm.com>,
donettom@linux.ibm.com
Subject: [RFC PATCH v2 0/5] drm/amd: Add support for non-4K page size systems – part 2
Date: Wed, 28 Jan 2026 21:03:01 +0530 [thread overview]
Message-ID: <cover.1769612973.git.donettom@linux.ibm.com> (raw)
This is v2, part-2 of the patch series for enabling non-4K system page
size support in AMDGPU. v2, part-1 of this series [1] has already
been picked up for the upcoming release.
This second part addresses additional issues uncovered in AMDGPU when
running rccl unit tests and rocr-debug-agent tests on Power platform
with 64KB system pagesize.
With this series applied, all RCCL unit tests and rocr-debug-agent
tests pass on systems using a 64KB system page size, across
multi-GPU configurations, with XNACK both enabled and disabled.
Note:
We believe Patch-1 in this series i.e.
drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 * PAGE_SIZE
fixes a kernel crash observed when running rocminfo on systems with a
64KB page size. So if you think, patch-1 looks good, then can it be
picked up indepdently of the rest of the series?
Because then, this patch along with previously picked up part-1 series
would allow amdgpu driver to work on non-4K pagesize system with at least
a two gpu configuration.
The questions we had for the rest of this seres are:
====================================================
When the control stack size is aligned to 64 KB, queue preemption or
eviction failures are consistently observed on gfx9, on both 4 KB and
64 KB system page-size configurations.
The control stack size is calculated based on the number of CUs and
waves and is then aligned to PAGE_SIZE. On systems with a 64 KB system
page size, this alignment always results in a 64 KB–aligned control
stack size, after which queue preemption fails.
1. Is there any hardware-imposed limitation on gfx9 that prevents the
control stack size from being 64 KB? For gfx10, explicit hardware
limitations on the control stack size are present in the code [2].
Is there anything similar for gfx9?
2. What is the correct or recommended control stack size for gfx9?
With a 4 KB system page size, the observed control stack size is
around 44 KB—can it grow beyond this? If the control stack size is
fixed for a given gfx version, are there any concerns with aligning
the control stack size to the GPU page size?
Changes so far in this series:
==============================
1. AMDGPU_VA_RESERVED_TRAP_SIZE was hard-coded to 8 KB, while
KFD_CWSR_TBA_TMA_SIZE is defined as 2 * PAGE_SIZE. This matches on
4 KB page-size systems but results in a size mismatch on 64 KB
systems, leading to kernel crashes when running rocminfo or RCCL
unit tests.
This patch updates AMDGPU_VA_RESERVED_TRAP_SIZE to 2 * PAGE_SIZE so
that the reserved trap area matches the allocation size across all
system page sizes. This patch is required to enable minimal support
for 64 KB system page sizes.
2. Fix the AMDGPU page fault handler (for XNACK) to pass the
corresponding system PFN (instead of the GPU PFN) when restoring
SVM range mappings.
3. Update AMDGPU_GTT_MAX_TRANSFER_SIZE to always match the PMD size
across all system page sizes.
4. On systems where the CPU page size is larger than the GPU’s 4 KB
page size, the MQD and control stack were aligned to the CPU
PAGE_SIZE, causing multiple GPU pages to incorrectly inherit the
uncached (UC) memory attribute.
This change aligns both regions to the GPU page size, ensuring that
the MQD is mapped as UC and the control stack as Non coherent (NC),
restoring the correct behavior.
5. Queue preemption fails when the control stack size is aligned to
64 KB. This patch fixes the issue by aligning the control stack size
to the GPU page size.
Setup details:
============
System details: Power10 LPAR using 64K pagesize.
AMD GPU:
Name: gfx90a
Marketing Name: AMD Instinct MI210
[1] https://lore.kernel.org/all/cover.1765519875.git.donettom@linux.ibm.com/
[2] https://elixir.bootlin.com/linux/v6.19-rc5/source/drivers/gpu/drm/amd/amdkfd/kfd_queue.c#L457
RFC V1 - https://lore.kernel.org/all/cover.1765519875.git.donettom@linux.ibm.com/
Donet Tom (5):
drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 PAGE_SIZE pages
drm/amdgpu: Handle GPU page faults correctly on non-4K page systems
drm/amdgpu: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size
drm/amd: Fix MQD and control stack alignment for non-4K
drm/amdkfd: Fix queue preemption/eviction failures by aligning control
stack size to GPU page size
drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c | 44 +++++++++++++++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h | 2 +
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 24 ++++------
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 6 +--
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 2 +-
drivers/gpu/drm/amd/amdgpu/vce_v1_0.c | 3 +-
.../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c | 23 ++++++----
drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 7 +--
9 files changed, 80 insertions(+), 33 deletions(-)
--
2.52.0
next reply other threads:[~2026-01-28 15:33 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-28 15:33 Donet Tom [this message]
2026-01-28 15:33 ` [RFC PATCH v2 1/5] drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 PAGE_SIZE pages Donet Tom
2026-01-28 15:33 ` [RFC PATCH v2 2/5] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems Donet Tom
2026-01-28 15:33 ` [RFC PATCH v2 3/5] drm/amdgpu: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size Donet Tom
2026-01-28 15:33 ` [RFC PATCH v2 4/5] drm/amd: Fix MQD and control stack alignment for non-4K Donet Tom
2026-01-28 15:33 ` [RFC PATCH v2 5/5] drm/amdkfd: Fix queue preemption/eviction failures by aligning control stack size to GPU page size Donet Tom
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cover.1769612973.git.donettom@linux.ibm.com \
--to=donettom@linux.ibm.com \
--cc=David.YatSin@amd.com \
--cc=Felix.Kuehling@amd.com \
--cc=Kent.Russell@amd.com \
--cc=alexander.deucher@amd.com \
--cc=alexdeucher@gmail.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=christian.koenig@amd.com \
--cc=ritesh.list@gmail.com \
--cc=svaidy@linux.ibm.com \
--cc=yangp@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox