AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/5] drm/amd: Add support for non-4K page size systems – part 2
@ 2026-01-28 15:33 Donet Tom
  2026-01-28 15:33 ` [RFC PATCH v2 1/5] drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 PAGE_SIZE pages Donet Tom
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Donet Tom @ 2026-01-28 15:33 UTC (permalink / raw)
  To: amd-gfx, Felix Kuehling, Alex Deucher, Alex Deucher,
	christian.koenig, Philip Yang
  Cc: David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan, donettom

This is v2, part-2 of the patch series for enabling non-4K system page
size support in AMDGPU. v2, part-1 of this series [1] has already
been picked up for the upcoming release.

This second part addresses additional issues uncovered in AMDGPU when
running rccl unit tests and rocr-debug-agent tests on Power platform 
with 64KB system pagesize.

With this series applied, all RCCL unit tests and rocr-debug-agent
tests pass on systems using a 64KB system page size, across
multi-GPU configurations, with XNACK both enabled and disabled.

Note: 
We believe Patch-1 in this series i.e. 
     drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 * PAGE_SIZE
fixes a kernel crash observed when running rocminfo on systems with a
64KB page size. So if you think, patch-1 looks good, then can it be
picked up indepdently of the rest of the series?

Because then, this patch along with previously picked up part-1 series
would allow amdgpu driver to work on non-4K pagesize system with at least
a two gpu configuration.

The questions we had for the rest of this seres are:
====================================================
When the control stack size is aligned to 64 KB, queue preemption or
eviction failures are consistently observed on gfx9, on both 4 KB and
64 KB system page-size configurations.

The control stack size is calculated based on the number of CUs and
waves and is then aligned to PAGE_SIZE. On systems with a 64 KB system
page size, this alignment always results in a 64 KB–aligned control
stack size, after which queue preemption fails.

1. Is there any hardware-imposed limitation on gfx9 that prevents the
   control stack size from being 64 KB? For gfx10, explicit hardware
   limitations on the control stack size are present in the code [2].
   Is there anything similar for gfx9?

2. What is the correct or recommended control stack size for gfx9?
   With a 4 KB system page size, the observed control stack size is
   around 44 KB—can it grow beyond this? If the control stack size is
   fixed for a given gfx version, are there any concerns with aligning
   the control stack size to the GPU page size?

Changes so far in this series:
==============================
1. AMDGPU_VA_RESERVED_TRAP_SIZE was hard-coded to 8 KB, while
   KFD_CWSR_TBA_TMA_SIZE is defined as 2 * PAGE_SIZE. This matches on
   4 KB page-size systems but results in a size mismatch on 64 KB
   systems, leading to kernel crashes when running rocminfo or RCCL
   unit tests.

   This patch updates AMDGPU_VA_RESERVED_TRAP_SIZE to 2 * PAGE_SIZE so
   that the reserved trap area matches the allocation size across all
   system page sizes. This patch is required to enable minimal support
   for 64 KB system page sizes.

2. Fix the AMDGPU page fault handler (for XNACK) to pass the
   corresponding system PFN (instead of the GPU PFN) when restoring
   SVM range mappings.

3. Update AMDGPU_GTT_MAX_TRANSFER_SIZE to always match the PMD size
   across all system page sizes.

4. On systems where the CPU page size is larger than the GPU’s 4 KB
   page size, the MQD and control stack were aligned to the CPU
   PAGE_SIZE, causing multiple GPU pages to incorrectly inherit the
   uncached (UC) memory attribute.

   This change aligns both regions to the GPU page size, ensuring that
   the MQD is mapped as UC and the control stack as Non coherent (NC),
   restoring the correct behavior.

5. Queue preemption fails when the control stack size is aligned to
   64 KB. This patch fixes the issue by aligning the control stack size
   to the GPU page size.

Setup details:
============
System details: Power10 LPAR using 64K pagesize.
AMD GPU:
Name:                    gfx90a
Marketing Name:          AMD Instinct MI210

[1] https://lore.kernel.org/all/cover.1765519875.git.donettom@linux.ibm.com/
[2] https://elixir.bootlin.com/linux/v6.19-rc5/source/drivers/gpu/drm/amd/amdkfd/kfd_queue.c#L457

RFC V1 - https://lore.kernel.org/all/cover.1765519875.git.donettom@linux.ibm.com/

Donet Tom (5):
  drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 PAGE_SIZE pages
  drm/amdgpu: Handle GPU page faults correctly on non-4K page systems
  drm/amdgpu: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size
  drm/amd: Fix MQD and control stack alignment for non-4K
  drm/amdkfd: Fix queue preemption/eviction failures by aligning control
    stack size to GPU page size

 drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c      | 44 +++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h      |  2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       | 24 ++++------
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h       |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |  6 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h        |  2 +-
 drivers/gpu/drm/amd/amdgpu/vce_v1_0.c         |  3 +-
 .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c   | 23 ++++++----
 drivers/gpu/drm/amd/amdkfd/kfd_queue.c        |  7 +--
 9 files changed, 80 insertions(+), 33 deletions(-)

-- 
2.52.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-01-28 15:33 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-28 15:33 [RFC PATCH v2 0/5] drm/amd: Add support for non-4K page size systems – part 2 Donet Tom
2026-01-28 15:33 ` [RFC PATCH v2 1/5] drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 PAGE_SIZE pages Donet Tom
2026-01-28 15:33 ` [RFC PATCH v2 2/5] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems Donet Tom
2026-01-28 15:33 ` [RFC PATCH v2 3/5] drm/amdgpu: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size Donet Tom
2026-01-28 15:33 ` [RFC PATCH v2 4/5] drm/amd: Fix MQD and control stack alignment for non-4K Donet Tom
2026-01-28 15:33 ` [RFC PATCH v2 5/5] drm/amdkfd: Fix queue preemption/eviction failures by aligning control stack size to GPU page size Donet Tom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox