[RFC PATCH v3 0/6] drm/amd: Add support for non-4K page size systems

public inbox for amd-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed

From: Donet Tom <donettom@linux.ibm.com>
To: amd-gfx@lists.freedesktop.org,
	Felix Kuehling <Felix.Kuehling@amd.com>,
	Alex Deucher <alexander.deucher@amd.com>,
	Alex Deucher <alexdeucher@gmail.com>,
	christian.koenig@amd.com, Philip Yang <yangp@amd.com>
Cc: David.YatSin@amd.com, Kent.Russell@amd.com,
	Ritesh Harjani <ritesh.list@gmail.com>,
	Vaidyanathan Srinivasan <svaidy@linux.ibm.com>,
	donettom@linux.ibm.com
Subject: [RFC PATCH v3 0/6] drm/amd: Add support for non-4K page size systems
Date: Sat, 21 Feb 2026 12:39:18 +0530	[thread overview]
Message-ID: <cover.1771656655.git.donettom@linux.ibm.com> (raw)

This is v3 of the patch series enabling 64 KB system page size support
in AMDGPU. v2, part 1 of this series [1] has already been merged
upstream and provides the minimal infrastructure required for 64 KB
page support.

This series addresses additional issues uncovered in AMDGPU when
running rccl unit tests and rocr-debug-agent tessts on 64KB page-size
systems.

With this series applied, all RCCL unit tests and rocr-debug-agent
tests pass on systems using a 64 KB system page size, across
multi-GPU configurations, with XNACK both enabled and disabled.

Patch 1 in this series (drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE
to 2 * PAGE_SIZE) fixes a kernel crash observed when running rocminfo
on systems with a 64 KB page size. This patch is required to enable
minimal support for 64 KB system page sizes.

Since RFC v2, we observed AQL queue creation failures while running
certain workloads on 64K page-size systems due to an expected queue size
mismatch. This issue is addressed in patch 2 of this series.

The questions we had in this seres are:
=======================================
1 When the control stack size is aligned to 64 KB, we consistently
  observe queue preemption or eviction failures on gfx9, on both
  4 KB and 64 KB system page-size configurations.

  The control stack size is calculated based on the number of CUs and
  waves and is then aligned to PAGE_SIZE. On systems with a 64 KB
  system page size, this alignment always results in a 64 KB-aligned
  control stack size, after which queue preemption fails.

  Is there any hardware-imposed limitation on gfx9 that prevents the
  control stack size from being 64 KB? For gfx10, I see explicit
  hardware limitations on the control stack size in the code [2].
  Is there anything similar for gfx9?

  What is the correct or recommended control stack size for gfx9?
  With a 4 KB system page size, I observe a control stack size of
  around 44 KB—can it grow beyond this? If the control stack size
  is fixed for a given gfx version, do you see any issues with
  aligning the control stack size to the GPU page size?

This series has 5 patches
=========================
1. AMDGPU_VA_RESERVED_TRAP_SIZE was hard-coded to 8 KB while
   KFD_CWSR_TBA_TMA_SIZE is defined as 2 * PAGE_SIZE, which matches on
   4 KB page-size systems but results in a size mismatch on 64 KB
   systems, leading to kernel crashes when running rocminfo or RCCL
   unit tests.
   This patch updates AMDGPU_VA_RESERVED_TRAP_SIZE to 2 * PAGE_SIZE so
   that the reserved trap area matches the allocation size across all
   system page sizes. This is a must needed patch to enable minimal
   support for 64 KB system page sizes.

2. Aligned expected_queue_size to PAGE_SIZE to fix AQL queue creation
   failure.

3. Fix amdgpu page fault handler (for xnack) to pass the corresponding
   system pfn (instead of gpu pfn) for restoring SVM range mapping.

4. Updated AMDGPU_GTT_MAX_TRANSFER_SIZE to always match the PMD size
   across all page sizes.

5. On systems where the CPU page size is larger than the GPU’s 4 KB page
   size, the MQD and control stack were aligned to the CPU PAGE_SIZE,
   causing multiple GPU pages to incorrectly inherit the UC attribute.
   This change aligns both regions to the GPU page size, ensuring that
   the MQD is mapped as UC and the control stack as NC, restoring the
   correct behavior.

6. Queue preemption fails when the control stack size is aligned to
   64 KB. This patch fixes this issue by aligning the control stack
   size to gpu page size.

Setup details:
============
System details: Power10 LPAR using 64K pagesize.
AMD GPU:
Name:                    gfx90a
Marketing Name:          AMD Instinct MI210

[1] https://lore.kernel.org/all/cover.1765519875.git.donettom@linux.ibm.com/
[2] https://elixir.bootlin.com/linux/v6.19-rc5/source/drivers/gpu/drm/amd/amdkfd/kfd_queue.c#L457

RFC V2 - https://lore.kernel.org/all/cover.1769612973.git.donettom@linux.ibm.com/
RFC V1 - https://lore.kernel.org/all/cover.1765519875.git.donettom@linux.ibm.com/

Donet Tom (6):
  drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 PAGE_SIZE pages
  drm/amdkfd: Align expected_queue_size to PAGE_SIZE
  drm/amdgpu: Handle GPU page faults correctly on non-4K page systems
  drm/amdgpu: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size
  drm/amd: Fix MQD and control stack alignment for non-4K
  drm/amdkfd: Fix queue preemption/eviction failures by aligning control
    stack size to GPU page size

 drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c      | 44 +++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h      |  2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       | 24 ++++------
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h       |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |  6 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h        |  2 +-
 drivers/gpu/drm/amd/amdgpu/vce_v1_0.c         |  3 +-
 .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c   | 23 ++++++----
 drivers/gpu/drm/amd/amdkfd/kfd_queue.c        | 11 ++---
 9 files changed, 82 insertions(+), 35 deletions(-)

-- 
2.52.0

next             reply	other threads:[~2026-02-21  7:09 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-21  7:09 Donet Tom [this message]
2026-02-21  7:09 ` [RFC PATCH v3 1/6] drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 PAGE_SIZE pages Donet Tom
2026-03-01  9:36   ` Ritesh Harjani
2026-02-21  7:09 ` [RFC PATCH v3 2/6] drm/amdkfd: Align expected_queue_size to PAGE_SIZE Donet Tom
2026-02-21  7:09 ` [RFC PATCH v3 3/6] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems Donet Tom
2026-02-21  7:09 ` [RFC PATCH v3 4/6] drm/amdgpu: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size Donet Tom
2026-02-21  7:09 ` [RFC PATCH v3 5/6] drm/amd: Fix MQD and control stack alignment for non-4K Donet Tom
2026-02-21  7:09 ` [RFC PATCH v3 6/6] drm/amdkfd: Fix queue preemption/eviction failures by aligning control stack size to GPU page size Donet Tom
2026-03-06 17:54 ` [RFC PATCH v3 0/6] drm/amd: Add support for non-4K page size systems Donet Tom

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1771656655.git.donettom@linux.ibm.com \
    --to=donettom@linux.ibm.com \
    --cc=David.YatSin@amd.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=Kent.Russell@amd.com \
    --cc=alexander.deucher@amd.com \
    --cc=alexdeucher@gmail.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=ritesh.list@gmail.com \
    --cc=svaidy@linux.ibm.com \
    --cc=yangp@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox