[RESEND RFC PATCH v3 0/6] drm/amd: Add support for non-4K page size systems

public inbox for amd-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed

* [RESEND RFC PATCH v3 0/6] drm/amd: Add support for non-4K page size systems
@ 2026-03-23  4:28 Donet Tom
  2026-03-23  4:28 ` [RESEND RFC PATCH v3 1/6] drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 PAGE_SIZE pages Donet Tom
                   ` (6 more replies)
  0 siblings, 7 replies; 30+ messages in thread
From: Donet Tom @ 2026-03-23  4:28 UTC (permalink / raw)
  To: amd-gfx, Felix Kuehling, Alex Deucher, Alex Deucher,
	christian.koenig, Philip Yang
  Cc: David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan, donettom

This is v3 of the patch series enabling 64 KB system page size support
in AMDGPU. v2, part 1 of this series [1] has already been merged
upstream and provides the minimal infrastructure required for 64 KB
page support.

This series addresses additional issues uncovered in AMDGPU when
running rccl unit tests and rocr-debug-agent tessts on 64KB page-size
systems.

With this series applied, all RCCL unit tests and rocr-debug-agent
tests pass on systems using a 64 KB system page size, across
multi-GPU configurations, with XNACK both enabled and disabled.

Patch 1 in this series (drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE
to 2 * PAGE_SIZE) fixes a kernel crash observed when running rocminfo
on systems with a 64 KB page size. This patch is required to enable
minimal support for 64 KB system page sizes.

Since RFC v2, we observed AQL queue creation failures while running
certain workloads on 64K page-size systems due to an expected queue size
mismatch. This issue is addressed in patch 2 of this series.

The questions we had in this seres are:
=======================================
1 When the control stack size is aligned to 64 KB, we consistently
  observe queue preemption or eviction failures on gfx9, on both
  4 KB and 64 KB system page-size configurations.

  The control stack size is calculated based on the number of CUs and
  waves and is then aligned to PAGE_SIZE. On systems with a 64 KB
  system page size, this alignment always results in a 64 KB-aligned
  control stack size, after which queue preemption fails.

  Is there any hardware-imposed limitation on gfx9 that prevents the
  control stack size from being 64 KB? For gfx10, I see explicit
  hardware limitations on the control stack size in the code [2].
  Is there anything similar for gfx9?

  What is the correct or recommended control stack size for gfx9?
  With a 4 KB system page size, I observe a control stack size of
  around 44 KB—can it grow beyond this? If the control stack size
  is fixed for a given gfx version, do you see any issues with
  aligning the control stack size to the GPU page size?

This series has 5 patches
=========================
1. AMDGPU_VA_RESERVED_TRAP_SIZE was hard-coded to 8 KB while
   KFD_CWSR_TBA_TMA_SIZE is defined as 2 * PAGE_SIZE, which matches on
   4 KB page-size systems but results in a size mismatch on 64 KB
   systems, leading to kernel crashes when running rocminfo or RCCL
   unit tests.
   This patch updates AMDGPU_VA_RESERVED_TRAP_SIZE to 2 * PAGE_SIZE so
   that the reserved trap area matches the allocation size across all
   system page sizes. This is a must needed patch to enable minimal
   support for 64 KB system page sizes.

2. Aligned expected_queue_size to PAGE_SIZE to fix AQL queue creation
   failure.

3. Fix amdgpu page fault handler (for xnack) to pass the corresponding
   system pfn (instead of gpu pfn) for restoring SVM range mapping.

4. Updated AMDGPU_GTT_MAX_TRANSFER_SIZE to always match the PMD size
   across all page sizes.

5. On systems where the CPU page size is larger than the GPU’s 4 KB page
   size, the MQD and control stack were aligned to the CPU PAGE_SIZE,
   causing multiple GPU pages to incorrectly inherit the UC attribute.
   This change aligns both regions to the GPU page size, ensuring that
   the MQD is mapped as UC and the control stack as NC, restoring the
   correct behavior.

6. Queue preemption fails when the control stack size is aligned to
   64 KB. This patch fixes this issue by aligning the control stack
   size to gpu page size.

Setup details:
============
System details: Power10 LPAR using 64K pagesize.
AMD GPU:
Name:                    gfx90a
Marketing Name:          AMD Instinct MI210

[1] https://lore.kernel.org/all/cover.1765519875.git.donettom@linux.ibm.com/
[2] https://elixir.bootlin.com/linux/v6.19-rc5/source/drivers/gpu/drm/amd/amdkfd/kfd_queue.c#L457

RFC V3 - https://lore.kernel.org/all/cover.1771656655.git.donettom@linux.ibm.com/
RFC V2 - https://lore.kernel.org/all/cover.1769612973.git.donettom@linux.ibm.com/
RFC V1 - https://lore.kernel.org/all/cover.1765519875.git.donettom@linux.ibm.com/

Donet Tom (6):
  drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 PAGE_SIZE pages
  drm/amdkfd: Align expected_queue_size to PAGE_SIZE
  drm/amdgpu: Handle GPU page faults correctly on non-4K page systems
  drm/amdgpu: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size
  drm/amd: Fix MQD and control stack alignment for non-4K
  drm/amdkfd: Fix queue preemption/eviction failures by aligning control
    stack size to GPU page size

 drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c      | 44 +++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h      |  2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       | 24 ++++------
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h       |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |  6 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h        |  2 +-
 drivers/gpu/drm/amd/amdgpu/vce_v1_0.c         |  3 +-
 .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c   | 23 ++++++----
 drivers/gpu/drm/amd/amdkfd/kfd_queue.c        | 11 ++---
 9 files changed, 82 insertions(+), 35 deletions(-)

-- 
2.52.0

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RESEND RFC PATCH v3 1/6] drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 PAGE_SIZE pages
  2026-03-23  4:28 [RESEND RFC PATCH v3 0/6] drm/amd: Add support for non-4K page size systems Donet Tom
@ 2026-03-23  4:28 ` Donet Tom
  2026-03-23 10:11   ` Christian König
  2026-03-23  4:28 ` [RESEND RFC PATCH v3 2/6] drm/amdkfd: Align expected_queue_size to PAGE_SIZE Donet Tom
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Donet Tom @ 2026-03-23  4:28 UTC (permalink / raw)
  To: amd-gfx, Felix Kuehling, Alex Deucher, Alex Deucher,
	christian.koenig, Philip Yang
  Cc: David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan, donettom, stable

Currently, AMDGPU_VA_RESERVED_TRAP_SIZE is hardcoded to 8KB, while
KFD_CWSR_TBA_TMA_SIZE is defined as 2 * PAGE_SIZE. On systems with
4K pages, both values match (8KB), so allocation and reserved space
are consistent.

However, on 64K page-size systems, KFD_CWSR_TBA_TMA_SIZE becomes 128KB,
while the reserved trap area remains 8KB. This mismatch causes the
kernel to crash when running rocminfo or rccl unit tests.

Kernel attempted to read user page (2) - exploit attempt? (uid: 1001)
BUG: Kernel NULL pointer dereference on read at 0x00000002
Faulting instruction address: 0xc0000000002c8a64
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
CPU: 34 UID: 1001 PID: 9379 Comm: rocminfo Tainted: G E
6.19.0-rc4-amdgpu-00320-gf23176405700 #56 VOLUNTARY
Tainted: [E]=UNSIGNED_MODULE
Hardware name: IBM,9105-42A POWER10 (architected) 0x800200 0xf000006
of:IBM,FW1060.30 (ML1060_896) hv:phyp pSeries
NIP:  c0000000002c8a64 LR: c00000000125dbc8 CTR: c00000000125e730
REGS: c0000001e0957580 TRAP: 0300 Tainted: G E
MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24008268
XER: 00000036
CFAR: c00000000125dbc4 DAR: 0000000000000002 DSISR: 40000000
IRQMASK: 1
GPR00: c00000000125d908 c0000001e0957820 c0000000016e8100
c00000013d814540
GPR04: 0000000000000002 c00000013d814550 0000000000000045
0000000000000000
GPR08: c00000013444d000 c00000013d814538 c00000013d814538
0000000084002268
GPR12: c00000000125e730 c000007e2ffd5f00 ffffffffffffffff
0000000000020000
GPR16: 0000000000000000 0000000000000002 c00000015f653000
0000000000000000
GPR20: c000000138662400 c00000013d814540 0000000000000000
c00000013d814500
GPR24: 0000000000000000 0000000000000002 c0000001e0957888
c0000001e0957878
GPR28: c00000013d814548 0000000000000000 c00000013d814540
c0000001e0957888
NIP [c0000000002c8a64] __mutex_add_waiter+0x24/0xc0
LR [c00000000125dbc8] __mutex_lock.constprop.0+0x318/0xd00
Call Trace:
0xc0000001e0957890 (unreliable)
__mutex_lock.constprop.0+0x58/0xd00
amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x6fc/0xb60 [amdgpu]
kfd_process_alloc_gpuvm+0x54/0x1f0 [amdgpu]
kfd_process_device_init_cwsr_dgpu+0xa4/0x1a0 [amdgpu]
kfd_process_device_init_vm+0xd8/0x2e0 [amdgpu]
kfd_ioctl_acquire_vm+0xd0/0x130 [amdgpu]
kfd_ioctl+0x514/0x670 [amdgpu]
sys_ioctl+0x134/0x180
system_call_exception+0x114/0x300
system_call_vectored_common+0x15c/0x2ec

This patch changes AMDGPU_VA_RESERVED_TRAP_SIZE to 2 * PAGE_SIZE,
ensuring that the reserved trap area matches the allocation size
across all page sizes.

cc: stable@vger.kernel.org
Fixes: 34a1de0f7935 ("drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole")
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
index 139642eacdd0..a5eae49f9471 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
@@ -173,7 +173,7 @@ struct amdgpu_bo_vm;
 #define AMDGPU_VA_RESERVED_SEQ64_SIZE		(2ULL << 20)
 #define AMDGPU_VA_RESERVED_SEQ64_START(adev)	(AMDGPU_VA_RESERVED_CSA_START(adev) \
 						 - AMDGPU_VA_RESERVED_SEQ64_SIZE)
-#define AMDGPU_VA_RESERVED_TRAP_SIZE		(2ULL << 12)
+#define AMDGPU_VA_RESERVED_TRAP_SIZE		(2ULL << PAGE_SHIFT)
 #define AMDGPU_VA_RESERVED_TRAP_START(adev)	(AMDGPU_VA_RESERVED_SEQ64_START(adev) \
 						 - AMDGPU_VA_RESERVED_TRAP_SIZE)
 #define AMDGPU_VA_RESERVED_BOTTOM		(1ULL << 16)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RESEND RFC PATCH v3 1/6] drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 PAGE_SIZE pages
  2026-03-23  4:28 ` [RESEND RFC PATCH v3 1/6] drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 PAGE_SIZE pages Donet Tom
@ 2026-03-23 10:11   ` Christian König
  2026-03-23 11:50     ` Donet Tom
  0 siblings, 1 reply; 30+ messages in thread
From: Christian König @ 2026-03-23 10:11 UTC (permalink / raw)
  To: Donet Tom, amd-gfx, Felix Kuehling, Alex Deucher, Alex Deucher,
	Philip Yang
  Cc: David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan, stable

On 3/23/26 05:28, Donet Tom wrote:
> Currently, AMDGPU_VA_RESERVED_TRAP_SIZE is hardcoded to 8KB, while
> KFD_CWSR_TBA_TMA_SIZE is defined as 2 * PAGE_SIZE. On systems with
> 4K pages, both values match (8KB), so allocation and reserved space
> are consistent.
> 
> However, on 64K page-size systems, KFD_CWSR_TBA_TMA_SIZE becomes 128KB,
> while the reserved trap area remains 8KB. This mismatch causes the
> kernel to crash when running rocminfo or rccl unit tests.
> 
> Kernel attempted to read user page (2) - exploit attempt? (uid: 1001)
> BUG: Kernel NULL pointer dereference on read at 0x00000002
> Faulting instruction address: 0xc0000000002c8a64
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
> CPU: 34 UID: 1001 PID: 9379 Comm: rocminfo Tainted: G E
> 6.19.0-rc4-amdgpu-00320-gf23176405700 #56 VOLUNTARY
> Tainted: [E]=UNSIGNED_MODULE
> Hardware name: IBM,9105-42A POWER10 (architected) 0x800200 0xf000006
> of:IBM,FW1060.30 (ML1060_896) hv:phyp pSeries
> NIP:  c0000000002c8a64 LR: c00000000125dbc8 CTR: c00000000125e730
> REGS: c0000001e0957580 TRAP: 0300 Tainted: G E
> MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24008268
> XER: 00000036
> CFAR: c00000000125dbc4 DAR: 0000000000000002 DSISR: 40000000
> IRQMASK: 1
> GPR00: c00000000125d908 c0000001e0957820 c0000000016e8100
> c00000013d814540
> GPR04: 0000000000000002 c00000013d814550 0000000000000045
> 0000000000000000
> GPR08: c00000013444d000 c00000013d814538 c00000013d814538
> 0000000084002268
> GPR12: c00000000125e730 c000007e2ffd5f00 ffffffffffffffff
> 0000000000020000
> GPR16: 0000000000000000 0000000000000002 c00000015f653000
> 0000000000000000
> GPR20: c000000138662400 c00000013d814540 0000000000000000
> c00000013d814500
> GPR24: 0000000000000000 0000000000000002 c0000001e0957888
> c0000001e0957878
> GPR28: c00000013d814548 0000000000000000 c00000013d814540
> c0000001e0957888
> NIP [c0000000002c8a64] __mutex_add_waiter+0x24/0xc0
> LR [c00000000125dbc8] __mutex_lock.constprop.0+0x318/0xd00
> Call Trace:
> 0xc0000001e0957890 (unreliable)
> __mutex_lock.constprop.0+0x58/0xd00
> amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x6fc/0xb60 [amdgpu]
> kfd_process_alloc_gpuvm+0x54/0x1f0 [amdgpu]
> kfd_process_device_init_cwsr_dgpu+0xa4/0x1a0 [amdgpu]
> kfd_process_device_init_vm+0xd8/0x2e0 [amdgpu]
> kfd_ioctl_acquire_vm+0xd0/0x130 [amdgpu]
> kfd_ioctl+0x514/0x670 [amdgpu]
> sys_ioctl+0x134/0x180
> system_call_exception+0x114/0x300
> system_call_vectored_common+0x15c/0x2ec
> 
> This patch changes AMDGPU_VA_RESERVED_TRAP_SIZE to 2 * PAGE_SIZE,
> ensuring that the reserved trap area matches the allocation size
> across all page sizes.
> 
> cc: stable@vger.kernel.org
> Fixes: 34a1de0f7935 ("drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole")
> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> index 139642eacdd0..a5eae49f9471 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> @@ -173,7 +173,7 @@ struct amdgpu_bo_vm;
>  #define AMDGPU_VA_RESERVED_SEQ64_SIZE		(2ULL << 20)
>  #define AMDGPU_VA_RESERVED_SEQ64_START(adev)	(AMDGPU_VA_RESERVED_CSA_START(adev) \
>  						 - AMDGPU_VA_RESERVED_SEQ64_SIZE)
> -#define AMDGPU_VA_RESERVED_TRAP_SIZE		(2ULL << 12)
> +#define AMDGPU_VA_RESERVED_TRAP_SIZE		(2ULL << PAGE_SHIFT)

Well using PAGE_SHIFT in amdgpu_vm.h looks quite broken to me.

That makes the GPU VA reservation depend on the CPU page size and that is clearly not something we want to have.

Where is KFD_CWSR_TBA_TMA_SIZE defined?

Regards,
Christian.

>  #define AMDGPU_VA_RESERVED_TRAP_START(adev)	(AMDGPU_VA_RESERVED_SEQ64_START(adev) \
>  						 - AMDGPU_VA_RESERVED_TRAP_SIZE)
>  #define AMDGPU_VA_RESERVED_BOTTOM		(1ULL << 16)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND RFC PATCH v3 1/6] drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 PAGE_SIZE pages
  2026-03-23 10:11   ` Christian König
@ 2026-03-23 11:50     ` Donet Tom
  2026-03-23 13:12       ` Christian König
  0 siblings, 1 reply; 30+ messages in thread
From: Donet Tom @ 2026-03-23 11:50 UTC (permalink / raw)
  To: Christian König, amd-gfx, Felix Kuehling, Alex Deucher,
	Alex Deucher, Philip Yang
  Cc: David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan, stable

[-- Attachment #1: Type: text/plain, Size: 4562 bytes --]


On 3/23/26 3:41 PM, Christian König wrote:

Hi Christian

> On 3/23/26 05:28, Donet Tom wrote:
>> Currently, AMDGPU_VA_RESERVED_TRAP_SIZE is hardcoded to 8KB, while
>> KFD_CWSR_TBA_TMA_SIZE is defined as 2 * PAGE_SIZE. On systems with
>> 4K pages, both values match (8KB), so allocation and reserved space
>> are consistent.
>>
>> However, on 64K page-size systems, KFD_CWSR_TBA_TMA_SIZE becomes 128KB,
>> while the reserved trap area remains 8KB. This mismatch causes the
>> kernel to crash when running rocminfo or rccl unit tests.
>>
>> Kernel attempted to read user page (2) - exploit attempt? (uid: 1001)
>> BUG: Kernel NULL pointer dereference on read at 0x00000002
>> Faulting instruction address: 0xc0000000002c8a64
>> Oops: Kernel access of bad area, sig: 11 [#1]
>> LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
>> CPU: 34 UID: 1001 PID: 9379 Comm: rocminfo Tainted: G E
>> 6.19.0-rc4-amdgpu-00320-gf23176405700 #56 VOLUNTARY
>> Tainted: [E]=UNSIGNED_MODULE
>> Hardware name: IBM,9105-42A POWER10 (architected) 0x800200 0xf000006
>> of:IBM,FW1060.30 (ML1060_896) hv:phyp pSeries
>> NIP:  c0000000002c8a64 LR: c00000000125dbc8 CTR: c00000000125e730
>> REGS: c0000001e0957580 TRAP: 0300 Tainted: G E
>> MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24008268
>> XER: 00000036
>> CFAR: c00000000125dbc4 DAR: 0000000000000002 DSISR: 40000000
>> IRQMASK: 1
>> GPR00: c00000000125d908 c0000001e0957820 c0000000016e8100
>> c00000013d814540
>> GPR04: 0000000000000002 c00000013d814550 0000000000000045
>> 0000000000000000
>> GPR08: c00000013444d000 c00000013d814538 c00000013d814538
>> 0000000084002268
>> GPR12: c00000000125e730 c000007e2ffd5f00 ffffffffffffffff
>> 0000000000020000
>> GPR16: 0000000000000000 0000000000000002 c00000015f653000
>> 0000000000000000
>> GPR20: c000000138662400 c00000013d814540 0000000000000000
>> c00000013d814500
>> GPR24: 0000000000000000 0000000000000002 c0000001e0957888
>> c0000001e0957878
>> GPR28: c00000013d814548 0000000000000000 c00000013d814540
>> c0000001e0957888
>> NIP [c0000000002c8a64] __mutex_add_waiter+0x24/0xc0
>> LR [c00000000125dbc8] __mutex_lock.constprop.0+0x318/0xd00
>> Call Trace:
>> 0xc0000001e0957890 (unreliable)
>> __mutex_lock.constprop.0+0x58/0xd00
>> amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x6fc/0xb60 [amdgpu]
>> kfd_process_alloc_gpuvm+0x54/0x1f0 [amdgpu]
>> kfd_process_device_init_cwsr_dgpu+0xa4/0x1a0 [amdgpu]
>> kfd_process_device_init_vm+0xd8/0x2e0 [amdgpu]
>> kfd_ioctl_acquire_vm+0xd0/0x130 [amdgpu]
>> kfd_ioctl+0x514/0x670 [amdgpu]
>> sys_ioctl+0x134/0x180
>> system_call_exception+0x114/0x300
>> system_call_vectored_common+0x15c/0x2ec
>>
>> This patch changes AMDGPU_VA_RESERVED_TRAP_SIZE to 2 * PAGE_SIZE,
>> ensuring that the reserved trap area matches the allocation size
>> across all page sizes.
>>
>> cc:stable@vger.kernel.org
>> Fixes: 34a1de0f7935 ("drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole")
>> Reviewed-by: Ritesh Harjani (IBM)<ritesh.list@gmail.com>
>> Signed-off-by: Donet Tom<donettom@linux.ibm.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>> index 139642eacdd0..a5eae49f9471 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>> @@ -173,7 +173,7 @@ struct amdgpu_bo_vm;
>>   #define AMDGPU_VA_RESERVED_SEQ64_SIZE		(2ULL << 20)
>>   #define AMDGPU_VA_RESERVED_SEQ64_START(adev)	(AMDGPU_VA_RESERVED_CSA_START(adev) \
>>   						 - AMDGPU_VA_RESERVED_SEQ64_SIZE)
>> -#define AMDGPU_VA_RESERVED_TRAP_SIZE		(2ULL << 12)
>> +#define AMDGPU_VA_RESERVED_TRAP_SIZE		(2ULL << PAGE_SHIFT)
> Well using PAGE_SHIFT in amdgpu_vm.h looks quite broken to me.
>
> That makes the GPU VA reservation depend on the CPU page size and that is clearly not something we want to have.
>
> Where is KFD_CWSR_TBA_TMA_SIZE defined?
>

Thanks Christian for reviewing this patch.

It is defined in kfd_priv.h.

/* * Size of the per-process TBA+TMA buffer: 2 pages * * The first chunk 
is the TBA used for the CWSR ISA code. The second * chunk is used as TMA 
for user-mode trap handler setup in daisy-chain mode. */ #define 
KFD_CWSR_TBA_TMA_SIZE (PAGE_SIZE * 2)


Could you please suggest the correct way to fix this issue? -Donet

> Regards,
> Christian.
>
>>   #define AMDGPU_VA_RESERVED_TRAP_START(adev)	(AMDGPU_VA_RESERVED_SEQ64_START(adev) \
>>   						 - AMDGPU_VA_RESERVED_TRAP_SIZE)
>>   #define AMDGPU_VA_RESERVED_BOTTOM		(1ULL << 16)

[-- Attachment #2: Type: text/html, Size: 6414 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND RFC PATCH v3 1/6] drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 PAGE_SIZE pages
  2026-03-23 11:50     ` Donet Tom
@ 2026-03-23 13:12       ` Christian König
  2026-03-24 18:19         ` Donet Tom
  0 siblings, 1 reply; 30+ messages in thread
From: Christian König @ 2026-03-23 13:12 UTC (permalink / raw)
  To: Donet Tom, amd-gfx, Felix Kuehling, Alex Deucher, Alex Deucher,
	Philip Yang
  Cc: David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan, stable

On 3/23/26 12:50, Donet Tom wrote:
> 
> On 3/23/26 3:41 PM, Christian König wrote:
> 
> Hi Christian
> 
>> On 3/23/26 05:28, Donet Tom wrote:
>>> Currently, AMDGPU_VA_RESERVED_TRAP_SIZE is hardcoded to 8KB, while
>>> KFD_CWSR_TBA_TMA_SIZE is defined as 2 * PAGE_SIZE. On systems with
>>> 4K pages, both values match (8KB), so allocation and reserved space
>>> are consistent.
>>>
>>> However, on 64K page-size systems, KFD_CWSR_TBA_TMA_SIZE becomes 128KB,
>>> while the reserved trap area remains 8KB. This mismatch causes the
>>> kernel to crash when running rocminfo or rccl unit tests.
>>>
>>> Kernel attempted to read user page (2) - exploit attempt? (uid: 1001)
>>> BUG: Kernel NULL pointer dereference on read at 0x00000002
>>> Faulting instruction address: 0xc0000000002c8a64
>>> Oops: Kernel access of bad area, sig: 11 [#1]
>>> LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
>>> CPU: 34 UID: 1001 PID: 9379 Comm: rocminfo Tainted: G E
>>> 6.19.0-rc4-amdgpu-00320-gf23176405700 #56 VOLUNTARY
>>> Tainted: [E]=UNSIGNED_MODULE
>>> Hardware name: IBM,9105-42A POWER10 (architected) 0x800200 0xf000006
>>> of:IBM,FW1060.30 (ML1060_896) hv:phyp pSeries
>>> NIP:  c0000000002c8a64 LR: c00000000125dbc8 CTR: c00000000125e730
>>> REGS: c0000001e0957580 TRAP: 0300 Tainted: G E
>>> MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24008268
>>> XER: 00000036
>>> CFAR: c00000000125dbc4 DAR: 0000000000000002 DSISR: 40000000
>>> IRQMASK: 1
>>> GPR00: c00000000125d908 c0000001e0957820 c0000000016e8100
>>> c00000013d814540
>>> GPR04: 0000000000000002 c00000013d814550 0000000000000045
>>> 0000000000000000
>>> GPR08: c00000013444d000 c00000013d814538 c00000013d814538
>>> 0000000084002268
>>> GPR12: c00000000125e730 c000007e2ffd5f00 ffffffffffffffff
>>> 0000000000020000
>>> GPR16: 0000000000000000 0000000000000002 c00000015f653000
>>> 0000000000000000
>>> GPR20: c000000138662400 c00000013d814540 0000000000000000
>>> c00000013d814500
>>> GPR24: 0000000000000000 0000000000000002 c0000001e0957888
>>> c0000001e0957878
>>> GPR28: c00000013d814548 0000000000000000 c00000013d814540
>>> c0000001e0957888
>>> NIP [c0000000002c8a64] __mutex_add_waiter+0x24/0xc0
>>> LR [c00000000125dbc8] __mutex_lock.constprop.0+0x318/0xd00
>>> Call Trace:
>>> 0xc0000001e0957890 (unreliable)
>>> __mutex_lock.constprop.0+0x58/0xd00
>>> amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x6fc/0xb60 [amdgpu]
>>> kfd_process_alloc_gpuvm+0x54/0x1f0 [amdgpu]
>>> kfd_process_device_init_cwsr_dgpu+0xa4/0x1a0 [amdgpu]
>>> kfd_process_device_init_vm+0xd8/0x2e0 [amdgpu]
>>> kfd_ioctl_acquire_vm+0xd0/0x130 [amdgpu]
>>> kfd_ioctl+0x514/0x670 [amdgpu]
>>> sys_ioctl+0x134/0x180
>>> system_call_exception+0x114/0x300
>>> system_call_vectored_common+0x15c/0x2ec
>>>
>>> This patch changes AMDGPU_VA_RESERVED_TRAP_SIZE to 2 * PAGE_SIZE,
>>> ensuring that the reserved trap area matches the allocation size
>>> across all page sizes.
>>>
>>> cc: stable@vger.kernel.org
>>> Fixes: 34a1de0f7935 ("drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole")
>>> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>> index 139642eacdd0..a5eae49f9471 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>> @@ -173,7 +173,7 @@ struct amdgpu_bo_vm;
>>>  #define AMDGPU_VA_RESERVED_SEQ64_SIZE		(2ULL << 20)
>>>  #define AMDGPU_VA_RESERVED_SEQ64_START(adev)	(AMDGPU_VA_RESERVED_CSA_START(adev) \
>>>  						 - AMDGPU_VA_RESERVED_SEQ64_SIZE)
>>> -#define AMDGPU_VA_RESERVED_TRAP_SIZE		(2ULL << 12)
>>> +#define AMDGPU_VA_RESERVED_TRAP_SIZE		(2ULL << PAGE_SHIFT)
>> Well using PAGE_SHIFT in amdgpu_vm.h looks quite broken to me.
>>
>> That makes the GPU VA reservation depend on the CPU page size and that is clearly not something we want to have.
>>
>> Where is KFD_CWSR_TBA_TMA_SIZE defined?
>>
> 
> Thanks Christian for reviewing this patch.
> 
> It is defined in kfd_priv.h.
> 
> /*
>  * Size of the per-process TBA+TMA buffer: 2 pages
>  *
>  * The first chunk is the TBA used for the CWSR ISA code. The second
>  * chunk is used as TMA for user-mode trap handler setup in daisy-chain mode.
>  */
> #define KFD_CWSR_TBA_TMA_SIZE (PAGE_SIZE * 2)
> 
> 
> 
> Could you please suggest the correct way to fix this issue?

I'm only looking from the POV of the VM code on this, but my educated guess is that KFD_CWSR_TBA_TMA_SIZE should be 8k independent of the CPU page size.

Background is that this is written by the shader trap handler and that byte code doesn't care what CPU architecture you have.

But I think only the engineers working on that trap handler can really answer this. @Felix / @Philip?

Regards,
Christian.

> 
> -Donet
> 
>> Regards,
>> Christian.
>>
>>>  #define AMDGPU_VA_RESERVED_TRAP_START(adev)	(AMDGPU_VA_RESERVED_SEQ64_START(adev) \
>>>  						 - AMDGPU_VA_RESERVED_TRAP_SIZE)
>>>  #define AMDGPU_VA_RESERVED_BOTTOM		(1ULL << 16)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND RFC PATCH v3 1/6] drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 PAGE_SIZE pages
  2026-03-23 13:12       ` Christian König
@ 2026-03-24 18:19         ` Donet Tom
  2026-03-25  2:26           ` Kuehling, Felix
  0 siblings, 1 reply; 30+ messages in thread
From: Donet Tom @ 2026-03-24 18:19 UTC (permalink / raw)
  To: Christian König, amd-gfx, Felix Kuehling, Alex Deucher,
	Alex Deucher, Philip Yang
  Cc: David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan, stable, Donet Tom


On 3/23/26 6:42 PM, Christian König wrote:
> On 3/23/26 12:50, Donet Tom wrote:
>> On 3/23/26 3:41 PM, Christian König wrote:
>>
>> Hi Christian
>>
>>> On 3/23/26 05:28, Donet Tom wrote:
>>>> Currently, AMDGPU_VA_RESERVED_TRAP_SIZE is hardcoded to 8KB, while
>>>> KFD_CWSR_TBA_TMA_SIZE is defined as 2 * PAGE_SIZE. On systems with
>>>> 4K pages, both values match (8KB), so allocation and reserved space
>>>> are consistent.
>>>>
>>>> However, on 64K page-size systems, KFD_CWSR_TBA_TMA_SIZE becomes 128KB,
>>>> while the reserved trap area remains 8KB. This mismatch causes the
>>>> kernel to crash when running rocminfo or rccl unit tests.
>>>>
>>>> Kernel attempted to read user page (2) - exploit attempt? (uid: 1001)
>>>> BUG: Kernel NULL pointer dereference on read at 0x00000002
>>>> Faulting instruction address: 0xc0000000002c8a64
>>>> Oops: Kernel access of bad area, sig: 11 [#1]
>>>> LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
>>>> CPU: 34 UID: 1001 PID: 9379 Comm: rocminfo Tainted: G E
>>>> 6.19.0-rc4-amdgpu-00320-gf23176405700 #56 VOLUNTARY
>>>> Tainted: [E]=UNSIGNED_MODULE
>>>> Hardware name: IBM,9105-42A POWER10 (architected) 0x800200 0xf000006
>>>> of:IBM,FW1060.30 (ML1060_896) hv:phyp pSeries
>>>> NIP:  c0000000002c8a64 LR: c00000000125dbc8 CTR: c00000000125e730
>>>> REGS: c0000001e0957580 TRAP: 0300 Tainted: G E
>>>> MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24008268
>>>> XER: 00000036
>>>> CFAR: c00000000125dbc4 DAR: 0000000000000002 DSISR: 40000000
>>>> IRQMASK: 1
>>>> GPR00: c00000000125d908 c0000001e0957820 c0000000016e8100
>>>> c00000013d814540
>>>> GPR04: 0000000000000002 c00000013d814550 0000000000000045
>>>> 0000000000000000
>>>> GPR08: c00000013444d000 c00000013d814538 c00000013d814538
>>>> 0000000084002268
>>>> GPR12: c00000000125e730 c000007e2ffd5f00 ffffffffffffffff
>>>> 0000000000020000
>>>> GPR16: 0000000000000000 0000000000000002 c00000015f653000
>>>> 0000000000000000
>>>> GPR20: c000000138662400 c00000013d814540 0000000000000000
>>>> c00000013d814500
>>>> GPR24: 0000000000000000 0000000000000002 c0000001e0957888
>>>> c0000001e0957878
>>>> GPR28: c00000013d814548 0000000000000000 c00000013d814540
>>>> c0000001e0957888
>>>> NIP [c0000000002c8a64] __mutex_add_waiter+0x24/0xc0
>>>> LR [c00000000125dbc8] __mutex_lock.constprop.0+0x318/0xd00
>>>> Call Trace:
>>>> 0xc0000001e0957890 (unreliable)
>>>> __mutex_lock.constprop.0+0x58/0xd00
>>>> amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x6fc/0xb60 [amdgpu]
>>>> kfd_process_alloc_gpuvm+0x54/0x1f0 [amdgpu]
>>>> kfd_process_device_init_cwsr_dgpu+0xa4/0x1a0 [amdgpu]
>>>> kfd_process_device_init_vm+0xd8/0x2e0 [amdgpu]
>>>> kfd_ioctl_acquire_vm+0xd0/0x130 [amdgpu]
>>>> kfd_ioctl+0x514/0x670 [amdgpu]
>>>> sys_ioctl+0x134/0x180
>>>> system_call_exception+0x114/0x300
>>>> system_call_vectored_common+0x15c/0x2ec
>>>>
>>>> This patch changes AMDGPU_VA_RESERVED_TRAP_SIZE to 2 * PAGE_SIZE,
>>>> ensuring that the reserved trap area matches the allocation size
>>>> across all page sizes.
>>>>
>>>> cc: stable@vger.kernel.org
>>>> Fixes: 34a1de0f7935 ("drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole")
>>>> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 2 +-
>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>> index 139642eacdd0..a5eae49f9471 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>> @@ -173,7 +173,7 @@ struct amdgpu_bo_vm;
>>>>   #define AMDGPU_VA_RESERVED_SEQ64_SIZE		(2ULL << 20)
>>>>   #define AMDGPU_VA_RESERVED_SEQ64_START(adev)	(AMDGPU_VA_RESERVED_CSA_START(adev) \
>>>>   						 - AMDGPU_VA_RESERVED_SEQ64_SIZE)
>>>> -#define AMDGPU_VA_RESERVED_TRAP_SIZE		(2ULL << 12)
>>>> +#define AMDGPU_VA_RESERVED_TRAP_SIZE		(2ULL << PAGE_SHIFT)
>>> Well using PAGE_SHIFT in amdgpu_vm.h looks quite broken to me.
>>>
>>> That makes the GPU VA reservation depend on the CPU page size and that is clearly not something we want to have.
>>>
>>> Where is KFD_CWSR_TBA_TMA_SIZE defined?
>>>
>> Thanks Christian for reviewing this patch.
>>
>> It is defined in kfd_priv.h.
>>
>> /*
>>   * Size of the per-process TBA+TMA buffer: 2 pages
>>   *
>>   * The first chunk is the TBA used for the CWSR ISA code. The second
>>   * chunk is used as TMA for user-mode trap handler setup in daisy-chain mode.
>>   */
>> #define KFD_CWSR_TBA_TMA_SIZE (PAGE_SIZE * 2)
>>
>>
>>
>> Could you please suggest the correct way to fix this issue?
> I'm only looking from the POV of the VM code on this, but my educated guess is that KFD_CWSR_TBA_TMA_SIZE should be 8k independent of the CPU page size.
>
> Background is that this is written by the shader trap handler and that byte code doesn't care what CPU architecture you have.
>
> But I think only the engineers working on that trap handler can really answer this. @Felix / @Philip?


Hi @christian @Felix @Philip

To remove the dependency on CPU page size, can we use

+#define AMDGPU_VA_RESERVED_TRAP_SIZE    (2ULL << 16)

During reservation, we reserve 128 bytes, but during
allocation, we use 2 * PAGE_SIZE.


-Donet

>
> Regards,
> Christian.
>
>> -Donet
>>
>>> Regards,
>>> Christian.
>>>
>>>>   #define AMDGPU_VA_RESERVED_TRAP_START(adev)	(AMDGPU_VA_RESERVED_SEQ64_START(adev) \
>>>>   						 - AMDGPU_VA_RESERVED_TRAP_SIZE)
>>>>   #define AMDGPU_VA_RESERVED_BOTTOM		(1ULL << 16)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND RFC PATCH v3 1/6] drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 PAGE_SIZE pages
  2026-03-24 18:19         ` Donet Tom
@ 2026-03-25  2:26           ` Kuehling, Felix
  2026-03-25  9:34             ` Christian König
  0 siblings, 1 reply; 30+ messages in thread
From: Kuehling, Felix @ 2026-03-25  2:26 UTC (permalink / raw)
  To: Donet Tom, Christian König, amd-gfx, Alex Deucher,
	Alex Deucher, Philip Yang
  Cc: David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan, stable


On 2026-03-24 14:19, Donet Tom wrote:
>
> On 3/23/26 6:42 PM, Christian König wrote:
>> On 3/23/26 12:50, Donet Tom wrote:
>>> On 3/23/26 3:41 PM, Christian König wrote:
>>>
>>> Hi Christian
>>>
>>>> On 3/23/26 05:28, Donet Tom wrote:
>>>>> Currently, AMDGPU_VA_RESERVED_TRAP_SIZE is hardcoded to 8KB, while
>>>>> KFD_CWSR_TBA_TMA_SIZE is defined as 2 * PAGE_SIZE. On systems with
>>>>> 4K pages, both values match (8KB), so allocation and reserved space
>>>>> are consistent.
>>>>>
>>>>> However, on 64K page-size systems, KFD_CWSR_TBA_TMA_SIZE becomes 
>>>>> 128KB,
>>>>> while the reserved trap area remains 8KB. This mismatch causes the
>>>>> kernel to crash when running rocminfo or rccl unit tests.
>>>>>
>>>>> Kernel attempted to read user page (2) - exploit attempt? (uid: 1001)
>>>>> BUG: Kernel NULL pointer dereference on read at 0x00000002
>>>>> Faulting instruction address: 0xc0000000002c8a64
>>>>> Oops: Kernel access of bad area, sig: 11 [#1]
>>>>> LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
>>>>> CPU: 34 UID: 1001 PID: 9379 Comm: rocminfo Tainted: G E
>>>>> 6.19.0-rc4-amdgpu-00320-gf23176405700 #56 VOLUNTARY
>>>>> Tainted: [E]=UNSIGNED_MODULE
>>>>> Hardware name: IBM,9105-42A POWER10 (architected) 0x800200 0xf000006
>>>>> of:IBM,FW1060.30 (ML1060_896) hv:phyp pSeries
>>>>> NIP:  c0000000002c8a64 LR: c00000000125dbc8 CTR: c00000000125e730
>>>>> REGS: c0000001e0957580 TRAP: 0300 Tainted: G E
>>>>> MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24008268
>>>>> XER: 00000036
>>>>> CFAR: c00000000125dbc4 DAR: 0000000000000002 DSISR: 40000000
>>>>> IRQMASK: 1
>>>>> GPR00: c00000000125d908 c0000001e0957820 c0000000016e8100
>>>>> c00000013d814540
>>>>> GPR04: 0000000000000002 c00000013d814550 0000000000000045
>>>>> 0000000000000000
>>>>> GPR08: c00000013444d000 c00000013d814538 c00000013d814538
>>>>> 0000000084002268
>>>>> GPR12: c00000000125e730 c000007e2ffd5f00 ffffffffffffffff
>>>>> 0000000000020000
>>>>> GPR16: 0000000000000000 0000000000000002 c00000015f653000
>>>>> 0000000000000000
>>>>> GPR20: c000000138662400 c00000013d814540 0000000000000000
>>>>> c00000013d814500
>>>>> GPR24: 0000000000000000 0000000000000002 c0000001e0957888
>>>>> c0000001e0957878
>>>>> GPR28: c00000013d814548 0000000000000000 c00000013d814540
>>>>> c0000001e0957888
>>>>> NIP [c0000000002c8a64] __mutex_add_waiter+0x24/0xc0
>>>>> LR [c00000000125dbc8] __mutex_lock.constprop.0+0x318/0xd00
>>>>> Call Trace:
>>>>> 0xc0000001e0957890 (unreliable)
>>>>> __mutex_lock.constprop.0+0x58/0xd00
>>>>> amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x6fc/0xb60 [amdgpu]
>>>>> kfd_process_alloc_gpuvm+0x54/0x1f0 [amdgpu]
>>>>> kfd_process_device_init_cwsr_dgpu+0xa4/0x1a0 [amdgpu]
>>>>> kfd_process_device_init_vm+0xd8/0x2e0 [amdgpu]
>>>>> kfd_ioctl_acquire_vm+0xd0/0x130 [amdgpu]
>>>>> kfd_ioctl+0x514/0x670 [amdgpu]
>>>>> sys_ioctl+0x134/0x180
>>>>> system_call_exception+0x114/0x300
>>>>> system_call_vectored_common+0x15c/0x2ec
>>>>>
>>>>> This patch changes AMDGPU_VA_RESERVED_TRAP_SIZE to 2 * PAGE_SIZE,
>>>>> ensuring that the reserved trap area matches the allocation size
>>>>> across all page sizes.
>>>>>
>>>>> cc: stable@vger.kernel.org
>>>>> Fixes: 34a1de0f7935 ("drm/amdkfd: Relocate TBA/TMA to opposite 
>>>>> side of VM hole")
>>>>> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>>>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>>>> ---
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 2 +-
>>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>> index 139642eacdd0..a5eae49f9471 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>> @@ -173,7 +173,7 @@ struct amdgpu_bo_vm;
>>>>>   #define AMDGPU_VA_RESERVED_SEQ64_SIZE        (2ULL << 20)
>>>>>   #define AMDGPU_VA_RESERVED_SEQ64_START(adev) 
>>>>> (AMDGPU_VA_RESERVED_CSA_START(adev) \
>>>>>                            - AMDGPU_VA_RESERVED_SEQ64_SIZE)
>>>>> -#define AMDGPU_VA_RESERVED_TRAP_SIZE        (2ULL << 12)
>>>>> +#define AMDGPU_VA_RESERVED_TRAP_SIZE        (2ULL << PAGE_SHIFT)
>>>> Well using PAGE_SHIFT in amdgpu_vm.h looks quite broken to me.
>>>>
>>>> That makes the GPU VA reservation depend on the CPU page size and 
>>>> that is clearly not something we want to have.
>>>>
>>>> Where is KFD_CWSR_TBA_TMA_SIZE defined?
>>>>
>>> Thanks Christian for reviewing this patch.
>>>
>>> It is defined in kfd_priv.h.
>>>
>>> /*
>>>   * Size of the per-process TBA+TMA buffer: 2 pages
>>>   *
>>>   * The first chunk is the TBA used for the CWSR ISA code. The second
>>>   * chunk is used as TMA for user-mode trap handler setup in 
>>> daisy-chain mode.
>>>   */
>>> #define KFD_CWSR_TBA_TMA_SIZE (PAGE_SIZE * 2)
>>>
>>>
>>>
>>> Could you please suggest the correct way to fix this issue?
>> I'm only looking from the POV of the VM code on this, but my educated 
>> guess is that KFD_CWSR_TBA_TMA_SIZE should be 8k independent of the 
>> CPU page size.
>>
>> Background is that this is written by the shader trap handler and 
>> that byte code doesn't care what CPU architecture you have.
>>
>> But I think only the engineers working on that trap handler can 
>> really answer this. @Felix / @Philip?
>
>
> Hi @christian @Felix @Philip
>
> To remove the dependency on CPU page size, can we use
>
> +#define AMDGPU_VA_RESERVED_TRAP_SIZE    (2ULL << 16)
>
> During reservation, we reserve 128 bytes, but during
> allocation, we use 2 * PAGE_SIZE.

We only need two GPU pages here. I think what Christian is objecting to 
is, that the GPU VM layout should not depend on the CPU page size. 
@Christian, it sounds like the BO allocations happen with 64KB 
granularity, but the mapping is still using 4KB granularity. Is the 
right solution to GPU-map only the first 8KB of the trap handler BO to 
keep the layout the same across CPU architectures?

I guess then the "correct" solution would be to change 
amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu and 
amdgpu_amdkfd_gpuvm_map_memory_to_gpu to support mapping of the 
requested size with GPU page size granularity regardless of the CPU page 
size. But that would increase complexity for a very niche uses case.

An easier solution would be to PAGE_ALIGN 8KB to the system page size. 
But that results in the virtual address space layout to depend on the 
system page size.

If that's objectionable, then the next best solution is to round up the 
trap handler size to 64KB byte unconditionally, so its the same with 4KB 
or 64KB system page size. But that would mean unnecessarily wasting a 
little memory per process/GPU on x86.

Regards,
   Felix


>
>
> -Donet
>
>>
>> Regards,
>> Christian.
>>
>>> -Donet
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>   #define AMDGPU_VA_RESERVED_TRAP_START(adev) 
>>>>> (AMDGPU_VA_RESERVED_SEQ64_START(adev) \
>>>>>                            - AMDGPU_VA_RESERVED_TRAP_SIZE)
>>>>>   #define AMDGPU_VA_RESERVED_BOTTOM        (1ULL << 16)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND RFC PATCH v3 1/6] drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 PAGE_SIZE pages
  2026-03-25  2:26           ` Kuehling, Felix
@ 2026-03-25  9:34             ` Christian König
  2026-03-25 10:26               ` Donet Tom
  0 siblings, 1 reply; 30+ messages in thread
From: Christian König @ 2026-03-25  9:34 UTC (permalink / raw)
  To: Kuehling, Felix, Donet Tom, amd-gfx, Alex Deucher, Alex Deucher,
	Philip Yang
  Cc: David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan, stable

On 3/25/26 03:26, Kuehling, Felix wrote:
> 
> On 2026-03-24 14:19, Donet Tom wrote:
>>
>> On 3/23/26 6:42 PM, Christian König wrote:
>>> On 3/23/26 12:50, Donet Tom wrote:
>>>> On 3/23/26 3:41 PM, Christian König wrote:
>>>>
>>>> Hi Christian
>>>>
>>>>> On 3/23/26 05:28, Donet Tom wrote:
>>>>>> Currently, AMDGPU_VA_RESERVED_TRAP_SIZE is hardcoded to 8KB, while
>>>>>> KFD_CWSR_TBA_TMA_SIZE is defined as 2 * PAGE_SIZE. On systems with
>>>>>> 4K pages, both values match (8KB), so allocation and reserved space
>>>>>> are consistent.
>>>>>>
>>>>>> However, on 64K page-size systems, KFD_CWSR_TBA_TMA_SIZE becomes 128KB,
>>>>>> while the reserved trap area remains 8KB. This mismatch causes the
>>>>>> kernel to crash when running rocminfo or rccl unit tests.
>>>>>>
>>>>>> Kernel attempted to read user page (2) - exploit attempt? (uid: 1001)
>>>>>> BUG: Kernel NULL pointer dereference on read at 0x00000002
>>>>>> Faulting instruction address: 0xc0000000002c8a64
>>>>>> Oops: Kernel access of bad area, sig: 11 [#1]
>>>>>> LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
>>>>>> CPU: 34 UID: 1001 PID: 9379 Comm: rocminfo Tainted: G E
>>>>>> 6.19.0-rc4-amdgpu-00320-gf23176405700 #56 VOLUNTARY
>>>>>> Tainted: [E]=UNSIGNED_MODULE
>>>>>> Hardware name: IBM,9105-42A POWER10 (architected) 0x800200 0xf000006
>>>>>> of:IBM,FW1060.30 (ML1060_896) hv:phyp pSeries
>>>>>> NIP:  c0000000002c8a64 LR: c00000000125dbc8 CTR: c00000000125e730
>>>>>> REGS: c0000001e0957580 TRAP: 0300 Tainted: G E
>>>>>> MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24008268
>>>>>> XER: 00000036
>>>>>> CFAR: c00000000125dbc4 DAR: 0000000000000002 DSISR: 40000000
>>>>>> IRQMASK: 1
>>>>>> GPR00: c00000000125d908 c0000001e0957820 c0000000016e8100
>>>>>> c00000013d814540
>>>>>> GPR04: 0000000000000002 c00000013d814550 0000000000000045
>>>>>> 0000000000000000
>>>>>> GPR08: c00000013444d000 c00000013d814538 c00000013d814538
>>>>>> 0000000084002268
>>>>>> GPR12: c00000000125e730 c000007e2ffd5f00 ffffffffffffffff
>>>>>> 0000000000020000
>>>>>> GPR16: 0000000000000000 0000000000000002 c00000015f653000
>>>>>> 0000000000000000
>>>>>> GPR20: c000000138662400 c00000013d814540 0000000000000000
>>>>>> c00000013d814500
>>>>>> GPR24: 0000000000000000 0000000000000002 c0000001e0957888
>>>>>> c0000001e0957878
>>>>>> GPR28: c00000013d814548 0000000000000000 c00000013d814540
>>>>>> c0000001e0957888
>>>>>> NIP [c0000000002c8a64] __mutex_add_waiter+0x24/0xc0
>>>>>> LR [c00000000125dbc8] __mutex_lock.constprop.0+0x318/0xd00
>>>>>> Call Trace:
>>>>>> 0xc0000001e0957890 (unreliable)
>>>>>> __mutex_lock.constprop.0+0x58/0xd00
>>>>>> amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x6fc/0xb60 [amdgpu]
>>>>>> kfd_process_alloc_gpuvm+0x54/0x1f0 [amdgpu]
>>>>>> kfd_process_device_init_cwsr_dgpu+0xa4/0x1a0 [amdgpu]
>>>>>> kfd_process_device_init_vm+0xd8/0x2e0 [amdgpu]
>>>>>> kfd_ioctl_acquire_vm+0xd0/0x130 [amdgpu]
>>>>>> kfd_ioctl+0x514/0x670 [amdgpu]
>>>>>> sys_ioctl+0x134/0x180
>>>>>> system_call_exception+0x114/0x300
>>>>>> system_call_vectored_common+0x15c/0x2ec
>>>>>>
>>>>>> This patch changes AMDGPU_VA_RESERVED_TRAP_SIZE to 2 * PAGE_SIZE,
>>>>>> ensuring that the reserved trap area matches the allocation size
>>>>>> across all page sizes.
>>>>>>
>>>>>> cc: stable@vger.kernel.org
>>>>>> Fixes: 34a1de0f7935 ("drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole")
>>>>>> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>>>>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 2 +-
>>>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>>> index 139642eacdd0..a5eae49f9471 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>>> @@ -173,7 +173,7 @@ struct amdgpu_bo_vm;
>>>>>>   #define AMDGPU_VA_RESERVED_SEQ64_SIZE        (2ULL << 20)
>>>>>>   #define AMDGPU_VA_RESERVED_SEQ64_START(adev) (AMDGPU_VA_RESERVED_CSA_START(adev) \
>>>>>>                            - AMDGPU_VA_RESERVED_SEQ64_SIZE)
>>>>>> -#define AMDGPU_VA_RESERVED_TRAP_SIZE        (2ULL << 12)
>>>>>> +#define AMDGPU_VA_RESERVED_TRAP_SIZE        (2ULL << PAGE_SHIFT)
>>>>> Well using PAGE_SHIFT in amdgpu_vm.h looks quite broken to me.
>>>>>
>>>>> That makes the GPU VA reservation depend on the CPU page size and that is clearly not something we want to have.
>>>>>
>>>>> Where is KFD_CWSR_TBA_TMA_SIZE defined?
>>>>>
>>>> Thanks Christian for reviewing this patch.
>>>>
>>>> It is defined in kfd_priv.h.
>>>>
>>>> /*
>>>>   * Size of the per-process TBA+TMA buffer: 2 pages
>>>>   *
>>>>   * The first chunk is the TBA used for the CWSR ISA code. The second
>>>>   * chunk is used as TMA for user-mode trap handler setup in daisy-chain mode.
>>>>   */
>>>> #define KFD_CWSR_TBA_TMA_SIZE (PAGE_SIZE * 2)
>>>>
>>>>
>>>>
>>>> Could you please suggest the correct way to fix this issue?
>>> I'm only looking from the POV of the VM code on this, but my educated guess is that KFD_CWSR_TBA_TMA_SIZE should be 8k independent of the CPU page size.
>>>
>>> Background is that this is written by the shader trap handler and that byte code doesn't care what CPU architecture you have.
>>>
>>> But I think only the engineers working on that trap handler can really answer this. @Felix / @Philip?
>>
>>
>> Hi @christian @Felix @Philip
>>
>> To remove the dependency on CPU page size, can we use
>>
>> +#define AMDGPU_VA_RESERVED_TRAP_SIZE    (2ULL << 16)
>>
>> During reservation, we reserve 128 bytes, but during
>> allocation, we use 2 * PAGE_SIZE.
> 
> We only need two GPU pages here. I think what Christian is objecting to is, that the GPU VM layout should not depend on the CPU page size.

Yes, exactly that was my concern.

> @Christian, it sounds like the BO allocations happen with 64KB granularity, but the mapping is still using 4KB granularity. Is the right solution to GPU-map only the first 8KB of the trap handler BO to keep the layout the same across CPU architectures?

Well that would work technically, but I agree that it also sounds a bit questionable as well.

> I guess then the "correct" solution would be to change amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu and amdgpu_amdkfd_gpuvm_map_memory_to_gpu to support mapping of the requested size with GPU page size granularity regardless of the CPU page size. But that would increase complexity for a very niche uses case.
> 
> An easier solution would be to PAGE_ALIGN 8KB to the system page size. But that results in the virtual address space layout to depend on the system page size.

Yeah, that dependency is certainly undesirable. We could easily end up with issues which can only be reproduced on systems with 64k page size.

> If that's objectionable, then the next best solution is to round up the trap handler size to 64KB byte unconditionally, so its the same with 4KB or 64KB system page size. But that would mean unnecessarily wasting a little memory per process/GPU on x86.

How about we always reserve 64KiB address space (or maybe even more, if you reserve 2MiB or 64KiB doesn't matter), but only map as large as the allocated buffer actually is?

I think that this would be my preferred solution.

Regards,
Christian.

> 
> Regards,
>   Felix
> 
> 
>>
>>
>> -Donet
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>> -Donet
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>   #define AMDGPU_VA_RESERVED_TRAP_START(adev) (AMDGPU_VA_RESERVED_SEQ64_START(adev) \
>>>>>>                            - AMDGPU_VA_RESERVED_TRAP_SIZE)
>>>>>>   #define AMDGPU_VA_RESERVED_BOTTOM        (1ULL << 16)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND RFC PATCH v3 1/6] drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 PAGE_SIZE pages
  2026-03-25  9:34             ` Christian König
@ 2026-03-25 10:26               ` Donet Tom
  2026-03-25 10:29                 ` Christian König
  0 siblings, 1 reply; 30+ messages in thread
From: Donet Tom @ 2026-03-25 10:26 UTC (permalink / raw)
  To: Christian König, Kuehling, Felix, amd-gfx, Alex Deucher,
	Alex Deucher, Philip Yang
  Cc: David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan, stable


On 3/25/26 3:04 PM, Christian König wrote:
> On 3/25/26 03:26, Kuehling, Felix wrote:
>> On 2026-03-24 14:19, Donet Tom wrote:
>>> On 3/23/26 6:42 PM, Christian König wrote:
>>>> On 3/23/26 12:50, Donet Tom wrote:
>>>>> On 3/23/26 3:41 PM, Christian König wrote:
>>>>>
>>>>> Hi Christian
>>>>>
>>>>>> On 3/23/26 05:28, Donet Tom wrote:
>>>>>>> Currently, AMDGPU_VA_RESERVED_TRAP_SIZE is hardcoded to 8KB, while
>>>>>>> KFD_CWSR_TBA_TMA_SIZE is defined as 2 * PAGE_SIZE. On systems with
>>>>>>> 4K pages, both values match (8KB), so allocation and reserved space
>>>>>>> are consistent.
>>>>>>>
>>>>>>> However, on 64K page-size systems, KFD_CWSR_TBA_TMA_SIZE becomes 128KB,
>>>>>>> while the reserved trap area remains 8KB. This mismatch causes the
>>>>>>> kernel to crash when running rocminfo or rccl unit tests.
>>>>>>>
>>>>>>> Kernel attempted to read user page (2) - exploit attempt? (uid: 1001)
>>>>>>> BUG: Kernel NULL pointer dereference on read at 0x00000002
>>>>>>> Faulting instruction address: 0xc0000000002c8a64
>>>>>>> Oops: Kernel access of bad area, sig: 11 [#1]
>>>>>>> LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
>>>>>>> CPU: 34 UID: 1001 PID: 9379 Comm: rocminfo Tainted: G E
>>>>>>> 6.19.0-rc4-amdgpu-00320-gf23176405700 #56 VOLUNTARY
>>>>>>> Tainted: [E]=UNSIGNED_MODULE
>>>>>>> Hardware name: IBM,9105-42A POWER10 (architected) 0x800200 0xf000006
>>>>>>> of:IBM,FW1060.30 (ML1060_896) hv:phyp pSeries
>>>>>>> NIP:  c0000000002c8a64 LR: c00000000125dbc8 CTR: c00000000125e730
>>>>>>> REGS: c0000001e0957580 TRAP: 0300 Tainted: G E
>>>>>>> MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24008268
>>>>>>> XER: 00000036
>>>>>>> CFAR: c00000000125dbc4 DAR: 0000000000000002 DSISR: 40000000
>>>>>>> IRQMASK: 1
>>>>>>> GPR00: c00000000125d908 c0000001e0957820 c0000000016e8100
>>>>>>> c00000013d814540
>>>>>>> GPR04: 0000000000000002 c00000013d814550 0000000000000045
>>>>>>> 0000000000000000
>>>>>>> GPR08: c00000013444d000 c00000013d814538 c00000013d814538
>>>>>>> 0000000084002268
>>>>>>> GPR12: c00000000125e730 c000007e2ffd5f00 ffffffffffffffff
>>>>>>> 0000000000020000
>>>>>>> GPR16: 0000000000000000 0000000000000002 c00000015f653000
>>>>>>> 0000000000000000
>>>>>>> GPR20: c000000138662400 c00000013d814540 0000000000000000
>>>>>>> c00000013d814500
>>>>>>> GPR24: 0000000000000000 0000000000000002 c0000001e0957888
>>>>>>> c0000001e0957878
>>>>>>> GPR28: c00000013d814548 0000000000000000 c00000013d814540
>>>>>>> c0000001e0957888
>>>>>>> NIP [c0000000002c8a64] __mutex_add_waiter+0x24/0xc0
>>>>>>> LR [c00000000125dbc8] __mutex_lock.constprop.0+0x318/0xd00
>>>>>>> Call Trace:
>>>>>>> 0xc0000001e0957890 (unreliable)
>>>>>>> __mutex_lock.constprop.0+0x58/0xd00
>>>>>>> amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x6fc/0xb60 [amdgpu]
>>>>>>> kfd_process_alloc_gpuvm+0x54/0x1f0 [amdgpu]
>>>>>>> kfd_process_device_init_cwsr_dgpu+0xa4/0x1a0 [amdgpu]
>>>>>>> kfd_process_device_init_vm+0xd8/0x2e0 [amdgpu]
>>>>>>> kfd_ioctl_acquire_vm+0xd0/0x130 [amdgpu]
>>>>>>> kfd_ioctl+0x514/0x670 [amdgpu]
>>>>>>> sys_ioctl+0x134/0x180
>>>>>>> system_call_exception+0x114/0x300
>>>>>>> system_call_vectored_common+0x15c/0x2ec
>>>>>>>
>>>>>>> This patch changes AMDGPU_VA_RESERVED_TRAP_SIZE to 2 * PAGE_SIZE,
>>>>>>> ensuring that the reserved trap area matches the allocation size
>>>>>>> across all page sizes.
>>>>>>>
>>>>>>> cc: stable@vger.kernel.org
>>>>>>> Fixes: 34a1de0f7935 ("drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole")
>>>>>>> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>>>>>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>>>>>> ---
>>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 2 +-
>>>>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>>>> index 139642eacdd0..a5eae49f9471 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>>>> @@ -173,7 +173,7 @@ struct amdgpu_bo_vm;
>>>>>>>    #define AMDGPU_VA_RESERVED_SEQ64_SIZE        (2ULL << 20)
>>>>>>>    #define AMDGPU_VA_RESERVED_SEQ64_START(adev) (AMDGPU_VA_RESERVED_CSA_START(adev) \
>>>>>>>                             - AMDGPU_VA_RESERVED_SEQ64_SIZE)
>>>>>>> -#define AMDGPU_VA_RESERVED_TRAP_SIZE        (2ULL << 12)
>>>>>>> +#define AMDGPU_VA_RESERVED_TRAP_SIZE        (2ULL << PAGE_SHIFT)
>>>>>> Well using PAGE_SHIFT in amdgpu_vm.h looks quite broken to me.
>>>>>>
>>>>>> That makes the GPU VA reservation depend on the CPU page size and that is clearly not something we want to have.
>>>>>>
>>>>>> Where is KFD_CWSR_TBA_TMA_SIZE defined?
>>>>>>
>>>>> Thanks Christian for reviewing this patch.
>>>>>
>>>>> It is defined in kfd_priv.h.
>>>>>
>>>>> /*
>>>>>    * Size of the per-process TBA+TMA buffer: 2 pages
>>>>>    *
>>>>>    * The first chunk is the TBA used for the CWSR ISA code. The second
>>>>>    * chunk is used as TMA for user-mode trap handler setup in daisy-chain mode.
>>>>>    */
>>>>> #define KFD_CWSR_TBA_TMA_SIZE (PAGE_SIZE * 2)
>>>>>
>>>>>
>>>>>
>>>>> Could you please suggest the correct way to fix this issue?
>>>> I'm only looking from the POV of the VM code on this, but my educated guess is that KFD_CWSR_TBA_TMA_SIZE should be 8k independent of the CPU page size.
>>>>
>>>> Background is that this is written by the shader trap handler and that byte code doesn't care what CPU architecture you have.
>>>>
>>>> But I think only the engineers working on that trap handler can really answer this. @Felix / @Philip?
>>>
>>> Hi @christian @Felix @Philip
>>>
>>> To remove the dependency on CPU page size, can we use
>>>
>>> +#define AMDGPU_VA_RESERVED_TRAP_SIZE    (2ULL << 16)
>>>
>>> During reservation, we reserve 128 bytes, but during
>>> allocation, we use 2 * PAGE_SIZE.
>> We only need two GPU pages here. I think what Christian is objecting to is, that the GPU VM layout should not depend on the CPU page size.
> Yes, exactly that was my concern.
>
>> @Christian, it sounds like the BO allocations happen with 64KB granularity, but the mapping is still using 4KB granularity. Is the right solution to GPU-map only the first 8KB of the trap handler BO to keep the layout the same across CPU architectures?
> Well that would work technically, but I agree that it also sounds a bit questionable as well.
>
>> I guess then the "correct" solution would be to change amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu and amdgpu_amdkfd_gpuvm_map_memory_to_gpu to support mapping of the requested size with GPU page size granularity regardless of the CPU page size. But that would increase complexity for a very niche uses case.
>>
>> An easier solution would be to PAGE_ALIGN 8KB to the system page size. But that results in the virtual address space layout to depend on the system page size.
> Yeah, that dependency is certainly undesirable. We could easily end up with issues which can only be reproduced on systems with 64k page size.
>
>> If that's objectionable, then the next best solution is to round up the trap handler size to 64KB byte unconditionally, so its the same with 4KB or 64KB system page size. But that would mean unnecessarily wasting a little memory per process/GPU on x86.
> How about we always reserve 64KiB address space (or maybe even more, if you reserve 2MiB or 64KiB doesn't matter), but only map as large as the allocated buffer actually is?
>
> I think that this would be my preferred solution.


Hi @Christian @Felix

Thanks for the review.

I have made the suggested change. I am now reserving 64 KB
in the  address space for the trap, while allocating
only 8 KB for both 4K and 64K page sizes. With this change,
I am no longer seeing crashes on either 4K or 64K systems.

Does this approach look reasonable to you?

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
index bb276c0ad06d..d5b7061556ba 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
@@ -173,7 +173,7 @@ struct amdgpu_bo_vm;
  #define AMDGPU_VA_RESERVED_SEQ64_SIZE          (2ULL << 20)
  #define AMDGPU_VA_RESERVED_SEQ64_START(adev) 
  (AMDGPU_VA_RESERVED_CSA_START(adev) \
                                                  - 
AMDGPU_VA_RESERVED_SEQ64_SIZE)
-#define AMDGPU_VA_RESERVED_TRAP_SIZE           (2ULL << 12)
+#define AMDGPU_VA_RESERVED_TRAP_SIZE           (1ULL << 16)
  #define AMDGPU_VA_RESERVED_TRAP_START(adev) 
(AMDGPU_VA_RESERVED_SEQ64_START(adev) \
                                                  - 
AMDGPU_VA_RESERVED_TRAP_SIZE)
  #define AMDGPU_VA_RESERVED_BOTTOM              (1ULL << 16)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index e5b56412931b..035687a17d89 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -102,8 +102,8 @@
   * The first chunk is the TBA used for the CWSR ISA code. The second
   * chunk is used as TMA for user-mode trap handler setup in 
daisy-chain mode.
   */
-#define KFD_CWSR_TBA_TMA_SIZE (PAGE_SIZE * 2)
-#define KFD_CWSR_TMA_OFFSET (PAGE_SIZE + 2048)
+#define KFD_CWSR_TBA_TMA_SIZE (AMDGPU_GPU_PAGE_SIZE * 2)
+#define KFD_CWSR_TMA_OFFSET (AMDGPU_GPU_PAGE_SIZE + 2048)

  #define KFD_MAX_NUM_OF_QUEUES_PER_DEVICE               \
         (KFD_MAX_NUM_OF_PROCESSES *



>
> Regards,
> Christian.
>
>> Regards,
>>    Felix
>>
>>
>>>
>>> -Donet
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> -Donet
>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>    #define AMDGPU_VA_RESERVED_TRAP_START(adev) (AMDGPU_VA_RESERVED_SEQ64_START(adev) \
>>>>>>>                             - AMDGPU_VA_RESERVED_TRAP_SIZE)
>>>>>>>    #define AMDGPU_VA_RESERVED_BOTTOM        (1ULL << 16)

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RESEND RFC PATCH v3 1/6] drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 PAGE_SIZE pages
  2026-03-25 10:26               ` Donet Tom
@ 2026-03-25 10:29                 ` Christian König
  2026-03-25 17:54                   ` Kuehling, Felix
  0 siblings, 1 reply; 30+ messages in thread
From: Christian König @ 2026-03-25 10:29 UTC (permalink / raw)
  To: Donet Tom, Kuehling, Felix, amd-gfx, Alex Deucher, Alex Deucher,
	Philip Yang
  Cc: David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan, stable

On 3/25/26 11:26, Donet Tom wrote:
> 
> On 3/25/26 3:04 PM, Christian König wrote:
>> On 3/25/26 03:26, Kuehling, Felix wrote:
>>> On 2026-03-24 14:19, Donet Tom wrote:
>>>> On 3/23/26 6:42 PM, Christian König wrote:
>>>>> On 3/23/26 12:50, Donet Tom wrote:
>>>>>> On 3/23/26 3:41 PM, Christian König wrote:
>>>>>>
>>>>>> Hi Christian
>>>>>>
>>>>>>> On 3/23/26 05:28, Donet Tom wrote:
>>>>>>>> Currently, AMDGPU_VA_RESERVED_TRAP_SIZE is hardcoded to 8KB, while
>>>>>>>> KFD_CWSR_TBA_TMA_SIZE is defined as 2 * PAGE_SIZE. On systems with
>>>>>>>> 4K pages, both values match (8KB), so allocation and reserved space
>>>>>>>> are consistent.
>>>>>>>>
>>>>>>>> However, on 64K page-size systems, KFD_CWSR_TBA_TMA_SIZE becomes 128KB,
>>>>>>>> while the reserved trap area remains 8KB. This mismatch causes the
>>>>>>>> kernel to crash when running rocminfo or rccl unit tests.
>>>>>>>>
>>>>>>>> Kernel attempted to read user page (2) - exploit attempt? (uid: 1001)
>>>>>>>> BUG: Kernel NULL pointer dereference on read at 0x00000002
>>>>>>>> Faulting instruction address: 0xc0000000002c8a64
>>>>>>>> Oops: Kernel access of bad area, sig: 11 [#1]
>>>>>>>> LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
>>>>>>>> CPU: 34 UID: 1001 PID: 9379 Comm: rocminfo Tainted: G E
>>>>>>>> 6.19.0-rc4-amdgpu-00320-gf23176405700 #56 VOLUNTARY
>>>>>>>> Tainted: [E]=UNSIGNED_MODULE
>>>>>>>> Hardware name: IBM,9105-42A POWER10 (architected) 0x800200 0xf000006
>>>>>>>> of:IBM,FW1060.30 (ML1060_896) hv:phyp pSeries
>>>>>>>> NIP:  c0000000002c8a64 LR: c00000000125dbc8 CTR: c00000000125e730
>>>>>>>> REGS: c0000001e0957580 TRAP: 0300 Tainted: G E
>>>>>>>> MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24008268
>>>>>>>> XER: 00000036
>>>>>>>> CFAR: c00000000125dbc4 DAR: 0000000000000002 DSISR: 40000000
>>>>>>>> IRQMASK: 1
>>>>>>>> GPR00: c00000000125d908 c0000001e0957820 c0000000016e8100
>>>>>>>> c00000013d814540
>>>>>>>> GPR04: 0000000000000002 c00000013d814550 0000000000000045
>>>>>>>> 0000000000000000
>>>>>>>> GPR08: c00000013444d000 c00000013d814538 c00000013d814538
>>>>>>>> 0000000084002268
>>>>>>>> GPR12: c00000000125e730 c000007e2ffd5f00 ffffffffffffffff
>>>>>>>> 0000000000020000
>>>>>>>> GPR16: 0000000000000000 0000000000000002 c00000015f653000
>>>>>>>> 0000000000000000
>>>>>>>> GPR20: c000000138662400 c00000013d814540 0000000000000000
>>>>>>>> c00000013d814500
>>>>>>>> GPR24: 0000000000000000 0000000000000002 c0000001e0957888
>>>>>>>> c0000001e0957878
>>>>>>>> GPR28: c00000013d814548 0000000000000000 c00000013d814540
>>>>>>>> c0000001e0957888
>>>>>>>> NIP [c0000000002c8a64] __mutex_add_waiter+0x24/0xc0
>>>>>>>> LR [c00000000125dbc8] __mutex_lock.constprop.0+0x318/0xd00
>>>>>>>> Call Trace:
>>>>>>>> 0xc0000001e0957890 (unreliable)
>>>>>>>> __mutex_lock.constprop.0+0x58/0xd00
>>>>>>>> amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x6fc/0xb60 [amdgpu]
>>>>>>>> kfd_process_alloc_gpuvm+0x54/0x1f0 [amdgpu]
>>>>>>>> kfd_process_device_init_cwsr_dgpu+0xa4/0x1a0 [amdgpu]
>>>>>>>> kfd_process_device_init_vm+0xd8/0x2e0 [amdgpu]
>>>>>>>> kfd_ioctl_acquire_vm+0xd0/0x130 [amdgpu]
>>>>>>>> kfd_ioctl+0x514/0x670 [amdgpu]
>>>>>>>> sys_ioctl+0x134/0x180
>>>>>>>> system_call_exception+0x114/0x300
>>>>>>>> system_call_vectored_common+0x15c/0x2ec
>>>>>>>>
>>>>>>>> This patch changes AMDGPU_VA_RESERVED_TRAP_SIZE to 2 * PAGE_SIZE,
>>>>>>>> ensuring that the reserved trap area matches the allocation size
>>>>>>>> across all page sizes.
>>>>>>>>
>>>>>>>> cc: stable@vger.kernel.org
>>>>>>>> Fixes: 34a1de0f7935 ("drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole")
>>>>>>>> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>>>>>>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>>>>>>> ---
>>>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 2 +-
>>>>>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>>>>> index 139642eacdd0..a5eae49f9471 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>>>>> @@ -173,7 +173,7 @@ struct amdgpu_bo_vm;
>>>>>>>>    #define AMDGPU_VA_RESERVED_SEQ64_SIZE        (2ULL << 20)
>>>>>>>>    #define AMDGPU_VA_RESERVED_SEQ64_START(adev) (AMDGPU_VA_RESERVED_CSA_START(adev) \
>>>>>>>>                             - AMDGPU_VA_RESERVED_SEQ64_SIZE)
>>>>>>>> -#define AMDGPU_VA_RESERVED_TRAP_SIZE        (2ULL << 12)
>>>>>>>> +#define AMDGPU_VA_RESERVED_TRAP_SIZE        (2ULL << PAGE_SHIFT)
>>>>>>> Well using PAGE_SHIFT in amdgpu_vm.h looks quite broken to me.
>>>>>>>
>>>>>>> That makes the GPU VA reservation depend on the CPU page size and that is clearly not something we want to have.
>>>>>>>
>>>>>>> Where is KFD_CWSR_TBA_TMA_SIZE defined?
>>>>>>>
>>>>>> Thanks Christian for reviewing this patch.
>>>>>>
>>>>>> It is defined in kfd_priv.h.
>>>>>>
>>>>>> /*
>>>>>>    * Size of the per-process TBA+TMA buffer: 2 pages
>>>>>>    *
>>>>>>    * The first chunk is the TBA used for the CWSR ISA code. The second
>>>>>>    * chunk is used as TMA for user-mode trap handler setup in daisy-chain mode.
>>>>>>    */
>>>>>> #define KFD_CWSR_TBA_TMA_SIZE (PAGE_SIZE * 2)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Could you please suggest the correct way to fix this issue?
>>>>> I'm only looking from the POV of the VM code on this, but my educated guess is that KFD_CWSR_TBA_TMA_SIZE should be 8k independent of the CPU page size.
>>>>>
>>>>> Background is that this is written by the shader trap handler and that byte code doesn't care what CPU architecture you have.
>>>>>
>>>>> But I think only the engineers working on that trap handler can really answer this. @Felix / @Philip?
>>>>
>>>> Hi @christian @Felix @Philip
>>>>
>>>> To remove the dependency on CPU page size, can we use
>>>>
>>>> +#define AMDGPU_VA_RESERVED_TRAP_SIZE    (2ULL << 16)
>>>>
>>>> During reservation, we reserve 128 bytes, but during
>>>> allocation, we use 2 * PAGE_SIZE.
>>> We only need two GPU pages here. I think what Christian is objecting to is, that the GPU VM layout should not depend on the CPU page size.
>> Yes, exactly that was my concern.
>>
>>> @Christian, it sounds like the BO allocations happen with 64KB granularity, but the mapping is still using 4KB granularity. Is the right solution to GPU-map only the first 8KB of the trap handler BO to keep the layout the same across CPU architectures?
>> Well that would work technically, but I agree that it also sounds a bit questionable as well.
>>
>>> I guess then the "correct" solution would be to change amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu and amdgpu_amdkfd_gpuvm_map_memory_to_gpu to support mapping of the requested size with GPU page size granularity regardless of the CPU page size. But that would increase complexity for a very niche uses case.
>>>
>>> An easier solution would be to PAGE_ALIGN 8KB to the system page size. But that results in the virtual address space layout to depend on the system page size.
>> Yeah, that dependency is certainly undesirable. We could easily end up with issues which can only be reproduced on systems with 64k page size.
>>
>>> If that's objectionable, then the next best solution is to round up the trap handler size to 64KB byte unconditionally, so its the same with 4KB or 64KB system page size. But that would mean unnecessarily wasting a little memory per process/GPU on x86.
>> How about we always reserve 64KiB address space (or maybe even more, if you reserve 2MiB or 64KiB doesn't matter), but only map as large as the allocated buffer actually is?
>>
>> I think that this would be my preferred solution.
> 
> 
> Hi @Christian @Felix
> 
> Thanks for the review.
> 
> I have made the suggested change. I am now reserving 64 KB
> in the  address space for the trap, while allocating
> only 8 KB for both 4K and 64K page sizes. With this change,
> I am no longer seeing crashes on either 4K or 64K systems.
> 
> Does this approach look reasonable to you?

Looks correct to me, but Felix clearly has the last word on that.

Regards,
Christian.

> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> index bb276c0ad06d..d5b7061556ba 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> @@ -173,7 +173,7 @@ struct amdgpu_bo_vm;
>  #define AMDGPU_VA_RESERVED_SEQ64_SIZE          (2ULL << 20)
>  #define AMDGPU_VA_RESERVED_SEQ64_START(adev)  (AMDGPU_VA_RESERVED_CSA_START(adev) \
>                                                  - AMDGPU_VA_RESERVED_SEQ64_SIZE)
> -#define AMDGPU_VA_RESERVED_TRAP_SIZE           (2ULL << 12)
> +#define AMDGPU_VA_RESERVED_TRAP_SIZE           (1ULL << 16)
>  #define AMDGPU_VA_RESERVED_TRAP_START(adev) (AMDGPU_VA_RESERVED_SEQ64_START(adev) \
>                                                  - AMDGPU_VA_RESERVED_TRAP_SIZE)
>  #define AMDGPU_VA_RESERVED_BOTTOM              (1ULL << 16)
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> index e5b56412931b..035687a17d89 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> @@ -102,8 +102,8 @@
>   * The first chunk is the TBA used for the CWSR ISA code. The second
>   * chunk is used as TMA for user-mode trap handler setup in daisy-chain mode.
>   */
> -#define KFD_CWSR_TBA_TMA_SIZE (PAGE_SIZE * 2)
> -#define KFD_CWSR_TMA_OFFSET (PAGE_SIZE + 2048)
> +#define KFD_CWSR_TBA_TMA_SIZE (AMDGPU_GPU_PAGE_SIZE * 2)
> +#define KFD_CWSR_TMA_OFFSET (AMDGPU_GPU_PAGE_SIZE + 2048)
> 
>  #define KFD_MAX_NUM_OF_QUEUES_PER_DEVICE               \
>         (KFD_MAX_NUM_OF_PROCESSES *
> 
> 
> 
>>
>> Regards,
>> Christian.
>>
>>> Regards,
>>>    Felix
>>>
>>>
>>>>
>>>> -Donet
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> -Donet
>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>>    #define AMDGPU_VA_RESERVED_TRAP_START(adev) (AMDGPU_VA_RESERVED_SEQ64_START(adev) \
>>>>>>>>                             - AMDGPU_VA_RESERVED_TRAP_SIZE)
>>>>>>>>    #define AMDGPU_VA_RESERVED_BOTTOM        (1ULL << 16)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND RFC PATCH v3 1/6] drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 PAGE_SIZE pages
  2026-03-25 10:29                 ` Christian König
@ 2026-03-25 17:54                   ` Kuehling, Felix
  2026-03-25 17:59                     ` Donet Tom
  0 siblings, 1 reply; 30+ messages in thread
From: Kuehling, Felix @ 2026-03-25 17:54 UTC (permalink / raw)
  To: Christian König, Donet Tom, amd-gfx, Alex Deucher,
	Alex Deucher, Philip Yang
  Cc: David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan, stable


On 2026-03-25 06:29, Christian König wrote:
>> Hi @Christian @Felix
>>
>> Thanks for the review.
>>
>> I have made the suggested change. I am now reserving 64 KB
>> in the  address space for the trap, while allocating
>> only 8 KB for both 4K and 64K page sizes. With this change,
>> I am no longer seeing crashes on either 4K or 64K systems.
>>
>> Does this approach look reasonable to you?
> Looks correct to me, but Felix clearly has the last word on that.

That works for me as well.

Thanks,
   Felix


>
> Regards,
> Christian.
>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>> index bb276c0ad06d..d5b7061556ba 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>> @@ -173,7 +173,7 @@ struct amdgpu_bo_vm;
>>   #define AMDGPU_VA_RESERVED_SEQ64_SIZE          (2ULL << 20)
>>   #define AMDGPU_VA_RESERVED_SEQ64_START(adev)  (AMDGPU_VA_RESERVED_CSA_START(adev) \
>>                                                   - AMDGPU_VA_RESERVED_SEQ64_SIZE)
>> -#define AMDGPU_VA_RESERVED_TRAP_SIZE           (2ULL << 12)
>> +#define AMDGPU_VA_RESERVED_TRAP_SIZE           (1ULL << 16)
>>   #define AMDGPU_VA_RESERVED_TRAP_START(adev) (AMDGPU_VA_RESERVED_SEQ64_START(adev) \
>>                                                   - AMDGPU_VA_RESERVED_TRAP_SIZE)
>>   #define AMDGPU_VA_RESERVED_BOTTOM              (1ULL << 16)
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> index e5b56412931b..035687a17d89 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> @@ -102,8 +102,8 @@
>>    * The first chunk is the TBA used for the CWSR ISA code. The second
>>    * chunk is used as TMA for user-mode trap handler setup in daisy-chain mode.
>>    */
>> -#define KFD_CWSR_TBA_TMA_SIZE (PAGE_SIZE * 2)
>> -#define KFD_CWSR_TMA_OFFSET (PAGE_SIZE + 2048)
>> +#define KFD_CWSR_TBA_TMA_SIZE (AMDGPU_GPU_PAGE_SIZE * 2)
>> +#define KFD_CWSR_TMA_OFFSET (AMDGPU_GPU_PAGE_SIZE + 2048)
>>
>>   #define KFD_MAX_NUM_OF_QUEUES_PER_DEVICE               \
>>          (KFD_MAX_NUM_OF_PROCESSES *

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND RFC PATCH v3 1/6] drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 PAGE_SIZE pages
  2026-03-25 17:54                   ` Kuehling, Felix
@ 2026-03-25 17:59                     ` Donet Tom
  0 siblings, 0 replies; 30+ messages in thread
From: Donet Tom @ 2026-03-25 17:59 UTC (permalink / raw)
  To: Kuehling, Felix, Christian König, amd-gfx, Alex Deucher,
	Alex Deucher, Philip Yang
  Cc: David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan, stable


On 3/25/26 11:24 PM, Kuehling, Felix wrote:
>
> On 2026-03-25 06:29, Christian König wrote:
>>> Hi @Christian @Felix
>>>
>>> Thanks for the review.
>>>
>>> I have made the suggested change. I am now reserving 64 KB
>>> in the  address space for the trap, while allocating
>>> only 8 KB for both 4K and 64K page sizes. With this change,
>>> I am no longer seeing crashes on either 4K or 64K systems.
>>>
>>> Does this approach look reasonable to you?
>> Looks correct to me, but Felix clearly has the last word on that.
>
> That works for me as well.


Thank you , Felix. I will incorporate this change and post an updated 
version.

-Donet


>
> Thanks,
>   Felix
>
>
>>
>> Regards,
>> Christian.
>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>> index bb276c0ad06d..d5b7061556ba 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>> @@ -173,7 +173,7 @@ struct amdgpu_bo_vm;
>>>   #define AMDGPU_VA_RESERVED_SEQ64_SIZE          (2ULL << 20)
>>>   #define AMDGPU_VA_RESERVED_SEQ64_START(adev) 
>>>  (AMDGPU_VA_RESERVED_CSA_START(adev) \
>>>                                                   - 
>>> AMDGPU_VA_RESERVED_SEQ64_SIZE)
>>> -#define AMDGPU_VA_RESERVED_TRAP_SIZE           (2ULL << 12)
>>> +#define AMDGPU_VA_RESERVED_TRAP_SIZE           (1ULL << 16)
>>>   #define AMDGPU_VA_RESERVED_TRAP_START(adev) 
>>> (AMDGPU_VA_RESERVED_SEQ64_START(adev) \
>>>                                                   - 
>>> AMDGPU_VA_RESERVED_TRAP_SIZE)
>>>   #define AMDGPU_VA_RESERVED_BOTTOM              (1ULL << 16)
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>> index e5b56412931b..035687a17d89 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>> @@ -102,8 +102,8 @@
>>>    * The first chunk is the TBA used for the CWSR ISA code. The second
>>>    * chunk is used as TMA for user-mode trap handler setup in 
>>> daisy-chain mode.
>>>    */
>>> -#define KFD_CWSR_TBA_TMA_SIZE (PAGE_SIZE * 2)
>>> -#define KFD_CWSR_TMA_OFFSET (PAGE_SIZE + 2048)
>>> +#define KFD_CWSR_TBA_TMA_SIZE (AMDGPU_GPU_PAGE_SIZE * 2)
>>> +#define KFD_CWSR_TMA_OFFSET (AMDGPU_GPU_PAGE_SIZE + 2048)
>>>
>>>   #define KFD_MAX_NUM_OF_QUEUES_PER_DEVICE               \
>>>          (KFD_MAX_NUM_OF_PROCESSES *

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RESEND RFC PATCH v3 2/6] drm/amdkfd: Align expected_queue_size to PAGE_SIZE
  2026-03-23  4:28 [RESEND RFC PATCH v3 0/6] drm/amd: Add support for non-4K page size systems Donet Tom
  2026-03-23  4:28 ` [RESEND RFC PATCH v3 1/6] drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 PAGE_SIZE pages Donet Tom
@ 2026-03-23  4:28 ` Donet Tom
  2026-03-25  2:28   ` Kuehling, Felix
  2026-03-23  4:28 ` [RESEND RFC PATCH v3 3/6] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems Donet Tom
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Donet Tom @ 2026-03-23  4:28 UTC (permalink / raw)
  To: amd-gfx, Felix Kuehling, Alex Deucher, Alex Deucher,
	christian.koenig, Philip Yang
  Cc: David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan, donettom

The AQL queue size can be 4K, but the minimum buffer object (BO)
allocation size is PAGE_SIZE. On systems with a page size larger
than 4K, the expected queue size does not match the allocated BO
size, causing queue creation to fail.

Align the expected queue size to PAGE_SIZE so that it matches the
allocated BO size and allows queue creation to succeed.

Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
index d1978e3f68be..572b21e39e83 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
@@ -249,10 +249,10 @@ int kfd_queue_acquire_buffers(struct kfd_process_device *pdd, struct queue_prope
 	    topo_dev->node_props.gfx_target_version < 90000)
 		/* metadata_queue_size not supported on GFX7/GFX8 */
 		expected_queue_size =
-			properties->queue_size / 2;
+			PAGE_ALIGN(properties->queue_size / 2);
 	else
 		expected_queue_size =
-			properties->queue_size + properties->metadata_queue_size;
+			PAGE_ALIGN(properties->queue_size + properties->metadata_queue_size);
 
 	vm = drm_priv_to_vm(pdd->drm_priv);
 	err = amdgpu_bo_reserve(vm->root.bo, false);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RESEND RFC PATCH v3 2/6] drm/amdkfd: Align expected_queue_size to PAGE_SIZE
  2026-03-23  4:28 ` [RESEND RFC PATCH v3 2/6] drm/amdkfd: Align expected_queue_size to PAGE_SIZE Donet Tom
@ 2026-03-25  2:28   ` Kuehling, Felix
  2026-03-25 18:33     ` Alex Deucher
  0 siblings, 1 reply; 30+ messages in thread
From: Kuehling, Felix @ 2026-03-25  2:28 UTC (permalink / raw)
  To: Donet Tom, amd-gfx, Alex Deucher, Alex Deucher, christian.koenig,
	Philip Yang
  Cc: David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan

On 2026-03-23 00:28, Donet Tom wrote:
> The AQL queue size can be 4K, but the minimum buffer object (BO)
> allocation size is PAGE_SIZE. On systems with a page size larger
> than 4K, the expected queue size does not match the allocated BO
> size, causing queue creation to fail.
>
> Align the expected queue size to PAGE_SIZE so that it matches the
> allocated BO size and allows queue creation to succeed.
>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>

Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>


> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> index d1978e3f68be..572b21e39e83 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> @@ -249,10 +249,10 @@ int kfd_queue_acquire_buffers(struct kfd_process_device *pdd, struct queue_prope
>   	    topo_dev->node_props.gfx_target_version < 90000)
>   		/* metadata_queue_size not supported on GFX7/GFX8 */
>   		expected_queue_size =
> -			properties->queue_size / 2;
> +			PAGE_ALIGN(properties->queue_size / 2);
>   	else
>   		expected_queue_size =
> -			properties->queue_size + properties->metadata_queue_size;
> +			PAGE_ALIGN(properties->queue_size + properties->metadata_queue_size);
>   
>   	vm = drm_priv_to_vm(pdd->drm_priv);
>   	err = amdgpu_bo_reserve(vm->root.bo, false);

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND RFC PATCH v3 2/6] drm/amdkfd: Align expected_queue_size to PAGE_SIZE
  2026-03-25  2:28   ` Kuehling, Felix
@ 2026-03-25 18:33     ` Alex Deucher
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Deucher @ 2026-03-25 18:33 UTC (permalink / raw)
  To: Kuehling, Felix
  Cc: Donet Tom, amd-gfx, Alex Deucher, christian.koenig, Philip Yang,
	David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan

Applied.  Thanks!

On Tue, Mar 24, 2026 at 10:28 PM Kuehling, Felix <felix.kuehling@amd.com> wrote:
>
> On 2026-03-23 00:28, Donet Tom wrote:
> > The AQL queue size can be 4K, but the minimum buffer object (BO)
> > allocation size is PAGE_SIZE. On systems with a page size larger
> > than 4K, the expected queue size does not match the allocated BO
> > size, causing queue creation to fail.
> >
> > Align the expected queue size to PAGE_SIZE so that it matches the
> > allocated BO size and allows queue creation to succeed.
> >
> > Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>
> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
>
>
> > ---
> >   drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 4 ++--
> >   1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> > index d1978e3f68be..572b21e39e83 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> > @@ -249,10 +249,10 @@ int kfd_queue_acquire_buffers(struct kfd_process_device *pdd, struct queue_prope
> >           topo_dev->node_props.gfx_target_version < 90000)
> >               /* metadata_queue_size not supported on GFX7/GFX8 */
> >               expected_queue_size =
> > -                     properties->queue_size / 2;
> > +                     PAGE_ALIGN(properties->queue_size / 2);
> >       else
> >               expected_queue_size =
> > -                     properties->queue_size + properties->metadata_queue_size;
> > +                     PAGE_ALIGN(properties->queue_size + properties->metadata_queue_size);
> >
> >       vm = drm_priv_to_vm(pdd->drm_priv);
> >       err = amdgpu_bo_reserve(vm->root.bo, false);

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RESEND RFC PATCH v3 3/6] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems
  2026-03-23  4:28 [RESEND RFC PATCH v3 0/6] drm/amd: Add support for non-4K page size systems Donet Tom
  2026-03-23  4:28 ` [RESEND RFC PATCH v3 1/6] drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 PAGE_SIZE pages Donet Tom
  2026-03-23  4:28 ` [RESEND RFC PATCH v3 2/6] drm/amdkfd: Align expected_queue_size to PAGE_SIZE Donet Tom
@ 2026-03-23  4:28 ` Donet Tom
  2026-03-23 13:04   ` Christian König
  2026-03-25  2:30   ` Kuehling, Felix
  2026-03-23  4:28 ` [RESEND RFC PATCH v3 4/6] drm/amdgpu: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size Donet Tom
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 30+ messages in thread
From: Donet Tom @ 2026-03-23  4:28 UTC (permalink / raw)
  To: amd-gfx, Felix Kuehling, Alex Deucher, Alex Deucher,
	christian.koenig, Philip Yang
  Cc: David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan, donettom

During a GPU page fault, the driver restores the SVM range and then maps it
into the GPU page tables. The current implementation passes a GPU-page-size
(4K-based) PFN to svm_range_restore_pages() to restore the range.

SVM ranges are tracked using system-page-size PFNs. On systems where the
system page size is larger than 4K, using GPU-page-size PFNs to restore the
range causes two problems:

Range lookup fails:
Because the restore function receives PFNs in GPU (4K) units, the SVM
range lookup does not find the existing range. This will result in a
duplicate SVM range being created.

VMA lookup failure:
The restore function also tries to locate the VMA for the faulting address.
It converts the GPU-page-size PFN into an address using the system page
size, which results in an incorrect address on non-4K page-size systems.
As a result, the VMA lookup fails with the message: "address 0xxxx VMA is
removed".

This patch passes the system-page-size PFN to svm_range_restore_pages() so
that the SVM range is restored correctly on non-4K page systems.

Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 6a2ea200d90c..7a3cb0057ac5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2985,14 +2985,14 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, u32 pasid,
 	if (!root)
 		return false;
 
-	addr /= AMDGPU_GPU_PAGE_SIZE;
-
 	if (is_compute_context && !svm_range_restore_pages(adev, pasid, vmid,
-	    node_id, addr, ts, write_fault)) {
+	    node_id, addr >> PAGE_SHIFT, ts, write_fault)) {
 		amdgpu_bo_unref(&root);
 		return true;
 	}
 
+	addr /= AMDGPU_GPU_PAGE_SIZE;
+
 	r = amdgpu_bo_reserve(root, true);
 	if (r)
 		goto error_unref;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RESEND RFC PATCH v3 3/6] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems
  2026-03-23  4:28 ` [RESEND RFC PATCH v3 3/6] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems Donet Tom
@ 2026-03-23 13:04   ` Christian König
  2026-03-24 13:10     ` Alex Deucher
  2026-03-25  2:30   ` Kuehling, Felix
  1 sibling, 1 reply; 30+ messages in thread
From: Christian König @ 2026-03-23 13:04 UTC (permalink / raw)
  To: Donet Tom, amd-gfx, Felix Kuehling, Alex Deucher, Alex Deucher,
	Philip Yang
  Cc: David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan

On 3/23/26 05:28, Donet Tom wrote:
> During a GPU page fault, the driver restores the SVM range and then maps it
> into the GPU page tables. The current implementation passes a GPU-page-size
> (4K-based) PFN to svm_range_restore_pages() to restore the range.
> 
> SVM ranges are tracked using system-page-size PFNs. On systems where the
> system page size is larger than 4K, using GPU-page-size PFNs to restore the
> range causes two problems:
> 
> Range lookup fails:
> Because the restore function receives PFNs in GPU (4K) units, the SVM
> range lookup does not find the existing range. This will result in a
> duplicate SVM range being created.
> 
> VMA lookup failure:
> The restore function also tries to locate the VMA for the faulting address.
> It converts the GPU-page-size PFN into an address using the system page
> size, which results in an incorrect address on non-4K page-size systems.
> As a result, the VMA lookup fails with the message: "address 0xxxx VMA is
> removed".
> 
> This patch passes the system-page-size PFN to svm_range_restore_pages() so
> that the SVM range is restored correctly on non-4K page systems.
> 
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>

Acked-by: Christian König <christian.koenig@amd.com>

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 6a2ea200d90c..7a3cb0057ac5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -2985,14 +2985,14 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, u32 pasid,
>  	if (!root)
>  		return false;
>  
> -	addr /= AMDGPU_GPU_PAGE_SIZE;
> -
>  	if (is_compute_context && !svm_range_restore_pages(adev, pasid, vmid,
> -	    node_id, addr, ts, write_fault)) {
> +	    node_id, addr >> PAGE_SHIFT, ts, write_fault)) {
>  		amdgpu_bo_unref(&root);
>  		return true;
>  	}
>  
> +	addr /= AMDGPU_GPU_PAGE_SIZE;
> +
>  	r = amdgpu_bo_reserve(root, true);
>  	if (r)
>  		goto error_unref;


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND RFC PATCH v3 3/6] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems
  2026-03-23 13:04   ` Christian König
@ 2026-03-24 13:10     ` Alex Deucher
  2026-03-25 18:04       ` Donet Tom
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Deucher @ 2026-03-24 13:10 UTC (permalink / raw)
  To: Christian König
  Cc: Donet Tom, amd-gfx, Felix Kuehling, Alex Deucher, Philip Yang,
	David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan

Applied.  Thanks!

Alex

On Mon, Mar 23, 2026 at 9:04 AM Christian König
<christian.koenig@amd.com> wrote:
>
> On 3/23/26 05:28, Donet Tom wrote:
> > During a GPU page fault, the driver restores the SVM range and then maps it
> > into the GPU page tables. The current implementation passes a GPU-page-size
> > (4K-based) PFN to svm_range_restore_pages() to restore the range.
> >
> > SVM ranges are tracked using system-page-size PFNs. On systems where the
> > system page size is larger than 4K, using GPU-page-size PFNs to restore the
> > range causes two problems:
> >
> > Range lookup fails:
> > Because the restore function receives PFNs in GPU (4K) units, the SVM
> > range lookup does not find the existing range. This will result in a
> > duplicate SVM range being created.
> >
> > VMA lookup failure:
> > The restore function also tries to locate the VMA for the faulting address.
> > It converts the GPU-page-size PFN into an address using the system page
> > size, which results in an incorrect address on non-4K page-size systems.
> > As a result, the VMA lookup fails with the message: "address 0xxxx VMA is
> > removed".
> >
> > This patch passes the system-page-size PFN to svm_range_restore_pages() so
> > that the SVM range is restored correctly on non-4K page systems.
> >
> > Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>
> Acked-by: Christian König <christian.koenig@amd.com>
>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > index 6a2ea200d90c..7a3cb0057ac5 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > @@ -2985,14 +2985,14 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, u32 pasid,
> >       if (!root)
> >               return false;
> >
> > -     addr /= AMDGPU_GPU_PAGE_SIZE;
> > -
> >       if (is_compute_context && !svm_range_restore_pages(adev, pasid, vmid,
> > -         node_id, addr, ts, write_fault)) {
> > +         node_id, addr >> PAGE_SHIFT, ts, write_fault)) {
> >               amdgpu_bo_unref(&root);
> >               return true;
> >       }
> >
> > +     addr /= AMDGPU_GPU_PAGE_SIZE;
> > +
> >       r = amdgpu_bo_reserve(root, true);
> >       if (r)
> >               goto error_unref;
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND RFC PATCH v3 3/6] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems
  2026-03-24 13:10     ` Alex Deucher
@ 2026-03-25 18:04       ` Donet Tom
  2026-03-25 18:36         ` Alex Deucher
  0 siblings, 1 reply; 30+ messages in thread
From: Donet Tom @ 2026-03-25 18:04 UTC (permalink / raw)
  To: Alex Deucher
  Cc: amd-gfx, Felix Kuehling, Alex Deucher, Philip Yang, David.YatSin,
	Kent.Russell, Ritesh Harjani, Vaidyanathan Srinivasan,
	Christian König


On 3/24/26 6:40 PM, Alex Deucher wrote:
> Applied.  Thanks!

Hi @Alex

Thank you for applying this patch.


I am planning to send the next version for PATCH 1/6. For the
other patches that have already received Reviewed-by tags,
would you prefer to pick them from this series, or should I
include them again in the next version?

-Donet


>
> Alex
>
> On Mon, Mar 23, 2026 at 9:04 AM Christian König
> <christian.koenig@amd.com> wrote:
>> On 3/23/26 05:28, Donet Tom wrote:
>>> During a GPU page fault, the driver restores the SVM range and then maps it
>>> into the GPU page tables. The current implementation passes a GPU-page-size
>>> (4K-based) PFN to svm_range_restore_pages() to restore the range.
>>>
>>> SVM ranges are tracked using system-page-size PFNs. On systems where the
>>> system page size is larger than 4K, using GPU-page-size PFNs to restore the
>>> range causes two problems:
>>>
>>> Range lookup fails:
>>> Because the restore function receives PFNs in GPU (4K) units, the SVM
>>> range lookup does not find the existing range. This will result in a
>>> duplicate SVM range being created.
>>>
>>> VMA lookup failure:
>>> The restore function also tries to locate the VMA for the faulting address.
>>> It converts the GPU-page-size PFN into an address using the system page
>>> size, which results in an incorrect address on non-4K page-size systems.
>>> As a result, the VMA lookup fails with the message: "address 0xxxx VMA is
>>> removed".
>>>
>>> This patch passes the system-page-size PFN to svm_range_restore_pages() so
>>> that the SVM range is restored correctly on non-4K page systems.
>>>
>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>> Acked-by: Christian König <christian.koenig@amd.com>
>>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 6 +++---
>>>   1 file changed, 3 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> index 6a2ea200d90c..7a3cb0057ac5 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> @@ -2985,14 +2985,14 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, u32 pasid,
>>>        if (!root)
>>>                return false;
>>>
>>> -     addr /= AMDGPU_GPU_PAGE_SIZE;
>>> -
>>>        if (is_compute_context && !svm_range_restore_pages(adev, pasid, vmid,
>>> -         node_id, addr, ts, write_fault)) {
>>> +         node_id, addr >> PAGE_SHIFT, ts, write_fault)) {
>>>                amdgpu_bo_unref(&root);
>>>                return true;
>>>        }
>>>
>>> +     addr /= AMDGPU_GPU_PAGE_SIZE;
>>> +
>>>        r = amdgpu_bo_reserve(root, true);
>>>        if (r)
>>>                goto error_unref;

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND RFC PATCH v3 3/6] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems
  2026-03-25 18:04       ` Donet Tom
@ 2026-03-25 18:36         ` Alex Deucher
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Deucher @ 2026-03-25 18:36 UTC (permalink / raw)
  To: Donet Tom
  Cc: amd-gfx, Felix Kuehling, Alex Deucher, Philip Yang, David.YatSin,
	Kent.Russell, Ritesh Harjani, Vaidyanathan Srinivasan,
	Christian König

On Wed, Mar 25, 2026 at 2:05 PM Donet Tom <donettom@linux.ibm.com> wrote:
>
>
> On 3/24/26 6:40 PM, Alex Deucher wrote:
> > Applied.  Thanks!
>
> Hi @Alex
>
> Thank you for applying this patch.
>
>
> I am planning to send the next version for PATCH 1/6. For the
> other patches that have already received Reviewed-by tags,
> would you prefer to pick them from this series, or should I
> include them again in the next version?

I'll pick up the reviewed patches.  Feel free to include them in your
resend if that's easier for you.  I'll pick up whatever the delta is
once those are reviewed.

Alex

>
> -Donet
>
>
> >
> > Alex
> >
> > On Mon, Mar 23, 2026 at 9:04 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >> On 3/23/26 05:28, Donet Tom wrote:
> >>> During a GPU page fault, the driver restores the SVM range and then maps it
> >>> into the GPU page tables. The current implementation passes a GPU-page-size
> >>> (4K-based) PFN to svm_range_restore_pages() to restore the range.
> >>>
> >>> SVM ranges are tracked using system-page-size PFNs. On systems where the
> >>> system page size is larger than 4K, using GPU-page-size PFNs to restore the
> >>> range causes two problems:
> >>>
> >>> Range lookup fails:
> >>> Because the restore function receives PFNs in GPU (4K) units, the SVM
> >>> range lookup does not find the existing range. This will result in a
> >>> duplicate SVM range being created.
> >>>
> >>> VMA lookup failure:
> >>> The restore function also tries to locate the VMA for the faulting address.
> >>> It converts the GPU-page-size PFN into an address using the system page
> >>> size, which results in an incorrect address on non-4K page-size systems.
> >>> As a result, the VMA lookup fails with the message: "address 0xxxx VMA is
> >>> removed".
> >>>
> >>> This patch passes the system-page-size PFN to svm_range_restore_pages() so
> >>> that the SVM range is restored correctly on non-4K page systems.
> >>>
> >>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
> >> Acked-by: Christian König <christian.koenig@amd.com>
> >>
> >>> ---
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 6 +++---
> >>>   1 file changed, 3 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> >>> index 6a2ea200d90c..7a3cb0057ac5 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> >>> @@ -2985,14 +2985,14 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, u32 pasid,
> >>>        if (!root)
> >>>                return false;
> >>>
> >>> -     addr /= AMDGPU_GPU_PAGE_SIZE;
> >>> -
> >>>        if (is_compute_context && !svm_range_restore_pages(adev, pasid, vmid,
> >>> -         node_id, addr, ts, write_fault)) {
> >>> +         node_id, addr >> PAGE_SHIFT, ts, write_fault)) {
> >>>                amdgpu_bo_unref(&root);
> >>>                return true;
> >>>        }
> >>>
> >>> +     addr /= AMDGPU_GPU_PAGE_SIZE;
> >>> +
> >>>        r = amdgpu_bo_reserve(root, true);
> >>>        if (r)
> >>>                goto error_unref;

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND RFC PATCH v3 3/6] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems
  2026-03-23  4:28 ` [RESEND RFC PATCH v3 3/6] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems Donet Tom
  2026-03-23 13:04   ` Christian König
@ 2026-03-25  2:30   ` Kuehling, Felix
  1 sibling, 0 replies; 30+ messages in thread
From: Kuehling, Felix @ 2026-03-25  2:30 UTC (permalink / raw)
  To: Donet Tom, amd-gfx, Alex Deucher, Alex Deucher, christian.koenig,
	Philip Yang
  Cc: David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan


On 2026-03-23 00:28, Donet Tom wrote:
> During a GPU page fault, the driver restores the SVM range and then maps it
> into the GPU page tables. The current implementation passes a GPU-page-size
> (4K-based) PFN to svm_range_restore_pages() to restore the range.
>
> SVM ranges are tracked using system-page-size PFNs. On systems where the
> system page size is larger than 4K, using GPU-page-size PFNs to restore the
> range causes two problems:
>
> Range lookup fails:
> Because the restore function receives PFNs in GPU (4K) units, the SVM
> range lookup does not find the existing range. This will result in a
> duplicate SVM range being created.
>
> VMA lookup failure:
> The restore function also tries to locate the VMA for the faulting address.
> It converts the GPU-page-size PFN into an address using the system page
> size, which results in an incorrect address on non-4K page-size systems.
> As a result, the VMA lookup fails with the message: "address 0xxxx VMA is
> removed".
>
> This patch passes the system-page-size PFN to svm_range_restore_pages() so
> that the SVM range is restored correctly on non-4K page systems.
>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>

Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>


> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 6a2ea200d90c..7a3cb0057ac5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -2985,14 +2985,14 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, u32 pasid,
>   	if (!root)
>   		return false;
>   
> -	addr /= AMDGPU_GPU_PAGE_SIZE;
> -
>   	if (is_compute_context && !svm_range_restore_pages(adev, pasid, vmid,
> -	    node_id, addr, ts, write_fault)) {
> +	    node_id, addr >> PAGE_SHIFT, ts, write_fault)) {
>   		amdgpu_bo_unref(&root);
>   		return true;
>   	}
>   
> +	addr /= AMDGPU_GPU_PAGE_SIZE;
> +
>   	r = amdgpu_bo_reserve(root, true);
>   	if (r)
>   		goto error_unref;

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RESEND RFC PATCH v3 4/6] drm/amdgpu: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size
  2026-03-23  4:28 [RESEND RFC PATCH v3 0/6] drm/amd: Add support for non-4K page size systems Donet Tom
                   ` (2 preceding siblings ...)
  2026-03-23  4:28 ` [RESEND RFC PATCH v3 3/6] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems Donet Tom
@ 2026-03-23  4:28 ` Donet Tom
  2026-03-23  4:28 ` [RESEND RFC PATCH v3 5/6] drm/amd: Fix MQD and control stack alignment for non-4K Donet Tom
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: Donet Tom @ 2026-03-23  4:28 UTC (permalink / raw)
  To: amd-gfx, Felix Kuehling, Alex Deucher, Alex Deucher,
	christian.koenig, Philip Yang
  Cc: David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan, donettom

AMDGPU_GTT_MAX_TRANSFER_SIZE represented the maximum number of
system-page-sized pages that could be transferred in a single
operation. The effective maximum transfer size was intended to be
one PMD-sized mapping.

In the existing code, AMDGPU_GTT_MAX_TRANSFER_SIZE was hard-coded
to 512 pages. This corresponded to 2 MB on 4 KB page-size systems,
matching the PMD size. However, on systems with a non-4 KB page
size, this value no longer matched the PMD size.

This patch changed the calculation of AMDGPU_GTT_MAX_TRANSFER_SIZE
to derive it from PMD_SHIFT and PAGE_SHIFT, ensuring that the
maximum transfer size remained PMD-sized across all system page
sizes.

Additionally, in some places, AMDGPU_GTT_MAX_TRANSFER_SIZE was
implicitly assumed to be based on 4 KB pages. This resulted in
incorrect address offset calculations. This patch updated the
address calculations to correctly handle non-4 KB system page
sizes as well.

amdgpu_ttm_map_buffer() can create both GTT GART entries and
VRAM GART entries. For GTT mappings, amdgpu_gart_map() takes
system page–sized PFNs, and the mappings are created correctly.

However, for VRAM GART mappings, amdgpu_gart_map_vram_range() expects
GPU page–sized PFNs, but CPU page–sized PFNs were being passed,
resulting in incorrect mappings.

This patch updates the code to pass GPU page–sized PFNs to
amdgpu_gart_map_vram_range(), ensuring that VRAM GART mappings are
created correctly.

Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 8 +++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 2 +-
 drivers/gpu/drm/amd/amdgpu/vce_v1_0.c   | 3 ++-
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 15d561e3d87f..67983955a124 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -204,7 +204,7 @@ static int amdgpu_ttm_map_buffer(struct amdgpu_ttm_buffer_entity *entity,
 	int r;
 
 	BUG_ON(adev->mman.buffer_funcs->copy_max_bytes <
-	       AMDGPU_GTT_MAX_TRANSFER_SIZE * 8);
+	       AMDGPU_GTT_MAX_TRANSFER_SIZE * AMDGPU_GPU_PAGES_IN_CPU_PAGE * 8);
 
 	if (WARN_ON(mem->mem_type == AMDGPU_PL_PREEMPT))
 		return -EINVAL;
@@ -230,7 +230,7 @@ static int amdgpu_ttm_map_buffer(struct amdgpu_ttm_buffer_entity *entity,
 
 	*addr = adev->gmc.gart_start;
 	*addr += (u64)window * AMDGPU_GTT_MAX_TRANSFER_SIZE *
-		AMDGPU_GPU_PAGE_SIZE;
+		AMDGPU_GPU_PAGES_IN_CPU_PAGE * AMDGPU_GPU_PAGE_SIZE;
 	*addr += offset;
 
 	num_dw = ALIGN(adev->mman.buffer_funcs->copy_num_dw, 8);
@@ -248,7 +248,8 @@ static int amdgpu_ttm_map_buffer(struct amdgpu_ttm_buffer_entity *entity,
 	src_addr += job->ibs[0].gpu_addr;
 
 	dst_addr = amdgpu_bo_gpu_offset(adev->gart.bo);
-	dst_addr += window * AMDGPU_GTT_MAX_TRANSFER_SIZE * 8;
+	dst_addr += window * AMDGPU_GTT_MAX_TRANSFER_SIZE *
+		AMDGPU_GPU_PAGES_IN_CPU_PAGE * 8;
 	amdgpu_emit_copy_buffer(adev, &job->ibs[0], src_addr,
 				dst_addr, num_bytes, 0);
 
@@ -266,6 +267,7 @@ static int amdgpu_ttm_map_buffer(struct amdgpu_ttm_buffer_entity *entity,
 	} else {
 		u64 pa = mm_cur->start + adev->vm_manager.vram_base_offset;
 
+		num_pages *= AMDGPU_GPU_PAGES_IN_CPU_PAGE;
 		amdgpu_gart_map_vram_range(adev, pa, 0, num_pages, flags, cpu_addr);
 	}
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
index 143201ecea3f..15aff225af1d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
@@ -38,7 +38,7 @@
 #define AMDGPU_PL_MMIO_REMAP	(TTM_PL_PRIV + 5)
 #define __AMDGPU_PL_NUM	(TTM_PL_PRIV + 6)
 
-#define AMDGPU_GTT_MAX_TRANSFER_SIZE	512
+#define AMDGPU_GTT_MAX_TRANSFER_SIZE	(1 << (PMD_SHIFT - PAGE_SHIFT))
 #define AMDGPU_GTT_NUM_TRANSFER_WINDOWS	2
 
 extern const struct attribute_group amdgpu_vram_mgr_attr_group;
diff --git a/drivers/gpu/drm/amd/amdgpu/vce_v1_0.c b/drivers/gpu/drm/amd/amdgpu/vce_v1_0.c
index 9ae424618556..b2d4114c258c 100644
--- a/drivers/gpu/drm/amd/amdgpu/vce_v1_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vce_v1_0.c
@@ -48,7 +48,8 @@
 #define VCE_STATUS_VCPU_REPORT_FW_LOADED_MASK	0x02
 
 #define VCE_V1_0_GART_PAGE_START \
-	(AMDGPU_GTT_MAX_TRANSFER_SIZE * AMDGPU_GTT_NUM_TRANSFER_WINDOWS)
+	(AMDGPU_GTT_MAX_TRANSFER_SIZE * AMDGPU_GPU_PAGES_IN_CPU_PAGE * \
+	 AMDGPU_GTT_NUM_TRANSFER_WINDOWS)
 #define VCE_V1_0_GART_ADDR_START \
 	(VCE_V1_0_GART_PAGE_START * AMDGPU_GPU_PAGE_SIZE)
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RESEND RFC PATCH v3 5/6] drm/amd: Fix MQD and control stack alignment for non-4K
  2026-03-23  4:28 [RESEND RFC PATCH v3 0/6] drm/amd: Add support for non-4K page size systems Donet Tom
                   ` (3 preceding siblings ...)
  2026-03-23  4:28 ` [RESEND RFC PATCH v3 4/6] drm/amdgpu: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size Donet Tom
@ 2026-03-23  4:28 ` Donet Tom
  2026-03-25  2:58   ` Kuehling, Felix
  2026-03-23  4:28 ` [RESEND RFC PATCH v3 6/6] drm/amdkfd: Fix queue preemption/eviction failures by aligning control stack size to GPU page size Donet Tom
  2026-03-25  2:27 ` [RESEND RFC PATCH v3 0/6] drm/amd: Add support for non-4K page size systems Kuehling, Felix
  6 siblings, 1 reply; 30+ messages in thread
From: Donet Tom @ 2026-03-23  4:28 UTC (permalink / raw)
  To: amd-gfx, Felix Kuehling, Alex Deucher, Alex Deucher,
	christian.koenig, Philip Yang
  Cc: David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan, donettom

For gfxV9, due to a hardware bug ("based on the comments in the code
here [1]"), the control stack of a user-mode compute queue must be
allocated immediately after the page boundary of its regular MQD buffer.
To handle this, we allocate an enlarged MQD buffer where the first page
is used as the MQD and the remaining pages store the control stack.
Although these regions share the same BO, they require different memory
types: the MQD must be UC (uncached), while the control stack must be
NC (non-coherent), matching the behavior when the control stack is
allocated in user space.

This logic works correctly on systems where the CPU page size matches
the GPU page size (4K). However, the current implementation aligns both
the MQD and the control stack to the CPU PAGE_SIZE. On systems with a
larger CPU page size, the entire first CPU page is marked UC—even though
that page may contain multiple GPU pages. The GPU treats the second 4K
GPU page inside that CPU page as part of the control stack, but it is
incorrectly mapped as UC.

This patch fixes the issue by aligning both the MQD and control stack
sizes to the GPU page size (4K). The first 4K page is correctly marked
as UC for the MQD, and the remaining GPU pages are marked NC for the
control stack. This ensures proper memory type assignment on systems
with larger CPU page sizes.

[1]: https://elixir.bootlin.com/linux/v6.18/source/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c#L118

Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c      | 44 +++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h      |  2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       | 16 ++-----
 .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c   | 23 ++++++----
 4 files changed, 64 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
index ec911dce345f..4d884180cf61 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
@@ -403,6 +403,50 @@ void amdgpu_gart_map_vram_range(struct amdgpu_device *adev, uint64_t pa,
 	drm_dev_exit(idx);
 }
 
+/**
+ * amdgpu_gart_map_gfx9_mqd - map mqd and ctrl_stack dma_addresses into GART entries
+ *
+ * @adev: amdgpu_device pointer
+ * @offset: offset into the GPU's gart aperture
+ * @pages: number of pages to bind
+ * @dma_addr: DMA addresses of pages
+ * @flags: page table entry flags
+ *
+ * Map the MQD and control stack addresses into GART entries with the correct
+ * memory types on gfxv9. The MQD occupies the first 4KB and is followed by
+ * the control stack. The MQD uses UC (uncached) memory, while the control stack
+ * uses NC (non-coherent) memory.
+ */
+void amdgpu_gart_map_gfx9_mqd(struct amdgpu_device *adev, uint64_t offset,
+			int pages, dma_addr_t *dma_addr, uint64_t flags)
+{
+	uint64_t page_base;
+	unsigned int i, j, t;
+	int idx;
+	uint64_t ctrl_flags = AMDGPU_PTE_MTYPE_VG10(flags, AMDGPU_MTYPE_NC);
+	void *dst;
+
+	if (!adev->gart.ptr)
+		return;
+
+	if (!drm_dev_enter(adev_to_drm(adev), &idx))
+		return;
+
+	t = offset / AMDGPU_GPU_PAGE_SIZE;
+	dst = adev->gart.ptr;
+	for (i = 0; i < pages; i++) {
+		page_base = dma_addr[i];
+		for (j = 0; j < AMDGPU_GPU_PAGES_IN_CPU_PAGE; j++, t++) {
+			if ((i == 0) && (j == 0))
+				amdgpu_gmc_set_pte_pde(adev, dst, t, page_base, flags);
+			else
+				amdgpu_gmc_set_pte_pde(adev, dst, t, page_base, ctrl_flags);
+			page_base += AMDGPU_GPU_PAGE_SIZE;
+		}
+	}
+	drm_dev_exit(idx);
+}
+
 /**
  * amdgpu_gart_bind - bind pages into the gart page table
  *
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
index d3118275ddae..6ebd2da32ea6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
@@ -62,6 +62,8 @@ void amdgpu_gart_unbind(struct amdgpu_device *adev, uint64_t offset,
 void amdgpu_gart_map(struct amdgpu_device *adev, uint64_t offset,
 		     int pages, dma_addr_t *dma_addr, uint64_t flags,
 		     void *dst);
+void amdgpu_gart_map_gfx9_mqd(struct amdgpu_device *adev, uint64_t offset,
+			int pages, dma_addr_t *dma_addr, uint64_t flags);
 void amdgpu_gart_bind(struct amdgpu_device *adev, uint64_t offset,
 		      int pages, dma_addr_t *dma_addr, uint64_t flags);
 void amdgpu_gart_map_vram_range(struct amdgpu_device *adev, uint64_t pa,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 67983955a124..e086eb1d2b24 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -855,25 +855,15 @@ static void amdgpu_ttm_gart_bind_gfx9_mqd(struct amdgpu_device *adev,
 	int num_xcc = max(1U, adev->gfx.num_xcc_per_xcp);
 	uint64_t page_idx, pages_per_xcc;
 	int i;
-	uint64_t ctrl_flags = AMDGPU_PTE_MTYPE_VG10(flags, AMDGPU_MTYPE_NC);
 
 	pages_per_xcc = total_pages;
 	do_div(pages_per_xcc, num_xcc);
 
 	for (i = 0, page_idx = 0; i < num_xcc; i++, page_idx += pages_per_xcc) {
-		/* MQD page: use default flags */
-		amdgpu_gart_bind(adev,
+		amdgpu_gart_map_gfx9_mqd(adev,
 				gtt->offset + (page_idx << PAGE_SHIFT),
-				1, &gtt->ttm.dma_address[page_idx], flags);
-		/*
-		 * Ctrl pages - modify the memory type to NC (ctrl_flags) from
-		 * the second page of the BO onward.
-		 */
-		amdgpu_gart_bind(adev,
-				gtt->offset + ((page_idx + 1) << PAGE_SHIFT),
-				pages_per_xcc - 1,
-				&gtt->ttm.dma_address[page_idx + 1],
-				ctrl_flags);
+				pages_per_xcc, &gtt->ttm.dma_address[page_idx],
+				flags);
 	}
 }
 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c b/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c
index dcf4bbfa641b..ff0e483514da 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c
@@ -42,9 +42,16 @@ static uint64_t mqd_stride_v9(struct mqd_manager *mm,
 				struct queue_properties *q)
 {
 	if (mm->dev->kfd->cwsr_enabled &&
-	    q->type == KFD_QUEUE_TYPE_COMPUTE)
-		return ALIGN(q->ctl_stack_size, PAGE_SIZE) +
-			ALIGN(sizeof(struct v9_mqd), PAGE_SIZE);
+	    q->type == KFD_QUEUE_TYPE_COMPUTE) {
+
+		/* On gfxv9, the MQD resides in the first 4K page,
+		 * followed by the control stack. Align both to
+		 * AMDGPU_GPU_PAGE_SIZE to maintain the required 4K boundary.
+		 */
+
+		return ALIGN(ALIGN(q->ctl_stack_size, AMDGPU_GPU_PAGE_SIZE) +
+			ALIGN(sizeof(struct v9_mqd), AMDGPU_GPU_PAGE_SIZE), PAGE_SIZE);
+	}
 
 	return mm->mqd_size;
 }
@@ -148,8 +155,8 @@ static struct kfd_mem_obj *allocate_mqd(struct mqd_manager *mm,
 		if (!mqd_mem_obj)
 			return NULL;
 		retval = amdgpu_amdkfd_alloc_kernel_mem(node->adev,
-			(ALIGN(q->ctl_stack_size, PAGE_SIZE) +
-			ALIGN(sizeof(struct v9_mqd), PAGE_SIZE)) *
+			(ALIGN(ALIGN(q->ctl_stack_size, AMDGPU_GPU_PAGE_SIZE) +
+			ALIGN(sizeof(struct v9_mqd), AMDGPU_GPU_PAGE_SIZE), PAGE_SIZE)) *
 			NUM_XCC(node->xcc_mask),
 			mqd_on_vram(node->adev) ? AMDGPU_GEM_DOMAIN_VRAM :
 						  AMDGPU_GEM_DOMAIN_GTT,
@@ -357,7 +364,7 @@ static int get_wave_state(struct mqd_manager *mm, void *mqd,
 	struct kfd_context_save_area_header header;
 
 	/* Control stack is located one page after MQD. */
-	void *mqd_ctl_stack = (void *)((uintptr_t)mqd + PAGE_SIZE);
+	void *mqd_ctl_stack = (void *)((uintptr_t)mqd + AMDGPU_GPU_PAGE_SIZE);
 
 	m = get_mqd(mqd);
 
@@ -394,7 +401,7 @@ static void checkpoint_mqd(struct mqd_manager *mm, void *mqd, void *mqd_dst, voi
 {
 	struct v9_mqd *m;
 	/* Control stack is located one page after MQD. */
-	void *ctl_stack = (void *)((uintptr_t)mqd + PAGE_SIZE);
+	void *ctl_stack = (void *)((uintptr_t)mqd + AMDGPU_GPU_PAGE_SIZE);
 
 	m = get_mqd(mqd);
 
@@ -440,7 +447,7 @@ static void restore_mqd(struct mqd_manager *mm, void **mqd,
 		*gart_addr = addr;
 
 	/* Control stack is located one page after MQD. */
-	ctl_stack = (void *)((uintptr_t)*mqd + PAGE_SIZE);
+	ctl_stack = (void *)((uintptr_t)*mqd + AMDGPU_GPU_PAGE_SIZE);
 	memcpy(ctl_stack, ctl_stack_src, ctl_stack_size);
 
 	m->cp_hqd_pq_doorbell_control =
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RESEND RFC PATCH v3 5/6] drm/amd: Fix MQD and control stack alignment for non-4K
  2026-03-23  4:28 ` [RESEND RFC PATCH v3 5/6] drm/amd: Fix MQD and control stack alignment for non-4K Donet Tom
@ 2026-03-25  2:58   ` Kuehling, Felix
  2026-03-25 18:41     ` Alex Deucher
  0 siblings, 1 reply; 30+ messages in thread
From: Kuehling, Felix @ 2026-03-25  2:58 UTC (permalink / raw)
  To: Donet Tom, amd-gfx, Alex Deucher, Alex Deucher, christian.koenig,
	Philip Yang
  Cc: David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan


On 2026-03-23 00:28, Donet Tom wrote:
> For gfxV9, due to a hardware bug ("based on the comments in the code
> here [1]"), the control stack of a user-mode compute queue must be
> allocated immediately after the page boundary of its regular MQD buffer.
> To handle this, we allocate an enlarged MQD buffer where the first page
> is used as the MQD and the remaining pages store the control stack.
> Although these regions share the same BO, they require different memory
> types: the MQD must be UC (uncached), while the control stack must be
> NC (non-coherent), matching the behavior when the control stack is
> allocated in user space.
>
> This logic works correctly on systems where the CPU page size matches
> the GPU page size (4K). However, the current implementation aligns both
> the MQD and the control stack to the CPU PAGE_SIZE. On systems with a
> larger CPU page size, the entire first CPU page is marked UC—even though
> that page may contain multiple GPU pages. The GPU treats the second 4K
> GPU page inside that CPU page as part of the control stack, but it is
> incorrectly mapped as UC.
>
> This patch fixes the issue by aligning both the MQD and control stack
> sizes to the GPU page size (4K). The first 4K page is correctly marked
> as UC for the MQD, and the remaining GPU pages are marked NC for the
> control stack. This ensures proper memory type assignment on systems
> with larger CPU page sizes.
>
> [1]: https://elixir.bootlin.com/linux/v6.18/source/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c#L118
>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>

Acked-by: Felix Kuehling <felix.kuehling@amd.com>


> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c      | 44 +++++++++++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h      |  2 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       | 16 ++-----
>   .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c   | 23 ++++++----
>   4 files changed, 64 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
> index ec911dce345f..4d884180cf61 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
> @@ -403,6 +403,50 @@ void amdgpu_gart_map_vram_range(struct amdgpu_device *adev, uint64_t pa,
>   	drm_dev_exit(idx);
>   }
>   
> +/**
> + * amdgpu_gart_map_gfx9_mqd - map mqd and ctrl_stack dma_addresses into GART entries
> + *
> + * @adev: amdgpu_device pointer
> + * @offset: offset into the GPU's gart aperture
> + * @pages: number of pages to bind
> + * @dma_addr: DMA addresses of pages
> + * @flags: page table entry flags
> + *
> + * Map the MQD and control stack addresses into GART entries with the correct
> + * memory types on gfxv9. The MQD occupies the first 4KB and is followed by
> + * the control stack. The MQD uses UC (uncached) memory, while the control stack
> + * uses NC (non-coherent) memory.
> + */
> +void amdgpu_gart_map_gfx9_mqd(struct amdgpu_device *adev, uint64_t offset,
> +			int pages, dma_addr_t *dma_addr, uint64_t flags)
> +{
> +	uint64_t page_base;
> +	unsigned int i, j, t;
> +	int idx;
> +	uint64_t ctrl_flags = AMDGPU_PTE_MTYPE_VG10(flags, AMDGPU_MTYPE_NC);
> +	void *dst;
> +
> +	if (!adev->gart.ptr)
> +		return;
> +
> +	if (!drm_dev_enter(adev_to_drm(adev), &idx))
> +		return;
> +
> +	t = offset / AMDGPU_GPU_PAGE_SIZE;
> +	dst = adev->gart.ptr;
> +	for (i = 0; i < pages; i++) {
> +		page_base = dma_addr[i];
> +		for (j = 0; j < AMDGPU_GPU_PAGES_IN_CPU_PAGE; j++, t++) {
> +			if ((i == 0) && (j == 0))
> +				amdgpu_gmc_set_pte_pde(adev, dst, t, page_base, flags);
> +			else
> +				amdgpu_gmc_set_pte_pde(adev, dst, t, page_base, ctrl_flags);
> +			page_base += AMDGPU_GPU_PAGE_SIZE;
> +		}
> +	}
> +	drm_dev_exit(idx);
> +}
> +
>   /**
>    * amdgpu_gart_bind - bind pages into the gart page table
>    *
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
> index d3118275ddae..6ebd2da32ea6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
> @@ -62,6 +62,8 @@ void amdgpu_gart_unbind(struct amdgpu_device *adev, uint64_t offset,
>   void amdgpu_gart_map(struct amdgpu_device *adev, uint64_t offset,
>   		     int pages, dma_addr_t *dma_addr, uint64_t flags,
>   		     void *dst);
> +void amdgpu_gart_map_gfx9_mqd(struct amdgpu_device *adev, uint64_t offset,
> +			int pages, dma_addr_t *dma_addr, uint64_t flags);
>   void amdgpu_gart_bind(struct amdgpu_device *adev, uint64_t offset,
>   		      int pages, dma_addr_t *dma_addr, uint64_t flags);
>   void amdgpu_gart_map_vram_range(struct amdgpu_device *adev, uint64_t pa,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index 67983955a124..e086eb1d2b24 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -855,25 +855,15 @@ static void amdgpu_ttm_gart_bind_gfx9_mqd(struct amdgpu_device *adev,
>   	int num_xcc = max(1U, adev->gfx.num_xcc_per_xcp);
>   	uint64_t page_idx, pages_per_xcc;
>   	int i;
> -	uint64_t ctrl_flags = AMDGPU_PTE_MTYPE_VG10(flags, AMDGPU_MTYPE_NC);
>   
>   	pages_per_xcc = total_pages;
>   	do_div(pages_per_xcc, num_xcc);
>   
>   	for (i = 0, page_idx = 0; i < num_xcc; i++, page_idx += pages_per_xcc) {
> -		/* MQD page: use default flags */
> -		amdgpu_gart_bind(adev,
> +		amdgpu_gart_map_gfx9_mqd(adev,
>   				gtt->offset + (page_idx << PAGE_SHIFT),
> -				1, &gtt->ttm.dma_address[page_idx], flags);
> -		/*
> -		 * Ctrl pages - modify the memory type to NC (ctrl_flags) from
> -		 * the second page of the BO onward.
> -		 */
> -		amdgpu_gart_bind(adev,
> -				gtt->offset + ((page_idx + 1) << PAGE_SHIFT),
> -				pages_per_xcc - 1,
> -				&gtt->ttm.dma_address[page_idx + 1],
> -				ctrl_flags);
> +				pages_per_xcc, &gtt->ttm.dma_address[page_idx],
> +				flags);
>   	}
>   }
>   
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c b/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c
> index dcf4bbfa641b..ff0e483514da 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c
> @@ -42,9 +42,16 @@ static uint64_t mqd_stride_v9(struct mqd_manager *mm,
>   				struct queue_properties *q)
>   {
>   	if (mm->dev->kfd->cwsr_enabled &&
> -	    q->type == KFD_QUEUE_TYPE_COMPUTE)
> -		return ALIGN(q->ctl_stack_size, PAGE_SIZE) +
> -			ALIGN(sizeof(struct v9_mqd), PAGE_SIZE);
> +	    q->type == KFD_QUEUE_TYPE_COMPUTE) {
> +
> +		/* On gfxv9, the MQD resides in the first 4K page,
> +		 * followed by the control stack. Align both to
> +		 * AMDGPU_GPU_PAGE_SIZE to maintain the required 4K boundary.
> +		 */
> +
> +		return ALIGN(ALIGN(q->ctl_stack_size, AMDGPU_GPU_PAGE_SIZE) +
> +			ALIGN(sizeof(struct v9_mqd), AMDGPU_GPU_PAGE_SIZE), PAGE_SIZE);
> +	}
>   
>   	return mm->mqd_size;
>   }
> @@ -148,8 +155,8 @@ static struct kfd_mem_obj *allocate_mqd(struct mqd_manager *mm,
>   		if (!mqd_mem_obj)
>   			return NULL;
>   		retval = amdgpu_amdkfd_alloc_kernel_mem(node->adev,
> -			(ALIGN(q->ctl_stack_size, PAGE_SIZE) +
> -			ALIGN(sizeof(struct v9_mqd), PAGE_SIZE)) *
> +			(ALIGN(ALIGN(q->ctl_stack_size, AMDGPU_GPU_PAGE_SIZE) +
> +			ALIGN(sizeof(struct v9_mqd), AMDGPU_GPU_PAGE_SIZE), PAGE_SIZE)) *
>   			NUM_XCC(node->xcc_mask),
>   			mqd_on_vram(node->adev) ? AMDGPU_GEM_DOMAIN_VRAM :
>   						  AMDGPU_GEM_DOMAIN_GTT,
> @@ -357,7 +364,7 @@ static int get_wave_state(struct mqd_manager *mm, void *mqd,
>   	struct kfd_context_save_area_header header;
>   
>   	/* Control stack is located one page after MQD. */
> -	void *mqd_ctl_stack = (void *)((uintptr_t)mqd + PAGE_SIZE);
> +	void *mqd_ctl_stack = (void *)((uintptr_t)mqd + AMDGPU_GPU_PAGE_SIZE);
>   
>   	m = get_mqd(mqd);
>   
> @@ -394,7 +401,7 @@ static void checkpoint_mqd(struct mqd_manager *mm, void *mqd, void *mqd_dst, voi
>   {
>   	struct v9_mqd *m;
>   	/* Control stack is located one page after MQD. */
> -	void *ctl_stack = (void *)((uintptr_t)mqd + PAGE_SIZE);
> +	void *ctl_stack = (void *)((uintptr_t)mqd + AMDGPU_GPU_PAGE_SIZE);
>   
>   	m = get_mqd(mqd);
>   
> @@ -440,7 +447,7 @@ static void restore_mqd(struct mqd_manager *mm, void **mqd,
>   		*gart_addr = addr;
>   
>   	/* Control stack is located one page after MQD. */
> -	ctl_stack = (void *)((uintptr_t)*mqd + PAGE_SIZE);
> +	ctl_stack = (void *)((uintptr_t)*mqd + AMDGPU_GPU_PAGE_SIZE);
>   	memcpy(ctl_stack, ctl_stack_src, ctl_stack_size);
>   
>   	m->cp_hqd_pq_doorbell_control =

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND RFC PATCH v3 5/6] drm/amd: Fix MQD and control stack alignment for non-4K
  2026-03-25  2:58   ` Kuehling, Felix
@ 2026-03-25 18:41     ` Alex Deucher
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Deucher @ 2026-03-25 18:41 UTC (permalink / raw)
  To: Kuehling, Felix
  Cc: Donet Tom, amd-gfx, Alex Deucher, christian.koenig, Philip Yang,
	David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan

Applied.  Thanks!

Alex

On Tue, Mar 24, 2026 at 10:58 PM Kuehling, Felix <felix.kuehling@amd.com> wrote:
>
>
> On 2026-03-23 00:28, Donet Tom wrote:
> > For gfxV9, due to a hardware bug ("based on the comments in the code
> > here [1]"), the control stack of a user-mode compute queue must be
> > allocated immediately after the page boundary of its regular MQD buffer.
> > To handle this, we allocate an enlarged MQD buffer where the first page
> > is used as the MQD and the remaining pages store the control stack.
> > Although these regions share the same BO, they require different memory
> > types: the MQD must be UC (uncached), while the control stack must be
> > NC (non-coherent), matching the behavior when the control stack is
> > allocated in user space.
> >
> > This logic works correctly on systems where the CPU page size matches
> > the GPU page size (4K). However, the current implementation aligns both
> > the MQD and the control stack to the CPU PAGE_SIZE. On systems with a
> > larger CPU page size, the entire first CPU page is marked UC—even though
> > that page may contain multiple GPU pages. The GPU treats the second 4K
> > GPU page inside that CPU page as part of the control stack, but it is
> > incorrectly mapped as UC.
> >
> > This patch fixes the issue by aligning both the MQD and control stack
> > sizes to the GPU page size (4K). The first 4K page is correctly marked
> > as UC for the MQD, and the remaining GPU pages are marked NC for the
> > control stack. This ensures proper memory type assignment on systems
> > with larger CPU page sizes.
> >
> > [1]: https://elixir.bootlin.com/linux/v6.18/source/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c#L118
> >
> > Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>
> Acked-by: Felix Kuehling <felix.kuehling@amd.com>
>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c      | 44 +++++++++++++++++++
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h      |  2 +
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       | 16 ++-----
> >   .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c   | 23 ++++++----
> >   4 files changed, 64 insertions(+), 21 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
> > index ec911dce345f..4d884180cf61 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
> > @@ -403,6 +403,50 @@ void amdgpu_gart_map_vram_range(struct amdgpu_device *adev, uint64_t pa,
> >       drm_dev_exit(idx);
> >   }
> >
> > +/**
> > + * amdgpu_gart_map_gfx9_mqd - map mqd and ctrl_stack dma_addresses into GART entries
> > + *
> > + * @adev: amdgpu_device pointer
> > + * @offset: offset into the GPU's gart aperture
> > + * @pages: number of pages to bind
> > + * @dma_addr: DMA addresses of pages
> > + * @flags: page table entry flags
> > + *
> > + * Map the MQD and control stack addresses into GART entries with the correct
> > + * memory types on gfxv9. The MQD occupies the first 4KB and is followed by
> > + * the control stack. The MQD uses UC (uncached) memory, while the control stack
> > + * uses NC (non-coherent) memory.
> > + */
> > +void amdgpu_gart_map_gfx9_mqd(struct amdgpu_device *adev, uint64_t offset,
> > +                     int pages, dma_addr_t *dma_addr, uint64_t flags)
> > +{
> > +     uint64_t page_base;
> > +     unsigned int i, j, t;
> > +     int idx;
> > +     uint64_t ctrl_flags = AMDGPU_PTE_MTYPE_VG10(flags, AMDGPU_MTYPE_NC);
> > +     void *dst;
> > +
> > +     if (!adev->gart.ptr)
> > +             return;
> > +
> > +     if (!drm_dev_enter(adev_to_drm(adev), &idx))
> > +             return;
> > +
> > +     t = offset / AMDGPU_GPU_PAGE_SIZE;
> > +     dst = adev->gart.ptr;
> > +     for (i = 0; i < pages; i++) {
> > +             page_base = dma_addr[i];
> > +             for (j = 0; j < AMDGPU_GPU_PAGES_IN_CPU_PAGE; j++, t++) {
> > +                     if ((i == 0) && (j == 0))
> > +                             amdgpu_gmc_set_pte_pde(adev, dst, t, page_base, flags);
> > +                     else
> > +                             amdgpu_gmc_set_pte_pde(adev, dst, t, page_base, ctrl_flags);
> > +                     page_base += AMDGPU_GPU_PAGE_SIZE;
> > +             }
> > +     }
> > +     drm_dev_exit(idx);
> > +}
> > +
> >   /**
> >    * amdgpu_gart_bind - bind pages into the gart page table
> >    *
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
> > index d3118275ddae..6ebd2da32ea6 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
> > @@ -62,6 +62,8 @@ void amdgpu_gart_unbind(struct amdgpu_device *adev, uint64_t offset,
> >   void amdgpu_gart_map(struct amdgpu_device *adev, uint64_t offset,
> >                    int pages, dma_addr_t *dma_addr, uint64_t flags,
> >                    void *dst);
> > +void amdgpu_gart_map_gfx9_mqd(struct amdgpu_device *adev, uint64_t offset,
> > +                     int pages, dma_addr_t *dma_addr, uint64_t flags);
> >   void amdgpu_gart_bind(struct amdgpu_device *adev, uint64_t offset,
> >                     int pages, dma_addr_t *dma_addr, uint64_t flags);
> >   void amdgpu_gart_map_vram_range(struct amdgpu_device *adev, uint64_t pa,
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> > index 67983955a124..e086eb1d2b24 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> > @@ -855,25 +855,15 @@ static void amdgpu_ttm_gart_bind_gfx9_mqd(struct amdgpu_device *adev,
> >       int num_xcc = max(1U, adev->gfx.num_xcc_per_xcp);
> >       uint64_t page_idx, pages_per_xcc;
> >       int i;
> > -     uint64_t ctrl_flags = AMDGPU_PTE_MTYPE_VG10(flags, AMDGPU_MTYPE_NC);
> >
> >       pages_per_xcc = total_pages;
> >       do_div(pages_per_xcc, num_xcc);
> >
> >       for (i = 0, page_idx = 0; i < num_xcc; i++, page_idx += pages_per_xcc) {
> > -             /* MQD page: use default flags */
> > -             amdgpu_gart_bind(adev,
> > +             amdgpu_gart_map_gfx9_mqd(adev,
> >                               gtt->offset + (page_idx << PAGE_SHIFT),
> > -                             1, &gtt->ttm.dma_address[page_idx], flags);
> > -             /*
> > -              * Ctrl pages - modify the memory type to NC (ctrl_flags) from
> > -              * the second page of the BO onward.
> > -              */
> > -             amdgpu_gart_bind(adev,
> > -                             gtt->offset + ((page_idx + 1) << PAGE_SHIFT),
> > -                             pages_per_xcc - 1,
> > -                             &gtt->ttm.dma_address[page_idx + 1],
> > -                             ctrl_flags);
> > +                             pages_per_xcc, &gtt->ttm.dma_address[page_idx],
> > +                             flags);
> >       }
> >   }
> >
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c b/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c
> > index dcf4bbfa641b..ff0e483514da 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c
> > @@ -42,9 +42,16 @@ static uint64_t mqd_stride_v9(struct mqd_manager *mm,
> >                               struct queue_properties *q)
> >   {
> >       if (mm->dev->kfd->cwsr_enabled &&
> > -         q->type == KFD_QUEUE_TYPE_COMPUTE)
> > -             return ALIGN(q->ctl_stack_size, PAGE_SIZE) +
> > -                     ALIGN(sizeof(struct v9_mqd), PAGE_SIZE);
> > +         q->type == KFD_QUEUE_TYPE_COMPUTE) {
> > +
> > +             /* On gfxv9, the MQD resides in the first 4K page,
> > +              * followed by the control stack. Align both to
> > +              * AMDGPU_GPU_PAGE_SIZE to maintain the required 4K boundary.
> > +              */
> > +
> > +             return ALIGN(ALIGN(q->ctl_stack_size, AMDGPU_GPU_PAGE_SIZE) +
> > +                     ALIGN(sizeof(struct v9_mqd), AMDGPU_GPU_PAGE_SIZE), PAGE_SIZE);
> > +     }
> >
> >       return mm->mqd_size;
> >   }
> > @@ -148,8 +155,8 @@ static struct kfd_mem_obj *allocate_mqd(struct mqd_manager *mm,
> >               if (!mqd_mem_obj)
> >                       return NULL;
> >               retval = amdgpu_amdkfd_alloc_kernel_mem(node->adev,
> > -                     (ALIGN(q->ctl_stack_size, PAGE_SIZE) +
> > -                     ALIGN(sizeof(struct v9_mqd), PAGE_SIZE)) *
> > +                     (ALIGN(ALIGN(q->ctl_stack_size, AMDGPU_GPU_PAGE_SIZE) +
> > +                     ALIGN(sizeof(struct v9_mqd), AMDGPU_GPU_PAGE_SIZE), PAGE_SIZE)) *
> >                       NUM_XCC(node->xcc_mask),
> >                       mqd_on_vram(node->adev) ? AMDGPU_GEM_DOMAIN_VRAM :
> >                                                 AMDGPU_GEM_DOMAIN_GTT,
> > @@ -357,7 +364,7 @@ static int get_wave_state(struct mqd_manager *mm, void *mqd,
> >       struct kfd_context_save_area_header header;
> >
> >       /* Control stack is located one page after MQD. */
> > -     void *mqd_ctl_stack = (void *)((uintptr_t)mqd + PAGE_SIZE);
> > +     void *mqd_ctl_stack = (void *)((uintptr_t)mqd + AMDGPU_GPU_PAGE_SIZE);
> >
> >       m = get_mqd(mqd);
> >
> > @@ -394,7 +401,7 @@ static void checkpoint_mqd(struct mqd_manager *mm, void *mqd, void *mqd_dst, voi
> >   {
> >       struct v9_mqd *m;
> >       /* Control stack is located one page after MQD. */
> > -     void *ctl_stack = (void *)((uintptr_t)mqd + PAGE_SIZE);
> > +     void *ctl_stack = (void *)((uintptr_t)mqd + AMDGPU_GPU_PAGE_SIZE);
> >
> >       m = get_mqd(mqd);
> >
> > @@ -440,7 +447,7 @@ static void restore_mqd(struct mqd_manager *mm, void **mqd,
> >               *gart_addr = addr;
> >
> >       /* Control stack is located one page after MQD. */
> > -     ctl_stack = (void *)((uintptr_t)*mqd + PAGE_SIZE);
> > +     ctl_stack = (void *)((uintptr_t)*mqd + AMDGPU_GPU_PAGE_SIZE);
> >       memcpy(ctl_stack, ctl_stack_src, ctl_stack_size);
> >
> >       m->cp_hqd_pq_doorbell_control =

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RESEND RFC PATCH v3 6/6] drm/amdkfd: Fix queue preemption/eviction failures by aligning control stack size to GPU page size
  2026-03-23  4:28 [RESEND RFC PATCH v3 0/6] drm/amd: Add support for non-4K page size systems Donet Tom
                   ` (4 preceding siblings ...)
  2026-03-23  4:28 ` [RESEND RFC PATCH v3 5/6] drm/amd: Fix MQD and control stack alignment for non-4K Donet Tom
@ 2026-03-23  4:28 ` Donet Tom
  2026-03-25  3:00   ` Kuehling, Felix
  2026-03-25  2:27 ` [RESEND RFC PATCH v3 0/6] drm/amd: Add support for non-4K page size systems Kuehling, Felix
  6 siblings, 1 reply; 30+ messages in thread
From: Donet Tom @ 2026-03-23  4:28 UTC (permalink / raw)
  To: amd-gfx, Felix Kuehling, Alex Deucher, Alex Deucher,
	christian.koenig, Philip Yang
  Cc: David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan, donettom

The control stack size is calculated based on the number of CUs and
waves, and is then aligned to PAGE_SIZE. When the resulting control
stack size is aligned to 64 KB, GPU hangs and queue preemption
failures are observed while running RCCL unit tests on systems with
more than two GPUs.

amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for queue with
doorbell_id: 80030008
amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues
amdgpu 0048:0f:00.0: amdgpu: GPU reset begin!. Source: 4
amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for queue with
doorbell_id: 80030008
amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues
amdgpu 0048:0f:00.0: amdgpu: Failed to restore process queues

This issue is observed on both 4 KB and 64 KB system page-size
configurations.

This patch fixes the issue by aligning the control stack size to
AMDGPU_GPU_PAGE_SIZE instead of PAGE_SIZE, so the control stack size
will not be 64 KB on systems with a 64 KB page size and queue
preemption works correctly.

Additionally, In the current code, wg_data_size is aligned to PAGE_SIZE,
which can waste memory if the system page size is large. In this patch,
wg_data_size is aligned to AMDGPU_GPU_PAGE_SIZE. The cwsr_size, calculated
from wg_data_size and the control stack size, is aligned to PAGE_SIZE.

Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
index 572b21e39e83..9d4838461168 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
@@ -492,10 +492,11 @@ void kfd_queue_ctx_save_restore_size(struct kfd_topology_device *dev)
 	cu_num = props->simd_count / props->simd_per_cu / NUM_XCC(dev->gpu->xcc_mask);
 	wave_num = get_num_waves(props, gfxv, cu_num);

-	wg_data_size = ALIGN(cu_num * WG_CONTEXT_DATA_SIZE_PER_CU(gfxv, props), PAGE_SIZE);
+	wg_data_size = ALIGN(cu_num * WG_CONTEXT_DATA_SIZE_PER_CU(gfxv, props),
+				AMDGPU_GPU_PAGE_SIZE);
 	ctl_stack_size = wave_num * CNTL_STACK_BYTES_PER_WAVE(gfxv) + 8;
 	ctl_stack_size = ALIGN(SIZEOF_HSA_USER_CONTEXT_SAVE_AREA_HEADER + ctl_stack_size,
-			       PAGE_SIZE);
+			       AMDGPU_GPU_PAGE_SIZE);

 	if ((gfxv / 10000 * 10000) == 100000) {
 		/* HW design limits control stack size to 0x7000.
@@ -507,7 +508,7 @@ void kfd_queue_ctx_save_restore_size(struct kfd_topology_device *dev)

 	props->ctl_stack_size = ctl_stack_size;
 	props->debug_memory_size = ALIGN(wave_num * DEBUGGER_BYTES_PER_WAVE, DEBUGGER_BYTES_ALIGN);
-	props->cwsr_size = ctl_stack_size + wg_data_size;
+	props->cwsr_size = ALIGN(ctl_stack_size + wg_data_size, PAGE_SIZE);

 	if (gfxv == 80002)	/* GFX_VERSION_TONGA */
 		props->eop_buffer_size = 0x8000;
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RESEND RFC PATCH v3 6/6] drm/amdkfd: Fix queue preemption/eviction failures by aligning control stack size to GPU page size
  2026-03-23  4:28 ` [RESEND RFC PATCH v3 6/6] drm/amdkfd: Fix queue preemption/eviction failures by aligning control stack size to GPU page size Donet Tom
@ 2026-03-25  3:00   ` Kuehling, Felix
  2026-03-25 18:42     ` Alex Deucher
  0 siblings, 1 reply; 30+ messages in thread
From: Kuehling, Felix @ 2026-03-25  3:00 UTC (permalink / raw)
  To: Donet Tom, amd-gfx, Alex Deucher, Alex Deucher, christian.koenig,
	Philip Yang
  Cc: David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan


On 2026-03-23 00:28, Donet Tom wrote:
> The control stack size is calculated based on the number of CUs and
> waves, and is then aligned to PAGE_SIZE. When the resulting control
> stack size is aligned to 64 KB, GPU hangs and queue preemption
> failures are observed while running RCCL unit tests on systems with
> more than two GPUs.
>
> amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for queue with
> doorbell_id: 80030008
> amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues
> amdgpu 0048:0f:00.0: amdgpu: GPU reset begin!. Source: 4
> amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for queue with
> doorbell_id: 80030008
> amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues
> amdgpu 0048:0f:00.0: amdgpu: Failed to restore process queues
>
> This issue is observed on both 4 KB and 64 KB system page-size
> configurations.
>
> This patch fixes the issue by aligning the control stack size to
> AMDGPU_GPU_PAGE_SIZE instead of PAGE_SIZE, so the control stack size
> will not be 64 KB on systems with a 64 KB page size and queue
> preemption works correctly.
>
> Additionally, In the current code, wg_data_size is aligned to PAGE_SIZE,
> which can waste memory if the system page size is large. In this patch,
> wg_data_size is aligned to AMDGPU_GPU_PAGE_SIZE. The cwsr_size, calculated
> from wg_data_size and the control stack size, is aligned to PAGE_SIZE.
>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>

Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>


> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 7 ++++---
>   1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> index 572b21e39e83..9d4838461168 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> @@ -492,10 +492,11 @@ void kfd_queue_ctx_save_restore_size(struct kfd_topology_device *dev)
>   	cu_num = props->simd_count / props->simd_per_cu / NUM_XCC(dev->gpu->xcc_mask);
>   	wave_num = get_num_waves(props, gfxv, cu_num);
>   
> -	wg_data_size = ALIGN(cu_num * WG_CONTEXT_DATA_SIZE_PER_CU(gfxv, props), PAGE_SIZE);
> +	wg_data_size = ALIGN(cu_num * WG_CONTEXT_DATA_SIZE_PER_CU(gfxv, props),
> +				AMDGPU_GPU_PAGE_SIZE);
>   	ctl_stack_size = wave_num * CNTL_STACK_BYTES_PER_WAVE(gfxv) + 8;
>   	ctl_stack_size = ALIGN(SIZEOF_HSA_USER_CONTEXT_SAVE_AREA_HEADER + ctl_stack_size,
> -			       PAGE_SIZE);
> +			       AMDGPU_GPU_PAGE_SIZE);
>   
>   	if ((gfxv / 10000 * 10000) == 100000) {
>   		/* HW design limits control stack size to 0x7000.
> @@ -507,7 +508,7 @@ void kfd_queue_ctx_save_restore_size(struct kfd_topology_device *dev)
>   
>   	props->ctl_stack_size = ctl_stack_size;
>   	props->debug_memory_size = ALIGN(wave_num * DEBUGGER_BYTES_PER_WAVE, DEBUGGER_BYTES_ALIGN);
> -	props->cwsr_size = ctl_stack_size + wg_data_size;
> +	props->cwsr_size = ALIGN(ctl_stack_size + wg_data_size, PAGE_SIZE);
>   
>   	if (gfxv == 80002)	/* GFX_VERSION_TONGA */
>   		props->eop_buffer_size = 0x8000;

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND RFC PATCH v3 6/6] drm/amdkfd: Fix queue preemption/eviction failures by aligning control stack size to GPU page size
  2026-03-25  3:00   ` Kuehling, Felix
@ 2026-03-25 18:42     ` Alex Deucher
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Deucher @ 2026-03-25 18:42 UTC (permalink / raw)
  To: Kuehling, Felix
  Cc: Donet Tom, amd-gfx, Alex Deucher, christian.koenig, Philip Yang,
	David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan

Applied.  Thanks!

Alex

On Tue, Mar 24, 2026 at 11:00 PM Kuehling, Felix <felix.kuehling@amd.com> wrote:
>
>
> On 2026-03-23 00:28, Donet Tom wrote:
> > The control stack size is calculated based on the number of CUs and
> > waves, and is then aligned to PAGE_SIZE. When the resulting control
> > stack size is aligned to 64 KB, GPU hangs and queue preemption
> > failures are observed while running RCCL unit tests on systems with
> > more than two GPUs.
> >
> > amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for queue with
> > doorbell_id: 80030008
> > amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues
> > amdgpu 0048:0f:00.0: amdgpu: GPU reset begin!. Source: 4
> > amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for queue with
> > doorbell_id: 80030008
> > amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues
> > amdgpu 0048:0f:00.0: amdgpu: Failed to restore process queues
> >
> > This issue is observed on both 4 KB and 64 KB system page-size
> > configurations.
> >
> > This patch fixes the issue by aligning the control stack size to
> > AMDGPU_GPU_PAGE_SIZE instead of PAGE_SIZE, so the control stack size
> > will not be 64 KB on systems with a 64 KB page size and queue
> > preemption works correctly.
> >
> > Additionally, In the current code, wg_data_size is aligned to PAGE_SIZE,
> > which can waste memory if the system page size is large. In this patch,
> > wg_data_size is aligned to AMDGPU_GPU_PAGE_SIZE. The cwsr_size, calculated
> > from wg_data_size and the control stack size, is aligned to PAGE_SIZE.
> >
> > Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>
> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
>
>
> > ---
> >   drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 7 ++++---
> >   1 file changed, 4 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> > index 572b21e39e83..9d4838461168 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> > @@ -492,10 +492,11 @@ void kfd_queue_ctx_save_restore_size(struct kfd_topology_device *dev)
> >       cu_num = props->simd_count / props->simd_per_cu / NUM_XCC(dev->gpu->xcc_mask);
> >       wave_num = get_num_waves(props, gfxv, cu_num);
> >
> > -     wg_data_size = ALIGN(cu_num * WG_CONTEXT_DATA_SIZE_PER_CU(gfxv, props), PAGE_SIZE);
> > +     wg_data_size = ALIGN(cu_num * WG_CONTEXT_DATA_SIZE_PER_CU(gfxv, props),
> > +                             AMDGPU_GPU_PAGE_SIZE);
> >       ctl_stack_size = wave_num * CNTL_STACK_BYTES_PER_WAVE(gfxv) + 8;
> >       ctl_stack_size = ALIGN(SIZEOF_HSA_USER_CONTEXT_SAVE_AREA_HEADER + ctl_stack_size,
> > -                            PAGE_SIZE);
> > +                            AMDGPU_GPU_PAGE_SIZE);
> >
> >       if ((gfxv / 10000 * 10000) == 100000) {
> >               /* HW design limits control stack size to 0x7000.
> > @@ -507,7 +508,7 @@ void kfd_queue_ctx_save_restore_size(struct kfd_topology_device *dev)
> >
> >       props->ctl_stack_size = ctl_stack_size;
> >       props->debug_memory_size = ALIGN(wave_num * DEBUGGER_BYTES_PER_WAVE, DEBUGGER_BYTES_ALIGN);
> > -     props->cwsr_size = ctl_stack_size + wg_data_size;
> > +     props->cwsr_size = ALIGN(ctl_stack_size + wg_data_size, PAGE_SIZE);
> >
> >       if (gfxv == 80002)      /* GFX_VERSION_TONGA */
> >               props->eop_buffer_size = 0x8000;

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND RFC PATCH v3 0/6] drm/amd: Add support for non-4K page size systems
  2026-03-23  4:28 [RESEND RFC PATCH v3 0/6] drm/amd: Add support for non-4K page size systems Donet Tom
                   ` (5 preceding siblings ...)
  2026-03-23  4:28 ` [RESEND RFC PATCH v3 6/6] drm/amdkfd: Fix queue preemption/eviction failures by aligning control stack size to GPU page size Donet Tom
@ 2026-03-25  2:27 ` Kuehling, Felix
  2026-03-25  8:02   ` Donet Tom
  6 siblings, 1 reply; 30+ messages in thread
From: Kuehling, Felix @ 2026-03-25  2:27 UTC (permalink / raw)
  To: Donet Tom, amd-gfx, Alex Deucher, Alex Deucher, christian.koenig,
	Philip Yang
  Cc: David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan

[-- Attachment #1: Type: text/plain, Size: 6455 bytes --]


On 2026-03-23 00:28, Donet Tom wrote:
> This is v3 of the patch series enabling 64 KB system page size support
> in AMDGPU. v2, part 1 of this series [1] has already been merged
> upstream and provides the minimal infrastructure required for 64 KB
> page support.
>
> This series addresses additional issues uncovered in AMDGPU when
> running rccl unit tests and rocr-debug-agent tessts on 64KB page-size
> systems.
>
> With this series applied, all RCCL unit tests and rocr-debug-agent
> tests pass on systems using a 64 KB system page size, across
> multi-GPU configurations, with XNACK both enabled and disabled.
>
> Patch 1 in this series (drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE
> to 2 * PAGE_SIZE) fixes a kernel crash observed when running rocminfo
> on systems with a 64 KB page size. This patch is required to enable
> minimal support for 64 KB system page sizes.
>
> Since RFC v2, we observed AQL queue creation failures while running
> certain workloads on 64K page-size systems due to an expected queue size
> mismatch. This issue is addressed in patch 2 of this series.
>
> The questions we had in this seres are:
> =======================================
> 1 When the control stack size is aligned to 64 KB, we consistently
>    observe queue preemption or eviction failures on gfx9, on both
>    4 KB and 64 KB system page-size configurations.
>
>    The control stack size is calculated based on the number of CUs and
>    waves and is then aligned to PAGE_SIZE. On systems with a 64 KB
>    system page size, this alignment always results in a 64 KB-aligned
>    control stack size, after which queue preemption fails.
>
>    Is there any hardware-imposed limitation on gfx9 that prevents the
>    control stack size from being 64 KB? For gfx10, I see explicit
>    hardware limitations on the control stack size in the code [2].
>    Is there anything similar for gfx9?
>
>    What is the correct or recommended control stack size for gfx9?
>    With a 4 KB system page size, I observe a control stack size of
>    around 44 KB—can it grow beyond this? If the control stack size
>    is fixed for a given gfx version, do you see any issues with
>    aligning the control stack size to the GPU page size?

I think there is a bug in user mode that uses its own calculation of the 
ctl_stack_size to calculate the total context save area size. If kernel 
mode increases the ctl_stack_size, the context save are allocated by 
user mode will be too small.

This is in 
https://github.com/ROCm/rocm-systems/blob/3a8bafb6a60f4cfa1047a5516fa7212beef4c98f/projects/rocr-runtime/libhsakmt/src/queues.c#L349

                 /* Keep calculating it in case we are using an older kernel, but if we have
                  * the CtlStackSize and CwsrSize from KFD, use that as the definitive value
                  */
                 q->ctx_save_restore_size = node.CwsrSize > 0 ? node.CwsrSize :
                                            q->ctl_stack_size + PAGE_ALIGN_UP(wg_data_size);
                 q->ctl_stack_size = node.CtlStackSize > 0 ? node.CtlStackSize : q->ctl_stack_size;

ctx_save_restore_size should be calculated after correcting 
ctl_stack_size with the one from the kernel mode driver.

Regards,
   Felix

>
> This series has 5 patches
> =========================
> 1. AMDGPU_VA_RESERVED_TRAP_SIZE was hard-coded to 8 KB while
>     KFD_CWSR_TBA_TMA_SIZE is defined as 2 * PAGE_SIZE, which matches on
>     4 KB page-size systems but results in a size mismatch on 64 KB
>     systems, leading to kernel crashes when running rocminfo or RCCL
>     unit tests.
>     This patch updates AMDGPU_VA_RESERVED_TRAP_SIZE to 2 * PAGE_SIZE so
>     that the reserved trap area matches the allocation size across all
>     system page sizes. This is a must needed patch to enable minimal
>     support for 64 KB system page sizes.
>
> 2. Aligned expected_queue_size to PAGE_SIZE to fix AQL queue creation
>     failure.
>
> 3. Fix amdgpu page fault handler (for xnack) to pass the corresponding
>     system pfn (instead of gpu pfn) for restoring SVM range mapping.
>
> 4. Updated AMDGPU_GTT_MAX_TRANSFER_SIZE to always match the PMD size
>     across all page sizes.
>
> 5. On systems where the CPU page size is larger than the GPU’s 4 KB page
>     size, the MQD and control stack were aligned to the CPU PAGE_SIZE,
>     causing multiple GPU pages to incorrectly inherit the UC attribute.
>     This change aligns both regions to the GPU page size, ensuring that
>     the MQD is mapped as UC and the control stack as NC, restoring the
>     correct behavior.
>
> 6. Queue preemption fails when the control stack size is aligned to
>     64 KB. This patch fixes this issue by aligning the control stack
>     size to gpu page size.
>
> Setup details:
> ============
> System details: Power10 LPAR using 64K pagesize.
> AMD GPU:
> Name:                    gfx90a
> Marketing Name:          AMD Instinct MI210
>
> [1]https://lore.kernel.org/all/cover.1765519875.git.donettom@linux.ibm.com/
> [2]https://elixir.bootlin.com/linux/v6.19-rc5/source/drivers/gpu/drm/amd/amdkfd/kfd_queue.c#L457
>
> RFC V3 -https://lore.kernel.org/all/cover.1771656655.git.donettom@linux.ibm.com/
> RFC V2 -https://lore.kernel.org/all/cover.1769612973.git.donettom@linux.ibm.com/
> RFC V1 -https://lore.kernel.org/all/cover.1765519875.git.donettom@linux.ibm.com/
>
>
> Donet Tom (6):
>    drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 PAGE_SIZE pages
>    drm/amdkfd: Align expected_queue_size to PAGE_SIZE
>    drm/amdgpu: Handle GPU page faults correctly on non-4K page systems
>    drm/amdgpu: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size
>    drm/amd: Fix MQD and control stack alignment for non-4K
>    drm/amdkfd: Fix queue preemption/eviction failures by aligning control
>      stack size to GPU page size
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c      | 44 +++++++++++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h      |  2 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       | 24 ++++------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h       |  2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |  6 +--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h        |  2 +-
>   drivers/gpu/drm/amd/amdgpu/vce_v1_0.c         |  3 +-
>   .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c   | 23 ++++++----
>   drivers/gpu/drm/amd/amdkfd/kfd_queue.c        | 11 ++---
>   9 files changed, 82 insertions(+), 35 deletions(-)
>

[-- Attachment #2: Type: text/html, Size: 7583 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND RFC PATCH v3 0/6] drm/amd: Add support for non-4K page size systems
  2026-03-25  2:27 ` [RESEND RFC PATCH v3 0/6] drm/amd: Add support for non-4K page size systems Kuehling, Felix
@ 2026-03-25  8:02   ` Donet Tom
  0 siblings, 0 replies; 30+ messages in thread
From: Donet Tom @ 2026-03-25  8:02 UTC (permalink / raw)
  To: Kuehling, Felix, amd-gfx, Alex Deucher, Alex Deucher,
	christian.koenig, Philip Yang
  Cc: David.YatSin, Kent.Russell, Ritesh Harjani,
	Vaidyanathan Srinivasan

[-- Attachment #1: Type: text/plain, Size: 7487 bytes --]


On 3/25/26 7:57 AM, Kuehling, Felix wrote:
> On 2026-03-23 00: 28, Donet Tom wrote: This is v3 of the patch series 
> enabling 64 KB system page size support in AMDGPU. v2, part 1 of this 
> series [1] has already been merged upstream and provides the minimal 
> infrastructure required for 64 KB
> 
>
>
> On 2026-03-23 00:28, Donet Tom wrote:
>> This is v3 of the patch series enabling 64 KB system page size support
>> in AMDGPU. v2, part 1 of this series [1] has already been merged
>> upstream and provides the minimal infrastructure required for 64 KB
>> page support.
>>
>> This series addresses additional issues uncovered in AMDGPU when
>> running rccl unit tests and rocr-debug-agent tessts on 64KB page-size
>> systems.
>>
>> With this series applied, all RCCL unit tests and rocr-debug-agent
>> tests pass on systems using a 64 KB system page size, across
>> multi-GPU configurations, with XNACK both enabled and disabled.
>>
>> Patch 1 in this series (drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE
>> to 2 * PAGE_SIZE) fixes a kernel crash observed when running rocminfo
>> on systems with a 64 KB page size. This patch is required to enable
>> minimal support for 64 KB system page sizes.
>>
>> Since RFC v2, we observed AQL queue creation failures while running
>> certain workloads on 64K page-size systems due to an expected queue size
>> mismatch. This issue is addressed in patch 2 of this series.
>>
>> The questions we had in this seres are:
>> =======================================
>> 1 When the control stack size is aligned to 64 KB, we consistently
>>    observe queue preemption or eviction failures on gfx9, on both
>>    4 KB and 64 KB system page-size configurations.
>>
>>    The control stack size is calculated based on the number of CUs and
>>    waves and is then aligned to PAGE_SIZE. On systems with a 64 KB
>>    system page size, this alignment always results in a 64 KB-aligned
>>    control stack size, after which queue preemption fails.
>>
>>    Is there any hardware-imposed limitation on gfx9 that prevents the
>>    control stack size from being 64 KB? For gfx10, I see explicit
>>    hardware limitations on the control stack size in the code [2].
>>    Is there anything similar for gfx9?
>>
>>    What is the correct or recommended control stack size for gfx9?
>>    With a 4 KB system page size, I observe a control stack size of
>>    around 44 KB—can it grow beyond this? If the control stack size
>>    is fixed for a given gfx version, do you see any issues with
>>    aligning the control stack size to the GPU page size?


Thank you, Felix, for your time and for reviewing this patch


> I think there is a bug in user mode that uses its own calculation of 
> the ctl_stack_size to calculate the total context save area size. If 
> kernel mode increases the ctl_stack_size, the context save are 
> allocated by user mode will be too small.
>
> This is in 
> https://github.com/ROCm/rocm-systems/blob/3a8bafb6a60f4cfa1047a5516fa7212beef4c98f/projects/rocr-runtime/libhsakmt/src/queues.c#L349
>
>                  /* Keep calculating it in case we are using an older kernel, but if we have
>                   * the CtlStackSize and CwsrSize from KFD, use that as the definitive value
>                   */
>                  q->ctx_save_restore_size = node.CwsrSize > 0 ? node.CwsrSize :
>                                             q->ctl_stack_size + PAGE_ALIGN_UP(wg_data_size);
>                  q->ctl_stack_size = node.CtlStackSize > 0 ? node.CtlStackSize : q->ctl_stack_size;
>
> ctx_save_restore_size should be calculated after correcting 
> ctl_stack_size with the one from the kernel mode driver.
>

Yes, we also need a fix in rocr-runtime. In rocr-runtime, I used the 
same approach as in the kernel (patch 6/6) to calculate ctl_stack_size 
and ctx_save_restore_size. Without this library change, I was 
encountering queue creation failures. With the library change and with 
this series all rccl tests are passing on both 4K and 64K page sizes.


-Donet


> Regards,
>    Felix
>
>> This series has 5 patches
>> =========================
>> 1. AMDGPU_VA_RESERVED_TRAP_SIZE was hard-coded to 8 KB while
>>     KFD_CWSR_TBA_TMA_SIZE is defined as 2 * PAGE_SIZE, which matches on
>>     4 KB page-size systems but results in a size mismatch on 64 KB
>>     systems, leading to kernel crashes when running rocminfo or RCCL
>>     unit tests.
>>     This patch updates AMDGPU_VA_RESERVED_TRAP_SIZE to 2 * PAGE_SIZE so
>>     that the reserved trap area matches the allocation size across all
>>     system page sizes. This is a must needed patch to enable minimal
>>     support for 64 KB system page sizes.
>>
>> 2. Aligned expected_queue_size to PAGE_SIZE to fix AQL queue creation
>>     failure.
>>
>> 3. Fix amdgpu page fault handler (for xnack) to pass the corresponding
>>     system pfn (instead of gpu pfn) for restoring SVM range mapping.
>>
>> 4. Updated AMDGPU_GTT_MAX_TRANSFER_SIZE to always match the PMD size
>>     across all page sizes.
>>
>> 5. On systems where the CPU page size is larger than the GPU’s 4 KB page
>>     size, the MQD and control stack were aligned to the CPU PAGE_SIZE,
>>     causing multiple GPU pages to incorrectly inherit the UC attribute.
>>     This change aligns both regions to the GPU page size, ensuring that
>>     the MQD is mapped as UC and the control stack as NC, restoring the
>>     correct behavior.
>>
>> 6. Queue preemption fails when the control stack size is aligned to
>>     64 KB. This patch fixes this issue by aligning the control stack
>>     size to gpu page size.
>>
>> Setup details:
>> ============
>> System details: Power10 LPAR using 64K pagesize.
>> AMD GPU:
>> Name:                    gfx90a
>> Marketing Name:          AMD Instinct MI210
>>
>> [1]https://lore.kernel.org/all/cover.1765519875.git.donettom@linux.ibm.com/
>> [2]https://elixir.bootlin.com/linux/v6.19-rc5/source/drivers/gpu/drm/amd/amdkfd/kfd_queue.c#L457
>>
>> RFC V3 -https://lore.kernel.org/all/cover.1771656655.git.donettom@linux.ibm.com/
>> RFC V2 -https://lore.kernel.org/all/cover.1769612973.git.donettom@linux.ibm.com/
>> RFC V1 -https://lore.kernel.org/all/cover.1765519875.git.donettom@linux.ibm.com/
>>
>>
>> Donet Tom (6):
>>    drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 PAGE_SIZE pages
>>    drm/amdkfd: Align expected_queue_size to PAGE_SIZE
>>    drm/amdgpu: Handle GPU page faults correctly on non-4K page systems
>>    drm/amdgpu: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size
>>    drm/amd: Fix MQD and control stack alignment for non-4K
>>    drm/amdkfd: Fix queue preemption/eviction failures by aligning control
>>      stack size to GPU page size
>>
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c      | 44 +++++++++++++++++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h      |  2 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       | 24 ++++------
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h       |  2 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |  6 +--
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h        |  2 +-
>>   drivers/gpu/drm/amd/amdgpu/vce_v1_0.c         |  3 +-
>>   .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c   | 23 ++++++----
>>   drivers/gpu/drm/amd/amdkfd/kfd_queue.c        | 11 ++---
>>   9 files changed, 82 insertions(+), 35 deletions(-)
>>

[-- Attachment #2: Type: text/html, Size: 11043 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2026-03-25 18:43 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-23  4:28 [RESEND RFC PATCH v3 0/6] drm/amd: Add support for non-4K page size systems Donet Tom
2026-03-23  4:28 ` [RESEND RFC PATCH v3 1/6] drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 PAGE_SIZE pages Donet Tom
2026-03-23 10:11   ` Christian König
2026-03-23 11:50     ` Donet Tom
2026-03-23 13:12       ` Christian König
2026-03-24 18:19         ` Donet Tom
2026-03-25  2:26           ` Kuehling, Felix
2026-03-25  9:34             ` Christian König
2026-03-25 10:26               ` Donet Tom
2026-03-25 10:29                 ` Christian König
2026-03-25 17:54                   ` Kuehling, Felix
2026-03-25 17:59                     ` Donet Tom
2026-03-23  4:28 ` [RESEND RFC PATCH v3 2/6] drm/amdkfd: Align expected_queue_size to PAGE_SIZE Donet Tom
2026-03-25  2:28   ` Kuehling, Felix
2026-03-25 18:33     ` Alex Deucher
2026-03-23  4:28 ` [RESEND RFC PATCH v3 3/6] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems Donet Tom
2026-03-23 13:04   ` Christian König
2026-03-24 13:10     ` Alex Deucher
2026-03-25 18:04       ` Donet Tom
2026-03-25 18:36         ` Alex Deucher
2026-03-25  2:30   ` Kuehling, Felix
2026-03-23  4:28 ` [RESEND RFC PATCH v3 4/6] drm/amdgpu: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size Donet Tom
2026-03-23  4:28 ` [RESEND RFC PATCH v3 5/6] drm/amd: Fix MQD and control stack alignment for non-4K Donet Tom
2026-03-25  2:58   ` Kuehling, Felix
2026-03-25 18:41     ` Alex Deucher
2026-03-23  4:28 ` [RESEND RFC PATCH v3 6/6] drm/amdkfd: Fix queue preemption/eviction failures by aligning control stack size to GPU page size Donet Tom
2026-03-25  3:00   ` Kuehling, Felix
2026-03-25 18:42     ` Alex Deucher
2026-03-25  2:27 ` [RESEND RFC PATCH v3 0/6] drm/amd: Add support for non-4K page size systems Kuehling, Felix
2026-03-25  8:02   ` Donet Tom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox