AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems
@ 2025-12-12  6:40 Donet Tom
  2025-12-12  6:40 ` [RFC PATCH v1 1/8] drm/amdkfd: Relax size checking during queue buffer get Donet Tom
                   ` (8 more replies)
  0 siblings, 9 replies; 44+ messages in thread
From: Donet Tom @ 2025-12-12  6:40 UTC (permalink / raw)
  To: amd-gfx, Felix Kuehling, Alex Deucher, christian.koenig
  Cc: Kent.Russell, Ritesh Harjani, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya, donettom

This patch series addresses few issues which we encountered while running rocr
debug agent and rccl unit tests with AMD GPU on Power10 (ppc64le), using 64k
system pagesize.

Note that we don't observe any of these issues while booting with 4k system
pagesize on Power. So with the 64K system pagesize what we observed so far is,
at few of the places, the conversion between gpu pfn to cpu pfn (or vice versa)
may not be done correctly (due to different page size of AMD GPU (4K)
v/s cpu pagesize (64K)) which causes issues like gpu page faults or gpu hang
while running these tests.

Changes so far in this series:
=============================
1. For now, during kfd queue creation, this patch lifts the restriction on EOP
   buffer size to be same buffer object mapping size.

2. Fix SVM range map/unmap operations to convert CPU page numbers to GPU page
   numbers before calling amdgpu_vm_update_range(), which expects 4K GPU pages.
   Without this the rocr-debug-agent tests and rccl unit  tests were failing.

3. Fix GART PTE allocation in migration code to account for multiple GPU pages
   per CPU page. The current code only allocates PTEs based on number of CPU
   pages, but GART may need one PTE per 4K GPU page.

4. Adjust AMDGPU_GTT_MAX_TRANSFER_SIZE to respect the SDMA engine's 4MB hardware
   limit regardless of CPU page size. The hardcoded 512 pages worked on 4K
   systems but seems to be exceeding the limit with 64K system page size.

5. In the current driver, MMIO remap is supported only when the system page
   size is 4K. Error messages have been added to indicate that MMIO remap
   is not supported on systems with a non-4K page size.

6. Fix amdgpu page fault handler (for xnack) to pass the corresponding system
   pfn (instead of gpu pfn) for restoring SVM range mapping.

7. Align ctl_stack_size and wg_data_size to GPU page size.

8. On systems where the CPU page size is larger than the GPU’s 4K page size,
   the MQD and control stack are aligned to the CPU PAGE_SIZE, causing
   multiple GPU pages to inherit the UC attribute incorrectly. This results
   in the control-stack area being mis-mapped and leads to queue preemption
   and eviction failures. Aligning both regions to the GPU page size
   ensures the MQD is mapped UC and the control stack NC, restoring correct
   behavior.

9. Apart from these 8 changes, we also needed this change [1]. Without this change
   kernel simply crashes when running rocminfo command itself.
   [1]: https://github.com/greenforce-project/chromeos-kernel-mirror/commit/2b33fad96c3129a2a53a42b9d90fb3b906145b98

Setup details:
============
System details: Power10 LPAR using 64K pagesize.
AMD GPU:
  Name:                    gfx90a
  Marketing Name:          AMD Instinct MI210

Queries:
=======
1. We currently ran rocr-debug agent tests [1]  and rccl unit tests [2] to test
   these changes. Is there anything else that you would suggest us to run to
   shake out any other page size related issues w.r.t the kernel driver?

2. Patch 1/8: We have a querry regarding eop buffer size Is this eop ring buffer
   size HW dependent? Should it be made PAGE_SIZE?

3. Patch 5/8: also have a query w.r.t the error paths when system page size > 4K.
   Do we need to lift this restriction and add MMIO remap support for systems with
   non-4K page sizes?

[1] ROCr debug agent tests: https://github.com/ROCm/rocr_debug_agent
[2] RCCL tests: https://github.com/ROCm/rccl/tree/develop/test


Please note that the changes in this series are on a best effort basis from our
end. Therefore, requesting the amd-gfx community (who have deeper knowledge of the
HW & SW stack) to kindly help with the review and provide feedback / comments on
these patches. The idea here is, to also have non-4K pagesize (e.g. 64K) well
supported with amd gpu kernel driver.

Donet Tom (7):
  drm/amdkfd: Relax size checking during queue buffer get
  amdkfd/kfd_svm: Fix SVM map/unmap address conversion for non-4k page
    sizes
  amdkfd/kfd_migrate: Fix GART PTE for non-4K pagesize in
    svm_migrate_gart_map()
  amdgpu/amdgpu_ttm: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page
    size
  drm/amdgpu: Handle GPU page faults correctly on non-4K page systems
  amdgpu: Align ctl_stack_size and wg_data_size to GPU page size instead
    of CPU page size
  amdgpu: Fix MQD and control stack alignment for non-4K CPU page size
    systems

Ritesh Harjani (IBM) (1):
  amdkfd/kfd_chardev: Add error message for non-4k pagesize failures

 drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c      | 29 ++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h      |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       | 16 ++--------
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h       |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |  6 ++--
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      | 10 +++++--
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c      |  2 +-
 .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c   | 15 +++++-----
 drivers/gpu/drm/amd/amdkfd/kfd_queue.c        | 17 ++++++-----
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c          | 30 ++++++++++++++-----
 10 files changed, 86 insertions(+), 43 deletions(-)

-- 
2.52.0


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC PATCH v1 1/8] drm/amdkfd: Relax size checking during queue buffer get
  2025-12-12  6:40 [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems Donet Tom
@ 2025-12-12  6:40 ` Donet Tom
  2025-12-15 20:25   ` Philip Yang
  2025-12-12  6:40 ` [RFC PATCH v1 2/8] amdkfd/kfd_svm: Fix SVM map/unmap address conversion for non-4k page sizes Donet Tom
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 44+ messages in thread
From: Donet Tom @ 2025-12-12  6:40 UTC (permalink / raw)
  To: amd-gfx, Felix Kuehling, Alex Deucher, christian.koenig
  Cc: Kent.Russell, Ritesh Harjani, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya, donettom

HW-supported EOP buffer sizes are 4K and 32K. On systems that do not
use 4K pages, the minimum buffer object (BO) allocation size is
PAGE_SIZE (for example, 64K). During queue buffer acquisition, the driver
currently checks the allocated BO size against the supported EOP buffer
size. Since the allocated BO is larger than the expected size, this check
fails, preventing queue creation.

Relax the strict size validation and allow PAGE_SIZE-sized BOs to be used.
Only the required 4K region of the buffer will be used as the EOP buffer
and avoids queue creation failures on non-4K page systems.

Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
index f1e7583650c4..dc857450fa16 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
@@ -199,6 +199,7 @@ int kfd_queue_buffer_get(struct amdgpu_vm *vm, void __user *addr, struct amdgpu_
 	struct amdgpu_bo_va_mapping *mapping;
 	u64 user_addr;
 	u64 size;
+	u64 bo_size;
 
 	user_addr = (u64)addr >> AMDGPU_GPU_PAGE_SHIFT;
 	size = expected_size >> AMDGPU_GPU_PAGE_SHIFT;
@@ -207,11 +208,12 @@ int kfd_queue_buffer_get(struct amdgpu_vm *vm, void __user *addr, struct amdgpu_
 	if (!mapping)
 		goto out_err;
 
-	if (user_addr != mapping->start ||
-	    (size != 0 && user_addr + size - 1 != mapping->last)) {
-		pr_debug("expected size 0x%llx not equal to mapping addr 0x%llx size 0x%llx\n",
+	bo_size = mapping->last - mapping->start + 1;
+
+	if (user_addr != mapping->start || (size != 0 && bo_size < size)) {
+		pr_debug("expected size 0x%llx grater than mapping addr 0x%llx size 0x%llx\n",
 			expected_size, mapping->start << AMDGPU_GPU_PAGE_SHIFT,
-			(mapping->last - mapping->start + 1) << AMDGPU_GPU_PAGE_SHIFT);
+			bo_size <<  AMDGPU_GPU_PAGE_SHIFT);
 		goto out_err;
 	}
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC PATCH v1 2/8] amdkfd/kfd_svm: Fix SVM map/unmap address conversion for non-4k page sizes
  2025-12-12  6:40 [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems Donet Tom
  2025-12-12  6:40 ` [RFC PATCH v1 1/8] drm/amdkfd: Relax size checking during queue buffer get Donet Tom
@ 2025-12-12  6:40 ` Donet Tom
  2025-12-15 20:44   ` Philip Yang
  2025-12-12  6:40 ` [RFC PATCH v1 3/8] amdkfd/kfd_migrate: Fix GART PTE for non-4K pagesize in svm_migrate_gart_map() Donet Tom
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 44+ messages in thread
From: Donet Tom @ 2025-12-12  6:40 UTC (permalink / raw)
  To: amd-gfx, Felix Kuehling, Alex Deucher, christian.koenig
  Cc: Kent.Russell, Ritesh Harjani, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya, donettom

SVM range size is tracked using the system page size. The range start and
end are aligned to system page-sized PFNs, so the total SVM range size
equals the total number of pages in the SVM range multiplied by the system
page size.

The SVM range map/unmap functions pass these system page-sized PFN numbers
to amdgpu_vm_update_range(), which expects PFNs based on the GPU page size
(4K). On non-4K page systems, this mismatch causes only part of the SVM
range to be mapped in the GPU page table, while the rest remains unmapped.
If the GPU accesses an unmapped address within the same range, it results
in a GPU page fault.

To fix this, the required conversion has been added in both
svm_range_map_to_gpu() and svm_range_unmap_from_gpu(), ensuring that all
pages in the SVM range are correctly mapped on non-4K systems.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 30 ++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 74a1d3e1d52b..a2636f2d6c71 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1314,11 +1314,16 @@ svm_range_unmap_from_gpu(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 			 struct dma_fence **fence)
 {
 	uint64_t init_pte_value = 0;
+	uint64_t gpu_start, gpu_end;
 
-	pr_debug("[0x%llx 0x%llx]\n", start, last);
+	// Convert CPU page range to GPU page range
+	gpu_start = start * AMDGPU_GPU_PAGES_IN_CPU_PAGE;
+	gpu_end = (last + 1) * AMDGPU_GPU_PAGES_IN_CPU_PAGE - 1;
 
-	return amdgpu_vm_update_range(adev, vm, false, true, true, false, NULL, start,
-				      last, init_pte_value, 0, 0, NULL, NULL,
+	pr_debug("%s: CPU[0x%llx 0x%llx] -> GPU[0x%llx 0x%llx]\n", __func__,
+		 start, last, gpu_start, gpu_end);
+	return amdgpu_vm_update_range(adev, vm, false, true, true, false, NULL, gpu_start,
+				      gpu_end, init_pte_value, 0, 0, NULL, NULL,
 				      fence);
 }
 
@@ -1398,9 +1403,13 @@ svm_range_map_to_gpu(struct kfd_process_device *pdd, struct svm_range *prange,
 		 last_start, last_start + npages - 1, readonly);
 
 	for (i = offset; i < offset + npages; i++) {
+		uint64_t gpu_start;
+		uint64_t gpu_end;
+
 		last_domain = dma_addr[i] & SVM_RANGE_VRAM_DOMAIN;
 		dma_addr[i] &= ~SVM_RANGE_VRAM_DOMAIN;
 
+
 		/* Collect all pages in the same address range and memory domain
 		 * that can be mapped with a single call to update mapping.
 		 */
@@ -1415,17 +1424,22 @@ svm_range_map_to_gpu(struct kfd_process_device *pdd, struct svm_range *prange,
 		if (readonly)
 			pte_flags &= ~AMDGPU_PTE_WRITEABLE;
 
-		pr_debug("svms 0x%p map [0x%lx 0x%llx] vram %d PTE 0x%llx\n",
-			 prange->svms, last_start, prange->start + i,
-			 (last_domain == SVM_RANGE_VRAM_DOMAIN) ? 1 : 0,
-			 pte_flags);
 
 		/* For dGPU mode, we use same vm_manager to allocate VRAM for
 		 * different memory partition based on fpfn/lpfn, we should use
 		 * same vm_manager.vram_base_offset regardless memory partition.
 		 */
+		gpu_start = last_start * AMDGPU_GPU_PAGES_IN_CPU_PAGE;
+		gpu_end = (prange->start + i + 1) * AMDGPU_GPU_PAGES_IN_CPU_PAGE - 1;
+
+		pr_debug("svms 0x%p map CPU[0x%lx 0x%llx] GPU[0x%llx 0x%llx] vram %d PTE 0x%llx\n",
+			 prange->svms, last_start, prange->start + i,
+			 gpu_start, gpu_end,
+			 (last_domain == SVM_RANGE_VRAM_DOMAIN) ? 1 : 0,
+			 pte_flags);
+
 		r = amdgpu_vm_update_range(adev, vm, false, false, flush_tlb, true,
-					   NULL, last_start, prange->start + i,
+					   NULL, gpu_start, gpu_end,
 					   pte_flags,
 					   (last_start - prange->start) << PAGE_SHIFT,
 					   bo_adev ? bo_adev->vm_manager.vram_base_offset : 0,
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC PATCH v1 3/8] amdkfd/kfd_migrate: Fix GART PTE for non-4K pagesize in svm_migrate_gart_map()
  2025-12-12  6:40 [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems Donet Tom
  2025-12-12  6:40 ` [RFC PATCH v1 1/8] drm/amdkfd: Relax size checking during queue buffer get Donet Tom
  2025-12-12  6:40 ` [RFC PATCH v1 2/8] amdkfd/kfd_svm: Fix SVM map/unmap address conversion for non-4k page sizes Donet Tom
@ 2025-12-12  6:40 ` Donet Tom
  2025-12-15 21:03   ` Philip Yang
  2025-12-12  6:40 ` [RFC PATCH v1 4/8] amdgpu/amdgpu_ttm: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size Donet Tom
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 44+ messages in thread
From: Donet Tom @ 2025-12-12  6:40 UTC (permalink / raw)
  To: amd-gfx, Felix Kuehling, Alex Deucher, christian.koenig
  Cc: Kent.Russell, Ritesh Harjani, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya, donettom

In svm_migrate_gart_map(), while migrating GART mapping, the number of
bytes copied for the GART table only accounts for CPU pages. On non-4K
systems, each CPU page can contain multiple GPU pages, and the GART
requires one 8-byte PTE per GPU page. As a result, an incorrect size was
passed to the DMA, causing only a partial update of the GART table.

Fix this function to work correctly on non-4K page-size systems by
accounting for the number of GPU pages per CPU page when calculating the
number of bytes to be copied.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 59a5a3fea65d..ea8377071c39 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -62,7 +62,7 @@ svm_migrate_gart_map(struct amdgpu_ring *ring, u64 npages,
 	*gart_addr = adev->gmc.gart_start;
 
 	num_dw = ALIGN(adev->mman.buffer_funcs->copy_num_dw, 8);
-	num_bytes = npages * 8;
+	num_bytes = npages * 8 * AMDGPU_GPU_PAGES_IN_CPU_PAGE;
 
 	r = amdgpu_job_alloc_with_ib(adev, &adev->mman.high_pr,
 				     AMDGPU_FENCE_OWNER_UNDEFINED,
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC PATCH v1 4/8] amdgpu/amdgpu_ttm: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size
  2025-12-12  6:40 [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems Donet Tom
                   ` (2 preceding siblings ...)
  2025-12-12  6:40 ` [RFC PATCH v1 3/8] amdkfd/kfd_migrate: Fix GART PTE for non-4K pagesize in svm_migrate_gart_map() Donet Tom
@ 2025-12-12  6:40 ` Donet Tom
  2025-12-12  8:53   ` Christian König
  2025-12-12  6:40 ` [RFC PATCH v1 5/8] amdkfd/kfd_chardev: Add error message for non-4k pagesize failures Donet Tom
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 44+ messages in thread
From: Donet Tom @ 2025-12-12  6:40 UTC (permalink / raw)
  To: amd-gfx, Felix Kuehling, Alex Deucher, christian.koenig
  Cc: Kent.Russell, Ritesh Harjani, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya, donettom

The SDMA engine has a hardware limitation of 4 MB maximum transfer
size per operation. AMDGPU_GTT_MAX_TRANSFER_SIZE was hardcoded to
512 pages, which worked correctly on systems with 4K pages but fails
on systems with larger page sizes.

This patch divides the max transfer size / AMDGPU_GPU_PAGES_IN_CPU_PAGE
to match with non-4K page size systems.

Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
index 0be2728aa872..9d038feb25b0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
@@ -37,7 +37,7 @@
 #define AMDGPU_PL_MMIO_REMAP	(TTM_PL_PRIV + 5)
 #define __AMDGPU_PL_NUM	(TTM_PL_PRIV + 6)
 
-#define AMDGPU_GTT_MAX_TRANSFER_SIZE	512
+#define AMDGPU_GTT_MAX_TRANSFER_SIZE	(512 / AMDGPU_GPU_PAGES_IN_CPU_PAGE)
 #define AMDGPU_GTT_NUM_TRANSFER_WINDOWS	2
 
 extern const struct attribute_group amdgpu_vram_mgr_attr_group;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC PATCH v1 5/8] amdkfd/kfd_chardev: Add error message for non-4k pagesize failures
  2025-12-12  6:40 [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems Donet Tom
                   ` (3 preceding siblings ...)
  2025-12-12  6:40 ` [RFC PATCH v1 4/8] amdgpu/amdgpu_ttm: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size Donet Tom
@ 2025-12-12  6:40 ` Donet Tom
  2025-12-12  6:40 ` [RFC PATCH v1 6/8] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems Donet Tom
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 44+ messages in thread
From: Donet Tom @ 2025-12-12  6:40 UTC (permalink / raw)
  To: amd-gfx, Felix Kuehling, Alex Deucher, christian.koenig
  Cc: Kent.Russell, Ritesh Harjani, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya, donettom

From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>

In the current driver, MMIO remap is supported only when the system page
size is 4K. Error messages have been added to indicate that MMIO remap
is not supported on systems with a non-4K page size.

Do we need to lift this restriction and add MMIO remap support for systems
with non-4K page sizes?

Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 0f0719528bcc..19632795c389 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -1134,6 +1134,8 @@ static int kfd_ioctl_alloc_memory_of_gpu(struct file *filep,
 		offset = dev->adev->rmmio_remap.bus_addr;
 		if (!offset || (PAGE_SIZE > 4096)) {
 			err = -ENOMEM;
+			pr_err("%s: Failed MMIO remap off:0x%llx PAGE_SIZE:%lu (requires 4K)\n",
+				__func__, offset, PAGE_SIZE);
 			goto err_unlock;
 		}
 	}
@@ -2317,7 +2319,8 @@ static int criu_restore_memory_of_gpu(struct kfd_process_device *pdd,
 		}
 		offset = pdd->dev->adev->rmmio_remap.bus_addr;
 		if (!offset || (PAGE_SIZE > 4096)) {
-			pr_err("amdgpu_amdkfd_get_mmio_remap_phys_addr failed\n");
+			pr_err("%s: amdgpu_amdkfd_get_mmio_remap_phys_addr failed off:0x%llx, page_size:%lu\n",
+				__func__, offset, PAGE_SIZE);
 			return -ENOMEM;
 		}
 	} else if (bo_bucket->alloc_flags & KFD_IOC_ALLOC_MEM_FLAGS_USERPTR) {
@@ -3367,8 +3370,11 @@ static int kfd_mmio_mmap(struct kfd_node *dev, struct kfd_process *process,
 	if (vma->vm_end - vma->vm_start != PAGE_SIZE)
 		return -EINVAL;
 
-	if (PAGE_SIZE > 4096)
+	if (PAGE_SIZE > 4096) {
+		pr_err("%s: MMIO mmap not supported with PAGE_SIZE=%lu (requires 4K)\n",
+			__func__, PAGE_SIZE);
 		return -EINVAL;
+	}
 
 	address = dev->adev->rmmio_remap.bus_addr;
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC PATCH v1 6/8] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems
  2025-12-12  6:40 [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems Donet Tom
                   ` (4 preceding siblings ...)
  2025-12-12  6:40 ` [RFC PATCH v1 5/8] amdkfd/kfd_chardev: Add error message for non-4k pagesize failures Donet Tom
@ 2025-12-12  6:40 ` Donet Tom
  2025-12-12  6:40 ` [RFC PATCH v1 7/8] amdgpu: Align ctl_stack_size and wg_data_size to GPU page size instead of CPU page size Donet Tom
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 44+ messages in thread
From: Donet Tom @ 2025-12-12  6:40 UTC (permalink / raw)
  To: amd-gfx, Felix Kuehling, Alex Deucher, christian.koenig
  Cc: Kent.Russell, Ritesh Harjani, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya, donettom

During a GPU page fault, the driver restores the SVM range and then maps it
into the GPU page tables. The current implementation passes a GPU-page-size
(4K-based) PFN to svm_range_restore_pages() to restore the range.

SVM ranges are tracked using system-page-size PFNs. On systems where the
system page size is larger than 4K, using GPU-page-size PFNs to restore the
range causes two problems:

Range lookup fails:
Because the restore function receives PFNs in GPU (4K) units, the SVM
range lookup does not find the existing range. This will result in a
duplicate SVM range being created.

VMA lookup failure:
The restore function also tries to locate the VMA for the faulting address.
It converts the GPU-page-size PFN into an address using the system page
size, which results in an incorrect address on non-4K page-size systems.
As a result, the VMA lookup fails with the message: "address 0xxxx VMA is
removed".

This patch passes the system-page-size PFN to svm_range_restore_pages() so
that the SVM range is restored correctly on non-4K page systems.

Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 676e24fb8864..6a11c9093e0c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2958,14 +2958,14 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, u32 pasid,
 	if (!root)
 		return false;
 
-	addr /= AMDGPU_GPU_PAGE_SIZE;
-
 	if (is_compute_context && !svm_range_restore_pages(adev, pasid, vmid,
-	    node_id, addr, ts, write_fault)) {
+	    node_id, addr >> PAGE_SHIFT, ts, write_fault)) {
 		amdgpu_bo_unref(&root);
 		return true;
 	}
 
+	addr /= AMDGPU_GPU_PAGE_SIZE;
+
 	r = amdgpu_bo_reserve(root, true);
 	if (r)
 		goto error_unref;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC PATCH v1 7/8] amdgpu: Align ctl_stack_size and wg_data_size to GPU page size instead of CPU page size
  2025-12-12  6:40 [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems Donet Tom
                   ` (5 preceding siblings ...)
  2025-12-12  6:40 ` [RFC PATCH v1 6/8] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems Donet Tom
@ 2025-12-12  6:40 ` Donet Tom
  2025-12-12  9:04   ` Christian König
  2025-12-12  6:40 ` [RFC PATCH v1 8/8] amdgpu: Fix MQD and control stack alignment for non-4K CPU page size systems Donet Tom
  2025-12-12  9:01 ` [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K " Christian König
  8 siblings, 1 reply; 44+ messages in thread
From: Donet Tom @ 2025-12-12  6:40 UTC (permalink / raw)
  To: amd-gfx, Felix Kuehling, Alex Deucher, christian.koenig
  Cc: Kent.Russell, Ritesh Harjani, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya, donettom

The ctl_stack_size and wg_data_size values are used to compute the total
context save/restore buffer size and the control stack size. These buffers
are programmed into the GPU and are used to store the queue state during
context save and restore.

Currently, both ctl_stack_size and wg_data_size are aligned to the CPU
PAGE_SIZE. On systems with a non-4K CPU page size, this causes unnecessary
memory waste because the GPU internally calculates and uses buffer sizes
aligned to a fixed 4K GPU page size.

Since the control stack and context save/restore buffers are consumed by
the GPU, their sizes should be aligned to the GPU page size (4K), not the
CPU page size. This patch updates the alignment of ctl_stack_size and
wg_data_size to prevent over-allocation on systems with larger CPU page
sizes.

Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
index dc857450fa16..00ab941c3e86 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
@@ -445,10 +445,11 @@ void kfd_queue_ctx_save_restore_size(struct kfd_topology_device *dev)
 		    min(cu_num * 40, props->array_count / props->simd_arrays_per_engine * 512)
 		    : cu_num * 32;
 
-	wg_data_size = ALIGN(cu_num * WG_CONTEXT_DATA_SIZE_PER_CU(gfxv, props), PAGE_SIZE);
+	wg_data_size = ALIGN(cu_num * WG_CONTEXT_DATA_SIZE_PER_CU(gfxv, props),
+				AMDGPU_GPU_PAGE_SIZE);
 	ctl_stack_size = wave_num * CNTL_STACK_BYTES_PER_WAVE(gfxv) + 8;
 	ctl_stack_size = ALIGN(SIZEOF_HSA_USER_CONTEXT_SAVE_AREA_HEADER + ctl_stack_size,
-			       PAGE_SIZE);
+			       AMDGPU_GPU_PAGE_SIZE);
 
 	if ((gfxv / 10000 * 10000) == 100000) {
 		/* HW design limits control stack size to 0x7000.
@@ -460,7 +461,7 @@ void kfd_queue_ctx_save_restore_size(struct kfd_topology_device *dev)
 
 	props->ctl_stack_size = ctl_stack_size;
 	props->debug_memory_size = ALIGN(wave_num * DEBUGGER_BYTES_PER_WAVE, DEBUGGER_BYTES_ALIGN);
-	props->cwsr_size = ctl_stack_size + wg_data_size;
+	props->cwsr_size = ALIGN(ctl_stack_size + wg_data_size, PAGE_SIZE);
 
 	if (gfxv == 80002)	/* GFX_VERSION_TONGA */
 		props->eop_buffer_size = 0x8000;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC PATCH v1 8/8] amdgpu: Fix MQD and control stack alignment for non-4K CPU page size systems
  2025-12-12  6:40 [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems Donet Tom
                   ` (6 preceding siblings ...)
  2025-12-12  6:40 ` [RFC PATCH v1 7/8] amdgpu: Align ctl_stack_size and wg_data_size to GPU page size instead of CPU page size Donet Tom
@ 2025-12-12  6:40 ` Donet Tom
  2025-12-12  9:01 ` [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K " Christian König
  8 siblings, 0 replies; 44+ messages in thread
From: Donet Tom @ 2025-12-12  6:40 UTC (permalink / raw)
  To: amd-gfx, Felix Kuehling, Alex Deucher, christian.koenig
  Cc: Kent.Russell, Ritesh Harjani, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya, donettom

For gfxV9, due to a hardware bug ("based on the comments in the code
here [1]"), the control stack of a user-mode compute queue must be
allocated immediately after the page boundary of its regular MQD buffer.
To handle this, we allocate an enlarged MQD buffer where the first page
is used as the MQD and the remaining pages store the control stack.
Although these regions share the same BO, they require different memory
types: the MQD must be UC (uncached), while the control stack must be
NC (non-coherent), matching the behavior when the control stack is
allocated in user space.

This logic works correctly on systems where the CPU page size matches
the GPU page size (4K). However, the current implementation aligns both
the MQD and the control stack to the CPU PAGE_SIZE. On systems with a
larger CPU page size, the entire first CPU page is marked UC—even though
that page may contain multiple GPU pages. The GPU treats the second 4K
GPU page inside that CPU page as part of the control stack, but it is
incorrectly mapped as UC. This misalignment leads to queue preemption
and eviction failures.

This patch fixes the issue by aligning both the MQD and control stack
sizes to the GPU page size (4K). The first 4K page is correctly marked
as UC for the MQD, and the remaining GPU pages are marked NC for the
control stack. This ensures proper memory type assignment on systems
with larger CPU page sizes and prevents incorrect behavior in queue
preemption and eviction.

[1]: https://elixir.bootlin.com/linux/v6.18/source/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c#L118

Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c      | 29 +++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h      |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       | 16 ++--------
 .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c   | 15 +++++-----
 4 files changed, 41 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
index b2033f8352f5..0e1c017d10dc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
@@ -368,6 +368,35 @@ void amdgpu_gart_map(struct amdgpu_device *adev, uint64_t offset,
 	}
 	drm_dev_exit(idx);
 }
+void amdgpu_gart_map_gfx9_mqd(struct amdgpu_device *adev, uint64_t offset,
+		    int pages, dma_addr_t *dma_addr, uint64_t flags)
+{
+	uint64_t page_base;
+	unsigned int i, j, t;
+	int idx;
+	uint64_t ctrl_flags = AMDGPU_PTE_MTYPE_VG10(flags, AMDGPU_MTYPE_NC);
+	void *dst;
+
+	if (!adev->gart.ptr)
+		return;
+
+	if (!drm_dev_enter(adev_to_drm(adev), &idx))
+		return;
+
+	t = offset / AMDGPU_GPU_PAGE_SIZE;
+	dst = adev->gart.ptr;
+	for (i = 0; i < pages; i++) {
+		page_base = dma_addr[i];
+		for (j = 0; j < AMDGPU_GPU_PAGES_IN_CPU_PAGE; j++, t++) {
+			if ((i == 0) && (j == 0))
+				amdgpu_gmc_set_pte_pde(adev, dst, t, page_base, flags);
+			else
+				amdgpu_gmc_set_pte_pde(adev, dst, t, page_base, ctrl_flags);
+			page_base += AMDGPU_GPU_PAGE_SIZE;
+		}
+	}
+	drm_dev_exit(idx);
+}
 
 /**
  * amdgpu_gart_bind - bind pages into the gart page table
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
index 7cc980bf4725..1beef780936d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
@@ -62,6 +62,8 @@ void amdgpu_gart_unbind(struct amdgpu_device *adev, uint64_t offset,
 void amdgpu_gart_map(struct amdgpu_device *adev, uint64_t offset,
 		     int pages, dma_addr_t *dma_addr, uint64_t flags,
 		     void *dst);
+void amdgpu_gart_map_gfx9_mqd(struct amdgpu_device *adev, uint64_t offset,
+		     int pages, dma_addr_t *dma_addr, uint64_t flags);
 void amdgpu_gart_bind(struct amdgpu_device *adev, uint64_t offset,
 		      int pages, dma_addr_t *dma_addr, uint64_t flags);
 void amdgpu_gart_invalidate_tlb(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 9d568c16beb1..c2e0d2518345 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -882,25 +882,15 @@ static void amdgpu_ttm_gart_bind_gfx9_mqd(struct amdgpu_device *adev,
 	int num_xcc = max(1U, adev->gfx.num_xcc_per_xcp);
 	uint64_t page_idx, pages_per_xcc;
 	int i;
-	uint64_t ctrl_flags = AMDGPU_PTE_MTYPE_VG10(flags, AMDGPU_MTYPE_NC);
 
 	pages_per_xcc = total_pages;
 	do_div(pages_per_xcc, num_xcc);
 
 	for (i = 0, page_idx = 0; i < num_xcc; i++, page_idx += pages_per_xcc) {
-		/* MQD page: use default flags */
-		amdgpu_gart_bind(adev,
+		amdgpu_gart_map_gfx9_mqd(adev,
 				gtt->offset + (page_idx << PAGE_SHIFT),
-				1, &gtt->ttm.dma_address[page_idx], flags);
-		/*
-		 * Ctrl pages - modify the memory type to NC (ctrl_flags) from
-		 * the second page of the BO onward.
-		 */
-		amdgpu_gart_bind(adev,
-				gtt->offset + ((page_idx + 1) << PAGE_SHIFT),
-				pages_per_xcc - 1,
-				&gtt->ttm.dma_address[page_idx + 1],
-				ctrl_flags);
+				pages_per_xcc, &gtt->ttm.dma_address[page_idx],
+				flags);
 	}
 }
 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c b/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c
index f2dee320fada..e45e39cd65fe 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c
@@ -43,8 +43,8 @@ static uint64_t mqd_stride_v9(struct mqd_manager *mm,
 {
 	if (mm->dev->kfd->cwsr_enabled &&
 	    q->type == KFD_QUEUE_TYPE_COMPUTE)
-		return ALIGN(q->ctl_stack_size, PAGE_SIZE) +
-			ALIGN(sizeof(struct v9_mqd), PAGE_SIZE);
+		return ALIGN(ALIGN(q->ctl_stack_size, AMDGPU_GPU_PAGE_SIZE) +
+			ALIGN(sizeof(struct v9_mqd), AMDGPU_GPU_PAGE_SIZE), PAGE_SIZE);
 
 	return mm->mqd_size;
 }
@@ -136,13 +136,12 @@ static struct kfd_mem_obj *allocate_mqd(struct kfd_node *node,
 		if (!mqd_mem_obj)
 			return NULL;
 		retval = amdgpu_amdkfd_alloc_gtt_mem(node->adev,
-			(ALIGN(q->ctl_stack_size, PAGE_SIZE) +
-			ALIGN(sizeof(struct v9_mqd), PAGE_SIZE)) *
+			(ALIGN(ALIGN(q->ctl_stack_size, AMDGPU_GPU_PAGE_SIZE) +
+			ALIGN(sizeof(struct v9_mqd), AMDGPU_GPU_PAGE_SIZE), PAGE_SIZE)) *
 			NUM_XCC(node->xcc_mask),
 			&(mqd_mem_obj->gtt_mem),
 			&(mqd_mem_obj->gpu_addr),
 			(void *)&(mqd_mem_obj->cpu_ptr), true);
-
 		if (retval) {
 			kfree(mqd_mem_obj);
 			return NULL;
@@ -343,7 +342,7 @@ static int get_wave_state(struct mqd_manager *mm, void *mqd,
 	struct kfd_context_save_area_header header;
 
 	/* Control stack is located one page after MQD. */
-	void *mqd_ctl_stack = (void *)((uintptr_t)mqd + PAGE_SIZE);
+	void *mqd_ctl_stack = (void *)((uintptr_t)mqd + AMDGPU_GPU_PAGE_SIZE);
 
 	m = get_mqd(mqd);
 
@@ -380,7 +379,7 @@ static void checkpoint_mqd(struct mqd_manager *mm, void *mqd, void *mqd_dst, voi
 {
 	struct v9_mqd *m;
 	/* Control stack is located one page after MQD. */
-	void *ctl_stack = (void *)((uintptr_t)mqd + PAGE_SIZE);
+	void *ctl_stack = (void *)((uintptr_t)mqd + AMDGPU_GPU_PAGE_SIZE);
 
 	m = get_mqd(mqd);
 
@@ -426,7 +425,7 @@ static void restore_mqd(struct mqd_manager *mm, void **mqd,
 		*gart_addr = addr;
 
 	/* Control stack is located one page after MQD. */
-	ctl_stack = (void *)((uintptr_t)*mqd + PAGE_SIZE);
+	ctl_stack = (void *)((uintptr_t)*mqd + AMDGPU_GPU_PAGE_SIZE);
 	memcpy(ctl_stack, ctl_stack_src, ctl_stack_size);
 
 	m->cp_hqd_pq_doorbell_control =
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 4/8] amdgpu/amdgpu_ttm: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size
  2025-12-12  6:40 ` [RFC PATCH v1 4/8] amdgpu/amdgpu_ttm: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size Donet Tom
@ 2025-12-12  8:53   ` Christian König
  2025-12-12 12:14     ` Donet Tom
  2026-01-06 12:55     ` Donet Tom
  0 siblings, 2 replies; 44+ messages in thread
From: Christian König @ 2025-12-12  8:53 UTC (permalink / raw)
  To: Donet Tom, amd-gfx, Felix Kuehling, Alex Deucher
  Cc: Kent.Russell, Ritesh Harjani, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya

On 12/12/25 07:40, Donet Tom wrote:
> The SDMA engine has a hardware limitation of 4 MB maximum transfer
> size per operation.

That is not correct. This is only true on ancient HW.

What problems are you seeing here?

> AMDGPU_GTT_MAX_TRANSFER_SIZE was hardcoded to
> 512 pages, which worked correctly on systems with 4K pages but fails
> on systems with larger page sizes.
> 
> This patch divides the max transfer size / AMDGPU_GPU_PAGES_IN_CPU_PAGE
> to match with non-4K page size systems.

That is actually a bad idea. The value was meant to match the PMD size.

Regards,
Christian.

> 
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
> index 0be2728aa872..9d038feb25b0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
> @@ -37,7 +37,7 @@
>  #define AMDGPU_PL_MMIO_REMAP	(TTM_PL_PRIV + 5)
>  #define __AMDGPU_PL_NUM	(TTM_PL_PRIV + 6)
>  
> -#define AMDGPU_GTT_MAX_TRANSFER_SIZE	512
> +#define AMDGPU_GTT_MAX_TRANSFER_SIZE	(512 / AMDGPU_GPU_PAGES_IN_CPU_PAGE)
>  #define AMDGPU_GTT_NUM_TRANSFER_WINDOWS	2
>  
>  extern const struct attribute_group amdgpu_vram_mgr_attr_group;


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems
  2025-12-12  6:40 [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems Donet Tom
                   ` (7 preceding siblings ...)
  2025-12-12  6:40 ` [RFC PATCH v1 8/8] amdgpu: Fix MQD and control stack alignment for non-4K CPU page size systems Donet Tom
@ 2025-12-12  9:01 ` Christian König
  2025-12-12 10:45   ` Ritesh Harjani
  8 siblings, 1 reply; 44+ messages in thread
From: Christian König @ 2025-12-12  9:01 UTC (permalink / raw)
  To: Donet Tom, amd-gfx, Felix Kuehling, Alex Deucher
  Cc: Kent.Russell, Ritesh Harjani, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya

On 12/12/25 07:40, Donet Tom wrote:
> This patch series addresses few issues which we encountered while running rocr
> debug agent and rccl unit tests with AMD GPU on Power10 (ppc64le), using 64k
> system pagesize.
> 
> Note that we don't observe any of these issues while booting with 4k system
> pagesize on Power. So with the 64K system pagesize what we observed so far is,
> at few of the places, the conversion between gpu pfn to cpu pfn (or vice versa)
> may not be done correctly (due to different page size of AMD GPU (4K)
> v/s cpu pagesize (64K)) which causes issues like gpu page faults or gpu hang
> while running these tests.
> 
> Changes so far in this series:
> =============================
> 1. For now, during kfd queue creation, this patch lifts the restriction on EOP
>    buffer size to be same buffer object mapping size.
> 
> 2. Fix SVM range map/unmap operations to convert CPU page numbers to GPU page
>    numbers before calling amdgpu_vm_update_range(), which expects 4K GPU pages.
>    Without this the rocr-debug-agent tests and rccl unit  tests were failing.
> 
> 3. Fix GART PTE allocation in migration code to account for multiple GPU pages
>    per CPU page. The current code only allocates PTEs based on number of CPU
>    pages, but GART may need one PTE per 4K GPU page.
> 
> 4. Adjust AMDGPU_GTT_MAX_TRANSFER_SIZE to respect the SDMA engine's 4MB hardware
>    limit regardless of CPU page size. The hardcoded 512 pages worked on 4K
>    systems but seems to be exceeding the limit with 64K system page size.
> 
> 5. In the current driver, MMIO remap is supported only when the system page
>    size is 4K. Error messages have been added to indicate that MMIO remap
>    is not supported on systems with a non-4K page size.
> 
> 6. Fix amdgpu page fault handler (for xnack) to pass the corresponding system
>    pfn (instead of gpu pfn) for restoring SVM range mapping.
> 
> 7. Align ctl_stack_size and wg_data_size to GPU page size.
> 
> 8. On systems where the CPU page size is larger than the GPU’s 4K page size,
>    the MQD and control stack are aligned to the CPU PAGE_SIZE, causing
>    multiple GPU pages to inherit the UC attribute incorrectly. This results
>    in the control-stack area being mis-mapped and leads to queue preemption
>    and eviction failures. Aligning both regions to the GPU page size
>    ensures the MQD is mapped UC and the control stack NC, restoring correct
>    behavior.
> 
> 9. Apart from these 8 changes, we also needed this change [1]. Without this change
>    kernel simply crashes when running rocminfo command itself.
>    [1]: https://github.com/greenforce-project/chromeos-kernel-mirror/commit/2b33fad96c3129a2a53a42b9d90fb3b906145b98
> 
> Setup details:
> ============
> System details: Power10 LPAR using 64K pagesize.
> AMD GPU:
>   Name:                    gfx90a
>   Marketing Name:          AMD Instinct MI210
> 
> Queries:
> =======
> 1. We currently ran rocr-debug agent tests [1]  and rccl unit tests [2] to test
>    these changes. Is there anything else that you would suggest us to run to
>    shake out any other page size related issues w.r.t the kernel driver?

The ROCm team needs to answer that.

> 2. Patch 1/8: We have a querry regarding eop buffer size Is this eop ring buffer
>    size HW dependent? Should it be made PAGE_SIZE?

Yes and no.

> 
> 3. Patch 5/8: also have a query w.r.t the error paths when system page size > 4K.
>    Do we need to lift this restriction and add MMIO remap support for systems with
>    non-4K page sizes?

The problem is the HW can't do this.

> 
> [1] ROCr debug agent tests: https://github.com/ROCm/rocr_debug_agent
> [2] RCCL tests: https://github.com/ROCm/rccl/tree/develop/test
> 
> 
> Please note that the changes in this series are on a best effort basis from our
> end. Therefore, requesting the amd-gfx community (who have deeper knowledge of the
> HW & SW stack) to kindly help with the review and provide feedback / comments on
> these patches. The idea here is, to also have non-4K pagesize (e.g. 64K) well
> supported with amd gpu kernel driver.

Well this is generally nice to have, but there are unfortunately some HW limitations which makes ROCm pretty much unusable on non 4k page size systems.

What we can do is to support graphics and MM, but that should already work out of the box.

I need to talk with Alex and the ROCm team about it if workarounds can be implemented for those issues.

Regards,
Christian.

> 
> Donet Tom (7):
>   drm/amdkfd: Relax size checking during queue buffer get
>   amdkfd/kfd_svm: Fix SVM map/unmap address conversion for non-4k page
>     sizes
>   amdkfd/kfd_migrate: Fix GART PTE for non-4K pagesize in
>     svm_migrate_gart_map()
>   amdgpu/amdgpu_ttm: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page
>     size
>   drm/amdgpu: Handle GPU page faults correctly on non-4K page systems
>   amdgpu: Align ctl_stack_size and wg_data_size to GPU page size instead
>     of CPU page size
>   amdgpu: Fix MQD and control stack alignment for non-4K CPU page size
>     systems
> 
> Ritesh Harjani (IBM) (1):
>   amdkfd/kfd_chardev: Add error message for non-4k pagesize failures
> 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c      | 29 ++++++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h      |  2 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       | 16 ++--------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h       |  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |  6 ++--
>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      | 10 +++++--
>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c      |  2 +-
>  .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c   | 15 +++++-----
>  drivers/gpu/drm/amd/amdkfd/kfd_queue.c        | 17 ++++++-----
>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c          | 30 ++++++++++++++-----
>  10 files changed, 86 insertions(+), 43 deletions(-)
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 7/8] amdgpu: Align ctl_stack_size and wg_data_size to GPU page size instead of CPU page size
  2025-12-12  6:40 ` [RFC PATCH v1 7/8] amdgpu: Align ctl_stack_size and wg_data_size to GPU page size instead of CPU page size Donet Tom
@ 2025-12-12  9:04   ` Christian König
  2025-12-12 12:29     ` Donet Tom
  2025-12-19 10:27     ` Donet Tom
  0 siblings, 2 replies; 44+ messages in thread
From: Christian König @ 2025-12-12  9:04 UTC (permalink / raw)
  To: Donet Tom, amd-gfx, Felix Kuehling, Alex Deucher
  Cc: Kent.Russell, Ritesh Harjani, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya

On 12/12/25 07:40, Donet Tom wrote:
> The ctl_stack_size and wg_data_size values are used to compute the total
> context save/restore buffer size and the control stack size. These buffers
> are programmed into the GPU and are used to store the queue state during
> context save and restore.
> 
> Currently, both ctl_stack_size and wg_data_size are aligned to the CPU
> PAGE_SIZE. On systems with a non-4K CPU page size, this causes unnecessary
> memory waste because the GPU internally calculates and uses buffer sizes
> aligned to a fixed 4K GPU page size.
> 
> Since the control stack and context save/restore buffers are consumed by
> the GPU, their sizes should be aligned to the GPU page size (4K), not the
> CPU page size. This patch updates the alignment of ctl_stack_size and
> wg_data_size to prevent over-allocation on systems with larger CPU page
> sizes.

As far as I know the problem is that the debugger needs to consume that stuff on the CPU side as well.

I need to double check that, but I think the alignment is correct as it is.

Regards,
Christian.

> 
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> index dc857450fa16..00ab941c3e86 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> @@ -445,10 +445,11 @@ void kfd_queue_ctx_save_restore_size(struct kfd_topology_device *dev)
>  		    min(cu_num * 40, props->array_count / props->simd_arrays_per_engine * 512)
>  		    : cu_num * 32;
>  
> -	wg_data_size = ALIGN(cu_num * WG_CONTEXT_DATA_SIZE_PER_CU(gfxv, props), PAGE_SIZE);
> +	wg_data_size = ALIGN(cu_num * WG_CONTEXT_DATA_SIZE_PER_CU(gfxv, props),
> +				AMDGPU_GPU_PAGE_SIZE);
>  	ctl_stack_size = wave_num * CNTL_STACK_BYTES_PER_WAVE(gfxv) + 8;
>  	ctl_stack_size = ALIGN(SIZEOF_HSA_USER_CONTEXT_SAVE_AREA_HEADER + ctl_stack_size,
> -			       PAGE_SIZE);
> +			       AMDGPU_GPU_PAGE_SIZE);
>  
>  	if ((gfxv / 10000 * 10000) == 100000) {
>  		/* HW design limits control stack size to 0x7000.
> @@ -460,7 +461,7 @@ void kfd_queue_ctx_save_restore_size(struct kfd_topology_device *dev)
>  
>  	props->ctl_stack_size = ctl_stack_size;
>  	props->debug_memory_size = ALIGN(wave_num * DEBUGGER_BYTES_PER_WAVE, DEBUGGER_BYTES_ALIGN);
> -	props->cwsr_size = ctl_stack_size + wg_data_size;
> +	props->cwsr_size = ALIGN(ctl_stack_size + wg_data_size, PAGE_SIZE);
>  
>  	if (gfxv == 80002)	/* GFX_VERSION_TONGA */
>  		props->eop_buffer_size = 0x8000;


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems
  2025-12-12  9:01 ` [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K " Christian König
@ 2025-12-12 10:45   ` Ritesh Harjani
  2025-12-12 13:01     ` Christian König
  0 siblings, 1 reply; 44+ messages in thread
From: Ritesh Harjani @ 2025-12-12 10:45 UTC (permalink / raw)
  To: Christian König, Donet Tom, amd-gfx, Felix Kuehling,
	Alex Deucher
  Cc: Kent.Russell, Vaidyanathan Srinivasan, Mukesh Kumar Chaurasiya

Christian König <christian.koenig@amd.com> writes:

> On 12/12/25 07:40, Donet Tom wrote:
>> This patch series addresses few issues which we encountered while running rocr
>> debug agent and rccl unit tests with AMD GPU on Power10 (ppc64le), using 64k
>> system pagesize.
>> 
>> Note that we don't observe any of these issues while booting with 4k system
>> pagesize on Power. So with the 64K system pagesize what we observed so far is,
>> at few of the places, the conversion between gpu pfn to cpu pfn (or vice versa)
>> may not be done correctly (due to different page size of AMD GPU (4K)
>> v/s cpu pagesize (64K)) which causes issues like gpu page faults or gpu hang
>> while running these tests.
>> 
>> Changes so far in this series:
>> =============================
>> 1. For now, during kfd queue creation, this patch lifts the restriction on EOP
>>    buffer size to be same buffer object mapping size.
>> 
>> 2. Fix SVM range map/unmap operations to convert CPU page numbers to GPU page
>>    numbers before calling amdgpu_vm_update_range(), which expects 4K GPU pages.
>>    Without this the rocr-debug-agent tests and rccl unit  tests were failing.
>> 
>> 3. Fix GART PTE allocation in migration code to account for multiple GPU pages
>>    per CPU page. The current code only allocates PTEs based on number of CPU
>>    pages, but GART may need one PTE per 4K GPU page.
>> 
>> 4. Adjust AMDGPU_GTT_MAX_TRANSFER_SIZE to respect the SDMA engine's 4MB hardware
>>    limit regardless of CPU page size. The hardcoded 512 pages worked on 4K
>>    systems but seems to be exceeding the limit with 64K system page size.
>> 
>> 5. In the current driver, MMIO remap is supported only when the system page
>>    size is 4K. Error messages have been added to indicate that MMIO remap
>>    is not supported on systems with a non-4K page size.
>> 
>> 6. Fix amdgpu page fault handler (for xnack) to pass the corresponding system
>>    pfn (instead of gpu pfn) for restoring SVM range mapping.
>> 
>> 7. Align ctl_stack_size and wg_data_size to GPU page size.
>> 
>> 8. On systems where the CPU page size is larger than the GPU’s 4K page size,
>>    the MQD and control stack are aligned to the CPU PAGE_SIZE, causing
>>    multiple GPU pages to inherit the UC attribute incorrectly. This results
>>    in the control-stack area being mis-mapped and leads to queue preemption
>>    and eviction failures. Aligning both regions to the GPU page size
>>    ensures the MQD is mapped UC and the control stack NC, restoring correct
>>    behavior.
>> 
>> 9. Apart from these 8 changes, we also needed this change [1]. Without this change
>>    kernel simply crashes when running rocminfo command itself.
>>    [1]: https://github.com/greenforce-project/chromeos-kernel-mirror/commit/2b33fad96c3129a2a53a42b9d90fb3b906145b98
>> 
>> Setup details:
>> ============
>> System details: Power10 LPAR using 64K pagesize.
>> AMD GPU:
>>   Name:                    gfx90a
>>   Marketing Name:          AMD Instinct MI210
>> 
>> Queries:
>> =======
>> 1. We currently ran rocr-debug agent tests [1]  and rccl unit tests [2] to test
>>    these changes. Is there anything else that you would suggest us to run to
>>    shake out any other page size related issues w.r.t the kernel driver?
>
> The ROCm team needs to answer that.
>

Is there any separate mailing list or list of people whom we can cc
then?

>> 2. Patch 1/8: We have a querry regarding eop buffer size Is this eop ring buffer
>>    size HW dependent? Should it be made PAGE_SIZE?
>
> Yes and no.
>

If you could more elaborate on this please? I am assuming you would
anyway respond with more context / details on Patch-1 itself. If yes,
that would be great!

>> 
>> 3. Patch 5/8: also have a query w.r.t the error paths when system page size > 4K.
>>    Do we need to lift this restriction and add MMIO remap support for systems with
>>    non-4K page sizes?
>
> The problem is the HW can't do this.
>

We aren't that familiar with the HW / SW stack here. Wanted to understand
what functionality will be unsupported due to this HW limitation then?

>> 
>> [1] ROCr debug agent tests: https://github.com/ROCm/rocr_debug_agent
>> [2] RCCL tests: https://github.com/ROCm/rccl/tree/develop/test
>> 
>> 
>> Please note that the changes in this series are on a best effort basis from our
>> end. Therefore, requesting the amd-gfx community (who have deeper knowledge of the
>> HW & SW stack) to kindly help with the review and provide feedback / comments on
>> these patches. The idea here is, to also have non-4K pagesize (e.g. 64K) well
>> supported with amd gpu kernel driver.
>
> Well this is generally nice to have, but there are unfortunately some HW limitations which makes ROCm pretty much unusable on non 4k page size systems.

That's a bummer :( 
- Do we have some HW documentation around what are these limitations around non-4K pagesize? Any links to such please?
- Are there any latest AMD GPU versions which maybe lifts such restrictions?

> What we can do is to support graphics and MM, but that should already work out of the box.
>

- Maybe we should also document, what will work and what won't work due to these HW limitations.


> What we can do is to support graphics and MM, but that should already work out of the box.

So these patches helped us resolve most of the issues like SDMA hangs
and GPU kernel page faults which we saw with rocr and rccl tests with
64K pagesize. Meaning, we didn't see this working out of box perhaps
due to 64K pagesize.

AFAIU, some of these patches may require re-work based on reviews, but
at least with these changes, we were able to see all the tests passing.

> I need to talk with Alex and the ROCm team about it if workarounds can be implemented for those issues.
>

Thanks a lot! That would be super helpful!


> Regards,
> Christian.
>

Thanks again for the quick response on the patch series.

-ritesh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 4/8] amdgpu/amdgpu_ttm: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size
  2025-12-12  8:53   ` Christian König
@ 2025-12-12 12:14     ` Donet Tom
  2026-01-06 12:55     ` Donet Tom
  1 sibling, 0 replies; 44+ messages in thread
From: Donet Tom @ 2025-12-12 12:14 UTC (permalink / raw)
  To: Christian König, amd-gfx, Felix Kuehling, Alex Deucher
  Cc: Kent.Russell, Ritesh Harjani, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya


On 12/12/25 2:23 PM, Christian König wrote:
> On 12/12/25 07:40, Donet Tom wrote:
>> The SDMA engine has a hardware limitation of 4 MB maximum transfer
>> size per operation.
> That is not correct. This is only true on ancient HW.
>
> What problems are you seeing here?
>
>> AMDGPU_GTT_MAX_TRANSFER_SIZE was hardcoded to
>> 512 pages, which worked correctly on systems with 4K pages but fails
>> on systems with larger page sizes.
>>
>> This patch divides the max transfer size / AMDGPU_GPU_PAGES_IN_CPU_PAGE
>> to match with non-4K page size systems.
> That is actually a bad idea. The value was meant to match the PMD size.


Hi Christian

Thank you for the reply.

In svm_migrate_copy_memory_gart(), the number of bytes to copy passed to 
amdgpu_copy_buffer() is based on the PMD size. On systems with a 4K page 
size, the PMD size is calculated correctly because 
AMDGPU_GTT_MAX_TRANSFER_SIZE is 512, and 512 × 4K = 2MB.

On systems with a 64K page size, however, the calculation becomes 512 × 
64K = 32MB. As a result, amdgpu_copy_buffer() ends up copying data in 
4MB chunks instead of PMD-sized chunks. To ensure consistent behavior 
across both 64K and 4K page-size systems, we adjusted the transfer size 
so that the maximum transfer remains 2MB, matching the PMD size.

The issue we observed was that the rocr-debug-agent test triggered SDMA 
hangs. This happened because an incorrect size was being passed when 
copying the GART mapping in svm_migrate_gart_map(). That problem was 
addressed in patch 3/8. While root-causing that issue, we also 
identified this inconsistency between 4K and 64K systems, So we felt 
that this change was needed to align the behavior with 4K system page sizes.



>
> Regards,
> Christian.
>
>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>> index 0be2728aa872..9d038feb25b0 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>> @@ -37,7 +37,7 @@
>>   #define AMDGPU_PL_MMIO_REMAP	(TTM_PL_PRIV + 5)
>>   #define __AMDGPU_PL_NUM	(TTM_PL_PRIV + 6)
>>   
>> -#define AMDGPU_GTT_MAX_TRANSFER_SIZE	512
>> +#define AMDGPU_GTT_MAX_TRANSFER_SIZE	(512 / AMDGPU_GPU_PAGES_IN_CPU_PAGE)
>>   #define AMDGPU_GTT_NUM_TRANSFER_WINDOWS	2
>>   
>>   extern const struct attribute_group amdgpu_vram_mgr_attr_group;

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 7/8] amdgpu: Align ctl_stack_size and wg_data_size to GPU page size instead of CPU page size
  2025-12-12  9:04   ` Christian König
@ 2025-12-12 12:29     ` Donet Tom
  2025-12-19 10:27     ` Donet Tom
  1 sibling, 0 replies; 44+ messages in thread
From: Donet Tom @ 2025-12-12 12:29 UTC (permalink / raw)
  To: Christian König, amd-gfx, Felix Kuehling, Alex Deucher
  Cc: Kent.Russell, Ritesh Harjani, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya


On 12/12/25 2:34 PM, Christian König wrote:
> On 12/12/25 07:40, Donet Tom wrote:
>> The ctl_stack_size and wg_data_size values are used to compute the total
>> context save/restore buffer size and the control stack size. These buffers
>> are programmed into the GPU and are used to store the queue state during
>> context save and restore.
>>
>> Currently, both ctl_stack_size and wg_data_size are aligned to the CPU
>> PAGE_SIZE. On systems with a non-4K CPU page size, this causes unnecessary
>> memory waste because the GPU internally calculates and uses buffer sizes
>> aligned to a fixed 4K GPU page size.
>>
>> Since the control stack and context save/restore buffers are consumed by
>> the GPU, their sizes should be aligned to the GPU page size (4K), not the
>> CPU page size. This patch updates the alignment of ctl_stack_size and
>> wg_data_size to prevent over-allocation on systems with larger CPU page
>> sizes.
> As far as I know the problem is that the debugger needs to consume that stuff on the CPU side as well.
>
> I need to double check that, but I think the alignment is correct as it is.


Thanks Christian

We were observing the following errors during RCCL unit tests when 
running on more than two GPUs, which eventually led to a GPU hang:

[  598.576821] amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for 
queue with doorbell_id: 80030008
[  606.696820] amdgpu 0048:0f:00.0: amdgpu: amdgpu: Failed to evict 
process queues
[  606.696826] amdgpu 0048:0f:00.0: amdgpu: GPU reset begin!. Source:  4
[  610.696852] amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for 
queue with doorbell_id: 80030008
[  610.696869] amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues
[  610.696942] amdgpu 0048:0f:00.0: amdgpu: Failed to restore process queues


After applying the alignment change and the change in patch 8/8, we are 
not seeing this issue.


>
> Regards,
> Christian.
>
>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>> ---
>>   drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 7 ++++---
>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>> index dc857450fa16..00ab941c3e86 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>> @@ -445,10 +445,11 @@ void kfd_queue_ctx_save_restore_size(struct kfd_topology_device *dev)
>>   		    min(cu_num * 40, props->array_count / props->simd_arrays_per_engine * 512)
>>   		    : cu_num * 32;
>>   
>> -	wg_data_size = ALIGN(cu_num * WG_CONTEXT_DATA_SIZE_PER_CU(gfxv, props), PAGE_SIZE);
>> +	wg_data_size = ALIGN(cu_num * WG_CONTEXT_DATA_SIZE_PER_CU(gfxv, props),
>> +				AMDGPU_GPU_PAGE_SIZE);
>>   	ctl_stack_size = wave_num * CNTL_STACK_BYTES_PER_WAVE(gfxv) + 8;
>>   	ctl_stack_size = ALIGN(SIZEOF_HSA_USER_CONTEXT_SAVE_AREA_HEADER + ctl_stack_size,
>> -			       PAGE_SIZE);
>> +			       AMDGPU_GPU_PAGE_SIZE);
>>   
>>   	if ((gfxv / 10000 * 10000) == 100000) {
>>   		/* HW design limits control stack size to 0x7000.
>> @@ -460,7 +461,7 @@ void kfd_queue_ctx_save_restore_size(struct kfd_topology_device *dev)
>>   
>>   	props->ctl_stack_size = ctl_stack_size;
>>   	props->debug_memory_size = ALIGN(wave_num * DEBUGGER_BYTES_PER_WAVE, DEBUGGER_BYTES_ALIGN);
>> -	props->cwsr_size = ctl_stack_size + wg_data_size;
>> +	props->cwsr_size = ALIGN(ctl_stack_size + wg_data_size, PAGE_SIZE);
>>   
>>   	if (gfxv == 80002)	/* GFX_VERSION_TONGA */
>>   		props->eop_buffer_size = 0x8000;

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems
  2025-12-12 10:45   ` Ritesh Harjani
@ 2025-12-12 13:01     ` Christian König
  2025-12-12 17:24       ` Alex Deucher
  0 siblings, 1 reply; 44+ messages in thread
From: Christian König @ 2025-12-12 13:01 UTC (permalink / raw)
  To: Ritesh Harjani (IBM), Donet Tom, amd-gfx, Felix Kuehling,
	Alex Deucher
  Cc: Kent.Russell, Vaidyanathan Srinivasan, Mukesh Kumar Chaurasiya

On 12/12/25 11:45, Ritesh Harjani (IBM) wrote:
> Christian König <christian.koenig@amd.com> writes:
>>> Setup details:
>>> ============
>>> System details: Power10 LPAR using 64K pagesize.
>>> AMD GPU:
>>>   Name:                    gfx90a
>>>   Marketing Name:          AMD Instinct MI210
>>>
>>> Queries:
>>> =======
>>> 1. We currently ran rocr-debug agent tests [1]  and rccl unit tests [2] to test
>>>    these changes. Is there anything else that you would suggest us to run to
>>>    shake out any other page size related issues w.r.t the kernel driver?
>>
>> The ROCm team needs to answer that.
>>
> 
> Is there any separate mailing list or list of people whom we can cc
> then?

With Felix on CC you already got the right person, but he's on vacation and will not be back before the end of the year.

I can check on Monday if some people are still around which could answer a couple of questions, but in general don't expect a quick response.

>>> 2. Patch 1/8: We have a querry regarding eop buffer size Is this eop ring buffer
>>>    size HW dependent? Should it be made PAGE_SIZE?
>>
>> Yes and no.
>>
> 
> If you could more elaborate on this please? I am assuming you would
> anyway respond with more context / details on Patch-1 itself. If yes,
> that would be great!

Well, in general the EOP (End of Pipe) buffer contains in a ring buffer of all the events and actions the CP should execute when shaders and cache flushes finish.

The size depends on the HW generation and configuration of the GPU etc..., but don't ask me for details how that is calculated.

The point is that the size is completely unrelated to the CPU, so using PAGE_SIZE is clearly incorrect.

>>>
>>> 3. Patch 5/8: also have a query w.r.t the error paths when system page size > 4K.
>>>    Do we need to lift this restriction and add MMIO remap support for systems with
>>>    non-4K page sizes?
>>
>> The problem is the HW can't do this.
>>
> 
> We aren't that familiar with the HW / SW stack here. Wanted to understand
> what functionality will be unsupported due to this HW limitation then?

The problem is that the CPU must map some of the registers/resources of the GPU into the address space of the application and you run into security issues when you map more than 4k at a time.

>>>
>>> [1] ROCr debug agent tests: https://github.com/ROCm/rocr_debug_agent
>>> [2] RCCL tests: https://github.com/ROCm/rccl/tree/develop/test
>>>
>>>
>>> Please note that the changes in this series are on a best effort basis from our
>>> end. Therefore, requesting the amd-gfx community (who have deeper knowledge of the
>>> HW & SW stack) to kindly help with the review and provide feedback / comments on
>>> these patches. The idea here is, to also have non-4K pagesize (e.g. 64K) well
>>> supported with amd gpu kernel driver.
>>
>> Well this is generally nice to have, but there are unfortunately some HW limitations which makes ROCm pretty much unusable on non 4k page size systems.
> 
> That's a bummer :( 
> - Do we have some HW documentation around what are these limitations around non-4K pagesize? Any links to such please?

You already mentioned MMIO remap which obviously has that problem, but if I'm not completely mistaken the PCIe doorbell BAR and some global seq counter resources will also cause problems here.

This can all be worked around by delegating those MMIO accesses into the kernel, but that means tons of extra IOCTL overhead.

Especially the cache flushes which are necessary to avoid corruption are really bad for performance in such an approach.

> - Are there any latest AMD GPU versions which maybe lifts such restrictions?

Not that I know off any.

>> What we can do is to support graphics and MM, but that should already work out of the box.
>>
> 
> - Maybe we should also document, what will work and what won't work due to these HW limitations.

Well pretty much everything, I need to double check how ROCm does HDP flushing/invalidating when the MMIO remap isn't available.

Could be that there is already a fallback path and that's the reason why this approach actually works at all.

>> What we can do is to support graphics and MM, but that should already work out of the box.> 
> So these patches helped us resolve most of the issues like SDMA hangs
> and GPU kernel page faults which we saw with rocr and rccl tests with
> 64K pagesize. Meaning, we didn't see this working out of box perhaps
> due to 64K pagesize.

Yeah, but this is all for ROCm and not the graphics side.

To be honest I'm not sure how ROCm even works when you have 64k pages at the moment. I would expect much more issue lurking in the kernel driver.

> AFAIU, some of these patches may require re-work based on reviews, but
> at least with these changes, we were able to see all the tests passing.
> 
>> I need to talk with Alex and the ROCm team about it if workarounds can be implemented for those issues.
>>
> 
> Thanks a lot! That would be super helpful!
> 
> 
>> Regards,
>> Christian.
>>
> 
> Thanks again for the quick response on the patch series.

You are welcome, but since it's so near to the end of the year not all people are available any more.

Regards,
Christian.

> 
> -ritesh


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems
  2025-12-12 13:01     ` Christian König
@ 2025-12-12 17:24       ` Alex Deucher
  2025-12-15  9:47         ` Christian König
  0 siblings, 1 reply; 44+ messages in thread
From: Alex Deucher @ 2025-12-12 17:24 UTC (permalink / raw)
  To: Christian König
  Cc: Ritesh Harjani (IBM), Donet Tom, amd-gfx, Felix Kuehling,
	Alex Deucher, Kent.Russell, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya

On Fri, Dec 12, 2025 at 8:19 AM Christian König
<christian.koenig@amd.com> wrote:
>
> On 12/12/25 11:45, Ritesh Harjani (IBM) wrote:
> > Christian König <christian.koenig@amd.com> writes:
> >>> Setup details:
> >>> ============
> >>> System details: Power10 LPAR using 64K pagesize.
> >>> AMD GPU:
> >>>   Name:                    gfx90a
> >>>   Marketing Name:          AMD Instinct MI210
> >>>
> >>> Queries:
> >>> =======
> >>> 1. We currently ran rocr-debug agent tests [1]  and rccl unit tests [2] to test
> >>>    these changes. Is there anything else that you would suggest us to run to
> >>>    shake out any other page size related issues w.r.t the kernel driver?
> >>
> >> The ROCm team needs to answer that.
> >>
> >
> > Is there any separate mailing list or list of people whom we can cc
> > then?
>
> With Felix on CC you already got the right person, but he's on vacation and will not be back before the end of the year.
>
> I can check on Monday if some people are still around which could answer a couple of questions, but in general don't expect a quick response.
>
> >>> 2. Patch 1/8: We have a querry regarding eop buffer size Is this eop ring buffer
> >>>    size HW dependent? Should it be made PAGE_SIZE?
> >>
> >> Yes and no.
> >>
> >
> > If you could more elaborate on this please? I am assuming you would
> > anyway respond with more context / details on Patch-1 itself. If yes,
> > that would be great!
>
> Well, in general the EOP (End of Pipe) buffer contains in a ring buffer of all the events and actions the CP should execute when shaders and cache flushes finish.
>
> The size depends on the HW generation and configuration of the GPU etc..., but don't ask me for details how that is calculated.
>
> The point is that the size is completely unrelated to the CPU, so using PAGE_SIZE is clearly incorrect.
>
> >>>
> >>> 3. Patch 5/8: also have a query w.r.t the error paths when system page size > 4K.
> >>>    Do we need to lift this restriction and add MMIO remap support for systems with
> >>>    non-4K page sizes?
> >>
> >> The problem is the HW can't do this.
> >>
> >
> > We aren't that familiar with the HW / SW stack here. Wanted to understand
> > what functionality will be unsupported due to this HW limitation then?
>
> The problem is that the CPU must map some of the registers/resources of the GPU into the address space of the application and you run into security issues when you map more than 4k at a time.

Right.  There are some 4K pages with the MMIO register BAR which are
empty and registers can be remapped into them.  In this case we remap
the HDP flush registers into one of those register pages.  This allows
applications to flush the HDP write FIFO from either the CPU or
another device.  This is needed to flush data written by the CPU or
another device to the VRAM BAR out to VRAM (i.e., so the GPU can see
it).  This is flushed internally as part of the shader dispatch
packets, but there are certain cases where an application may want
more control.  This is probably not a showstopper for most ROCm apps.
That said, the region is only 4K so if you allow applications to map a
larger region they would get access to GPU register pages which they
shouldn't have access to.

Alex

>
> >>>
> >>> [1] ROCr debug agent tests: https://github.com/ROCm/rocr_debug_agent
> >>> [2] RCCL tests: https://github.com/ROCm/rccl/tree/develop/test
> >>>
> >>>
> >>> Please note that the changes in this series are on a best effort basis from our
> >>> end. Therefore, requesting the amd-gfx community (who have deeper knowledge of the
> >>> HW & SW stack) to kindly help with the review and provide feedback / comments on
> >>> these patches. The idea here is, to also have non-4K pagesize (e.g. 64K) well
> >>> supported with amd gpu kernel driver.
> >>
> >> Well this is generally nice to have, but there are unfortunately some HW limitations which makes ROCm pretty much unusable on non 4k page size systems.
> >
> > That's a bummer :(
> > - Do we have some HW documentation around what are these limitations around non-4K pagesize? Any links to such please?
>
> You already mentioned MMIO remap which obviously has that problem, but if I'm not completely mistaken the PCIe doorbell BAR and some global seq counter resources will also cause problems here.
>
> This can all be worked around by delegating those MMIO accesses into the kernel, but that means tons of extra IOCTL overhead.
>
> Especially the cache flushes which are necessary to avoid corruption are really bad for performance in such an approach.
>
> > - Are there any latest AMD GPU versions which maybe lifts such restrictions?
>
> Not that I know off any.
>
> >> What we can do is to support graphics and MM, but that should already work out of the box.
> >>
> >
> > - Maybe we should also document, what will work and what won't work due to these HW limitations.
>
> Well pretty much everything, I need to double check how ROCm does HDP flushing/invalidating when the MMIO remap isn't available.
>
> Could be that there is already a fallback path and that's the reason why this approach actually works at all.
>
> >> What we can do is to support graphics and MM, but that should already work out of the box.>
> > So these patches helped us resolve most of the issues like SDMA hangs
> > and GPU kernel page faults which we saw with rocr and rccl tests with
> > 64K pagesize. Meaning, we didn't see this working out of box perhaps
> > due to 64K pagesize.
>
> Yeah, but this is all for ROCm and not the graphics side.
>
> To be honest I'm not sure how ROCm even works when you have 64k pages at the moment. I would expect much more issue lurking in the kernel driver.
>
> > AFAIU, some of these patches may require re-work based on reviews, but
> > at least with these changes, we were able to see all the tests passing.
> >
> >> I need to talk with Alex and the ROCm team about it if workarounds can be implemented for those issues.
> >>
> >
> > Thanks a lot! That would be super helpful!
> >
> >
> >> Regards,
> >> Christian.
> >>
> >
> > Thanks again for the quick response on the patch series.
>
> You are welcome, but since it's so near to the end of the year not all people are available any more.
>
> Regards,
> Christian.
>
> >
> > -ritesh
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems
  2025-12-12 17:24       ` Alex Deucher
@ 2025-12-15  9:47         ` Christian König
  2025-12-15 10:11           ` Donet Tom
  2025-12-15 14:09           ` Alex Deucher
  0 siblings, 2 replies; 44+ messages in thread
From: Christian König @ 2025-12-15  9:47 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Ritesh Harjani (IBM), Donet Tom, amd-gfx, Felix Kuehling,
	Alex Deucher, Kent.Russell, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya

On 12/12/25 18:24, Alex Deucher wrote:
> On Fri, Dec 12, 2025 at 8:19 AM Christian König
> <christian.koenig@amd.com> wrote:
>>
>> On 12/12/25 11:45, Ritesh Harjani (IBM) wrote:
>>> Christian König <christian.koenig@amd.com> writes:
>>>>> Setup details:
>>>>> ============
>>>>> System details: Power10 LPAR using 64K pagesize.
>>>>> AMD GPU:
>>>>>   Name:                    gfx90a
>>>>>   Marketing Name:          AMD Instinct MI210
>>>>>
>>>>> Queries:
>>>>> =======
>>>>> 1. We currently ran rocr-debug agent tests [1]  and rccl unit tests [2] to test
>>>>>    these changes. Is there anything else that you would suggest us to run to
>>>>>    shake out any other page size related issues w.r.t the kernel driver?
>>>>
>>>> The ROCm team needs to answer that.
>>>>
>>>
>>> Is there any separate mailing list or list of people whom we can cc
>>> then?
>>
>> With Felix on CC you already got the right person, but he's on vacation and will not be back before the end of the year.
>>
>> I can check on Monday if some people are still around which could answer a couple of questions, but in general don't expect a quick response.
>>
>>>>> 2. Patch 1/8: We have a querry regarding eop buffer size Is this eop ring buffer
>>>>>    size HW dependent? Should it be made PAGE_SIZE?
>>>>
>>>> Yes and no.
>>>>
>>>
>>> If you could more elaborate on this please? I am assuming you would
>>> anyway respond with more context / details on Patch-1 itself. If yes,
>>> that would be great!
>>
>> Well, in general the EOP (End of Pipe) buffer contains in a ring buffer of all the events and actions the CP should execute when shaders and cache flushes finish.
>>
>> The size depends on the HW generation and configuration of the GPU etc..., but don't ask me for details how that is calculated.
>>
>> The point is that the size is completely unrelated to the CPU, so using PAGE_SIZE is clearly incorrect.
>>
>>>>>
>>>>> 3. Patch 5/8: also have a query w.r.t the error paths when system page size > 4K.
>>>>>    Do we need to lift this restriction and add MMIO remap support for systems with
>>>>>    non-4K page sizes?
>>>>
>>>> The problem is the HW can't do this.
>>>>
>>>
>>> We aren't that familiar with the HW / SW stack here. Wanted to understand
>>> what functionality will be unsupported due to this HW limitation then?
>>
>> The problem is that the CPU must map some of the registers/resources of the GPU into the address space of the application and you run into security issues when you map more than 4k at a time.
> 
> Right.  There are some 4K pages with the MMIO register BAR which are
> empty and registers can be remapped into them.  In this case we remap
> the HDP flush registers into one of those register pages.  This allows
> applications to flush the HDP write FIFO from either the CPU or
> another device.  This is needed to flush data written by the CPU or
> another device to the VRAM BAR out to VRAM (i.e., so the GPU can see
> it).  This is flushed internally as part of the shader dispatch
> packets,

As far as I know this is only done for graphics shader submissions to the classic CS interface, but not for compute dispatches through ROCm queues.

That's the reason why ROCm needs the remapped MMIO register BAR.

> but there are certain cases where an application may want
> more control.  This is probably not a showstopper for most ROCm apps.

Well the problem is that you absolutely need the HDP flush/invalidation for 100% correctness. It does work most of the time without it, but you then risk data corruption.

Apart from making the flush/invalidate an IOCTL I think we could also just use a global dummy page in VRAM.

If you make two 32bit writes which are apart from each other and then a read back a 32bit value from VRAM that should invalidate the HDP as well. It's less efficient than the MMIO BAR remap but still much better than going though an IOCTL.

The only tricky part is that you need to get the HW barriers with the doorbell write right.....

> That said, the region is only 4K so if you allow applications to map a
> larger region they would get access to GPU register pages which they
> shouldn't have access to.

But don't we also have problems with the doorbell? E.g. the global aggregated one needs to be 4k as well, or is it ok to over allocate there?

Thinking more about it there is also a major problem with page tables. Those are 4k by default on modern systems as well and while over allocating them to 64k is possible that not only wastes some VRAM but can also result in OOM situations because we can't allocate the necessary page tables to switch from 2MiB to 4k pages in some cases.

Christian.

> 
> Alex
> 
>>
>>>>>
>>>>> [1] ROCr debug agent tests: https://github.com/ROCm/rocr_debug_agent
>>>>> [2] RCCL tests: https://github.com/ROCm/rccl/tree/develop/test
>>>>>
>>>>>
>>>>> Please note that the changes in this series are on a best effort basis from our
>>>>> end. Therefore, requesting the amd-gfx community (who have deeper knowledge of the
>>>>> HW & SW stack) to kindly help with the review and provide feedback / comments on
>>>>> these patches. The idea here is, to also have non-4K pagesize (e.g. 64K) well
>>>>> supported with amd gpu kernel driver.
>>>>
>>>> Well this is generally nice to have, but there are unfortunately some HW limitations which makes ROCm pretty much unusable on non 4k page size systems.
>>>
>>> That's a bummer :(
>>> - Do we have some HW documentation around what are these limitations around non-4K pagesize? Any links to such please?
>>
>> You already mentioned MMIO remap which obviously has that problem, but if I'm not completely mistaken the PCIe doorbell BAR and some global seq counter resources will also cause problems here.
>>
>> This can all be worked around by delegating those MMIO accesses into the kernel, but that means tons of extra IOCTL overhead.
>>
>> Especially the cache flushes which are necessary to avoid corruption are really bad for performance in such an approach.
>>
>>> - Are there any latest AMD GPU versions which maybe lifts such restrictions?
>>
>> Not that I know off any.
>>
>>>> What we can do is to support graphics and MM, but that should already work out of the box.
>>>>
>>>
>>> - Maybe we should also document, what will work and what won't work due to these HW limitations.
>>
>> Well pretty much everything, I need to double check how ROCm does HDP flushing/invalidating when the MMIO remap isn't available.
>>
>> Could be that there is already a fallback path and that's the reason why this approach actually works at all.
>>
>>>> What we can do is to support graphics and MM, but that should already work out of the box.>
>>> So these patches helped us resolve most of the issues like SDMA hangs
>>> and GPU kernel page faults which we saw with rocr and rccl tests with
>>> 64K pagesize. Meaning, we didn't see this working out of box perhaps
>>> due to 64K pagesize.
>>
>> Yeah, but this is all for ROCm and not the graphics side.
>>
>> To be honest I'm not sure how ROCm even works when you have 64k pages at the moment. I would expect much more issue lurking in the kernel driver.
>>
>>> AFAIU, some of these patches may require re-work based on reviews, but
>>> at least with these changes, we were able to see all the tests passing.
>>>
>>>> I need to talk with Alex and the ROCm team about it if workarounds can be implemented for those issues.
>>>>
>>>
>>> Thanks a lot! That would be super helpful!
>>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>
>>> Thanks again for the quick response on the patch series.
>>
>> You are welcome, but since it's so near to the end of the year not all people are available any more.
>>
>> Regards,
>> Christian.
>>
>>>
>>> -ritesh
>>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems
  2025-12-15  9:47         ` Christian König
@ 2025-12-15 10:11           ` Donet Tom
  2025-12-15 16:11             ` Christian König
  2025-12-15 14:09           ` Alex Deucher
  1 sibling, 1 reply; 44+ messages in thread
From: Donet Tom @ 2025-12-15 10:11 UTC (permalink / raw)
  To: Christian König, Alex Deucher
  Cc: Ritesh Harjani (IBM), amd-gfx, Felix Kuehling, Alex Deucher,
	Kent.Russell, Vaidyanathan Srinivasan, Mukesh Kumar Chaurasiya


On 12/15/25 3:17 PM, Christian König wrote:
> On 12/12/25 18:24, Alex Deucher wrote:
>> On Fri, Dec 12, 2025 at 8:19 AM Christian König
>> <christian.koenig@amd.com> wrote:
>>> On 12/12/25 11:45, Ritesh Harjani (IBM) wrote:
>>>> Christian König <christian.koenig@amd.com> writes:
>>>>>> Setup details:
>>>>>> ============
>>>>>> System details: Power10 LPAR using 64K pagesize.
>>>>>> AMD GPU:
>>>>>>    Name:                    gfx90a
>>>>>>    Marketing Name:          AMD Instinct MI210
>>>>>>
>>>>>> Queries:
>>>>>> =======
>>>>>> 1. We currently ran rocr-debug agent tests [1]  and rccl unit tests [2] to test
>>>>>>     these changes. Is there anything else that you would suggest us to run to
>>>>>>     shake out any other page size related issues w.r.t the kernel driver?
>>>>> The ROCm team needs to answer that.
>>>>>
>>>> Is there any separate mailing list or list of people whom we can cc
>>>> then?
>>> With Felix on CC you already got the right person, but he's on vacation and will not be back before the end of the year.
>>>
>>> I can check on Monday if some people are still around which could answer a couple of questions, but in general don't expect a quick response.
>>>
>>>>>> 2. Patch 1/8: We have a querry regarding eop buffer size Is this eop ring buffer
>>>>>>     size HW dependent? Should it be made PAGE_SIZE?
>>>>> Yes and no.
>>>>>
>>>> If you could more elaborate on this please? I am assuming you would
>>>> anyway respond with more context / details on Patch-1 itself. If yes,
>>>> that would be great!
>>> Well, in general the EOP (End of Pipe) buffer contains in a ring buffer of all the events and actions the CP should execute when shaders and cache flushes finish.
>>>
>>> The size depends on the HW generation and configuration of the GPU etc..., but don't ask me for details how that is calculated.
>>>
>>> The point is that the size is completely unrelated to the CPU, so using PAGE_SIZE is clearly incorrect.
>>>
>>>>>> 3. Patch 5/8: also have a query w.r.t the error paths when system page size > 4K.
>>>>>>     Do we need to lift this restriction and add MMIO remap support for systems with
>>>>>>     non-4K page sizes?
>>>>> The problem is the HW can't do this.
>>>>>
>>>> We aren't that familiar with the HW / SW stack here. Wanted to understand
>>>> what functionality will be unsupported due to this HW limitation then?
>>> The problem is that the CPU must map some of the registers/resources of the GPU into the address space of the application and you run into security issues when you map more than 4k at a time.
>> Right.  There are some 4K pages with the MMIO register BAR which are
>> empty and registers can be remapped into them.  In this case we remap
>> the HDP flush registers into one of those register pages.  This allows
>> applications to flush the HDP write FIFO from either the CPU or
>> another device.  This is needed to flush data written by the CPU or
>> another device to the VRAM BAR out to VRAM (i.e., so the GPU can see
>> it).  This is flushed internally as part of the shader dispatch
>> packets,
> As far as I know this is only done for graphics shader submissions to the classic CS interface, but not for compute dispatches through ROCm queues.
>
> That's the reason why ROCm needs the remapped MMIO register BAR.
>
>> but there are certain cases where an application may want
>> more control.  This is probably not a showstopper for most ROCm apps.
> Well the problem is that you absolutely need the HDP flush/invalidation for 100% correctness. It does work most of the time without it, but you then risk data corruption.
>
> Apart from making the flush/invalidate an IOCTL I think we could also just use a global dummy page in VRAM.
>
> If you make two 32bit writes which are apart from each other and then a read back a 32bit value from VRAM that should invalidate the HDP as well. It's less efficient than the MMIO BAR remap but still much better than going though an IOCTL.
>
> The only tricky part is that you need to get the HW barriers with the doorbell write right.....
>
>> That said, the region is only 4K so if you allow applications to map a
>> larger region they would get access to GPU register pages which they
>> shouldn't have access to.
> But don't we also have problems with the doorbell? E.g. the global aggregated one needs to be 4k as well, or is it ok to over allocate there?
>
> Thinking more about it there is also a major problem with page tables. Those are 4k by default on modern systems as well and while over allocating them to 64k is possible that not only wastes some VRAM but can also result in OOM situations because we can't allocate the necessary page tables to switch from 2MiB to 4k pages in some cases.


Sorry, Cristian — I may be misunderstanding this point, so I would 
appreciate some clarification.

If the CPU page size is 64K and the GPU page size is 4K, then from the 
GPU side the page table entries are created and mapped at 4K 
granularity, while on the CPU side the pages remain 64K. To map a single 
CPU page to the GPU, we therefore need to create multiple GPU page table 
entries for that CPU page.

We found that this was not being handled correctly in the SVM path and 
addressed it with the change in patch 2/8.

Given this, if the memory is allocated and mapped in GPU page-size (4K) 
granularity on the GPU side, could you please clarify how memory waste 
occurs in this scenario?

Thank you for your time and guidance.


>
> Christian.
>
>> Alex
>>
>>>>>> [1] ROCr debug agent tests: https://github.com/ROCm/rocr_debug_agent
>>>>>> [2] RCCL tests: https://github.com/ROCm/rccl/tree/develop/test
>>>>>>
>>>>>>
>>>>>> Please note that the changes in this series are on a best effort basis from our
>>>>>> end. Therefore, requesting the amd-gfx community (who have deeper knowledge of the
>>>>>> HW & SW stack) to kindly help with the review and provide feedback / comments on
>>>>>> these patches. The idea here is, to also have non-4K pagesize (e.g. 64K) well
>>>>>> supported with amd gpu kernel driver.
>>>>> Well this is generally nice to have, but there are unfortunately some HW limitations which makes ROCm pretty much unusable on non 4k page size systems.
>>>> That's a bummer :(
>>>> - Do we have some HW documentation around what are these limitations around non-4K pagesize? Any links to such please?
>>> You already mentioned MMIO remap which obviously has that problem, but if I'm not completely mistaken the PCIe doorbell BAR and some global seq counter resources will also cause problems here.
>>>
>>> This can all be worked around by delegating those MMIO accesses into the kernel, but that means tons of extra IOCTL overhead.
>>>
>>> Especially the cache flushes which are necessary to avoid corruption are really bad for performance in such an approach.
>>>
>>>> - Are there any latest AMD GPU versions which maybe lifts such restrictions?
>>> Not that I know off any.
>>>
>>>>> What we can do is to support graphics and MM, but that should already work out of the box.
>>>>>
>>>> - Maybe we should also document, what will work and what won't work due to these HW limitations.
>>> Well pretty much everything, I need to double check how ROCm does HDP flushing/invalidating when the MMIO remap isn't available.
>>>
>>> Could be that there is already a fallback path and that's the reason why this approach actually works at all.
>>>
>>>>> What we can do is to support graphics and MM, but that should already work out of the box.>
>>>> So these patches helped us resolve most of the issues like SDMA hangs
>>>> and GPU kernel page faults which we saw with rocr and rccl tests with
>>>> 64K pagesize. Meaning, we didn't see this working out of box perhaps
>>>> due to 64K pagesize.
>>> Yeah, but this is all for ROCm and not the graphics side.
>>>
>>> To be honest I'm not sure how ROCm even works when you have 64k pages at the moment. I would expect much more issue lurking in the kernel driver.
>>>
>>>> AFAIU, some of these patches may require re-work based on reviews, but
>>>> at least with these changes, we were able to see all the tests passing.
>>>>
>>>>> I need to talk with Alex and the ROCm team about it if workarounds can be implemented for those issues.
>>>>>
>>>> Thanks a lot! That would be super helpful!
>>>>
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>> Thanks again for the quick response on the patch series.
>>> You are welcome, but since it's so near to the end of the year not all people are available any more.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> -ritesh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems
  2025-12-15  9:47         ` Christian König
  2025-12-15 10:11           ` Donet Tom
@ 2025-12-15 14:09           ` Alex Deucher
  2025-12-16 13:54             ` Donet Tom
  1 sibling, 1 reply; 44+ messages in thread
From: Alex Deucher @ 2025-12-15 14:09 UTC (permalink / raw)
  To: Christian König
  Cc: Ritesh Harjani (IBM), Donet Tom, amd-gfx, Felix Kuehling,
	Alex Deucher, Kent.Russell, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya

On Mon, Dec 15, 2025 at 4:47 AM Christian König
<christian.koenig@amd.com> wrote:
>
> On 12/12/25 18:24, Alex Deucher wrote:
> > On Fri, Dec 12, 2025 at 8:19 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >>
> >> On 12/12/25 11:45, Ritesh Harjani (IBM) wrote:
> >>> Christian König <christian.koenig@amd.com> writes:
> >>>>> Setup details:
> >>>>> ============
> >>>>> System details: Power10 LPAR using 64K pagesize.
> >>>>> AMD GPU:
> >>>>>   Name:                    gfx90a
> >>>>>   Marketing Name:          AMD Instinct MI210
> >>>>>
> >>>>> Queries:
> >>>>> =======
> >>>>> 1. We currently ran rocr-debug agent tests [1]  and rccl unit tests [2] to test
> >>>>>    these changes. Is there anything else that you would suggest us to run to
> >>>>>    shake out any other page size related issues w.r.t the kernel driver?
> >>>>
> >>>> The ROCm team needs to answer that.
> >>>>
> >>>
> >>> Is there any separate mailing list or list of people whom we can cc
> >>> then?
> >>
> >> With Felix on CC you already got the right person, but he's on vacation and will not be back before the end of the year.
> >>
> >> I can check on Monday if some people are still around which could answer a couple of questions, but in general don't expect a quick response.
> >>
> >>>>> 2. Patch 1/8: We have a querry regarding eop buffer size Is this eop ring buffer
> >>>>>    size HW dependent? Should it be made PAGE_SIZE?
> >>>>
> >>>> Yes and no.
> >>>>
> >>>
> >>> If you could more elaborate on this please? I am assuming you would
> >>> anyway respond with more context / details on Patch-1 itself. If yes,
> >>> that would be great!
> >>
> >> Well, in general the EOP (End of Pipe) buffer contains in a ring buffer of all the events and actions the CP should execute when shaders and cache flushes finish.
> >>
> >> The size depends on the HW generation and configuration of the GPU etc..., but don't ask me for details how that is calculated.
> >>
> >> The point is that the size is completely unrelated to the CPU, so using PAGE_SIZE is clearly incorrect.
> >>
> >>>>>
> >>>>> 3. Patch 5/8: also have a query w.r.t the error paths when system page size > 4K.
> >>>>>    Do we need to lift this restriction and add MMIO remap support for systems with
> >>>>>    non-4K page sizes?
> >>>>
> >>>> The problem is the HW can't do this.
> >>>>
> >>>
> >>> We aren't that familiar with the HW / SW stack here. Wanted to understand
> >>> what functionality will be unsupported due to this HW limitation then?
> >>
> >> The problem is that the CPU must map some of the registers/resources of the GPU into the address space of the application and you run into security issues when you map more than 4k at a time.
> >
> > Right.  There are some 4K pages with the MMIO register BAR which are
> > empty and registers can be remapped into them.  In this case we remap
> > the HDP flush registers into one of those register pages.  This allows
> > applications to flush the HDP write FIFO from either the CPU or
> > another device.  This is needed to flush data written by the CPU or
> > another device to the VRAM BAR out to VRAM (i.e., so the GPU can see
> > it).  This is flushed internally as part of the shader dispatch
> > packets,
>
> As far as I know this is only done for graphics shader submissions to the classic CS interface, but not for compute dispatches through ROCm queues.

There is an explicit PM4 packet to flush the HDP cache for userqs and
for AQL the flush is handled via one of the flags in the dispatch
packet.  The MMIO remap is needed for more fine grained use cases
where you might have the CPU or another device operating in a gang
like scenario with the GPU.

Alex

>
> That's the reason why ROCm needs the remapped MMIO register BAR.
>
> > but there are certain cases where an application may want
> > more control.  This is probably not a showstopper for most ROCm apps.
>
> Well the problem is that you absolutely need the HDP flush/invalidation for 100% correctness. It does work most of the time without it, but you then risk data corruption.
>
> Apart from making the flush/invalidate an IOCTL I think we could also just use a global dummy page in VRAM.
>
> If you make two 32bit writes which are apart from each other and then a read back a 32bit value from VRAM that should invalidate the HDP as well. It's less efficient than the MMIO BAR remap but still much better than going though an IOCTL.
>
> The only tricky part is that you need to get the HW barriers with the doorbell write right.....
>
> > That said, the region is only 4K so if you allow applications to map a
> > larger region they would get access to GPU register pages which they
> > shouldn't have access to.
>
> But don't we also have problems with the doorbell? E.g. the global aggregated one needs to be 4k as well, or is it ok to over allocate there?
>
> Thinking more about it there is also a major problem with page tables. Those are 4k by default on modern systems as well and while over allocating them to 64k is possible that not only wastes some VRAM but can also result in OOM situations because we can't allocate the necessary page tables to switch from 2MiB to 4k pages in some cases.
>
> Christian.
>
> >
> > Alex
> >
> >>
> >>>>>
> >>>>> [1] ROCr debug agent tests: https://github.com/ROCm/rocr_debug_agent
> >>>>> [2] RCCL tests: https://github.com/ROCm/rccl/tree/develop/test
> >>>>>
> >>>>>
> >>>>> Please note that the changes in this series are on a best effort basis from our
> >>>>> end. Therefore, requesting the amd-gfx community (who have deeper knowledge of the
> >>>>> HW & SW stack) to kindly help with the review and provide feedback / comments on
> >>>>> these patches. The idea here is, to also have non-4K pagesize (e.g. 64K) well
> >>>>> supported with amd gpu kernel driver.
> >>>>
> >>>> Well this is generally nice to have, but there are unfortunately some HW limitations which makes ROCm pretty much unusable on non 4k page size systems.
> >>>
> >>> That's a bummer :(
> >>> - Do we have some HW documentation around what are these limitations around non-4K pagesize? Any links to such please?
> >>
> >> You already mentioned MMIO remap which obviously has that problem, but if I'm not completely mistaken the PCIe doorbell BAR and some global seq counter resources will also cause problems here.
> >>
> >> This can all be worked around by delegating those MMIO accesses into the kernel, but that means tons of extra IOCTL overhead.
> >>
> >> Especially the cache flushes which are necessary to avoid corruption are really bad for performance in such an approach.
> >>
> >>> - Are there any latest AMD GPU versions which maybe lifts such restrictions?
> >>
> >> Not that I know off any.
> >>
> >>>> What we can do is to support graphics and MM, but that should already work out of the box.
> >>>>
> >>>
> >>> - Maybe we should also document, what will work and what won't work due to these HW limitations.
> >>
> >> Well pretty much everything, I need to double check how ROCm does HDP flushing/invalidating when the MMIO remap isn't available.
> >>
> >> Could be that there is already a fallback path and that's the reason why this approach actually works at all.
> >>
> >>>> What we can do is to support graphics and MM, but that should already work out of the box.>
> >>> So these patches helped us resolve most of the issues like SDMA hangs
> >>> and GPU kernel page faults which we saw with rocr and rccl tests with
> >>> 64K pagesize. Meaning, we didn't see this working out of box perhaps
> >>> due to 64K pagesize.
> >>
> >> Yeah, but this is all for ROCm and not the graphics side.
> >>
> >> To be honest I'm not sure how ROCm even works when you have 64k pages at the moment. I would expect much more issue lurking in the kernel driver.
> >>
> >>> AFAIU, some of these patches may require re-work based on reviews, but
> >>> at least with these changes, we were able to see all the tests passing.
> >>>
> >>>> I need to talk with Alex and the ROCm team about it if workarounds can be implemented for those issues.
> >>>>
> >>>
> >>> Thanks a lot! That would be super helpful!
> >>>
> >>>
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>>
> >>> Thanks again for the quick response on the patch series.
> >>
> >> You are welcome, but since it's so near to the end of the year not all people are available any more.
> >>
> >> Regards,
> >> Christian.
> >>
> >>>
> >>> -ritesh
> >>
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems
  2025-12-15 10:11           ` Donet Tom
@ 2025-12-15 16:11             ` Christian König
  2025-12-16 10:08               ` Donet Tom
  2025-12-17  9:46               ` Donet Tom
  0 siblings, 2 replies; 44+ messages in thread
From: Christian König @ 2025-12-15 16:11 UTC (permalink / raw)
  To: Donet Tom, Alex Deucher
  Cc: Ritesh Harjani (IBM), amd-gfx, Felix Kuehling, Alex Deucher,
	Kent.Russell, Vaidyanathan Srinivasan, Mukesh Kumar Chaurasiya

On 12/15/25 11:11, Donet Tom wrote:
> On 12/15/25 3:17 PM, Christian König wrote:
>> On 12/12/25 18:24, Alex Deucher wrote:
>>> On Fri, Dec 12, 2025 at 8:19 AM Christian König
>>> <christian.koenig@amd.com> wrote:
>>>> On 12/12/25 11:45, Ritesh Harjani (IBM) wrote:
>>>>> Christian König <christian.koenig@amd.com> writes:
>>>>>>> Setup details:
>>>>>>> ============
>>>>>>> System details: Power10 LPAR using 64K pagesize.
>>>>>>> AMD GPU:
>>>>>>>    Name:                    gfx90a
>>>>>>>    Marketing Name:          AMD Instinct MI210
>>>>>>>
>>>>>>> Queries:
>>>>>>> =======
>>>>>>> 1. We currently ran rocr-debug agent tests [1]  and rccl unit tests [2] to test
>>>>>>>     these changes. Is there anything else that you would suggest us to run to
>>>>>>>     shake out any other page size related issues w.r.t the kernel driver?
>>>>>> The ROCm team needs to answer that.
>>>>>>
>>>>> Is there any separate mailing list or list of people whom we can cc
>>>>> then?
>>>> With Felix on CC you already got the right person, but he's on vacation and will not be back before the end of the year.
>>>>
>>>> I can check on Monday if some people are still around which could answer a couple of questions, but in general don't expect a quick response.
>>>>
>>>>>>> 2. Patch 1/8: We have a querry regarding eop buffer size Is this eop ring buffer
>>>>>>>     size HW dependent? Should it be made PAGE_SIZE?
>>>>>> Yes and no.
>>>>>>
>>>>> If you could more elaborate on this please? I am assuming you would
>>>>> anyway respond with more context / details on Patch-1 itself. If yes,
>>>>> that would be great!
>>>> Well, in general the EOP (End of Pipe) buffer contains in a ring buffer of all the events and actions the CP should execute when shaders and cache flushes finish.
>>>>
>>>> The size depends on the HW generation and configuration of the GPU etc..., but don't ask me for details how that is calculated.
>>>>
>>>> The point is that the size is completely unrelated to the CPU, so using PAGE_SIZE is clearly incorrect.
>>>>
>>>>>>> 3. Patch 5/8: also have a query w.r.t the error paths when system page size > 4K.
>>>>>>>     Do we need to lift this restriction and add MMIO remap support for systems with
>>>>>>>     non-4K page sizes?
>>>>>> The problem is the HW can't do this.
>>>>>>
>>>>> We aren't that familiar with the HW / SW stack here. Wanted to understand
>>>>> what functionality will be unsupported due to this HW limitation then?
>>>> The problem is that the CPU must map some of the registers/resources of the GPU into the address space of the application and you run into security issues when you map more than 4k at a time.
>>> Right.  There are some 4K pages with the MMIO register BAR which are
>>> empty and registers can be remapped into them.  In this case we remap
>>> the HDP flush registers into one of those register pages.  This allows
>>> applications to flush the HDP write FIFO from either the CPU or
>>> another device.  This is needed to flush data written by the CPU or
>>> another device to the VRAM BAR out to VRAM (i.e., so the GPU can see
>>> it).  This is flushed internally as part of the shader dispatch
>>> packets,
>> As far as I know this is only done for graphics shader submissions to the classic CS interface, but not for compute dispatches through ROCm queues.
>>
>> That's the reason why ROCm needs the remapped MMIO register BAR.
>>
>>> but there are certain cases where an application may want
>>> more control.  This is probably not a showstopper for most ROCm apps.
>> Well the problem is that you absolutely need the HDP flush/invalidation for 100% correctness. It does work most of the time without it, but you then risk data corruption.
>>
>> Apart from making the flush/invalidate an IOCTL I think we could also just use a global dummy page in VRAM.
>>
>> If you make two 32bit writes which are apart from each other and then a read back a 32bit value from VRAM that should invalidate the HDP as well. It's less efficient than the MMIO BAR remap but still much better than going though an IOCTL.
>>
>> The only tricky part is that you need to get the HW barriers with the doorbell write right.....
>>
>>> That said, the region is only 4K so if you allow applications to map a
>>> larger region they would get access to GPU register pages which they
>>> shouldn't have access to.
>> But don't we also have problems with the doorbell? E.g. the global aggregated one needs to be 4k as well, or is it ok to over allocate there?
>>
>> Thinking more about it there is also a major problem with page tables. Those are 4k by default on modern systems as well and while over allocating them to 64k is possible that not only wastes some VRAM but can also result in OOM situations because we can't allocate the necessary page tables to switch from 2MiB to 4k pages in some cases.
> 
> 
> Sorry, Cristian — I may be misunderstanding this point, so I would appreciate some clarification.
> 
> If the CPU page size is 64K and the GPU page size is 4K, then from the GPU side the page table entries are created and mapped at 4K granularity, while on the CPU side the pages remain 64K. To map a single CPU page to the GPU, we therefore need to create multiple GPU page table entries for that CPU page.

The GPU page tables are 4k in size no matter what the CPU page size is and there is some special handling so that we can allocate them even under memory pressure. Background is that you sometimes need to split up higher order pages (1G, 2M) into lower order pages (2M, 4k) to be able to swap things to system memory for example and for that you need some an extra layer of page tables.

The problem is now that those 4k pages are rounded up to your CPU page size, resulting in both wasting quite some memory as well as messing up the special handling to not run into OOM situations when swapping things to system memory....

What we could potentially do is to switch to 64k pages on the GPU as well (the HW is flexible enough to be re-configurable), but that is tons of changes and probably not easily testable.

Regards,
Christian.

> 
> We found that this was not being handled correctly in the SVM path and addressed it with the change in patch 2/8.
> 
> Given this, if the memory is allocated and mapped in GPU page-size (4K) granularity on the GPU side, could you please clarify how memory waste occurs in this scenario?
> 
> Thank you for your time and guidance.
> 
> 
>>
>> Christian.
>>
>>> Alex
>>>
>>>>>>> [1] ROCr debug agent tests: https://github.com/ROCm/rocr_debug_agent
>>>>>>> [2] RCCL tests: https://github.com/ROCm/rccl/tree/develop/test
>>>>>>>
>>>>>>>
>>>>>>> Please note that the changes in this series are on a best effort basis from our
>>>>>>> end. Therefore, requesting the amd-gfx community (who have deeper knowledge of the
>>>>>>> HW & SW stack) to kindly help with the review and provide feedback / comments on
>>>>>>> these patches. The idea here is, to also have non-4K pagesize (e.g. 64K) well
>>>>>>> supported with amd gpu kernel driver.
>>>>>> Well this is generally nice to have, but there are unfortunately some HW limitations which makes ROCm pretty much unusable on non 4k page size systems.
>>>>> That's a bummer :(
>>>>> - Do we have some HW documentation around what are these limitations around non-4K pagesize? Any links to such please?
>>>> You already mentioned MMIO remap which obviously has that problem, but if I'm not completely mistaken the PCIe doorbell BAR and some global seq counter resources will also cause problems here.
>>>>
>>>> This can all be worked around by delegating those MMIO accesses into the kernel, but that means tons of extra IOCTL overhead.
>>>>
>>>> Especially the cache flushes which are necessary to avoid corruption are really bad for performance in such an approach.
>>>>
>>>>> - Are there any latest AMD GPU versions which maybe lifts such restrictions?
>>>> Not that I know off any.
>>>>
>>>>>> What we can do is to support graphics and MM, but that should already work out of the box.
>>>>>>
>>>>> - Maybe we should also document, what will work and what won't work due to these HW limitations.
>>>> Well pretty much everything, I need to double check how ROCm does HDP flushing/invalidating when the MMIO remap isn't available.
>>>>
>>>> Could be that there is already a fallback path and that's the reason why this approach actually works at all.
>>>>
>>>>>> What we can do is to support graphics and MM, but that should already work out of the box.>
>>>>> So these patches helped us resolve most of the issues like SDMA hangs
>>>>> and GPU kernel page faults which we saw with rocr and rccl tests with
>>>>> 64K pagesize. Meaning, we didn't see this working out of box perhaps
>>>>> due to 64K pagesize.
>>>> Yeah, but this is all for ROCm and not the graphics side.
>>>>
>>>> To be honest I'm not sure how ROCm even works when you have 64k pages at the moment. I would expect much more issue lurking in the kernel driver.
>>>>
>>>>> AFAIU, some of these patches may require re-work based on reviews, but
>>>>> at least with these changes, we were able to see all the tests passing.
>>>>>
>>>>>> I need to talk with Alex and the ROCm team about it if workarounds can be implemented for those issues.
>>>>>>
>>>>> Thanks a lot! That would be super helpful!
>>>>>
>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>> Thanks again for the quick response on the patch series.
>>>> You are welcome, but since it's so near to the end of the year not all people are available any more.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> -ritesh


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 1/8] drm/amdkfd: Relax size checking during queue buffer get
  2025-12-12  6:40 ` [RFC PATCH v1 1/8] drm/amdkfd: Relax size checking during queue buffer get Donet Tom
@ 2025-12-15 20:25   ` Philip Yang
  2025-12-16 10:12     ` Donet Tom
  0 siblings, 1 reply; 44+ messages in thread
From: Philip Yang @ 2025-12-15 20:25 UTC (permalink / raw)
  To: Donet Tom, amd-gfx, Felix Kuehling, Alex Deucher,
	christian.koenig
  Cc: Kent.Russell, Ritesh Harjani, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya



On 2025-12-12 01:40, Donet Tom wrote:
> HW-supported EOP buffer sizes are 4K and 32K. On systems that do not
> use 4K pages, the minimum buffer object (BO) allocation size is
> PAGE_SIZE (for example, 64K). During queue buffer acquisition, the driver
> currently checks the allocated BO size against the supported EOP buffer
> size. Since the allocated BO is larger than the expected size, this check
> fails, preventing queue creation.
>
> Relax the strict size validation and allow PAGE_SIZE-sized BOs to be used.
> Only the required 4K region of the buffer will be used as the EOP buffer
> and avoids queue creation failures on non-4K page systems.
>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 10 ++++++----
>   1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> index f1e7583650c4..dc857450fa16 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> @@ -199,6 +199,7 @@ int kfd_queue_buffer_get(struct amdgpu_vm *vm, void __user *addr, struct amdgpu_
>   	struct amdgpu_bo_va_mapping *mapping;
>   	u64 user_addr;
>   	u64 size;
> +	u64 bo_size;
>   
>   	user_addr = (u64)addr >> AMDGPU_GPU_PAGE_SHIFT;
>   	size = expected_size >> AMDGPU_GPU_PAGE_SHIFT;
> @@ -207,11 +208,12 @@ int kfd_queue_buffer_get(struct amdgpu_vm *vm, void __user *addr, struct amdgpu_
>   	if (!mapping)
>   		goto out_err;
>   
> -	if (user_addr != mapping->start ||
> -	    (size != 0 && user_addr + size - 1 != mapping->last)) {
> -		pr_debug("expected size 0x%llx not equal to mapping addr 0x%llx size 0x%llx\n",
> +	bo_size = mapping->last - mapping->start + 1;
> +
> +	if (user_addr != mapping->start || (size != 0 && bo_size < size)) {
> +		pr_debug("expected size 0x%llx grater than mapping addr 0x%llx size 0x%llx\n",
>   			expected_size, mapping->start << AMDGPU_GPU_PAGE_SHIFT,
> -			(mapping->last - mapping->start + 1) << AMDGPU_GPU_PAGE_SHIFT);
> +			bo_size <<  AMDGPU_GPU_PAGE_SHIFT);
This change works, but also relax the size validation for ring buffer 
size etc, this may have side effect,
for example FW and user space should have the same ring buffer size.

Other buffers already use PAGE_SIZE as expected size or size aligned to 
PAGE_SIZE, maybe only relax the eop buffer
size check

@@ -275,7 +275,7 @@ int kfd_queue_acquire_buffers(struct 
kfd_process_device *pdd, struct queue_prope

         /* EOP buffer is not required for all ASICs */
         if (properties->eop_ring_buffer_address) {
-               if (properties->eop_ring_buffer_size != 
topo_dev->node_props.eop_buffer_size) {
+               if (properties->eop_ring_buffer_size < 
topo_dev->node_props.eop_buffer_size) {
                         pr_debug("queue eop bo size 0x%x not equal to 
node eop buf size 0x%x\n",
                                 properties->eop_ring_buffer_size,
topo_dev->node_props.eop_buffer_size);
@@ -284,7 +284,7 @@ int kfd_queue_acquire_buffers(struct 
kfd_process_device *pdd, struct queue_prope
                 }
                 err = kfd_queue_buffer_get(vm, (void 
*)properties->eop_ring_buffer_address,
&properties->eop_buf_bo,
- properties->eop_ring_buffer_size);
+ ALIGN(properties->eop_ring_buffer_size, PAGE_SIZE));
                 if (err)
                         goto out_err_unreserve;
         }

Regards,
Philip
>   		goto out_err;
>   	}
>   


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 2/8] amdkfd/kfd_svm: Fix SVM map/unmap address conversion for non-4k page sizes
  2025-12-12  6:40 ` [RFC PATCH v1 2/8] amdkfd/kfd_svm: Fix SVM map/unmap address conversion for non-4k page sizes Donet Tom
@ 2025-12-15 20:44   ` Philip Yang
  2025-12-16 10:09     ` Donet Tom
  0 siblings, 1 reply; 44+ messages in thread
From: Philip Yang @ 2025-12-15 20:44 UTC (permalink / raw)
  To: Donet Tom, amd-gfx, Felix Kuehling, Alex Deucher,
	christian.koenig
  Cc: Kent.Russell, Ritesh Harjani, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya



On 2025-12-12 01:40, Donet Tom wrote:
> SVM range size is tracked using the system page size. The range start and
> end are aligned to system page-sized PFNs, so the total SVM range size
> equals the total number of pages in the SVM range multiplied by the system
> page size.
>
> The SVM range map/unmap functions pass these system page-sized PFN numbers
> to amdgpu_vm_update_range(), which expects PFNs based on the GPU page size
> (4K). On non-4K page systems, this mismatch causes only part of the SVM
> range to be mapped in the GPU page table, while the rest remains unmapped.
> If the GPU accesses an unmapped address within the same range, it results
> in a GPU page fault.
>
> To fix this, the required conversion has been added in both
> svm_range_map_to_gpu() and svm_range_unmap_from_gpu(), ensuring that all
> pages in the SVM range are correctly mapped on non-4K systems.
>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 30 ++++++++++++++++++++--------
>   1 file changed, 22 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index 74a1d3e1d52b..a2636f2d6c71 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> @@ -1314,11 +1314,16 @@ svm_range_unmap_from_gpu(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>   			 struct dma_fence **fence)
>   {
>   	uint64_t init_pte_value = 0;
> +	uint64_t gpu_start, gpu_end;
>   
> -	pr_debug("[0x%llx 0x%llx]\n", start, last);
> +	// Convert CPU page range to GPU page range
> +	gpu_start = start * AMDGPU_GPU_PAGES_IN_CPU_PAGE;
> +	gpu_end = (last + 1) * AMDGPU_GPU_PAGES_IN_CPU_PAGE - 1;
>   
> -	return amdgpu_vm_update_range(adev, vm, false, true, true, false, NULL, start,
> -				      last, init_pte_value, 0, 0, NULL, NULL,
> +	pr_debug("%s: CPU[0x%llx 0x%llx] -> GPU[0x%llx 0x%llx]\n", __func__,
dynamic debug control can enable function name, linenum print with +pfl, 
don't add __func__.
> +		 start, last, gpu_start, gpu_end);
> +	return amdgpu_vm_update_range(adev, vm, false, true, true, false, NULL, gpu_start,
> +				      gpu_end, init_pte_value, 0, 0, NULL, NULL,
>   				      fence);
>   }
>   
> @@ -1398,9 +1403,13 @@ svm_range_map_to_gpu(struct kfd_process_device *pdd, struct svm_range *prange,
>   		 last_start, last_start + npages - 1, readonly);
>   
>   	for (i = offset; i < offset + npages; i++) {
> +		uint64_t gpu_start;
> +		uint64_t gpu_end;
> +
>   		last_domain = dma_addr[i] & SVM_RANGE_VRAM_DOMAIN;
>   		dma_addr[i] &= ~SVM_RANGE_VRAM_DOMAIN;
>   
> +
remove extra blank line.
>   		/* Collect all pages in the same address range and memory domain
>   		 * that can be mapped with a single call to update mapping.
>   		 */
> @@ -1415,17 +1424,22 @@ svm_range_map_to_gpu(struct kfd_process_device *pdd, struct svm_range *prange,
>   		if (readonly)
>   			pte_flags &= ~AMDGPU_PTE_WRITEABLE;
>   
> -		pr_debug("svms 0x%p map [0x%lx 0x%llx] vram %d PTE 0x%llx\n",
> -			 prange->svms, last_start, prange->start + i,
> -			 (last_domain == SVM_RANGE_VRAM_DOMAIN) ? 1 : 0,
> -			 pte_flags);
>   
>   		/* For dGPU mode, we use same vm_manager to allocate VRAM for
>   		 * different memory partition based on fpfn/lpfn, we should use
>   		 * same vm_manager.vram_base_offset regardless memory partition.
>   		 */
> +		gpu_start = last_start * AMDGPU_GPU_PAGES_IN_CPU_PAGE;
> +		gpu_end = (prange->start + i + 1) * AMDGPU_GPU_PAGES_IN_CPU_PAGE - 1;
> +
> +		pr_debug("svms 0x%p map CPU[0x%lx 0x%llx] GPU[0x%llx 0x%llx] vram %d PTE 0x%llx\n",
> +			 prange->svms, last_start, prange->start + i,
> +			 gpu_start, gpu_end,
> +			 (last_domain == SVM_RANGE_VRAM_DOMAIN) ? 1 : 0,
> +			 pte_flags);
> +
>   		r = amdgpu_vm_update_range(adev, vm, false, false, flush_tlb, true,
> -					   NULL, last_start, prange->start + i,
> +					   NULL, gpu_start, gpu_end,
>   					   pte_flags,
With those fixed, this looks good to me.

Reviewed-by: Philip Yang <Philip.Yang@amd.com>
>   					   (last_start - prange->start) << PAGE_SHIFT,
>   					   bo_adev ? bo_adev->vm_manager.vram_base_offset : 0,


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 3/8] amdkfd/kfd_migrate: Fix GART PTE for non-4K pagesize in svm_migrate_gart_map()
  2025-12-12  6:40 ` [RFC PATCH v1 3/8] amdkfd/kfd_migrate: Fix GART PTE for non-4K pagesize in svm_migrate_gart_map() Donet Tom
@ 2025-12-15 21:03   ` Philip Yang
  0 siblings, 0 replies; 44+ messages in thread
From: Philip Yang @ 2025-12-15 21:03 UTC (permalink / raw)
  To: Donet Tom, amd-gfx, Felix Kuehling, Alex Deucher,
	christian.koenig
  Cc: Kent.Russell, Ritesh Harjani, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya



On 2025-12-12 01:40, Donet Tom wrote:
> In svm_migrate_gart_map(), while migrating GART mapping, the number of
> bytes copied for the GART table only accounts for CPU pages. On non-4K
> systems, each CPU page can contain multiple GPU pages, and the GART
> requires one 8-byte PTE per GPU page. As a result, an incorrect size was
> passed to the DMA, causing only a partial update of the GART table.
>
> Fix this function to work correctly on non-4K page-size systems by
> accounting for the number of GPU pages per CPU page when calculating the
> number of bytes to be copied.
>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> index 59a5a3fea65d..ea8377071c39 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> @@ -62,7 +62,7 @@ svm_migrate_gart_map(struct amdgpu_ring *ring, u64 npages,
>   	*gart_addr = adev->gmc.gart_start;
>   
>   	num_dw = ALIGN(adev->mman.buffer_funcs->copy_num_dw, 8);
> -	num_bytes = npages * 8;
> +	num_bytes = npages * 8 * AMDGPU_GPU_PAGES_IN_CPU_PAGE;
>   
>   	r = amdgpu_job_alloc_with_ib(adev, &adev->mman.high_pr,
>   				     AMDGPU_FENCE_OWNER_UNDEFINED,


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems
  2025-12-15 16:11             ` Christian König
@ 2025-12-16 10:08               ` Donet Tom
  2025-12-16 16:06                 ` Christian König
  2025-12-17  9:46               ` Donet Tom
  1 sibling, 1 reply; 44+ messages in thread
From: Donet Tom @ 2025-12-16 10:08 UTC (permalink / raw)
  To: Christian König, Alex Deucher
  Cc: Ritesh Harjani (IBM), amd-gfx, Felix Kuehling, Alex Deucher,
	Kent.Russell, Vaidyanathan Srinivasan, Mukesh Kumar Chaurasiya


On 12/15/25 9:41 PM, Christian König wrote:
> On 12/15/25 11:11, Donet Tom wrote:
>> On 12/15/25 3:17 PM, Christian König wrote:
>>> On 12/12/25 18:24, Alex Deucher wrote:
>>>> On Fri, Dec 12, 2025 at 8:19 AM Christian König
>>>> <christian.koenig@amd.com> wrote:
>>>>> On 12/12/25 11:45, Ritesh Harjani (IBM) wrote:
>>>>>> Christian König <christian.koenig@amd.com> writes:
>>>>>>>> Setup details:
>>>>>>>> ============
>>>>>>>> System details: Power10 LPAR using 64K pagesize.
>>>>>>>> AMD GPU:
>>>>>>>>     Name:                    gfx90a
>>>>>>>>     Marketing Name:          AMD Instinct MI210
>>>>>>>>
>>>>>>>> Queries:
>>>>>>>> =======
>>>>>>>> 1. We currently ran rocr-debug agent tests [1]  and rccl unit tests [2] to test
>>>>>>>>      these changes. Is there anything else that you would suggest us to run to
>>>>>>>>      shake out any other page size related issues w.r.t the kernel driver?
>>>>>>> The ROCm team needs to answer that.
>>>>>>>
>>>>>> Is there any separate mailing list or list of people whom we can cc
>>>>>> then?
>>>>> With Felix on CC you already got the right person, but he's on vacation and will not be back before the end of the year.
>>>>>
>>>>> I can check on Monday if some people are still around which could answer a couple of questions, but in general don't expect a quick response.
>>>>>
>>>>>>>> 2. Patch 1/8: We have a querry regarding eop buffer size Is this eop ring buffer
>>>>>>>>      size HW dependent? Should it be made PAGE_SIZE?
>>>>>>> Yes and no.
>>>>>>>
>>>>>> If you could more elaborate on this please? I am assuming you would
>>>>>> anyway respond with more context / details on Patch-1 itself. If yes,
>>>>>> that would be great!
>>>>> Well, in general the EOP (End of Pipe) buffer contains in a ring buffer of all the events and actions the CP should execute when shaders and cache flushes finish.
>>>>>
>>>>> The size depends on the HW generation and configuration of the GPU etc..., but don't ask me for details how that is calculated.
>>>>>
>>>>> The point is that the size is completely unrelated to the CPU, so using PAGE_SIZE is clearly incorrect.
>>>>>
>>>>>>>> 3. Patch 5/8: also have a query w.r.t the error paths when system page size > 4K.
>>>>>>>>      Do we need to lift this restriction and add MMIO remap support for systems with
>>>>>>>>      non-4K page sizes?
>>>>>>> The problem is the HW can't do this.
>>>>>>>
>>>>>> We aren't that familiar with the HW / SW stack here. Wanted to understand
>>>>>> what functionality will be unsupported due to this HW limitation then?
>>>>> The problem is that the CPU must map some of the registers/resources of the GPU into the address space of the application and you run into security issues when you map more than 4k at a time.
>>>> Right.  There are some 4K pages with the MMIO register BAR which are
>>>> empty and registers can be remapped into them.  In this case we remap
>>>> the HDP flush registers into one of those register pages.  This allows
>>>> applications to flush the HDP write FIFO from either the CPU or
>>>> another device.  This is needed to flush data written by the CPU or
>>>> another device to the VRAM BAR out to VRAM (i.e., so the GPU can see
>>>> it).  This is flushed internally as part of the shader dispatch
>>>> packets,
>>> As far as I know this is only done for graphics shader submissions to the classic CS interface, but not for compute dispatches through ROCm queues.
>>>
>>> That's the reason why ROCm needs the remapped MMIO register BAR.
>>>
>>>> but there are certain cases where an application may want
>>>> more control.  This is probably not a showstopper for most ROCm apps.
>>> Well the problem is that you absolutely need the HDP flush/invalidation for 100% correctness. It does work most of the time without it, but you then risk data corruption.
>>>
>>> Apart from making the flush/invalidate an IOCTL I think we could also just use a global dummy page in VRAM.
>>>
>>> If you make two 32bit writes which are apart from each other and then a read back a 32bit value from VRAM that should invalidate the HDP as well. It's less efficient than the MMIO BAR remap but still much better than going though an IOCTL.
>>>
>>> The only tricky part is that you need to get the HW barriers with the doorbell write right.....
>>>
>>>> That said, the region is only 4K so if you allow applications to map a
>>>> larger region they would get access to GPU register pages which they
>>>> shouldn't have access to.
>>> But don't we also have problems with the doorbell? E.g. the global aggregated one needs to be 4k as well, or is it ok to over allocate there?
>>>
>>> Thinking more about it there is also a major problem with page tables. Those are 4k by default on modern systems as well and while over allocating them to 64k is possible that not only wastes some VRAM but can also result in OOM situations because we can't allocate the necessary page tables to switch from 2MiB to 4k pages in some cases.
>>
>> Sorry, Cristian — I may be misunderstanding this point, so I would appreciate some clarification.
>>
>> If the CPU page size is 64K and the GPU page size is 4K, then from the GPU side the page table entries are created and mapped at 4K granularity, while on the CPU side the pages remain 64K. To map a single CPU page to the GPU, we therefore need to create multiple GPU page table entries for that CPU page.
> The GPU page tables are 4k in size no matter what the CPU page size is and there is some special handling so that we can allocate them even under memory pressure. Background is that you sometimes need to split up higher order pages (1G, 2M) into lower order pages (2M, 4k) to be able to swap things to system memory for example and for that you need some an extra layer of page tables.
>
> The problem is now that those 4k pages are rounded up to your CPU page size, resulting in both wasting quite some memory as well as messing up the special handling to not run into OOM situations when swapping things to system memory....
>
> What we could potentially do is to switch to 64k pages on the GPU as well (the HW is flexible enough to be re-configurable), but that is tons of changes and probably not easily testable.


If possible, could you share the steps to change the hardware page size? 
I can try testing it on our system.


>
> Regards,
> Christian.
>
>> We found that this was not being handled correctly in the SVM path and addressed it with the change in patch 2/8.
>>
>> Given this, if the memory is allocated and mapped in GPU page-size (4K) granularity on the GPU side, could you please clarify how memory waste occurs in this scenario?
>>
>> Thank you for your time and guidance.
>>
>>
>>> Christian.
>>>
>>>> Alex
>>>>
>>>>>>>> [1] ROCr debug agent tests: https://github.com/ROCm/rocr_debug_agent
>>>>>>>> [2] RCCL tests: https://github.com/ROCm/rccl/tree/develop/test
>>>>>>>>
>>>>>>>>
>>>>>>>> Please note that the changes in this series are on a best effort basis from our
>>>>>>>> end. Therefore, requesting the amd-gfx community (who have deeper knowledge of the
>>>>>>>> HW & SW stack) to kindly help with the review and provide feedback / comments on
>>>>>>>> these patches. The idea here is, to also have non-4K pagesize (e.g. 64K) well
>>>>>>>> supported with amd gpu kernel driver.
>>>>>>> Well this is generally nice to have, but there are unfortunately some HW limitations which makes ROCm pretty much unusable on non 4k page size systems.
>>>>>> That's a bummer :(
>>>>>> - Do we have some HW documentation around what are these limitations around non-4K pagesize? Any links to such please?
>>>>> You already mentioned MMIO remap which obviously has that problem, but if I'm not completely mistaken the PCIe doorbell BAR and some global seq counter resources will also cause problems here.
>>>>>
>>>>> This can all be worked around by delegating those MMIO accesses into the kernel, but that means tons of extra IOCTL overhead.
>>>>>
>>>>> Especially the cache flushes which are necessary to avoid corruption are really bad for performance in such an approach.
>>>>>
>>>>>> - Are there any latest AMD GPU versions which maybe lifts such restrictions?
>>>>> Not that I know off any.
>>>>>
>>>>>>> What we can do is to support graphics and MM, but that should already work out of the box.
>>>>>>>
>>>>>> - Maybe we should also document, what will work and what won't work due to these HW limitations.
>>>>> Well pretty much everything, I need to double check how ROCm does HDP flushing/invalidating when the MMIO remap isn't available.
>>>>>
>>>>> Could be that there is already a fallback path and that's the reason why this approach actually works at all.
>>>>>
>>>>>>> What we can do is to support graphics and MM, but that should already work out of the box.>
>>>>>> So these patches helped us resolve most of the issues like SDMA hangs
>>>>>> and GPU kernel page faults which we saw with rocr and rccl tests with
>>>>>> 64K pagesize. Meaning, we didn't see this working out of box perhaps
>>>>>> due to 64K pagesize.
>>>>> Yeah, but this is all for ROCm and not the graphics side.
>>>>>
>>>>> To be honest I'm not sure how ROCm even works when you have 64k pages at the moment. I would expect much more issue lurking in the kernel driver.
>>>>>
>>>>>> AFAIU, some of these patches may require re-work based on reviews, but
>>>>>> at least with these changes, we were able to see all the tests passing.
>>>>>>
>>>>>>> I need to talk with Alex and the ROCm team about it if workarounds can be implemented for those issues.
>>>>>>>
>>>>>> Thanks a lot! That would be super helpful!
>>>>>>
>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>> Thanks again for the quick response on the patch series.
>>>>> You are welcome, but since it's so near to the end of the year not all people are available any more.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> -ritesh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 2/8] amdkfd/kfd_svm: Fix SVM map/unmap address conversion for non-4k page sizes
  2025-12-15 20:44   ` Philip Yang
@ 2025-12-16 10:09     ` Donet Tom
  0 siblings, 0 replies; 44+ messages in thread
From: Donet Tom @ 2025-12-16 10:09 UTC (permalink / raw)
  To: Philip Yang, amd-gfx, Felix Kuehling, Alex Deucher,
	christian.koenig
  Cc: Kent.Russell, Ritesh Harjani, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya


On 12/16/25 2:14 AM, Philip Yang wrote:
>
>
> On 2025-12-12 01:40, Donet Tom wrote:
>> SVM range size is tracked using the system page size. The range start 
>> and
>> end are aligned to system page-sized PFNs, so the total SVM range size
>> equals the total number of pages in the SVM range multiplied by the 
>> system
>> page size.
>>
>> The SVM range map/unmap functions pass these system page-sized PFN 
>> numbers
>> to amdgpu_vm_update_range(), which expects PFNs based on the GPU page 
>> size
>> (4K). On non-4K page systems, this mismatch causes only part of the SVM
>> range to be mapped in the GPU page table, while the rest remains 
>> unmapped.
>> If the GPU accesses an unmapped address within the same range, it 
>> results
>> in a GPU page fault.
>>
>> To fix this, the required conversion has been added in both
>> svm_range_map_to_gpu() and svm_range_unmap_from_gpu(), ensuring that all
>> pages in the SVM range are correctly mapped on non-4K systems.
>>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>> ---
>>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 30 ++++++++++++++++++++--------
>>   1 file changed, 22 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> index 74a1d3e1d52b..a2636f2d6c71 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> @@ -1314,11 +1314,16 @@ svm_range_unmap_from_gpu(struct amdgpu_device 
>> *adev, struct amdgpu_vm *vm,
>>                struct dma_fence **fence)
>>   {
>>       uint64_t init_pte_value = 0;
>> +    uint64_t gpu_start, gpu_end;
>>   -    pr_debug("[0x%llx 0x%llx]\n", start, last);
>> +    // Convert CPU page range to GPU page range
>> +    gpu_start = start * AMDGPU_GPU_PAGES_IN_CPU_PAGE;
>> +    gpu_end = (last + 1) * AMDGPU_GPU_PAGES_IN_CPU_PAGE - 1;
>>   -    return amdgpu_vm_update_range(adev, vm, false, true, true, 
>> false, NULL, start,
>> -                      last, init_pte_value, 0, 0, NULL, NULL,
>> +    pr_debug("%s: CPU[0x%llx 0x%llx] -> GPU[0x%llx 0x%llx]\n", 
>> __func__,
> dynamic debug control can enable function name, linenum print with 
> +pfl, don't add __func__.
>> +         start, last, gpu_start, gpu_end);
>> +    return amdgpu_vm_update_range(adev, vm, false, true, true, 
>> false, NULL, gpu_start,
>> +                      gpu_end, init_pte_value, 0, 0, NULL, NULL,
>>                         fence);
>>   }
>>   @@ -1398,9 +1403,13 @@ svm_range_map_to_gpu(struct 
>> kfd_process_device *pdd, struct svm_range *prange,
>>            last_start, last_start + npages - 1, readonly);
>>         for (i = offset; i < offset + npages; i++) {
>> +        uint64_t gpu_start;
>> +        uint64_t gpu_end;
>> +
>>           last_domain = dma_addr[i] & SVM_RANGE_VRAM_DOMAIN;
>>           dma_addr[i] &= ~SVM_RANGE_VRAM_DOMAIN;
>>   +
> remove extra blank line.
>>           /* Collect all pages in the same address range and memory 
>> domain
>>            * that can be mapped with a single call to update mapping.
>>            */
>> @@ -1415,17 +1424,22 @@ svm_range_map_to_gpu(struct 
>> kfd_process_device *pdd, struct svm_range *prange,
>>           if (readonly)
>>               pte_flags &= ~AMDGPU_PTE_WRITEABLE;
>>   -        pr_debug("svms 0x%p map [0x%lx 0x%llx] vram %d PTE 0x%llx\n",
>> -             prange->svms, last_start, prange->start + i,
>> -             (last_domain == SVM_RANGE_VRAM_DOMAIN) ? 1 : 0,
>> -             pte_flags);
>>             /* For dGPU mode, we use same vm_manager to allocate VRAM 
>> for
>>            * different memory partition based on fpfn/lpfn, we should 
>> use
>>            * same vm_manager.vram_base_offset regardless memory 
>> partition.
>>            */
>> +        gpu_start = last_start * AMDGPU_GPU_PAGES_IN_CPU_PAGE;
>> +        gpu_end = (prange->start + i + 1) * 
>> AMDGPU_GPU_PAGES_IN_CPU_PAGE - 1;
>> +
>> +        pr_debug("svms 0x%p map CPU[0x%lx 0x%llx] GPU[0x%llx 0x%llx] 
>> vram %d PTE 0x%llx\n",
>> +             prange->svms, last_start, prange->start + i,
>> +             gpu_start, gpu_end,
>> +             (last_domain == SVM_RANGE_VRAM_DOMAIN) ? 1 : 0,
>> +             pte_flags);
>> +
>>           r = amdgpu_vm_update_range(adev, vm, false, false, 
>> flush_tlb, true,
>> -                       NULL, last_start, prange->start + i,
>> +                       NULL, gpu_start, gpu_end,
>>                          pte_flags,
> With those fixed, this looks good to me.


Than you. I will fix them in next version.


>
> Reviewed-by: Philip Yang <Philip.Yang@amd.com>
>>                          (last_start - prange->start) << PAGE_SHIFT,
>>                          bo_adev ? 
>> bo_adev->vm_manager.vram_base_offset : 0,
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 1/8] drm/amdkfd: Relax size checking during queue buffer get
  2025-12-15 20:25   ` Philip Yang
@ 2025-12-16 10:12     ` Donet Tom
  0 siblings, 0 replies; 44+ messages in thread
From: Donet Tom @ 2025-12-16 10:12 UTC (permalink / raw)
  To: Philip Yang, amd-gfx, Felix Kuehling, Alex Deucher,
	christian.koenig
  Cc: Kent.Russell, Ritesh Harjani, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya


On 12/16/25 1:55 AM, Philip Yang wrote:
>
>
> On 2025-12-12 01:40, Donet Tom wrote:
>> HW-supported EOP buffer sizes are 4K and 32K. On systems that do not
>> use 4K pages, the minimum buffer object (BO) allocation size is
>> PAGE_SIZE (for example, 64K). During queue buffer acquisition, the 
>> driver
>> currently checks the allocated BO size against the supported EOP buffer
>> size. Since the allocated BO is larger than the expected size, this 
>> check
>> fails, preventing queue creation.
>>
>> Relax the strict size validation and allow PAGE_SIZE-sized BOs to be 
>> used.
>> Only the required 4K region of the buffer will be used as the EOP buffer
>> and avoids queue creation failures on non-4K page systems.
>>
>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>> ---
>>   drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 10 ++++++----
>>   1 file changed, 6 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c 
>> b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>> index f1e7583650c4..dc857450fa16 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>> @@ -199,6 +199,7 @@ int kfd_queue_buffer_get(struct amdgpu_vm *vm, 
>> void __user *addr, struct amdgpu_
>>       struct amdgpu_bo_va_mapping *mapping;
>>       u64 user_addr;
>>       u64 size;
>> +    u64 bo_size;
>>         user_addr = (u64)addr >> AMDGPU_GPU_PAGE_SHIFT;
>>       size = expected_size >> AMDGPU_GPU_PAGE_SHIFT;
>> @@ -207,11 +208,12 @@ int kfd_queue_buffer_get(struct amdgpu_vm *vm, 
>> void __user *addr, struct amdgpu_
>>       if (!mapping)
>>           goto out_err;
>>   -    if (user_addr != mapping->start ||
>> -        (size != 0 && user_addr + size - 1 != mapping->last)) {
>> -        pr_debug("expected size 0x%llx not equal to mapping addr 
>> 0x%llx size 0x%llx\n",
>> +    bo_size = mapping->last - mapping->start + 1;
>> +
>> +    if (user_addr != mapping->start || (size != 0 && bo_size < size)) {
>> +        pr_debug("expected size 0x%llx grater than mapping addr 
>> 0x%llx size 0x%llx\n",
>>               expected_size, mapping->start << AMDGPU_GPU_PAGE_SHIFT,
>> -            (mapping->last - mapping->start + 1) << 
>> AMDGPU_GPU_PAGE_SHIFT);
>> +            bo_size <<  AMDGPU_GPU_PAGE_SHIFT);
> This change works, but also relax the size validation for ring buffer 
> size etc, this may have side effect,
> for example FW and user space should have the same ring buffer size.
>
> Other buffers already use PAGE_SIZE as expected size or size aligned 
> to PAGE_SIZE, maybe only relax the eop buffer
> size check
>
> @@ -275,7 +275,7 @@ int kfd_queue_acquire_buffers(struct 
> kfd_process_device *pdd, struct queue_prope
>
>         /* EOP buffer is not required for all ASICs */
>         if (properties->eop_ring_buffer_address) {
> -               if (properties->eop_ring_buffer_size != 
> topo_dev->node_props.eop_buffer_size) {
> +               if (properties->eop_ring_buffer_size < 
> topo_dev->node_props.eop_buffer_size) {
>                         pr_debug("queue eop bo size 0x%x not equal to 
> node eop buf size 0x%x\n",
> properties->eop_ring_buffer_size,
> topo_dev->node_props.eop_buffer_size);
> @@ -284,7 +284,7 @@ int kfd_queue_acquire_buffers(struct 
> kfd_process_device *pdd, struct queue_prope
>                 }
>                 err = kfd_queue_buffer_get(vm, (void 
> *)properties->eop_ring_buffer_address,
> &properties->eop_buf_bo,
> - properties->eop_ring_buffer_size);
> + ALIGN(properties->eop_ring_buffer_size, PAGE_SIZE));
>                 if (err)
>                         goto out_err_unreserve;
>         }


Thank you. I will make this change in next version.


>
> Regards,
> Philip
>>           goto out_err;
>>       }
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems
  2025-12-15 14:09           ` Alex Deucher
@ 2025-12-16 13:54             ` Donet Tom
  2025-12-16 14:02               ` Alex Deucher
  0 siblings, 1 reply; 44+ messages in thread
From: Donet Tom @ 2025-12-16 13:54 UTC (permalink / raw)
  To: Alex Deucher, Christian König
  Cc: Ritesh Harjani (IBM), amd-gfx, Felix Kuehling, Alex Deucher,
	Kent.Russell, Vaidyanathan Srinivasan, Mukesh Kumar Chaurasiya


On 12/15/25 7:39 PM, Alex Deucher wrote:
> On Mon, Dec 15, 2025 at 4:47 AM Christian König
> <christian.koenig@amd.com> wrote:
>> On 12/12/25 18:24, Alex Deucher wrote:
>>> On Fri, Dec 12, 2025 at 8:19 AM Christian König
>>> <christian.koenig@amd.com> wrote:
>>>> On 12/12/25 11:45, Ritesh Harjani (IBM) wrote:
>>>>> Christian König <christian.koenig@amd.com> writes:
>>>>>>> Setup details:
>>>>>>> ============
>>>>>>> System details: Power10 LPAR using 64K pagesize.
>>>>>>> AMD GPU:
>>>>>>>    Name:                    gfx90a
>>>>>>>    Marketing Name:          AMD Instinct MI210
>>>>>>>
>>>>>>> Queries:
>>>>>>> =======
>>>>>>> 1. We currently ran rocr-debug agent tests [1]  and rccl unit tests [2] to test
>>>>>>>     these changes. Is there anything else that you would suggest us to run to
>>>>>>>     shake out any other page size related issues w.r.t the kernel driver?
>>>>>> The ROCm team needs to answer that.
>>>>>>
>>>>> Is there any separate mailing list or list of people whom we can cc
>>>>> then?
>>>> With Felix on CC you already got the right person, but he's on vacation and will not be back before the end of the year.
>>>>
>>>> I can check on Monday if some people are still around which could answer a couple of questions, but in general don't expect a quick response.
>>>>
>>>>>>> 2. Patch 1/8: We have a querry regarding eop buffer size Is this eop ring buffer
>>>>>>>     size HW dependent? Should it be made PAGE_SIZE?
>>>>>> Yes and no.
>>>>>>
>>>>> If you could more elaborate on this please? I am assuming you would
>>>>> anyway respond with more context / details on Patch-1 itself. If yes,
>>>>> that would be great!
>>>> Well, in general the EOP (End of Pipe) buffer contains in a ring buffer of all the events and actions the CP should execute when shaders and cache flushes finish.
>>>>
>>>> The size depends on the HW generation and configuration of the GPU etc..., but don't ask me for details how that is calculated.
>>>>
>>>> The point is that the size is completely unrelated to the CPU, so using PAGE_SIZE is clearly incorrect.
>>>>
>>>>>>> 3. Patch 5/8: also have a query w.r.t the error paths when system page size > 4K.
>>>>>>>     Do we need to lift this restriction and add MMIO remap support for systems with
>>>>>>>     non-4K page sizes?
>>>>>> The problem is the HW can't do this.
>>>>>>
>>>>> We aren't that familiar with the HW / SW stack here. Wanted to understand
>>>>> what functionality will be unsupported due to this HW limitation then?
>>>> The problem is that the CPU must map some of the registers/resources of the GPU into the address space of the application and you run into security issues when you map more than 4k at a time.
>>> Right.  There are some 4K pages with the MMIO register BAR which are
>>> empty and registers can be remapped into them.  In this case we remap
>>> the HDP flush registers into one of those register pages.  This allows
>>> applications to flush the HDP write FIFO from either the CPU or
>>> another device.  This is needed to flush data written by the CPU or
>>> another device to the VRAM BAR out to VRAM (i.e., so the GPU can see
>>> it).  This is flushed internally as part of the shader dispatch
>>> packets,
>> As far as I know this is only done for graphics shader submissions to the classic CS interface, but not for compute dispatches through ROCm queues.
> There is an explicit PM4 packet to flush the HDP cache for userqs and
> for AQL the flush is handled via one of the flags in the dispatch
> packet.  The MMIO remap is needed for more fine grained use cases
> where you might have the CPU or another device operating in a gang
> like scenario with the GPU.


Thank you, Alex.

We were encountering an issue while running the RCCL unit tests. With 2 
GPUs, all tests passed successfully; however, when running with more 
than 2 GPUs, the tests began to fail at random points with the following 
errors:

[  598.576821] amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for 
queue with doorbell_id: 80030008
[  606.696820] amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues
[  606.696826] amdgpu 0048:0f:00.0: amdgpu: GPU reset begin!. Source: 4
[  610.696852] amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for 
queue with doorbell_id: 80030008
[  610.696869] amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues
[  610.696942] amdgpu 0048:0f:00.0: amdgpu: Failed to restore process queues


After applying patches 7/8 and 8/8, we are no longer seeing this issue.

One question I have is: we only started observing this problem when the 
number of GPUs increased. Could this be related to MMIO remapping not 
being available?


> Alex
>
>> That's the reason why ROCm needs the remapped MMIO register BAR.
>>
>>> but there are certain cases where an application may want
>>> more control.  This is probably not a showstopper for most ROCm apps.
>> Well the problem is that you absolutely need the HDP flush/invalidation for 100% correctness. It does work most of the time without it, but you then risk data corruption.
>>
>> Apart from making the flush/invalidate an IOCTL I think we could also just use a global dummy page in VRAM.
>>
>> If you make two 32bit writes which are apart from each other and then a read back a 32bit value from VRAM that should invalidate the HDP as well. It's less efficient than the MMIO BAR remap but still much better than going though an IOCTL.
>>
>> The only tricky part is that you need to get the HW barriers with the doorbell write right.....
>>
>>> That said, the region is only 4K so if you allow applications to map a
>>> larger region they would get access to GPU register pages which they
>>> shouldn't have access to.
>> But don't we also have problems with the doorbell? E.g. the global aggregated one needs to be 4k as well, or is it ok to over allocate there?
>>
>> Thinking more about it there is also a major problem with page tables. Those are 4k by default on modern systems as well and while over allocating them to 64k is possible that not only wastes some VRAM but can also result in OOM situations because we can't allocate the necessary page tables to switch from 2MiB to 4k pages in some cases.
>>
>> Christian.
>>
>>> Alex
>>>
>>>>>>> [1] ROCr debug agent tests: https://github.com/ROCm/rocr_debug_agent
>>>>>>> [2] RCCL tests: https://github.com/ROCm/rccl/tree/develop/test
>>>>>>>
>>>>>>>
>>>>>>> Please note that the changes in this series are on a best effort basis from our
>>>>>>> end. Therefore, requesting the amd-gfx community (who have deeper knowledge of the
>>>>>>> HW & SW stack) to kindly help with the review and provide feedback / comments on
>>>>>>> these patches. The idea here is, to also have non-4K pagesize (e.g. 64K) well
>>>>>>> supported with amd gpu kernel driver.
>>>>>> Well this is generally nice to have, but there are unfortunately some HW limitations which makes ROCm pretty much unusable on non 4k page size systems.
>>>>> That's a bummer :(
>>>>> - Do we have some HW documentation around what are these limitations around non-4K pagesize? Any links to such please?
>>>> You already mentioned MMIO remap which obviously has that problem, but if I'm not completely mistaken the PCIe doorbell BAR and some global seq counter resources will also cause problems here.
>>>>
>>>> This can all be worked around by delegating those MMIO accesses into the kernel, but that means tons of extra IOCTL overhead.
>>>>
>>>> Especially the cache flushes which are necessary to avoid corruption are really bad for performance in such an approach.
>>>>
>>>>> - Are there any latest AMD GPU versions which maybe lifts such restrictions?
>>>> Not that I know off any.
>>>>
>>>>>> What we can do is to support graphics and MM, but that should already work out of the box.
>>>>>>
>>>>> - Maybe we should also document, what will work and what won't work due to these HW limitations.
>>>> Well pretty much everything, I need to double check how ROCm does HDP flushing/invalidating when the MMIO remap isn't available.
>>>>
>>>> Could be that there is already a fallback path and that's the reason why this approach actually works at all.
>>>>
>>>>>> What we can do is to support graphics and MM, but that should already work out of the box.>
>>>>> So these patches helped us resolve most of the issues like SDMA hangs
>>>>> and GPU kernel page faults which we saw with rocr and rccl tests with
>>>>> 64K pagesize. Meaning, we didn't see this working out of box perhaps
>>>>> due to 64K pagesize.
>>>> Yeah, but this is all for ROCm and not the graphics side.
>>>>
>>>> To be honest I'm not sure how ROCm even works when you have 64k pages at the moment. I would expect much more issue lurking in the kernel driver.
>>>>
>>>>> AFAIU, some of these patches may require re-work based on reviews, but
>>>>> at least with these changes, we were able to see all the tests passing.
>>>>>
>>>>>> I need to talk with Alex and the ROCm team about it if workarounds can be implemented for those issues.
>>>>>>
>>>>> Thanks a lot! That would be super helpful!
>>>>>
>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>> Thanks again for the quick response on the patch series.
>>>> You are welcome, but since it's so near to the end of the year not all people are available any more.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> -ritesh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems
  2025-12-16 13:54             ` Donet Tom
@ 2025-12-16 14:02               ` Alex Deucher
  2025-12-17  9:03                 ` Donet Tom
  0 siblings, 1 reply; 44+ messages in thread
From: Alex Deucher @ 2025-12-16 14:02 UTC (permalink / raw)
  To: Donet Tom
  Cc: Christian König, Ritesh Harjani (IBM), amd-gfx,
	Felix Kuehling, Alex Deucher, Kent.Russell,
	Vaidyanathan Srinivasan, Mukesh Kumar Chaurasiya

On Tue, Dec 16, 2025 at 8:55 AM Donet Tom <donettom@linux.ibm.com> wrote:
>
>
> On 12/15/25 7:39 PM, Alex Deucher wrote:
> > On Mon, Dec 15, 2025 at 4:47 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >> On 12/12/25 18:24, Alex Deucher wrote:
> >>> On Fri, Dec 12, 2025 at 8:19 AM Christian König
> >>> <christian.koenig@amd.com> wrote:
> >>>> On 12/12/25 11:45, Ritesh Harjani (IBM) wrote:
> >>>>> Christian König <christian.koenig@amd.com> writes:
> >>>>>>> Setup details:
> >>>>>>> ============
> >>>>>>> System details: Power10 LPAR using 64K pagesize.
> >>>>>>> AMD GPU:
> >>>>>>>    Name:                    gfx90a
> >>>>>>>    Marketing Name:          AMD Instinct MI210
> >>>>>>>
> >>>>>>> Queries:
> >>>>>>> =======
> >>>>>>> 1. We currently ran rocr-debug agent tests [1]  and rccl unit tests [2] to test
> >>>>>>>     these changes. Is there anything else that you would suggest us to run to
> >>>>>>>     shake out any other page size related issues w.r.t the kernel driver?
> >>>>>> The ROCm team needs to answer that.
> >>>>>>
> >>>>> Is there any separate mailing list or list of people whom we can cc
> >>>>> then?
> >>>> With Felix on CC you already got the right person, but he's on vacation and will not be back before the end of the year.
> >>>>
> >>>> I can check on Monday if some people are still around which could answer a couple of questions, but in general don't expect a quick response.
> >>>>
> >>>>>>> 2. Patch 1/8: We have a querry regarding eop buffer size Is this eop ring buffer
> >>>>>>>     size HW dependent? Should it be made PAGE_SIZE?
> >>>>>> Yes and no.
> >>>>>>
> >>>>> If you could more elaborate on this please? I am assuming you would
> >>>>> anyway respond with more context / details on Patch-1 itself. If yes,
> >>>>> that would be great!
> >>>> Well, in general the EOP (End of Pipe) buffer contains in a ring buffer of all the events and actions the CP should execute when shaders and cache flushes finish.
> >>>>
> >>>> The size depends on the HW generation and configuration of the GPU etc..., but don't ask me for details how that is calculated.
> >>>>
> >>>> The point is that the size is completely unrelated to the CPU, so using PAGE_SIZE is clearly incorrect.
> >>>>
> >>>>>>> 3. Patch 5/8: also have a query w.r.t the error paths when system page size > 4K.
> >>>>>>>     Do we need to lift this restriction and add MMIO remap support for systems with
> >>>>>>>     non-4K page sizes?
> >>>>>> The problem is the HW can't do this.
> >>>>>>
> >>>>> We aren't that familiar with the HW / SW stack here. Wanted to understand
> >>>>> what functionality will be unsupported due to this HW limitation then?
> >>>> The problem is that the CPU must map some of the registers/resources of the GPU into the address space of the application and you run into security issues when you map more than 4k at a time.
> >>> Right.  There are some 4K pages with the MMIO register BAR which are
> >>> empty and registers can be remapped into them.  In this case we remap
> >>> the HDP flush registers into one of those register pages.  This allows
> >>> applications to flush the HDP write FIFO from either the CPU or
> >>> another device.  This is needed to flush data written by the CPU or
> >>> another device to the VRAM BAR out to VRAM (i.e., so the GPU can see
> >>> it).  This is flushed internally as part of the shader dispatch
> >>> packets,
> >> As far as I know this is only done for graphics shader submissions to the classic CS interface, but not for compute dispatches through ROCm queues.
> > There is an explicit PM4 packet to flush the HDP cache for userqs and
> > for AQL the flush is handled via one of the flags in the dispatch
> > packet.  The MMIO remap is needed for more fine grained use cases
> > where you might have the CPU or another device operating in a gang
> > like scenario with the GPU.
>
>
> Thank you, Alex.
>
> We were encountering an issue while running the RCCL unit tests. With 2
> GPUs, all tests passed successfully; however, when running with more
> than 2 GPUs, the tests began to fail at random points with the following
> errors:
>
> [  598.576821] amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for
> queue with doorbell_id: 80030008
> [  606.696820] amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues
> [  606.696826] amdgpu 0048:0f:00.0: amdgpu: GPU reset begin!. Source: 4
> [  610.696852] amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for
> queue with doorbell_id: 80030008
> [  610.696869] amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues
> [  610.696942] amdgpu 0048:0f:00.0: amdgpu: Failed to restore process queues
>
>
> After applying patches 7/8 and 8/8, we are no longer seeing this issue.
>
> One question I have is: we only started observing this problem when the
> number of GPUs increased. Could this be related to MMIO remapping not
> being available?

It could be.  E.g., if the CPU or a GPU writes data to VRAM on another
GPU, you will need to flush the HDP to make sure that data hits VRAM
before the GPU attached to the VRAM can see it.

Alex

>
>
> > Alex
> >
> >> That's the reason why ROCm needs the remapped MMIO register BAR.
> >>
> >>> but there are certain cases where an application may want
> >>> more control.  This is probably not a showstopper for most ROCm apps.
> >> Well the problem is that you absolutely need the HDP flush/invalidation for 100% correctness. It does work most of the time without it, but you then risk data corruption.
> >>
> >> Apart from making the flush/invalidate an IOCTL I think we could also just use a global dummy page in VRAM.
> >>
> >> If you make two 32bit writes which are apart from each other and then a read back a 32bit value from VRAM that should invalidate the HDP as well. It's less efficient than the MMIO BAR remap but still much better than going though an IOCTL.
> >>
> >> The only tricky part is that you need to get the HW barriers with the doorbell write right.....
> >>
> >>> That said, the region is only 4K so if you allow applications to map a
> >>> larger region they would get access to GPU register pages which they
> >>> shouldn't have access to.
> >> But don't we also have problems with the doorbell? E.g. the global aggregated one needs to be 4k as well, or is it ok to over allocate there?
> >>
> >> Thinking more about it there is also a major problem with page tables. Those are 4k by default on modern systems as well and while over allocating them to 64k is possible that not only wastes some VRAM but can also result in OOM situations because we can't allocate the necessary page tables to switch from 2MiB to 4k pages in some cases.
> >>
> >> Christian.
> >>
> >>> Alex
> >>>
> >>>>>>> [1] ROCr debug agent tests: https://github.com/ROCm/rocr_debug_agent
> >>>>>>> [2] RCCL tests: https://github.com/ROCm/rccl/tree/develop/test
> >>>>>>>
> >>>>>>>
> >>>>>>> Please note that the changes in this series are on a best effort basis from our
> >>>>>>> end. Therefore, requesting the amd-gfx community (who have deeper knowledge of the
> >>>>>>> HW & SW stack) to kindly help with the review and provide feedback / comments on
> >>>>>>> these patches. The idea here is, to also have non-4K pagesize (e.g. 64K) well
> >>>>>>> supported with amd gpu kernel driver.
> >>>>>> Well this is generally nice to have, but there are unfortunately some HW limitations which makes ROCm pretty much unusable on non 4k page size systems.
> >>>>> That's a bummer :(
> >>>>> - Do we have some HW documentation around what are these limitations around non-4K pagesize? Any links to such please?
> >>>> You already mentioned MMIO remap which obviously has that problem, but if I'm not completely mistaken the PCIe doorbell BAR and some global seq counter resources will also cause problems here.
> >>>>
> >>>> This can all be worked around by delegating those MMIO accesses into the kernel, but that means tons of extra IOCTL overhead.
> >>>>
> >>>> Especially the cache flushes which are necessary to avoid corruption are really bad for performance in such an approach.
> >>>>
> >>>>> - Are there any latest AMD GPU versions which maybe lifts such restrictions?
> >>>> Not that I know off any.
> >>>>
> >>>>>> What we can do is to support graphics and MM, but that should already work out of the box.
> >>>>>>
> >>>>> - Maybe we should also document, what will work and what won't work due to these HW limitations.
> >>>> Well pretty much everything, I need to double check how ROCm does HDP flushing/invalidating when the MMIO remap isn't available.
> >>>>
> >>>> Could be that there is already a fallback path and that's the reason why this approach actually works at all.
> >>>>
> >>>>>> What we can do is to support graphics and MM, but that should already work out of the box.>
> >>>>> So these patches helped us resolve most of the issues like SDMA hangs
> >>>>> and GPU kernel page faults which we saw with rocr and rccl tests with
> >>>>> 64K pagesize. Meaning, we didn't see this working out of box perhaps
> >>>>> due to 64K pagesize.
> >>>> Yeah, but this is all for ROCm and not the graphics side.
> >>>>
> >>>> To be honest I'm not sure how ROCm even works when you have 64k pages at the moment. I would expect much more issue lurking in the kernel driver.
> >>>>
> >>>>> AFAIU, some of these patches may require re-work based on reviews, but
> >>>>> at least with these changes, we were able to see all the tests passing.
> >>>>>
> >>>>>> I need to talk with Alex and the ROCm team about it if workarounds can be implemented for those issues.
> >>>>>>
> >>>>> Thanks a lot! That would be super helpful!
> >>>>>
> >>>>>
> >>>>>> Regards,
> >>>>>> Christian.
> >>>>>>
> >>>>> Thanks again for the quick response on the patch series.
> >>>> You are welcome, but since it's so near to the end of the year not all people are available any more.
> >>>>
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>>>> -ritesh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems
  2025-12-16 10:08               ` Donet Tom
@ 2025-12-16 16:06                 ` Christian König
  2025-12-17  9:04                   ` Donet Tom
  0 siblings, 1 reply; 44+ messages in thread
From: Christian König @ 2025-12-16 16:06 UTC (permalink / raw)
  To: Donet Tom, Alex Deucher
  Cc: Ritesh Harjani (IBM), amd-gfx, Felix Kuehling, Alex Deucher,
	Kent.Russell, Vaidyanathan Srinivasan, Mukesh Kumar Chaurasiya

On 12/16/25 11:08, Donet Tom wrote:
>> The GPU page tables are 4k in size no matter what the CPU page size is and there is some special handling so that we can allocate them even under memory pressure. Background is that you sometimes need to split up higher order pages (1G, 2M) into lower order pages (2M, 4k) to be able to swap things to system memory for example and for that you need some an extra layer of page tables.
>>
>> The problem is now that those 4k pages are rounded up to your CPU page size, resulting in both wasting quite some memory as well as messing up the special handling to not run into OOM situations when swapping things to system memory....
>>
>> What we could potentially do is to switch to 64k pages on the GPU as well (the HW is flexible enough to be re-configurable), but that is tons of changes and probably not easily testable.
> 
> 
> If possible, could you share the steps to change the hardware page size? I can try testing it on our system.

Just typing that down from the front of my head, so don't nail me for 100% correctness.

Modern HW, e.g. gfx9/Vega and newer including all MI* products, has a maximum of 48bits of address space.

Those 48bits are divided on multiple page directories (PD) and a leave page table (PT).

IIRC vm_block_size module parameter controls the size of the PDs. If you set that to 13 instead of the default 9 you should already get 64k PDs instead of 4k PDs. But take that with a grain of salt I think we haven't tested that parameter in the last 10 years or so.

Then each page directory entry on level 0 (PDE0) has a field called block fragment size (see AMDGPU_PDE_BFS for MI products). This controls to how much memory each page table entry (PTE) finally points to.

So putting it all together you should be able to have a configuration with two levels PDs, each covering 13 bits of address space and 64k in size, plus a PT covering 18bits of address space and 2M in size where each PTE points to a 64k block.

Here are the relevant bits from function amdgpu_vm_adjust_size():
...
        tmp = roundup_pow_of_two(adev->vm_manager.max_pfn);
        if (amdgpu_vm_block_size != -1)
                tmp >>= amdgpu_vm_block_size - 9;
        tmp = DIV_ROUND_UP(fls64(tmp) - 1, 9) - 1;
        adev->vm_manager.num_level = min_t(unsigned int, max_level, tmp);
        switch (adev->vm_manager.num_level) {
        case 3:
                adev->vm_manager.root_level = AMDGPU_VM_PDB2;
                break;
        case 2:
                adev->vm_manager.root_level = AMDGPU_VM_PDB1;
                break;
        case 1:
                adev->vm_manager.root_level = AMDGPU_VM_PDB0;
                break;
        default:
                dev_err(adev->dev, "VMPT only supports 2~4+1 levels\n");
        }
        /* block size depends on vm size and hw setup*/
        if (amdgpu_vm_block_size != -1)
                adev->vm_manager.block_size =
                        min((unsigned)amdgpu_vm_block_size, max_bits
                            - AMDGPU_GPU_PAGE_SHIFT
                            - 9 * adev->vm_manager.num_level);
        else if (adev->vm_manager.num_level > 1)
                adev->vm_manager.block_size = 9;
        else
                adev->vm_manager.block_size = amdgpu_vm_get_block_size(tmp);

        if (amdgpu_vm_fragment_size == -1)
                adev->vm_manager.fragment_size = fragment_size_default;
        else
                adev->vm_manager.fragment_size = amdgpu_vm_fragment_size;
...

But again, that is probably tons of work since the AMDGPU_PAGE_SIZE macro needs to change as well and I'm not sure if the FW doesn't internally assume that we have 4k pages somewhere.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems
  2025-12-16 14:02               ` Alex Deucher
@ 2025-12-17  9:03                 ` Donet Tom
  2025-12-17 14:23                   ` Alex Deucher
  0 siblings, 1 reply; 44+ messages in thread
From: Donet Tom @ 2025-12-17  9:03 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Christian König, Ritesh Harjani (IBM), amd-gfx,
	Felix Kuehling, Alex Deucher, Kent.Russell,
	Vaidyanathan Srinivasan, Mukesh Kumar Chaurasiya


On 12/16/25 7:32 PM, Alex Deucher wrote:
> On Tue, Dec 16, 2025 at 8:55 AM Donet Tom <donettom@linux.ibm.com> wrote:
>>
>> On 12/15/25 7:39 PM, Alex Deucher wrote:
>>> On Mon, Dec 15, 2025 at 4:47 AM Christian König
>>> <christian.koenig@amd.com> wrote:
>>>> On 12/12/25 18:24, Alex Deucher wrote:
>>>>> On Fri, Dec 12, 2025 at 8:19 AM Christian König
>>>>> <christian.koenig@amd.com> wrote:
>>>>>> On 12/12/25 11:45, Ritesh Harjani (IBM) wrote:
>>>>>>> Christian König <christian.koenig@amd.com> writes:
>>>>>>>>> Setup details:
>>>>>>>>> ============
>>>>>>>>> System details: Power10 LPAR using 64K pagesize.
>>>>>>>>> AMD GPU:
>>>>>>>>>     Name:                    gfx90a
>>>>>>>>>     Marketing Name:          AMD Instinct MI210
>>>>>>>>>
>>>>>>>>> Queries:
>>>>>>>>> =======
>>>>>>>>> 1. We currently ran rocr-debug agent tests [1]  and rccl unit tests [2] to test
>>>>>>>>>      these changes. Is there anything else that you would suggest us to run to
>>>>>>>>>      shake out any other page size related issues w.r.t the kernel driver?
>>>>>>>> The ROCm team needs to answer that.
>>>>>>>>
>>>>>>> Is there any separate mailing list or list of people whom we can cc
>>>>>>> then?
>>>>>> With Felix on CC you already got the right person, but he's on vacation and will not be back before the end of the year.
>>>>>>
>>>>>> I can check on Monday if some people are still around which could answer a couple of questions, but in general don't expect a quick response.
>>>>>>
>>>>>>>>> 2. Patch 1/8: We have a querry regarding eop buffer size Is this eop ring buffer
>>>>>>>>>      size HW dependent? Should it be made PAGE_SIZE?
>>>>>>>> Yes and no.
>>>>>>>>
>>>>>>> If you could more elaborate on this please? I am assuming you would
>>>>>>> anyway respond with more context / details on Patch-1 itself. If yes,
>>>>>>> that would be great!
>>>>>> Well, in general the EOP (End of Pipe) buffer contains in a ring buffer of all the events and actions the CP should execute when shaders and cache flushes finish.
>>>>>>
>>>>>> The size depends on the HW generation and configuration of the GPU etc..., but don't ask me for details how that is calculated.
>>>>>>
>>>>>> The point is that the size is completely unrelated to the CPU, so using PAGE_SIZE is clearly incorrect.
>>>>>>
>>>>>>>>> 3. Patch 5/8: also have a query w.r.t the error paths when system page size > 4K.
>>>>>>>>>      Do we need to lift this restriction and add MMIO remap support for systems with
>>>>>>>>>      non-4K page sizes?
>>>>>>>> The problem is the HW can't do this.
>>>>>>>>
>>>>>>> We aren't that familiar with the HW / SW stack here. Wanted to understand
>>>>>>> what functionality will be unsupported due to this HW limitation then?
>>>>>> The problem is that the CPU must map some of the registers/resources of the GPU into the address space of the application and you run into security issues when you map more than 4k at a time.
>>>>> Right.  There are some 4K pages with the MMIO register BAR which are
>>>>> empty and registers can be remapped into them.  In this case we remap
>>>>> the HDP flush registers into one of those register pages.  This allows
>>>>> applications to flush the HDP write FIFO from either the CPU or
>>>>> another device.  This is needed to flush data written by the CPU or
>>>>> another device to the VRAM BAR out to VRAM (i.e., so the GPU can see
>>>>> it).  This is flushed internally as part of the shader dispatch
>>>>> packets,
>>>> As far as I know this is only done for graphics shader submissions to the classic CS interface, but not for compute dispatches through ROCm queues.
>>> There is an explicit PM4 packet to flush the HDP cache for userqs and
>>> for AQL the flush is handled via one of the flags in the dispatch
>>> packet.  The MMIO remap is needed for more fine grained use cases
>>> where you might have the CPU or another device operating in a gang
>>> like scenario with the GPU.
>>
>> Thank you, Alex.
>>
>> We were encountering an issue while running the RCCL unit tests. With 2
>> GPUs, all tests passed successfully; however, when running with more
>> than 2 GPUs, the tests began to fail at random points with the following
>> errors:
>>
>> [  598.576821] amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for
>> queue with doorbell_id: 80030008
>> [  606.696820] amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues
>> [  606.696826] amdgpu 0048:0f:00.0: amdgpu: GPU reset begin!. Source: 4
>> [  610.696852] amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for
>> queue with doorbell_id: 80030008
>> [  610.696869] amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues
>> [  610.696942] amdgpu 0048:0f:00.0: amdgpu: Failed to restore process queues
>>
>>
>> After applying patches 7/8 and 8/8, we are no longer seeing this issue.
>>
>> One question I have is: we only started observing this problem when the
>> number of GPUs increased. Could this be related to MMIO remapping not
>> being available?
> It could be.  E.g., if the CPU or a GPU writes data to VRAM on another
> GPU, you will need to flush the HDP to make sure that data hits VRAM
> before the GPU attached to the VRAM can see it.


Thanks Alex

I am now suspecting that the queue preemption issue may be related to 
the unavailability of MMIO remapping. I am not very familiar with this area.

Could you please point me to the relevant code path where the PM4 packet 
is issued to flush the HDP cache?

I am consistently able to reproduce this issue on my system when using 
more than three GPUs if patches 7/8 and 8/8 are not applied. In your 
opinion, is there anything that can be done to speed up the HDP flush or 
to avoid this situation altogether?



>
> Alex
>
>>
>>> Alex
>>>
>>>> That's the reason why ROCm needs the remapped MMIO register BAR.
>>>>
>>>>> but there are certain cases where an application may want
>>>>> more control.  This is probably not a showstopper for most ROCm apps.
>>>> Well the problem is that you absolutely need the HDP flush/invalidation for 100% correctness. It does work most of the time without it, but you then risk data corruption.
>>>>
>>>> Apart from making the flush/invalidate an IOCTL I think we could also just use a global dummy page in VRAM.
>>>>
>>>> If you make two 32bit writes which are apart from each other and then a read back a 32bit value from VRAM that should invalidate the HDP as well. It's less efficient than the MMIO BAR remap but still much better than going though an IOCTL.
>>>>
>>>> The only tricky part is that you need to get the HW barriers with the doorbell write right.....
>>>>
>>>>> That said, the region is only 4K so if you allow applications to map a
>>>>> larger region they would get access to GPU register pages which they
>>>>> shouldn't have access to.
>>>> But don't we also have problems with the doorbell? E.g. the global aggregated one needs to be 4k as well, or is it ok to over allocate there?
>>>>
>>>> Thinking more about it there is also a major problem with page tables. Those are 4k by default on modern systems as well and while over allocating them to 64k is possible that not only wastes some VRAM but can also result in OOM situations because we can't allocate the necessary page tables to switch from 2MiB to 4k pages in some cases.
>>>>
>>>> Christian.
>>>>
>>>>> Alex
>>>>>
>>>>>>>>> [1] ROCr debug agent tests: https://github.com/ROCm/rocr_debug_agent
>>>>>>>>> [2] RCCL tests: https://github.com/ROCm/rccl/tree/develop/test
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Please note that the changes in this series are on a best effort basis from our
>>>>>>>>> end. Therefore, requesting the amd-gfx community (who have deeper knowledge of the
>>>>>>>>> HW & SW stack) to kindly help with the review and provide feedback / comments on
>>>>>>>>> these patches. The idea here is, to also have non-4K pagesize (e.g. 64K) well
>>>>>>>>> supported with amd gpu kernel driver.
>>>>>>>> Well this is generally nice to have, but there are unfortunately some HW limitations which makes ROCm pretty much unusable on non 4k page size systems.
>>>>>>> That's a bummer :(
>>>>>>> - Do we have some HW documentation around what are these limitations around non-4K pagesize? Any links to such please?
>>>>>> You already mentioned MMIO remap which obviously has that problem, but if I'm not completely mistaken the PCIe doorbell BAR and some global seq counter resources will also cause problems here.
>>>>>>
>>>>>> This can all be worked around by delegating those MMIO accesses into the kernel, but that means tons of extra IOCTL overhead.
>>>>>>
>>>>>> Especially the cache flushes which are necessary to avoid corruption are really bad for performance in such an approach.
>>>>>>
>>>>>>> - Are there any latest AMD GPU versions which maybe lifts such restrictions?
>>>>>> Not that I know off any.
>>>>>>
>>>>>>>> What we can do is to support graphics and MM, but that should already work out of the box.
>>>>>>>>
>>>>>>> - Maybe we should also document, what will work and what won't work due to these HW limitations.
>>>>>> Well pretty much everything, I need to double check how ROCm does HDP flushing/invalidating when the MMIO remap isn't available.
>>>>>>
>>>>>> Could be that there is already a fallback path and that's the reason why this approach actually works at all.
>>>>>>
>>>>>>>> What we can do is to support graphics and MM, but that should already work out of the box.>
>>>>>>> So these patches helped us resolve most of the issues like SDMA hangs
>>>>>>> and GPU kernel page faults which we saw with rocr and rccl tests with
>>>>>>> 64K pagesize. Meaning, we didn't see this working out of box perhaps
>>>>>>> due to 64K pagesize.
>>>>>> Yeah, but this is all for ROCm and not the graphics side.
>>>>>>
>>>>>> To be honest I'm not sure how ROCm even works when you have 64k pages at the moment. I would expect much more issue lurking in the kernel driver.
>>>>>>
>>>>>>> AFAIU, some of these patches may require re-work based on reviews, but
>>>>>>> at least with these changes, we were able to see all the tests passing.
>>>>>>>
>>>>>>>> I need to talk with Alex and the ROCm team about it if workarounds can be implemented for those issues.
>>>>>>>>
>>>>>>> Thanks a lot! That would be super helpful!
>>>>>>>
>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>> Thanks again for the quick response on the patch series.
>>>>>> You are welcome, but since it's so near to the end of the year not all people are available any more.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>> -ritesh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems
  2025-12-16 16:06                 ` Christian König
@ 2025-12-17  9:04                   ` Donet Tom
  0 siblings, 0 replies; 44+ messages in thread
From: Donet Tom @ 2025-12-17  9:04 UTC (permalink / raw)
  To: Christian König, Alex Deucher
  Cc: Ritesh Harjani (IBM), amd-gfx, Felix Kuehling, Alex Deucher,
	Kent.Russell, Vaidyanathan Srinivasan, Mukesh Kumar Chaurasiya


On 12/16/25 9:36 PM, Christian König wrote:
> On 12/16/25 11:08, Donet Tom wrote:
>>> The GPU page tables are 4k in size no matter what the CPU page size is and there is some special handling so that we can allocate them even under memory pressure. Background is that you sometimes need to split up higher order pages (1G, 2M) into lower order pages (2M, 4k) to be able to swap things to system memory for example and for that you need some an extra layer of page tables.
>>>
>>> The problem is now that those 4k pages are rounded up to your CPU page size, resulting in both wasting quite some memory as well as messing up the special handling to not run into OOM situations when swapping things to system memory....
>>>
>>> What we could potentially do is to switch to 64k pages on the GPU as well (the HW is flexible enough to be re-configurable), but that is tons of changes and probably not easily testable.
>>
>> If possible, could you share the steps to change the hardware page size? I can try testing it on our system.
> Just typing that down from the front of my head, so don't nail me for 100% correctness.
>
> Modern HW, e.g. gfx9/Vega and newer including all MI* products, has a maximum of 48bits of address space.
>
> Those 48bits are divided on multiple page directories (PD) and a leave page table (PT).
>
> IIRC vm_block_size module parameter controls the size of the PDs. If you set that to 13 instead of the default 9 you should already get 64k PDs instead of 4k PDs. But take that with a grain of salt I think we haven't tested that parameter in the last 10 years or so.
>
> Then each page directory entry on level 0 (PDE0) has a field called block fragment size (see AMDGPU_PDE_BFS for MI products). This controls to how much memory each page table entry (PTE) finally points to.
>
> So putting it all together you should be able to have a configuration with two levels PDs, each covering 13 bits of address space and 64k in size, plus a PT covering 18bits of address space and 2M in size where each PTE points to a 64k block.
>
> Here are the relevant bits from function amdgpu_vm_adjust_size():
> ...
>          tmp = roundup_pow_of_two(adev->vm_manager.max_pfn);
>          if (amdgpu_vm_block_size != -1)
>                  tmp >>= amdgpu_vm_block_size - 9;
>          tmp = DIV_ROUND_UP(fls64(tmp) - 1, 9) - 1;
>          adev->vm_manager.num_level = min_t(unsigned int, max_level, tmp);
>          switch (adev->vm_manager.num_level) {
>          case 3:
>                  adev->vm_manager.root_level = AMDGPU_VM_PDB2;
>                  break;
>          case 2:
>                  adev->vm_manager.root_level = AMDGPU_VM_PDB1;
>                  break;
>          case 1:
>                  adev->vm_manager.root_level = AMDGPU_VM_PDB0;
>                  break;
>          default:
>                  dev_err(adev->dev, "VMPT only supports 2~4+1 levels\n");
>          }
>          /* block size depends on vm size and hw setup*/
>          if (amdgpu_vm_block_size != -1)
>                  adev->vm_manager.block_size =
>                          min((unsigned)amdgpu_vm_block_size, max_bits
>                              - AMDGPU_GPU_PAGE_SHIFT
>                              - 9 * adev->vm_manager.num_level);
>          else if (adev->vm_manager.num_level > 1)
>                  adev->vm_manager.block_size = 9;
>          else
>                  adev->vm_manager.block_size = amdgpu_vm_get_block_size(tmp);
>
>          if (amdgpu_vm_fragment_size == -1)
>                  adev->vm_manager.fragment_size = fragment_size_default;
>          else
>                  adev->vm_manager.fragment_size = amdgpu_vm_fragment_size;


Thanks Christian

I will try it.


> ...
>
> But again, that is probably tons of work since the AMDGPU_PAGE_SIZE macro needs to change as well and I'm not sure if the FW doesn't internally assume that we have 4k pages somewhere.
>
> Regards,
> Christian.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems
  2025-12-15 16:11             ` Christian König
  2025-12-16 10:08               ` Donet Tom
@ 2025-12-17  9:46               ` Donet Tom
  2025-12-17 10:10                 ` Christian König
  1 sibling, 1 reply; 44+ messages in thread
From: Donet Tom @ 2025-12-17  9:46 UTC (permalink / raw)
  To: Christian König, Alex Deucher
  Cc: Ritesh Harjani (IBM), amd-gfx, Felix Kuehling, Alex Deucher,
	Kent.Russell, Vaidyanathan Srinivasan, Mukesh Kumar Chaurasiya


On 12/15/25 9:41 PM, Christian König wrote:
> On 12/15/25 11:11, Donet Tom wrote:
>> On 12/15/25 3:17 PM, Christian König wrote:
>>> On 12/12/25 18:24, Alex Deucher wrote:
>>>> On Fri, Dec 12, 2025 at 8:19 AM Christian König
>>>> <christian.koenig@amd.com> wrote:
>>>>> On 12/12/25 11:45, Ritesh Harjani (IBM) wrote:
>>>>>> Christian König <christian.koenig@amd.com> writes:
>>>>>>>> Setup details:
>>>>>>>> ============
>>>>>>>> System details: Power10 LPAR using 64K pagesize.
>>>>>>>> AMD GPU:
>>>>>>>>     Name:                    gfx90a
>>>>>>>>     Marketing Name:          AMD Instinct MI210
>>>>>>>>
>>>>>>>> Queries:
>>>>>>>> =======
>>>>>>>> 1. We currently ran rocr-debug agent tests [1]  and rccl unit tests [2] to test
>>>>>>>>      these changes. Is there anything else that you would suggest us to run to
>>>>>>>>      shake out any other page size related issues w.r.t the kernel driver?
>>>>>>> The ROCm team needs to answer that.
>>>>>>>
>>>>>> Is there any separate mailing list or list of people whom we can cc
>>>>>> then?
>>>>> With Felix on CC you already got the right person, but he's on vacation and will not be back before the end of the year.
>>>>>
>>>>> I can check on Monday if some people are still around which could answer a couple of questions, but in general don't expect a quick response.
>>>>>
>>>>>>>> 2. Patch 1/8: We have a querry regarding eop buffer size Is this eop ring buffer
>>>>>>>>      size HW dependent? Should it be made PAGE_SIZE?
>>>>>>> Yes and no.
>>>>>>>
>>>>>> If you could more elaborate on this please? I am assuming you would
>>>>>> anyway respond with more context / details on Patch-1 itself. If yes,
>>>>>> that would be great!
>>>>> Well, in general the EOP (End of Pipe) buffer contains in a ring buffer of all the events and actions the CP should execute when shaders and cache flushes finish.
>>>>>
>>>>> The size depends on the HW generation and configuration of the GPU etc..., but don't ask me for details how that is calculated.
>>>>>
>>>>> The point is that the size is completely unrelated to the CPU, so using PAGE_SIZE is clearly incorrect.
>>>>>
>>>>>>>> 3. Patch 5/8: also have a query w.r.t the error paths when system page size > 4K.
>>>>>>>>      Do we need to lift this restriction and add MMIO remap support for systems with
>>>>>>>>      non-4K page sizes?
>>>>>>> The problem is the HW can't do this.
>>>>>>>
>>>>>> We aren't that familiar with the HW / SW stack here. Wanted to understand
>>>>>> what functionality will be unsupported due to this HW limitation then?
>>>>> The problem is that the CPU must map some of the registers/resources of the GPU into the address space of the application and you run into security issues when you map more than 4k at a time.
>>>> Right.  There are some 4K pages with the MMIO register BAR which are
>>>> empty and registers can be remapped into them.  In this case we remap
>>>> the HDP flush registers into one of those register pages.  This allows
>>>> applications to flush the HDP write FIFO from either the CPU or
>>>> another device.  This is needed to flush data written by the CPU or
>>>> another device to the VRAM BAR out to VRAM (i.e., so the GPU can see
>>>> it).  This is flushed internally as part of the shader dispatch
>>>> packets,
>>> As far as I know this is only done for graphics shader submissions to the classic CS interface, but not for compute dispatches through ROCm queues.
>>>
>>> That's the reason why ROCm needs the remapped MMIO register BAR.
>>>
>>>> but there are certain cases where an application may want
>>>> more control.  This is probably not a showstopper for most ROCm apps.
>>> Well the problem is that you absolutely need the HDP flush/invalidation for 100% correctness. It does work most of the time without it, but you then risk data corruption.
>>>
>>> Apart from making the flush/invalidate an IOCTL I think we could also just use a global dummy page in VRAM.
>>>
>>> If you make two 32bit writes which are apart from each other and then a read back a 32bit value from VRAM that should invalidate the HDP as well. It's less efficient than the MMIO BAR remap but still much better than going though an IOCTL.
>>>
>>> The only tricky part is that you need to get the HW barriers with the doorbell write right.....
>>>
>>>> That said, the region is only 4K so if you allow applications to map a
>>>> larger region they would get access to GPU register pages which they
>>>> shouldn't have access to.
>>> But don't we also have problems with the doorbell? E.g. the global aggregated one needs to be 4k as well, or is it ok to over allocate there?
>>>
>>> Thinking more about it there is also a major problem with page tables. Those are 4k by default on modern systems as well and while over allocating them to 64k is possible that not only wastes some VRAM but can also result in OOM situations because we can't allocate the necessary page tables to switch from 2MiB to 4k pages in some cases.
>>
>> Sorry, Cristian — I may be misunderstanding this point, so I would appreciate some clarification.
>>
>> If the CPU page size is 64K and the GPU page size is 4K, then from the GPU side the page table entries are created and mapped at 4K granularity, while on the CPU side the pages remain 64K. To map a single CPU page to the GPU, we therefore need to create multiple GPU page table entries for that CPU page.
> The GPU page tables are 4k in size no matter what the CPU page size is and there is some special handling so that we can allocate them even under memory pressure. Background is that you sometimes need to split up higher order pages (1G, 2M) into lower order pages (2M, 4k) to be able to swap things to system memory for example and for that you need some an extra layer of page tables.
>
> The problem is now that those 4k pages are rounded up to your CPU page size, resulting in both wasting quite some memory as well as messing up the special handling to not run into OOM situations when swapping things to system memory....


Thank you, Christian, for the clarification.

When you say swapping to system memory, does that mean SVM migration to 
DRAM?

 From my understanding of the code, SVM pages are tracked in system 
page–size PFNs, which on our system is 64 KB. With a 64 KB base page 
size, buffer objects (BOs) are allocated in 64 KB–aligned chunks, both 
in VRAM and GTT, while the GPU page-table mappings are still created 
using 4 KB pages.

During SVM migration from VRAM to system memory, I observed that an 
entire 64 KB page is migrated. Similarly, when XNACK is enabled, if the 
GPU accesses a 4 KB page, my understanding is that the entire 64 KB page 
is migrated.

If my understanding is correct, allocating 4 KB memory on a 64 KB 
page–size system results in a 64 KB BO allocation, meaning that around 
60 KB is effectively wasted. Are you referring to this kind of 
over-allocation potentially leading to OOM situations under memory pressure?

Since I am still getting familiar with the AMDGPU codebase, could you 
please point me to the locations where special handling is implemented 
to avoid OOM conditions during swapping or migration?


>
> What we could potentially do is to switch to 64k pages on the GPU as well (the HW is flexible enough to be re-configurable), but that is tons of changes and probably not easily testable.
>
> Regards,
> Christian.
>
>> We found that this was not being handled correctly in the SVM path and addressed it with the change in patch 2/8.
>>
>> Given this, if the memory is allocated and mapped in GPU page-size (4K) granularity on the GPU side, could you please clarify how memory waste occurs in this scenario?
>>
>> Thank you for your time and guidance.
>>
>>
>>> Christian.
>>>
>>>> Alex
>>>>
>>>>>>>> [1] ROCr debug agent tests: https://github.com/ROCm/rocr_debug_agent
>>>>>>>> [2] RCCL tests: https://github.com/ROCm/rccl/tree/develop/test
>>>>>>>>
>>>>>>>>
>>>>>>>> Please note that the changes in this series are on a best effort basis from our
>>>>>>>> end. Therefore, requesting the amd-gfx community (who have deeper knowledge of the
>>>>>>>> HW & SW stack) to kindly help with the review and provide feedback / comments on
>>>>>>>> these patches. The idea here is, to also have non-4K pagesize (e.g. 64K) well
>>>>>>>> supported with amd gpu kernel driver.
>>>>>>> Well this is generally nice to have, but there are unfortunately some HW limitations which makes ROCm pretty much unusable on non 4k page size systems.
>>>>>> That's a bummer :(
>>>>>> - Do we have some HW documentation around what are these limitations around non-4K pagesize? Any links to such please?
>>>>> You already mentioned MMIO remap which obviously has that problem, but if I'm not completely mistaken the PCIe doorbell BAR and some global seq counter resources will also cause problems here.
>>>>>
>>>>> This can all be worked around by delegating those MMIO accesses into the kernel, but that means tons of extra IOCTL overhead.
>>>>>
>>>>> Especially the cache flushes which are necessary to avoid corruption are really bad for performance in such an approach.
>>>>>
>>>>>> - Are there any latest AMD GPU versions which maybe lifts such restrictions?
>>>>> Not that I know off any.
>>>>>
>>>>>>> What we can do is to support graphics and MM, but that should already work out of the box.
>>>>>>>
>>>>>> - Maybe we should also document, what will work and what won't work due to these HW limitations.
>>>>> Well pretty much everything, I need to double check how ROCm does HDP flushing/invalidating when the MMIO remap isn't available.
>>>>>
>>>>> Could be that there is already a fallback path and that's the reason why this approach actually works at all.
>>>>>
>>>>>>> What we can do is to support graphics and MM, but that should already work out of the box.>
>>>>>> So these patches helped us resolve most of the issues like SDMA hangs
>>>>>> and GPU kernel page faults which we saw with rocr and rccl tests with
>>>>>> 64K pagesize. Meaning, we didn't see this working out of box perhaps
>>>>>> due to 64K pagesize.
>>>>> Yeah, but this is all for ROCm and not the graphics side.
>>>>>
>>>>> To be honest I'm not sure how ROCm even works when you have 64k pages at the moment. I would expect much more issue lurking in the kernel driver.
>>>>>
>>>>>> AFAIU, some of these patches may require re-work based on reviews, but
>>>>>> at least with these changes, we were able to see all the tests passing.
>>>>>>
>>>>>>> I need to talk with Alex and the ROCm team about it if workarounds can be implemented for those issues.
>>>>>>>
>>>>>> Thanks a lot! That would be super helpful!
>>>>>>
>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>> Thanks again for the quick response on the patch series.
>>>>> You are welcome, but since it's so near to the end of the year not all people are available any more.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> -ritesh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems
  2025-12-17  9:46               ` Donet Tom
@ 2025-12-17 10:10                 ` Christian König
  0 siblings, 0 replies; 44+ messages in thread
From: Christian König @ 2025-12-17 10:10 UTC (permalink / raw)
  To: Donet Tom, Alex Deucher
  Cc: Ritesh Harjani (IBM), amd-gfx, Felix Kuehling, Alex Deucher,
	Kent.Russell, Vaidyanathan Srinivasan, Mukesh Kumar Chaurasiya

On 12/17/25 10:46, Donet Tom wrote:
>>>> But don't we also have problems with the doorbell? E.g. the global aggregated one needs to be 4k as well, or is it ok to over allocate there?
>>>>
>>>> Thinking more about it there is also a major problem with page tables. Those are 4k by default on modern systems as well and while over allocating them to 64k is possible that not only wastes some VRAM but can also result in OOM situations because we can't allocate the necessary page tables to switch from 2MiB to 4k pages in some cases.
>>>
>>> Sorry, Cristian — I may be misunderstanding this point, so I would appreciate some clarification.
>>>
>>> If the CPU page size is 64K and the GPU page size is 4K, then from the GPU side the page table entries are created and mapped at 4K granularity, while on the CPU side the pages remain 64K. To map a single CPU page to the GPU, we therefore need to create multiple GPU page table entries for that CPU page.
>> The GPU page tables are 4k in size no matter what the CPU page size is and there is some special handling so that we can allocate them even under memory pressure. Background is that you sometimes need to split up higher order pages (1G, 2M) into lower order pages (2M, 4k) to be able to swap things to system memory for example and for that you need some an extra layer of page tables.
>>
>> The problem is now that those 4k pages are rounded up to your CPU page size, resulting in both wasting quite some memory as well as messing up the special handling to not run into OOM situations when swapping things to system memory....
> 
> 
> Thank you, Christian, for the clarification.
> 
> When you say swapping to system memory, does that mean SVM migration to DRAM?

Yes and no. It's mostly the normal BO based swapping of TTM. SVM is still an experimental and extremely rarely used feature.

> 
> From my understanding of the code, SVM pages are tracked in system page–size PFNs, which on our system is 64 KB. With a 64 KB base page size, buffer objects (BOs) are allocated in 64 KB–aligned chunks, both in VRAM and GTT, while the GPU page-table mappings are still created using 4 KB pages.
> 
> During SVM migration from VRAM to system memory, I observed that an entire 64 KB page is migrated. Similarly, when XNACK is enabled, if the GPU accesses a 4 KB page, my understanding is that the entire 64 KB page is migrated.
> 
> If my understanding is correct, allocating 4 KB memory on a 64 KB page–size system results in a 64 KB BO allocation, meaning that around 60 KB is effectively wasted. Are you referring to this kind of over-allocation potentially leading to OOM situations under memory pressure?

Correct, yes.

> Since I am still getting familiar with the AMDGPU codebase, could you please point me to the locations where special handling is implemented to avoid OOM conditions during swapping or migration?

See AMDGPU_VM_RESERVED_VRAM.

Regards,
Christian.

> 
> 
>>
>> What we could potentially do is to switch to 64k pages on the GPU as well (the HW is flexible enough to be re-configurable), but that is tons of changes and probably not easily testable.
>>
>> Regards,
>> Christian.
>>
>>> We found that this was not being handled correctly in the SVM path and addressed it with the change in patch 2/8.
>>>
>>> Given this, if the memory is allocated and mapped in GPU page-size (4K) granularity on the GPU side, could you please clarify how memory waste occurs in this scenario?
>>>
>>> Thank you for your time and guidance.
>>>
>>>
>>>> Christian.
>>>>
>>>>> Alex
>>>>>
>>>>>>>>> [1] ROCr debug agent tests: https://github.com/ROCm/rocr_debug_agent
>>>>>>>>> [2] RCCL tests: https://github.com/ROCm/rccl/tree/develop/test
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Please note that the changes in this series are on a best effort basis from our
>>>>>>>>> end. Therefore, requesting the amd-gfx community (who have deeper knowledge of the
>>>>>>>>> HW & SW stack) to kindly help with the review and provide feedback / comments on
>>>>>>>>> these patches. The idea here is, to also have non-4K pagesize (e.g. 64K) well
>>>>>>>>> supported with amd gpu kernel driver.
>>>>>>>> Well this is generally nice to have, but there are unfortunately some HW limitations which makes ROCm pretty much unusable on non 4k page size systems.
>>>>>>> That's a bummer :(
>>>>>>> - Do we have some HW documentation around what are these limitations around non-4K pagesize? Any links to such please?
>>>>>> You already mentioned MMIO remap which obviously has that problem, but if I'm not completely mistaken the PCIe doorbell BAR and some global seq counter resources will also cause problems here.
>>>>>>
>>>>>> This can all be worked around by delegating those MMIO accesses into the kernel, but that means tons of extra IOCTL overhead.
>>>>>>
>>>>>> Especially the cache flushes which are necessary to avoid corruption are really bad for performance in such an approach.
>>>>>>
>>>>>>> - Are there any latest AMD GPU versions which maybe lifts such restrictions?
>>>>>> Not that I know off any.
>>>>>>
>>>>>>>> What we can do is to support graphics and MM, but that should already work out of the box.
>>>>>>>>
>>>>>>> - Maybe we should also document, what will work and what won't work due to these HW limitations.
>>>>>> Well pretty much everything, I need to double check how ROCm does HDP flushing/invalidating when the MMIO remap isn't available.
>>>>>>
>>>>>> Could be that there is already a fallback path and that's the reason why this approach actually works at all.
>>>>>>
>>>>>>>> What we can do is to support graphics and MM, but that should already work out of the box.>
>>>>>>> So these patches helped us resolve most of the issues like SDMA hangs
>>>>>>> and GPU kernel page faults which we saw with rocr and rccl tests with
>>>>>>> 64K pagesize. Meaning, we didn't see this working out of box perhaps
>>>>>>> due to 64K pagesize.
>>>>>> Yeah, but this is all for ROCm and not the graphics side.
>>>>>>
>>>>>> To be honest I'm not sure how ROCm even works when you have 64k pages at the moment. I would expect much more issue lurking in the kernel driver.
>>>>>>
>>>>>>> AFAIU, some of these patches may require re-work based on reviews, but
>>>>>>> at least with these changes, we were able to see all the tests passing.
>>>>>>>
>>>>>>>> I need to talk with Alex and the ROCm team about it if workarounds can be implemented for those issues.
>>>>>>>>
>>>>>>> Thanks a lot! That would be super helpful!
>>>>>>>
>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>> Thanks again for the quick response on the patch series.
>>>>>> You are welcome, but since it's so near to the end of the year not all people are available any more.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>> -ritesh


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems
  2025-12-17  9:03                 ` Donet Tom
@ 2025-12-17 14:23                   ` Alex Deucher
  2025-12-17 21:31                     ` Yat Sin, David
  0 siblings, 1 reply; 44+ messages in thread
From: Alex Deucher @ 2025-12-17 14:23 UTC (permalink / raw)
  To: Donet Tom, David Yat Sin
  Cc: Christian König, Ritesh Harjani (IBM), amd-gfx,
	Felix Kuehling, Alex Deucher, Kent.Russell,
	Vaidyanathan Srinivasan, Mukesh Kumar Chaurasiya

On Wed, Dec 17, 2025 at 4:03 AM Donet Tom <donettom@linux.ibm.com> wrote:
>
>
> On 12/16/25 7:32 PM, Alex Deucher wrote:
> > On Tue, Dec 16, 2025 at 8:55 AM Donet Tom <donettom@linux.ibm.com> wrote:
> >>
> >> On 12/15/25 7:39 PM, Alex Deucher wrote:
> >>> On Mon, Dec 15, 2025 at 4:47 AM Christian König
> >>> <christian.koenig@amd.com> wrote:
> >>>> On 12/12/25 18:24, Alex Deucher wrote:
> >>>>> On Fri, Dec 12, 2025 at 8:19 AM Christian König
> >>>>> <christian.koenig@amd.com> wrote:
> >>>>>> On 12/12/25 11:45, Ritesh Harjani (IBM) wrote:
> >>>>>>> Christian König <christian.koenig@amd.com> writes:
> >>>>>>>>> Setup details:
> >>>>>>>>> ============
> >>>>>>>>> System details: Power10 LPAR using 64K pagesize.
> >>>>>>>>> AMD GPU:
> >>>>>>>>>     Name:                    gfx90a
> >>>>>>>>>     Marketing Name:          AMD Instinct MI210
> >>>>>>>>>
> >>>>>>>>> Queries:
> >>>>>>>>> =======
> >>>>>>>>> 1. We currently ran rocr-debug agent tests [1]  and rccl unit tests [2] to test
> >>>>>>>>>      these changes. Is there anything else that you would suggest us to run to
> >>>>>>>>>      shake out any other page size related issues w.r.t the kernel driver?
> >>>>>>>> The ROCm team needs to answer that.
> >>>>>>>>
> >>>>>>> Is there any separate mailing list or list of people whom we can cc
> >>>>>>> then?
> >>>>>> With Felix on CC you already got the right person, but he's on vacation and will not be back before the end of the year.
> >>>>>>
> >>>>>> I can check on Monday if some people are still around which could answer a couple of questions, but in general don't expect a quick response.
> >>>>>>
> >>>>>>>>> 2. Patch 1/8: We have a querry regarding eop buffer size Is this eop ring buffer
> >>>>>>>>>      size HW dependent? Should it be made PAGE_SIZE?
> >>>>>>>> Yes and no.
> >>>>>>>>
> >>>>>>> If you could more elaborate on this please? I am assuming you would
> >>>>>>> anyway respond with more context / details on Patch-1 itself. If yes,
> >>>>>>> that would be great!
> >>>>>> Well, in general the EOP (End of Pipe) buffer contains in a ring buffer of all the events and actions the CP should execute when shaders and cache flushes finish.
> >>>>>>
> >>>>>> The size depends on the HW generation and configuration of the GPU etc..., but don't ask me for details how that is calculated.
> >>>>>>
> >>>>>> The point is that the size is completely unrelated to the CPU, so using PAGE_SIZE is clearly incorrect.
> >>>>>>
> >>>>>>>>> 3. Patch 5/8: also have a query w.r.t the error paths when system page size > 4K.
> >>>>>>>>>      Do we need to lift this restriction and add MMIO remap support for systems with
> >>>>>>>>>      non-4K page sizes?
> >>>>>>>> The problem is the HW can't do this.
> >>>>>>>>
> >>>>>>> We aren't that familiar with the HW / SW stack here. Wanted to understand
> >>>>>>> what functionality will be unsupported due to this HW limitation then?
> >>>>>> The problem is that the CPU must map some of the registers/resources of the GPU into the address space of the application and you run into security issues when you map more than 4k at a time.
> >>>>> Right.  There are some 4K pages with the MMIO register BAR which are
> >>>>> empty and registers can be remapped into them.  In this case we remap
> >>>>> the HDP flush registers into one of those register pages.  This allows
> >>>>> applications to flush the HDP write FIFO from either the CPU or
> >>>>> another device.  This is needed to flush data written by the CPU or
> >>>>> another device to the VRAM BAR out to VRAM (i.e., so the GPU can see
> >>>>> it).  This is flushed internally as part of the shader dispatch
> >>>>> packets,
> >>>> As far as I know this is only done for graphics shader submissions to the classic CS interface, but not for compute dispatches through ROCm queues.
> >>> There is an explicit PM4 packet to flush the HDP cache for userqs and
> >>> for AQL the flush is handled via one of the flags in the dispatch
> >>> packet.  The MMIO remap is needed for more fine grained use cases
> >>> where you might have the CPU or another device operating in a gang
> >>> like scenario with the GPU.
> >>
> >> Thank you, Alex.
> >>
> >> We were encountering an issue while running the RCCL unit tests. With 2
> >> GPUs, all tests passed successfully; however, when running with more
> >> than 2 GPUs, the tests began to fail at random points with the following
> >> errors:
> >>
> >> [  598.576821] amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for
> >> queue with doorbell_id: 80030008
> >> [  606.696820] amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues
> >> [  606.696826] amdgpu 0048:0f:00.0: amdgpu: GPU reset begin!. Source: 4
> >> [  610.696852] amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for
> >> queue with doorbell_id: 80030008
> >> [  610.696869] amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues
> >> [  610.696942] amdgpu 0048:0f:00.0: amdgpu: Failed to restore process queues
> >>
> >>
> >> After applying patches 7/8 and 8/8, we are no longer seeing this issue.
> >>
> >> One question I have is: we only started observing this problem when the
> >> number of GPUs increased. Could this be related to MMIO remapping not
> >> being available?
> > It could be.  E.g., if the CPU or a GPU writes data to VRAM on another
> > GPU, you will need to flush the HDP to make sure that data hits VRAM
> > before the GPU attached to the VRAM can see it.
>
>
> Thanks Alex
>
> I am now suspecting that the queue preemption issue may be related to
> the unavailability of MMIO remapping. I am not very familiar with this area.
>
> Could you please point me to the relevant code path where the PM4 packet
> is issued to flush the HDP cache?

+ David who is more familiar with the ROCm runtime.

PM4 has a packet called HDP_FLUSH which flushes the HDP.  For AQL,
it's handled by one of the flags I think.  Most things in ROCm use
AQL.

@David Yat Sin Can you point to how HDP flushes are handled in the ROCm runtime?

Alex

>
> I am consistently able to reproduce this issue on my system when using
> more than three GPUs if patches 7/8 and 8/8 are not applied. In your
> opinion, is there anything that can be done to speed up the HDP flush or
> to avoid this situation altogether?
>
>
>
> >
> > Alex
> >
> >>
> >>> Alex
> >>>
> >>>> That's the reason why ROCm needs the remapped MMIO register BAR.
> >>>>
> >>>>> but there are certain cases where an application may want
> >>>>> more control.  This is probably not a showstopper for most ROCm apps.
> >>>> Well the problem is that you absolutely need the HDP flush/invalidation for 100% correctness. It does work most of the time without it, but you then risk data corruption.
> >>>>
> >>>> Apart from making the flush/invalidate an IOCTL I think we could also just use a global dummy page in VRAM.
> >>>>
> >>>> If you make two 32bit writes which are apart from each other and then a read back a 32bit value from VRAM that should invalidate the HDP as well. It's less efficient than the MMIO BAR remap but still much better than going though an IOCTL.
> >>>>
> >>>> The only tricky part is that you need to get the HW barriers with the doorbell write right.....
> >>>>
> >>>>> That said, the region is only 4K so if you allow applications to map a
> >>>>> larger region they would get access to GPU register pages which they
> >>>>> shouldn't have access to.
> >>>> But don't we also have problems with the doorbell? E.g. the global aggregated one needs to be 4k as well, or is it ok to over allocate there?
> >>>>
> >>>> Thinking more about it there is also a major problem with page tables. Those are 4k by default on modern systems as well and while over allocating them to 64k is possible that not only wastes some VRAM but can also result in OOM situations because we can't allocate the necessary page tables to switch from 2MiB to 4k pages in some cases.
> >>>>
> >>>> Christian.
> >>>>
> >>>>> Alex
> >>>>>
> >>>>>>>>> [1] ROCr debug agent tests: https://github.com/ROCm/rocr_debug_agent
> >>>>>>>>> [2] RCCL tests: https://github.com/ROCm/rccl/tree/develop/test
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Please note that the changes in this series are on a best effort basis from our
> >>>>>>>>> end. Therefore, requesting the amd-gfx community (who have deeper knowledge of the
> >>>>>>>>> HW & SW stack) to kindly help with the review and provide feedback / comments on
> >>>>>>>>> these patches. The idea here is, to also have non-4K pagesize (e.g. 64K) well
> >>>>>>>>> supported with amd gpu kernel driver.
> >>>>>>>> Well this is generally nice to have, but there are unfortunately some HW limitations which makes ROCm pretty much unusable on non 4k page size systems.
> >>>>>>> That's a bummer :(
> >>>>>>> - Do we have some HW documentation around what are these limitations around non-4K pagesize? Any links to such please?
> >>>>>> You already mentioned MMIO remap which obviously has that problem, but if I'm not completely mistaken the PCIe doorbell BAR and some global seq counter resources will also cause problems here.
> >>>>>>
> >>>>>> This can all be worked around by delegating those MMIO accesses into the kernel, but that means tons of extra IOCTL overhead.
> >>>>>>
> >>>>>> Especially the cache flushes which are necessary to avoid corruption are really bad for performance in such an approach.
> >>>>>>
> >>>>>>> - Are there any latest AMD GPU versions which maybe lifts such restrictions?
> >>>>>> Not that I know off any.
> >>>>>>
> >>>>>>>> What we can do is to support graphics and MM, but that should already work out of the box.
> >>>>>>>>
> >>>>>>> - Maybe we should also document, what will work and what won't work due to these HW limitations.
> >>>>>> Well pretty much everything, I need to double check how ROCm does HDP flushing/invalidating when the MMIO remap isn't available.
> >>>>>>
> >>>>>> Could be that there is already a fallback path and that's the reason why this approach actually works at all.
> >>>>>>
> >>>>>>>> What we can do is to support graphics and MM, but that should already work out of the box.>
> >>>>>>> So these patches helped us resolve most of the issues like SDMA hangs
> >>>>>>> and GPU kernel page faults which we saw with rocr and rccl tests with
> >>>>>>> 64K pagesize. Meaning, we didn't see this working out of box perhaps
> >>>>>>> due to 64K pagesize.
> >>>>>> Yeah, but this is all for ROCm and not the graphics side.
> >>>>>>
> >>>>>> To be honest I'm not sure how ROCm even works when you have 64k pages at the moment. I would expect much more issue lurking in the kernel driver.
> >>>>>>
> >>>>>>> AFAIU, some of these patches may require re-work based on reviews, but
> >>>>>>> at least with these changes, we were able to see all the tests passing.
> >>>>>>>
> >>>>>>>> I need to talk with Alex and the ROCm team about it if workarounds can be implemented for those issues.
> >>>>>>>>
> >>>>>>> Thanks a lot! That would be super helpful!
> >>>>>>>
> >>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Christian.
> >>>>>>>>
> >>>>>>> Thanks again for the quick response on the patch series.
> >>>>>> You are welcome, but since it's so near to the end of the year not all people are available any more.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Christian.
> >>>>>>
> >>>>>>> -ritesh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems
  2025-12-17 14:23                   ` Alex Deucher
@ 2025-12-17 21:31                     ` Yat Sin, David
  2026-01-02 18:53                       ` Donet Tom
  2026-01-06 12:58                       ` Donet Tom
  0 siblings, 2 replies; 44+ messages in thread
From: Yat Sin, David @ 2025-12-17 21:31 UTC (permalink / raw)
  To: Alex Deucher, Donet Tom
  Cc: Koenig, Christian, Ritesh Harjani (IBM),
	amd-gfx@lists.freedesktop.org, Kuehling, Felix,
	Deucher, Alexander, Russell, Kent, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya

[AMD Official Use Only - AMD Internal Distribution Only]

HDP flush is done in ROCm using these 3 methods:

1. For AQL packets, this is done by setting the system-scope acquire and release fences in the packet header.
     For example, it is set here:
     https://github.com/ROCm/rocm-systems/blob/develop/projects/rocr-runtime/runtime/hsa-runtime/core/runtime/amd_blit_kernel.cpp#L878

     And when the headers are defined here:
     https://github.com/ROCm/rocm-systems/blob/develop/projects/clr/rocclr/device/rocm/rocvirtual.cpp#L85


2. Via a SDMA packet. This is done before doing a memory copy:
     The function is called here:
        https://github.com/ROCm/rocm-systems/blob/develop/projects/rocr-runtime/runtime/hsa-runtime/core/runtime/amd_blit_sdma.cpp#L484
     And the packet (POLL_REGMEM) is generated here:
        https://github.com/ROCm/rocm-systems/blob/develop/projects/rocr-runtime/runtime/hsa-runtime/core/runtime/amd_blit_sdma.cpp#L1154


3. By writing to a MMIO remapped address:
            The address is stored in rocclr here:
        https://github.com/ROCm/rocm-systems/blob/develop/projects/clr/rocclr/device/rocm/rocdevice.cpp#L607

            And the flush is triggered by writing a 1, e.g here:
        https://github.com/ROCm/rocm-systems/blob/develop/projects/clr/rocclr/device/rocm/rocvirtual.cpp#L3831


Regards,
David


> -----Original Message-----
> From: Alex Deucher <alexdeucher@gmail.com>
> Sent: Wednesday, December 17, 2025 9:23 AM
> To: Donet Tom <donettom@linux.ibm.com>; Yat Sin, David
> <David.YatSin@amd.com>
> Cc: Koenig, Christian <Christian.Koenig@amd.com>; Ritesh Harjani (IBM)
> <ritesh.list@gmail.com>; amd-gfx@lists.freedesktop.org; Kuehling, Felix
> <Felix.Kuehling@amd.com>; Deucher, Alexander
> <Alexander.Deucher@amd.com>; Russell, Kent <Kent.Russell@amd.com>;
> Vaidyanathan Srinivasan <svaidy@linux.ibm.com>; Mukesh Kumar Chaurasiya
> <mkchauras@linux.ibm.com>
> Subject: Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page
> size systems
>
> On Wed, Dec 17, 2025 at 4:03 AM Donet Tom <donettom@linux.ibm.com> wrote:
> >
> >
> > On 12/16/25 7:32 PM, Alex Deucher wrote:
> > > On Tue, Dec 16, 2025 at 8:55 AM Donet Tom <donettom@linux.ibm.com>
> wrote:
> > >>
> > >> On 12/15/25 7:39 PM, Alex Deucher wrote:
> > >>> On Mon, Dec 15, 2025 at 4:47 AM Christian König
> > >>> <christian.koenig@amd.com> wrote:
> > >>>> On 12/12/25 18:24, Alex Deucher wrote:
> > >>>>> On Fri, Dec 12, 2025 at 8:19 AM Christian König
> > >>>>> <christian.koenig@amd.com> wrote:
> > >>>>>> On 12/12/25 11:45, Ritesh Harjani (IBM) wrote:
> > >>>>>>> Christian König <christian.koenig@amd.com> writes:
> > >>>>>>>>> Setup details:
> > >>>>>>>>> ============
> > >>>>>>>>> System details: Power10 LPAR using 64K pagesize.
> > >>>>>>>>> AMD GPU:
> > >>>>>>>>>     Name:                    gfx90a
> > >>>>>>>>>     Marketing Name:          AMD Instinct MI210
> > >>>>>>>>>
> > >>>>>>>>> Queries:
> > >>>>>>>>> =======
> > >>>>>>>>> 1. We currently ran rocr-debug agent tests [1]  and rccl unit tests [2]
> to test
> > >>>>>>>>>      these changes. Is there anything else that you would suggest us
> to run to
> > >>>>>>>>>      shake out any other page size related issues w.r.t the kernel
> driver?
> > >>>>>>>> The ROCm team needs to answer that.
> > >>>>>>>>
> > >>>>>>> Is there any separate mailing list or list of people whom we
> > >>>>>>> can cc then?
> > >>>>>> With Felix on CC you already got the right person, but he's on vacation
> and will not be back before the end of the year.
> > >>>>>>
> > >>>>>> I can check on Monday if some people are still around which could
> answer a couple of questions, but in general don't expect a quick response.
> > >>>>>>
> > >>>>>>>>> 2. Patch 1/8: We have a querry regarding eop buffer size Is this eop
> ring buffer
> > >>>>>>>>>      size HW dependent? Should it be made PAGE_SIZE?
> > >>>>>>>> Yes and no.
> > >>>>>>>>
> > >>>>>>> If you could more elaborate on this please? I am assuming you
> > >>>>>>> would anyway respond with more context / details on Patch-1
> > >>>>>>> itself. If yes, that would be great!
> > >>>>>> Well, in general the EOP (End of Pipe) buffer contains in a ring buffer of
> all the events and actions the CP should execute when shaders and cache flushes
> finish.
> > >>>>>>
> > >>>>>> The size depends on the HW generation and configuration of the GPU
> etc..., but don't ask me for details how that is calculated.
> > >>>>>>
> > >>>>>> The point is that the size is completely unrelated to the CPU, so using
> PAGE_SIZE is clearly incorrect.
> > >>>>>>
> > >>>>>>>>> 3. Patch 5/8: also have a query w.r.t the error paths when system
> page size > 4K.
> > >>>>>>>>>      Do we need to lift this restriction and add MMIO remap support
> for systems with
> > >>>>>>>>>      non-4K page sizes?
> > >>>>>>>> The problem is the HW can't do this.
> > >>>>>>>>
> > >>>>>>> We aren't that familiar with the HW / SW stack here. Wanted to
> > >>>>>>> understand what functionality will be unsupported due to this HW
> limitation then?
> > >>>>>> The problem is that the CPU must map some of the registers/resources
> of the GPU into the address space of the application and you run into security
> issues when you map more than 4k at a time.
> > >>>>> Right.  There are some 4K pages with the MMIO register BAR which
> > >>>>> are empty and registers can be remapped into them.  In this case
> > >>>>> we remap the HDP flush registers into one of those register
> > >>>>> pages.  This allows applications to flush the HDP write FIFO
> > >>>>> from either the CPU or another device.  This is needed to flush
> > >>>>> data written by the CPU or another device to the VRAM BAR out to
> > >>>>> VRAM (i.e., so the GPU can see it).  This is flushed internally
> > >>>>> as part of the shader dispatch packets,
> > >>>> As far as I know this is only done for graphics shader submissions to the
> classic CS interface, but not for compute dispatches through ROCm queues.
> > >>> There is an explicit PM4 packet to flush the HDP cache for userqs
> > >>> and for AQL the flush is handled via one of the flags in the
> > >>> dispatch packet.  The MMIO remap is needed for more fine grained
> > >>> use cases where you might have the CPU or another device operating
> > >>> in a gang like scenario with the GPU.
> > >>
> > >> Thank you, Alex.
> > >>
> > >> We were encountering an issue while running the RCCL unit tests.
> > >> With 2 GPUs, all tests passed successfully; however, when running
> > >> with more than 2 GPUs, the tests began to fail at random points
> > >> with the following
> > >> errors:
> > >>
> > >> [  598.576821] amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed
> > >> for queue with doorbell_id: 80030008 [  606.696820] amdgpu
> > >> 0048:0f:00.0: amdgpu: Failed to evict process queues [  606.696826]
> > >> amdgpu 0048:0f:00.0: amdgpu: GPU reset begin!. Source: 4 [
> > >> 610.696852] amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed
> > >> for queue with doorbell_id: 80030008 [  610.696869] amdgpu
> > >> 0048:0f:00.0: amdgpu: Failed to evict process queues [  610.696942]
> > >> amdgpu 0048:0f:00.0: amdgpu: Failed to restore process queues
> > >>
> > >>
> > >> After applying patches 7/8 and 8/8, we are no longer seeing this issue.
> > >>
> > >> One question I have is: we only started observing this problem when
> > >> the number of GPUs increased. Could this be related to MMIO
> > >> remapping not being available?
> > > It could be.  E.g., if the CPU or a GPU writes data to VRAM on
> > > another GPU, you will need to flush the HDP to make sure that data
> > > hits VRAM before the GPU attached to the VRAM can see it.
> >
> >
> > Thanks Alex
> >
> > I am now suspecting that the queue preemption issue may be related to
> > the unavailability of MMIO remapping. I am not very familiar with this area.
> >
> > Could you please point me to the relevant code path where the PM4
> > packet is issued to flush the HDP cache?
>
> + David who is more familiar with the ROCm runtime.
>
> PM4 has a packet called HDP_FLUSH which flushes the HDP.  For AQL, it's
> handled by one of the flags I think.  Most things in ROCm use AQL.
>
> @David Yat Sin Can you point to how HDP flushes are handled in the ROCm
> runtime?
>
> Alex
>
> >
> > I am consistently able to reproduce this issue on my system when using
> > more than three GPUs if patches 7/8 and 8/8 are not applied. In your
> > opinion, is there anything that can be done to speed up the HDP flush
> > or to avoid this situation altogether?
> >
> >
> >
> > >
> > > Alex
> > >
> > >>
> > >>> Alex
> > >>>
> > >>>> That's the reason why ROCm needs the remapped MMIO register BAR.
> > >>>>
> > >>>>> but there are certain cases where an application may want more
> > >>>>> control.  This is probably not a showstopper for most ROCm apps.
> > >>>> Well the problem is that you absolutely need the HDP flush/invalidation for
> 100% correctness. It does work most of the time without it, but you then risk data
> corruption.
> > >>>>
> > >>>> Apart from making the flush/invalidate an IOCTL I think we could also just
> use a global dummy page in VRAM.
> > >>>>
> > >>>> If you make two 32bit writes which are apart from each other and then a
> read back a 32bit value from VRAM that should invalidate the HDP as well. It's less
> efficient than the MMIO BAR remap but still much better than going though an
> IOCTL.
> > >>>>
> > >>>> The only tricky part is that you need to get the HW barriers with the doorbell
> write right.....
> > >>>>
> > >>>>> That said, the region is only 4K so if you allow applications to
> > >>>>> map a larger region they would get access to GPU register pages
> > >>>>> which they shouldn't have access to.
> > >>>> But don't we also have problems with the doorbell? E.g. the global
> aggregated one needs to be 4k as well, or is it ok to over allocate there?
> > >>>>
> > >>>> Thinking more about it there is also a major problem with page tables.
> Those are 4k by default on modern systems as well and while over allocating them
> to 64k is possible that not only wastes some VRAM but can also result in OOM
> situations because we can't allocate the necessary page tables to switch from 2MiB
> to 4k pages in some cases.
> > >>>>
> > >>>> Christian.
> > >>>>
> > >>>>> Alex
> > >>>>>
> > >>>>>>>>> [1] ROCr debug agent tests:
> > >>>>>>>>> https://github.com/ROCm/rocr_debug_agent
> > >>>>>>>>> [2] RCCL tests:
> > >>>>>>>>> https://github.com/ROCm/rccl/tree/develop/test
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Please note that the changes in this series are on a best
> > >>>>>>>>> effort basis from our end. Therefore, requesting the amd-gfx
> > >>>>>>>>> community (who have deeper knowledge of the HW & SW stack)
> > >>>>>>>>> to kindly help with the review and provide feedback /
> > >>>>>>>>> comments on these patches. The idea here is, to also have non-4K
> pagesize (e.g. 64K) well supported with amd gpu kernel driver.
> > >>>>>>>> Well this is generally nice to have, but there are unfortunately some
> HW limitations which makes ROCm pretty much unusable on non 4k page size
> systems.
> > >>>>>>> That's a bummer :(
> > >>>>>>> - Do we have some HW documentation around what are these
> limitations around non-4K pagesize? Any links to such please?
> > >>>>>> You already mentioned MMIO remap which obviously has that problem,
> but if I'm not completely mistaken the PCIe doorbell BAR and some global seq
> counter resources will also cause problems here.
> > >>>>>>
> > >>>>>> This can all be worked around by delegating those MMIO accesses into
> the kernel, but that means tons of extra IOCTL overhead.
> > >>>>>>
> > >>>>>> Especially the cache flushes which are necessary to avoid corruption
> are really bad for performance in such an approach.
> > >>>>>>
> > >>>>>>> - Are there any latest AMD GPU versions which maybe lifts such
> restrictions?
> > >>>>>> Not that I know off any.
> > >>>>>>
> > >>>>>>>> What we can do is to support graphics and MM, but that should
> already work out of the box.
> > >>>>>>>>
> > >>>>>>> - Maybe we should also document, what will work and what won't work
> due to these HW limitations.
> > >>>>>> Well pretty much everything, I need to double check how ROCm does
> HDP flushing/invalidating when the MMIO remap isn't available.
> > >>>>>>
> > >>>>>> Could be that there is already a fallback path and that's the reason why
> this approach actually works at all.
> > >>>>>>
> > >>>>>>>> What we can do is to support graphics and MM, but that should
> > >>>>>>>> already work out of the box.>
> > >>>>>>> So these patches helped us resolve most of the issues like
> > >>>>>>> SDMA hangs and GPU kernel page faults which we saw with rocr
> > >>>>>>> and rccl tests with 64K pagesize. Meaning, we didn't see this
> > >>>>>>> working out of box perhaps due to 64K pagesize.
> > >>>>>> Yeah, but this is all for ROCm and not the graphics side.
> > >>>>>>
> > >>>>>> To be honest I'm not sure how ROCm even works when you have 64k
> pages at the moment. I would expect much more issue lurking in the kernel driver.
> > >>>>>>
> > >>>>>>> AFAIU, some of these patches may require re-work based on
> > >>>>>>> reviews, but at least with these changes, we were able to see all the
> tests passing.
> > >>>>>>>
> > >>>>>>>> I need to talk with Alex and the ROCm team about it if workarounds
> can be implemented for those issues.
> > >>>>>>>>
> > >>>>>>> Thanks a lot! That would be super helpful!
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>> Regards,
> > >>>>>>>> Christian.
> > >>>>>>>>
> > >>>>>>> Thanks again for the quick response on the patch series.
> > >>>>>> You are welcome, but since it's so near to the end of the year not all
> people are available any more.
> > >>>>>>
> > >>>>>> Regards,
> > >>>>>> Christian.
> > >>>>>>
> > >>>>>>> -ritesh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 7/8] amdgpu: Align ctl_stack_size and wg_data_size to GPU page size instead of CPU page size
  2025-12-12  9:04   ` Christian König
  2025-12-12 12:29     ` Donet Tom
@ 2025-12-19 10:27     ` Donet Tom
  2026-01-06 13:01       ` Donet Tom
  1 sibling, 1 reply; 44+ messages in thread
From: Donet Tom @ 2025-12-19 10:27 UTC (permalink / raw)
  To: Christian König, amd-gfx, Felix Kuehling, Alex Deucher
  Cc: Kent.Russell, Ritesh Harjani, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya


On 12/12/25 2:34 PM, Christian König wrote:
> On 12/12/25 07:40, Donet Tom wrote:
>> The ctl_stack_size and wg_data_size values are used to compute the total
>> context save/restore buffer size and the control stack size. These buffers
>> are programmed into the GPU and are used to store the queue state during
>> context save and restore.
>>
>> Currently, both ctl_stack_size and wg_data_size are aligned to the CPU
>> PAGE_SIZE. On systems with a non-4K CPU page size, this causes unnecessary
>> memory waste because the GPU internally calculates and uses buffer sizes
>> aligned to a fixed 4K GPU page size.
>>
>> Since the control stack and context save/restore buffers are consumed by
>> the GPU, their sizes should be aligned to the GPU page size (4K), not the
>> CPU page size. This patch updates the alignment of ctl_stack_size and
>> wg_data_size to prevent over-allocation on systems with larger CPU page
>> sizes.
> As far as I know the problem is that the debugger needs to consume that stuff on the CPU side as well.


Thank you for your help.

As mentioned earlier, we were observing some queue preemption and GPU 
hang issues. To address this, we introduced a patch, and after applying 
the 7/8 and 8/8 patches, those issues have not been seen anymore

While debugging the GPU hang issue, I made some additional observations.

On my system, I booted a kernel with a 4 KB system page size and 
modified both the ROCR runtime and the GPU driver to set the control 
stack size to 64 KB. Even on a 4 KB page-size system, using a 64 KB 
control stack size reliably reproduces the queue preemption failure when 
running RCCL unit tests on 8 GPUs. This suggests that the issue is not 
related to the system page size, but rather to the control stack size 
being exactly 64 KB.

When the control stack size is set to 64 KB ± 4 KB, the tests pass on 
both 4 KB and 64 KB system page-size configurations.

For gfxv9, is there any documented hardware limitation on the control 
stack size? Specifically, is it valid to use a control stack size of 
exactly 64 KB?


>
> I need to double check that, but I think the alignment is correct as it is.


The control stack is part of the context save-restore buffer, and we 
configure it on the GPU as shown below:

m->cp_hqd_ctx_save_base_addr_lo = 
lower_32_bits(q->ctx_save_restore_area_address);
m->cp_hqd_ctx_save_base_addr_hi = 
upper_32_bits(q->ctx_save_restore_area_address);
m->cp_hqd_ctx_save_size = q->ctx_save_restore_area_size;
m->cp_hqd_cntl_stack_size = q->ctl_stack_size;
m->cp_hqd_cntl_stack_offset = q->ctl_stack_size;
m->cp_hqd_wg_state_offset = q->ctl_stack_size;

The control stack occupies the region from cp_hqd_cntl_stack_offset down 
to 0 within the ctx save restore area, and the remaining space is used 
for WG state. This buffer is fully managed by the GPU during preemption 
and restore operations.
The control stack size is calculated based on hardware configuration (CU 
count and wave count). For example, on gfxv9, the size is typically 
around 32 KB. If we align this size to the system page size (e.g., 
64 KB), two issues arise:

1. Unnecessary memory overhead.
2. Potential queue preemption issues.

On the CPU side, we copy the control stack contents to other buffers for 
processing. Since the control stack size is derived from hardware 
configuration, aligning it to the GPU page size seems more appropriate. 
Aligning to the system page size would waste memory without adding 
value. Using GPU page size alignment ensures consistency with hardware 
and avoids unnecessary overhead.

Would you agree that aligning the control stack size to the GPU page 
size is the right approach? Or do you see any concerns with this method?


>
> Regards,
> Christian.
>
>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>> ---
>>   drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 7 ++++---
>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>> index dc857450fa16..00ab941c3e86 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>> @@ -445,10 +445,11 @@ void kfd_queue_ctx_save_restore_size(struct kfd_topology_device *dev)
>>   		    min(cu_num * 40, props->array_count / props->simd_arrays_per_engine * 512)
>>   		    : cu_num * 32;
>>   
>> -	wg_data_size = ALIGN(cu_num * WG_CONTEXT_DATA_SIZE_PER_CU(gfxv, props), PAGE_SIZE);
>> +	wg_data_size = ALIGN(cu_num * WG_CONTEXT_DATA_SIZE_PER_CU(gfxv, props),
>> +				AMDGPU_GPU_PAGE_SIZE);
>>   	ctl_stack_size = wave_num * CNTL_STACK_BYTES_PER_WAVE(gfxv) + 8;
>>   	ctl_stack_size = ALIGN(SIZEOF_HSA_USER_CONTEXT_SAVE_AREA_HEADER + ctl_stack_size,
>> -			       PAGE_SIZE);
>> +			       AMDGPU_GPU_PAGE_SIZE);
>>   
>>   	if ((gfxv / 10000 * 10000) == 100000) {
>>   		/* HW design limits control stack size to 0x7000.
>> @@ -460,7 +461,7 @@ void kfd_queue_ctx_save_restore_size(struct kfd_topology_device *dev)
>>   
>>   	props->ctl_stack_size = ctl_stack_size;
>>   	props->debug_memory_size = ALIGN(wave_num * DEBUGGER_BYTES_PER_WAVE, DEBUGGER_BYTES_ALIGN);
>> -	props->cwsr_size = ctl_stack_size + wg_data_size;
>> +	props->cwsr_size = ALIGN(ctl_stack_size + wg_data_size, PAGE_SIZE);
>>   
>>   	if (gfxv == 80002)	/* GFX_VERSION_TONGA */
>>   		props->eop_buffer_size = 0x8000;

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems
  2025-12-17 21:31                     ` Yat Sin, David
@ 2026-01-02 18:53                       ` Donet Tom
  2026-01-06 12:58                       ` Donet Tom
  1 sibling, 0 replies; 44+ messages in thread
From: Donet Tom @ 2026-01-02 18:53 UTC (permalink / raw)
  To: Yat Sin, David, Alex Deucher
  Cc: Koenig, Christian, Ritesh Harjani (IBM),
	amd-gfx@lists.freedesktop.org, Kuehling, Felix,
	Deucher, Alexander, Russell, Kent, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya


On 12/18/25 3:01 AM, Yat Sin, David wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> HDP flush is done in ROCm using these 3 methods:
>
> 1. For AQL packets, this is done by setting the system-scope acquire and release fences in the packet header.
>       For example, it is set here:
>       https://github.com/ROCm/rocm-systems/blob/develop/projects/rocr-runtime/runtime/hsa-runtime/core/runtime/amd_blit_kernel.cpp#L878
>
>       And when the headers are defined here:
>       https://github.com/ROCm/rocm-systems/blob/develop/projects/clr/rocclr/device/rocm/rocvirtual.cpp#L85
>
>
> 2. Via a SDMA packet. This is done before doing a memory copy:
>       The function is called here:
>          https://github.com/ROCm/rocm-systems/blob/develop/projects/rocr-runtime/runtime/hsa-runtime/core/runtime/amd_blit_sdma.cpp#L484
>       And the packet (POLL_REGMEM) is generated here:
>          https://github.com/ROCm/rocm-systems/blob/develop/projects/rocr-runtime/runtime/hsa-runtime/core/runtime/amd_blit_sdma.cpp#L1154
>
>
> 3. By writing to a MMIO remapped address:
>              The address is stored in rocclr here:
>          https://github.com/ROCm/rocm-systems/blob/develop/projects/clr/rocclr/device/rocm/rocdevice.cpp#L607
>
>              And the flush is triggered by writing a 1, e.g here:
>          https://github.com/ROCm/rocm-systems/blob/develop/projects/clr/rocclr/device/rocm/rocvirtual.cpp#L3831


Thank you David.


>
>
> Regards,
> David
>
>
>> -----Original Message-----
>> From: Alex Deucher <alexdeucher@gmail.com>
>> Sent: Wednesday, December 17, 2025 9:23 AM
>> To: Donet Tom <donettom@linux.ibm.com>; Yat Sin, David
>> <David.YatSin@amd.com>
>> Cc: Koenig, Christian <Christian.Koenig@amd.com>; Ritesh Harjani (IBM)
>> <ritesh.list@gmail.com>; amd-gfx@lists.freedesktop.org; Kuehling, Felix
>> <Felix.Kuehling@amd.com>; Deucher, Alexander
>> <Alexander.Deucher@amd.com>; Russell, Kent <Kent.Russell@amd.com>;
>> Vaidyanathan Srinivasan <svaidy@linux.ibm.com>; Mukesh Kumar Chaurasiya
>> <mkchauras@linux.ibm.com>
>> Subject: Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page
>> size systems
>>
>> On Wed, Dec 17, 2025 at 4:03 AM Donet Tom <donettom@linux.ibm.com> wrote:
>>>
>>> On 12/16/25 7:32 PM, Alex Deucher wrote:
>>>> On Tue, Dec 16, 2025 at 8:55 AM Donet Tom <donettom@linux.ibm.com>
>> wrote:
>>>>> On 12/15/25 7:39 PM, Alex Deucher wrote:
>>>>>> On Mon, Dec 15, 2025 at 4:47 AM Christian König
>>>>>> <christian.koenig@amd.com> wrote:
>>>>>>> On 12/12/25 18:24, Alex Deucher wrote:
>>>>>>>> On Fri, Dec 12, 2025 at 8:19 AM Christian König
>>>>>>>> <christian.koenig@amd.com> wrote:
>>>>>>>>> On 12/12/25 11:45, Ritesh Harjani (IBM) wrote:
>>>>>>>>>> Christian König <christian.koenig@amd.com> writes:
>>>>>>>>>>>> Setup details:
>>>>>>>>>>>> ============
>>>>>>>>>>>> System details: Power10 LPAR using 64K pagesize.
>>>>>>>>>>>> AMD GPU:
>>>>>>>>>>>>      Name:                    gfx90a
>>>>>>>>>>>>      Marketing Name:          AMD Instinct MI210
>>>>>>>>>>>>
>>>>>>>>>>>> Queries:
>>>>>>>>>>>> =======
>>>>>>>>>>>> 1. We currently ran rocr-debug agent tests [1]  and rccl unit tests [2]
>> to test
>>>>>>>>>>>>       these changes. Is there anything else that you would suggest us
>> to run to
>>>>>>>>>>>>       shake out any other page size related issues w.r.t the kernel
>> driver?
>>>>>>>>>>> The ROCm team needs to answer that.
>>>>>>>>>>>
>>>>>>>>>> Is there any separate mailing list or list of people whom we
>>>>>>>>>> can cc then?
>>>>>>>>> With Felix on CC you already got the right person, but he's on vacation
>> and will not be back before the end of the year.
>>>>>>>>> I can check on Monday if some people are still around which could
>> answer a couple of questions, but in general don't expect a quick response.
>>>>>>>>>>>> 2. Patch 1/8: We have a querry regarding eop buffer size Is this eop
>> ring buffer
>>>>>>>>>>>>       size HW dependent? Should it be made PAGE_SIZE?
>>>>>>>>>>> Yes and no.
>>>>>>>>>>>
>>>>>>>>>> If you could more elaborate on this please? I am assuming you
>>>>>>>>>> would anyway respond with more context / details on Patch-1
>>>>>>>>>> itself. If yes, that would be great!
>>>>>>>>> Well, in general the EOP (End of Pipe) buffer contains in a ring buffer of
>> all the events and actions the CP should execute when shaders and cache flushes
>> finish.
>>>>>>>>> The size depends on the HW generation and configuration of the GPU
>> etc..., but don't ask me for details how that is calculated.
>>>>>>>>> The point is that the size is completely unrelated to the CPU, so using
>> PAGE_SIZE is clearly incorrect.
>>>>>>>>>>>> 3. Patch 5/8: also have a query w.r.t the error paths when system
>> page size > 4K.
>>>>>>>>>>>>       Do we need to lift this restriction and add MMIO remap support
>> for systems with
>>>>>>>>>>>>       non-4K page sizes?
>>>>>>>>>>> The problem is the HW can't do this.
>>>>>>>>>>>
>>>>>>>>>> We aren't that familiar with the HW / SW stack here. Wanted to
>>>>>>>>>> understand what functionality will be unsupported due to this HW
>> limitation then?
>>>>>>>>> The problem is that the CPU must map some of the registers/resources
>> of the GPU into the address space of the application and you run into security
>> issues when you map more than 4k at a time.
>>>>>>>> Right.  There are some 4K pages with the MMIO register BAR which
>>>>>>>> are empty and registers can be remapped into them.  In this case
>>>>>>>> we remap the HDP flush registers into one of those register
>>>>>>>> pages.  This allows applications to flush the HDP write FIFO
>>>>>>>> from either the CPU or another device.  This is needed to flush
>>>>>>>> data written by the CPU or another device to the VRAM BAR out to
>>>>>>>> VRAM (i.e., so the GPU can see it).  This is flushed internally
>>>>>>>> as part of the shader dispatch packets,
>>>>>>> As far as I know this is only done for graphics shader submissions to the
>> classic CS interface, but not for compute dispatches through ROCm queues.
>>>>>> There is an explicit PM4 packet to flush the HDP cache for userqs
>>>>>> and for AQL the flush is handled via one of the flags in the
>>>>>> dispatch packet.  The MMIO remap is needed for more fine grained
>>>>>> use cases where you might have the CPU or another device operating
>>>>>> in a gang like scenario with the GPU.
>>>>> Thank you, Alex.
>>>>>
>>>>> We were encountering an issue while running the RCCL unit tests.
>>>>> With 2 GPUs, all tests passed successfully; however, when running
>>>>> with more than 2 GPUs, the tests began to fail at random points
>>>>> with the following
>>>>> errors:
>>>>>
>>>>> [  598.576821] amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed
>>>>> for queue with doorbell_id: 80030008 [  606.696820] amdgpu
>>>>> 0048:0f:00.0: amdgpu: Failed to evict process queues [  606.696826]
>>>>> amdgpu 0048:0f:00.0: amdgpu: GPU reset begin!. Source: 4 [
>>>>> 610.696852] amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed
>>>>> for queue with doorbell_id: 80030008 [  610.696869] amdgpu
>>>>> 0048:0f:00.0: amdgpu: Failed to evict process queues [  610.696942]
>>>>> amdgpu 0048:0f:00.0: amdgpu: Failed to restore process queues
>>>>>
>>>>>
>>>>> After applying patches 7/8 and 8/8, we are no longer seeing this issue.
>>>>>
>>>>> One question I have is: we only started observing this problem when
>>>>> the number of GPUs increased. Could this be related to MMIO
>>>>> remapping not being available?
>>>> It could be.  E.g., if the CPU or a GPU writes data to VRAM on
>>>> another GPU, you will need to flush the HDP to make sure that data
>>>> hits VRAM before the GPU attached to the VRAM can see it.
>>>
>>> Thanks Alex
>>>
>>> I am now suspecting that the queue preemption issue may be related to
>>> the unavailability of MMIO remapping. I am not very familiar with this area.
>>>
>>> Could you please point me to the relevant code path where the PM4
>>> packet is issued to flush the HDP cache?
>> + David who is more familiar with the ROCm runtime.
>>
>> PM4 has a packet called HDP_FLUSH which flushes the HDP.  For AQL, it's
>> handled by one of the flags I think.  Most things in ROCm use AQL.
>>
>> @David Yat Sin Can you point to how HDP flushes are handled in the ROCm
>> runtime?
>>
>> Alex
>>
>>> I am consistently able to reproduce this issue on my system when using
>>> more than three GPUs if patches 7/8 and 8/8 are not applied. In your
>>> opinion, is there anything that can be done to speed up the HDP flush
>>> or to avoid this situation altogether?
>>>
>>>
>>>
>>>> Alex
>>>>
>>>>>> Alex
>>>>>>
>>>>>>> That's the reason why ROCm needs the remapped MMIO register BAR.
>>>>>>>
>>>>>>>> but there are certain cases where an application may want more
>>>>>>>> control.  This is probably not a showstopper for most ROCm apps.
>>>>>>> Well the problem is that you absolutely need the HDP flush/invalidation for
>> 100% correctness. It does work most of the time without it, but you then risk data
>> corruption.
>>>>>>> Apart from making the flush/invalidate an IOCTL I think we could also just
>> use a global dummy page in VRAM.
>>>>>>> If you make two 32bit writes which are apart from each other and then a
>> read back a 32bit value from VRAM that should invalidate the HDP as well. It's less
>> efficient than the MMIO BAR remap but still much better than going though an
>> IOCTL.
>>>>>>> The only tricky part is that you need to get the HW barriers with the doorbell
>> write right.....
>>>>>>>> That said, the region is only 4K so if you allow applications to
>>>>>>>> map a larger region they would get access to GPU register pages
>>>>>>>> which they shouldn't have access to.
>>>>>>> But don't we also have problems with the doorbell? E.g. the global
>> aggregated one needs to be 4k as well, or is it ok to over allocate there?
>>>>>>> Thinking more about it there is also a major problem with page tables.
>> Those are 4k by default on modern systems as well and while over allocating them
>> to 64k is possible that not only wastes some VRAM but can also result in OOM
>> situations because we can't allocate the necessary page tables to switch from 2MiB
>> to 4k pages in some cases.
>>>>>>> Christian.
>>>>>>>
>>>>>>>> Alex
>>>>>>>>
>>>>>>>>>>>> [1] ROCr debug agent tests:
>>>>>>>>>>>> https://github.com/ROCm/rocr_debug_agent
>>>>>>>>>>>> [2] RCCL tests:
>>>>>>>>>>>> https://github.com/ROCm/rccl/tree/develop/test
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Please note that the changes in this series are on a best
>>>>>>>>>>>> effort basis from our end. Therefore, requesting the amd-gfx
>>>>>>>>>>>> community (who have deeper knowledge of the HW & SW stack)
>>>>>>>>>>>> to kindly help with the review and provide feedback /
>>>>>>>>>>>> comments on these patches. The idea here is, to also have non-4K
>> pagesize (e.g. 64K) well supported with amd gpu kernel driver.
>>>>>>>>>>> Well this is generally nice to have, but there are unfortunately some
>> HW limitations which makes ROCm pretty much unusable on non 4k page size
>> systems.
>>>>>>>>>> That's a bummer :(
>>>>>>>>>> - Do we have some HW documentation around what are these
>> limitations around non-4K pagesize? Any links to such please?
>>>>>>>>> You already mentioned MMIO remap which obviously has that problem,
>> but if I'm not completely mistaken the PCIe doorbell BAR and some global seq
>> counter resources will also cause problems here.
>>>>>>>>> This can all be worked around by delegating those MMIO accesses into
>> the kernel, but that means tons of extra IOCTL overhead.
>>>>>>>>> Especially the cache flushes which are necessary to avoid corruption
>> are really bad for performance in such an approach.
>>>>>>>>>> - Are there any latest AMD GPU versions which maybe lifts such
>> restrictions?
>>>>>>>>> Not that I know off any.
>>>>>>>>>
>>>>>>>>>>> What we can do is to support graphics and MM, but that should
>> already work out of the box.
>>>>>>>>>> - Maybe we should also document, what will work and what won't work
>> due to these HW limitations.
>>>>>>>>> Well pretty much everything, I need to double check how ROCm does
>> HDP flushing/invalidating when the MMIO remap isn't available.
>>>>>>>>> Could be that there is already a fallback path and that's the reason why
>> this approach actually works at all.
>>>>>>>>>>> What we can do is to support graphics and MM, but that should
>>>>>>>>>>> already work out of the box.>
>>>>>>>>>> So these patches helped us resolve most of the issues like
>>>>>>>>>> SDMA hangs and GPU kernel page faults which we saw with rocr
>>>>>>>>>> and rccl tests with 64K pagesize. Meaning, we didn't see this
>>>>>>>>>> working out of box perhaps due to 64K pagesize.
>>>>>>>>> Yeah, but this is all for ROCm and not the graphics side.
>>>>>>>>>
>>>>>>>>> To be honest I'm not sure how ROCm even works when you have 64k
>> pages at the moment. I would expect much more issue lurking in the kernel driver.
>>>>>>>>>> AFAIU, some of these patches may require re-work based on
>>>>>>>>>> reviews, but at least with these changes, we were able to see all the
>> tests passing.
>>>>>>>>>>> I need to talk with Alex and the ROCm team about it if workarounds
>> can be implemented for those issues.
>>>>>>>>>> Thanks a lot! That would be super helpful!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>> Thanks again for the quick response on the patch series.
>>>>>>>>> You are welcome, but since it's so near to the end of the year not all
>> people are available any more.
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>> -ritesh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 4/8] amdgpu/amdgpu_ttm: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size
  2025-12-12  8:53   ` Christian König
  2025-12-12 12:14     ` Donet Tom
@ 2026-01-06 12:55     ` Donet Tom
  2026-01-08 12:31       ` Christian König
  1 sibling, 1 reply; 44+ messages in thread
From: Donet Tom @ 2026-01-06 12:55 UTC (permalink / raw)
  To: Christian König, amd-gfx, Felix Kuehling, Alex Deucher,
	Philip Yang
  Cc: Kent.Russell, Ritesh Harjani, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya

[-- Attachment #1: Type: text/plain, Size: 3058 bytes --]


On 12/12/25 2:23 PM, Christian König wrote:
> On 12/12/25 07:40, Donet Tom wrote:
>> The SDMA engine has a hardware limitation of 4 MB maximum transfer
>> size per operation.
> That is not correct. This is only true on ancient HW.
>
> What problems are you seeing here?
>
>> AMDGPU_GTT_MAX_TRANSFER_SIZE was hardcoded to
>> 512 pages, which worked correctly on systems with 4K pages but fails
>> on systems with larger page sizes.
>>
>> This patch divides the max transfer size / AMDGPU_GPU_PAGES_IN_CPU_PAGE
>> to match with non-4K page size systems.
> That is actually a bad idea. The value was meant to match the PMD size.


Hi Christian, Felix, Alex and philip

Instead of hardcoding the AMDGPU_GTT_MAX_TRANSFER_SIZE value to 512,
what do you think about doing something like the change below?
This should work across all architectures and page sizes, so
AMDGPU_GTT_MAX_TRANSFER_SIZE will always correspond to the PMD
size on all architectures and with all page sizes.


diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
index 0be2728aa872..c594ed7dff18 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
@@ -37,7 +37,7 @@
  #define AMDGPU_PL_MMIO_REMAP   (TTM_PL_PRIV + 5)
  #define __AMDGPU_PL_NUM        (TTM_PL_PRIV + 6)
  
-#define AMDGPU_GTT_MAX_TRANSFER_SIZE   512
+#define AMDGPU_GTT_MAX_TRANSFER_SIZE   1 << (PMD_SHIFT - PAGE_SHIFT)
  #define AMDGPU_GTT_NUM_TRANSFER_WINDOWS

Could you please provide your thoughts on above? Is it looking ok to you?

If this looks good - here is what we were thinking:

Patches 1-4 are required to fix initial non-4k pagesize support to AMD GPU.
And since these patches are looking in good shape (since Philip has already
reviewed [1-3])- We thought it will be good to split the patch series into two.
I will send a v2 of Part-1 with patches [1-4] (will also address the review comments
in v2 for Patch-1 & 2 from Philip) and for the rest of the patches [5-8] Part-2, we
can continue the discussion till other things are sorted. That will also allow us to
get these initial fixes in Part-1 ready before the 6.20 merge window.

Thoughts?


>
> Regards,
> Christian.
>
>> Signed-off-by: Donet Tom<donettom@linux.ibm.com>
>> Signed-off-by: Ritesh Harjani (IBM)<ritesh.list@gmail.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>> index 0be2728aa872..9d038feb25b0 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>> @@ -37,7 +37,7 @@
>>   #define AMDGPU_PL_MMIO_REMAP	(TTM_PL_PRIV + 5)
>>   #define __AMDGPU_PL_NUM	(TTM_PL_PRIV + 6)
>>   
>> -#define AMDGPU_GTT_MAX_TRANSFER_SIZE	512
>> +#define AMDGPU_GTT_MAX_TRANSFER_SIZE	(512 / AMDGPU_GPU_PAGES_IN_CPU_PAGE)
>>   #define AMDGPU_GTT_NUM_TRANSFER_WINDOWS	2
>>   
>>   extern const struct attribute_group amdgpu_vram_mgr_attr_group;

[-- Attachment #2: Type: text/html, Size: 4163 bytes --]

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems
  2025-12-17 21:31                     ` Yat Sin, David
  2026-01-02 18:53                       ` Donet Tom
@ 2026-01-06 12:58                       ` Donet Tom
  1 sibling, 0 replies; 44+ messages in thread
From: Donet Tom @ 2026-01-06 12:58 UTC (permalink / raw)
  To: Yat Sin, David, Alex Deucher
  Cc: Koenig, Christian, Ritesh Harjani (IBM),
	amd-gfx@lists.freedesktop.org, Kuehling, Felix,
	Deucher, Alexander, Russell, Kent, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya

[-- Attachment #1: Type: text/plain, Size: 14404 bytes --]


On 12/18/25 3:01 AM, Yat Sin, David wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> HDP flush is done in ROCm using these 3 methods:
>
> 1. For AQL packets, this is done by setting the system-scope acquire and release fences in the packet header.
>       For example, it is set here:
>       https://github.com/ROCm/rocm-systems/blob/develop/projects/rocr-runtime/runtime/hsa-runtime/core/runtime/amd_blit_kernel.cpp#L878
>
>       And when the headers are defined here:
>       https://github.com/ROCm/rocm-systems/blob/develop/projects/clr/rocclr/device/rocm/rocvirtual.cpp#L85
>
>
> 2. Via a SDMA packet. This is done before doing a memory copy:
>       The function is called here:
>          https://github.com/ROCm/rocm-systems/blob/develop/projects/rocr-runtime/runtime/hsa-runtime/core/runtime/amd_blit_sdma.cpp#L484
>       And the packet (POLL_REGMEM) is generated here:
>          https://github.com/ROCm/rocm-systems/blob/develop/projects/rocr-runtime/runtime/hsa-runtime/core/runtime/amd_blit_sdma.cpp#L1154
>
>
> 3. By writing to a MMIO remapped address:
>              The address is stored in rocclr here:
>          https://github.com/ROCm/rocm-systems/blob/develop/projects/clr/rocclr/device/rocm/rocdevice.cpp#L607
>
>              And the flush is triggered by writing a 1, e.g here:
>          https://github.com/ROCm/rocm-systems/blob/develop/projects/clr/rocclr/device/rocm/rocvirtual.cpp#L3831


Hi David, Alex and Christian

Thank you for pointing me to the code where the HDP flush is handled.

 From my understanding of the code, if MMIO remap is supported, the HDP flush is performed via MMIO remap. If MMIO remap is not supported, a workaround is used:
"write all arguments → sfence → write the last byte → mfence → read the last byte"

This sequence ensures that the HDP flush is completed before the GPU reads the data.

With this understanding, I think patch 5/8 (amdkfd/kfd_chardev: Add error message for non-4K page size failures) may not be needed, as it prints an error when MMIO remap is not supported, which is an expected case.

Should we drop patch 5/8, or do you think it would be better to change this to a pr_warn or pr_info message when MMIO remap is not supported?


>
>
> Regards,
> David
>
>
>> -----Original Message-----
>> From: Alex Deucher<alexdeucher@gmail.com>
>> Sent: Wednesday, December 17, 2025 9:23 AM
>> To: Donet Tom<donettom@linux.ibm.com>; Yat Sin, David
>> <David.YatSin@amd.com>
>> Cc: Koenig, Christian<Christian.Koenig@amd.com>; Ritesh Harjani (IBM)
>> <ritesh.list@gmail.com>;amd-gfx@lists.freedesktop.org; Kuehling, Felix
>> <Felix.Kuehling@amd.com>; Deucher, Alexander
>> <Alexander.Deucher@amd.com>; Russell, Kent<Kent.Russell@amd.com>;
>> Vaidyanathan Srinivasan<svaidy@linux.ibm.com>; Mukesh Kumar Chaurasiya
>> <mkchauras@linux.ibm.com>
>> Subject: Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page
>> size systems
>>
>> On Wed, Dec 17, 2025 at 4:03 AM Donet Tom<donettom@linux.ibm.com> wrote:
>>>
>>> On 12/16/25 7:32 PM, Alex Deucher wrote:
>>>> On Tue, Dec 16, 2025 at 8:55 AM Donet Tom<donettom@linux.ibm.com>
>> wrote:
>>>>> On 12/15/25 7:39 PM, Alex Deucher wrote:
>>>>>> On Mon, Dec 15, 2025 at 4:47 AM Christian König
>>>>>> <christian.koenig@amd.com> wrote:
>>>>>>> On 12/12/25 18:24, Alex Deucher wrote:
>>>>>>>> On Fri, Dec 12, 2025 at 8:19 AM Christian König
>>>>>>>> <christian.koenig@amd.com> wrote:
>>>>>>>>> On 12/12/25 11:45, Ritesh Harjani (IBM) wrote:
>>>>>>>>>> Christian König<christian.koenig@amd.com> writes:
>>>>>>>>>>>> Setup details:
>>>>>>>>>>>> ============
>>>>>>>>>>>> System details: Power10 LPAR using 64K pagesize.
>>>>>>>>>>>> AMD GPU:
>>>>>>>>>>>>      Name:                    gfx90a
>>>>>>>>>>>>      Marketing Name:          AMD Instinct MI210
>>>>>>>>>>>>
>>>>>>>>>>>> Queries:
>>>>>>>>>>>> =======
>>>>>>>>>>>> 1. We currently ran rocr-debug agent tests [1]  and rccl unit tests [2]
>> to test
>>>>>>>>>>>>       these changes. Is there anything else that you would suggest us
>> to run to
>>>>>>>>>>>>       shake out any other page size related issues w.r.t the kernel
>> driver?
>>>>>>>>>>> The ROCm team needs to answer that.
>>>>>>>>>>>
>>>>>>>>>> Is there any separate mailing list or list of people whom we
>>>>>>>>>> can cc then?
>>>>>>>>> With Felix on CC you already got the right person, but he's on vacation
>> and will not be back before the end of the year.
>>>>>>>>> I can check on Monday if some people are still around which could
>> answer a couple of questions, but in general don't expect a quick response.
>>>>>>>>>>>> 2. Patch 1/8: We have a querry regarding eop buffer size Is this eop
>> ring buffer
>>>>>>>>>>>>       size HW dependent? Should it be made PAGE_SIZE?
>>>>>>>>>>> Yes and no.
>>>>>>>>>>>
>>>>>>>>>> If you could more elaborate on this please? I am assuming you
>>>>>>>>>> would anyway respond with more context / details on Patch-1
>>>>>>>>>> itself. If yes, that would be great!
>>>>>>>>> Well, in general the EOP (End of Pipe) buffer contains in a ring buffer of
>> all the events and actions the CP should execute when shaders and cache flushes
>> finish.
>>>>>>>>> The size depends on the HW generation and configuration of the GPU
>> etc..., but don't ask me for details how that is calculated.
>>>>>>>>> The point is that the size is completely unrelated to the CPU, so using
>> PAGE_SIZE is clearly incorrect.
>>>>>>>>>>>> 3. Patch 5/8: also have a query w.r.t the error paths when system
>> page size > 4K.
>>>>>>>>>>>>       Do we need to lift this restriction and add MMIO remap support
>> for systems with
>>>>>>>>>>>>       non-4K page sizes?
>>>>>>>>>>> The problem is the HW can't do this.
>>>>>>>>>>>
>>>>>>>>>> We aren't that familiar with the HW / SW stack here. Wanted to
>>>>>>>>>> understand what functionality will be unsupported due to this HW
>> limitation then?
>>>>>>>>> The problem is that the CPU must map some of the registers/resources
>> of the GPU into the address space of the application and you run into security
>> issues when you map more than 4k at a time.
>>>>>>>> Right.  There are some 4K pages with the MMIO register BAR which
>>>>>>>> are empty and registers can be remapped into them.  In this case
>>>>>>>> we remap the HDP flush registers into one of those register
>>>>>>>> pages.  This allows applications to flush the HDP write FIFO
>>>>>>>> from either the CPU or another device.  This is needed to flush
>>>>>>>> data written by the CPU or another device to the VRAM BAR out to
>>>>>>>> VRAM (i.e., so the GPU can see it).  This is flushed internally
>>>>>>>> as part of the shader dispatch packets,
>>>>>>> As far as I know this is only done for graphics shader submissions to the
>> classic CS interface, but not for compute dispatches through ROCm queues.
>>>>>> There is an explicit PM4 packet to flush the HDP cache for userqs
>>>>>> and for AQL the flush is handled via one of the flags in the
>>>>>> dispatch packet.  The MMIO remap is needed for more fine grained
>>>>>> use cases where you might have the CPU or another device operating
>>>>>> in a gang like scenario with the GPU.
>>>>> Thank you, Alex.
>>>>>
>>>>> We were encountering an issue while running the RCCL unit tests.
>>>>> With 2 GPUs, all tests passed successfully; however, when running
>>>>> with more than 2 GPUs, the tests began to fail at random points
>>>>> with the following
>>>>> errors:
>>>>>
>>>>> [  598.576821] amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed
>>>>> for queue with doorbell_id: 80030008 [  606.696820] amdgpu
>>>>> 0048:0f:00.0: amdgpu: Failed to evict process queues [  606.696826]
>>>>> amdgpu 0048:0f:00.0: amdgpu: GPU reset begin!. Source: 4 [
>>>>> 610.696852] amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed
>>>>> for queue with doorbell_id: 80030008 [  610.696869] amdgpu
>>>>> 0048:0f:00.0: amdgpu: Failed to evict process queues [  610.696942]
>>>>> amdgpu 0048:0f:00.0: amdgpu: Failed to restore process queues
>>>>>
>>>>>
>>>>> After applying patches 7/8 and 8/8, we are no longer seeing this issue.
>>>>>
>>>>> One question I have is: we only started observing this problem when
>>>>> the number of GPUs increased. Could this be related to MMIO
>>>>> remapping not being available?
>>>> It could be.  E.g., if the CPU or a GPU writes data to VRAM on
>>>> another GPU, you will need to flush the HDP to make sure that data
>>>> hits VRAM before the GPU attached to the VRAM can see it.
>>>
>>> Thanks Alex
>>>
>>> I am now suspecting that the queue preemption issue may be related to
>>> the unavailability of MMIO remapping. I am not very familiar with this area.
>>>
>>> Could you please point me to the relevant code path where the PM4
>>> packet is issued to flush the HDP cache?
>> + David who is more familiar with the ROCm runtime.
>>
>> PM4 has a packet called HDP_FLUSH which flushes the HDP.  For AQL, it's
>> handled by one of the flags I think.  Most things in ROCm use AQL.
>>
>> @David Yat Sin Can you point to how HDP flushes are handled in the ROCm
>> runtime?
>>
>> Alex
>>
>>> I am consistently able to reproduce this issue on my system when using
>>> more than three GPUs if patches 7/8 and 8/8 are not applied. In your
>>> opinion, is there anything that can be done to speed up the HDP flush
>>> or to avoid this situation altogether?
>>>
>>>
>>>
>>>> Alex
>>>>
>>>>>> Alex
>>>>>>
>>>>>>> That's the reason why ROCm needs the remapped MMIO register BAR.
>>>>>>>
>>>>>>>> but there are certain cases where an application may want more
>>>>>>>> control.  This is probably not a showstopper for most ROCm apps.
>>>>>>> Well the problem is that you absolutely need the HDP flush/invalidation for
>> 100% correctness. It does work most of the time without it, but you then risk data
>> corruption.
>>>>>>> Apart from making the flush/invalidate an IOCTL I think we could also just
>> use a global dummy page in VRAM.
>>>>>>> If you make two 32bit writes which are apart from each other and then a
>> read back a 32bit value from VRAM that should invalidate the HDP as well. It's less
>> efficient than the MMIO BAR remap but still much better than going though an
>> IOCTL.
>>>>>>> The only tricky part is that you need to get the HW barriers with the doorbell
>> write right.....
>>>>>>>> That said, the region is only 4K so if you allow applications to
>>>>>>>> map a larger region they would get access to GPU register pages
>>>>>>>> which they shouldn't have access to.
>>>>>>> But don't we also have problems with the doorbell? E.g. the global
>> aggregated one needs to be 4k as well, or is it ok to over allocate there?
>>>>>>> Thinking more about it there is also a major problem with page tables.
>> Those are 4k by default on modern systems as well and while over allocating them
>> to 64k is possible that not only wastes some VRAM but can also result in OOM
>> situations because we can't allocate the necessary page tables to switch from 2MiB
>> to 4k pages in some cases.
>>>>>>> Christian.
>>>>>>>
>>>>>>>> Alex
>>>>>>>>
>>>>>>>>>>>> [1] ROCr debug agent tests:
>>>>>>>>>>>> https://github.com/ROCm/rocr_debug_agent
>>>>>>>>>>>> [2] RCCL tests:
>>>>>>>>>>>> https://github.com/ROCm/rccl/tree/develop/test
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Please note that the changes in this series are on a best
>>>>>>>>>>>> effort basis from our end. Therefore, requesting the amd-gfx
>>>>>>>>>>>> community (who have deeper knowledge of the HW & SW stack)
>>>>>>>>>>>> to kindly help with the review and provide feedback /
>>>>>>>>>>>> comments on these patches. The idea here is, to also have non-4K
>> pagesize (e.g. 64K) well supported with amd gpu kernel driver.
>>>>>>>>>>> Well this is generally nice to have, but there are unfortunately some
>> HW limitations which makes ROCm pretty much unusable on non 4k page size
>> systems.
>>>>>>>>>> That's a bummer :(
>>>>>>>>>> - Do we have some HW documentation around what are these
>> limitations around non-4K pagesize? Any links to such please?
>>>>>>>>> You already mentioned MMIO remap which obviously has that problem,
>> but if I'm not completely mistaken the PCIe doorbell BAR and some global seq
>> counter resources will also cause problems here.
>>>>>>>>> This can all be worked around by delegating those MMIO accesses into
>> the kernel, but that means tons of extra IOCTL overhead.
>>>>>>>>> Especially the cache flushes which are necessary to avoid corruption
>> are really bad for performance in such an approach.
>>>>>>>>>> - Are there any latest AMD GPU versions which maybe lifts such
>> restrictions?
>>>>>>>>> Not that I know off any.
>>>>>>>>>
>>>>>>>>>>> What we can do is to support graphics and MM, but that should
>> already work out of the box.
>>>>>>>>>> - Maybe we should also document, what will work and what won't work
>> due to these HW limitations.
>>>>>>>>> Well pretty much everything, I need to double check how ROCm does
>> HDP flushing/invalidating when the MMIO remap isn't available.
>>>>>>>>> Could be that there is already a fallback path and that's the reason why
>> this approach actually works at all.
>>>>>>>>>>> What we can do is to support graphics and MM, but that should
>>>>>>>>>>> already work out of the box.>
>>>>>>>>>> So these patches helped us resolve most of the issues like
>>>>>>>>>> SDMA hangs and GPU kernel page faults which we saw with rocr
>>>>>>>>>> and rccl tests with 64K pagesize. Meaning, we didn't see this
>>>>>>>>>> working out of box perhaps due to 64K pagesize.
>>>>>>>>> Yeah, but this is all for ROCm and not the graphics side.
>>>>>>>>>
>>>>>>>>> To be honest I'm not sure how ROCm even works when you have 64k
>> pages at the moment. I would expect much more issue lurking in the kernel driver.
>>>>>>>>>> AFAIU, some of these patches may require re-work based on
>>>>>>>>>> reviews, but at least with these changes, we were able to see all the
>> tests passing.
>>>>>>>>>>> I need to talk with Alex and the ROCm team about it if workarounds
>> can be implemented for those issues.
>>>>>>>>>> Thanks a lot! That would be super helpful!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>> Thanks again for the quick response on the patch series.
>>>>>>>>> You are welcome, but since it's so near to the end of the year not all
>> people are available any more.
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>> -ritesh

[-- Attachment #2: Type: text/html, Size: 43153 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 7/8] amdgpu: Align ctl_stack_size and wg_data_size to GPU page size instead of CPU page size
  2025-12-19 10:27     ` Donet Tom
@ 2026-01-06 13:01       ` Donet Tom
  0 siblings, 0 replies; 44+ messages in thread
From: Donet Tom @ 2026-01-06 13:01 UTC (permalink / raw)
  To: Christian König, amd-gfx, Felix Kuehling, Alex Deucher
  Cc: Kent.Russell, Ritesh Harjani, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya


On 12/19/25 3:57 PM, Donet Tom wrote:
>
> On 12/12/25 2:34 PM, Christian König wrote:
>> On 12/12/25 07:40, Donet Tom wrote:
>>> The ctl_stack_size and wg_data_size values are used to compute the 
>>> total
>>> context save/restore buffer size and the control stack size. These 
>>> buffers
>>> are programmed into the GPU and are used to store the queue state 
>>> during
>>> context save and restore.
>>>
>>> Currently, both ctl_stack_size and wg_data_size are aligned to the CPU
>>> PAGE_SIZE. On systems with a non-4K CPU page size, this causes 
>>> unnecessary
>>> memory waste because the GPU internally calculates and uses buffer 
>>> sizes
>>> aligned to a fixed 4K GPU page size.
>>>
>>> Since the control stack and context save/restore buffers are 
>>> consumed by
>>> the GPU, their sizes should be aligned to the GPU page size (4K), 
>>> not the
>>> CPU page size. This patch updates the alignment of ctl_stack_size and
>>> wg_data_size to prevent over-allocation on systems with larger CPU page
>>> sizes.
>> As far as I know the problem is that the debugger needs to consume 
>> that stuff on the CPU side as well.
>
>
> Thank you for your help.
>
> As mentioned earlier, we were observing some queue preemption and GPU 
> hang issues. To address this, we introduced a patch, and after 
> applying the 7/8 and 8/8 patches, those issues have not been seen anymore
>
> While debugging the GPU hang issue, I made some additional observations.
>
> On my system, I booted a kernel with a 4 KB system page size and 
> modified both the ROCR runtime and the GPU driver to set the control 
> stack size to 64 KB. Even on a 4 KB page-size system, using a 64 KB 
> control stack size reliably reproduces the queue preemption failure 
> when running RCCL unit tests on 8 GPUs. This suggests that the issue 
> is not related to the system page size, but rather to the control 
> stack size being exactly 64 KB.
>
> When the control stack size is set to 64 KB ± 4 KB, the tests pass on 
> both 4 KB and 64 KB system page-size configurations.
>
> For gfxv9, is there any documented hardware limitation on the control 
> stack size? Specifically, is it valid to use a control stack size of 
> exactly 64 KB?


I have one more question based on my understanding of the code. The 
control stack size depends on the number of CUs and waves. For GFXv9, 
what is the maximum possible control stack size? Can it reach 64K?

For GFX10, I’ve seen that the control stack size must be less than or 
equal to 0x7000. Is there a similar limitation for GFXv9?

I’m asking because, with both 4K and 64K page sizes, I’m seeing queue 
preemption failures on GFXv9 when the control stack size is set to 64K.


>
>
>>
>> I need to double check that, but I think the alignment is correct as 
>> it is.
>
>
> The control stack is part of the context save-restore buffer, and we 
> configure it on the GPU as shown below:
>
> m->cp_hqd_ctx_save_base_addr_lo = 
> lower_32_bits(q->ctx_save_restore_area_address);
> m->cp_hqd_ctx_save_base_addr_hi = 
> upper_32_bits(q->ctx_save_restore_area_address);
> m->cp_hqd_ctx_save_size = q->ctx_save_restore_area_size;
> m->cp_hqd_cntl_stack_size = q->ctl_stack_size;
> m->cp_hqd_cntl_stack_offset = q->ctl_stack_size;
> m->cp_hqd_wg_state_offset = q->ctl_stack_size;
>
> The control stack occupies the region from cp_hqd_cntl_stack_offset 
> down to 0 within the ctx save restore area, and the remaining space is 
> used for WG state. This buffer is fully managed by the GPU during 
> preemption and restore operations.
> The control stack size is calculated based on hardware configuration 
> (CU count and wave count). For example, on gfxv9, the size is 
> typically around 32 KB. If we align this size to the system page size 
> (e.g., 64 KB), two issues arise:
>
> 1. Unnecessary memory overhead.
> 2. Potential queue preemption issues.
>
> On the CPU side, we copy the control stack contents to other buffers 
> for processing. Since the control stack size is derived from hardware 
> configuration, aligning it to the GPU page size seems more 
> appropriate. Aligning to the system page size would waste memory 
> without adding value. Using GPU page size alignment ensures 
> consistency with hardware and avoids unnecessary overhead.
>
> Would you agree that aligning the control stack size to the GPU page 
> size is the right approach? Or do you see any concerns with this method?
>
>
>>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 7 ++++---
>>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>>> index dc857450fa16..00ab941c3e86 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>>> @@ -445,10 +445,11 @@ void kfd_queue_ctx_save_restore_size(struct 
>>> kfd_topology_device *dev)
>>>               min(cu_num * 40, props->array_count / 
>>> props->simd_arrays_per_engine * 512)
>>>               : cu_num * 32;
>>>   -    wg_data_size = ALIGN(cu_num * 
>>> WG_CONTEXT_DATA_SIZE_PER_CU(gfxv, props), PAGE_SIZE);
>>> +    wg_data_size = ALIGN(cu_num * WG_CONTEXT_DATA_SIZE_PER_CU(gfxv, 
>>> props),
>>> +                AMDGPU_GPU_PAGE_SIZE);
>>>       ctl_stack_size = wave_num * CNTL_STACK_BYTES_PER_WAVE(gfxv) + 8;
>>>       ctl_stack_size = 
>>> ALIGN(SIZEOF_HSA_USER_CONTEXT_SAVE_AREA_HEADER + ctl_stack_size,
>>> -                   PAGE_SIZE);
>>> +                   AMDGPU_GPU_PAGE_SIZE);
>>>         if ((gfxv / 10000 * 10000) == 100000) {
>>>           /* HW design limits control stack size to 0x7000.
>>> @@ -460,7 +461,7 @@ void kfd_queue_ctx_save_restore_size(struct 
>>> kfd_topology_device *dev)
>>>         props->ctl_stack_size = ctl_stack_size;
>>>       props->debug_memory_size = ALIGN(wave_num * 
>>> DEBUGGER_BYTES_PER_WAVE, DEBUGGER_BYTES_ALIGN);
>>> -    props->cwsr_size = ctl_stack_size + wg_data_size;
>>> +    props->cwsr_size = ALIGN(ctl_stack_size + wg_data_size, 
>>> PAGE_SIZE);
>>>         if (gfxv == 80002)    /* GFX_VERSION_TONGA */
>>>           props->eop_buffer_size = 0x8000;

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 4/8] amdgpu/amdgpu_ttm: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size
  2026-01-06 12:55     ` Donet Tom
@ 2026-01-08 12:31       ` Christian König
  2026-01-09 10:22         ` Pierre-Eric Pelloux-Prayer
  2026-01-09 12:57         ` Donet Tom
  0 siblings, 2 replies; 44+ messages in thread
From: Christian König @ 2026-01-08 12:31 UTC (permalink / raw)
  To: Donet Tom, amd-gfx, Felix Kuehling, Alex Deucher, Philip Yang,
	Pelloux-Prayer, Pierre-Eric
  Cc: Kent.Russell, Ritesh Harjani, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya

On 1/6/26 13:55, Donet Tom wrote:
> 
> On 12/12/25 2:23 PM, Christian König wrote:
>> On 12/12/25 07:40, Donet Tom wrote:
>>> The SDMA engine has a hardware limitation of 4 MB maximum transfer
>>> size per operation.
>> That is not correct. This is only true on ancient HW.
>>
>> What problems are you seeing here?
>>
>>> AMDGPU_GTT_MAX_TRANSFER_SIZE was hardcoded to
>>> 512 pages, which worked correctly on systems with 4K pages but fails
>>> on systems with larger page sizes.
>>>
>>> This patch divides the max transfer size / AMDGPU_GPU_PAGES_IN_CPU_PAGE
>>> to match with non-4K page size systems.
>> That is actually a bad idea. The value was meant to match the PMD size.
> 
> 
> Hi Christian, Felix, Alex and philip
> 
> Instead of hardcoding the AMDGPU_GTT_MAX_TRANSFER_SIZE value to 512,
> what do you think about doing something like the change below?
> This should work across all architectures and page sizes, so
> AMDGPU_GTT_MAX_TRANSFER_SIZE will always correspond to the PMD
> size on all architectures and with all page sizes.
> 
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
> index 0be2728aa872..c594ed7dff18 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
> @@ -37,7 +37,7 @@
>  #define AMDGPU_PL_MMIO_REMAP   (TTM_PL_PRIV + 5)
>  #define __AMDGPU_PL_NUM        (TTM_PL_PRIV + 6)
>  
> -#define AMDGPU_GTT_MAX_TRANSFER_SIZE   512
> +#define AMDGPU_GTT_MAX_TRANSFER_SIZE   1 << (PMD_SHIFT - PAGE_SHIFT)
>  #define AMDGPU_GTT_NUM_TRANSFER_WINDOWS  
> 
> Could you please provide your thoughts on above? Is it looking ok to you?

It's at least reasonable. My only concern is that we have patches in the pipeline to remove that define and make it independent of the PMD size.

@Pierre-Eric how far along are we with that?

> 
> If this looks good - here is what we were thinking:
> 
> Patches 1-4 are required to fix initial non-4k pagesize support to AMD GPU.
> And since these patches are looking in good shape (since Philip has already
> reviewed [1-3])- We thought it will be good to split the patch series into two.
> I will send a v2 of Part-1 with patches [1-4] (will also address the review comments
> in v2 for Patch-1 & 2 from Philip) and for the rest of the patches [5-8] Part-2, we
> can continue the discussion till other things are sorted. That will also allow us to
> get these initial fixes in Part-1 ready before the 6.20 merge window. 
> 
> Thoughts?

Sounds reasonable to me.

Regards,
Christian.

> 
> 
>> Regards,
>> Christian.
>>
>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>>> index 0be2728aa872..9d038feb25b0 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>>> @@ -37,7 +37,7 @@
>>>  #define AMDGPU_PL_MMIO_REMAP	(TTM_PL_PRIV + 5)
>>>  #define __AMDGPU_PL_NUM	(TTM_PL_PRIV + 6)
>>>  
>>> -#define AMDGPU_GTT_MAX_TRANSFER_SIZE	512
>>> +#define AMDGPU_GTT_MAX_TRANSFER_SIZE	(512 / AMDGPU_GPU_PAGES_IN_CPU_PAGE)
>>>  #define AMDGPU_GTT_NUM_TRANSFER_WINDOWS	2
>>>  
>>>  extern const struct attribute_group amdgpu_vram_mgr_attr_group;


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 4/8] amdgpu/amdgpu_ttm: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size
  2026-01-08 12:31       ` Christian König
@ 2026-01-09 10:22         ` Pierre-Eric Pelloux-Prayer
  2026-01-09 12:57         ` Donet Tom
  1 sibling, 0 replies; 44+ messages in thread
From: Pierre-Eric Pelloux-Prayer @ 2026-01-09 10:22 UTC (permalink / raw)
  To: Christian König, Donet Tom, amd-gfx, Felix Kuehling,
	Alex Deucher, Philip Yang, Pelloux-Prayer, Pierre-Eric
  Cc: Kent.Russell, Ritesh Harjani, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya


Hi,

Le 08/01/2026 à 13:31, Christian König a écrit :
> On 1/6/26 13:55, Donet Tom wrote:
>>
>> On 12/12/25 2:23 PM, Christian König wrote:
>>> On 12/12/25 07:40, Donet Tom wrote:
>>>> The SDMA engine has a hardware limitation of 4 MB maximum transfer
>>>> size per operation.
>>> That is not correct. This is only true on ancient HW.
>>>
>>> What problems are you seeing here?
>>>
>>>> AMDGPU_GTT_MAX_TRANSFER_SIZE was hardcoded to
>>>> 512 pages, which worked correctly on systems with 4K pages but fails
>>>> on systems with larger page sizes.
>>>>
>>>> This patch divides the max transfer size / AMDGPU_GPU_PAGES_IN_CPU_PAGE
>>>> to match with non-4K page size systems.
>>> That is actually a bad idea. The value was meant to match the PMD size.
>>
>>
>> Hi Christian, Felix, Alex and philip
>>
>> Instead of hardcoding the AMDGPU_GTT_MAX_TRANSFER_SIZE value to 512,
>> what do you think about doing something like the change below?
>> This should work across all architectures and page sizes, so
>> AMDGPU_GTT_MAX_TRANSFER_SIZE will always correspond to the PMD
>> size on all architectures and with all page sizes.
>>
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>> index 0be2728aa872..c594ed7dff18 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>> @@ -37,7 +37,7 @@
>>   #define AMDGPU_PL_MMIO_REMAP   (TTM_PL_PRIV + 5)
>>   #define __AMDGPU_PL_NUM        (TTM_PL_PRIV + 6)
>>   
>> -#define AMDGPU_GTT_MAX_TRANSFER_SIZE   512
>> +#define AMDGPU_GTT_MAX_TRANSFER_SIZE   1 << (PMD_SHIFT - PAGE_SHIFT)
>>   #define AMDGPU_GTT_NUM_TRANSFER_WINDOWS
>>
>> Could you please provide your thoughts on above? Is it looking ok to you?
> 
> It's at least reasonable. My only concern is that we have patches in the pipeline to remove that define and make it independent of the PMD size.
> 
> @Pierre-Eric how far along are we with that?


My patchset is dropping AMDGPU_GTT_NUM_TRANSFER_WINDOWS and doubling 
AMDGPU_GTT_MAX_TRANSFER_SIZE so it's not negatively affected by Tom's 
patches.

Regards,
Pierre-Eric



> 
>>
>> If this looks good - here is what we were thinking:
>>
>> Patches 1-4 are required to fix initial non-4k pagesize support to AMD GPU.
>> And since these patches are looking in good shape (since Philip has already
>> reviewed [1-3])- We thought it will be good to split the patch series into two.
>> I will send a v2 of Part-1 with patches [1-4] (will also address the review comments
>> in v2 for Patch-1 & 2 from Philip) and for the rest of the patches [5-8] Part-2, we
>> can continue the discussion till other things are sorted. That will also allow us to
>> get these initial fixes in Part-1 ready before the 6.20 merge window.
>>
>> Thoughts?
> 
> Sounds reasonable to me.
> 
> Regards,
> Christian.
> 
>>
>>
>>> Regards,
>>> Christian.
>>>
>>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 2 +-
>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>>>> index 0be2728aa872..9d038feb25b0 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>>>> @@ -37,7 +37,7 @@
>>>>   #define AMDGPU_PL_MMIO_REMAP	(TTM_PL_PRIV + 5)
>>>>   #define __AMDGPU_PL_NUM	(TTM_PL_PRIV + 6)
>>>>   
>>>> -#define AMDGPU_GTT_MAX_TRANSFER_SIZE	512
>>>> +#define AMDGPU_GTT_MAX_TRANSFER_SIZE	(512 / AMDGPU_GPU_PAGES_IN_CPU_PAGE)
>>>>   #define AMDGPU_GTT_NUM_TRANSFER_WINDOWS	2
>>>>   
>>>>   extern const struct attribute_group amdgpu_vram_mgr_attr_group;

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v1 4/8] amdgpu/amdgpu_ttm: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size
  2026-01-08 12:31       ` Christian König
  2026-01-09 10:22         ` Pierre-Eric Pelloux-Prayer
@ 2026-01-09 12:57         ` Donet Tom
  1 sibling, 0 replies; 44+ messages in thread
From: Donet Tom @ 2026-01-09 12:57 UTC (permalink / raw)
  To: Christian König, amd-gfx, Felix Kuehling, Alex Deucher,
	Philip Yang, Pelloux-Prayer, Pierre-Eric
  Cc: Kent.Russell, Ritesh Harjani, Vaidyanathan Srinivasan,
	Mukesh Kumar Chaurasiya


On 1/8/26 6:01 PM, Christian König wrote:
> On 1/6/26 13:55, Donet Tom wrote:
>> On 12/12/25 2:23 PM, Christian König wrote:
>>> On 12/12/25 07:40, Donet Tom wrote:
>>>> The SDMA engine has a hardware limitation of 4 MB maximum transfer
>>>> size per operation.
>>> That is not correct. This is only true on ancient HW.
>>>
>>> What problems are you seeing here?
>>>
>>>> AMDGPU_GTT_MAX_TRANSFER_SIZE was hardcoded to
>>>> 512 pages, which worked correctly on systems with 4K pages but fails
>>>> on systems with larger page sizes.
>>>>
>>>> This patch divides the max transfer size / AMDGPU_GPU_PAGES_IN_CPU_PAGE
>>>> to match with non-4K page size systems.
>>> That is actually a bad idea. The value was meant to match the PMD size.
>>
>> Hi Christian, Felix, Alex and philip
>>
>> Instead of hardcoding the AMDGPU_GTT_MAX_TRANSFER_SIZE value to 512,
>> what do you think about doing something like the change below?
>> This should work across all architectures and page sizes, so
>> AMDGPU_GTT_MAX_TRANSFER_SIZE will always correspond to the PMD
>> size on all architectures and with all page sizes.
>>
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>> index 0be2728aa872..c594ed7dff18 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>> @@ -37,7 +37,7 @@
>>   #define AMDGPU_PL_MMIO_REMAP   (TTM_PL_PRIV + 5)
>>   #define __AMDGPU_PL_NUM        (TTM_PL_PRIV + 6)
>>   
>> -#define AMDGPU_GTT_MAX_TRANSFER_SIZE   512
>> +#define AMDGPU_GTT_MAX_TRANSFER_SIZE   1 << (PMD_SHIFT - PAGE_SHIFT)
>>   #define AMDGPU_GTT_NUM_TRANSFER_WINDOWS
>>
>> Could you please provide your thoughts on above? Is it looking ok to you?
> It's at least reasonable. My only concern is that we have patches in the pipeline to remove that define and make it independent of the PMD size.
>
> @Pierre-Eric how far along are we with that?
>
>> If this looks good - here is what we were thinking:
>>
>> Patches 1-4 are required to fix initial non-4k pagesize support to AMD GPU.
>> And since these patches are looking in good shape (since Philip has already
>> reviewed [1-3])- We thought it will be good to split the patch series into two.
>> I will send a v2 of Part-1 with patches [1-4] (will also address the review comments
>> in v2 for Patch-1 & 2 from Philip) and for the rest of the patches [5-8] Part-2, we
>> can continue the discussion till other things are sorted. That will also allow us to
>> get these initial fixes in Part-1 ready before the 6.20 merge window.
>>
>> Thoughts?
> Sounds reasonable to me.


Thanks Christian.


>
> Regards,
> Christian.
>
>>
>>> Regards,
>>> Christian.
>>>
>>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 2 +-
>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>>>> index 0be2728aa872..9d038feb25b0 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>>>> @@ -37,7 +37,7 @@
>>>>   #define AMDGPU_PL_MMIO_REMAP	(TTM_PL_PRIV + 5)
>>>>   #define __AMDGPU_PL_NUM	(TTM_PL_PRIV + 6)
>>>>   
>>>> -#define AMDGPU_GTT_MAX_TRANSFER_SIZE	512
>>>> +#define AMDGPU_GTT_MAX_TRANSFER_SIZE	(512 / AMDGPU_GPU_PAGES_IN_CPU_PAGE)
>>>>   #define AMDGPU_GTT_NUM_TRANSFER_WINDOWS	2
>>>>   
>>>>   extern const struct attribute_group amdgpu_vram_mgr_attr_group;

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2026-01-09 12:58 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-12  6:40 [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems Donet Tom
2025-12-12  6:40 ` [RFC PATCH v1 1/8] drm/amdkfd: Relax size checking during queue buffer get Donet Tom
2025-12-15 20:25   ` Philip Yang
2025-12-16 10:12     ` Donet Tom
2025-12-12  6:40 ` [RFC PATCH v1 2/8] amdkfd/kfd_svm: Fix SVM map/unmap address conversion for non-4k page sizes Donet Tom
2025-12-15 20:44   ` Philip Yang
2025-12-16 10:09     ` Donet Tom
2025-12-12  6:40 ` [RFC PATCH v1 3/8] amdkfd/kfd_migrate: Fix GART PTE for non-4K pagesize in svm_migrate_gart_map() Donet Tom
2025-12-15 21:03   ` Philip Yang
2025-12-12  6:40 ` [RFC PATCH v1 4/8] amdgpu/amdgpu_ttm: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size Donet Tom
2025-12-12  8:53   ` Christian König
2025-12-12 12:14     ` Donet Tom
2026-01-06 12:55     ` Donet Tom
2026-01-08 12:31       ` Christian König
2026-01-09 10:22         ` Pierre-Eric Pelloux-Prayer
2026-01-09 12:57         ` Donet Tom
2025-12-12  6:40 ` [RFC PATCH v1 5/8] amdkfd/kfd_chardev: Add error message for non-4k pagesize failures Donet Tom
2025-12-12  6:40 ` [RFC PATCH v1 6/8] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems Donet Tom
2025-12-12  6:40 ` [RFC PATCH v1 7/8] amdgpu: Align ctl_stack_size and wg_data_size to GPU page size instead of CPU page size Donet Tom
2025-12-12  9:04   ` Christian König
2025-12-12 12:29     ` Donet Tom
2025-12-19 10:27     ` Donet Tom
2026-01-06 13:01       ` Donet Tom
2025-12-12  6:40 ` [RFC PATCH v1 8/8] amdgpu: Fix MQD and control stack alignment for non-4K CPU page size systems Donet Tom
2025-12-12  9:01 ` [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K " Christian König
2025-12-12 10:45   ` Ritesh Harjani
2025-12-12 13:01     ` Christian König
2025-12-12 17:24       ` Alex Deucher
2025-12-15  9:47         ` Christian König
2025-12-15 10:11           ` Donet Tom
2025-12-15 16:11             ` Christian König
2025-12-16 10:08               ` Donet Tom
2025-12-16 16:06                 ` Christian König
2025-12-17  9:04                   ` Donet Tom
2025-12-17  9:46               ` Donet Tom
2025-12-17 10:10                 ` Christian König
2025-12-15 14:09           ` Alex Deucher
2025-12-16 13:54             ` Donet Tom
2025-12-16 14:02               ` Alex Deucher
2025-12-17  9:03                 ` Donet Tom
2025-12-17 14:23                   ` Alex Deucher
2025-12-17 21:31                     ` Yat Sin, David
2026-01-02 18:53                       ` Donet Tom
2026-01-06 12:58                       ` Donet Tom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox