AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Donet Tom <donettom@linux.ibm.com>
To: "Christian König" <christian.koenig@amd.com>,
	amd-gfx@lists.freedesktop.org,
	"Felix Kuehling" <Felix.Kuehling@amd.com>,
	"Alex Deucher" <alexander.deucher@amd.com>
Cc: Kent.Russell@amd.com, Ritesh Harjani <ritesh.list@gmail.com>,
	Vaidyanathan Srinivasan <svaidy@linux.ibm.com>,
	Mukesh Kumar Chaurasiya <mkchauras@linux.ibm.com>
Subject: Re: [RFC PATCH v1 7/8] amdgpu: Align ctl_stack_size and wg_data_size to GPU page size instead of CPU page size
Date: Fri, 19 Dec 2025 15:57:47 +0530	[thread overview]
Message-ID: <7e367a8b-6dda-4eec-98ff-aa0bb6550a77@linux.ibm.com> (raw)
In-Reply-To: <2a213294-bf56-4ead-9e1f-cc8c3d4003a0@amd.com>


On 12/12/25 2:34 PM, Christian König wrote:
> On 12/12/25 07:40, Donet Tom wrote:
>> The ctl_stack_size and wg_data_size values are used to compute the total
>> context save/restore buffer size and the control stack size. These buffers
>> are programmed into the GPU and are used to store the queue state during
>> context save and restore.
>>
>> Currently, both ctl_stack_size and wg_data_size are aligned to the CPU
>> PAGE_SIZE. On systems with a non-4K CPU page size, this causes unnecessary
>> memory waste because the GPU internally calculates and uses buffer sizes
>> aligned to a fixed 4K GPU page size.
>>
>> Since the control stack and context save/restore buffers are consumed by
>> the GPU, their sizes should be aligned to the GPU page size (4K), not the
>> CPU page size. This patch updates the alignment of ctl_stack_size and
>> wg_data_size to prevent over-allocation on systems with larger CPU page
>> sizes.
> As far as I know the problem is that the debugger needs to consume that stuff on the CPU side as well.


Thank you for your help.

As mentioned earlier, we were observing some queue preemption and GPU 
hang issues. To address this, we introduced a patch, and after applying 
the 7/8 and 8/8 patches, those issues have not been seen anymore

While debugging the GPU hang issue, I made some additional observations.

On my system, I booted a kernel with a 4 KB system page size and 
modified both the ROCR runtime and the GPU driver to set the control 
stack size to 64 KB. Even on a 4 KB page-size system, using a 64 KB 
control stack size reliably reproduces the queue preemption failure when 
running RCCL unit tests on 8 GPUs. This suggests that the issue is not 
related to the system page size, but rather to the control stack size 
being exactly 64 KB.

When the control stack size is set to 64 KB ± 4 KB, the tests pass on 
both 4 KB and 64 KB system page-size configurations.

For gfxv9, is there any documented hardware limitation on the control 
stack size? Specifically, is it valid to use a control stack size of 
exactly 64 KB?


>
> I need to double check that, but I think the alignment is correct as it is.


The control stack is part of the context save-restore buffer, and we 
configure it on the GPU as shown below:

m->cp_hqd_ctx_save_base_addr_lo = 
lower_32_bits(q->ctx_save_restore_area_address);
m->cp_hqd_ctx_save_base_addr_hi = 
upper_32_bits(q->ctx_save_restore_area_address);
m->cp_hqd_ctx_save_size = q->ctx_save_restore_area_size;
m->cp_hqd_cntl_stack_size = q->ctl_stack_size;
m->cp_hqd_cntl_stack_offset = q->ctl_stack_size;
m->cp_hqd_wg_state_offset = q->ctl_stack_size;

The control stack occupies the region from cp_hqd_cntl_stack_offset down 
to 0 within the ctx save restore area, and the remaining space is used 
for WG state. This buffer is fully managed by the GPU during preemption 
and restore operations.
The control stack size is calculated based on hardware configuration (CU 
count and wave count). For example, on gfxv9, the size is typically 
around 32 KB. If we align this size to the system page size (e.g., 
64 KB), two issues arise:

1. Unnecessary memory overhead.
2. Potential queue preemption issues.

On the CPU side, we copy the control stack contents to other buffers for 
processing. Since the control stack size is derived from hardware 
configuration, aligning it to the GPU page size seems more appropriate. 
Aligning to the system page size would waste memory without adding 
value. Using GPU page size alignment ensures consistency with hardware 
and avoids unnecessary overhead.

Would you agree that aligning the control stack size to the GPU page 
size is the right approach? Or do you see any concerns with this method?


>
> Regards,
> Christian.
>
>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>> ---
>>   drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 7 ++++---
>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>> index dc857450fa16..00ab941c3e86 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>> @@ -445,10 +445,11 @@ void kfd_queue_ctx_save_restore_size(struct kfd_topology_device *dev)
>>   		    min(cu_num * 40, props->array_count / props->simd_arrays_per_engine * 512)
>>   		    : cu_num * 32;
>>   
>> -	wg_data_size = ALIGN(cu_num * WG_CONTEXT_DATA_SIZE_PER_CU(gfxv, props), PAGE_SIZE);
>> +	wg_data_size = ALIGN(cu_num * WG_CONTEXT_DATA_SIZE_PER_CU(gfxv, props),
>> +				AMDGPU_GPU_PAGE_SIZE);
>>   	ctl_stack_size = wave_num * CNTL_STACK_BYTES_PER_WAVE(gfxv) + 8;
>>   	ctl_stack_size = ALIGN(SIZEOF_HSA_USER_CONTEXT_SAVE_AREA_HEADER + ctl_stack_size,
>> -			       PAGE_SIZE);
>> +			       AMDGPU_GPU_PAGE_SIZE);
>>   
>>   	if ((gfxv / 10000 * 10000) == 100000) {
>>   		/* HW design limits control stack size to 0x7000.
>> @@ -460,7 +461,7 @@ void kfd_queue_ctx_save_restore_size(struct kfd_topology_device *dev)
>>   
>>   	props->ctl_stack_size = ctl_stack_size;
>>   	props->debug_memory_size = ALIGN(wave_num * DEBUGGER_BYTES_PER_WAVE, DEBUGGER_BYTES_ALIGN);
>> -	props->cwsr_size = ctl_stack_size + wg_data_size;
>> +	props->cwsr_size = ALIGN(ctl_stack_size + wg_data_size, PAGE_SIZE);
>>   
>>   	if (gfxv == 80002)	/* GFX_VERSION_TONGA */
>>   		props->eop_buffer_size = 0x8000;

  parent reply	other threads:[~2025-12-19 10:28 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-12  6:40 [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems Donet Tom
2025-12-12  6:40 ` [RFC PATCH v1 1/8] drm/amdkfd: Relax size checking during queue buffer get Donet Tom
2025-12-15 20:25   ` Philip Yang
2025-12-16 10:12     ` Donet Tom
2025-12-12  6:40 ` [RFC PATCH v1 2/8] amdkfd/kfd_svm: Fix SVM map/unmap address conversion for non-4k page sizes Donet Tom
2025-12-15 20:44   ` Philip Yang
2025-12-16 10:09     ` Donet Tom
2025-12-12  6:40 ` [RFC PATCH v1 3/8] amdkfd/kfd_migrate: Fix GART PTE for non-4K pagesize in svm_migrate_gart_map() Donet Tom
2025-12-15 21:03   ` Philip Yang
2025-12-12  6:40 ` [RFC PATCH v1 4/8] amdgpu/amdgpu_ttm: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size Donet Tom
2025-12-12  8:53   ` Christian König
2025-12-12 12:14     ` Donet Tom
2026-01-06 12:55     ` Donet Tom
2026-01-08 12:31       ` Christian König
2026-01-09 10:22         ` Pierre-Eric Pelloux-Prayer
2026-01-09 12:57         ` Donet Tom
2025-12-12  6:40 ` [RFC PATCH v1 5/8] amdkfd/kfd_chardev: Add error message for non-4k pagesize failures Donet Tom
2025-12-12  6:40 ` [RFC PATCH v1 6/8] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems Donet Tom
2025-12-12  6:40 ` [RFC PATCH v1 7/8] amdgpu: Align ctl_stack_size and wg_data_size to GPU page size instead of CPU page size Donet Tom
2025-12-12  9:04   ` Christian König
2025-12-12 12:29     ` Donet Tom
2025-12-19 10:27     ` Donet Tom [this message]
2026-01-06 13:01       ` Donet Tom
2025-12-12  6:40 ` [RFC PATCH v1 8/8] amdgpu: Fix MQD and control stack alignment for non-4K CPU page size systems Donet Tom
2025-12-12  9:01 ` [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K " Christian König
2025-12-12 10:45   ` Ritesh Harjani
2025-12-12 13:01     ` Christian König
2025-12-12 17:24       ` Alex Deucher
2025-12-15  9:47         ` Christian König
2025-12-15 10:11           ` Donet Tom
2025-12-15 16:11             ` Christian König
2025-12-16 10:08               ` Donet Tom
2025-12-16 16:06                 ` Christian König
2025-12-17  9:04                   ` Donet Tom
2025-12-17  9:46               ` Donet Tom
2025-12-17 10:10                 ` Christian König
2025-12-15 14:09           ` Alex Deucher
2025-12-16 13:54             ` Donet Tom
2025-12-16 14:02               ` Alex Deucher
2025-12-17  9:03                 ` Donet Tom
2025-12-17 14:23                   ` Alex Deucher
2025-12-17 21:31                     ` Yat Sin, David
2026-01-02 18:53                       ` Donet Tom
2026-01-06 12:58                       ` Donet Tom

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7e367a8b-6dda-4eec-98ff-aa0bb6550a77@linux.ibm.com \
    --to=donettom@linux.ibm.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=Kent.Russell@amd.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=mkchauras@linux.ibm.com \
    --cc=ritesh.list@gmail.com \
    --cc=svaidy@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox