* [PATCH] drm/amdkfd: relax checks for over allocation of save area
@ 2025-11-06 20:27 Jonathan Kim
2025-11-07 14:36 ` Alex Deucher
0 siblings, 1 reply; 8+ messages in thread
From: Jonathan Kim @ 2025-11-06 20:27 UTC (permalink / raw)
To: amd-gfx; +Cc: Felix.Kuehling, Lancelot.six, Philip.Yang, Jonathan Kim
Over allocation of save area is not fatal, only under allocation is.
ROCm has various components that independently claim authority over save
area size.
Unless KFD decides to claim single authority, relax size checks with a
warning on over allocation.
Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
---
drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 17 +++++++++++------
1 file changed, 11 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
index a65c67cf56ff..448043bc2937 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
@@ -297,16 +297,21 @@ int kfd_queue_acquire_buffers(struct kfd_process_device *pdd, struct queue_prope
goto out_err_unreserve;
}
- if (properties->ctx_save_restore_area_size != topo_dev->node_props.cwsr_size) {
- pr_debug("queue cwsr size 0x%x not equal to node cwsr size 0x%x\n",
+ if (properties->ctx_save_restore_area_size < topo_dev->node_props.cwsr_size) {
+ pr_debug("queue cwsr size 0x%x not sufficient for node cwsr size 0x%x\n",
properties->ctx_save_restore_area_size,
topo_dev->node_props.cwsr_size);
err = -EINVAL;
goto out_err_unreserve;
}
- total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev->node_props.debug_memory_size)
- * NUM_XCC(pdd->dev->xcc_mask);
+ if (properties->ctx_save_restore_area_size > topo_dev->node_props.cwsr_size)
+ pr_warn_ratelimited("queue cwsr size 0x%x exceeds recommended node cwsr size 0x%x\n",
+ properties->ctx_save_restore_area_size,
+ topo_dev->node_props.cwsr_size);
+
+ total_cwsr_size = (properties->ctx_save_restore_area_size +
+ topo_dev->node_props.debug_memory_size) * NUM_XCC(pdd->dev->xcc_mask);
total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE);
err = kfd_queue_buffer_get(vm, (void *)properties->ctx_save_restore_area_address,
@@ -352,8 +357,8 @@ int kfd_queue_release_buffers(struct kfd_process_device *pdd, struct queue_prope
topo_dev = kfd_topology_device_by_id(pdd->dev->id);
if (!topo_dev)
return -EINVAL;
- total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev->node_props.debug_memory_size)
- * NUM_XCC(pdd->dev->xcc_mask);
+ total_cwsr_size = (properties->ctx_save_restore_area_size +
+ topo_dev->node_props.debug_memory_size) * NUM_XCC(pdd->dev->xcc_mask);
total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE);
kfd_queue_buffer_svm_put(pdd, properties->ctx_save_restore_area_address, total_cwsr_size);
--
2.34.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH] drm/amdkfd: relax checks for over allocation of save area
2025-11-06 20:27 [PATCH] drm/amdkfd: relax checks for over allocation of save area Jonathan Kim
@ 2025-11-07 14:36 ` Alex Deucher
2025-11-07 15:12 ` Kim, Jonathan
0 siblings, 1 reply; 8+ messages in thread
From: Alex Deucher @ 2025-11-07 14:36 UTC (permalink / raw)
To: Jonathan Kim; +Cc: amd-gfx, Felix.Kuehling, Lancelot.six, Philip.Yang
On Thu, Nov 6, 2025 at 3:46 PM Jonathan Kim <jonathan.kim@amd.com> wrote:
>
> Over allocation of save area is not fatal, only under allocation is.
> ROCm has various components that independently claim authority over save
> area size.
>
> Unless KFD decides to claim single authority, relax size checks with a
> warning on over allocation.
Do we want any sort of upper limit?
>
> Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
> ---
> drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 17 +++++++++++------
> 1 file changed, 11 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> index a65c67cf56ff..448043bc2937 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> @@ -297,16 +297,21 @@ int kfd_queue_acquire_buffers(struct kfd_process_device *pdd, struct queue_prope
> goto out_err_unreserve;
> }
>
> - if (properties->ctx_save_restore_area_size != topo_dev->node_props.cwsr_size) {
> - pr_debug("queue cwsr size 0x%x not equal to node cwsr size 0x%x\n",
> + if (properties->ctx_save_restore_area_size < topo_dev->node_props.cwsr_size) {
> + pr_debug("queue cwsr size 0x%x not sufficient for node cwsr size 0x%x\n",
> properties->ctx_save_restore_area_size,
> topo_dev->node_props.cwsr_size);
> err = -EINVAL;
> goto out_err_unreserve;
> }
>
> - total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev->node_props.debug_memory_size)
> - * NUM_XCC(pdd->dev->xcc_mask);
> + if (properties->ctx_save_restore_area_size > topo_dev->node_props.cwsr_size)
> + pr_warn_ratelimited("queue cwsr size 0x%x exceeds recommended node cwsr size 0x%x\n",
> + properties->ctx_save_restore_area_size,
> + topo_dev->node_props.cwsr_size);
We can probably drop the warning here.
Alex
> +
> + total_cwsr_size = (properties->ctx_save_restore_area_size +
> + topo_dev->node_props.debug_memory_size) * NUM_XCC(pdd->dev->xcc_mask);
> total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE);
>
> err = kfd_queue_buffer_get(vm, (void *)properties->ctx_save_restore_area_address,
> @@ -352,8 +357,8 @@ int kfd_queue_release_buffers(struct kfd_process_device *pdd, struct queue_prope
> topo_dev = kfd_topology_device_by_id(pdd->dev->id);
> if (!topo_dev)
> return -EINVAL;
> - total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev->node_props.debug_memory_size)
> - * NUM_XCC(pdd->dev->xcc_mask);
> + total_cwsr_size = (properties->ctx_save_restore_area_size +
> + topo_dev->node_props.debug_memory_size) * NUM_XCC(pdd->dev->xcc_mask);
> total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE);
>
> kfd_queue_buffer_svm_put(pdd, properties->ctx_save_restore_area_address, total_cwsr_size);
> --
> 2.34.1
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: [PATCH] drm/amdkfd: relax checks for over allocation of save area
2025-11-07 14:36 ` Alex Deucher
@ 2025-11-07 15:12 ` Kim, Jonathan
2025-11-07 15:36 ` Lancelot SIX
0 siblings, 1 reply; 8+ messages in thread
From: Kim, Jonathan @ 2025-11-07 15:12 UTC (permalink / raw)
To: Alex Deucher
Cc: amd-gfx@lists.freedesktop.org, Kuehling, Felix, Six, Lancelot,
Yang, Philip
[Public]
> -----Original Message-----
> From: Alex Deucher <alexdeucher@gmail.com>
> Sent: Friday, November 7, 2025 9:36 AM
> To: Kim, Jonathan <Jonathan.Kim@amd.com>
> Cc: amd-gfx@lists.freedesktop.org; Kuehling, Felix <Felix.Kuehling@amd.com>;
> Six, Lancelot <Lancelot.Six@amd.com>; Yang, Philip <Philip.Yang@amd.com>
> Subject: Re: [PATCH] drm/amdkfd: relax checks for over allocation of save area
>
> Caution: This message originated from an External Source. Use proper caution
> when opening attachments, clicking links, or responding.
>
>
> On Thu, Nov 6, 2025 at 3:46 PM Jonathan Kim <jonathan.kim@amd.com> wrote:
> >
> > Over allocation of save area is not fatal, only under allocation is.
> > ROCm has various components that independently claim authority over save
> > area size.
> >
> > Unless KFD decides to claim single authority, relax size checks with a
> > warning on over allocation.
>
> Do we want any sort of upper limit?
Mmm. I'd expect early failure on unreasonable user request for over allocation at the allocation stage prior to acquire on queue create stage and acquire should be bound to what was registered/allocated.
So I'm not sure what an upper bound acquire limit check could serve.
>
> >
> > Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
> > ---
> > drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 17 +++++++++++------
> > 1 file changed, 11 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> > index a65c67cf56ff..448043bc2937 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> > @@ -297,16 +297,21 @@ int kfd_queue_acquire_buffers(struct
> kfd_process_device *pdd, struct queue_prope
> > goto out_err_unreserve;
> > }
> >
> > - if (properties->ctx_save_restore_area_size != topo_dev-
> >node_props.cwsr_size) {
> > - pr_debug("queue cwsr size 0x%x not equal to node cwsr size 0x%x\n",
> > + if (properties->ctx_save_restore_area_size < topo_dev-
> >node_props.cwsr_size) {
> > + pr_debug("queue cwsr size 0x%x not sufficient for node cwsr size
> 0x%x\n",
> > properties->ctx_save_restore_area_size,
> > topo_dev->node_props.cwsr_size);
> > err = -EINVAL;
> > goto out_err_unreserve;
> > }
> >
> > - total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev-
> >node_props.debug_memory_size)
> > - * NUM_XCC(pdd->dev->xcc_mask);
> > + if (properties->ctx_save_restore_area_size > topo_dev-
> >node_props.cwsr_size)
> > + pr_warn_ratelimited("queue cwsr size 0x%x exceeds recommended
> node cwsr size 0x%x\n",
> > + properties->ctx_save_restore_area_size,
> > + topo_dev->node_props.cwsr_size);
>
> We can probably drop the warning here.
Acked.
Thanks.
Jon
>
> Alex
>
> > +
> > + total_cwsr_size = (properties->ctx_save_restore_area_size +
> > + topo_dev->node_props.debug_memory_size) * NUM_XCC(pdd-
> >dev->xcc_mask);
> > total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE);
> >
> > err = kfd_queue_buffer_get(vm, (void *)properties-
> >ctx_save_restore_area_address,
> > @@ -352,8 +357,8 @@ int kfd_queue_release_buffers(struct
> kfd_process_device *pdd, struct queue_prope
> > topo_dev = kfd_topology_device_by_id(pdd->dev->id);
> > if (!topo_dev)
> > return -EINVAL;
> > - total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev-
> >node_props.debug_memory_size)
> > - * NUM_XCC(pdd->dev->xcc_mask);
> > + total_cwsr_size = (properties->ctx_save_restore_area_size +
> > + topo_dev->node_props.debug_memory_size) * NUM_XCC(pdd-
> >dev->xcc_mask);
> > total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE);
> >
> > kfd_queue_buffer_svm_put(pdd, properties-
> >ctx_save_restore_area_address, total_cwsr_size);
> > --
> > 2.34.1
> >
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] drm/amdkfd: relax checks for over allocation of save area
2025-11-07 15:12 ` Kim, Jonathan
@ 2025-11-07 15:36 ` Lancelot SIX
2025-11-10 19:22 ` Alex Deucher
0 siblings, 1 reply; 8+ messages in thread
From: Lancelot SIX @ 2025-11-07 15:36 UTC (permalink / raw)
To: Kim, Jonathan, Alex Deucher
Cc: amd-gfx@lists.freedesktop.org, Kuehling, Felix, Yang, Philip
On 07/11/2025 15:12, Kim, Jonathan wrote:
> [Public]
>
>> -----Original Message-----
>> From: Alex Deucher <alexdeucher@gmail.com>
>> Sent: Friday, November 7, 2025 9:36 AM
>> To: Kim, Jonathan <Jonathan.Kim@amd.com>
>> Cc: amd-gfx@lists.freedesktop.org; Kuehling, Felix <Felix.Kuehling@amd.com>;
>> Six, Lancelot <Lancelot.Six@amd.com>; Yang, Philip <Philip.Yang@amd.com>
>> Subject: Re: [PATCH] drm/amdkfd: relax checks for over allocation of save area
>>
>> Caution: This message originated from an External Source. Use proper caution
>> when opening attachments, clicking links, or responding.
>>
>>
>> On Thu, Nov 6, 2025 at 3:46 PM Jonathan Kim <jonathan.kim@amd.com> wrote:
>>>
>>> Over allocation of save area is not fatal, only under allocation is.
>>> ROCm has various components that independently claim authority over save
>>> area size.
>>>
>>> Unless KFD decides to claim single authority, relax size checks with a
>>> warning on over allocation.
>>
>> Do we want any sort of upper limit?
>
> Mmm. I'd expect early failure on unreasonable user request for over allocation at the allocation stage prior to acquire on queue create stage and acquire should be bound to what was registered/allocated.
> So I'm not sure what an upper bound acquire limit check could serve.
>
One thing to consider is that at the end of the CWSR area, there is
space dedicated for the debugger (used to implement displaced-stepping
buffers). Initially, when KFD did not check the CWSR area size, it was
just a userspace matter to increase that size if the debugger needed
extensions. With this new change, we can't extend this buffer without a
KFD update.
Having a "check that the buffer the thunk provided is big enough" policy
would still allow us to extend this allocation if necessary.
Best,
Lancelot.
>>
>>>
>>> Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
>>> ---
>>> drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 17 +++++++++++------
>>> 1 file changed, 11 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>>> index a65c67cf56ff..448043bc2937 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>>> @@ -297,16 +297,21 @@ int kfd_queue_acquire_buffers(struct
>> kfd_process_device *pdd, struct queue_prope
>>> goto out_err_unreserve;
>>> }
>>>
>>> - if (properties->ctx_save_restore_area_size != topo_dev-
>>> node_props.cwsr_size) {
>>> - pr_debug("queue cwsr size 0x%x not equal to node cwsr size 0x%x\n",
>>> + if (properties->ctx_save_restore_area_size < topo_dev-
>>> node_props.cwsr_size) {
>>> + pr_debug("queue cwsr size 0x%x not sufficient for node cwsr size
>> 0x%x\n",
>>> properties->ctx_save_restore_area_size,
>>> topo_dev->node_props.cwsr_size);
>>> err = -EINVAL;
>>> goto out_err_unreserve;
>>> }
>>>
>>> - total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev-
>>> node_props.debug_memory_size)
>>> - * NUM_XCC(pdd->dev->xcc_mask);
>>> + if (properties->ctx_save_restore_area_size > topo_dev-
>>> node_props.cwsr_size)
>>> + pr_warn_ratelimited("queue cwsr size 0x%x exceeds recommended
>> node cwsr size 0x%x\n",
>>> + properties->ctx_save_restore_area_size,
>>> + topo_dev->node_props.cwsr_size);
>>
>> We can probably drop the warning here.
>
> Acked.
>
> Thanks.
>
> Jon
>
>>
>> Alex
>>
>>> +
>>> + total_cwsr_size = (properties->ctx_save_restore_area_size +
>>> + topo_dev->node_props.debug_memory_size) * NUM_XCC(pdd-
>>> dev->xcc_mask);
>>> total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE);
>>>
>>> err = kfd_queue_buffer_get(vm, (void *)properties-
>>> ctx_save_restore_area_address,
>>> @@ -352,8 +357,8 @@ int kfd_queue_release_buffers(struct
>> kfd_process_device *pdd, struct queue_prope
>>> topo_dev = kfd_topology_device_by_id(pdd->dev->id);
>>> if (!topo_dev)
>>> return -EINVAL;
>>> - total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev-
>>> node_props.debug_memory_size)
>>> - * NUM_XCC(pdd->dev->xcc_mask);
>>> + total_cwsr_size = (properties->ctx_save_restore_area_size +
>>> + topo_dev->node_props.debug_memory_size) * NUM_XCC(pdd-
>>> dev->xcc_mask);
>>> total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE);
>>>
>>> kfd_queue_buffer_svm_put(pdd, properties-
>>> ctx_save_restore_area_address, total_cwsr_size);
>>> --
>>> 2.34.1
>>>
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH] drm/amdkfd: relax checks for over allocation of save area
@ 2025-11-08 0:57 Jonathan Kim
2025-11-10 14:50 ` Kim, Jonathan
0 siblings, 1 reply; 8+ messages in thread
From: Jonathan Kim @ 2025-11-08 0:57 UTC (permalink / raw)
To: amd-gfx
Cc: Alexander.Deucher, Felix.Kuehling, Lancelot.six, Philip.Yang,
Jonathan Kim
Over allocation of save area is not fatal, only under allocation is.
ROCm has various components that independently claim authority over save
area size.
Unless KFD decides to claim single authority, relax size checks.
v2: remove warning
Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
---
drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
index a65c67cf56ff..f1e7583650c4 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
@@ -297,16 +297,16 @@ int kfd_queue_acquire_buffers(struct kfd_process_device *pdd, struct queue_prope
goto out_err_unreserve;
}
- if (properties->ctx_save_restore_area_size != topo_dev->node_props.cwsr_size) {
- pr_debug("queue cwsr size 0x%x not equal to node cwsr size 0x%x\n",
+ if (properties->ctx_save_restore_area_size < topo_dev->node_props.cwsr_size) {
+ pr_debug("queue cwsr size 0x%x not sufficient for node cwsr size 0x%x\n",
properties->ctx_save_restore_area_size,
topo_dev->node_props.cwsr_size);
err = -EINVAL;
goto out_err_unreserve;
}
- total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev->node_props.debug_memory_size)
- * NUM_XCC(pdd->dev->xcc_mask);
+ total_cwsr_size = (properties->ctx_save_restore_area_size +
+ topo_dev->node_props.debug_memory_size) * NUM_XCC(pdd->dev->xcc_mask);
total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE);
err = kfd_queue_buffer_get(vm, (void *)properties->ctx_save_restore_area_address,
@@ -352,8 +352,8 @@ int kfd_queue_release_buffers(struct kfd_process_device *pdd, struct queue_prope
topo_dev = kfd_topology_device_by_id(pdd->dev->id);
if (!topo_dev)
return -EINVAL;
- total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev->node_props.debug_memory_size)
- * NUM_XCC(pdd->dev->xcc_mask);
+ total_cwsr_size = (properties->ctx_save_restore_area_size +
+ topo_dev->node_props.debug_memory_size) * NUM_XCC(pdd->dev->xcc_mask);
total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE);
kfd_queue_buffer_svm_put(pdd, properties->ctx_save_restore_area_address, total_cwsr_size);
--
2.34.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* RE: [PATCH] drm/amdkfd: relax checks for over allocation of save area
2025-11-08 0:57 Jonathan Kim
@ 2025-11-10 14:50 ` Kim, Jonathan
2025-11-10 16:49 ` Philip Yang
0 siblings, 1 reply; 8+ messages in thread
From: Kim, Jonathan @ 2025-11-10 14:50 UTC (permalink / raw)
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander, Kuehling, Felix, Six, Lancelot, Yang, Philip,
Alex Deucher
[Public]
Ping
> -----Original Message-----
> From: Kim, Jonathan <Jonathan.Kim@amd.com>
> Sent: Friday, November 7, 2025 7:57 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Kuehling, Felix
> <Felix.Kuehling@amd.com>; Six, Lancelot <Lancelot.Six@amd.com>; Yang, Philip
> <Philip.Yang@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com>
> Subject: [PATCH] drm/amdkfd: relax checks for over allocation of save area
>
> Over allocation of save area is not fatal, only under allocation is.
> ROCm has various components that independently claim authority over save
> area size.
>
> Unless KFD decides to claim single authority, relax size checks.
>
> v2: remove warning
>
> Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
> ---
> drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> index a65c67cf56ff..f1e7583650c4 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> @@ -297,16 +297,16 @@ int kfd_queue_acquire_buffers(struct
> kfd_process_device *pdd, struct queue_prope
> goto out_err_unreserve;
> }
>
> - if (properties->ctx_save_restore_area_size != topo_dev-
> >node_props.cwsr_size) {
> - pr_debug("queue cwsr size 0x%x not equal to node cwsr size
> 0x%x\n",
> + if (properties->ctx_save_restore_area_size < topo_dev-
> >node_props.cwsr_size) {
> + pr_debug("queue cwsr size 0x%x not sufficient for node cwsr size
> 0x%x\n",
> properties->ctx_save_restore_area_size,
> topo_dev->node_props.cwsr_size);
> err = -EINVAL;
> goto out_err_unreserve;
> }
>
> - total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev-
> >node_props.debug_memory_size)
> - * NUM_XCC(pdd->dev->xcc_mask);
> + total_cwsr_size = (properties->ctx_save_restore_area_size +
> + topo_dev->node_props.debug_memory_size) *
> NUM_XCC(pdd->dev->xcc_mask);
> total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE);
>
> err = kfd_queue_buffer_get(vm, (void *)properties-
> >ctx_save_restore_area_address,
> @@ -352,8 +352,8 @@ int kfd_queue_release_buffers(struct kfd_process_device
> *pdd, struct queue_prope
> topo_dev = kfd_topology_device_by_id(pdd->dev->id);
> if (!topo_dev)
> return -EINVAL;
> - total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev-
> >node_props.debug_memory_size)
> - * NUM_XCC(pdd->dev->xcc_mask);
> + total_cwsr_size = (properties->ctx_save_restore_area_size +
> + topo_dev->node_props.debug_memory_size) *
> NUM_XCC(pdd->dev->xcc_mask);
> total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE);
>
> kfd_queue_buffer_svm_put(pdd, properties-
> >ctx_save_restore_area_address, total_cwsr_size);
> --
> 2.34.1
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] drm/amdkfd: relax checks for over allocation of save area
2025-11-10 14:50 ` Kim, Jonathan
@ 2025-11-10 16:49 ` Philip Yang
0 siblings, 0 replies; 8+ messages in thread
From: Philip Yang @ 2025-11-10 16:49 UTC (permalink / raw)
To: Kim, Jonathan, amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander, Kuehling, Felix, Six, Lancelot, Yang, Philip,
Alex Deucher
The size up-limit check is not needed because user buffer validation
will fail if the size is larger than the cwsr_bo or cwsr svm range size
allocated.
Reviewed-by: Philip.Yang<Philip.Yang@amd.com>
On 2025-11-10 09:50, Kim, Jonathan wrote:
> [Public]
>
> Ping
>
>> -----Original Message-----
>> From: Kim, Jonathan <Jonathan.Kim@amd.com>
>> Sent: Friday, November 7, 2025 7:57 PM
>> To: amd-gfx@lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Kuehling, Felix
>> <Felix.Kuehling@amd.com>; Six, Lancelot <Lancelot.Six@amd.com>; Yang, Philip
>> <Philip.Yang@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com>
>> Subject: [PATCH] drm/amdkfd: relax checks for over allocation of save area
>>
>> Over allocation of save area is not fatal, only under allocation is.
>> ROCm has various components that independently claim authority over save
>> area size.
>>
>> Unless KFD decides to claim single authority, relax size checks.
>>
>> v2: remove warning
>>
>> Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
>> ---
>> drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 12 ++++++------
>> 1 file changed, 6 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>> index a65c67cf56ff..f1e7583650c4 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>> @@ -297,16 +297,16 @@ int kfd_queue_acquire_buffers(struct
>> kfd_process_device *pdd, struct queue_prope
>> goto out_err_unreserve;
>> }
>>
>> - if (properties->ctx_save_restore_area_size != topo_dev-
>>> node_props.cwsr_size) {
>> - pr_debug("queue cwsr size 0x%x not equal to node cwsr size
>> 0x%x\n",
>> + if (properties->ctx_save_restore_area_size < topo_dev-
>>> node_props.cwsr_size) {
>> + pr_debug("queue cwsr size 0x%x not sufficient for node cwsr size
>> 0x%x\n",
>> properties->ctx_save_restore_area_size,
>> topo_dev->node_props.cwsr_size);
>> err = -EINVAL;
>> goto out_err_unreserve;
>> }
>>
>> - total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev-
>>> node_props.debug_memory_size)
>> - * NUM_XCC(pdd->dev->xcc_mask);
>> + total_cwsr_size = (properties->ctx_save_restore_area_size +
>> + topo_dev->node_props.debug_memory_size) *
>> NUM_XCC(pdd->dev->xcc_mask);
>> total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE);
>>
>> err = kfd_queue_buffer_get(vm, (void *)properties-
>>> ctx_save_restore_area_address,
>> @@ -352,8 +352,8 @@ int kfd_queue_release_buffers(struct kfd_process_device
>> *pdd, struct queue_prope
>> topo_dev = kfd_topology_device_by_id(pdd->dev->id);
>> if (!topo_dev)
>> return -EINVAL;
>> - total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev-
>>> node_props.debug_memory_size)
>> - * NUM_XCC(pdd->dev->xcc_mask);
>> + total_cwsr_size = (properties->ctx_save_restore_area_size +
>> + topo_dev->node_props.debug_memory_size) *
>> NUM_XCC(pdd->dev->xcc_mask);
>> total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE);
>>
>> kfd_queue_buffer_svm_put(pdd, properties-
>>> ctx_save_restore_area_address, total_cwsr_size);
>> --
>> 2.34.1
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] drm/amdkfd: relax checks for over allocation of save area
2025-11-07 15:36 ` Lancelot SIX
@ 2025-11-10 19:22 ` Alex Deucher
0 siblings, 0 replies; 8+ messages in thread
From: Alex Deucher @ 2025-11-10 19:22 UTC (permalink / raw)
To: Lancelot SIX
Cc: Kim, Jonathan, amd-gfx@lists.freedesktop.org, Kuehling, Felix,
Yang, Philip
On Fri, Nov 7, 2025 at 10:37 AM Lancelot SIX <Lancelot.Six@amd.com> wrote:
>
>
>
> On 07/11/2025 15:12, Kim, Jonathan wrote:
> > [Public]
> >
> >> -----Original Message-----
> >> From: Alex Deucher <alexdeucher@gmail.com>
> >> Sent: Friday, November 7, 2025 9:36 AM
> >> To: Kim, Jonathan <Jonathan.Kim@amd.com>
> >> Cc: amd-gfx@lists.freedesktop.org; Kuehling, Felix <Felix.Kuehling@amd.com>;
> >> Six, Lancelot <Lancelot.Six@amd.com>; Yang, Philip <Philip.Yang@amd.com>
> >> Subject: Re: [PATCH] drm/amdkfd: relax checks for over allocation of save area
> >>
> >> Caution: This message originated from an External Source. Use proper caution
> >> when opening attachments, clicking links, or responding.
> >>
> >>
> >> On Thu, Nov 6, 2025 at 3:46 PM Jonathan Kim <jonathan.kim@amd.com> wrote:
> >>>
> >>> Over allocation of save area is not fatal, only under allocation is.
> >>> ROCm has various components that independently claim authority over save
> >>> area size.
> >>>
> >>> Unless KFD decides to claim single authority, relax size checks with a
> >>> warning on over allocation.
> >>
> >> Do we want any sort of upper limit?
> >
> > Mmm. I'd expect early failure on unreasonable user request for over allocation at the allocation stage prior to acquire on queue create stage and acquire should be bound to what was registered/allocated.
> > So I'm not sure what an upper bound acquire limit check could serve.
> >
>
> One thing to consider is that at the end of the CWSR area, there is
> space dedicated for the debugger (used to implement displaced-stepping
> buffers). Initially, when KFD did not check the CWSR area size, it was
> just a userspace matter to increase that size if the debugger needed
> extensions. With this new change, we can't extend this buffer without a
> KFD update.
>
> Having a "check that the buffer the thunk provided is big enough" policy
> would still allow us to extend this allocation if necessary.
>
Fair enough. We can probably drop the warning as well in that case.
Alex
> Best,
> Lancelot.
>
> >>
> >>>
> >>> Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
> >>> ---
> >>> drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 17 +++++++++++------
> >>> 1 file changed, 11 insertions(+), 6 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> >> b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> >>> index a65c67cf56ff..448043bc2937 100644
> >>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> >>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> >>> @@ -297,16 +297,21 @@ int kfd_queue_acquire_buffers(struct
> >> kfd_process_device *pdd, struct queue_prope
> >>> goto out_err_unreserve;
> >>> }
> >>>
> >>> - if (properties->ctx_save_restore_area_size != topo_dev-
> >>> node_props.cwsr_size) {
> >>> - pr_debug("queue cwsr size 0x%x not equal to node cwsr size 0x%x\n",
> >>> + if (properties->ctx_save_restore_area_size < topo_dev-
> >>> node_props.cwsr_size) {
> >>> + pr_debug("queue cwsr size 0x%x not sufficient for node cwsr size
> >> 0x%x\n",
> >>> properties->ctx_save_restore_area_size,
> >>> topo_dev->node_props.cwsr_size);
> >>> err = -EINVAL;
> >>> goto out_err_unreserve;
> >>> }
> >>>
> >>> - total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev-
> >>> node_props.debug_memory_size)
> >>> - * NUM_XCC(pdd->dev->xcc_mask);
> >>> + if (properties->ctx_save_restore_area_size > topo_dev-
> >>> node_props.cwsr_size)
> >>> + pr_warn_ratelimited("queue cwsr size 0x%x exceeds recommended
> >> node cwsr size 0x%x\n",
> >>> + properties->ctx_save_restore_area_size,
> >>> + topo_dev->node_props.cwsr_size);
> >>
> >> We can probably drop the warning here.
> >
> > Acked.
> >
> > Thanks.
> >
> > Jon
> >
> >>
> >> Alex
> >>
> >>> +
> >>> + total_cwsr_size = (properties->ctx_save_restore_area_size +
> >>> + topo_dev->node_props.debug_memory_size) * NUM_XCC(pdd-
> >>> dev->xcc_mask);
> >>> total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE);
> >>>
> >>> err = kfd_queue_buffer_get(vm, (void *)properties-
> >>> ctx_save_restore_area_address,
> >>> @@ -352,8 +357,8 @@ int kfd_queue_release_buffers(struct
> >> kfd_process_device *pdd, struct queue_prope
> >>> topo_dev = kfd_topology_device_by_id(pdd->dev->id);
> >>> if (!topo_dev)
> >>> return -EINVAL;
> >>> - total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev-
> >>> node_props.debug_memory_size)
> >>> - * NUM_XCC(pdd->dev->xcc_mask);
> >>> + total_cwsr_size = (properties->ctx_save_restore_area_size +
> >>> + topo_dev->node_props.debug_memory_size) * NUM_XCC(pdd-
> >>> dev->xcc_mask);
> >>> total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE);
> >>>
> >>> kfd_queue_buffer_svm_put(pdd, properties-
> >>> ctx_save_restore_area_address, total_cwsr_size);
> >>> --
> >>> 2.34.1
> >>>
>
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-11-10 19:22 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-06 20:27 [PATCH] drm/amdkfd: relax checks for over allocation of save area Jonathan Kim
2025-11-07 14:36 ` Alex Deucher
2025-11-07 15:12 ` Kim, Jonathan
2025-11-07 15:36 ` Lancelot SIX
2025-11-10 19:22 ` Alex Deucher
-- strict thread matches above, loose matches on Subject: below --
2025-11-08 0:57 Jonathan Kim
2025-11-10 14:50 ` Kim, Jonathan
2025-11-10 16:49 ` Philip Yang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox