* [PATCH v2 1/2] drm/amdkfd: Fix false positive queue buffer free warning
@ 2025-10-15 20:11 Philip Yang
2025-10-15 20:11 ` [PATCH v2 2/2] drm/amdkfd: Stop user queues when process mm released Philip Yang
2025-10-15 21:01 ` [PATCH v2 1/2] drm/amdkfd: Fix false positive queue buffer free warning Chen, Xiaogang
0 siblings, 2 replies; 13+ messages in thread
From: Philip Yang @ 2025-10-15 20:11 UTC (permalink / raw)
To: amd-gfx; +Cc: Felix.Kuehling, Philip Yang
Only show warning message if process mm is still alive when queue
buffer is freed to evcit the queues.
If kfd_lookup_process_by_mm return NULL, means the process is already
exited and mm is gone, it is fine to free queue buffer.
Fixes: b049504e211e ("drm/amdkfd: Validate user queue svm memory residency")
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
---
drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 4d4a47313f5b..d1b2f8525f80 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -2487,7 +2487,9 @@ svm_range_unmap_from_cpu(struct mm_struct *mm, struct svm_range *prange,
bool unmap_parent;
uint32_t i;
- if (atomic_read(&prange->queue_refcount)) {
+ p = kfd_lookup_process_by_mm(mm);
+
+ if (p && atomic_read(&prange->queue_refcount)) {
int r;
pr_warn("Freeing queue vital buffer 0x%lx, queue evicted\n",
@@ -2497,7 +2499,6 @@ svm_range_unmap_from_cpu(struct mm_struct *mm, struct svm_range *prange,
pr_debug("failed %d to quiesce KFD queues\n", r);
}
- p = kfd_lookup_process_by_mm(mm);
if (!p)
return;
svms = &p->svms;
--
2.49.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v2 2/2] drm/amdkfd: Stop user queues when process mm released
2025-10-15 20:11 [PATCH v2 1/2] drm/amdkfd: Fix false positive queue buffer free warning Philip Yang
@ 2025-10-15 20:11 ` Philip Yang
2025-10-15 20:40 ` Chen, Xiaogang
2025-10-17 22:43 ` Felix Kuehling
2025-10-15 21:01 ` [PATCH v2 1/2] drm/amdkfd: Fix false positive queue buffer free warning Chen, Xiaogang
1 sibling, 2 replies; 13+ messages in thread
From: Philip Yang @ 2025-10-15 20:11 UTC (permalink / raw)
To: amd-gfx; +Cc: Felix.Kuehling, Philip Yang, Felix Kuehling
In mmu notifier release callback, stop user queues to be safe because
the SVM memory is going to unmap from CPU.
Suggested-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
---
drivers/gpu/drm/amd/amdkfd/kfd_process.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 0341f570f3d1..e2a0ae0394b8 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -1221,11 +1221,16 @@ static void kfd_process_free_notifier(struct mmu_notifier *mn)
static void kfd_process_notifier_release_internal(struct kfd_process *p)
{
- int i;
+ int i, r;
cancel_delayed_work_sync(&p->eviction_work);
cancel_delayed_work_sync(&p->restore_work);
+ WARN(debug_evictions, "Evicting pid %d", p->lead_thread->pid);
+ r = kfd_process_evict_queues(p, KFD_QUEUE_EVICTION_TRIGGER_SVM);
+ if (r)
+ pr_debug("failed %d to quiesce KFD queues\n", r);
+
for (i = 0; i < p->n_pdds; i++) {
struct kfd_process_device *pdd = p->pdds[i];
--
2.49.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v2 2/2] drm/amdkfd: Stop user queues when process mm released
2025-10-15 20:11 ` [PATCH v2 2/2] drm/amdkfd: Stop user queues when process mm released Philip Yang
@ 2025-10-15 20:40 ` Chen, Xiaogang
2025-10-15 21:33 ` Philip Yang
2025-10-17 22:43 ` Felix Kuehling
1 sibling, 1 reply; 13+ messages in thread
From: Chen, Xiaogang @ 2025-10-15 20:40 UTC (permalink / raw)
To: Philip Yang, amd-gfx; +Cc: Felix.Kuehling
[-- Attachment #1: Type: text/plain, Size: 1618 bytes --]
On 10/15/2025 3:11 PM, Philip Yang wrote:
> In mmu notifier release callback, stop user queues to be safe because
> the SVM memory is going to unmap from CPU.
>
> Suggested-by: Felix Kuehling<felix.kuehling@amd.com>
> Signed-off-by: Philip Yang<Philip.Yang@amd.com>
> ---
> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> index 0341f570f3d1..e2a0ae0394b8 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> @@ -1221,11 +1221,16 @@ static void kfd_process_free_notifier(struct mmu_notifier *mn)
>
> static void kfd_process_notifier_release_internal(struct kfd_process *p)
> {
> - int i;
> + int i, r;
>
> cancel_delayed_work_sync(&p->eviction_work);
> cancel_delayed_work_sync(&p->restore_work);
>
> + WARN(debug_evictions, "Evicting pid %d", p->lead_thread->pid);
Use warning message or debug message? I saw this WARN are used several
places. If the queues from kfd process p are still running when come
here we need to stop them. It is not error. debug message is more
suitable I think.
> + r = kfd_process_evict_queues(p, KFD_QUEUE_EVICTION_TRIGGER_SVM);
The evict reason KFD_QUEUE_EVICTION_TRIGGER_SVM is not good here as it
is general kfd process release. Maybe need another enum value.
Regards
Xiaogagn
> + if (r)
> + pr_debug("failed %d to quiesce KFD queues\n", r);
> +
> for (i = 0; i < p->n_pdds; i++) {
> struct kfd_process_device *pdd = p->pdds[i];
>
[-- Attachment #2: Type: text/html, Size: 2592 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/2] drm/amdkfd: Fix false positive queue buffer free warning
2025-10-15 20:11 [PATCH v2 1/2] drm/amdkfd: Fix false positive queue buffer free warning Philip Yang
2025-10-15 20:11 ` [PATCH v2 2/2] drm/amdkfd: Stop user queues when process mm released Philip Yang
@ 2025-10-15 21:01 ` Chen, Xiaogang
2025-10-15 21:45 ` Philip Yang
1 sibling, 1 reply; 13+ messages in thread
From: Chen, Xiaogang @ 2025-10-15 21:01 UTC (permalink / raw)
To: Philip Yang, amd-gfx; +Cc: Felix.Kuehling
On 10/15/2025 3:11 PM, Philip Yang wrote:
> Only show warning message if process mm is still alive when queue
> buffer is freed to evcit the queues.
>
> If kfd_lookup_process_by_mm return NULL, means the process is already
> exited and mm is gone, it is fine to free queue buffer.
But another question is why a prange is still alive, its kfd process is
gone?
When unmap a prange the queues that use it should have been stopped. If
not, there is problem somewhere. This warning message need be sent no
matter kfd process exists or not.
I think a real problem here is kfd process need be alive as long as any
of its resource is still alive. In this case since prange is still alive
its kfd process should not be released(p should not be null). If not we
need wait all pranges from this process got released, then release this
kfd process.
Regards
Xiaogang
>
> Fixes: b049504e211e ("drm/amdkfd: Validate user queue svm memory residency")
> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
> ---
> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index 4d4a47313f5b..d1b2f8525f80 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> @@ -2487,7 +2487,9 @@ svm_range_unmap_from_cpu(struct mm_struct *mm, struct svm_range *prange,
> bool unmap_parent;
> uint32_t i;
>
> - if (atomic_read(&prange->queue_refcount)) {
> + p = kfd_lookup_process_by_mm(mm);
> +
> + if (p && atomic_read(&prange->queue_refcount)) {
> int r;
>
> pr_warn("Freeing queue vital buffer 0x%lx, queue evicted\n",
> @@ -2497,7 +2499,6 @@ svm_range_unmap_from_cpu(struct mm_struct *mm, struct svm_range *prange,
> pr_debug("failed %d to quiesce KFD queues\n", r);
> }
>
> - p = kfd_lookup_process_by_mm(mm);
> if (!p)
> return;
> svms = &p->svms;
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 2/2] drm/amdkfd: Stop user queues when process mm released
2025-10-15 20:40 ` Chen, Xiaogang
@ 2025-10-15 21:33 ` Philip Yang
2025-10-16 14:46 ` Chen, Xiaogang
0 siblings, 1 reply; 13+ messages in thread
From: Philip Yang @ 2025-10-15 21:33 UTC (permalink / raw)
To: Chen, Xiaogang, Philip Yang, amd-gfx; +Cc: Felix.Kuehling
On 2025-10-15 16:40, Chen, Xiaogang wrote:
>
>
> On 10/15/2025 3:11 PM, Philip Yang wrote:
>> In mmu notifier release callback, stop user queues to be safe because
>> the SVM memory is going to unmap from CPU.
>>
>> Suggested-by: Felix Kuehling<felix.kuehling@amd.com>
>> Signed-off-by: Philip Yang<Philip.Yang@amd.com>
>> ---
>> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 7 ++++++-
>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> index 0341f570f3d1..e2a0ae0394b8 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> @@ -1221,11 +1221,16 @@ static void kfd_process_free_notifier(struct mmu_notifier *mn)
>>
>> static void kfd_process_notifier_release_internal(struct kfd_process *p)
>> {
>> - int i;
>> + int i, r;
>>
>> cancel_delayed_work_sync(&p->eviction_work);
>> cancel_delayed_work_sync(&p->restore_work);
>>
>> + WARN(debug_evictions, "Evicting pid %d", p->lead_thread->pid);
>
> Use warning message or debug message? I saw this WARN are used several
> places. If the queues from kfd process p are still running when come
> here we need to stop them. It is not error. debug message is more
> suitable I think.
>
The module parameter debug_evictions can be set to true, use WARN to
dump call back trace to help understand why queue is evicted, by default
debug_evictions is false.
>> + r = kfd_process_evict_queues(p, KFD_QUEUE_EVICTION_TRIGGER_SVM);
>
> The evict reason KFD_QUEUE_EVICTION_TRIGGER_SVM is not good here as it
> is general kfd process release. Maybe need another enum value.
>
Define new profiling event requires rocprofiler API change,
KFD_QUEUE_EVICTION_TRIGGER_SVM seems the closest event from mmu notifier.
Regards,
Philip
> Regards
>
> Xiaogagn
>
>> + if (r)
>> + pr_debug("failed %d to quiesce KFD queues\n", r);
>> +
>> for (i = 0; i < p->n_pdds; i++) {
>> struct kfd_process_device *pdd = p->pdds[i];
>>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/2] drm/amdkfd: Fix false positive queue buffer free warning
2025-10-15 21:01 ` [PATCH v2 1/2] drm/amdkfd: Fix false positive queue buffer free warning Chen, Xiaogang
@ 2025-10-15 21:45 ` Philip Yang
2025-10-15 22:46 ` Chen, Xiaogang
0 siblings, 1 reply; 13+ messages in thread
From: Philip Yang @ 2025-10-15 21:45 UTC (permalink / raw)
To: Chen, Xiaogang, Philip Yang, amd-gfx; +Cc: Felix.Kuehling
On 2025-10-15 17:01, Chen, Xiaogang wrote:
>
> On 10/15/2025 3:11 PM, Philip Yang wrote:
>> Only show warning message if process mm is still alive when queue
>> buffer is freed to evcit the queues.
>>
>> If kfd_lookup_process_by_mm return NULL, means the process is already
>> exited and mm is gone, it is fine to free queue buffer.
>
> But another question is why a prange is still alive, its kfd process
> is gone?
It is application process exited, kfd process structure still exist and
available. The issue is race condition:
do_exit
exit_mmap
a. mmu mm release notifier, schedule kfd release wq to destroy
queue
unmap_vmas
b. mmu_notifier_range(.. MMU_NOTIFY_UNMAP...)
the step b is executed to unmap CWSR svm range, before step a kfd
release wq destroy queue.
>
> When unmap a prange the queues that use it should have been stopped.
> If not, there is problem somewhere. This warning message need be sent
> no matter kfd process exists or not.
>
> I think a real problem here is kfd process need be alive as long as
> any of its resource is still alive. In this case since prange is still
> alive its kfd process should not be released(p should not be null). If
> not we need wait all pranges from this process got released, then
> release this kfd process.
kfd process structure is freed in kfd_process_wq_release after
svm_range_list_fini.
Regards,
Philip
>
> Regards
>
> Xiaogang
>
>>
>> Fixes: b049504e211e ("drm/amdkfd: Validate user queue svm memory
>> residency")
>> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
>> ---
>> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 +++--
>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> index 4d4a47313f5b..d1b2f8525f80 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> @@ -2487,7 +2487,9 @@ svm_range_unmap_from_cpu(struct mm_struct *mm,
>> struct svm_range *prange,
>> bool unmap_parent;
>> uint32_t i;
>> - if (atomic_read(&prange->queue_refcount)) {
>> + p = kfd_lookup_process_by_mm(mm);
>> +
>> + if (p && atomic_read(&prange->queue_refcount)) {
>> int r;
>> pr_warn("Freeing queue vital buffer 0x%lx, queue evicted\n",
>> @@ -2497,7 +2499,6 @@ svm_range_unmap_from_cpu(struct mm_struct *mm,
>> struct svm_range *prange,
>> pr_debug("failed %d to quiesce KFD queues\n", r);
>> }
>> - p = kfd_lookup_process_by_mm(mm);
>> if (!p)
>> return;
>> svms = &p->svms;
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/2] drm/amdkfd: Fix false positive queue buffer free warning
2025-10-15 21:45 ` Philip Yang
@ 2025-10-15 22:46 ` Chen, Xiaogang
2025-10-16 18:01 ` Philip Yang
2025-10-17 22:39 ` Felix Kuehling
0 siblings, 2 replies; 13+ messages in thread
From: Chen, Xiaogang @ 2025-10-15 22:46 UTC (permalink / raw)
To: Philip Yang, Philip Yang, amd-gfx; +Cc: Felix.Kuehling
On 10/15/2025 4:45 PM, Philip Yang wrote:
>
> On 2025-10-15 17:01, Chen, Xiaogang wrote:
>>
>> On 10/15/2025 3:11 PM, Philip Yang wrote:
>>> Only show warning message if process mm is still alive when queue
>>> buffer is freed to evcit the queues.
>>>
>>> If kfd_lookup_process_by_mm return NULL, means the process is already
>>> exited and mm is gone, it is fine to free queue buffer.
>>
>> But another question is why a prange is still alive, its kfd process
>> is gone?
> It is application process exited, kfd process structure still exist
> and available. The issue is race condition:
>
> do_exit
> exit_mmap
> a. mmu mm release notifier, schedule kfd release wq to
> destroy queue
> unmap_vmas
> b. mmu_notifier_range(.. MMU_NOTIFY_UNMAP...)
>
> the step b is executed to unmap CWSR svm range, before step a kfd
> release wq destroy queue.
>
>
>>
>> When unmap a prange the queues that use it should have been stopped.
>> If not, there is problem somewhere. This warning message need be sent
>> no matter kfd process exists or not.
>>
>> I think a real problem here is kfd process need be alive as long as
>> any of its resource is still alive. In this case since prange is
>> still alive its kfd process should not be released(p should not be
>> null). If not we need wait all pranges from this process got
>> released, then release this kfd process.
>
> kfd process structure is freed in kfd_process_wq_release after
> svm_range_list_fini.
I wanted to say: delay remove kfd process p from kfd_processes_table
until all resources of p got released. So when any p's resources is
getting released p is available. That needs change kfd process release
logic.
Regards
Xiaogang
>
> Regards,
>
> Philip
>
>>
>> Regards
>>
>> Xiaogang
>>
>>>
>>> Fixes: b049504e211e ("drm/amdkfd: Validate user queue svm memory
>>> residency")
>>> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
>>> ---
>>> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 +++--
>>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> index 4d4a47313f5b..d1b2f8525f80 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> @@ -2487,7 +2487,9 @@ svm_range_unmap_from_cpu(struct mm_struct *mm,
>>> struct svm_range *prange,
>>> bool unmap_parent;
>>> uint32_t i;
>>> - if (atomic_read(&prange->queue_refcount)) {
>>> + p = kfd_lookup_process_by_mm(mm);
>>> +
>>> + if (p && atomic_read(&prange->queue_refcount)) {
>>> int r;
>>> pr_warn("Freeing queue vital buffer 0x%lx, queue
>>> evicted\n",
>>> @@ -2497,7 +2499,6 @@ svm_range_unmap_from_cpu(struct mm_struct *mm,
>>> struct svm_range *prange,
>>> pr_debug("failed %d to quiesce KFD queues\n", r);
>>> }
>>> - p = kfd_lookup_process_by_mm(mm);
>>> if (!p)
>>> return;
>>> svms = &p->svms;
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 2/2] drm/amdkfd: Stop user queues when process mm released
2025-10-15 21:33 ` Philip Yang
@ 2025-10-16 14:46 ` Chen, Xiaogang
2025-10-17 22:34 ` Felix Kuehling
0 siblings, 1 reply; 13+ messages in thread
From: Chen, Xiaogang @ 2025-10-16 14:46 UTC (permalink / raw)
To: Philip Yang, Philip Yang, amd-gfx; +Cc: Felix.Kuehling
On 10/15/2025 4:33 PM, Philip Yang wrote:
>
> On 2025-10-15 16:40, Chen, Xiaogang wrote:
>>
>>
>> On 10/15/2025 3:11 PM, Philip Yang wrote:
>>> In mmu notifier release callback, stop user queues to be safe because
>>> the SVM memory is going to unmap from CPU.
>>>
>>> Suggested-by: Felix Kuehling<felix.kuehling@amd.com>
>>> Signed-off-by: Philip Yang<Philip.Yang@amd.com>
>>> ---
>>> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 7 ++++++-
>>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> index 0341f570f3d1..e2a0ae0394b8 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> @@ -1221,11 +1221,16 @@ static void kfd_process_free_notifier(struct
>>> mmu_notifier *mn)
>>> static void kfd_process_notifier_release_internal(struct
>>> kfd_process *p)
>>> {
>>> - int i;
>>> + int i, r;
>>> cancel_delayed_work_sync(&p->eviction_work);
>>> cancel_delayed_work_sync(&p->restore_work);
>>> + WARN(debug_evictions, "Evicting pid %d", p->lead_thread->pid);
>>
>> Use warning message or debug message? I saw this WARN are used
>> several places. If the queues from kfd process p are still running
>> when come here we need to stop them. It is not error. debug message
>> is more suitable I think.
>>
> The module parameter debug_evictions can be set to true, use WARN to
> dump call back trace to help understand why queue is evicted, by
> default debug_evictions is false.
I agree stopping kfd process's queues during kfd process release. Just
wonder if change WARN to debug message form. We can use dump_stack() to
dump stack anyway, but it is not relevant to this patch.
>>> + r = kfd_process_evict_queues(p, KFD_QUEUE_EVICTION_TRIGGER_SVM);
>>
>> The evict reason KFD_QUEUE_EVICTION_TRIGGER_SVM is not good here as
>> it is general kfd process release. Maybe need another enum value.
>>
> Define new profiling event requires rocprofiler API change,
> KFD_QUEUE_EVICTION_TRIGGER_SVM seems the closest event from mmu notifier.
That is awkward. We may add a emu value at end that rocprofile would not
know for now.
Regards
Xiaogang
>
> Regards,
>
> Philip
>
>> Regards
>>
>> Xiaogagn
>>
>>> + if (r)
>>> + pr_debug("failed %d to quiesce KFD queues\n", r);
>>> +
>>> for (i = 0; i < p->n_pdds; i++) {
>>> struct kfd_process_device *pdd = p->pdds[i];
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/2] drm/amdkfd: Fix false positive queue buffer free warning
2025-10-15 22:46 ` Chen, Xiaogang
@ 2025-10-16 18:01 ` Philip Yang
2025-10-17 22:39 ` Felix Kuehling
1 sibling, 0 replies; 13+ messages in thread
From: Philip Yang @ 2025-10-16 18:01 UTC (permalink / raw)
To: Chen, Xiaogang, Philip Yang, amd-gfx; +Cc: Felix.Kuehling
On 2025-10-15 18:46, Chen, Xiaogang wrote:
>
> On 10/15/2025 4:45 PM, Philip Yang wrote:
>>
>> On 2025-10-15 17:01, Chen, Xiaogang wrote:
>>>
>>> On 10/15/2025 3:11 PM, Philip Yang wrote:
>>>> Only show warning message if process mm is still alive when queue
>>>> buffer is freed to evcit the queues.
>>>>
>>>> If kfd_lookup_process_by_mm return NULL, means the process is already
>>>> exited and mm is gone, it is fine to free queue buffer.
>>>
>>> But another question is why a prange is still alive, its kfd process
>>> is gone?
>> It is application process exited, kfd process structure still exist
>> and available. The issue is race condition:
>>
>> do_exit
>> exit_mmap
>> a. mmu mm release notifier, schedule kfd release wq to
>> destroy queue
>> unmap_vmas
>> b. mmu_notifier_range(.. MMU_NOTIFY_UNMAP...)
>>
>> the step b is executed to unmap CWSR svm range, before step a kfd
>> release wq destroy queue.
>>
>>
>>>
>>> When unmap a prange the queues that use it should have been stopped.
>>> If not, there is problem somewhere. This warning message need be
>>> sent no matter kfd process exists or not.
>>>
>>> I think a real problem here is kfd process need be alive as long as
>>> any of its resource is still alive. In this case since prange is
>>> still alive its kfd process should not be released(p should not be
>>> null). If not we need wait all pranges from this process got
>>> released, then release this kfd process.
>>
>> kfd process structure is freed in kfd_process_wq_release after
>> svm_range_list_fini.
>
> I wanted to say: delay remove kfd process p from kfd_processes_table
> until all resources of p got released. So when any p's resources is
> getting released p is available. That needs change kfd process release
> logic.
prange->queue_refcount will be 0 after queue is destroyed (not evicted),
we should warn user space and evict queues if prange is freed with
prange->queue_refcount not zero. This patch is to fix the race that
generate false warning after process exited to free prange. I don't
think that keep kfd process in kfd_processes_table after mmu release
notifier will solve this race issue.
Regards,
Philip
>
>
> Regards
>
> Xiaogang
>
>
>
>
>>
>> Regards,
>>
>> Philip
>>
>>>
>>> Regards
>>>
>>> Xiaogang
>>>
>>>>
>>>> Fixes: b049504e211e ("drm/amdkfd: Validate user queue svm memory
>>>> residency")
>>>> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
>>>> ---
>>>> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 +++--
>>>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>>> index 4d4a47313f5b..d1b2f8525f80 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>>> @@ -2487,7 +2487,9 @@ svm_range_unmap_from_cpu(struct mm_struct
>>>> *mm, struct svm_range *prange,
>>>> bool unmap_parent;
>>>> uint32_t i;
>>>> - if (atomic_read(&prange->queue_refcount)) {
>>>> + p = kfd_lookup_process_by_mm(mm);
>>>> +
>>>> + if (p && atomic_read(&prange->queue_refcount)) {
>>>> int r;
>>>> pr_warn("Freeing queue vital buffer 0x%lx, queue
>>>> evicted\n",
>>>> @@ -2497,7 +2499,6 @@ svm_range_unmap_from_cpu(struct mm_struct
>>>> *mm, struct svm_range *prange,
>>>> pr_debug("failed %d to quiesce KFD queues\n", r);
>>>> }
>>>> - p = kfd_lookup_process_by_mm(mm);
>>>> if (!p)
>>>> return;
>>>> svms = &p->svms;
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 2/2] drm/amdkfd: Stop user queues when process mm released
2025-10-16 14:46 ` Chen, Xiaogang
@ 2025-10-17 22:34 ` Felix Kuehling
0 siblings, 0 replies; 13+ messages in thread
From: Felix Kuehling @ 2025-10-17 22:34 UTC (permalink / raw)
To: Chen, Xiaogang, Philip Yang, Philip Yang, amd-gfx
On 2025-10-16 10:46, Chen, Xiaogang wrote:
>
> On 10/15/2025 4:33 PM, Philip Yang wrote:
>>
>> On 2025-10-15 16:40, Chen, Xiaogang wrote:
>>>
>>>
>>> On 10/15/2025 3:11 PM, Philip Yang wrote:
>>>> In mmu notifier release callback, stop user queues to be safe because
>>>> the SVM memory is going to unmap from CPU.
>>>>
>>>> Suggested-by: Felix Kuehling<felix.kuehling@amd.com>
>>>> Signed-off-by: Philip Yang<Philip.Yang@amd.com>
>>>> ---
>>>> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 7 ++++++-
>>>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>> index 0341f570f3d1..e2a0ae0394b8 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>> @@ -1221,11 +1221,16 @@ static void
>>>> kfd_process_free_notifier(struct mmu_notifier *mn)
>>>> static void kfd_process_notifier_release_internal(struct
>>>> kfd_process *p)
>>>> {
>>>> - int i;
>>>> + int i, r;
>>>> cancel_delayed_work_sync(&p->eviction_work);
>>>> cancel_delayed_work_sync(&p->restore_work);
>>>> + WARN(debug_evictions, "Evicting pid %d", p->lead_thread->pid);
>>>
>>> Use warning message or debug message? I saw this WARN are used
>>> several places. If the queues from kfd process p are still running
>>> when come here we need to stop them. It is not error. debug message
>>> is more suitable I think.
>>>
>> The module parameter debug_evictions can be set to true, use WARN to
>> dump call back trace to help understand why queue is evicted, by
>> default debug_evictions is false.
> I agree stopping kfd process's queues during kfd process release.
> Just wonder if change WARN to debug message form. We can use
> dump_stack() to dump stack anyway, but it is not relevant to this patch.
>
>>>> + r = kfd_process_evict_queues(p, KFD_QUEUE_EVICTION_TRIGGER_SVM);
>>>
>>> The evict reason KFD_QUEUE_EVICTION_TRIGGER_SVM is not good here as
>>> it is general kfd process release. Maybe need another enum value.
>>>
>> Define new profiling event requires rocprofiler API change,
>> KFD_QUEUE_EVICTION_TRIGGER_SVM seems the closest event from mmu
>> notifier.
>
> That is awkward. We may add a emu value at end that rocprofile would
> not know for now.
roc-profile collects events in the running process context. This only
happens on process termination. roc-profile will never see these events.
So it really doesn't matter.
Regards,
Felix
>
> Regards
>
> Xiaogang
>
>
>>
>> Regards,
>>
>> Philip
>>
>>> Regards
>>>
>>> Xiaogagn
>>>
>>>> + if (r)
>>>> + pr_debug("failed %d to quiesce KFD queues\n", r);
>>>> +
>>>> for (i = 0; i < p->n_pdds; i++) {
>>>> struct kfd_process_device *pdd = p->pdds[i];
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/2] drm/amdkfd: Fix false positive queue buffer free warning
2025-10-15 22:46 ` Chen, Xiaogang
2025-10-16 18:01 ` Philip Yang
@ 2025-10-17 22:39 ` Felix Kuehling
1 sibling, 0 replies; 13+ messages in thread
From: Felix Kuehling @ 2025-10-17 22:39 UTC (permalink / raw)
To: Chen, Xiaogang, Philip Yang, Philip Yang, amd-gfx
On 2025-10-15 18:46, Chen, Xiaogang wrote:
>
> On 10/15/2025 4:45 PM, Philip Yang wrote:
>>
>> On 2025-10-15 17:01, Chen, Xiaogang wrote:
>>>
>>> On 10/15/2025 3:11 PM, Philip Yang wrote:
>>>> Only show warning message if process mm is still alive when queue
>>>> buffer is freed to evcit the queues.
>>>>
>>>> If kfd_lookup_process_by_mm return NULL, means the process is already
>>>> exited and mm is gone, it is fine to free queue buffer.
>>>
>>> But another question is why a prange is still alive, its kfd process
>>> is gone?
>> It is application process exited, kfd process structure still exist
>> and available. The issue is race condition:
>>
>> do_exit
>> exit_mmap
>> a. mmu mm release notifier, schedule kfd release wq to
>> destroy queue
>> unmap_vmas
>> b. mmu_notifier_range(.. MMU_NOTIFY_UNMAP...)
>>
>> the step b is executed to unmap CWSR svm range, before step a kfd
>> release wq destroy queue.
>>
>>
>>>
>>> When unmap a prange the queues that use it should have been stopped.
>>> If not, there is problem somewhere. This warning message need be
>>> sent no matter kfd process exists or not.
>>>
>>> I think a real problem here is kfd process need be alive as long as
>>> any of its resource is still alive. In this case since prange is
>>> still alive its kfd process should not be released(p should not be
>>> null). If not we need wait all pranges from this process got
>>> released, then release this kfd process.
>>
>> kfd process structure is freed in kfd_process_wq_release after
>> svm_range_list_fini.
>
> I wanted to say: delay remove kfd process p from kfd_processes_table
> until all resources of p got released. So when any p's resources is
> getting released p is available. That needs change kfd process release
> logic.
That would complicate the cleanup a lot, because now other threads can
still look up the kfd_process and use or modify it concurrently while
the cleanup is happening. We remove the process from the
kfd_processes_table first to ensure that it is safe to clean up all the
process resources.
Regards,
Felix
>
>
> Regards
>
> Xiaogang
>
>
>
>
>>
>> Regards,
>>
>> Philip
>>
>>>
>>> Regards
>>>
>>> Xiaogang
>>>
>>>>
>>>> Fixes: b049504e211e ("drm/amdkfd: Validate user queue svm memory
>>>> residency")
>>>> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
>>>> ---
>>>> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 +++--
>>>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>>> index 4d4a47313f5b..d1b2f8525f80 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>>> @@ -2487,7 +2487,9 @@ svm_range_unmap_from_cpu(struct mm_struct
>>>> *mm, struct svm_range *prange,
>>>> bool unmap_parent;
>>>> uint32_t i;
>>>> - if (atomic_read(&prange->queue_refcount)) {
>>>> + p = kfd_lookup_process_by_mm(mm);
>>>> +
>>>> + if (p && atomic_read(&prange->queue_refcount)) {
>>>> int r;
>>>> pr_warn("Freeing queue vital buffer 0x%lx, queue
>>>> evicted\n",
>>>> @@ -2497,7 +2499,6 @@ svm_range_unmap_from_cpu(struct mm_struct
>>>> *mm, struct svm_range *prange,
>>>> pr_debug("failed %d to quiesce KFD queues\n", r);
>>>> }
>>>> - p = kfd_lookup_process_by_mm(mm);
>>>> if (!p)
>>>> return;
>>>> svms = &p->svms;
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 2/2] drm/amdkfd: Stop user queues when process mm released
2025-10-15 20:11 ` [PATCH v2 2/2] drm/amdkfd: Stop user queues when process mm released Philip Yang
2025-10-15 20:40 ` Chen, Xiaogang
@ 2025-10-17 22:43 ` Felix Kuehling
2025-10-20 14:26 ` Philip Yang
1 sibling, 1 reply; 13+ messages in thread
From: Felix Kuehling @ 2025-10-17 22:43 UTC (permalink / raw)
To: Philip Yang, amd-gfx
On 2025-10-15 16:11, Philip Yang wrote:
> In mmu notifier release callback, stop user queues to be safe because
> the SVM memory is going to unmap from CPU.
>
> Suggested-by: Felix Kuehling <felix.kuehling@amd.com>
> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
> ---
> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> index 0341f570f3d1..e2a0ae0394b8 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> @@ -1221,11 +1221,16 @@ static void kfd_process_free_notifier(struct mmu_notifier *mn)
>
> static void kfd_process_notifier_release_internal(struct kfd_process *p)
> {
> - int i;
> + int i, r;
>
> cancel_delayed_work_sync(&p->eviction_work);
> cancel_delayed_work_sync(&p->restore_work);
>
> + WARN(debug_evictions, "Evicting pid %d", p->lead_thread->pid);
> + r = kfd_process_evict_queues(p, KFD_QUEUE_EVICTION_TRIGGER_SVM);
Is there a reason why we can't just call
kfd_process_dequeue_from_all_devices here, and remove that call from
kfd_process_wq_release? We don't need to call this an eviction. The
queues get removed on process termination anyway. We're just doing it a
bit earlier now.
Regards,
Felix
> + if (r)
> + pr_debug("failed %d to quiesce KFD queues\n", r);
> +
> for (i = 0; i < p->n_pdds; i++) {
> struct kfd_process_device *pdd = p->pdds[i];
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 2/2] drm/amdkfd: Stop user queues when process mm released
2025-10-17 22:43 ` Felix Kuehling
@ 2025-10-20 14:26 ` Philip Yang
0 siblings, 0 replies; 13+ messages in thread
From: Philip Yang @ 2025-10-20 14:26 UTC (permalink / raw)
To: Felix Kuehling, Philip Yang, amd-gfx
On 2025-10-17 18:43, Felix Kuehling wrote:
>
> On 2025-10-15 16:11, Philip Yang wrote:
>> In mmu notifier release callback, stop user queues to be safe because
>> the SVM memory is going to unmap from CPU.
>>
>> Suggested-by: Felix Kuehling <felix.kuehling@amd.com>
>> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
>> ---
>> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 7 ++++++-
>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> index 0341f570f3d1..e2a0ae0394b8 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> @@ -1221,11 +1221,16 @@ static void kfd_process_free_notifier(struct
>> mmu_notifier *mn)
>> static void kfd_process_notifier_release_internal(struct
>> kfd_process *p)
>> {
>> - int i;
>> + int i, r;
>> cancel_delayed_work_sync(&p->eviction_work);
>> cancel_delayed_work_sync(&p->restore_work);
>> + WARN(debug_evictions, "Evicting pid %d", p->lead_thread->pid);
>> + r = kfd_process_evict_queues(p, KFD_QUEUE_EVICTION_TRIGGER_SVM);
>
> Is there a reason why we can't just call
> kfd_process_dequeue_from_all_devices here, and remove that call from
> kfd_process_wq_release? We don't need to call this an eviction. The
> queues get removed on process termination anyway. We're just doing it
> a bit earlier now.
MMU release notifier callback don't hold mmap lock, it is safe to call
kfd_process_dequeue_from_all_devices here, will send new version for review.
Regards,
Philip
>
> Regards,
> Felix
>
>
>> + if (r)
>> + pr_debug("failed %d to quiesce KFD queues\n", r);
>> +
>> for (i = 0; i < p->n_pdds; i++) {
>> struct kfd_process_device *pdd = p->pdds[i];
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-10-20 14:26 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-15 20:11 [PATCH v2 1/2] drm/amdkfd: Fix false positive queue buffer free warning Philip Yang
2025-10-15 20:11 ` [PATCH v2 2/2] drm/amdkfd: Stop user queues when process mm released Philip Yang
2025-10-15 20:40 ` Chen, Xiaogang
2025-10-15 21:33 ` Philip Yang
2025-10-16 14:46 ` Chen, Xiaogang
2025-10-17 22:34 ` Felix Kuehling
2025-10-17 22:43 ` Felix Kuehling
2025-10-20 14:26 ` Philip Yang
2025-10-15 21:01 ` [PATCH v2 1/2] drm/amdkfd: Fix false positive queue buffer free warning Chen, Xiaogang
2025-10-15 21:45 ` Philip Yang
2025-10-15 22:46 ` Chen, Xiaogang
2025-10-16 18:01 ` Philip Yang
2025-10-17 22:39 ` Felix Kuehling
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox