* [PATCH v2 1/2] drm/amdkfd: Fix false positive queue buffer free warning
@ 2025-10-15 20:11 Philip Yang
2025-10-15 20:11 ` [PATCH v2 2/2] drm/amdkfd: Stop user queues when process mm released Philip Yang
2025-10-15 21:01 ` [PATCH v2 1/2] drm/amdkfd: Fix false positive queue buffer free warning Chen, Xiaogang
0 siblings, 2 replies; 13+ messages in thread
From: Philip Yang @ 2025-10-15 20:11 UTC (permalink / raw)
To: amd-gfx; +Cc: Felix.Kuehling, Philip Yang
Only show warning message if process mm is still alive when queue
buffer is freed to evcit the queues.
If kfd_lookup_process_by_mm return NULL, means the process is already
exited and mm is gone, it is fine to free queue buffer.
Fixes: b049504e211e ("drm/amdkfd: Validate user queue svm memory residency")
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
---
drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 4d4a47313f5b..d1b2f8525f80 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -2487,7 +2487,9 @@ svm_range_unmap_from_cpu(struct mm_struct *mm, struct svm_range *prange,
bool unmap_parent;
uint32_t i;
- if (atomic_read(&prange->queue_refcount)) {
+ p = kfd_lookup_process_by_mm(mm);
+
+ if (p && atomic_read(&prange->queue_refcount)) {
int r;
pr_warn("Freeing queue vital buffer 0x%lx, queue evicted\n",
@@ -2497,7 +2499,6 @@ svm_range_unmap_from_cpu(struct mm_struct *mm, struct svm_range *prange,
pr_debug("failed %d to quiesce KFD queues\n", r);
}
- p = kfd_lookup_process_by_mm(mm);
if (!p)
return;
svms = &p->svms;
--
2.49.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH v2 2/2] drm/amdkfd: Stop user queues when process mm released 2025-10-15 20:11 [PATCH v2 1/2] drm/amdkfd: Fix false positive queue buffer free warning Philip Yang @ 2025-10-15 20:11 ` Philip Yang 2025-10-15 20:40 ` Chen, Xiaogang 2025-10-17 22:43 ` Felix Kuehling 2025-10-15 21:01 ` [PATCH v2 1/2] drm/amdkfd: Fix false positive queue buffer free warning Chen, Xiaogang 1 sibling, 2 replies; 13+ messages in thread From: Philip Yang @ 2025-10-15 20:11 UTC (permalink / raw) To: amd-gfx; +Cc: Felix.Kuehling, Philip Yang, Felix Kuehling In mmu notifier release callback, stop user queues to be safe because the SVM memory is going to unmap from CPU. Suggested-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Philip Yang <Philip.Yang@amd.com> --- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c index 0341f570f3d1..e2a0ae0394b8 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c @@ -1221,11 +1221,16 @@ static void kfd_process_free_notifier(struct mmu_notifier *mn) static void kfd_process_notifier_release_internal(struct kfd_process *p) { - int i; + int i, r; cancel_delayed_work_sync(&p->eviction_work); cancel_delayed_work_sync(&p->restore_work); + WARN(debug_evictions, "Evicting pid %d", p->lead_thread->pid); + r = kfd_process_evict_queues(p, KFD_QUEUE_EVICTION_TRIGGER_SVM); + if (r) + pr_debug("failed %d to quiesce KFD queues\n", r); + for (i = 0; i < p->n_pdds; i++) { struct kfd_process_device *pdd = p->pdds[i]; -- 2.49.0 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v2 2/2] drm/amdkfd: Stop user queues when process mm released 2025-10-15 20:11 ` [PATCH v2 2/2] drm/amdkfd: Stop user queues when process mm released Philip Yang @ 2025-10-15 20:40 ` Chen, Xiaogang 2025-10-15 21:33 ` Philip Yang 2025-10-17 22:43 ` Felix Kuehling 1 sibling, 1 reply; 13+ messages in thread From: Chen, Xiaogang @ 2025-10-15 20:40 UTC (permalink / raw) To: Philip Yang, amd-gfx; +Cc: Felix.Kuehling [-- Attachment #1: Type: text/plain, Size: 1618 bytes --] On 10/15/2025 3:11 PM, Philip Yang wrote: > In mmu notifier release callback, stop user queues to be safe because > the SVM memory is going to unmap from CPU. > > Suggested-by: Felix Kuehling<felix.kuehling@amd.com> > Signed-off-by: Philip Yang<Philip.Yang@amd.com> > --- > drivers/gpu/drm/amd/amdkfd/kfd_process.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c > index 0341f570f3d1..e2a0ae0394b8 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c > @@ -1221,11 +1221,16 @@ static void kfd_process_free_notifier(struct mmu_notifier *mn) > > static void kfd_process_notifier_release_internal(struct kfd_process *p) > { > - int i; > + int i, r; > > cancel_delayed_work_sync(&p->eviction_work); > cancel_delayed_work_sync(&p->restore_work); > > + WARN(debug_evictions, "Evicting pid %d", p->lead_thread->pid); Use warning message or debug message? I saw this WARN are used several places. If the queues from kfd process p are still running when come here we need to stop them. It is not error. debug message is more suitable I think. > + r = kfd_process_evict_queues(p, KFD_QUEUE_EVICTION_TRIGGER_SVM); The evict reason KFD_QUEUE_EVICTION_TRIGGER_SVM is not good here as it is general kfd process release. Maybe need another enum value. Regards Xiaogagn > + if (r) > + pr_debug("failed %d to quiesce KFD queues\n", r); > + > for (i = 0; i < p->n_pdds; i++) { > struct kfd_process_device *pdd = p->pdds[i]; > [-- Attachment #2: Type: text/html, Size: 2592 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 2/2] drm/amdkfd: Stop user queues when process mm released 2025-10-15 20:40 ` Chen, Xiaogang @ 2025-10-15 21:33 ` Philip Yang 2025-10-16 14:46 ` Chen, Xiaogang 0 siblings, 1 reply; 13+ messages in thread From: Philip Yang @ 2025-10-15 21:33 UTC (permalink / raw) To: Chen, Xiaogang, Philip Yang, amd-gfx; +Cc: Felix.Kuehling On 2025-10-15 16:40, Chen, Xiaogang wrote: > > > On 10/15/2025 3:11 PM, Philip Yang wrote: >> In mmu notifier release callback, stop user queues to be safe because >> the SVM memory is going to unmap from CPU. >> >> Suggested-by: Felix Kuehling<felix.kuehling@amd.com> >> Signed-off-by: Philip Yang<Philip.Yang@amd.com> >> --- >> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 7 ++++++- >> 1 file changed, 6 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c >> index 0341f570f3d1..e2a0ae0394b8 100644 >> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c >> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c >> @@ -1221,11 +1221,16 @@ static void kfd_process_free_notifier(struct mmu_notifier *mn) >> >> static void kfd_process_notifier_release_internal(struct kfd_process *p) >> { >> - int i; >> + int i, r; >> >> cancel_delayed_work_sync(&p->eviction_work); >> cancel_delayed_work_sync(&p->restore_work); >> >> + WARN(debug_evictions, "Evicting pid %d", p->lead_thread->pid); > > Use warning message or debug message? I saw this WARN are used several > places. If the queues from kfd process p are still running when come > here we need to stop them. It is not error. debug message is more > suitable I think. > The module parameter debug_evictions can be set to true, use WARN to dump call back trace to help understand why queue is evicted, by default debug_evictions is false. >> + r = kfd_process_evict_queues(p, KFD_QUEUE_EVICTION_TRIGGER_SVM); > > The evict reason KFD_QUEUE_EVICTION_TRIGGER_SVM is not good here as it > is general kfd process release. Maybe need another enum value. > Define new profiling event requires rocprofiler API change, KFD_QUEUE_EVICTION_TRIGGER_SVM seems the closest event from mmu notifier. Regards, Philip > Regards > > Xiaogagn > >> + if (r) >> + pr_debug("failed %d to quiesce KFD queues\n", r); >> + >> for (i = 0; i < p->n_pdds; i++) { >> struct kfd_process_device *pdd = p->pdds[i]; >> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 2/2] drm/amdkfd: Stop user queues when process mm released 2025-10-15 21:33 ` Philip Yang @ 2025-10-16 14:46 ` Chen, Xiaogang 2025-10-17 22:34 ` Felix Kuehling 0 siblings, 1 reply; 13+ messages in thread From: Chen, Xiaogang @ 2025-10-16 14:46 UTC (permalink / raw) To: Philip Yang, Philip Yang, amd-gfx; +Cc: Felix.Kuehling On 10/15/2025 4:33 PM, Philip Yang wrote: > > On 2025-10-15 16:40, Chen, Xiaogang wrote: >> >> >> On 10/15/2025 3:11 PM, Philip Yang wrote: >>> In mmu notifier release callback, stop user queues to be safe because >>> the SVM memory is going to unmap from CPU. >>> >>> Suggested-by: Felix Kuehling<felix.kuehling@amd.com> >>> Signed-off-by: Philip Yang<Philip.Yang@amd.com> >>> --- >>> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 7 ++++++- >>> 1 file changed, 6 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c >>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c >>> index 0341f570f3d1..e2a0ae0394b8 100644 >>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c >>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c >>> @@ -1221,11 +1221,16 @@ static void kfd_process_free_notifier(struct >>> mmu_notifier *mn) >>> static void kfd_process_notifier_release_internal(struct >>> kfd_process *p) >>> { >>> - int i; >>> + int i, r; >>> cancel_delayed_work_sync(&p->eviction_work); >>> cancel_delayed_work_sync(&p->restore_work); >>> + WARN(debug_evictions, "Evicting pid %d", p->lead_thread->pid); >> >> Use warning message or debug message? I saw this WARN are used >> several places. If the queues from kfd process p are still running >> when come here we need to stop them. It is not error. debug message >> is more suitable I think. >> > The module parameter debug_evictions can be set to true, use WARN to > dump call back trace to help understand why queue is evicted, by > default debug_evictions is false. I agree stopping kfd process's queues during kfd process release. Just wonder if change WARN to debug message form. We can use dump_stack() to dump stack anyway, but it is not relevant to this patch. >>> + r = kfd_process_evict_queues(p, KFD_QUEUE_EVICTION_TRIGGER_SVM); >> >> The evict reason KFD_QUEUE_EVICTION_TRIGGER_SVM is not good here as >> it is general kfd process release. Maybe need another enum value. >> > Define new profiling event requires rocprofiler API change, > KFD_QUEUE_EVICTION_TRIGGER_SVM seems the closest event from mmu notifier. That is awkward. We may add a emu value at end that rocprofile would not know for now. Regards Xiaogang > > Regards, > > Philip > >> Regards >> >> Xiaogagn >> >>> + if (r) >>> + pr_debug("failed %d to quiesce KFD queues\n", r); >>> + >>> for (i = 0; i < p->n_pdds; i++) { >>> struct kfd_process_device *pdd = p->pdds[i]; ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 2/2] drm/amdkfd: Stop user queues when process mm released 2025-10-16 14:46 ` Chen, Xiaogang @ 2025-10-17 22:34 ` Felix Kuehling 0 siblings, 0 replies; 13+ messages in thread From: Felix Kuehling @ 2025-10-17 22:34 UTC (permalink / raw) To: Chen, Xiaogang, Philip Yang, Philip Yang, amd-gfx On 2025-10-16 10:46, Chen, Xiaogang wrote: > > On 10/15/2025 4:33 PM, Philip Yang wrote: >> >> On 2025-10-15 16:40, Chen, Xiaogang wrote: >>> >>> >>> On 10/15/2025 3:11 PM, Philip Yang wrote: >>>> In mmu notifier release callback, stop user queues to be safe because >>>> the SVM memory is going to unmap from CPU. >>>> >>>> Suggested-by: Felix Kuehling<felix.kuehling@amd.com> >>>> Signed-off-by: Philip Yang<Philip.Yang@amd.com> >>>> --- >>>> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 7 ++++++- >>>> 1 file changed, 6 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c >>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c >>>> index 0341f570f3d1..e2a0ae0394b8 100644 >>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c >>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c >>>> @@ -1221,11 +1221,16 @@ static void >>>> kfd_process_free_notifier(struct mmu_notifier *mn) >>>> static void kfd_process_notifier_release_internal(struct >>>> kfd_process *p) >>>> { >>>> - int i; >>>> + int i, r; >>>> cancel_delayed_work_sync(&p->eviction_work); >>>> cancel_delayed_work_sync(&p->restore_work); >>>> + WARN(debug_evictions, "Evicting pid %d", p->lead_thread->pid); >>> >>> Use warning message or debug message? I saw this WARN are used >>> several places. If the queues from kfd process p are still running >>> when come here we need to stop them. It is not error. debug message >>> is more suitable I think. >>> >> The module parameter debug_evictions can be set to true, use WARN to >> dump call back trace to help understand why queue is evicted, by >> default debug_evictions is false. > I agree stopping kfd process's queues during kfd process release. > Just wonder if change WARN to debug message form. We can use > dump_stack() to dump stack anyway, but it is not relevant to this patch. > >>>> + r = kfd_process_evict_queues(p, KFD_QUEUE_EVICTION_TRIGGER_SVM); >>> >>> The evict reason KFD_QUEUE_EVICTION_TRIGGER_SVM is not good here as >>> it is general kfd process release. Maybe need another enum value. >>> >> Define new profiling event requires rocprofiler API change, >> KFD_QUEUE_EVICTION_TRIGGER_SVM seems the closest event from mmu >> notifier. > > That is awkward. We may add a emu value at end that rocprofile would > not know for now. roc-profile collects events in the running process context. This only happens on process termination. roc-profile will never see these events. So it really doesn't matter. Regards, Felix > > Regards > > Xiaogang > > >> >> Regards, >> >> Philip >> >>> Regards >>> >>> Xiaogagn >>> >>>> + if (r) >>>> + pr_debug("failed %d to quiesce KFD queues\n", r); >>>> + >>>> for (i = 0; i < p->n_pdds; i++) { >>>> struct kfd_process_device *pdd = p->pdds[i]; ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 2/2] drm/amdkfd: Stop user queues when process mm released 2025-10-15 20:11 ` [PATCH v2 2/2] drm/amdkfd: Stop user queues when process mm released Philip Yang 2025-10-15 20:40 ` Chen, Xiaogang @ 2025-10-17 22:43 ` Felix Kuehling 2025-10-20 14:26 ` Philip Yang 1 sibling, 1 reply; 13+ messages in thread From: Felix Kuehling @ 2025-10-17 22:43 UTC (permalink / raw) To: Philip Yang, amd-gfx On 2025-10-15 16:11, Philip Yang wrote: > In mmu notifier release callback, stop user queues to be safe because > the SVM memory is going to unmap from CPU. > > Suggested-by: Felix Kuehling <felix.kuehling@amd.com> > Signed-off-by: Philip Yang <Philip.Yang@amd.com> > --- > drivers/gpu/drm/amd/amdkfd/kfd_process.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c > index 0341f570f3d1..e2a0ae0394b8 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c > @@ -1221,11 +1221,16 @@ static void kfd_process_free_notifier(struct mmu_notifier *mn) > > static void kfd_process_notifier_release_internal(struct kfd_process *p) > { > - int i; > + int i, r; > > cancel_delayed_work_sync(&p->eviction_work); > cancel_delayed_work_sync(&p->restore_work); > > + WARN(debug_evictions, "Evicting pid %d", p->lead_thread->pid); > + r = kfd_process_evict_queues(p, KFD_QUEUE_EVICTION_TRIGGER_SVM); Is there a reason why we can't just call kfd_process_dequeue_from_all_devices here, and remove that call from kfd_process_wq_release? We don't need to call this an eviction. The queues get removed on process termination anyway. We're just doing it a bit earlier now. Regards, Felix > + if (r) > + pr_debug("failed %d to quiesce KFD queues\n", r); > + > for (i = 0; i < p->n_pdds; i++) { > struct kfd_process_device *pdd = p->pdds[i]; > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 2/2] drm/amdkfd: Stop user queues when process mm released 2025-10-17 22:43 ` Felix Kuehling @ 2025-10-20 14:26 ` Philip Yang 0 siblings, 0 replies; 13+ messages in thread From: Philip Yang @ 2025-10-20 14:26 UTC (permalink / raw) To: Felix Kuehling, Philip Yang, amd-gfx On 2025-10-17 18:43, Felix Kuehling wrote: > > On 2025-10-15 16:11, Philip Yang wrote: >> In mmu notifier release callback, stop user queues to be safe because >> the SVM memory is going to unmap from CPU. >> >> Suggested-by: Felix Kuehling <felix.kuehling@amd.com> >> Signed-off-by: Philip Yang <Philip.Yang@amd.com> >> --- >> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 7 ++++++- >> 1 file changed, 6 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c >> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c >> index 0341f570f3d1..e2a0ae0394b8 100644 >> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c >> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c >> @@ -1221,11 +1221,16 @@ static void kfd_process_free_notifier(struct >> mmu_notifier *mn) >> static void kfd_process_notifier_release_internal(struct >> kfd_process *p) >> { >> - int i; >> + int i, r; >> cancel_delayed_work_sync(&p->eviction_work); >> cancel_delayed_work_sync(&p->restore_work); >> + WARN(debug_evictions, "Evicting pid %d", p->lead_thread->pid); >> + r = kfd_process_evict_queues(p, KFD_QUEUE_EVICTION_TRIGGER_SVM); > > Is there a reason why we can't just call > kfd_process_dequeue_from_all_devices here, and remove that call from > kfd_process_wq_release? We don't need to call this an eviction. The > queues get removed on process termination anyway. We're just doing it > a bit earlier now. MMU release notifier callback don't hold mmap lock, it is safe to call kfd_process_dequeue_from_all_devices here, will send new version for review. Regards, Philip > > Regards, > Felix > > >> + if (r) >> + pr_debug("failed %d to quiesce KFD queues\n", r); >> + >> for (i = 0; i < p->n_pdds; i++) { >> struct kfd_process_device *pdd = p->pdds[i]; ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/2] drm/amdkfd: Fix false positive queue buffer free warning 2025-10-15 20:11 [PATCH v2 1/2] drm/amdkfd: Fix false positive queue buffer free warning Philip Yang 2025-10-15 20:11 ` [PATCH v2 2/2] drm/amdkfd: Stop user queues when process mm released Philip Yang @ 2025-10-15 21:01 ` Chen, Xiaogang 2025-10-15 21:45 ` Philip Yang 1 sibling, 1 reply; 13+ messages in thread From: Chen, Xiaogang @ 2025-10-15 21:01 UTC (permalink / raw) To: Philip Yang, amd-gfx; +Cc: Felix.Kuehling On 10/15/2025 3:11 PM, Philip Yang wrote: > Only show warning message if process mm is still alive when queue > buffer is freed to evcit the queues. > > If kfd_lookup_process_by_mm return NULL, means the process is already > exited and mm is gone, it is fine to free queue buffer. But another question is why a prange is still alive, its kfd process is gone? When unmap a prange the queues that use it should have been stopped. If not, there is problem somewhere. This warning message need be sent no matter kfd process exists or not. I think a real problem here is kfd process need be alive as long as any of its resource is still alive. In this case since prange is still alive its kfd process should not be released(p should not be null). If not we need wait all pranges from this process got released, then release this kfd process. Regards Xiaogang > > Fixes: b049504e211e ("drm/amdkfd: Validate user queue svm memory residency") > Signed-off-by: Philip Yang <Philip.Yang@amd.com> > --- > drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c > index 4d4a47313f5b..d1b2f8525f80 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c > @@ -2487,7 +2487,9 @@ svm_range_unmap_from_cpu(struct mm_struct *mm, struct svm_range *prange, > bool unmap_parent; > uint32_t i; > > - if (atomic_read(&prange->queue_refcount)) { > + p = kfd_lookup_process_by_mm(mm); > + > + if (p && atomic_read(&prange->queue_refcount)) { > int r; > > pr_warn("Freeing queue vital buffer 0x%lx, queue evicted\n", > @@ -2497,7 +2499,6 @@ svm_range_unmap_from_cpu(struct mm_struct *mm, struct svm_range *prange, > pr_debug("failed %d to quiesce KFD queues\n", r); > } > > - p = kfd_lookup_process_by_mm(mm); > if (!p) > return; > svms = &p->svms; ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/2] drm/amdkfd: Fix false positive queue buffer free warning 2025-10-15 21:01 ` [PATCH v2 1/2] drm/amdkfd: Fix false positive queue buffer free warning Chen, Xiaogang @ 2025-10-15 21:45 ` Philip Yang 2025-10-15 22:46 ` Chen, Xiaogang 0 siblings, 1 reply; 13+ messages in thread From: Philip Yang @ 2025-10-15 21:45 UTC (permalink / raw) To: Chen, Xiaogang, Philip Yang, amd-gfx; +Cc: Felix.Kuehling On 2025-10-15 17:01, Chen, Xiaogang wrote: > > On 10/15/2025 3:11 PM, Philip Yang wrote: >> Only show warning message if process mm is still alive when queue >> buffer is freed to evcit the queues. >> >> If kfd_lookup_process_by_mm return NULL, means the process is already >> exited and mm is gone, it is fine to free queue buffer. > > But another question is why a prange is still alive, its kfd process > is gone? It is application process exited, kfd process structure still exist and available. The issue is race condition: do_exit exit_mmap a. mmu mm release notifier, schedule kfd release wq to destroy queue unmap_vmas b. mmu_notifier_range(.. MMU_NOTIFY_UNMAP...) the step b is executed to unmap CWSR svm range, before step a kfd release wq destroy queue. > > When unmap a prange the queues that use it should have been stopped. > If not, there is problem somewhere. This warning message need be sent > no matter kfd process exists or not. > > I think a real problem here is kfd process need be alive as long as > any of its resource is still alive. In this case since prange is still > alive its kfd process should not be released(p should not be null). If > not we need wait all pranges from this process got released, then > release this kfd process. kfd process structure is freed in kfd_process_wq_release after svm_range_list_fini. Regards, Philip > > Regards > > Xiaogang > >> >> Fixes: b049504e211e ("drm/amdkfd: Validate user queue svm memory >> residency") >> Signed-off-by: Philip Yang <Philip.Yang@amd.com> >> --- >> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 +++-- >> 1 file changed, 3 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >> index 4d4a47313f5b..d1b2f8525f80 100644 >> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >> @@ -2487,7 +2487,9 @@ svm_range_unmap_from_cpu(struct mm_struct *mm, >> struct svm_range *prange, >> bool unmap_parent; >> uint32_t i; >> - if (atomic_read(&prange->queue_refcount)) { >> + p = kfd_lookup_process_by_mm(mm); >> + >> + if (p && atomic_read(&prange->queue_refcount)) { >> int r; >> pr_warn("Freeing queue vital buffer 0x%lx, queue evicted\n", >> @@ -2497,7 +2499,6 @@ svm_range_unmap_from_cpu(struct mm_struct *mm, >> struct svm_range *prange, >> pr_debug("failed %d to quiesce KFD queues\n", r); >> } >> - p = kfd_lookup_process_by_mm(mm); >> if (!p) >> return; >> svms = &p->svms; ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/2] drm/amdkfd: Fix false positive queue buffer free warning 2025-10-15 21:45 ` Philip Yang @ 2025-10-15 22:46 ` Chen, Xiaogang 2025-10-16 18:01 ` Philip Yang 2025-10-17 22:39 ` Felix Kuehling 0 siblings, 2 replies; 13+ messages in thread From: Chen, Xiaogang @ 2025-10-15 22:46 UTC (permalink / raw) To: Philip Yang, Philip Yang, amd-gfx; +Cc: Felix.Kuehling On 10/15/2025 4:45 PM, Philip Yang wrote: > > On 2025-10-15 17:01, Chen, Xiaogang wrote: >> >> On 10/15/2025 3:11 PM, Philip Yang wrote: >>> Only show warning message if process mm is still alive when queue >>> buffer is freed to evcit the queues. >>> >>> If kfd_lookup_process_by_mm return NULL, means the process is already >>> exited and mm is gone, it is fine to free queue buffer. >> >> But another question is why a prange is still alive, its kfd process >> is gone? > It is application process exited, kfd process structure still exist > and available. The issue is race condition: > > do_exit > exit_mmap > a. mmu mm release notifier, schedule kfd release wq to > destroy queue > unmap_vmas > b. mmu_notifier_range(.. MMU_NOTIFY_UNMAP...) > > the step b is executed to unmap CWSR svm range, before step a kfd > release wq destroy queue. > > >> >> When unmap a prange the queues that use it should have been stopped. >> If not, there is problem somewhere. This warning message need be sent >> no matter kfd process exists or not. >> >> I think a real problem here is kfd process need be alive as long as >> any of its resource is still alive. In this case since prange is >> still alive its kfd process should not be released(p should not be >> null). If not we need wait all pranges from this process got >> released, then release this kfd process. > > kfd process structure is freed in kfd_process_wq_release after > svm_range_list_fini. I wanted to say: delay remove kfd process p from kfd_processes_table until all resources of p got released. So when any p's resources is getting released p is available. That needs change kfd process release logic. Regards Xiaogang > > Regards, > > Philip > >> >> Regards >> >> Xiaogang >> >>> >>> Fixes: b049504e211e ("drm/amdkfd: Validate user queue svm memory >>> residency") >>> Signed-off-by: Philip Yang <Philip.Yang@amd.com> >>> --- >>> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 +++-- >>> 1 file changed, 3 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >>> index 4d4a47313f5b..d1b2f8525f80 100644 >>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >>> @@ -2487,7 +2487,9 @@ svm_range_unmap_from_cpu(struct mm_struct *mm, >>> struct svm_range *prange, >>> bool unmap_parent; >>> uint32_t i; >>> - if (atomic_read(&prange->queue_refcount)) { >>> + p = kfd_lookup_process_by_mm(mm); >>> + >>> + if (p && atomic_read(&prange->queue_refcount)) { >>> int r; >>> pr_warn("Freeing queue vital buffer 0x%lx, queue >>> evicted\n", >>> @@ -2497,7 +2499,6 @@ svm_range_unmap_from_cpu(struct mm_struct *mm, >>> struct svm_range *prange, >>> pr_debug("failed %d to quiesce KFD queues\n", r); >>> } >>> - p = kfd_lookup_process_by_mm(mm); >>> if (!p) >>> return; >>> svms = &p->svms; ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/2] drm/amdkfd: Fix false positive queue buffer free warning 2025-10-15 22:46 ` Chen, Xiaogang @ 2025-10-16 18:01 ` Philip Yang 2025-10-17 22:39 ` Felix Kuehling 1 sibling, 0 replies; 13+ messages in thread From: Philip Yang @ 2025-10-16 18:01 UTC (permalink / raw) To: Chen, Xiaogang, Philip Yang, amd-gfx; +Cc: Felix.Kuehling On 2025-10-15 18:46, Chen, Xiaogang wrote: > > On 10/15/2025 4:45 PM, Philip Yang wrote: >> >> On 2025-10-15 17:01, Chen, Xiaogang wrote: >>> >>> On 10/15/2025 3:11 PM, Philip Yang wrote: >>>> Only show warning message if process mm is still alive when queue >>>> buffer is freed to evcit the queues. >>>> >>>> If kfd_lookup_process_by_mm return NULL, means the process is already >>>> exited and mm is gone, it is fine to free queue buffer. >>> >>> But another question is why a prange is still alive, its kfd process >>> is gone? >> It is application process exited, kfd process structure still exist >> and available. The issue is race condition: >> >> do_exit >> exit_mmap >> a. mmu mm release notifier, schedule kfd release wq to >> destroy queue >> unmap_vmas >> b. mmu_notifier_range(.. MMU_NOTIFY_UNMAP...) >> >> the step b is executed to unmap CWSR svm range, before step a kfd >> release wq destroy queue. >> >> >>> >>> When unmap a prange the queues that use it should have been stopped. >>> If not, there is problem somewhere. This warning message need be >>> sent no matter kfd process exists or not. >>> >>> I think a real problem here is kfd process need be alive as long as >>> any of its resource is still alive. In this case since prange is >>> still alive its kfd process should not be released(p should not be >>> null). If not we need wait all pranges from this process got >>> released, then release this kfd process. >> >> kfd process structure is freed in kfd_process_wq_release after >> svm_range_list_fini. > > I wanted to say: delay remove kfd process p from kfd_processes_table > until all resources of p got released. So when any p's resources is > getting released p is available. That needs change kfd process release > logic. prange->queue_refcount will be 0 after queue is destroyed (not evicted), we should warn user space and evict queues if prange is freed with prange->queue_refcount not zero. This patch is to fix the race that generate false warning after process exited to free prange. I don't think that keep kfd process in kfd_processes_table after mmu release notifier will solve this race issue. Regards, Philip > > > Regards > > Xiaogang > > > > >> >> Regards, >> >> Philip >> >>> >>> Regards >>> >>> Xiaogang >>> >>>> >>>> Fixes: b049504e211e ("drm/amdkfd: Validate user queue svm memory >>>> residency") >>>> Signed-off-by: Philip Yang <Philip.Yang@amd.com> >>>> --- >>>> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 +++-- >>>> 1 file changed, 3 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >>>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >>>> index 4d4a47313f5b..d1b2f8525f80 100644 >>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >>>> @@ -2487,7 +2487,9 @@ svm_range_unmap_from_cpu(struct mm_struct >>>> *mm, struct svm_range *prange, >>>> bool unmap_parent; >>>> uint32_t i; >>>> - if (atomic_read(&prange->queue_refcount)) { >>>> + p = kfd_lookup_process_by_mm(mm); >>>> + >>>> + if (p && atomic_read(&prange->queue_refcount)) { >>>> int r; >>>> pr_warn("Freeing queue vital buffer 0x%lx, queue >>>> evicted\n", >>>> @@ -2497,7 +2499,6 @@ svm_range_unmap_from_cpu(struct mm_struct >>>> *mm, struct svm_range *prange, >>>> pr_debug("failed %d to quiesce KFD queues\n", r); >>>> } >>>> - p = kfd_lookup_process_by_mm(mm); >>>> if (!p) >>>> return; >>>> svms = &p->svms; ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/2] drm/amdkfd: Fix false positive queue buffer free warning 2025-10-15 22:46 ` Chen, Xiaogang 2025-10-16 18:01 ` Philip Yang @ 2025-10-17 22:39 ` Felix Kuehling 1 sibling, 0 replies; 13+ messages in thread From: Felix Kuehling @ 2025-10-17 22:39 UTC (permalink / raw) To: Chen, Xiaogang, Philip Yang, Philip Yang, amd-gfx On 2025-10-15 18:46, Chen, Xiaogang wrote: > > On 10/15/2025 4:45 PM, Philip Yang wrote: >> >> On 2025-10-15 17:01, Chen, Xiaogang wrote: >>> >>> On 10/15/2025 3:11 PM, Philip Yang wrote: >>>> Only show warning message if process mm is still alive when queue >>>> buffer is freed to evcit the queues. >>>> >>>> If kfd_lookup_process_by_mm return NULL, means the process is already >>>> exited and mm is gone, it is fine to free queue buffer. >>> >>> But another question is why a prange is still alive, its kfd process >>> is gone? >> It is application process exited, kfd process structure still exist >> and available. The issue is race condition: >> >> do_exit >> exit_mmap >> a. mmu mm release notifier, schedule kfd release wq to >> destroy queue >> unmap_vmas >> b. mmu_notifier_range(.. MMU_NOTIFY_UNMAP...) >> >> the step b is executed to unmap CWSR svm range, before step a kfd >> release wq destroy queue. >> >> >>> >>> When unmap a prange the queues that use it should have been stopped. >>> If not, there is problem somewhere. This warning message need be >>> sent no matter kfd process exists or not. >>> >>> I think a real problem here is kfd process need be alive as long as >>> any of its resource is still alive. In this case since prange is >>> still alive its kfd process should not be released(p should not be >>> null). If not we need wait all pranges from this process got >>> released, then release this kfd process. >> >> kfd process structure is freed in kfd_process_wq_release after >> svm_range_list_fini. > > I wanted to say: delay remove kfd process p from kfd_processes_table > until all resources of p got released. So when any p's resources is > getting released p is available. That needs change kfd process release > logic. That would complicate the cleanup a lot, because now other threads can still look up the kfd_process and use or modify it concurrently while the cleanup is happening. We remove the process from the kfd_processes_table first to ensure that it is safe to clean up all the process resources. Regards, Felix > > > Regards > > Xiaogang > > > > >> >> Regards, >> >> Philip >> >>> >>> Regards >>> >>> Xiaogang >>> >>>> >>>> Fixes: b049504e211e ("drm/amdkfd: Validate user queue svm memory >>>> residency") >>>> Signed-off-by: Philip Yang <Philip.Yang@amd.com> >>>> --- >>>> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 +++-- >>>> 1 file changed, 3 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >>>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >>>> index 4d4a47313f5b..d1b2f8525f80 100644 >>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >>>> @@ -2487,7 +2487,9 @@ svm_range_unmap_from_cpu(struct mm_struct >>>> *mm, struct svm_range *prange, >>>> bool unmap_parent; >>>> uint32_t i; >>>> - if (atomic_read(&prange->queue_refcount)) { >>>> + p = kfd_lookup_process_by_mm(mm); >>>> + >>>> + if (p && atomic_read(&prange->queue_refcount)) { >>>> int r; >>>> pr_warn("Freeing queue vital buffer 0x%lx, queue >>>> evicted\n", >>>> @@ -2497,7 +2499,6 @@ svm_range_unmap_from_cpu(struct mm_struct >>>> *mm, struct svm_range *prange, >>>> pr_debug("failed %d to quiesce KFD queues\n", r); >>>> } >>>> - p = kfd_lookup_process_by_mm(mm); >>>> if (!p) >>>> return; >>>> svms = &p->svms; ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-10-20 14:26 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-10-15 20:11 [PATCH v2 1/2] drm/amdkfd: Fix false positive queue buffer free warning Philip Yang 2025-10-15 20:11 ` [PATCH v2 2/2] drm/amdkfd: Stop user queues when process mm released Philip Yang 2025-10-15 20:40 ` Chen, Xiaogang 2025-10-15 21:33 ` Philip Yang 2025-10-16 14:46 ` Chen, Xiaogang 2025-10-17 22:34 ` Felix Kuehling 2025-10-17 22:43 ` Felix Kuehling 2025-10-20 14:26 ` Philip Yang 2025-10-15 21:01 ` [PATCH v2 1/2] drm/amdkfd: Fix false positive queue buffer free warning Chen, Xiaogang 2025-10-15 21:45 ` Philip Yang 2025-10-15 22:46 ` Chen, Xiaogang 2025-10-16 18:01 ` Philip Yang 2025-10-17 22:39 ` Felix Kuehling
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox