* [PATCH] drm/amdgpu: Fix a NULL pointer of fence @ 2022-07-07 9:50 xinhui pan 2022-07-07 9:54 ` Christian König 2022-07-07 18:23 ` Mike Lothian 0 siblings, 2 replies; 9+ messages in thread From: xinhui pan @ 2022-07-07 9:50 UTC (permalink / raw) To: amd-gfx; +Cc: alexander.deucher, Felix.Kuehling, xinhui pan, christian.koenig Fence is accessed by dma_resv_add_fence() now. Use amdgpu_amdkfd_remove_eviction_fence instead. Signed-off-by: xinhui pan <xinhui.pan@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index 0036c9e405af..1e25c400ce4f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -1558,10 +1558,10 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct amdgpu_device *adev, if (!process_info) return; - /* Release eviction fence from PD */ amdgpu_bo_reserve(pd, false); - amdgpu_bo_fence(pd, NULL, false); + amdgpu_amdkfd_remove_eviction_fence(pd, + process_info->eviction_fence); amdgpu_bo_unreserve(pd); /* Update process info */ -- 2.34.1 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence 2022-07-07 9:50 [PATCH] drm/amdgpu: Fix a NULL pointer of fence xinhui pan @ 2022-07-07 9:54 ` Christian König 2022-07-07 15:47 ` Felix Kuehling 2022-07-07 18:23 ` Mike Lothian 1 sibling, 1 reply; 9+ messages in thread From: Christian König @ 2022-07-07 9:54 UTC (permalink / raw) To: xinhui pan, amd-gfx; +Cc: alexander.deucher, Felix.Kuehling, christian.koenig Am 07.07.22 um 11:50 schrieb xinhui pan: > Fence is accessed by dma_resv_add_fence() now. > Use amdgpu_amdkfd_remove_eviction_fence instead. > > Signed-off-by: xinhui pan <xinhui.pan@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > index 0036c9e405af..1e25c400ce4f 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > @@ -1558,10 +1558,10 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct amdgpu_device *adev, > > if (!process_info) > return; > - > /* Release eviction fence from PD */ > amdgpu_bo_reserve(pd, false); > - amdgpu_bo_fence(pd, NULL, false); > + amdgpu_amdkfd_remove_eviction_fence(pd, > + process_info->eviction_fence); Good catch as well, but Felix needs to take a look at this. Regards, Christian. > amdgpu_bo_unreserve(pd); > > /* Update process info */ ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence 2022-07-07 9:54 ` Christian König @ 2022-07-07 15:47 ` Felix Kuehling 2022-07-08 1:08 ` Pan, Xinhui 0 siblings, 1 reply; 9+ messages in thread From: Felix Kuehling @ 2022-07-07 15:47 UTC (permalink / raw) To: Christian König, xinhui pan, amd-gfx Cc: alexander.deucher, christian.koenig Am 2022-07-07 um 05:54 schrieb Christian König: > Am 07.07.22 um 11:50 schrieb xinhui pan: >> Fence is accessed by dma_resv_add_fence() now. >> Use amdgpu_amdkfd_remove_eviction_fence instead. >> >> Signed-off-by: xinhui pan <xinhui.pan@amd.com> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> index 0036c9e405af..1e25c400ce4f 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> @@ -1558,10 +1558,10 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct >> amdgpu_device *adev, >> if (!process_info) >> return; >> - >> /* Release eviction fence from PD */ >> amdgpu_bo_reserve(pd, false); >> - amdgpu_bo_fence(pd, NULL, false); >> + amdgpu_amdkfd_remove_eviction_fence(pd, >> + process_info->eviction_fence); > > Good catch as well, but Felix needs to take a look at this. This is weird. We used amdgpu_bo_fence(pd, NULL, false) here, which would have removed an exclusive fence. But as far as I can tell we added the fence as a shared fence in init_kfd_vm and amdgpu_amdkfd_gpuvm_restore_process_bos. So this probably never worked as intended. You could try if this is really needed. Just remove the eviction fence removal. Then enable eviction debugging with echo Y > /sys/module/amdgpu/parameters/debug_evictions Run some simple tests and check the kernel log to see if process termination is causing any unexpected evictions. Regards, Felix > > Regards, > Christian. > >> amdgpu_bo_unreserve(pd); >> /* Update process info */ > ^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [PATCH] drm/amdgpu: Fix a NULL pointer of fence 2022-07-07 15:47 ` Felix Kuehling @ 2022-07-08 1:08 ` Pan, Xinhui 2022-07-08 9:03 ` Christian König 0 siblings, 1 reply; 9+ messages in thread From: Pan, Xinhui @ 2022-07-08 1:08 UTC (permalink / raw) To: Kuehling, Felix, Christian König, amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander, Koenig, Christian [AMD Official Use Only - General] Felix, Shared fences depend on exclusive fence, so add a new exclusive fence, say NULL would also remove all shared fences. That works before 5.18 . 😉 From 5.18, adding a new exclusive fence(the write usage fence) did not remove any shared fences(the read usage fence). So that is broken. And I also try the debug_evictions parameter. No unexpected eviction shows anyway. I did a quick check and found amdgpu implement BO release notify and it will remove kfd ef on pt/pd BO. So we don’t need this duplicated ef removal. The interesting thing is that is done by patch f4a3c42b5c52("drm/amdgpu: Remove kfd eviction fence before release bo (v2)") which is from me 😉 I totally forgot it. So I would make a new patch to remove this duplicated ef removal. -----Original Message----- From: Kuehling, Felix <Felix.Kuehling@amd.com> Sent: Thursday, July 7, 2022 11:47 PM To: Christian König <ckoenig.leichtzumerken@gmail.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com> Subject: Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence Am 2022-07-07 um 05:54 schrieb Christian König: > Am 07.07.22 um 11:50 schrieb xinhui pan: >> Fence is accessed by dma_resv_add_fence() now. >> Use amdgpu_amdkfd_remove_eviction_fence instead. >> >> Signed-off-by: xinhui pan <xinhui.pan@amd.com> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> index 0036c9e405af..1e25c400ce4f 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> @@ -1558,10 +1558,10 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct >> amdgpu_device *adev, >> if (!process_info) >> return; >> - >> /* Release eviction fence from PD */ >> amdgpu_bo_reserve(pd, false); >> - amdgpu_bo_fence(pd, NULL, false); >> + amdgpu_amdkfd_remove_eviction_fence(pd, >> + process_info->eviction_fence); > > Good catch as well, but Felix needs to take a look at this. This is weird. We used amdgpu_bo_fence(pd, NULL, false) here, which would have removed an exclusive fence. But as far as I can tell we added the fence as a shared fence in init_kfd_vm and amdgpu_amdkfd_gpuvm_restore_process_bos. So this probably never worked as intended. You could try if this is really needed. Just remove the eviction fence removal. Then enable eviction debugging with echo Y > /sys/module/amdgpu/parameters/debug_evictions Run some simple tests and check the kernel log to see if process termination is causing any unexpected evictions. Regards, Felix > > Regards, > Christian. > >> amdgpu_bo_unreserve(pd); >> /* Update process info */ > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence 2022-07-08 1:08 ` Pan, Xinhui @ 2022-07-08 9:03 ` Christian König 2022-07-18 14:58 ` Mike Lothian 0 siblings, 1 reply; 9+ messages in thread From: Christian König @ 2022-07-08 9:03 UTC (permalink / raw) To: Pan, Xinhui, Kuehling, Felix, amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander, Koenig, Christian Hi guys, well the practice to remove all fences by adding a NULL exclusive fence was pretty much illegal in the first place because this also removes kernel internal fences which can lead to freeing up memory which is still accessed. I've just didn't noticed that this was used by the KFD code as well otherwise I would have pushed to clean that up much earlier. Regards, Christian. Am 08.07.22 um 03:08 schrieb Pan, Xinhui: > [AMD Official Use Only - General] > > Felix, > Shared fences depend on exclusive fence, so add a new exclusive fence, say NULL would also remove all shared fences. That works before 5.18 . 😉 > From 5.18, adding a new exclusive fence(the write usage fence) did not remove any shared fences(the read usage fence). So that is broken. > > And I also try the debug_evictions parameter. No unexpected eviction shows anyway. > I did a quick check and found amdgpu implement BO release notify and it will remove kfd ef on pt/pd BO. So we don’t need this duplicated ef removal. The interesting thing is that is done by patch f4a3c42b5c52("drm/amdgpu: Remove kfd eviction fence before release bo (v2)") which is from me 😉 I totally forgot it. > > So I would make a new patch to remove this duplicated ef removal. > > -----Original Message----- > From: Kuehling, Felix <Felix.Kuehling@amd.com> > Sent: Thursday, July 7, 2022 11:47 PM > To: Christian König <ckoenig.leichtzumerken@gmail.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org > Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com> > Subject: Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence > > Am 2022-07-07 um 05:54 schrieb Christian König: >> Am 07.07.22 um 11:50 schrieb xinhui pan: >>> Fence is accessed by dma_resv_add_fence() now. >>> Use amdgpu_amdkfd_remove_eviction_fence instead. >>> >>> Signed-off-by: xinhui pan <xinhui.pan@amd.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 ++-- >>> 1 file changed, 2 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>> index 0036c9e405af..1e25c400ce4f 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>> @@ -1558,10 +1558,10 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct >>> amdgpu_device *adev, >>> if (!process_info) >>> return; >>> - >>> /* Release eviction fence from PD */ >>> amdgpu_bo_reserve(pd, false); >>> - amdgpu_bo_fence(pd, NULL, false); >>> + amdgpu_amdkfd_remove_eviction_fence(pd, >>> + process_info->eviction_fence); >> Good catch as well, but Felix needs to take a look at this. > This is weird. We used amdgpu_bo_fence(pd, NULL, false) here, which would have removed an exclusive fence. But as far as I can tell we added the fence as a shared fence in init_kfd_vm and amdgpu_amdkfd_gpuvm_restore_process_bos. So this probably never worked as intended. > > You could try if this is really needed. Just remove the eviction fence removal. Then enable eviction debugging with > > echo Y > /sys/module/amdgpu/parameters/debug_evictions > > Run some simple tests and check the kernel log to see if process termination is causing any unexpected evictions. > > Regards, > Felix > > >> Regards, >> Christian. >> >>> amdgpu_bo_unreserve(pd); >>> /* Update process info */ ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence 2022-07-08 9:03 ` Christian König @ 2022-07-18 14:58 ` Mike Lothian 2022-07-18 15:29 ` Felix Kuehling 0 siblings, 1 reply; 9+ messages in thread From: Mike Lothian @ 2022-07-18 14:58 UTC (permalink / raw) To: Christian König Cc: Deucher, Alexander, Kuehling, Felix, Pan, Xinhui, Koenig, Christian, amd-gfx@lists.freedesktop.org Is this likely to land before 5.19 final? It's been nearly 2 weeks since I said if fixed an issue I was seeing https://gitlab.freedesktop.org/drm/amd/-/issues/2074 On Fri, 8 Jul 2022 at 10:05, Christian König <ckoenig.leichtzumerken@gmail.com> wrote: > > Hi guys, > > well the practice to remove all fences by adding a NULL exclusive fence > was pretty much illegal in the first place because this also removes > kernel internal fences which can lead to freeing up memory which is > still accessed. > > I've just didn't noticed that this was used by the KFD code as well > otherwise I would have pushed to clean that up much earlier. > > Regards, > Christian. > > Am 08.07.22 um 03:08 schrieb Pan, Xinhui: > > [AMD Official Use Only - General] > > > > Felix, > > Shared fences depend on exclusive fence, so add a new exclusive fence, say NULL would also remove all shared fences. That works before 5.18 . 😉 > > From 5.18, adding a new exclusive fence(the write usage fence) did not remove any shared fences(the read usage fence). So that is broken. > > > > And I also try the debug_evictions parameter. No unexpected eviction shows anyway. > > I did a quick check and found amdgpu implement BO release notify and it will remove kfd ef on pt/pd BO. So we don’t need this duplicated ef removal. The interesting thing is that is done by patch f4a3c42b5c52("drm/amdgpu: Remove kfd eviction fence before release bo (v2)") which is from me 😉 I totally forgot it. > > > > So I would make a new patch to remove this duplicated ef removal. > > > > -----Original Message----- > > From: Kuehling, Felix <Felix.Kuehling@amd.com> > > Sent: Thursday, July 7, 2022 11:47 PM > > To: Christian König <ckoenig.leichtzumerken@gmail.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org > > Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com> > > Subject: Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence > > > > Am 2022-07-07 um 05:54 schrieb Christian König: > >> Am 07.07.22 um 11:50 schrieb xinhui pan: > >>> Fence is accessed by dma_resv_add_fence() now. > >>> Use amdgpu_amdkfd_remove_eviction_fence instead. > >>> > >>> Signed-off-by: xinhui pan <xinhui.pan@amd.com> > >>> --- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 ++-- > >>> 1 file changed, 2 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > >>> index 0036c9e405af..1e25c400ce4f 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > >>> @@ -1558,10 +1558,10 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct > >>> amdgpu_device *adev, > >>> if (!process_info) > >>> return; > >>> - > >>> /* Release eviction fence from PD */ > >>> amdgpu_bo_reserve(pd, false); > >>> - amdgpu_bo_fence(pd, NULL, false); > >>> + amdgpu_amdkfd_remove_eviction_fence(pd, > >>> + process_info->eviction_fence); > >> Good catch as well, but Felix needs to take a look at this. > > This is weird. We used amdgpu_bo_fence(pd, NULL, false) here, which would have removed an exclusive fence. But as far as I can tell we added the fence as a shared fence in init_kfd_vm and amdgpu_amdkfd_gpuvm_restore_process_bos. So this probably never worked as intended. > > > > You could try if this is really needed. Just remove the eviction fence removal. Then enable eviction debugging with > > > > echo Y > /sys/module/amdgpu/parameters/debug_evictions > > > > Run some simple tests and check the kernel log to see if process termination is causing any unexpected evictions. > > > > Regards, > > Felix > > > > > >> Regards, > >> Christian. > >> > >>> amdgpu_bo_unreserve(pd); > >>> /* Update process info */ > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence 2022-07-18 14:58 ` Mike Lothian @ 2022-07-18 15:29 ` Felix Kuehling 2022-07-18 15:46 ` Deucher, Alexander 0 siblings, 1 reply; 9+ messages in thread From: Felix Kuehling @ 2022-07-18 15:29 UTC (permalink / raw) To: Mike Lothian, Christian König Cc: Deucher, Alexander, Pan, Xinhui, Koenig, Christian, amd-gfx@lists.freedesktop.org Xinhui submitted this patch instead, which should address the same issue: "drm/amdgpu: Remove one duplicated ef removal" Alex, can you pick up that patch for drm-fixes for 5.19, if it's not too late? Thanks, Felix On 2022-07-18 10:58, Mike Lothian wrote: > Is this likely to land before 5.19 final? It's been nearly 2 weeks > since I said if fixed an issue I was seeing > https://gitlab.freedesktop.org/drm/amd/-/issues/2074 > > On Fri, 8 Jul 2022 at 10:05, Christian König > <ckoenig.leichtzumerken@gmail.com> wrote: >> Hi guys, >> >> well the practice to remove all fences by adding a NULL exclusive fence >> was pretty much illegal in the first place because this also removes >> kernel internal fences which can lead to freeing up memory which is >> still accessed. >> >> I've just didn't noticed that this was used by the KFD code as well >> otherwise I would have pushed to clean that up much earlier. >> >> Regards, >> Christian. >> >> Am 08.07.22 um 03:08 schrieb Pan, Xinhui: >>> [AMD Official Use Only - General] >>> >>> Felix, >>> Shared fences depend on exclusive fence, so add a new exclusive fence, say NULL would also remove all shared fences. That works before 5.18 . 😉 >>> From 5.18, adding a new exclusive fence(the write usage fence) did not remove any shared fences(the read usage fence). So that is broken. >>> >>> And I also try the debug_evictions parameter. No unexpected eviction shows anyway. >>> I did a quick check and found amdgpu implement BO release notify and it will remove kfd ef on pt/pd BO. So we don’t need this duplicated ef removal. The interesting thing is that is done by patch f4a3c42b5c52("drm/amdgpu: Remove kfd eviction fence before release bo (v2)") which is from me 😉 I totally forgot it. >>> >>> So I would make a new patch to remove this duplicated ef removal. >>> >>> -----Original Message----- >>> From: Kuehling, Felix <Felix.Kuehling@amd.com> >>> Sent: Thursday, July 7, 2022 11:47 PM >>> To: Christian König <ckoenig.leichtzumerken@gmail.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org >>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com> >>> Subject: Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence >>> >>> Am 2022-07-07 um 05:54 schrieb Christian König: >>>> Am 07.07.22 um 11:50 schrieb xinhui pan: >>>>> Fence is accessed by dma_resv_add_fence() now. >>>>> Use amdgpu_amdkfd_remove_eviction_fence instead. >>>>> >>>>> Signed-off-by: xinhui pan <xinhui.pan@amd.com> >>>>> --- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 ++-- >>>>> 1 file changed, 2 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>>>> index 0036c9e405af..1e25c400ce4f 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>>>> @@ -1558,10 +1558,10 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct >>>>> amdgpu_device *adev, >>>>> if (!process_info) >>>>> return; >>>>> - >>>>> /* Release eviction fence from PD */ >>>>> amdgpu_bo_reserve(pd, false); >>>>> - amdgpu_bo_fence(pd, NULL, false); >>>>> + amdgpu_amdkfd_remove_eviction_fence(pd, >>>>> + process_info->eviction_fence); >>>> Good catch as well, but Felix needs to take a look at this. >>> This is weird. We used amdgpu_bo_fence(pd, NULL, false) here, which would have removed an exclusive fence. But as far as I can tell we added the fence as a shared fence in init_kfd_vm and amdgpu_amdkfd_gpuvm_restore_process_bos. So this probably never worked as intended. >>> >>> You could try if this is really needed. Just remove the eviction fence removal. Then enable eviction debugging with >>> >>> echo Y > /sys/module/amdgpu/parameters/debug_evictions >>> >>> Run some simple tests and check the kernel log to see if process termination is causing any unexpected evictions. >>> >>> Regards, >>> Felix >>> >>> >>>> Regards, >>>> Christian. >>>> >>>>> amdgpu_bo_unreserve(pd); >>>>> /* Update process info */ ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence 2022-07-18 15:29 ` Felix Kuehling @ 2022-07-18 15:46 ` Deucher, Alexander 0 siblings, 0 replies; 9+ messages in thread From: Deucher, Alexander @ 2022-07-18 15:46 UTC (permalink / raw) To: Kuehling, Felix, Mike Lothian, Christian König Cc: Pan, Xinhui, Koenig, Christian, amd-gfx@lists.freedesktop.org [-- Attachment #1: Type: text/plain, Size: 4779 bytes --] [Public] Will do. Alex ________________________________ From: Kuehling, Felix <Felix.Kuehling@amd.com> Sent: Monday, July 18, 2022 11:29 AM To: Mike Lothian <mike@fireburn.co.uk>; Christian König <ckoenig.leichtzumerken@gmail.com> Cc: Pan, Xinhui <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com> Subject: Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence Xinhui submitted this patch instead, which should address the same issue: "drm/amdgpu: Remove one duplicated ef removal" Alex, can you pick up that patch for drm-fixes for 5.19, if it's not too late? Thanks, Felix On 2022-07-18 10:58, Mike Lothian wrote: > Is this likely to land before 5.19 final? It's been nearly 2 weeks > since I said if fixed an issue I was seeing > https://gitlab.freedesktop.org/drm/amd/-/issues/2074 > > On Fri, 8 Jul 2022 at 10:05, Christian König > <ckoenig.leichtzumerken@gmail.com> wrote: >> Hi guys, >> >> well the practice to remove all fences by adding a NULL exclusive fence >> was pretty much illegal in the first place because this also removes >> kernel internal fences which can lead to freeing up memory which is >> still accessed. >> >> I've just didn't noticed that this was used by the KFD code as well >> otherwise I would have pushed to clean that up much earlier. >> >> Regards, >> Christian. >> >> Am 08.07.22 um 03:08 schrieb Pan, Xinhui: >>> [AMD Official Use Only - General] >>> >>> Felix, >>> Shared fences depend on exclusive fence, so add a new exclusive fence, say NULL would also remove all shared fences. That works before 5.18 . 😉 >>> From 5.18, adding a new exclusive fence(the write usage fence) did not remove any shared fences(the read usage fence). So that is broken. >>> >>> And I also try the debug_evictions parameter. No unexpected eviction shows anyway. >>> I did a quick check and found amdgpu implement BO release notify and it will remove kfd ef on pt/pd BO. So we don’t need this duplicated ef removal. The interesting thing is that is done by patch f4a3c42b5c52("drm/amdgpu: Remove kfd eviction fence before release bo (v2)") which is from me 😉 I totally forgot it. >>> >>> So I would make a new patch to remove this duplicated ef removal. >>> >>> -----Original Message----- >>> From: Kuehling, Felix <Felix.Kuehling@amd.com> >>> Sent: Thursday, July 7, 2022 11:47 PM >>> To: Christian König <ckoenig.leichtzumerken@gmail.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org >>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com> >>> Subject: Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence >>> >>> Am 2022-07-07 um 05:54 schrieb Christian König: >>>> Am 07.07.22 um 11:50 schrieb xinhui pan: >>>>> Fence is accessed by dma_resv_add_fence() now. >>>>> Use amdgpu_amdkfd_remove_eviction_fence instead. >>>>> >>>>> Signed-off-by: xinhui pan <xinhui.pan@amd.com> >>>>> --- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 ++-- >>>>> 1 file changed, 2 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>>>> index 0036c9e405af..1e25c400ce4f 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>>>> @@ -1558,10 +1558,10 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct >>>>> amdgpu_device *adev, >>>>> if (!process_info) >>>>> return; >>>>> - >>>>> /* Release eviction fence from PD */ >>>>> amdgpu_bo_reserve(pd, false); >>>>> - amdgpu_bo_fence(pd, NULL, false); >>>>> + amdgpu_amdkfd_remove_eviction_fence(pd, >>>>> + process_info->eviction_fence); >>>> Good catch as well, but Felix needs to take a look at this. >>> This is weird. We used amdgpu_bo_fence(pd, NULL, false) here, which would have removed an exclusive fence. But as far as I can tell we added the fence as a shared fence in init_kfd_vm and amdgpu_amdkfd_gpuvm_restore_process_bos. So this probably never worked as intended. >>> >>> You could try if this is really needed. Just remove the eviction fence removal. Then enable eviction debugging with >>> >>> echo Y > /sys/module/amdgpu/parameters/debug_evictions >>> >>> Run some simple tests and check the kernel log to see if process termination is causing any unexpected evictions. >>> >>> Regards, >>> Felix >>> >>> >>>> Regards, >>>> Christian. >>>> >>>>> amdgpu_bo_unreserve(pd); >>>>> /* Update process info */ [-- Attachment #2: Type: text/html, Size: 7926 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence 2022-07-07 9:50 [PATCH] drm/amdgpu: Fix a NULL pointer of fence xinhui pan 2022-07-07 9:54 ` Christian König @ 2022-07-07 18:23 ` Mike Lothian 1 sibling, 0 replies; 9+ messages in thread From: Mike Lothian @ 2022-07-07 18:23 UTC (permalink / raw) To: xinhui pan; +Cc: alexander.deucher, Felix.Kuehling, christian.koenig, amd-gfx Hi This appears to fix https://gitlab.freedesktop.org/drm/amd/-/issues/2074 Feel free to add my tested by Thanks Mike On Thu, 7 Jul 2022 at 10:51, xinhui pan <xinhui.pan@amd.com> wrote: > > Fence is accessed by dma_resv_add_fence() now. > Use amdgpu_amdkfd_remove_eviction_fence instead. > > Signed-off-by: xinhui pan <xinhui.pan@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > index 0036c9e405af..1e25c400ce4f 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > @@ -1558,10 +1558,10 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct amdgpu_device *adev, > > if (!process_info) > return; > - > /* Release eviction fence from PD */ > amdgpu_bo_reserve(pd, false); > - amdgpu_bo_fence(pd, NULL, false); > + amdgpu_amdkfd_remove_eviction_fence(pd, > + process_info->eviction_fence); > amdgpu_bo_unreserve(pd); > > /* Update process info */ > -- > 2.34.1 > ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2022-07-18 15:46 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-07-07 9:50 [PATCH] drm/amdgpu: Fix a NULL pointer of fence xinhui pan 2022-07-07 9:54 ` Christian König 2022-07-07 15:47 ` Felix Kuehling 2022-07-08 1:08 ` Pan, Xinhui 2022-07-08 9:03 ` Christian König 2022-07-18 14:58 ` Mike Lothian 2022-07-18 15:29 ` Felix Kuehling 2022-07-18 15:46 ` Deucher, Alexander 2022-07-07 18:23 ` Mike Lothian
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.