* [RFC PATCH 0/1] drm/amdgpu: Fix NULL-deref in amdgpu_device_fini_sw() @ 2022-09-30 21:41 ` Zhang Boyang 0 siblings, 0 replies; 6+ messages in thread From: Zhang Boyang @ 2022-09-30 21:41 UTC (permalink / raw) To: Andrey Grodzovsky, Christian König, linux-kernel, amd-gfx Cc: Steven J Abner, David C . Rankin Hi, There are several reports of "Fatal error during GPU init" will cause NULL-deref in amdgpu_device_fini_sw(). Although the NULL-deref is result instead of reason, this NULL-deref will confuse user. https://lore.kernel.org/lkml/a8bce489-8ccc-aa95-3de6-f854e03ad557@suddenlinkmail.com/ https://lore.kernel.org/lkml/AT9WHR.3Z1T3VI9A2AQ3@att.net/ This is probably because "adev" is not fully initialized when amdgpu_device_init() failed. Thus subsequent amdgpu_device_fini_sw() will try to release "adev->reset_domain" and cause NULL-deref. This patch fixes this problem by guarding the code with an "if". However, I'm new to this module and I didn't fully understand the code, so please review my code carefully. Best Regards, Zhang Boyang ^ permalink raw reply [flat|nested] 6+ messages in thread
* [RFC PATCH 0/1] drm/amdgpu: Fix NULL-deref in amdgpu_device_fini_sw() @ 2022-09-30 21:41 ` Zhang Boyang 0 siblings, 0 replies; 6+ messages in thread From: Zhang Boyang @ 2022-09-30 21:41 UTC (permalink / raw) To: Andrey Grodzovsky, Christian König, linux-kernel, amd-gfx Cc: David C . Rankin, Steven J Abner Hi, There are several reports of "Fatal error during GPU init" will cause NULL-deref in amdgpu_device_fini_sw(). Although the NULL-deref is result instead of reason, this NULL-deref will confuse user. https://lore.kernel.org/lkml/a8bce489-8ccc-aa95-3de6-f854e03ad557@suddenlinkmail.com/ https://lore.kernel.org/lkml/AT9WHR.3Z1T3VI9A2AQ3@att.net/ This is probably because "adev" is not fully initialized when amdgpu_device_init() failed. Thus subsequent amdgpu_device_fini_sw() will try to release "adev->reset_domain" and cause NULL-deref. This patch fixes this problem by guarding the code with an "if". However, I'm new to this module and I didn't fully understand the code, so please review my code carefully. Best Regards, Zhang Boyang ^ permalink raw reply [flat|nested] 6+ messages in thread
* [RFC PATCH 1/1] drm/amdgpu: Fix null-ptr-deref in amdgpu_device_fini_sw() 2022-09-30 21:41 ` Zhang Boyang @ 2022-09-30 21:41 ` Zhang Boyang -1 siblings, 0 replies; 6+ messages in thread From: Zhang Boyang @ 2022-09-30 21:41 UTC (permalink / raw) To: Andrey Grodzovsky, Christian König, linux-kernel, amd-gfx Cc: Steven J Abner, David C . Rankin, Zhang Boyang After amdgpu_device_init() failed, adev->reset_domain may be NULL. Thus subsequent call to amdgpu_device_fini_sw() may result in null-ptr-deref. This patch fixes the problem by adding a NULL pointer check around the code of releasing adev->reset_domain in amdgpu_device_fini_sw(). Fixes: cfbb6b004744 ("drm/amdgpu: Rework reset domain to be refcounted.") Signed-off-by: Zhang Boyang <zhangboyang.id@gmail.com> Link: https://lore.kernel.org/lkml/a8bce489-8ccc-aa95-3de6-f854e03ad557@suddenlinkmail.com/ Link: https://lore.kernel.org/lkml/AT9WHR.3Z1T3VI9A2AQ3@att.net/ --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index be7aff2d4a57..204daad06b32 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4021,8 +4021,10 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev) if (adev->mman.discovery_bin) amdgpu_discovery_fini(adev); - amdgpu_reset_put_reset_domain(adev->reset_domain); - adev->reset_domain = NULL; + if (adev->reset_domain) { + amdgpu_reset_put_reset_domain(adev->reset_domain); + adev->reset_domain = NULL; + } kfree(adev->pci_state); -- 2.30.2 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* [RFC PATCH 1/1] drm/amdgpu: Fix null-ptr-deref in amdgpu_device_fini_sw() @ 2022-09-30 21:41 ` Zhang Boyang 0 siblings, 0 replies; 6+ messages in thread From: Zhang Boyang @ 2022-09-30 21:41 UTC (permalink / raw) To: Andrey Grodzovsky, Christian König, linux-kernel, amd-gfx Cc: David C . Rankin, Steven J Abner, Zhang Boyang After amdgpu_device_init() failed, adev->reset_domain may be NULL. Thus subsequent call to amdgpu_device_fini_sw() may result in null-ptr-deref. This patch fixes the problem by adding a NULL pointer check around the code of releasing adev->reset_domain in amdgpu_device_fini_sw(). Fixes: cfbb6b004744 ("drm/amdgpu: Rework reset domain to be refcounted.") Signed-off-by: Zhang Boyang <zhangboyang.id@gmail.com> Link: https://lore.kernel.org/lkml/a8bce489-8ccc-aa95-3de6-f854e03ad557@suddenlinkmail.com/ Link: https://lore.kernel.org/lkml/AT9WHR.3Z1T3VI9A2AQ3@att.net/ --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index be7aff2d4a57..204daad06b32 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4021,8 +4021,10 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev) if (adev->mman.discovery_bin) amdgpu_discovery_fini(adev); - amdgpu_reset_put_reset_domain(adev->reset_domain); - adev->reset_domain = NULL; + if (adev->reset_domain) { + amdgpu_reset_put_reset_domain(adev->reset_domain); + adev->reset_domain = NULL; + } kfree(adev->pci_state); -- 2.30.2 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [RFC PATCH 1/1] drm/amdgpu: Fix null-ptr-deref in amdgpu_device_fini_sw() 2022-09-30 21:41 ` Zhang Boyang @ 2022-09-30 22:38 ` Steven J Abner -1 siblings, 0 replies; 6+ messages in thread From: Steven J Abner @ 2022-09-30 22:38 UTC (permalink / raw) To: Zhang Boyang Cc: Andrey Grodzovsky, David C . Rankin, Christian König, amd-gfx, linux-kernel I had done more delving into this also, thankfully was not forgotten. Additional info to solve blackout, was going to contact AMD: The problem as far as I could trace occurs in amdgpu_psp.c in function psp_cmd_submit_buf(). Normal counter 'timeout' seems to use a max of about 400 with usleep_range(5, 80). Under normal operation 'psp->fence_buf' will equal 'index' and drop from loop. I could not track what fills the kernel buffer objects 'virtual' pointer's reference (psp->fence_buf). I assume that it's to a firmware file that on read is returning an error, and getting stuck in a loop lock. The error condition I found occurs when 'psp->fence_buf' does not equal 'index', breaking from loop with 'timeout' == 0. Blackout seemed to be about 80% of reboots, but found that in reconfiguration of kernel, on 5.18.19, with CONFIG_ATA_ACPI=y drops blackouts to about 30% of reboots. I have now avoided all blackouts with addition of CONFIG_SATA_PMP=y (at least a week free?). This is pure guess but, maybe the ARM PSP processor is dependent on libata procedures, one of which causes a lock up some of the time. Steve On Fri, Sep 30, 2022 at 21:41, Zhang Boyang <zhangboyang.id@gmail.com> wrote: > After amdgpu_device_init() failed, adev->reset_domain may be NULL. > Thus > subsequent call to amdgpu_device_fini_sw() may result in > null-ptr-deref. > > This patch fixes the problem by adding a NULL pointer check around the > code of releasing adev->reset_domain in amdgpu_device_fini_sw(). > > Fixes: cfbb6b004744 ("drm/amdgpu: Rework reset domain to be > refcounted.") > > Signed-off-by: Zhang Boyang <zhangboyang.id@gmail.com> > Link: > https://lore.kernel.org/lkml/a8bce489-8ccc-aa95-3de6-f854e03ad557@suddenlinkmail.com/ > Link: https://lore.kernel.org/lkml/AT9WHR.3Z1T3VI9A2AQ3@att.net/ > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index be7aff2d4a57..204daad06b32 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -4021,8 +4021,10 @@ void amdgpu_device_fini_sw(struct > amdgpu_device *adev) > if (adev->mman.discovery_bin) > amdgpu_discovery_fini(adev); > > - amdgpu_reset_put_reset_domain(adev->reset_domain); > - adev->reset_domain = NULL; > + if (adev->reset_domain) { > + amdgpu_reset_put_reset_domain(adev->reset_domain); > + adev->reset_domain = NULL; > + } > > kfree(adev->pci_state); > > -- > 2.30.2 > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC PATCH 1/1] drm/amdgpu: Fix null-ptr-deref in amdgpu_device_fini_sw() @ 2022-09-30 22:38 ` Steven J Abner 0 siblings, 0 replies; 6+ messages in thread From: Steven J Abner @ 2022-09-30 22:38 UTC (permalink / raw) To: Zhang Boyang Cc: Andrey Grodzovsky, Christian König, linux-kernel, amd-gfx, David C . Rankin I had done more delving into this also, thankfully was not forgotten. Additional info to solve blackout, was going to contact AMD: The problem as far as I could trace occurs in amdgpu_psp.c in function psp_cmd_submit_buf(). Normal counter 'timeout' seems to use a max of about 400 with usleep_range(5, 80). Under normal operation 'psp->fence_buf' will equal 'index' and drop from loop. I could not track what fills the kernel buffer objects 'virtual' pointer's reference (psp->fence_buf). I assume that it's to a firmware file that on read is returning an error, and getting stuck in a loop lock. The error condition I found occurs when 'psp->fence_buf' does not equal 'index', breaking from loop with 'timeout' == 0. Blackout seemed to be about 80% of reboots, but found that in reconfiguration of kernel, on 5.18.19, with CONFIG_ATA_ACPI=y drops blackouts to about 30% of reboots. I have now avoided all blackouts with addition of CONFIG_SATA_PMP=y (at least a week free?). This is pure guess but, maybe the ARM PSP processor is dependent on libata procedures, one of which causes a lock up some of the time. Steve On Fri, Sep 30, 2022 at 21:41, Zhang Boyang <zhangboyang.id@gmail.com> wrote: > After amdgpu_device_init() failed, adev->reset_domain may be NULL. > Thus > subsequent call to amdgpu_device_fini_sw() may result in > null-ptr-deref. > > This patch fixes the problem by adding a NULL pointer check around the > code of releasing adev->reset_domain in amdgpu_device_fini_sw(). > > Fixes: cfbb6b004744 ("drm/amdgpu: Rework reset domain to be > refcounted.") > > Signed-off-by: Zhang Boyang <zhangboyang.id@gmail.com> > Link: > https://lore.kernel.org/lkml/a8bce489-8ccc-aa95-3de6-f854e03ad557@suddenlinkmail.com/ > Link: https://lore.kernel.org/lkml/AT9WHR.3Z1T3VI9A2AQ3@att.net/ > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index be7aff2d4a57..204daad06b32 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -4021,8 +4021,10 @@ void amdgpu_device_fini_sw(struct > amdgpu_device *adev) > if (adev->mman.discovery_bin) > amdgpu_discovery_fini(adev); > > - amdgpu_reset_put_reset_domain(adev->reset_domain); > - adev->reset_domain = NULL; > + if (adev->reset_domain) { > + amdgpu_reset_put_reset_domain(adev->reset_domain); > + adev->reset_domain = NULL; > + } > > kfree(adev->pci_state); > > -- > 2.30.2 > ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2022-10-03 13:08 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-09-30 21:41 [RFC PATCH 0/1] drm/amdgpu: Fix NULL-deref in amdgpu_device_fini_sw() Zhang Boyang 2022-09-30 21:41 ` Zhang Boyang 2022-09-30 21:41 ` [RFC PATCH 1/1] drm/amdgpu: Fix null-ptr-deref " Zhang Boyang 2022-09-30 21:41 ` Zhang Boyang 2022-09-30 22:38 ` Steven J Abner 2022-09-30 22:38 ` Steven J Abner
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.