* [PATCH] drm/amdgpu: fix NULL pointer dereference in amdgpu_gmc_filter_faults_remove
@ 2026-01-21 18:24 Alex Deucher
2026-01-22 5:07 ` Lazar, Lijo
2026-01-22 15:01 ` Timur Kristóf
0 siblings, 2 replies; 8+ messages in thread
From: Alex Deucher @ 2026-01-21 18:24 UTC (permalink / raw)
To: amd-gfx; +Cc: Jon Doron, stable, Alex Deucher
From: Jon Doron <jond@wiz.io>
On APUs such as Raven and Renoir (GC 9.1.0, 9.2.2, 9.3.0), the ih1 and
ih2 interrupt ring buffers are not initialized. This is by design, as
these secondary IH rings are only available on discrete GPUs. See
vega10_ih_sw_init() which explicitly skips ih1/ih2 initialization when
AMD_IS_APU is set.
However, amdgpu_gmc_filter_faults_remove() unconditionally uses ih1 to
get the timestamp of the last interrupt entry. When retry faults are
enabled on APUs (noretry=0), this function is called from the SVM page
fault recovery path, resulting in a NULL pointer dereference when
amdgpu_ih_decode_iv_ts_helper() attempts to access ih->ring[].
The crash manifests as:
BUG: kernel NULL pointer dereference, address: 0000000000000004
RIP: 0010:amdgpu_ih_decode_iv_ts_helper+0x22/0x40 [amdgpu]
Call Trace:
amdgpu_gmc_filter_faults_remove+0x60/0x130 [amdgpu]
svm_range_restore_pages+0xae5/0x11c0 [amdgpu]
amdgpu_vm_handle_fault+0xc8/0x340 [amdgpu]
gmc_v9_0_process_interrupt+0x191/0x220 [amdgpu]
amdgpu_irq_dispatch+0xed/0x2c0 [amdgpu]
amdgpu_ih_process+0x84/0x100 [amdgpu]
This issue was exposed by commit 1446226d32a4 ("drm/amdgpu: Remove GC HW
IP 9.3.0 from noretry=1") which changed the default for Renoir APU from
noretry=1 to noretry=0, enabling retry fault handling and thus
exercising the buggy code path.
Fix this by adding a check for ih1.ring_size before attempting to use
it. Also restore the soft_ih support from commit dd299441654f ("drm/amdgpu:
Rework retry fault removal"). This is needed if the hardware doesn't
support secondary HW IH rings.
v2: additional updates (Alex)
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3814
Fixes: dd299441654f ("drm/amdgpu: Rework retry fault removal")
Cc: stable@vger.kernel.org
Signed-off-by: Jon Doron <jond@wiz.io>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 8e65fec9f534e..243d75917458a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -498,8 +498,13 @@ void amdgpu_gmc_filter_faults_remove(struct amdgpu_device *adev, uint64_t addr,
if (adev->irq.retry_cam_enabled)
return;
+ else if (adev->irq.ih1.ring_size)
+ ih = &adev->irq.ih1;
+ else if (adev->irq.ih_soft.enabled)
+ ih = &adev->irq.ih_soft;
+ else
+ return;
- ih = &adev->irq.ih1;
/* Get the WPTR of the last entry in IH ring */
last_wptr = amdgpu_ih_get_wptr(adev, ih);
/* Order wptr with ring data. */
--
2.52.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH] drm/amdgpu: fix NULL pointer dereference in amdgpu_gmc_filter_faults_remove
2026-01-21 18:24 [PATCH] drm/amdgpu: fix NULL pointer dereference in amdgpu_gmc_filter_faults_remove Alex Deucher
@ 2026-01-22 5:07 ` Lazar, Lijo
2026-01-22 14:02 ` Alex Deucher
2026-01-22 15:00 ` Timur Kristóf
2026-01-22 15:01 ` Timur Kristóf
1 sibling, 2 replies; 8+ messages in thread
From: Lazar, Lijo @ 2026-01-22 5:07 UTC (permalink / raw)
To: Alex Deucher, amd-gfx; +Cc: Jon Doron, stable
On 21-Jan-26 11:54 PM, Alex Deucher wrote:
> From: Jon Doron <jond@wiz.io>
>
> On APUs such as Raven and Renoir (GC 9.1.0, 9.2.2, 9.3.0), the ih1 and
> ih2 interrupt ring buffers are not initialized. This is by design, as
> these secondary IH rings are only available on discrete GPUs. See
> vega10_ih_sw_init() which explicitly skips ih1/ih2 initialization when
> AMD_IS_APU is set.
>
> However, amdgpu_gmc_filter_faults_remove() unconditionally uses ih1 to
> get the timestamp of the last interrupt entry. When retry faults are
> enabled on APUs (noretry=0), this function is called from the SVM page
> fault recovery path, resulting in a NULL pointer dereference when
> amdgpu_ih_decode_iv_ts_helper() attempts to access ih->ring[].
>
> The crash manifests as:
>
> BUG: kernel NULL pointer dereference, address: 0000000000000004
> RIP: 0010:amdgpu_ih_decode_iv_ts_helper+0x22/0x40 [amdgpu]
> Call Trace:
> amdgpu_gmc_filter_faults_remove+0x60/0x130 [amdgpu]
> svm_range_restore_pages+0xae5/0x11c0 [amdgpu]
> amdgpu_vm_handle_fault+0xc8/0x340 [amdgpu]
> gmc_v9_0_process_interrupt+0x191/0x220 [amdgpu]
> amdgpu_irq_dispatch+0xed/0x2c0 [amdgpu]
> amdgpu_ih_process+0x84/0x100 [amdgpu]
>
> This issue was exposed by commit 1446226d32a4 ("drm/amdgpu: Remove GC HW
> IP 9.3.0 from noretry=1") which changed the default for Renoir APU from
> noretry=1 to noretry=0, enabling retry fault handling and thus
> exercising the buggy code path.
>
> Fix this by adding a check for ih1.ring_size before attempting to use
> it. Also restore the soft_ih support from commit dd299441654f ("drm/amdgpu:
> Rework retry fault removal"). This is needed if the hardware doesn't
> support secondary HW IH rings.
>
> v2: additional updates (Alex)
>
> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3814
> Fixes: dd299441654f ("drm/amdgpu: Rework retry fault removal")
> Cc: stable@vger.kernel.org
> Signed-off-by: Jon Doron <jond@wiz.io>
> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> index 8e65fec9f534e..243d75917458a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> @@ -498,8 +498,13 @@ void amdgpu_gmc_filter_faults_remove(struct amdgpu_device *adev, uint64_t addr,
>
> if (adev->irq.retry_cam_enabled)
> return;
> + else if (adev->irq.ih1.ring_size)
> + ih = &adev->irq.ih1;
> + else if (adev->irq.ih_soft.enabled)
> + ih = &adev->irq.ih_soft;
Faults are delegated to soft ring when retry_cam is enabled -
https://gitlab.freedesktop.org/agd5f/linux/-/blob/amd-staging-drm-next/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c#L541
That matches with the original logic in d299441654f ("drm/amdgpu: Rework
retry fault removal").
To match exactly with the logic in above commit, I think it should use
soft ring only when retry cam is enabled. Presently, it's returning
without doing anything.
Thanks,
Lijo
> + else
> + return;
>
> - ih = &adev->irq.ih1;
> /* Get the WPTR of the last entry in IH ring */
> last_wptr = amdgpu_ih_get_wptr(adev, ih);
> /* Order wptr with ring data. */
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] drm/amdgpu: fix NULL pointer dereference in amdgpu_gmc_filter_faults_remove
2026-01-22 5:07 ` Lazar, Lijo
@ 2026-01-22 14:02 ` Alex Deucher
2026-01-22 15:00 ` Timur Kristóf
1 sibling, 0 replies; 8+ messages in thread
From: Alex Deucher @ 2026-01-22 14:02 UTC (permalink / raw)
To: Lazar, Lijo, philip yang; +Cc: Alex Deucher, amd-gfx, Jon Doron
On Thu, Jan 22, 2026 at 12:07 AM Lazar, Lijo <lijo.lazar@amd.com> wrote:
>
>
>
> On 21-Jan-26 11:54 PM, Alex Deucher wrote:
> > From: Jon Doron <jond@wiz.io>
> >
> > On APUs such as Raven and Renoir (GC 9.1.0, 9.2.2, 9.3.0), the ih1 and
> > ih2 interrupt ring buffers are not initialized. This is by design, as
> > these secondary IH rings are only available on discrete GPUs. See
> > vega10_ih_sw_init() which explicitly skips ih1/ih2 initialization when
> > AMD_IS_APU is set.
> >
> > However, amdgpu_gmc_filter_faults_remove() unconditionally uses ih1 to
> > get the timestamp of the last interrupt entry. When retry faults are
> > enabled on APUs (noretry=0), this function is called from the SVM page
> > fault recovery path, resulting in a NULL pointer dereference when
> > amdgpu_ih_decode_iv_ts_helper() attempts to access ih->ring[].
> >
> > The crash manifests as:
> >
> > BUG: kernel NULL pointer dereference, address: 0000000000000004
> > RIP: 0010:amdgpu_ih_decode_iv_ts_helper+0x22/0x40 [amdgpu]
> > Call Trace:
> > amdgpu_gmc_filter_faults_remove+0x60/0x130 [amdgpu]
> > svm_range_restore_pages+0xae5/0x11c0 [amdgpu]
> > amdgpu_vm_handle_fault+0xc8/0x340 [amdgpu]
> > gmc_v9_0_process_interrupt+0x191/0x220 [amdgpu]
> > amdgpu_irq_dispatch+0xed/0x2c0 [amdgpu]
> > amdgpu_ih_process+0x84/0x100 [amdgpu]
> >
> > This issue was exposed by commit 1446226d32a4 ("drm/amdgpu: Remove GC HW
> > IP 9.3.0 from noretry=1") which changed the default for Renoir APU from
> > noretry=1 to noretry=0, enabling retry fault handling and thus
> > exercising the buggy code path.
> >
> > Fix this by adding a check for ih1.ring_size before attempting to use
> > it. Also restore the soft_ih support from commit dd299441654f ("drm/amdgpu:
> > Rework retry fault removal"). This is needed if the hardware doesn't
> > support secondary HW IH rings.
> >
> > v2: additional updates (Alex)
> >
> > Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3814
> > Fixes: dd299441654f ("drm/amdgpu: Rework retry fault removal")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Jon Doron <jond@wiz.io>
> > Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> > ---
> > drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 7 ++++++-
> > 1 file changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > index 8e65fec9f534e..243d75917458a 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > @@ -498,8 +498,13 @@ void amdgpu_gmc_filter_faults_remove(struct amdgpu_device *adev, uint64_t addr,
> >
> > if (adev->irq.retry_cam_enabled)
> > return;
> > + else if (adev->irq.ih1.ring_size)
> > + ih = &adev->irq.ih1;
> > + else if (adev->irq.ih_soft.enabled)
> > + ih = &adev->irq.ih_soft;
>
> Faults are delegated to soft ring when retry_cam is enabled -
> https://gitlab.freedesktop.org/agd5f/linux/-/blob/amd-staging-drm-next/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c#L541
>
> That matches with the original logic in d299441654f ("drm/amdgpu: Rework
> retry fault removal").
>
> To match exactly with the logic in above commit, I think it should use
> soft ring only when retry cam is enabled. Presently, it's returning
> without doing anything.
+ Philip
That logic was changed in:
commit e61801f162ddcf8874c820639483ec4849b0fb0b
Author: Philip Yang <Philip.Yang@amd.com>
Date: Mon Aug 28 14:05:55 2023 -0400
drm/amdkfd: Don't use sw fault filter if retry cam enabled
If retry cam enabled, we don't use sw retry fault filter and add fault
into sw filter ring, so we shouldn't remove fault from sw filter.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
So I retained that logic.
Alex
>
> Thanks,
> Lijo
>
> > + else
> > + return;
> >
> > - ih = &adev->irq.ih1;
> > /* Get the WPTR of the last entry in IH ring */
> > last_wptr = amdgpu_ih_get_wptr(adev, ih);
> > /* Order wptr with ring data. */
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] drm/amdgpu: fix NULL pointer dereference in amdgpu_gmc_filter_faults_remove
2026-01-22 5:07 ` Lazar, Lijo
2026-01-22 14:02 ` Alex Deucher
@ 2026-01-22 15:00 ` Timur Kristóf
2026-01-22 22:47 ` Philip Yang
2026-01-23 7:17 ` Lazar, Lijo
1 sibling, 2 replies; 8+ messages in thread
From: Timur Kristóf @ 2026-01-22 15:00 UTC (permalink / raw)
To: Alex Deucher, amd-gfx; +Cc: Jon Doron, stable, Lazar, Lijo
On Thursday, January 22, 2026 6:07:27 AM Central European Standard Time Lazar,
Lijo wrote:
> On 21-Jan-26 11:54 PM, Alex Deucher wrote:
> > From: Jon Doron <jond@wiz.io>
> >
> > On APUs such as Raven and Renoir (GC 9.1.0, 9.2.2, 9.3.0), the ih1 and
> > ih2 interrupt ring buffers are not initialized. This is by design, as
> > these secondary IH rings are only available on discrete GPUs. See
> > vega10_ih_sw_init() which explicitly skips ih1/ih2 initialization when
> > AMD_IS_APU is set.
> >
> > However, amdgpu_gmc_filter_faults_remove() unconditionally uses ih1 to
> > get the timestamp of the last interrupt entry. When retry faults are
> > enabled on APUs (noretry=0), this function is called from the SVM page
> > fault recovery path, resulting in a NULL pointer dereference when
> > amdgpu_ih_decode_iv_ts_helper() attempts to access ih->ring[].
> >
> > The crash manifests as:
> > BUG: kernel NULL pointer dereference, address: 0000000000000004
> > RIP: 0010:amdgpu_ih_decode_iv_ts_helper+0x22/0x40 [amdgpu]
> >
> > Call Trace:
> > amdgpu_gmc_filter_faults_remove+0x60/0x130 [amdgpu]
> > svm_range_restore_pages+0xae5/0x11c0 [amdgpu]
> > amdgpu_vm_handle_fault+0xc8/0x340 [amdgpu]
> > gmc_v9_0_process_interrupt+0x191/0x220 [amdgpu]
> > amdgpu_irq_dispatch+0xed/0x2c0 [amdgpu]
> > amdgpu_ih_process+0x84/0x100 [amdgpu]
> >
> > This issue was exposed by commit 1446226d32a4 ("drm/amdgpu: Remove GC HW
> > IP 9.3.0 from noretry=1") which changed the default for Renoir APU from
> > noretry=1 to noretry=0, enabling retry fault handling and thus
> > exercising the buggy code path.
> >
> > Fix this by adding a check for ih1.ring_size before attempting to use
> > it. Also restore the soft_ih support from commit dd299441654f
> > ("drm/amdgpu:
> > Rework retry fault removal"). This is needed if the hardware doesn't
> > support secondary HW IH rings.
> >
> > v2: additional updates (Alex)
> >
> > Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3814
> > Fixes: dd299441654f ("drm/amdgpu: Rework retry fault removal")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Jon Doron <jond@wiz.io>
> > Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> > ---
> >
> > drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 7 ++++++-
> > 1 file changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c index
> > 8e65fec9f534e..243d75917458a 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > @@ -498,8 +498,13 @@ void amdgpu_gmc_filter_faults_remove(struct
> > amdgpu_device *adev, uint64_t addr,>
> > if (adev->irq.retry_cam_enabled)
> >
> > return;
> >
> > + else if (adev->irq.ih1.ring_size)
> > + ih = &adev->irq.ih1;
> > + else if (adev->irq.ih_soft.enabled)
> > + ih = &adev->irq.ih_soft;
>
> Faults are delegated to soft ring when retry_cam is enabled -
> https://gitlab.freedesktop.org/agd5f/linux/-/blob/amd-staging-drm-next/drive
> rs/gpu/drm/amd/amdgpu/amdgpu_gmc.c#L541
Hi,
As far as I know the retry CAM is not available on APUs.
Please correct me if I'm wrong.
Thanks,
Timur
>
> That matches with the original logic in d299441654f ("drm/amdgpu: Rework
> retry fault removal").
>
> To match exactly with the logic in above commit, I think it should use
> soft ring only when retry cam is enabled. Presently, it's returning
> without doing anything.
>
> Thanks,
> Lijo
>
> > + else
> > + return;
> >
> > - ih = &adev->irq.ih1;
> >
> > /* Get the WPTR of the last entry in IH ring */
> > last_wptr = amdgpu_ih_get_wptr(adev, ih);
> > /* Order wptr with ring data. */
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] drm/amdgpu: fix NULL pointer dereference in amdgpu_gmc_filter_faults_remove
2026-01-21 18:24 [PATCH] drm/amdgpu: fix NULL pointer dereference in amdgpu_gmc_filter_faults_remove Alex Deucher
2026-01-22 5:07 ` Lazar, Lijo
@ 2026-01-22 15:01 ` Timur Kristóf
1 sibling, 0 replies; 8+ messages in thread
From: Timur Kristóf @ 2026-01-22 15:01 UTC (permalink / raw)
To: amd-gfx; +Cc: Jon Doron, stable, Alex Deucher, Alex Deucher
On Wednesday, January 21, 2026 7:24:47 PM Central European Standard Time Alex
Deucher wrote:
> From: Jon Doron <jond@wiz.io>
>
> On APUs such as Raven and Renoir (GC 9.1.0, 9.2.2, 9.3.0), the ih1 and
> ih2 interrupt ring buffers are not initialized. This is by design, as
> these secondary IH rings are only available on discrete GPUs. See
> vega10_ih_sw_init() which explicitly skips ih1/ih2 initialization when
> AMD_IS_APU is set.
>
> However, amdgpu_gmc_filter_faults_remove() unconditionally uses ih1 to
> get the timestamp of the last interrupt entry. When retry faults are
> enabled on APUs (noretry=0), this function is called from the SVM page
> fault recovery path, resulting in a NULL pointer dereference when
> amdgpu_ih_decode_iv_ts_helper() attempts to access ih->ring[].
>
> The crash manifests as:
>
> BUG: kernel NULL pointer dereference, address: 0000000000000004
> RIP: 0010:amdgpu_ih_decode_iv_ts_helper+0x22/0x40 [amdgpu]
> Call Trace:
> amdgpu_gmc_filter_faults_remove+0x60/0x130 [amdgpu]
> svm_range_restore_pages+0xae5/0x11c0 [amdgpu]
> amdgpu_vm_handle_fault+0xc8/0x340 [amdgpu]
> gmc_v9_0_process_interrupt+0x191/0x220 [amdgpu]
> amdgpu_irq_dispatch+0xed/0x2c0 [amdgpu]
> amdgpu_ih_process+0x84/0x100 [amdgpu]
>
> This issue was exposed by commit 1446226d32a4 ("drm/amdgpu: Remove GC HW
> IP 9.3.0 from noretry=1") which changed the default for Renoir APU from
> noretry=1 to noretry=0, enabling retry fault handling and thus
> exercising the buggy code path.
>
> Fix this by adding a check for ih1.ring_size before attempting to use
> it. Also restore the soft_ih support from commit dd299441654f ("drm/amdgpu:
> Rework retry fault removal"). This is needed if the hardware doesn't
> support secondary HW IH rings.
>
> v2: additional updates (Alex)
>
> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3814
> Fixes: dd299441654f ("drm/amdgpu: Rework retry fault removal")
> Cc: stable@vger.kernel.org
> Signed-off-by: Jon Doron <jond@wiz.io>
> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Thank you for taking care of this!
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c index
> 8e65fec9f534e..243d75917458a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> @@ -498,8 +498,13 @@ void amdgpu_gmc_filter_faults_remove(struct
> amdgpu_device *adev, uint64_t addr,
>
> if (adev->irq.retry_cam_enabled)
> return;
> + else if (adev->irq.ih1.ring_size)
> + ih = &adev->irq.ih1;
> + else if (adev->irq.ih_soft.enabled)
> + ih = &adev->irq.ih_soft;
> + else
> + return;
>
> - ih = &adev->irq.ih1;
> /* Get the WPTR of the last entry in IH ring */
> last_wptr = amdgpu_ih_get_wptr(adev, ih);
> /* Order wptr with ring data. */
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] drm/amdgpu: fix NULL pointer dereference in amdgpu_gmc_filter_faults_remove
2026-01-22 15:00 ` Timur Kristóf
@ 2026-01-22 22:47 ` Philip Yang
2026-01-23 7:17 ` Lazar, Lijo
1 sibling, 0 replies; 8+ messages in thread
From: Philip Yang @ 2026-01-22 22:47 UTC (permalink / raw)
To: Timur Kristóf, Alex Deucher, amd-gfx; +Cc: Jon Doron, stable, Lazar, Lijo
[-- Attachment #1: Type: text/plain, Size: 4242 bytes --]
On 2026-01-22 10:00, Timur Kristóf wrote:
> On Thursday, January 22, 2026 6:07:27 AM Central European Standard Time Lazar,
> Lijo wrote:
>> On 21-Jan-26 11:54 PM, Alex Deucher wrote:
>>> From: Jon Doron<jond@wiz.io>
>>>
>>> On APUs such as Raven and Renoir (GC 9.1.0, 9.2.2, 9.3.0), the ih1 and
>>> ih2 interrupt ring buffers are not initialized. This is by design, as
>>> these secondary IH rings are only available on discrete GPUs. See
>>> vega10_ih_sw_init() which explicitly skips ih1/ih2 initialization when
>>> AMD_IS_APU is set.
>>>
>>> However, amdgpu_gmc_filter_faults_remove() unconditionally uses ih1 to
>>> get the timestamp of the last interrupt entry. When retry faults are
>>> enabled on APUs (noretry=0), this function is called from the SVM page
>>> fault recovery path, resulting in a NULL pointer dereference when
>>> amdgpu_ih_decode_iv_ts_helper() attempts to access ih->ring[].
>>>
>>> The crash manifests as:
>>> BUG: kernel NULL pointer dereference, address: 0000000000000004
>>> RIP: 0010:amdgpu_ih_decode_iv_ts_helper+0x22/0x40 [amdgpu]
>>>
>>> Call Trace:
>>> amdgpu_gmc_filter_faults_remove+0x60/0x130 [amdgpu]
>>> svm_range_restore_pages+0xae5/0x11c0 [amdgpu]
>>> amdgpu_vm_handle_fault+0xc8/0x340 [amdgpu]
>>> gmc_v9_0_process_interrupt+0x191/0x220 [amdgpu]
>>> amdgpu_irq_dispatch+0xed/0x2c0 [amdgpu]
>>> amdgpu_ih_process+0x84/0x100 [amdgpu]
>>>
>>> This issue was exposed by commit 1446226d32a4 ("drm/amdgpu: Remove GC HW
>>> IP 9.3.0 from noretry=1") which changed the default for Renoir APU from
>>> noretry=1 to noretry=0, enabling retry fault handling and thus
>>> exercising the buggy code path.
>>>
>>> Fix this by adding a check for ih1.ring_size before attempting to use
>>> it. Also restore the soft_ih support from commit dd299441654f
>>> ("drm/amdgpu:
>>> Rework retry fault removal"). This is needed if the hardware doesn't
>>> support secondary HW IH rings.
>>>
>>> v2: additional updates (Alex)
>>>
>>> Closes:https://gitlab.freedesktop.org/drm/amd/-/issues/3814
>>> Fixes: dd299441654f ("drm/amdgpu: Rework retry fault removal")
>>> Cc:stable@vger.kernel.org
>>> Signed-off-by: Jon Doron<jond@wiz.io>
>>> Signed-off-by: Alex Deucher<alexander.deucher@amd.com>
>>> ---
>>>
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 7 ++++++-
>>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c index
>>> 8e65fec9f534e..243d75917458a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>> @@ -498,8 +498,13 @@ void amdgpu_gmc_filter_faults_remove(struct
>>> amdgpu_device *adev, uint64_t addr,>
>>> if (adev->irq.retry_cam_enabled)
>>>
>>> return;
>>>
>>> + else if (adev->irq.ih1.ring_size)
>>> + ih = &adev->irq.ih1;
>>> + else if (adev->irq.ih_soft.enabled)
>>> + ih = &adev->irq.ih_soft;
>> Faults are delegated to soft ring when retry_cam is enabled -
>> https://gitlab.freedesktop.org/agd5f/linux/-/blob/amd-staging-drm-next/drive
>> rs/gpu/drm/amd/amdgpu/amdgpu_gmc.c#L541
> Hi,
>
> As far as I know the retry CAM is not available on APUs.
> Please correct me if I'm wrong.
Yes, that is correct, only on ASICs without CAM (the retry page fault hw
filter), we use sw retry fault filter to
drop duplicate faults, from the dedicated retry fault ring IH1 or
delegate from IH ring to ih_soft ring.
Without IH1 ring, the retry fault on IH ring may cause ring overflow and
drop other important interrupts,
we should not enable XNACK on it. With that said, the fix looks good to me.
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
>
> Thanks,
> Timur
>
>> That matches with the original logic in d299441654f ("drm/amdgpu: Rework
>> retry fault removal").
>>
>> To match exactly with the logic in above commit, I think it should use
>> soft ring only when retry cam is enabled. Presently, it's returning
>> without doing anything.
>>
>> Thanks,
>> Lijo
>>
>>> + else
>>> + return;
>>>
>>> - ih = &adev->irq.ih1;
>>>
>>> /* Get the WPTR of the last entry in IH ring */
>>> last_wptr = amdgpu_ih_get_wptr(adev, ih);
>>> /* Order wptr with ring data. */
>
>
>
[-- Attachment #2: Type: text/html, Size: 5717 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] drm/amdgpu: fix NULL pointer dereference in amdgpu_gmc_filter_faults_remove
2026-01-22 15:00 ` Timur Kristóf
2026-01-22 22:47 ` Philip Yang
@ 2026-01-23 7:17 ` Lazar, Lijo
2026-01-23 8:06 ` Lazar, Lijo
1 sibling, 1 reply; 8+ messages in thread
From: Lazar, Lijo @ 2026-01-23 7:17 UTC (permalink / raw)
To: Timur Kristóf, Alex Deucher, amd-gfx; +Cc: Jon Doron, stable
On 22-Jan-26 8:30 PM, Timur Kristóf wrote:
> On Thursday, January 22, 2026 6:07:27 AM Central European Standard Time Lazar,
> Lijo wrote:
>> On 21-Jan-26 11:54 PM, Alex Deucher wrote:
>>> From: Jon Doron <jond@wiz.io>
>>>
>>> On APUs such as Raven and Renoir (GC 9.1.0, 9.2.2, 9.3.0), the ih1 and
>>> ih2 interrupt ring buffers are not initialized. This is by design, as
>>> these secondary IH rings are only available on discrete GPUs. See
>>> vega10_ih_sw_init() which explicitly skips ih1/ih2 initialization when
>>> AMD_IS_APU is set.
>>>
>>> However, amdgpu_gmc_filter_faults_remove() unconditionally uses ih1 to
>>> get the timestamp of the last interrupt entry. When retry faults are
>>> enabled on APUs (noretry=0), this function is called from the SVM page
>>> fault recovery path, resulting in a NULL pointer dereference when
>>> amdgpu_ih_decode_iv_ts_helper() attempts to access ih->ring[].
>>>
>>> The crash manifests as:
>>> BUG: kernel NULL pointer dereference, address: 0000000000000004
>>> RIP: 0010:amdgpu_ih_decode_iv_ts_helper+0x22/0x40 [amdgpu]
>>>
>>> Call Trace:
>>> amdgpu_gmc_filter_faults_remove+0x60/0x130 [amdgpu]
>>> svm_range_restore_pages+0xae5/0x11c0 [amdgpu]
>>> amdgpu_vm_handle_fault+0xc8/0x340 [amdgpu]
>>> gmc_v9_0_process_interrupt+0x191/0x220 [amdgpu]
>>> amdgpu_irq_dispatch+0xed/0x2c0 [amdgpu]
>>> amdgpu_ih_process+0x84/0x100 [amdgpu]
>>>
>>> This issue was exposed by commit 1446226d32a4 ("drm/amdgpu: Remove GC HW
>>> IP 9.3.0 from noretry=1") which changed the default for Renoir APU from
>>> noretry=1 to noretry=0, enabling retry fault handling and thus
>>> exercising the buggy code path.
>>>
>>> Fix this by adding a check for ih1.ring_size before attempting to use
>>> it. Also restore the soft_ih support from commit dd299441654f
>>> ("drm/amdgpu:
>>> Rework retry fault removal"). This is needed if the hardware doesn't
>>> support secondary HW IH rings.
>>>
>>> v2: additional updates (Alex)
>>>
>>> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3814
>>> Fixes: dd299441654f ("drm/amdgpu: Rework retry fault removal")
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Jon Doron <jond@wiz.io>
>>> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>>> ---
>>>
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 7 ++++++-
>>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c index
>>> 8e65fec9f534e..243d75917458a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>> @@ -498,8 +498,13 @@ void amdgpu_gmc_filter_faults_remove(struct
>>> amdgpu_device *adev, uint64_t addr,>
>>> if (adev->irq.retry_cam_enabled)
>>>
>>> return;
>>>
>>> + else if (adev->irq.ih1.ring_size)
>>> + ih = &adev->irq.ih1;
>>> + else if (adev->irq.ih_soft.enabled)
>>> + ih = &adev->irq.ih_soft;
>>
>> Faults are delegated to soft ring when retry_cam is enabled -
>> https://gitlab.freedesktop.org/agd5f/linux/-/blob/amd-staging-drm-next/drive
>> rs/gpu/drm/amd/amdgpu/amdgpu_gmc.c#L541
>
> Hi,
>
> As far as I know the retry CAM is not available on APUs.
> Please correct me if I'm wrong.
>
Retry CAM filter is available on APUs as well.
Thanks,
Lijo
> Thanks,
> Timur
>
>>
>> That matches with the original logic in d299441654f ("drm/amdgpu: Rework
>> retry fault removal").
>>
>> To match exactly with the logic in above commit, I think it should use
>> soft ring only when retry cam is enabled. Presently, it's returning
>> without doing anything.
>>
>> Thanks,
>> Lijo
>>
>>> + else
>>> + return;
>>>
>>> - ih = &adev->irq.ih1;
>>>
>>> /* Get the WPTR of the last entry in IH ring */
>>> last_wptr = amdgpu_ih_get_wptr(adev, ih);
>>> /* Order wptr with ring data. */
>
>
>
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] drm/amdgpu: fix NULL pointer dereference in amdgpu_gmc_filter_faults_remove
2026-01-23 7:17 ` Lazar, Lijo
@ 2026-01-23 8:06 ` Lazar, Lijo
0 siblings, 0 replies; 8+ messages in thread
From: Lazar, Lijo @ 2026-01-23 8:06 UTC (permalink / raw)
To: Timur Kristóf, Alex Deucher, amd-gfx; +Cc: Jon Doron, stable
On 23-Jan-26 12:47 PM, Lazar, Lijo wrote:
>
>
> On 22-Jan-26 8:30 PM, Timur Kristóf wrote:
>> On Thursday, January 22, 2026 6:07:27 AM Central European Standard
>> Time Lazar,
>> Lijo wrote:
>>> On 21-Jan-26 11:54 PM, Alex Deucher wrote:
>>>> From: Jon Doron <jond@wiz.io>
>>>>
>>>> On APUs such as Raven and Renoir (GC 9.1.0, 9.2.2, 9.3.0), the ih1 and
>>>> ih2 interrupt ring buffers are not initialized. This is by design, as
>>>> these secondary IH rings are only available on discrete GPUs. See
>>>> vega10_ih_sw_init() which explicitly skips ih1/ih2 initialization when
>>>> AMD_IS_APU is set.
>>>>
>>>> However, amdgpu_gmc_filter_faults_remove() unconditionally uses ih1 to
>>>> get the timestamp of the last interrupt entry. When retry faults are
>>>> enabled on APUs (noretry=0), this function is called from the SVM page
>>>> fault recovery path, resulting in a NULL pointer dereference when
>>>> amdgpu_ih_decode_iv_ts_helper() attempts to access ih->ring[].
>>>>
>>>> The crash manifests as:
>>>> BUG: kernel NULL pointer dereference, address: 0000000000000004
>>>> RIP: 0010:amdgpu_ih_decode_iv_ts_helper+0x22/0x40 [amdgpu]
>>>> Call Trace:
>>>> amdgpu_gmc_filter_faults_remove+0x60/0x130 [amdgpu]
>>>> svm_range_restore_pages+0xae5/0x11c0 [amdgpu]
>>>> amdgpu_vm_handle_fault+0xc8/0x340 [amdgpu]
>>>> gmc_v9_0_process_interrupt+0x191/0x220 [amdgpu]
>>>> amdgpu_irq_dispatch+0xed/0x2c0 [amdgpu]
>>>> amdgpu_ih_process+0x84/0x100 [amdgpu]
>>>>
>>>> This issue was exposed by commit 1446226d32a4 ("drm/amdgpu: Remove
>>>> GC HW
>>>> IP 9.3.0 from noretry=1") which changed the default for Renoir APU from
>>>> noretry=1 to noretry=0, enabling retry fault handling and thus
>>>> exercising the buggy code path.
>>>>
>>>> Fix this by adding a check for ih1.ring_size before attempting to use
>>>> it. Also restore the soft_ih support from commit dd299441654f
>>>> ("drm/amdgpu:
>>>> Rework retry fault removal"). This is needed if the hardware doesn't
>>>> support secondary HW IH rings.
>>>>
>>>> v2: additional updates (Alex)
>>>>
>>>> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3814
>>>> Fixes: dd299441654f ("drm/amdgpu: Rework retry fault removal")
>>>> Cc: stable@vger.kernel.org
>>>> Signed-off-by: Jon Doron <jond@wiz.io>
>>>> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>>>> ---
>>>>
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 7 ++++++-
>>>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c index
>>>> 8e65fec9f534e..243d75917458a 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>>> @@ -498,8 +498,13 @@ void amdgpu_gmc_filter_faults_remove(struct
>>>> amdgpu_device *adev, uint64_t addr,>
>>>> if (adev->irq.retry_cam_enabled)
>>>>
>>>> return;
>>>>
>>>> + else if (adev->irq.ih1.ring_size)
>>>> + ih = &adev->irq.ih1;
>>>> + else if (adev->irq.ih_soft.enabled)
>>>> + ih = &adev->irq.ih_soft;
>>>
>>> Faults are delegated to soft ring when retry_cam is enabled -
>>> https://gitlab.freedesktop.org/agd5f/linux/-/blob/amd-staging-drm-
>>> next/drive
>>> rs/gpu/drm/amd/amdgpu/amdgpu_gmc.c#L541
>>
>> Hi,
>>
>> As far as I know the retry CAM is not available on APUs.
>> Please correct me if I'm wrong.
>>
>
> Retry CAM filter is available on APUs as well.
>
Correction - this is no longer true for newer ones.
Thanks,
Lijo
> Thanks,
> Lijo
>
>> Thanks,
>> Timur
>>
>>>
>>> That matches with the original logic in d299441654f ("drm/amdgpu: Rework
>>> retry fault removal").
>>>
>>> To match exactly with the logic in above commit, I think it should use
>>> soft ring only when retry cam is enabled. Presently, it's returning
>>> without doing anything.
>>>
>>> Thanks,
>>> Lijo
>>>
>>>> + else
>>>> + return;
>>>>
>>>> - ih = &adev->irq.ih1;
>>>>
>>>> /* Get the WPTR of the last entry in IH ring */
>>>> last_wptr = amdgpu_ih_get_wptr(adev, ih);
>>>> /* Order wptr with ring data. */
>>
>>
>>
>>
>
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-01-23 8:06 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-21 18:24 [PATCH] drm/amdgpu: fix NULL pointer dereference in amdgpu_gmc_filter_faults_remove Alex Deucher
2026-01-22 5:07 ` Lazar, Lijo
2026-01-22 14:02 ` Alex Deucher
2026-01-22 15:00 ` Timur Kristóf
2026-01-22 22:47 ` Philip Yang
2026-01-23 7:17 ` Lazar, Lijo
2026-01-23 8:06 ` Lazar, Lijo
2026-01-22 15:01 ` Timur Kristóf
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox