public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu: fix zero-size GDS range init on RDNA4
       [not found] <bug-221376-2300@https.bugzilla.kernel.org/>
@ 2026-04-20 21:57 ` arjan
  2026-04-21  6:42   ` Christian König
  0 siblings, 1 reply; 4+ messages in thread
From: arjan @ 2026-04-20 21:57 UTC (permalink / raw)
  To: amd-gfx
  Cc: Arjan van de Ven, Alex Deucher, Christian König, dri-devel,
	linux-kernel

From: Arjan van de Ven <arjan@linux.intel.com>

RDNA4 (GFX 12) hardware removes the GDS, GWS, and OA on-chip memory
resources. The gfx_v12_0 initialisation code correctly leaves
adev->gds.gds_size, adev->gds.gws_size, and adev->gds.oa_size at
zero to reflect this.

amdgpu_ttm_init() unconditionally calls amdgpu_ttm_init_on_chip() for
each of these resources regardless of size. When the size is zero,
amdgpu_ttm_init_on_chip() forwards the call to ttm_range_man_init(),
which calls drm_mm_init(mm, 0, 0). drm_mm_init() immediately fires
DRM_MM_BUG_ON(start + size <= start) -- trivially true when size is
zero -- crashing the kernel during modprobe of amdgpu on an RX 9070 XT.

Guard against this by returning 0 early from
amdgpu_ttm_init_on_chip() when size_in_page is zero. This skips TTM
resource manager registration for hardware resources that are absent,
without affecting any other GPU type.

Link: https://lore.kernel.org/all/bug-221376-2300@https.bugzilla.kernel.org%2F/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=221376
Oops-Analysis: http://oops.fenrus.org/reports/bugzilla.korg/221376/report.html
Assisted-by: GitHub Copilot:Claude Sonnet 4.6 linux-kernel-oops-x86.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Cc: linux-kernel@vger.kernel.org

---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index afaaab6496def..8075ac735321e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -75,6 +75,9 @@ static int amdgpu_ttm_init_on_chip(struct amdgpu_device *adev,
 				    unsigned int type,
 				    uint64_t size_in_page)
 {
+	if (!size_in_page)
+		return 0;
+
 	return ttm_range_man_init(&adev->mman.bdev, type,
 				  false, size_in_page);
 }

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] drm/amdgpu: fix zero-size GDS range init on RDNA4
  2026-04-20 21:57 ` [PATCH] drm/amdgpu: fix zero-size GDS range init on RDNA4 arjan
@ 2026-04-21  6:42   ` Christian König
  2026-04-21 11:54     ` Arjan van de Ven
  2026-04-21 13:42     ` Alex Deucher
  0 siblings, 2 replies; 4+ messages in thread
From: Christian König @ 2026-04-21  6:42 UTC (permalink / raw)
  To: arjan, amd-gfx; +Cc: Alex Deucher, dri-devel, linux-kernel

On 4/20/26 23:57, arjan@linux.intel.com wrote:
> 
> RDNA4 (GFX 12) hardware removes the GDS, GWS, and OA on-chip memory
> resources. The gfx_v12_0 initialisation code correctly leaves
> adev->gds.gds_size, adev->gds.gws_size, and adev->gds.oa_size at
> zero to reflect this.
> 
> amdgpu_ttm_init() unconditionally calls amdgpu_ttm_init_on_chip() for
> each of these resources regardless of size. When the size is zero,
> amdgpu_ttm_init_on_chip() forwards the call to ttm_range_man_init(),
> which calls drm_mm_init(mm, 0, 0). drm_mm_init() immediately fires
> DRM_MM_BUG_ON(start + size <= start) -- trivially true when size is
> zero -- crashing the kernel during modprobe of amdgpu on an RX 9070 XT.

Mhm in general not a bad idea, but we are having tons of GFX 12 systems in our test machines and nothing is crashing there.

We are clearly missing something here. Is that on an upstream kernel or something backported?

Regards,
Christian.

> 
> Guard against this by returning 0 early from
> amdgpu_ttm_init_on_chip() when size_in_page is zero. This skips TTM
> resource manager registration for hardware resources that are absent,
> without affecting any other GPU type.
> 
> Link: https://lore.kernel.org/all/bug-221376-2300@https.bugzilla.kernel.org%2F/
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=221376
> Oops-Analysis: http://oops.fenrus.org/reports/bugzilla.korg/221376/report.html
> Assisted-by: GitHub Copilot:Claude Sonnet 4.6 linux-kernel-oops-x86.
> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: "Christian König" <christian.koenig@amd.com>
> Cc: amd-gfx@lists.freedesktop.org
> Cc: dri-devel@lists.freedesktop.org
> Cc: linux-kernel@vger.kernel.org
> 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index afaaab6496def..8075ac735321e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -75,6 +75,9 @@ static int amdgpu_ttm_init_on_chip(struct amdgpu_device *adev,
>                                     unsigned int type,
>                                     uint64_t size_in_page)
>  {
> +       if (!size_in_page)
> +               return 0;
> +
>         return ttm_range_man_init(&adev->mman.bdev, type,
>                                   false, size_in_page);
>  }


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] drm/amdgpu: fix zero-size GDS range init on RDNA4
  2026-04-21  6:42   ` Christian König
@ 2026-04-21 11:54     ` Arjan van de Ven
  2026-04-21 13:42     ` Alex Deucher
  1 sibling, 0 replies; 4+ messages in thread
From: Arjan van de Ven @ 2026-04-21 11:54 UTC (permalink / raw)
  To: Christian König, amd-gfx; +Cc: Alex Deucher, dri-devel, linux-kernel

On 4/20/2026 11:42 PM, Christian König wrote:
> On 4/20/26 23:57, arjan@linux.intel.com wrote:
>>
>> RDNA4 (GFX 12) hardware removes the GDS, GWS, and OA on-chip memory
>> resources. The gfx_v12_0 initialisation code correctly leaves
>> adev->gds.gds_size, adev->gds.gws_size, and adev->gds.oa_size at
>> zero to reflect this.
>>
>> amdgpu_ttm_init() unconditionally calls amdgpu_ttm_init_on_chip() for
>> each of these resources regardless of size. When the size is zero,
>> amdgpu_ttm_init_on_chip() forwards the call to ttm_range_man_init(),
>> which calls drm_mm_init(mm, 0, 0). drm_mm_init() immediately fires
>> DRM_MM_BUG_ON(start + size <= start) -- trivially true when size is
>> zero -- crashing the kernel during modprobe of amdgpu on an RX 9070 XT.
> 
> Mhm in general not a bad idea, but we are having tons of GFX 12 systems in our test machines and nothing is crashing there.
> 
> We are clearly missing something here. Is that on an upstream kernel or something backported?
> 

the reported oops/etc say 6.18.22 so that does not sound like something crazy backported

(https://bugzilla.kernel.org/show_bug.cgi?id=221376)



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] drm/amdgpu: fix zero-size GDS range init on RDNA4
  2026-04-21  6:42   ` Christian König
  2026-04-21 11:54     ` Arjan van de Ven
@ 2026-04-21 13:42     ` Alex Deucher
  1 sibling, 0 replies; 4+ messages in thread
From: Alex Deucher @ 2026-04-21 13:42 UTC (permalink / raw)
  To: Christian König
  Cc: arjan, amd-gfx, Alex Deucher, dri-devel, linux-kernel

On Tue, Apr 21, 2026 at 2:59 AM Christian König
<christian.koenig@amd.com> wrote:
>
> On 4/20/26 23:57, arjan@linux.intel.com wrote:
> >
> > RDNA4 (GFX 12) hardware removes the GDS, GWS, and OA on-chip memory
> > resources. The gfx_v12_0 initialisation code correctly leaves
> > adev->gds.gds_size, adev->gds.gws_size, and adev->gds.oa_size at
> > zero to reflect this.
> >
> > amdgpu_ttm_init() unconditionally calls amdgpu_ttm_init_on_chip() for
> > each of these resources regardless of size. When the size is zero,
> > amdgpu_ttm_init_on_chip() forwards the call to ttm_range_man_init(),
> > which calls drm_mm_init(mm, 0, 0). drm_mm_init() immediately fires
> > DRM_MM_BUG_ON(start + size <= start) -- trivially true when size is
> > zero -- crashing the kernel during modprobe of amdgpu on an RX 9070 XT.
>
> Mhm in general not a bad idea, but we are having tons of GFX 12 systems in our test machines and nothing is crashing there.
>
> We are clearly missing something here. Is that on an upstream kernel or something backported?

Looks like that check only asserts if CONFIG_DRM_DEBUG_MM is set in
the user's kernel config.  I guess no one uses that option.  These
chips have been in the market for over a year and no one has reported
that until now.  Applied with a note about this in the commit message.

Thanks!

Alex

>
> Regards,
> Christian.
>
> >
> > Guard against this by returning 0 early from
> > amdgpu_ttm_init_on_chip() when size_in_page is zero. This skips TTM
> > resource manager registration for hardware resources that are absent,
> > without affecting any other GPU type.
> >
> > Link: https://lore.kernel.org/all/bug-221376-2300@https.bugzilla.kernel.org%2F/
> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=221376
> > Oops-Analysis: http://oops.fenrus.org/reports/bugzilla.korg/221376/report.html
> > Assisted-by: GitHub Copilot:Claude Sonnet 4.6 linux-kernel-oops-x86.
> > Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
> > Cc: Alex Deucher <alexander.deucher@amd.com>
> > Cc: "Christian König" <christian.koenig@amd.com>
> > Cc: amd-gfx@lists.freedesktop.org
> > Cc: dri-devel@lists.freedesktop.org
> > Cc: linux-kernel@vger.kernel.org
> >
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c |    3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> > index afaaab6496def..8075ac735321e 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> > @@ -75,6 +75,9 @@ static int amdgpu_ttm_init_on_chip(struct amdgpu_device *adev,
> >                                     unsigned int type,
> >                                     uint64_t size_in_page)
> >  {
> > +       if (!size_in_page)
> > +               return 0;
> > +
> >         return ttm_range_man_init(&adev->mman.bdev, type,
> >                                   false, size_in_page);
> >  }
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-04-21 13:42 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <bug-221376-2300@https.bugzilla.kernel.org/>
2026-04-20 21:57 ` [PATCH] drm/amdgpu: fix zero-size GDS range init on RDNA4 arjan
2026-04-21  6:42   ` Christian König
2026-04-21 11:54     ` Arjan van de Ven
2026-04-21 13:42     ` Alex Deucher

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox