* [PATCH v3] drm/amdgpu: Avoid NULL dereference in discovery topology coredump path v3
@ 2026-03-18 10:30 Srinivasan Shanmugam
2026-03-18 10:33 ` Christian König
2026-03-18 10:57 ` Lazar, Lijo
0 siblings, 2 replies; 8+ messages in thread
From: Srinivasan Shanmugam @ 2026-03-18 10:30 UTC (permalink / raw)
To: Christian König, Alex Deucher
Cc: amd-gfx, Srinivasan Shanmugam, Pierre-Eric Pelloux-Prayer
When a GPU fault or timeout happens, the driver creates a devcoredump
to collect debug information.
During this, amdgpu_devcoredump_format() calls
amdgpu_discovery_dump() to print IP discovery data.
amdgpu_discovery_dump() uses:
adev->discovery.ip_top
and then accesses:
ip_top->die_kset
amdgpu_discovery_dump() uses adev->discovery.ip_top. However,
ip_top may be NULL if the discovery topology was never initialized.
The current code does not check for this before using ip_top. As a
result, when ip_top is NULL, the coredump worker crashes while taking
the spinlock for ip_top->die_kset.
Fix this by checking for a missing ip_top before walking the discovery
topology. If it is unavailable, print a short message in the dump and
return safely.
- If ip_top is NULL, print a message and skip the dump
- Also add the same check in the cleanup path
This makes the coredump and cleanup paths safe even when the
discovery topology is not available.
KASAN trace:
[ 522.228252] [IGT] amd_deadlock: starting subtest amdgpu-deadlock-sdma
[ 522.240681] [IGT] amd_deadlock: starting dynamic subtest amdgpu-deadlock-sdma
...
[ 522.952317] Write of size 4 at addr 0000000000000050 by task kworker/u129:5/5434
[ 522.937526] BUG: KASAN: null-ptr-deref in _raw_spin_lock+0x66/0xc0
[ 522.967659] Workqueue: events_unbound amdgpu_devcoredump_deferred_work [amdgpu]
...
[ 522.969445] Call Trace:
[ 522.969508] _raw_spin_lock+0x66/0xc0
[ 522.969518] ? __pfx__raw_spin_lock+0x10/0x10
[ 522.969534] amdgpu_discovery_dump+0x61/0x530 [amdgpu]
[ 522.971346] ? pick_next_task_fair+0x3f6/0x1c60
[ 522.971363] amdgpu_devcoredump_format+0x84f/0x26f0 [amdgpu]
[ 522.973188] ? __pfx_amdgpu_devcoredump_format+0x10/0x10 [amdgpu]
[ 522.975012] ? psi_task_switch+0x2b5/0x9b0
[ 522.975027] ? __pfx___drm_printfn_coredump+0x10/0x10 [drm]
[ 522.975198] ? __pfx___drm_puts_coredump+0x10/0x10 [drm]
[ 522.975366] ? __schedule+0x113c/0x38d0
[ 522.975381] amdgpu_devcoredump_deferred_work+0x4c/0x1f0 [amdgpu]
v2: Updated commit message - Clarified that ip_top is not freed, it can
just be NULL if discovery was not initialized. (Christian/Lijo)
v3: Removed the extra drm_warn() for sysfs init failure as sysfs already
reports errors. (Christian)
Fixes: 7083eb8982fb ("drm/amdgpu: include ip discovery data in devcoredump")
Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
index f7f37d93d0ce..6be1f971a31a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
@@ -1395,6 +1395,9 @@ static void amdgpu_discovery_sysfs_fini(struct amdgpu_device *adev)
struct list_head *el, *tmp;
struct kset *die_kset;
+ if (!ip_top)
+ return;
+
die_kset = &ip_top->die_kset;
spin_lock(&die_kset->list_lock);
list_for_each_prev_safe(el, tmp, &die_kset->list) {
@@ -1419,9 +1422,15 @@ void amdgpu_discovery_dump(struct amdgpu_device *adev, struct drm_printer *p)
struct ip_hw_instance *ip_inst;
int i = 0, j;
+ drm_printf(p, "\nHW IP Discovery\n");
+
+ if (!ip_top) {
+ drm_printf(p, "ip discovery topology unavailable\n");
+ return;
+ }
+
die_kset = &ip_top->die_kset;
- drm_printf(p, "\nHW IP Discovery\n");
spin_lock(&die_kset->list_lock);
list_for_each(el_die, &die_kset->list) {
drm_printf(p, "die %d\n", i++);
--
2.34.1
^ permalink raw reply related [flat|nested] 8+ messages in thread* Re: [PATCH v3] drm/amdgpu: Avoid NULL dereference in discovery topology coredump path v3 2026-03-18 10:30 [PATCH v3] drm/amdgpu: Avoid NULL dereference in discovery topology coredump path v3 Srinivasan Shanmugam @ 2026-03-18 10:33 ` Christian König 2026-03-18 10:57 ` Lazar, Lijo 1 sibling, 0 replies; 8+ messages in thread From: Christian König @ 2026-03-18 10:33 UTC (permalink / raw) To: Srinivasan Shanmugam, Alex Deucher; +Cc: amd-gfx, Pierre-Eric Pelloux-Prayer On 3/18/26 11:30, Srinivasan Shanmugam wrote: > When a GPU fault or timeout happens, the driver creates a devcoredump > to collect debug information. > > During this, amdgpu_devcoredump_format() calls > amdgpu_discovery_dump() to print IP discovery data. > > amdgpu_discovery_dump() uses: > adev->discovery.ip_top > > and then accesses: > ip_top->die_kset > > amdgpu_discovery_dump() uses adev->discovery.ip_top. However, > ip_top may be NULL if the discovery topology was never initialized. > > The current code does not check for this before using ip_top. As a > result, when ip_top is NULL, the coredump worker crashes while taking > the spinlock for ip_top->die_kset. > > Fix this by checking for a missing ip_top before walking the discovery > topology. If it is unavailable, print a short message in the dump and > return safely. > > - If ip_top is NULL, print a message and skip the dump > - Also add the same check in the cleanup path > > This makes the coredump and cleanup paths safe even when the > discovery topology is not available. > > KASAN trace: > [ 522.228252] [IGT] amd_deadlock: starting subtest amdgpu-deadlock-sdma > [ 522.240681] [IGT] amd_deadlock: starting dynamic subtest amdgpu-deadlock-sdma > > ... > > [ 522.952317] Write of size 4 at addr 0000000000000050 by task kworker/u129:5/5434 > [ 522.937526] BUG: KASAN: null-ptr-deref in _raw_spin_lock+0x66/0xc0 > [ 522.967659] Workqueue: events_unbound amdgpu_devcoredump_deferred_work [amdgpu] > > ... > > [ 522.969445] Call Trace: > [ 522.969508] _raw_spin_lock+0x66/0xc0 > [ 522.969518] ? __pfx__raw_spin_lock+0x10/0x10 > [ 522.969534] amdgpu_discovery_dump+0x61/0x530 [amdgpu] > [ 522.971346] ? pick_next_task_fair+0x3f6/0x1c60 > [ 522.971363] amdgpu_devcoredump_format+0x84f/0x26f0 [amdgpu] > [ 522.973188] ? __pfx_amdgpu_devcoredump_format+0x10/0x10 [amdgpu] > [ 522.975012] ? psi_task_switch+0x2b5/0x9b0 > [ 522.975027] ? __pfx___drm_printfn_coredump+0x10/0x10 [drm] > [ 522.975198] ? __pfx___drm_puts_coredump+0x10/0x10 [drm] > [ 522.975366] ? __schedule+0x113c/0x38d0 > [ 522.975381] amdgpu_devcoredump_deferred_work+0x4c/0x1f0 [amdgpu] > > v2: Updated commit message - Clarified that ip_top is not freed, it can > just be NULL if discovery was not initialized. (Christian/Lijo) > > v3: Removed the extra drm_warn() for sysfs init failure as sysfs already > reports errors. (Christian) > > Fixes: 7083eb8982fb ("drm/amdgpu: include ip discovery data in devcoredump") > Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> > Cc: Christian König <christian.koenig@amd.com> > Cc: Alex Deucher <alexander.deucher@amd.com> > Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 11 ++++++++++- > 1 file changed, 10 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > index f7f37d93d0ce..6be1f971a31a 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > @@ -1395,6 +1395,9 @@ static void amdgpu_discovery_sysfs_fini(struct amdgpu_device *adev) > struct list_head *el, *tmp; > struct kset *die_kset; > > + if (!ip_top) > + return; > + > die_kset = &ip_top->die_kset; > spin_lock(&die_kset->list_lock); > list_for_each_prev_safe(el, tmp, &die_kset->list) { > @@ -1419,9 +1422,15 @@ void amdgpu_discovery_dump(struct amdgpu_device *adev, struct drm_printer *p) > struct ip_hw_instance *ip_inst; > int i = 0, j; > > + drm_printf(p, "\nHW IP Discovery\n"); > + > + if (!ip_top) { > + drm_printf(p, "ip discovery topology unavailable\n"); > + return; > + } > + > die_kset = &ip_top->die_kset; > > - drm_printf(p, "\nHW IP Discovery\n"); > spin_lock(&die_kset->list_lock); > list_for_each(el_die, &die_kset->list) { > drm_printf(p, "die %d\n", i++); ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v3] drm/amdgpu: Avoid NULL dereference in discovery topology coredump path v3 2026-03-18 10:30 [PATCH v3] drm/amdgpu: Avoid NULL dereference in discovery topology coredump path v3 Srinivasan Shanmugam 2026-03-18 10:33 ` Christian König @ 2026-03-18 10:57 ` Lazar, Lijo 2026-03-18 11:11 ` SHANMUGAM, SRINIVASAN 1 sibling, 1 reply; 8+ messages in thread From: Lazar, Lijo @ 2026-03-18 10:57 UTC (permalink / raw) To: Srinivasan Shanmugam, Christian König, Alex Deucher Cc: amd-gfx, Pierre-Eric Pelloux-Prayer On 18-Mar-26 4:00 PM, Srinivasan Shanmugam wrote: > When a GPU fault or timeout happens, the driver creates a devcoredump > to collect debug information. > > During this, amdgpu_devcoredump_format() calls > amdgpu_discovery_dump() to print IP discovery data. > > amdgpu_discovery_dump() uses: > adev->discovery.ip_top > > and then accesses: > ip_top->die_kset > > amdgpu_discovery_dump() uses adev->discovery.ip_top. However, > ip_top may be NULL if the discovery topology was never initialized. > > The current code does not check for this before using ip_top. As a > result, when ip_top is NULL, the coredump worker crashes while taking > the spinlock for ip_top->die_kset. > > Fix this by checking for a missing ip_top before walking the discovery > topology. If it is unavailable, print a short message in the dump and > return safely. > > - If ip_top is NULL, print a message and skip the dump > - Also add the same check in the cleanup path > > This makes the coredump and cleanup paths safe even when the > discovery topology is not available. > > KASAN trace: > [ 522.228252] [IGT] amd_deadlock: starting subtest amdgpu-deadlock-sdma > [ 522.240681] [IGT] amd_deadlock: starting dynamic subtest amdgpu-deadlock-sdma > > ... > > [ 522.952317] Write of size 4 at addr 0000000000000050 by task kworker/u129:5/5434 > [ 522.937526] BUG: KASAN: null-ptr-deref in _raw_spin_lock+0x66/0xc0 > [ 522.967659] Workqueue: events_unbound amdgpu_devcoredump_deferred_work [amdgpu] > > ... > > [ 522.969445] Call Trace: > [ 522.969508] _raw_spin_lock+0x66/0xc0 > [ 522.969518] ? __pfx__raw_spin_lock+0x10/0x10 > [ 522.969534] amdgpu_discovery_dump+0x61/0x530 [amdgpu] > [ 522.971346] ? pick_next_task_fair+0x3f6/0x1c60 > [ 522.971363] amdgpu_devcoredump_format+0x84f/0x26f0 [amdgpu] > [ 522.973188] ? __pfx_amdgpu_devcoredump_format+0x10/0x10 [amdgpu] > [ 522.975012] ? psi_task_switch+0x2b5/0x9b0 > [ 522.975027] ? __pfx___drm_printfn_coredump+0x10/0x10 [drm] > [ 522.975198] ? __pfx___drm_puts_coredump+0x10/0x10 [drm] > [ 522.975366] ? __schedule+0x113c/0x38d0 > [ 522.975381] amdgpu_devcoredump_deferred_work+0x4c/0x1f0 [amdgpu] > > v2: Updated commit message - Clarified that ip_top is not freed, it can > just be NULL if discovery was not initialized. (Christian/Lijo) > > v3: Removed the extra drm_warn() for sysfs init failure as sysfs already > reports errors. (Christian) > > Fixes: 7083eb8982fb ("drm/amdgpu: include ip discovery data in devcoredump") > Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> > Cc: Christian König <christian.koenig@amd.com> > Cc: Alex Deucher <alexander.deucher@amd.com> > Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 11 ++++++++++- > 1 file changed, 10 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > index f7f37d93d0ce..6be1f971a31a 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > @@ -1395,6 +1395,9 @@ static void amdgpu_discovery_sysfs_fini(struct amdgpu_device *adev) > struct list_head *el, *tmp; > struct kset *die_kset; > > + if (!ip_top) > + return; > + > die_kset = &ip_top->die_kset; > spin_lock(&die_kset->list_lock); > list_for_each_prev_safe(el, tmp, &die_kset->list) { > @@ -1419,9 +1422,15 @@ void amdgpu_discovery_dump(struct amdgpu_device *adev, struct drm_printer *p) > struct ip_hw_instance *ip_inst; > int i = 0, j; > > + drm_printf(p, "\nHW IP Discovery\n"); > + > + if (!ip_top) { > + drm_printf(p, "ip discovery topology unavailable\n"); Is this type of printing really required or just skipping the whole section good enough? Thanks, Lijo > + return; > + } > + > die_kset = &ip_top->die_kset; > > - drm_printf(p, "\nHW IP Discovery\n"); > spin_lock(&die_kset->list_lock); > list_for_each(el_die, &die_kset->list) { > drm_printf(p, "die %d\n", i++); ^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: [PATCH v3] drm/amdgpu: Avoid NULL dereference in discovery topology coredump path v3 2026-03-18 10:57 ` Lazar, Lijo @ 2026-03-18 11:11 ` SHANMUGAM, SRINIVASAN 2026-03-19 4:15 ` Lazar, Lijo 0 siblings, 1 reply; 8+ messages in thread From: SHANMUGAM, SRINIVASAN @ 2026-03-18 11:11 UTC (permalink / raw) To: Lazar, Lijo, Koenig, Christian, Deucher, Alexander Cc: amd-gfx@lists.freedesktop.org, Pelloux-Prayer, Pierre-Eric [AMD Official Use Only - AMD Internal Distribution Only] > -----Original Message----- > From: Lazar, Lijo <Lijo.Lazar@amd.com> > Sent: Wednesday, March 18, 2026 4:28 PM > To: SHANMUGAM, SRINIVASAN <SRINIVASAN.SHANMUGAM@amd.com>; > Koenig, Christian <Christian.Koenig@amd.com>; Deucher, Alexander > <Alexander.Deucher@amd.com> > Cc: amd-gfx@lists.freedesktop.org; Pelloux-Prayer, Pierre-Eric <Pierre- > eric.Pelloux-prayer@amd.com> > Subject: Re: [PATCH v3] drm/amdgpu: Avoid NULL dereference in discovery > topology coredump path v3 > > > > On 18-Mar-26 4:00 PM, Srinivasan Shanmugam wrote: > > When a GPU fault or timeout happens, the driver creates a devcoredump > > to collect debug information. > > > > During this, amdgpu_devcoredump_format() calls > > amdgpu_discovery_dump() to print IP discovery data. > > > > amdgpu_discovery_dump() uses: > > adev->discovery.ip_top > > > > and then accesses: > > ip_top->die_kset > > > > amdgpu_discovery_dump() uses adev->discovery.ip_top. However, ip_top > > may be NULL if the discovery topology was never initialized. > > > > The current code does not check for this before using ip_top. As a > > result, when ip_top is NULL, the coredump worker crashes while taking > > the spinlock for ip_top->die_kset. > > > > Fix this by checking for a missing ip_top before walking the discovery > > topology. If it is unavailable, print a short message in the dump and > > return safely. > > > > - If ip_top is NULL, print a message and skip the dump > > - Also add the same check in the cleanup path > > > > This makes the coredump and cleanup paths safe even when the discovery > > topology is not available. > > > > KASAN trace: > > [ 522.228252] [IGT] amd_deadlock: starting subtest > > amdgpu-deadlock-sdma [ 522.240681] [IGT] amd_deadlock: starting > > dynamic subtest amdgpu-deadlock-sdma > > > > ... > > > > [ 522.952317] Write of size 4 at addr 0000000000000050 by task > > kworker/u129:5/5434 [ 522.937526] BUG: KASAN: null-ptr-deref in > > _raw_spin_lock+0x66/0xc0 [ 522.967659] Workqueue: events_unbound > > amdgpu_devcoredump_deferred_work [amdgpu] > > > > ... > > > > [ 522.969445] Call Trace: > > [ 522.969508] _raw_spin_lock+0x66/0xc0 [ 522.969518] ? > > __pfx__raw_spin_lock+0x10/0x10 [ 522.969534] > > amdgpu_discovery_dump+0x61/0x530 [amdgpu] [ 522.971346] ? > > pick_next_task_fair+0x3f6/0x1c60 [ 522.971363] > > amdgpu_devcoredump_format+0x84f/0x26f0 [amdgpu] [ 522.973188] ? > > __pfx_amdgpu_devcoredump_format+0x10/0x10 [amdgpu] [ 522.975012] ? > > psi_task_switch+0x2b5/0x9b0 [ 522.975027] ? > > __pfx___drm_printfn_coredump+0x10/0x10 [drm] [ 522.975198] ? > > __pfx___drm_puts_coredump+0x10/0x10 [drm] [ 522.975366] ? > > __schedule+0x113c/0x38d0 [ 522.975381] > > amdgpu_devcoredump_deferred_work+0x4c/0x1f0 [amdgpu] > > > > v2: Updated commit message - Clarified that ip_top is not freed, it can > > just be NULL if discovery was not initialized. (Christian/Lijo) > > > > v3: Removed the extra drm_warn() for sysfs init failure as sysfs already > > reports errors. (Christian) > > > > Fixes: 7083eb8982fb ("drm/amdgpu: include ip discovery data in > > devcoredump") > > Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> > > Cc: Christian König <christian.koenig@amd.com> > > Cc: Alex Deucher <alexander.deucher@amd.com> > > Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 11 ++++++++++- > > 1 file changed, 10 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > > index f7f37d93d0ce..6be1f971a31a 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > > @@ -1395,6 +1395,9 @@ static void amdgpu_discovery_sysfs_fini(struct > amdgpu_device *adev) > > struct list_head *el, *tmp; > > struct kset *die_kset; > > > > + if (!ip_top) > > + return; > > + > > die_kset = &ip_top->die_kset; > > spin_lock(&die_kset->list_lock); > > list_for_each_prev_safe(el, tmp, &die_kset->list) { @@ -1419,9 > > +1422,15 @@ void amdgpu_discovery_dump(struct amdgpu_device *adev, > struct drm_printer *p) > > struct ip_hw_instance *ip_inst; > > int i = 0, j; > > > > + drm_printf(p, "\nHW IP Discovery\n"); > > + > > + if (!ip_top) { > > + drm_printf(p, "ip discovery topology unavailable\n"); > > Is this type of printing really required or just skipping the whole section good > enough? Silently skipping the rest may look like incomplete or missing data in the coredump. Adding a one-line message makes it clear that the topology was not available, rather than leaving an empty section. Best regards, Srini > > Thanks, > Lijo > > > + return; > > + } > > + > > die_kset = &ip_top->die_kset; > > > > - drm_printf(p, "\nHW IP Discovery\n"); > > spin_lock(&die_kset->list_lock); > > list_for_each(el_die, &die_kset->list) { > > drm_printf(p, "die %d\n", i++); ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v3] drm/amdgpu: Avoid NULL dereference in discovery topology coredump path v3 2026-03-18 11:11 ` SHANMUGAM, SRINIVASAN @ 2026-03-19 4:15 ` Lazar, Lijo 2026-03-19 4:47 ` SHANMUGAM, SRINIVASAN 0 siblings, 1 reply; 8+ messages in thread From: Lazar, Lijo @ 2026-03-19 4:15 UTC (permalink / raw) To: SHANMUGAM, SRINIVASAN, Koenig, Christian, Deucher, Alexander Cc: amd-gfx@lists.freedesktop.org, Pelloux-Prayer, Pierre-Eric On 18-Mar-26 4:41 PM, SHANMUGAM, SRINIVASAN wrote: > [AMD Official Use Only - AMD Internal Distribution Only] > >> -----Original Message----- >> From: Lazar, Lijo <Lijo.Lazar@amd.com> >> Sent: Wednesday, March 18, 2026 4:28 PM >> To: SHANMUGAM, SRINIVASAN <SRINIVASAN.SHANMUGAM@amd.com>; >> Koenig, Christian <Christian.Koenig@amd.com>; Deucher, Alexander >> <Alexander.Deucher@amd.com> >> Cc: amd-gfx@lists.freedesktop.org; Pelloux-Prayer, Pierre-Eric <Pierre- >> eric.Pelloux-prayer@amd.com> >> Subject: Re: [PATCH v3] drm/amdgpu: Avoid NULL dereference in discovery >> topology coredump path v3 >> >> >> >> On 18-Mar-26 4:00 PM, Srinivasan Shanmugam wrote: >>> When a GPU fault or timeout happens, the driver creates a devcoredump >>> to collect debug information. >>> >>> During this, amdgpu_devcoredump_format() calls >>> amdgpu_discovery_dump() to print IP discovery data. >>> >>> amdgpu_discovery_dump() uses: >>> adev->discovery.ip_top >>> >>> and then accesses: >>> ip_top->die_kset >>> >>> amdgpu_discovery_dump() uses adev->discovery.ip_top. However, ip_top >>> may be NULL if the discovery topology was never initialized. >>> >>> The current code does not check for this before using ip_top. As a >>> result, when ip_top is NULL, the coredump worker crashes while taking >>> the spinlock for ip_top->die_kset. >>> >>> Fix this by checking for a missing ip_top before walking the discovery >>> topology. If it is unavailable, print a short message in the dump and >>> return safely. >>> >>> - If ip_top is NULL, print a message and skip the dump >>> - Also add the same check in the cleanup path >>> >>> This makes the coredump and cleanup paths safe even when the discovery >>> topology is not available. >>> >>> KASAN trace: >>> [ 522.228252] [IGT] amd_deadlock: starting subtest >>> amdgpu-deadlock-sdma [ 522.240681] [IGT] amd_deadlock: starting >>> dynamic subtest amdgpu-deadlock-sdma >>> >>> ... >>> >>> [ 522.952317] Write of size 4 at addr 0000000000000050 by task >>> kworker/u129:5/5434 [ 522.937526] BUG: KASAN: null-ptr-deref in >>> _raw_spin_lock+0x66/0xc0 [ 522.967659] Workqueue: events_unbound >>> amdgpu_devcoredump_deferred_work [amdgpu] >>> >>> ... >>> >>> [ 522.969445] Call Trace: >>> [ 522.969508] _raw_spin_lock+0x66/0xc0 [ 522.969518] ? >>> __pfx__raw_spin_lock+0x10/0x10 [ 522.969534] >>> amdgpu_discovery_dump+0x61/0x530 [amdgpu] [ 522.971346] ? >>> pick_next_task_fair+0x3f6/0x1c60 [ 522.971363] >>> amdgpu_devcoredump_format+0x84f/0x26f0 [amdgpu] [ 522.973188] ? >>> __pfx_amdgpu_devcoredump_format+0x10/0x10 [amdgpu] [ 522.975012] ? >>> psi_task_switch+0x2b5/0x9b0 [ 522.975027] ? >>> __pfx___drm_printfn_coredump+0x10/0x10 [drm] [ 522.975198] ? >>> __pfx___drm_puts_coredump+0x10/0x10 [drm] [ 522.975366] ? >>> __schedule+0x113c/0x38d0 [ 522.975381] >>> amdgpu_devcoredump_deferred_work+0x4c/0x1f0 [amdgpu] >>> >>> v2: Updated commit message - Clarified that ip_top is not freed, it can >>> just be NULL if discovery was not initialized. (Christian/Lijo) >>> >>> v3: Removed the extra drm_warn() for sysfs init failure as sysfs already >>> reports errors. (Christian) >>> >>> Fixes: 7083eb8982fb ("drm/amdgpu: include ip discovery data in >>> devcoredump") >>> Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> >>> Cc: Christian König <christian.koenig@amd.com> >>> Cc: Alex Deucher <alexander.deucher@amd.com> >>> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 11 ++++++++++- >>> 1 file changed, 10 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c >>> index f7f37d93d0ce..6be1f971a31a 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c >>> @@ -1395,6 +1395,9 @@ static void amdgpu_discovery_sysfs_fini(struct >> amdgpu_device *adev) >>> struct list_head *el, *tmp; >>> struct kset *die_kset; >>> >>> + if (!ip_top) >>> + return; >>> + >>> die_kset = &ip_top->die_kset; >>> spin_lock(&die_kset->list_lock); >>> list_for_each_prev_safe(el, tmp, &die_kset->list) { @@ -1419,9 >>> +1422,15 @@ void amdgpu_discovery_dump(struct amdgpu_device *adev, >> struct drm_printer *p) >>> struct ip_hw_instance *ip_inst; >>> int i = 0, j; >>> >>> + drm_printf(p, "\nHW IP Discovery\n"); >>> + >>> + if (!ip_top) { >>> + drm_printf(p, "ip discovery topology unavailable\n"); >> >> Is this type of printing really required or just skipping the whole section good >> enough? > > > Silently skipping the rest may look like incomplete or missing data in the > coredump. > > Adding a one-line message makes it clear that the topology was not > available, rather than leaving an empty section. > Here is my take - Discovery is the basic requirement for SOCs which make use of that mechanism and it is always expected to be present for those, otherwise driver load will fail. For those which don't make use of discovery, the section will not be present. There is no special message required for that. There is no harm to keep that, it only adds extra parsing. Thanks, Lijo > Best regards, > Srini > >> >> Thanks, >> Lijo >> >>> + return; >>> + } >>> + >>> die_kset = &ip_top->die_kset; >>> >>> - drm_printf(p, "\nHW IP Discovery\n"); >>> spin_lock(&die_kset->list_lock); >>> list_for_each(el_die, &die_kset->list) { >>> drm_printf(p, "die %d\n", i++); > ^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: [PATCH v3] drm/amdgpu: Avoid NULL dereference in discovery topology coredump path v3 2026-03-19 4:15 ` Lazar, Lijo @ 2026-03-19 4:47 ` SHANMUGAM, SRINIVASAN 2026-03-19 5:00 ` Lazar, Lijo 0 siblings, 1 reply; 8+ messages in thread From: SHANMUGAM, SRINIVASAN @ 2026-03-19 4:47 UTC (permalink / raw) To: Lazar, Lijo, Koenig, Christian, Deucher, Alexander Cc: amd-gfx@lists.freedesktop.org, Pelloux-Prayer, Pierre-Eric [AMD Official Use Only - AMD Internal Distribution Only] > -----Original Message----- > From: Lazar, Lijo <Lijo.Lazar@amd.com> > Sent: Thursday, March 19, 2026 9:46 AM > To: SHANMUGAM, SRINIVASAN <SRINIVASAN.SHANMUGAM@amd.com>; > Koenig, Christian <Christian.Koenig@amd.com>; Deucher, Alexander > <Alexander.Deucher@amd.com> > Cc: amd-gfx@lists.freedesktop.org; Pelloux-Prayer, Pierre-Eric <Pierre- > eric.Pelloux-prayer@amd.com> > Subject: Re: [PATCH v3] drm/amdgpu: Avoid NULL dereference in discovery > topology coredump path v3 > > > > On 18-Mar-26 4:41 PM, SHANMUGAM, SRINIVASAN wrote: > > [AMD Official Use Only - AMD Internal Distribution Only] > > > >> -----Original Message----- > >> From: Lazar, Lijo <Lijo.Lazar@amd.com> > >> Sent: Wednesday, March 18, 2026 4:28 PM > >> To: SHANMUGAM, SRINIVASAN <SRINIVASAN.SHANMUGAM@amd.com>; > Koenig, > >> Christian <Christian.Koenig@amd.com>; Deucher, Alexander > >> <Alexander.Deucher@amd.com> > >> Cc: amd-gfx@lists.freedesktop.org; Pelloux-Prayer, Pierre-Eric > >> <Pierre- eric.Pelloux-prayer@amd.com> > >> Subject: Re: [PATCH v3] drm/amdgpu: Avoid NULL dereference in > >> discovery topology coredump path v3 > >> > >> > >> > >> On 18-Mar-26 4:00 PM, Srinivasan Shanmugam wrote: > >>> When a GPU fault or timeout happens, the driver creates a > >>> devcoredump to collect debug information. > >>> > >>> During this, amdgpu_devcoredump_format() calls > >>> amdgpu_discovery_dump() to print IP discovery data. > >>> > >>> amdgpu_discovery_dump() uses: > >>> adev->discovery.ip_top > >>> > >>> and then accesses: > >>> ip_top->die_kset > >>> > >>> amdgpu_discovery_dump() uses adev->discovery.ip_top. However, ip_top > >>> may be NULL if the discovery topology was never initialized. > >>> > >>> The current code does not check for this before using ip_top. As a > >>> result, when ip_top is NULL, the coredump worker crashes while > >>> taking the spinlock for ip_top->die_kset. > >>> > >>> Fix this by checking for a missing ip_top before walking the > >>> discovery topology. If it is unavailable, print a short message in > >>> the dump and return safely. > >>> > >>> - If ip_top is NULL, print a message and skip the dump > >>> - Also add the same check in the cleanup path > >>> > >>> This makes the coredump and cleanup paths safe even when the > >>> discovery topology is not available. > >>> > >>> KASAN trace: > >>> [ 522.228252] [IGT] amd_deadlock: starting subtest > >>> amdgpu-deadlock-sdma [ 522.240681] [IGT] amd_deadlock: starting > >>> dynamic subtest amdgpu-deadlock-sdma > >>> > >>> ... > >>> > >>> [ 522.952317] Write of size 4 at addr 0000000000000050 by task > >>> kworker/u129:5/5434 [ 522.937526] BUG: KASAN: null-ptr-deref in > >>> _raw_spin_lock+0x66/0xc0 [ 522.967659] Workqueue: events_unbound > >>> amdgpu_devcoredump_deferred_work [amdgpu] > >>> > >>> ... > >>> > >>> [ 522.969445] Call Trace: > >>> [ 522.969508] _raw_spin_lock+0x66/0xc0 [ 522.969518] ? > >>> __pfx__raw_spin_lock+0x10/0x10 [ 522.969534] > >>> amdgpu_discovery_dump+0x61/0x530 [amdgpu] [ 522.971346] ? > >>> pick_next_task_fair+0x3f6/0x1c60 [ 522.971363] > >>> amdgpu_devcoredump_format+0x84f/0x26f0 [amdgpu] [ 522.973188] ? > >>> __pfx_amdgpu_devcoredump_format+0x10/0x10 [amdgpu] [ 522.975012] ? > >>> psi_task_switch+0x2b5/0x9b0 [ 522.975027] ? > >>> __pfx___drm_printfn_coredump+0x10/0x10 [drm] [ 522.975198] ? > >>> __pfx___drm_puts_coredump+0x10/0x10 [drm] [ 522.975366] ? > >>> __schedule+0x113c/0x38d0 [ 522.975381] > >>> amdgpu_devcoredump_deferred_work+0x4c/0x1f0 [amdgpu] > >>> > >>> v2: Updated commit message - Clarified that ip_top is not freed, it can > >>> just be NULL if discovery was not initialized. > >>> (Christian/Lijo) > >>> > >>> v3: Removed the extra drm_warn() for sysfs init failure as sysfs already > >>> reports errors. (Christian) > >>> > >>> Fixes: 7083eb8982fb ("drm/amdgpu: include ip discovery data in > >>> devcoredump") > >>> Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> > >>> Cc: Christian König <christian.koenig@amd.com> > >>> Cc: Alex Deucher <alexander.deucher@amd.com> > >>> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> > >>> --- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 11 ++++++++++- > >>> 1 file changed, 10 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > >>> index f7f37d93d0ce..6be1f971a31a 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > >>> @@ -1395,6 +1395,9 @@ static void amdgpu_discovery_sysfs_fini(struct > >> amdgpu_device *adev) > >>> struct list_head *el, *tmp; > >>> struct kset *die_kset; > >>> > >>> + if (!ip_top) > >>> + return; > >>> + > >>> die_kset = &ip_top->die_kset; > >>> spin_lock(&die_kset->list_lock); > >>> list_for_each_prev_safe(el, tmp, &die_kset->list) { @@ -1419,9 > >>> +1422,15 @@ void amdgpu_discovery_dump(struct amdgpu_device *adev, > >> struct drm_printer *p) > >>> struct ip_hw_instance *ip_inst; > >>> int i = 0, j; > >>> > >>> + drm_printf(p, "\nHW IP Discovery\n"); > >>> + > >>> + if (!ip_top) { > >>> + drm_printf(p, "ip discovery topology unavailable\n"); > >> > >> Is this type of printing really required or just skipping the whole > >> section good enough? > > > > > > Silently skipping the rest may look like incomplete or missing data in > > the coredump. > > > > Adding a one-line message makes it clear that the topology was not > > available, rather than leaving an empty section. > > > > Here is my take - Discovery is the basic requirement for SOCs which make use of > that mechanism and it is always expected to be present for those, otherwise driver > load will fail. > > For those which don't make use of discovery, the section will not be present. There > is no special message required for that. There is no harm to keep that, it only adds > extra parsing. It's ok for normal cases. However, this runs in the *coredump/debug path during error conditions* - where the discovery topology may not be initialized. instead of leaving the section empty, which can be confusing during debugging. Best, Srini ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v3] drm/amdgpu: Avoid NULL dereference in discovery topology coredump path v3 2026-03-19 4:47 ` SHANMUGAM, SRINIVASAN @ 2026-03-19 5:00 ` Lazar, Lijo 2026-03-19 7:38 ` Christian König 0 siblings, 1 reply; 8+ messages in thread From: Lazar, Lijo @ 2026-03-19 5:00 UTC (permalink / raw) To: SHANMUGAM, SRINIVASAN, Koenig, Christian, Deucher, Alexander Cc: amd-gfx@lists.freedesktop.org, Pelloux-Prayer, Pierre-Eric [-- Attachment #1: Type: text/plain, Size: 7241 bytes --] [Public] To run coredump logic, driver needs to be loaded first. For driver load to work on all new SOCs, it needs discovery. Other SOCs which depend on pci id mechanism don't need discovery and they don't have discovery section either. Thanks, Lijo ________________________________ From: SHANMUGAM, SRINIVASAN <SRINIVASAN.SHANMUGAM@amd.com> Sent: Thursday, March 19, 2026 10:17:09 AM To: Lazar, Lijo <Lijo.Lazar@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com> Cc: amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; Pelloux-Prayer, Pierre-Eric <Pierre-eric.Pelloux-prayer@amd.com> Subject: RE: [PATCH v3] drm/amdgpu: Avoid NULL dereference in discovery topology coredump path v3 [AMD Official Use Only - AMD Internal Distribution Only] > -----Original Message----- > From: Lazar, Lijo <Lijo.Lazar@amd.com> > Sent: Thursday, March 19, 2026 9:46 AM > To: SHANMUGAM, SRINIVASAN <SRINIVASAN.SHANMUGAM@amd.com>; > Koenig, Christian <Christian.Koenig@amd.com>; Deucher, Alexander > <Alexander.Deucher@amd.com> > Cc: amd-gfx@lists.freedesktop.org; Pelloux-Prayer, Pierre-Eric <Pierre- > eric.Pelloux-prayer@amd.com> > Subject: Re: [PATCH v3] drm/amdgpu: Avoid NULL dereference in discovery > topology coredump path v3 > > > > On 18-Mar-26 4:41 PM, SHANMUGAM, SRINIVASAN wrote: > > [AMD Official Use Only - AMD Internal Distribution Only] > > > >> -----Original Message----- > >> From: Lazar, Lijo <Lijo.Lazar@amd.com> > >> Sent: Wednesday, March 18, 2026 4:28 PM > >> To: SHANMUGAM, SRINIVASAN <SRINIVASAN.SHANMUGAM@amd.com>; > Koenig, > >> Christian <Christian.Koenig@amd.com>; Deucher, Alexander > >> <Alexander.Deucher@amd.com> > >> Cc: amd-gfx@lists.freedesktop.org; Pelloux-Prayer, Pierre-Eric > >> <Pierre- eric.Pelloux-prayer@amd.com> > >> Subject: Re: [PATCH v3] drm/amdgpu: Avoid NULL dereference in > >> discovery topology coredump path v3 > >> > >> > >> > >> On 18-Mar-26 4:00 PM, Srinivasan Shanmugam wrote: > >>> When a GPU fault or timeout happens, the driver creates a > >>> devcoredump to collect debug information. > >>> > >>> During this, amdgpu_devcoredump_format() calls > >>> amdgpu_discovery_dump() to print IP discovery data. > >>> > >>> amdgpu_discovery_dump() uses: > >>> adev->discovery.ip_top > >>> > >>> and then accesses: > >>> ip_top->die_kset > >>> > >>> amdgpu_discovery_dump() uses adev->discovery.ip_top. However, ip_top > >>> may be NULL if the discovery topology was never initialized. > >>> > >>> The current code does not check for this before using ip_top. As a > >>> result, when ip_top is NULL, the coredump worker crashes while > >>> taking the spinlock for ip_top->die_kset. > >>> > >>> Fix this by checking for a missing ip_top before walking the > >>> discovery topology. If it is unavailable, print a short message in > >>> the dump and return safely. > >>> > >>> - If ip_top is NULL, print a message and skip the dump > >>> - Also add the same check in the cleanup path > >>> > >>> This makes the coredump and cleanup paths safe even when the > >>> discovery topology is not available. > >>> > >>> KASAN trace: > >>> [ 522.228252] [IGT] amd_deadlock: starting subtest > >>> amdgpu-deadlock-sdma [ 522.240681] [IGT] amd_deadlock: starting > >>> dynamic subtest amdgpu-deadlock-sdma > >>> > >>> ... > >>> > >>> [ 522.952317] Write of size 4 at addr 0000000000000050 by task > >>> kworker/u129:5/5434 [ 522.937526] BUG: KASAN: null-ptr-deref in > >>> _raw_spin_lock+0x66/0xc0 [ 522.967659] Workqueue: events_unbound > >>> amdgpu_devcoredump_deferred_work [amdgpu] > >>> > >>> ... > >>> > >>> [ 522.969445] Call Trace: > >>> [ 522.969508] _raw_spin_lock+0x66/0xc0 [ 522.969518] ? > >>> __pfx__raw_spin_lock+0x10/0x10 [ 522.969534] > >>> amdgpu_discovery_dump+0x61/0x530 [amdgpu] [ 522.971346] ? > >>> pick_next_task_fair+0x3f6/0x1c60 [ 522.971363] > >>> amdgpu_devcoredump_format+0x84f/0x26f0 [amdgpu] [ 522.973188] ? > >>> __pfx_amdgpu_devcoredump_format+0x10/0x10 [amdgpu] [ 522.975012] ? > >>> psi_task_switch+0x2b5/0x9b0 [ 522.975027] ? > >>> __pfx___drm_printfn_coredump+0x10/0x10 [drm] [ 522.975198] ? > >>> __pfx___drm_puts_coredump+0x10/0x10 [drm] [ 522.975366] ? > >>> __schedule+0x113c/0x38d0 [ 522.975381] > >>> amdgpu_devcoredump_deferred_work+0x4c/0x1f0 [amdgpu] > >>> > >>> v2: Updated commit message - Clarified that ip_top is not freed, it can > >>> just be NULL if discovery was not initialized. > >>> (Christian/Lijo) > >>> > >>> v3: Removed the extra drm_warn() for sysfs init failure as sysfs already > >>> reports errors. (Christian) > >>> > >>> Fixes: 7083eb8982fb ("drm/amdgpu: include ip discovery data in > >>> devcoredump") > >>> Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> > >>> Cc: Christian König <christian.koenig@amd.com> > >>> Cc: Alex Deucher <alexander.deucher@amd.com> > >>> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> > >>> --- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 11 ++++++++++- > >>> 1 file changed, 10 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > >>> index f7f37d93d0ce..6be1f971a31a 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > >>> @@ -1395,6 +1395,9 @@ static void amdgpu_discovery_sysfs_fini(struct > >> amdgpu_device *adev) > >>> struct list_head *el, *tmp; > >>> struct kset *die_kset; > >>> > >>> + if (!ip_top) > >>> + return; > >>> + > >>> die_kset = &ip_top->die_kset; > >>> spin_lock(&die_kset->list_lock); > >>> list_for_each_prev_safe(el, tmp, &die_kset->list) { @@ -1419,9 > >>> +1422,15 @@ void amdgpu_discovery_dump(struct amdgpu_device *adev, > >> struct drm_printer *p) > >>> struct ip_hw_instance *ip_inst; > >>> int i = 0, j; > >>> > >>> + drm_printf(p, "\nHW IP Discovery\n"); > >>> + > >>> + if (!ip_top) { > >>> + drm_printf(p, "ip discovery topology unavailable\n"); > >> > >> Is this type of printing really required or just skipping the whole > >> section good enough? > > > > > > Silently skipping the rest may look like incomplete or missing data in > > the coredump. > > > > Adding a one-line message makes it clear that the topology was not > > available, rather than leaving an empty section. > > > > Here is my take - Discovery is the basic requirement for SOCs which make use of > that mechanism and it is always expected to be present for those, otherwise driver > load will fail. > > For those which don't make use of discovery, the section will not be present. There > is no special message required for that. There is no harm to keep that, it only adds > extra parsing. It's ok for normal cases. However, this runs in the *coredump/debug path during error conditions* - where the discovery topology may not be initialized. instead of leaving the section empty, which can be confusing during debugging. Best, Srini [-- Attachment #2: Type: text/html, Size: 12147 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v3] drm/amdgpu: Avoid NULL dereference in discovery topology coredump path v3 2026-03-19 5:00 ` Lazar, Lijo @ 2026-03-19 7:38 ` Christian König 0 siblings, 0 replies; 8+ messages in thread From: Christian König @ 2026-03-19 7:38 UTC (permalink / raw) To: Lazar, Lijo, SHANMUGAM, SRINIVASAN, Deucher, Alexander Cc: amd-gfx@lists.freedesktop.org, Pelloux-Prayer, Pierre-Eric I kind of agree with Lijo. We clearly should have a NULL pointer check, but when the information isn't available it's perfectly fine to not print the section for it. That's what we do when other information isn't available as well. So just completely skipping the "drm_printf(p, "\nHW IP Discovery\n");" would work for me as well. Regards, Christian. On 3/19/26 06:00, Lazar, Lijo wrote: > [Public] > > > To run coredump logic, driver needs to be loaded first. For driver load to work on all new SOCs, it needs discovery. > > Other SOCs which depend on pci id mechanism don't need discovery and they don't have discovery section either. > > Thanks, > Lijo > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > *From:* SHANMUGAM, SRINIVASAN <SRINIVASAN.SHANMUGAM@amd.com> > *Sent:* Thursday, March 19, 2026 10:17:09 AM > *To:* Lazar, Lijo <Lijo.Lazar@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com> > *Cc:* amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; Pelloux-Prayer, Pierre-Eric <Pierre-eric.Pelloux-prayer@amd.com> > *Subject:* RE: [PATCH v3] drm/amdgpu: Avoid NULL dereference in discovery topology coredump path v3 > > [AMD Official Use Only - AMD Internal Distribution Only] > >> -----Original Message----- >> From: Lazar, Lijo <Lijo.Lazar@amd.com> >> Sent: Thursday, March 19, 2026 9:46 AM >> To: SHANMUGAM, SRINIVASAN <SRINIVASAN.SHANMUGAM@amd.com>; >> Koenig, Christian <Christian.Koenig@amd.com>; Deucher, Alexander >> <Alexander.Deucher@amd.com> >> Cc: amd-gfx@lists.freedesktop.org; Pelloux-Prayer, Pierre-Eric <Pierre- >> eric.Pelloux-prayer@amd.com> >> Subject: Re: [PATCH v3] drm/amdgpu: Avoid NULL dereference in discovery >> topology coredump path v3 >> >> >> >> On 18-Mar-26 4:41 PM, SHANMUGAM, SRINIVASAN wrote: >> > [AMD Official Use Only - AMD Internal Distribution Only] >> > >> >> -----Original Message----- >> >> From: Lazar, Lijo <Lijo.Lazar@amd.com> >> >> Sent: Wednesday, March 18, 2026 4:28 PM >> >> To: SHANMUGAM, SRINIVASAN <SRINIVASAN.SHANMUGAM@amd.com>; >> Koenig, >> >> Christian <Christian.Koenig@amd.com>; Deucher, Alexander >> >> <Alexander.Deucher@amd.com> >> >> Cc: amd-gfx@lists.freedesktop.org; Pelloux-Prayer, Pierre-Eric >> >> <Pierre- eric.Pelloux-prayer@amd.com> >> >> Subject: Re: [PATCH v3] drm/amdgpu: Avoid NULL dereference in >> >> discovery topology coredump path v3 >> >> >> >> >> >> >> >> On 18-Mar-26 4:00 PM, Srinivasan Shanmugam wrote: >> >>> When a GPU fault or timeout happens, the driver creates a >> >>> devcoredump to collect debug information. >> >>> >> >>> During this, amdgpu_devcoredump_format() calls >> >>> amdgpu_discovery_dump() to print IP discovery data. >> >>> >> >>> amdgpu_discovery_dump() uses: >> >>> adev->discovery.ip_top >> >>> >> >>> and then accesses: >> >>> ip_top->die_kset >> >>> >> >>> amdgpu_discovery_dump() uses adev->discovery.ip_top. However, ip_top >> >>> may be NULL if the discovery topology was never initialized. >> >>> >> >>> The current code does not check for this before using ip_top. As a >> >>> result, when ip_top is NULL, the coredump worker crashes while >> >>> taking the spinlock for ip_top->die_kset. >> >>> >> >>> Fix this by checking for a missing ip_top before walking the >> >>> discovery topology. If it is unavailable, print a short message in >> >>> the dump and return safely. >> >>> >> >>> - If ip_top is NULL, print a message and skip the dump >> >>> - Also add the same check in the cleanup path >> >>> >> >>> This makes the coredump and cleanup paths safe even when the >> >>> discovery topology is not available. >> >>> >> >>> KASAN trace: >> >>> [ 522.228252] [IGT] amd_deadlock: starting subtest >> >>> amdgpu-deadlock-sdma [ 522.240681] [IGT] amd_deadlock: starting >> >>> dynamic subtest amdgpu-deadlock-sdma >> >>> >> >>> ... >> >>> >> >>> [ 522.952317] Write of size 4 at addr 0000000000000050 by task >> >>> kworker/u129:5/5434 [ 522.937526] BUG: KASAN: null-ptr-deref in >> >>> _raw_spin_lock+0x66/0xc0 [ 522.967659] Workqueue: events_unbound >> >>> amdgpu_devcoredump_deferred_work [amdgpu] >> >>> >> >>> ... >> >>> >> >>> [ 522.969445] Call Trace: >> >>> [ 522.969508] _raw_spin_lock+0x66/0xc0 [ 522.969518] ? >> >>> __pfx__raw_spin_lock+0x10/0x10 [ 522.969534] >> >>> amdgpu_discovery_dump+0x61/0x530 [amdgpu] [ 522.971346] ? >> >>> pick_next_task_fair+0x3f6/0x1c60 [ 522.971363] >> >>> amdgpu_devcoredump_format+0x84f/0x26f0 [amdgpu] [ 522.973188] ? >> >>> __pfx_amdgpu_devcoredump_format+0x10/0x10 [amdgpu] [ 522.975012] ? >> >>> psi_task_switch+0x2b5/0x9b0 [ 522.975027] ? >> >>> __pfx___drm_printfn_coredump+0x10/0x10 [drm] [ 522.975198] ? >> >>> __pfx___drm_puts_coredump+0x10/0x10 [drm] [ 522.975366] ? >> >>> __schedule+0x113c/0x38d0 [ 522.975381] >> >>> amdgpu_devcoredump_deferred_work+0x4c/0x1f0 [amdgpu] >> >>> >> >>> v2: Updated commit message - Clarified that ip_top is not freed, it can >> >>> just be NULL if discovery was not initialized. >> >>> (Christian/Lijo) >> >>> >> >>> v3: Removed the extra drm_warn() for sysfs init failure as sysfs already >> >>> reports errors. (Christian) >> >>> >> >>> Fixes: 7083eb8982fb ("drm/amdgpu: include ip discovery data in >> >>> devcoredump") >> >>> Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> >> >>> Cc: Christian König <christian.koenig@amd.com> >> >>> Cc: Alex Deucher <alexander.deucher@amd.com> >> >>> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> >> >>> --- >> >>> drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 11 ++++++++++- >> >>> 1 file changed, 10 insertions(+), 1 deletion(-) >> >>> >> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c >> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c >> >>> index f7f37d93d0ce..6be1f971a31a 100644 >> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c >> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c >> >>> @@ -1395,6 +1395,9 @@ static void amdgpu_discovery_sysfs_fini(struct >> >> amdgpu_device *adev) >> >>> struct list_head *el, *tmp; >> >>> struct kset *die_kset; >> >>> >> >>> + if (!ip_top) >> >>> + return; >> >>> + >> >>> die_kset = &ip_top->die_kset; >> >>> spin_lock(&die_kset->list_lock); >> >>> list_for_each_prev_safe(el, tmp, &die_kset->list) { @@ -1419,9 >> >>> +1422,15 @@ void amdgpu_discovery_dump(struct amdgpu_device *adev, >> >> struct drm_printer *p) >> >>> struct ip_hw_instance *ip_inst; >> >>> int i = 0, j; >> >>> >> >>> + drm_printf(p, "\nHW IP Discovery\n"); >> >>> + >> >>> + if (!ip_top) { >> >>> + drm_printf(p, "ip discovery topology unavailable\n"); >> >> >> >> Is this type of printing really required or just skipping the whole >> >> section good enough? >> > >> > >> > Silently skipping the rest may look like incomplete or missing data in >> > the coredump. >> > >> > Adding a one-line message makes it clear that the topology was not >> > available, rather than leaving an empty section. >> > >> >> Here is my take - Discovery is the basic requirement for SOCs which make use of >> that mechanism and it is always expected to be present for those, otherwise driver >> load will fail. >> >> For those which don't make use of discovery, the section will not be present. There >> is no special message required for that. There is no harm to keep that, it only adds >> extra parsing. > > It's ok for normal cases. However, this runs in the *coredump/debug path > during error conditions* - where the discovery topology may not be > initialized. instead of leaving the section empty, which can be confusing > during debugging. > > Best, > Srini > ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-03-19 7:38 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-03-18 10:30 [PATCH v3] drm/amdgpu: Avoid NULL dereference in discovery topology coredump path v3 Srinivasan Shanmugam 2026-03-18 10:33 ` Christian König 2026-03-18 10:57 ` Lazar, Lijo 2026-03-18 11:11 ` SHANMUGAM, SRINIVASAN 2026-03-19 4:15 ` Lazar, Lijo 2026-03-19 4:47 ` SHANMUGAM, SRINIVASAN 2026-03-19 5:00 ` Lazar, Lijo 2026-03-19 7:38 ` Christian König
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox