From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Eric Huang <jinhuieric.huang@amd.com>,
Kent Russell <kent.russell@amd.com>,
Alex Deucher <alexander.deucher@amd.com>,
Sasha Levin <sashal@kernel.org>,
Felix.Kuehling@amd.com, christian.koenig@amd.com,
airlied@gmail.com, simona@ffwll.ch,
amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
linux-kernel@vger.kernel.org
Subject: [PATCH AUTOSEL 7.0] drm/amdkfd: fix kernel crash on releasing NULL sysfs entry
Date: Mon, 20 Apr 2026 09:18:45 -0400 [thread overview]
Message-ID: <20260420132314.1023554-131-sashal@kernel.org> (raw)
In-Reply-To: <20260420132314.1023554-1-sashal@kernel.org>
From: Eric Huang <jinhuieric.huang@amd.com>
[ Upstream commit 4ea64d482fc2cc85009fce5abdf4780ece00c31c ]
there is an abnormal case that When a process re-opens kfd
with different mm_struct(execve() called by user), the
allocated p->kobj will be freed, but missed setting it to NULL,
that will cause sysfs/kernel crash with NULL pointers in p->kobj
on kfd_process_remove_sysfs() when releasing process, and the
similar error on kfd_procfs_del_queue() as well.
Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Reviewed-by: Kent Russell <kent.russell@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
Now I have all the information needed for a complete analysis. Let me
compile the full report.
---
## PHASE 1: COMMIT MESSAGE FORENSICS
### Step 1.1: Subject Line
- **Subsystem**: drm/amdkfd
- **Action verb**: "fix" - explicitly a bug fix
- **Summary**: Fixes a kernel crash when releasing a NULL (dangling)
sysfs entry
### Step 1.2: Tags
- **Signed-off-by**: Eric Huang <jinhuieric.huang@amd.com> (author)
- **Reviewed-by**: Kent Russell <kent.russell@amd.com> (AMD KFD team
member)
- **Signed-off-by**: Alex Deucher <alexander.deucher@amd.com> (AMD DRM
maintainer)
- No Fixes: tag, no Cc: stable, no Reported-by, no Link - absence
expected
### Step 1.3: Commit Body
The commit message describes: when a process re-opens KFD with a
different `mm_struct` (after `execve()`), the allocated `p->kobj` is
freed via `kobject_put()` but not set to NULL. Later,
`kfd_process_remove_sysfs()` checks `if (!p->kobj)` - but since the
pointer is dangling (not NULL), the check passes and causes a kernel
crash. The same issue affects `kfd_procfs_del_queue()`.
**Failure mode**: kernel crash (NULL pointer dereference / use-after-
free on stale kobj pointer)
### Step 1.4: Hidden Bug Fix?
No hiding here - the subject and body explicitly say "fix kernel crash."
---
## PHASE 2: DIFF ANALYSIS
### Step 2.1: Inventory
- **Files**: 1 file changed: `drivers/gpu/drm/amd/amdkfd/kfd_process.c`
- **Lines**: +2, -1 (net 1 line added)
- **Functions modified**: `kfd_procfs_del_queue()`,
`kfd_create_process_sysfs()`
- **Scope**: Single-file, surgical fix
### Step 2.2: Code Flow Changes
**Hunk 1** (`kfd_procfs_del_queue`):
- Before: only checks `if (!q)` then proceeds to `kobject_del(&q->kobj)`
and `kobject_put(&q->kobj)`
- After: checks `if (!q || !q->process->kobj)` - if the process's kobj
was freed, skip queue cleanup since parent sysfs is gone
**Hunk 2** (`kfd_create_process_sysfs`):
- Before: on `kobject_init_and_add()` failure, calls
`kobject_put(process->kobj)` and returns, leaving `process->kobj` as a
dangling pointer
- After: adds `process->kobj = NULL` after `kobject_put()`, preventing
dangling pointer
### Step 2.3: Bug Mechanism
This is a **dangling pointer / use-after-free** bug. After
`kobject_put()` frees the kobj, the pointer is not NULLed, so
`kfd_process_remove_sysfs()` later passes the `!p->kobj` guard and
dereferences the freed pointer.
### Step 2.4: Fix Quality
- Obviously correct: setting pointer to NULL after free is a textbook
fix
- Minimal: 2 lines changed
- No regression risk: the NULL check prevents use of an already-freed
object
- No locking changes, no API changes
---
## PHASE 3: GIT HISTORY
### Step 3.1: Blame
- `kfd_procfs_del_queue()` was introduced by commit 6d220a7e7971ec
(Amber Lin, 2020-01-30) - old function
- `kfd_create_process_sysfs()` was introduced by commit 4cd255b9694647
(Zhu Lingshan, 2025-04-22) - the commit that extracted sysfs creation
into a separate function
- The error path that's missing `= NULL` was introduced in
4cd255b9694647
### Step 3.2: Fixes target
No explicit Fixes: tag. The bug was introduced by 4cd255b9694647
("amdkfd: Introduce kfd_create_process_sysfs as a separate function").
This commit first appeared in v7.0-rc1 - confirmed NOT in v6.19,
v6.16-rc1, v6.15-rc1, or v6.14.
### Step 3.3: File History
Many intermediate changes between the buggy commit and the fix (treewide
refactoring, unrelated amdkfd changes). No other fix for this specific
issue.
### Step 3.4: Author
Eric Huang is an AMD developer with multiple amdkfd commits (pasid
debugfs, smi events, TLB flush, etc.) - a regular contributor to the
subsystem.
### Step 3.5: Dependencies
The fix modifies `kfd_create_process_sysfs()` which only exists since
4cd255b9694647 (v7.0-rc1). The fix is self-contained and needs no other
patches.
---
## PHASE 4: MAILING LIST
### Step 4.1: Patch Discussion
b4 dig found the original submission at:
`https://patch.msgid.link/20260327180036.131096-1-
jinhuieric.huang@amd.com`
The mbox shows:
- v1 submission on 2026-03-27 by Eric Huang
- Kent Russell replied with `Reviewed-by` on the same day
- No NAKs, no concerns raised
- No stable nomination by reviewers, but this is expected for commits in
the autosel pipeline
### Step 4.2: Reviewers
Patch was sent to `amd-gfx@lists.freedesktop.org`, reviewed by Kent
Russell (AMD KFD team), committed by Alex Deucher (AMD DRM maintainer).
Appropriate review chain.
### Step 4.3: Bug Report
No external bug report link. The author discovered this through internal
testing of the execve() code path.
### Step 4.4: Related Patches
Single standalone patch (v1 only, no series).
### Step 4.5: Stable Discussion
No prior stable discussion found.
---
## PHASE 5: CODE SEMANTIC ANALYSIS
### Step 5.1: Functions Modified
- `kfd_procfs_del_queue()` - called to delete a queue's sysfs entry
- `kfd_create_process_sysfs()` - creates sysfs entries for a KFD process
### Step 5.2: Callers
- `kfd_procfs_del_queue()` called from `kfd_process_queue_manager.c` in
two places: during queue destruction and queue resource cleanup
- `kfd_create_process_sysfs()` called from `kfd_process.c` (initial
process creation) and `kfd_chardev.c` (secondary process context
creation via ioctl)
### Step 5.3-5.4: Call Chain
The bug path: user calls `execve()` → KFD detects mm change → re-opens
KFD → `kfd_create_process_sysfs()` fails → dangling `kobj` → process
cleanup → `kfd_process_remove_sysfs()` → crash via stale pointer
### Step 5.5: Similar Patterns
The existing `kfd_process_remove_sysfs()` already has a `if (!p->kobj)
return;` guard (line 1158), which is the correct pattern. The bug is
that the error path in `kfd_create_process_sysfs()` doesn't maintain the
invariant that freed kobj should be NULL.
---
## PHASE 6: STABLE TREE ANALYSIS
### Step 6.1: Buggy Code in Stable Trees
The buggy function `kfd_create_process_sysfs()` was introduced in commit
4cd255b9694647, which is in v7.0-rc1 and v7.0 only. It is NOT in v6.19,
v6.14, or any earlier stable tree. The fix is only relevant for the
**7.0.y** stable tree.
### Step 6.2: Backport Complications
The fix should apply cleanly to 7.0.y since the code context matches
directly.
### Step 6.3: Related Fixes Already in Stable
No related fixes found.
---
## PHASE 7: SUBSYSTEM CONTEXT
### Step 7.1: Subsystem
- **Subsystem**: drivers/gpu/drm/amd/amdkfd - AMD GPU Kernel Fusion
Driver
- **Criticality**: IMPORTANT - AMD GPU users running KFD compute
workloads (ROCm, OpenCL)
### Step 7.2: Activity
Very active subsystem with frequent changes.
---
## PHASE 8: IMPACT AND RISK
### Step 8.1: Who Is Affected
Users of AMD GPU KFD (ROCm compute users) who hit the error path in
`kfd_create_process_sysfs()` during process creation, specifically when
`execve()` causes a re-open with a different mm.
### Step 8.2: Trigger
When `kobject_init_and_add()` fails during KFD process sysfs creation
AND the process later gets cleaned up. This is an error path, but can be
triggered by resource exhaustion or sysfs conflicts.
### Step 8.3: Failure Mode
**CRITICAL**: Kernel crash (use of dangling pointer / NULL dereference
in sysfs teardown).
### Step 8.4: Risk-Benefit
- **Benefit**: HIGH - prevents kernel crash
- **Risk**: VERY LOW - 2-line fix, setting pointer to NULL after free
and adding a NULL check
- **Ratio**: Strongly favorable
---
## PHASE 9: FINAL SYNTHESIS
### Step 9.1: Evidence
**FOR backporting:**
- Fixes a kernel crash (dangling pointer causes crash in sysfs teardown)
- Tiny fix: 2 lines changed in 1 file
- Obviously correct (NULL pointer after free, NULL check before use)
- Reviewed by subsystem developer (Kent Russell)
- Committed by subsystem maintainer (Alex Deucher)
- Standalone fix with no dependencies
- Buggy code exists in 7.0.y stable tree
**AGAINST backporting:**
- Error path only (not every user will hit this)
- Only affects 7.0.y (the buggy code is new in v7.0-rc1)
- No Fixes: tag (expected for autosel candidates)
### Step 9.2: Stable Rules Checklist
1. Obviously correct and tested? **YES** - trivial NULL-after-free
pattern, reviewed
2. Fixes a real bug? **YES** - kernel crash on sysfs teardown
3. Important issue? **YES** - kernel crash (CRITICAL severity)
4. Small and contained? **YES** - 2 lines, 1 file
5. No new features or APIs? **YES** - pure bug fix
6. Can apply to stable? **YES** - code exists in 7.0.y, should apply
cleanly
### Step 9.3: Exception Categories
Not an exception category - this is a standard bug fix.
---
## Verification
- [Phase 1] Parsed tags: Reviewed-by Kent Russell, Signed-off-by Alex
Deucher (maintainer), no Fixes/Cc-stable
- [Phase 2] Diff analysis: +2/-1 lines; adds `process->kobj = NULL`
after `kobject_put()` and `!q->process->kobj` check in
`kfd_procfs_del_queue()`
- [Phase 3] git blame: error path introduced by 4cd255b9694647 (Zhu
Lingshan, 2025-04-22)
- [Phase 3] git merge-base: 4cd255b9694647 is in v7.0-rc1 and v7.0, NOT
in v6.19 or earlier
- [Phase 3] git log between buggy and fix: no intermediate fix for same
issue found
- [Phase 4] b4 dig: found original at `https://patch.msgid.link/20260327
180036.131096-1-jinhuieric.huang@amd.com`
- [Phase 4] b4 dig -a: single v1 submission, no revisions
- [Phase 4] mbox review: Kent Russell gave Reviewed-by, no NAKs or
concerns
- [Phase 5] Grep callers: `kfd_procfs_del_queue()` called from
`kfd_process_queue_manager.c` (2 sites); `kfd_create_process_sysfs()`
called from both `kfd_process.c` and `kfd_chardev.c`
- [Phase 5] `kfd_process_remove_sysfs()` at line 1158 already has `if
(!p->kobj) return;` guard, confirming the invariant the fix maintains
- [Phase 6] Buggy code only in v7.0-rc1+, fix relevant for 7.0.y stable
only
- [Phase 8] Failure mode: kernel crash on dangling pointer dereference
during sysfs teardown, severity CRITICAL
The fix is small, surgical, obviously correct, and prevents a kernel
crash. It meets all stable kernel criteria.
**YES**
drivers/gpu/drm/amd/amdkfd/kfd_process.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 12e24fbf8c463..c6782904d2f1c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -679,7 +679,7 @@ static void kfd_procfs_add_sysfs_files(struct kfd_process *p)
void kfd_procfs_del_queue(struct queue *q)
{
- if (!q)
+ if (!q || !q->process->kobj)
return;
kobject_del(&q->kobj);
@@ -858,6 +858,7 @@ int kfd_create_process_sysfs(struct kfd_process *process)
if (ret) {
pr_warn("Creating procfs pid directory failed");
kobject_put(process->kobj);
+ process->kobj = NULL;
return ret;
}
--
2.53.0
next prev parent reply other threads:[~2026-04-20 13:26 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20260420132314.1023554-1-sashal@kernel.org>
2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-6.12] drm/amdgpu: fix DF NULL pointer issue for soc24 Sasha Levin
2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-6.18] drm/ttm: Avoid invoking the OOM killer when reading back swapped content Sasha Levin
2026-04-20 13:16 ` [PATCH AUTOSEL 6.18] drm/vc4: Release runtime PM reference after binding V3D Sasha Levin
2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-6.19] drm/xe/vf: Wait for all fixups before using default LRCs Sasha Levin
2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-6.12] drm/amd/display: remove duplicate format modifier Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0] drm/amdgpu/userq: unlock cancel_delayed_work_sync for hang_detect_work Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0-6.1] drm/amd/display: Merge pipes for validate Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 6.18] drm/xe: Fix bug in idledly unit conversion Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0] drm/xe: Skip adding PRL entry to NULL VMA Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 6.18] drm/vc4: Fix a memory leak in hang state error path Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 6.18] drm/vc4: Protect madv read in vc4_gem_object_mmap() with madv_lock Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0-6.12] drm/amd/display: Fix cursor pos at overlay plane edges on DCN4 Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.1] drm/msm/dpu: fix vblank IRQ registration before atomic_mode_set Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 6.18] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-5.10] drm/amd/display: bios_parser: fix GPIO I2C line off-by-one Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0] drm/amdgpu: Handle IH v7_1 reg offset differences Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu/vcn4.0.3: gate per-queue reset by PSP SOS program version Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/imx: parallel-display: add DRM_DISPLAY_HELPER for DRM_IMX_PARALLEL_DISPLAY Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: fix amdgpu_userq_evict Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-5.10] drm/amdgpu: validate fence_count in wait_fences ioctl Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.6] drm/amdgpu: fix shift-out-of-bounds when updating umc active mask Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0] drm/amdgpu/userq: remove queue from doorbell xa during clean up Sasha Levin
2026-04-20 13:18 ` Sasha Levin [this message]
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/xe/guc: Add Wa_14025883347 for GuC DMA failure on reset Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: clear related counter after RAS eeprom reset Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.19] drm/amd/display: Restore full update for tiling change to linear Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0] drm/amdgpu: fix array out of bounds accesses for mes sw_fini Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.12] drm/amd/display: Exit IPS w/ DC helper for all dc_set_power_state cases Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: fix syncobj leak for amdgpu_gem_va_ioctl() Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: Check for multiplication overflow in checkpoint stack size Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.18] drm/prime: Limit scatter list size with dedicated DMA device Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.19] drm/amd/display: Clamp dc_cursor_position x_hotspot to prevent integer overflow Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0] drm/amdgpu/userq: defer queue publication until create completes Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu/userq: fix dma_fence refcount underflow in userq path Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.12] drm/amdgpu: guard atom_context in devcoredump VBIOS dump Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.18] drm/amd/display: Avoid turning off the PHY when OTG is running for DVI Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0] drm/amdgpu: Revert setting up Retry based Thrashing on GFX 12.1 Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0] drm/amd/pm: Avoid overflow when sorting pp_feature list Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.19] drm/amd/display: Fix number of opp Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.19] drm/panel-edp: Change BOE NV140WUM-N64 timings Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0] drm/amd/display: Fix HWSS v3 fast path determination Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-5.10] drm/mediatek: mtk_dsi: enable hs clock during pre-enable Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 6.18] drm/vc4: Fix memory leak of BO array in hang state Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.12] drm/amd/display: Remove invalid DPSTREAMCLK mask usage Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.18] drm/panel-edp: Add CMN N116BCL-EAK (C2) Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0] drm/amdgpu: Add default reset method for soc_v1_0 Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0] drm/amdgpu/userq: cleanup amdgpu_userq_get/put where not needed Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: fix some more bug in amdgpu_gem_va_ioctl Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-5.10] fbdev: omap2: fix inconsistent lock returns in omapfb_mmap Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.18] drm: gpu: msm: forbid mem reclaim from reset Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.18] drm/panel-edp: Add AUO B116XAT04.1 (HW: 1A) Sasha Levin
2026-04-20 13:22 ` [PATCH AUTOSEL 7.0-6.6] drm/gem-dma: set VM_DONTDUMP for mmap Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260420132314.1023554-131-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=Felix.Kuehling@amd.com \
--cc=airlied@gmail.com \
--cc=alexander.deucher@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=christian.koenig@amd.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=jinhuieric.huang@amd.com \
--cc=kent.russell@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=patches@lists.linux.dev \
--cc=simona@ffwll.ch \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox