public inbox for dri-devel@lists.freedesktop.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Eric Huang <jinhuieric.huang@amd.com>,
	Kent Russell <kent.russell@amd.com>,
	Alex Deucher <alexander.deucher@amd.com>,
	Sasha Levin <sashal@kernel.org>,
	Felix.Kuehling@amd.com, christian.koenig@amd.com,
	airlied@gmail.com, simona@ffwll.ch,
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH AUTOSEL 7.0] drm/amdkfd: fix kernel crash on releasing NULL sysfs entry
Date: Mon, 20 Apr 2026 09:18:45 -0400	[thread overview]
Message-ID: <20260420132314.1023554-131-sashal@kernel.org> (raw)
In-Reply-To: <20260420132314.1023554-1-sashal@kernel.org>

From: Eric Huang <jinhuieric.huang@amd.com>

[ Upstream commit 4ea64d482fc2cc85009fce5abdf4780ece00c31c ]

there is an abnormal case that When a process re-opens kfd
with different mm_struct(execve() called by user), the
allocated p->kobj will be freed, but missed setting it to NULL,
that will cause sysfs/kernel crash with NULL pointers in p->kobj
on kfd_process_remove_sysfs() when releasing process, and the
similar error on kfd_procfs_del_queue() as well.

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Reviewed-by: Kent Russell <kent.russell@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have all the information needed for a complete analysis. Let me
compile the full report.

---

## PHASE 1: COMMIT MESSAGE FORENSICS

### Step 1.1: Subject Line
- **Subsystem**: drm/amdkfd
- **Action verb**: "fix" - explicitly a bug fix
- **Summary**: Fixes a kernel crash when releasing a NULL (dangling)
  sysfs entry

### Step 1.2: Tags
- **Signed-off-by**: Eric Huang <jinhuieric.huang@amd.com> (author)
- **Reviewed-by**: Kent Russell <kent.russell@amd.com> (AMD KFD team
  member)
- **Signed-off-by**: Alex Deucher <alexander.deucher@amd.com> (AMD DRM
  maintainer)
- No Fixes: tag, no Cc: stable, no Reported-by, no Link - absence
  expected

### Step 1.3: Commit Body
The commit message describes: when a process re-opens KFD with a
different `mm_struct` (after `execve()`), the allocated `p->kobj` is
freed via `kobject_put()` but not set to NULL. Later,
`kfd_process_remove_sysfs()` checks `if (!p->kobj)` - but since the
pointer is dangling (not NULL), the check passes and causes a kernel
crash. The same issue affects `kfd_procfs_del_queue()`.

**Failure mode**: kernel crash (NULL pointer dereference / use-after-
free on stale kobj pointer)

### Step 1.4: Hidden Bug Fix?
No hiding here - the subject and body explicitly say "fix kernel crash."

---

## PHASE 2: DIFF ANALYSIS

### Step 2.1: Inventory
- **Files**: 1 file changed: `drivers/gpu/drm/amd/amdkfd/kfd_process.c`
- **Lines**: +2, -1 (net 1 line added)
- **Functions modified**: `kfd_procfs_del_queue()`,
  `kfd_create_process_sysfs()`
- **Scope**: Single-file, surgical fix

### Step 2.2: Code Flow Changes

**Hunk 1** (`kfd_procfs_del_queue`):
- Before: only checks `if (!q)` then proceeds to `kobject_del(&q->kobj)`
  and `kobject_put(&q->kobj)`
- After: checks `if (!q || !q->process->kobj)` - if the process's kobj
  was freed, skip queue cleanup since parent sysfs is gone

**Hunk 2** (`kfd_create_process_sysfs`):
- Before: on `kobject_init_and_add()` failure, calls
  `kobject_put(process->kobj)` and returns, leaving `process->kobj` as a
  dangling pointer
- After: adds `process->kobj = NULL` after `kobject_put()`, preventing
  dangling pointer

### Step 2.3: Bug Mechanism
This is a **dangling pointer / use-after-free** bug. After
`kobject_put()` frees the kobj, the pointer is not NULLed, so
`kfd_process_remove_sysfs()` later passes the `!p->kobj` guard and
dereferences the freed pointer.

### Step 2.4: Fix Quality
- Obviously correct: setting pointer to NULL after free is a textbook
  fix
- Minimal: 2 lines changed
- No regression risk: the NULL check prevents use of an already-freed
  object
- No locking changes, no API changes

---

## PHASE 3: GIT HISTORY

### Step 3.1: Blame
- `kfd_procfs_del_queue()` was introduced by commit 6d220a7e7971ec
  (Amber Lin, 2020-01-30) - old function
- `kfd_create_process_sysfs()` was introduced by commit 4cd255b9694647
  (Zhu Lingshan, 2025-04-22) - the commit that extracted sysfs creation
  into a separate function
- The error path that's missing `= NULL` was introduced in
  4cd255b9694647

### Step 3.2: Fixes target
No explicit Fixes: tag. The bug was introduced by 4cd255b9694647
("amdkfd: Introduce kfd_create_process_sysfs as a separate function").
This commit first appeared in v7.0-rc1 - confirmed NOT in v6.19,
v6.16-rc1, v6.15-rc1, or v6.14.

### Step 3.3: File History
Many intermediate changes between the buggy commit and the fix (treewide
refactoring, unrelated amdkfd changes). No other fix for this specific
issue.

### Step 3.4: Author
Eric Huang is an AMD developer with multiple amdkfd commits (pasid
debugfs, smi events, TLB flush, etc.) - a regular contributor to the
subsystem.

### Step 3.5: Dependencies
The fix modifies `kfd_create_process_sysfs()` which only exists since
4cd255b9694647 (v7.0-rc1). The fix is self-contained and needs no other
patches.

---

## PHASE 4: MAILING LIST

### Step 4.1: Patch Discussion
b4 dig found the original submission at:
`https://patch.msgid.link/20260327180036.131096-1-
jinhuieric.huang@amd.com`

The mbox shows:
- v1 submission on 2026-03-27 by Eric Huang
- Kent Russell replied with `Reviewed-by` on the same day
- No NAKs, no concerns raised
- No stable nomination by reviewers, but this is expected for commits in
  the autosel pipeline

### Step 4.2: Reviewers
Patch was sent to `amd-gfx@lists.freedesktop.org`, reviewed by Kent
Russell (AMD KFD team), committed by Alex Deucher (AMD DRM maintainer).
Appropriate review chain.

### Step 4.3: Bug Report
No external bug report link. The author discovered this through internal
testing of the execve() code path.

### Step 4.4: Related Patches
Single standalone patch (v1 only, no series).

### Step 4.5: Stable Discussion
No prior stable discussion found.

---

## PHASE 5: CODE SEMANTIC ANALYSIS

### Step 5.1: Functions Modified
- `kfd_procfs_del_queue()` - called to delete a queue's sysfs entry
- `kfd_create_process_sysfs()` - creates sysfs entries for a KFD process

### Step 5.2: Callers
- `kfd_procfs_del_queue()` called from `kfd_process_queue_manager.c` in
  two places: during queue destruction and queue resource cleanup
- `kfd_create_process_sysfs()` called from `kfd_process.c` (initial
  process creation) and `kfd_chardev.c` (secondary process context
  creation via ioctl)

### Step 5.3-5.4: Call Chain
The bug path: user calls `execve()` → KFD detects mm change → re-opens
KFD → `kfd_create_process_sysfs()` fails → dangling `kobj` → process
cleanup → `kfd_process_remove_sysfs()` → crash via stale pointer

### Step 5.5: Similar Patterns
The existing `kfd_process_remove_sysfs()` already has a `if (!p->kobj)
return;` guard (line 1158), which is the correct pattern. The bug is
that the error path in `kfd_create_process_sysfs()` doesn't maintain the
invariant that freed kobj should be NULL.

---

## PHASE 6: STABLE TREE ANALYSIS

### Step 6.1: Buggy Code in Stable Trees
The buggy function `kfd_create_process_sysfs()` was introduced in commit
4cd255b9694647, which is in v7.0-rc1 and v7.0 only. It is NOT in v6.19,
v6.14, or any earlier stable tree. The fix is only relevant for the
**7.0.y** stable tree.

### Step 6.2: Backport Complications
The fix should apply cleanly to 7.0.y since the code context matches
directly.

### Step 6.3: Related Fixes Already in Stable
No related fixes found.

---

## PHASE 7: SUBSYSTEM CONTEXT

### Step 7.1: Subsystem
- **Subsystem**: drivers/gpu/drm/amd/amdkfd - AMD GPU Kernel Fusion
  Driver
- **Criticality**: IMPORTANT - AMD GPU users running KFD compute
  workloads (ROCm, OpenCL)

### Step 7.2: Activity
Very active subsystem with frequent changes.

---

## PHASE 8: IMPACT AND RISK

### Step 8.1: Who Is Affected
Users of AMD GPU KFD (ROCm compute users) who hit the error path in
`kfd_create_process_sysfs()` during process creation, specifically when
`execve()` causes a re-open with a different mm.

### Step 8.2: Trigger
When `kobject_init_and_add()` fails during KFD process sysfs creation
AND the process later gets cleaned up. This is an error path, but can be
triggered by resource exhaustion or sysfs conflicts.

### Step 8.3: Failure Mode
**CRITICAL**: Kernel crash (use of dangling pointer / NULL dereference
in sysfs teardown).

### Step 8.4: Risk-Benefit
- **Benefit**: HIGH - prevents kernel crash
- **Risk**: VERY LOW - 2-line fix, setting pointer to NULL after free
  and adding a NULL check
- **Ratio**: Strongly favorable

---

## PHASE 9: FINAL SYNTHESIS

### Step 9.1: Evidence
**FOR backporting:**
- Fixes a kernel crash (dangling pointer causes crash in sysfs teardown)
- Tiny fix: 2 lines changed in 1 file
- Obviously correct (NULL pointer after free, NULL check before use)
- Reviewed by subsystem developer (Kent Russell)
- Committed by subsystem maintainer (Alex Deucher)
- Standalone fix with no dependencies
- Buggy code exists in 7.0.y stable tree

**AGAINST backporting:**
- Error path only (not every user will hit this)
- Only affects 7.0.y (the buggy code is new in v7.0-rc1)
- No Fixes: tag (expected for autosel candidates)

### Step 9.2: Stable Rules Checklist
1. Obviously correct and tested? **YES** - trivial NULL-after-free
   pattern, reviewed
2. Fixes a real bug? **YES** - kernel crash on sysfs teardown
3. Important issue? **YES** - kernel crash (CRITICAL severity)
4. Small and contained? **YES** - 2 lines, 1 file
5. No new features or APIs? **YES** - pure bug fix
6. Can apply to stable? **YES** - code exists in 7.0.y, should apply
   cleanly

### Step 9.3: Exception Categories
Not an exception category - this is a standard bug fix.

---

## Verification

- [Phase 1] Parsed tags: Reviewed-by Kent Russell, Signed-off-by Alex
  Deucher (maintainer), no Fixes/Cc-stable
- [Phase 2] Diff analysis: +2/-1 lines; adds `process->kobj = NULL`
  after `kobject_put()` and `!q->process->kobj` check in
  `kfd_procfs_del_queue()`
- [Phase 3] git blame: error path introduced by 4cd255b9694647 (Zhu
  Lingshan, 2025-04-22)
- [Phase 3] git merge-base: 4cd255b9694647 is in v7.0-rc1 and v7.0, NOT
  in v6.19 or earlier
- [Phase 3] git log between buggy and fix: no intermediate fix for same
  issue found
- [Phase 4] b4 dig: found original at `https://patch.msgid.link/20260327
  180036.131096-1-jinhuieric.huang@amd.com`
- [Phase 4] b4 dig -a: single v1 submission, no revisions
- [Phase 4] mbox review: Kent Russell gave Reviewed-by, no NAKs or
  concerns
- [Phase 5] Grep callers: `kfd_procfs_del_queue()` called from
  `kfd_process_queue_manager.c` (2 sites); `kfd_create_process_sysfs()`
  called from both `kfd_process.c` and `kfd_chardev.c`
- [Phase 5] `kfd_process_remove_sysfs()` at line 1158 already has `if
  (!p->kobj) return;` guard, confirming the invariant the fix maintains
- [Phase 6] Buggy code only in v7.0-rc1+, fix relevant for 7.0.y stable
  only
- [Phase 8] Failure mode: kernel crash on dangling pointer dereference
  during sysfs teardown, severity CRITICAL

The fix is small, surgical, obviously correct, and prevents a kernel
crash. It meets all stable kernel criteria.

**YES**

 drivers/gpu/drm/amd/amdkfd/kfd_process.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 12e24fbf8c463..c6782904d2f1c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -679,7 +679,7 @@ static void kfd_procfs_add_sysfs_files(struct kfd_process *p)
 
 void kfd_procfs_del_queue(struct queue *q)
 {
-	if (!q)
+	if (!q || !q->process->kobj)
 		return;
 
 	kobject_del(&q->kobj);
@@ -858,6 +858,7 @@ int kfd_create_process_sysfs(struct kfd_process *process)
 	if (ret) {
 		pr_warn("Creating procfs pid directory failed");
 		kobject_put(process->kobj);
+		process->kobj = NULL;
 		return ret;
 	}
 
-- 
2.53.0


  parent reply	other threads:[~2026-04-20 13:26 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20260420132314.1023554-1-sashal@kernel.org>
2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-6.12] drm/amdgpu: fix DF NULL pointer issue for soc24 Sasha Levin
2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-6.18] drm/ttm: Avoid invoking the OOM killer when reading back swapped content Sasha Levin
2026-04-20 13:16 ` [PATCH AUTOSEL 6.18] drm/vc4: Release runtime PM reference after binding V3D Sasha Levin
2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-6.19] drm/xe/vf: Wait for all fixups before using default LRCs Sasha Levin
2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-6.12] drm/amd/display: remove duplicate format modifier Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0] drm/amdgpu/userq: unlock cancel_delayed_work_sync for hang_detect_work Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0-6.1] drm/amd/display: Merge pipes for validate Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 6.18] drm/xe: Fix bug in idledly unit conversion Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0] drm/xe: Skip adding PRL entry to NULL VMA Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 6.18] drm/vc4: Fix a memory leak in hang state error path Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 6.18] drm/vc4: Protect madv read in vc4_gem_object_mmap() with madv_lock Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0-6.12] drm/amd/display: Fix cursor pos at overlay plane edges on DCN4 Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.1] drm/msm/dpu: fix vblank IRQ registration before atomic_mode_set Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 6.18] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-5.10] drm/amd/display: bios_parser: fix GPIO I2C line off-by-one Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0] drm/amdgpu: Handle IH v7_1 reg offset differences Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu/vcn4.0.3: gate per-queue reset by PSP SOS program version Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/imx: parallel-display: add DRM_DISPLAY_HELPER for DRM_IMX_PARALLEL_DISPLAY Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: fix amdgpu_userq_evict Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-5.10] drm/amdgpu: validate fence_count in wait_fences ioctl Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.6] drm/amdgpu: fix shift-out-of-bounds when updating umc active mask Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0] drm/amdgpu/userq: remove queue from doorbell xa during clean up Sasha Levin
2026-04-20 13:18 ` Sasha Levin [this message]
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/xe/guc: Add Wa_14025883347 for GuC DMA failure on reset Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: clear related counter after RAS eeprom reset Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.19] drm/amd/display: Restore full update for tiling change to linear Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0] drm/amdgpu: fix array out of bounds accesses for mes sw_fini Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.12] drm/amd/display: Exit IPS w/ DC helper for all dc_set_power_state cases Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: fix syncobj leak for amdgpu_gem_va_ioctl() Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: Check for multiplication overflow in checkpoint stack size Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.18] drm/prime: Limit scatter list size with dedicated DMA device Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.19] drm/amd/display: Clamp dc_cursor_position x_hotspot to prevent integer overflow Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0] drm/amdgpu/userq: defer queue publication until create completes Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu/userq: fix dma_fence refcount underflow in userq path Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.12] drm/amdgpu: guard atom_context in devcoredump VBIOS dump Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.18] drm/amd/display: Avoid turning off the PHY when OTG is running for DVI Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0] drm/amdgpu: Revert setting up Retry based Thrashing on GFX 12.1 Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0] drm/amd/pm: Avoid overflow when sorting pp_feature list Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.19] drm/amd/display: Fix number of opp Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.19] drm/panel-edp: Change BOE NV140WUM-N64 timings Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0] drm/amd/display: Fix HWSS v3 fast path determination Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-5.10] drm/mediatek: mtk_dsi: enable hs clock during pre-enable Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 6.18] drm/vc4: Fix memory leak of BO array in hang state Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.12] drm/amd/display: Remove invalid DPSTREAMCLK mask usage Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.18] drm/panel-edp: Add CMN N116BCL-EAK (C2) Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0] drm/amdgpu: Add default reset method for soc_v1_0 Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0] drm/amdgpu/userq: cleanup amdgpu_userq_get/put where not needed Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: fix some more bug in amdgpu_gem_va_ioctl Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-5.10] fbdev: omap2: fix inconsistent lock returns in omapfb_mmap Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.18] drm: gpu: msm: forbid mem reclaim from reset Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.18] drm/panel-edp: Add AUO B116XAT04.1 (HW: 1A) Sasha Levin
2026-04-20 13:22 ` [PATCH AUTOSEL 7.0-6.6] drm/gem-dma: set VM_DONTDUMP for mmap Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260420132314.1023554-131-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=Felix.Kuehling@amd.com \
    --cc=airlied@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=jinhuieric.huang@amd.com \
    --cc=kent.russell@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=patches@lists.linux.dev \
    --cc=simona@ffwll.ch \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox