From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: "Prike Liang" <Prike.Liang@amd.com>,
"Christian König" <christian.koenig@amd.com>,
"Alex Deucher" <alexander.deucher@amd.com>,
"Sasha Levin" <sashal@kernel.org>,
airlied@gmail.com, simona@ffwll.ch,
amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
linux-kernel@vger.kernel.org
Subject: [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: fix syncobj leak for amdgpu_gem_va_ioctl()
Date: Mon, 20 Apr 2026 09:19:34 -0400 [thread overview]
Message-ID: <20260420132314.1023554-180-sashal@kernel.org> (raw)
In-Reply-To: <20260420132314.1023554-1-sashal@kernel.org>
From: Prike Liang <Prike.Liang@amd.com>
[ Upstream commit a0f0b6d31a53a7607ed44f7623faafc628333258 ]
It requires freeing the syncobj and chain
alloction resource.
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
The 7.0 tree doesn't have `(!adev->debug_vm || timeline_syncobj)` — it
has `!adev->debug_vm`. This confirms the diff was created against a
newer mainline. The fix's core logic still applies.
---
## Complete Analysis
### PHASE 1: COMMIT MESSAGE FORENSICS
**Step 1.1: Subject Line**
- Subsystem: `drm/amdgpu`
- Action verb: "fix"
- Summary: Fix syncobj and chain allocation resource leak in
amdgpu_gem_va_ioctl()
- Record: [drm/amdgpu] [fix] [syncobj/chain resource leak in gem VA
ioctl]
**Step 1.2: Tags**
- `Reviewed-by: Christian König <christian.koenig@amd.com>` — subsystem
co-maintainer
- `Signed-off-by: Alex Deucher <alexander.deucher@amd.com>` — AMD GPU
maintainer committed it
- `Signed-off-by: Prike Liang <Prike.Liang@amd.com>` — AMD engineer,
author
- No Fixes: tag, no Reported-by:, no Cc: stable — expected for manual
review candidates
- Record: Reviewed by Christian König (DRM/amdgpu co-maintainer).
Committed by Alex Deucher.
**Step 1.3: Commit Body**
- Describes: "requires freeing the syncobj and chain allocation
resource"
- Bug: syncobj refcount and chain memory are never released after use
- Failure mode: resource/memory leak on every ioctl call with timeline
syncobj
- Record: Clear resource leak. Every call to the ioctl with timeline
syncobj leaks memory.
**Step 1.4: Hidden Bug Fixes**
- This is NOT hidden — it explicitly says "fix...leak"
- Record: Explicit bug fix.
### PHASE 2: DIFF ANALYSIS
**Step 2.1: Inventory**
- Files: `drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c` only
- Changes: +5 lines added (3 in ioctl cleanup, 1 NULL assignment in
helper, 1 NULL assignment in ioctl)
- Functions modified: `amdgpu_gem_update_timeline_node()` and
`amdgpu_gem_va_ioctl()`
- Record: Single-file surgical fix, 5 meaningful lines added.
**Step 2.2: Code Flow Changes**
Hunk 1 — `amdgpu_gem_update_timeline_node()`:
- BEFORE: When `dma_fence_chain_alloc()` fails, calls
`drm_syncobj_put(*syncobj)` and returns -ENOMEM, leaving `*syncobj` as
a dangling pointer.
- AFTER: Also sets `*syncobj = NULL` to prevent dangling pointer.
Hunk 2 — `amdgpu_gem_va_ioctl()`:
- BEFORE: After `drm_syncobj_add_point()` consumes `timeline_chain`,
`timeline_chain` still points to consumed memory. The `error:` label
never frees `timeline_chain` or puts `timeline_syncobj`.
- AFTER: Sets `timeline_chain = NULL` after consumption. Adds
`dma_fence_chain_free(timeline_chain)` and
`drm_syncobj_put(timeline_syncobj)` to cleanup.
**Step 2.3: Bug Mechanism**
- Category: **Resource leak** (syncobj refcount leak + memory leak)
- `drm_syncobj_find()` increments refcount — never decremented by caller
- `dma_fence_chain_alloc()` allocates memory — never freed when not
consumed
- Record: Missing cleanup for refcounted object and allocated memory on
both success and error paths.
**Step 2.4: Fix Quality**
- Obviously correct: adds standard cleanup patterns (NULL-after-consume,
free/put at error label)
- Minimal and surgical: 5 meaningful lines
- No regression risk: `dma_fence_chain_free(NULL)` = `kfree(NULL)` is
safe; `drm_syncobj_put` is guarded by NULL check
- Record: High quality, zero regression risk.
### PHASE 3: GIT HISTORY
**Step 3.1: Blame**
- `amdgpu_gem_update_timeline_node` — introduced by `70773bef4e091f`
(Arvind Yadav, Sep 2024)
- Timeline call moved before switch by `ad6c120f688803` (Feb 2025, "fix
the memleak caused by fence not released")
- Inline timeline handling in ioctl by `bd8150a1b3370` (Dec 2025, v4
refactor)
- Record: Buggy code introduced in 70773bef4e091f, worsened by
ad6c120f688803 which moved allocation before switch but didn't add
cleanup.
**Step 3.2: Fixes tag**
- No Fixes: tag present. Based on analysis, the bug was introduced in
`70773bef4e091f` and never had proper cleanup.
- Record: Bug exists since original timeline code introduction.
**Step 3.3: File History**
- 31 commits since `ad6c120f688803`. Active file with many recent
changes.
- The v4 refactor (`bd8150a1b3370`) and v7 refactor (`efdc66fe12b07`)
touched the same code but neither added cleanup.
- Record: Standalone fix, no prerequisites beyond code already in 7.0
tree.
**Step 3.4: Author**
- Prike Liang: AMD engineer, regular contributor to amdgpu driver with
multiple recent fixes.
- Record: Active AMD GPU developer, credible author.
**Step 3.5: Dependencies**
- None. The fix only adds cleanup to existing code paths. All referenced
functions exist in 7.0.
- Minor context conflict: mainline has `(!adev->debug_vm ||
timeline_syncobj)` vs 7.0's `!adev->debug_vm`, but the fix's added
lines don't depend on this condition.
- Record: Standalone fix, minor context adjustment needed.
### PHASE 4: MAILING LIST RESEARCH
**Step 4.1-4.5:**
- b4 dig could not find the original patch submission (lore.kernel.org
blocked by Anubis).
- The related commit `ad6c120f688803` explicitly described the memleak
problem with a full stack trace showing BUG in drm_sched_fence slab
during module unload — evidence the leak has real impact.
- Christian König (co-maintainer) reviewed the fix.
- Record: Could not access lore. However, reviewer is the subsystem co-
maintainer, which is strong endorsement.
### PHASE 5: CODE SEMANTIC ANALYSIS
**Step 5.1-5.4:**
- `amdgpu_gem_va_ioctl()` is a DRM ioctl handler directly callable from
userspace
- Called every time userspace maps/unmaps GPU virtual address space
- This is a HOT path for GPU applications (Mesa, AMDVLK, ROCm)
- Every call with a timeline syncobj leaks the syncobj refcount and
potentially the chain allocation
- Record: Ioctl path reachable from any GPU userspace application. Very
high call frequency.
### PHASE 6: STABLE TREE ANALYSIS
**Step 6.1:** The buggy code exists in 7.0 tree. Confirmed via blame:
`70773bef4e091f` (Sep 2024) and `ad6c120f688803` (Feb 2025) are both
present.
**Step 6.2:** Minor context conflict due to condition difference in line
979. Would need a trivial backport adjustment, or `git apply --3way`
could handle it.
**Step 6.3:** No related fix already in stable for this specific leak.
### PHASE 7: SUBSYSTEM CONTEXT
- Subsystem: `drivers/gpu/drm/amd/amdgpu` — GPU driver
- Criticality: IMPORTANT — AMD GPUs are extremely common in desktops,
servers, and workstations
- Active subsystem with frequent changes
- Record: [IMPORTANT] AMD GPU driver, widely used hardware.
### PHASE 8: IMPACT AND RISK
**Step 8.1:** Affected users: All users with AMD GPUs using
userqueue/timeline syncobj features (Mesa Vulkan, ROCm).
**Step 8.2:** Trigger: Any GPU application calling the VA ioctl with a
timeline syncobj. Repeated calls (normal GPU operation) cause cumulative
memory leak.
**Step 8.3:** Failure mode: Memory leak in hot ioctl path → eventual OOM
under sustained GPU workloads. Severity: **HIGH** (gradual resource
exhaustion).
**Step 8.4:**
- BENEFIT: High — prevents memory leak in frequently-called GPU ioctl
- RISK: Very low — 5-line fix adding standard NULL-and-free patterns,
reviewed by subsystem maintainer
- Record: High benefit, very low risk.
### PHASE 9: SYNTHESIS
**Evidence FOR backporting:**
- Fixes a real resource leak (syncobj refcount + chain memory) in a
userspace-facing ioctl
- Every call with timeline syncobj leaks resources — cumulative, leads
to OOM under sustained use
- Small, surgical fix: 5 meaningful lines in a single file
- Reviewed by Christian König (DRM/amdgpu co-maintainer)
- Obviously correct: standard cleanup patterns (NULL-after-consume,
free/put at cleanup label)
- Bug code exists in 7.0 tree (confirmed via blame)
- Zero regression risk: `kfree(NULL)` is safe, NULL checks guard all
puts
**Evidence AGAINST backporting:**
- Minor context conflict (condition text differs between mainline and
7.0) — trivially resolvable
- Timeline syncobj feature is relatively new (Sep 2024) — may not affect
all users yet
**Stable rules checklist:**
1. Obviously correct and tested? **YES** — standard cleanup pattern,
reviewed by co-maintainer
2. Fixes a real bug? **YES** — resource leak in ioctl path
3. Important issue? **YES** — memory leak in hot path → eventual OOM
4. Small and contained? **YES** — 5 lines, single file
5. No new features? **YES** — only adds missing cleanup
6. Can apply to stable? **YES** — with minor context adjustment
### Verification
- [Phase 1] Parsed tags: Reviewed-by Christian König (co-maintainer),
SOB Alex Deucher (maintainer)
- [Phase 2] Diff analysis: 5 lines added — NULL assignment in helper
error path, NULL assignment after chain consumption, 3-line cleanup in
error label (chain free + syncobj put)
- [Phase 3] git blame: timeline code introduced in 70773bef4e091f (Sep
2024), moved by ad6c120f688803 (Feb 2025), both in 7.0 tree
- [Phase 3] git show ad6c120f688803: confirmed this commit moved
timeline allocation before switch without adding cleanup — the root
cause
- [Phase 3] git show bd8150a1b3370: v4 refactor inlined timeline
handling, still no cleanup
- [Phase 3] git show efdc66fe12b07: v7 refactor, still no cleanup
- [Phase 5] amdgpu_gem_va_ioctl is DRM ioctl handler — directly callable
from userspace, hot path for GPU apps
- [Phase 5] Confirmed drm_syncobj_add_point() consumes chain
(dma_fence_chain_init + rcu_assign_pointer), so NULL-after-use is
correct
- [Phase 5] Confirmed dma_fence_chain_free(NULL) is safe (just
kfree(NULL))
- [Phase 6] Verified no drm_syncobj_put(timeline_syncobj) in current 7.0
file — bug confirmed present
- [Phase 6] Minor context conflict: 7.0 has `!adev->debug_vm`, mainline
has `(!adev->debug_vm || timeline_syncobj)` — needs trivial adjustment
- [Phase 8] Failure mode: cumulative memory/refcount leak → eventual
OOM, severity HIGH
- UNVERIFIED: Could not access lore.kernel.org for original patch
discussion (blocked by Anubis)
**YES**
drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index c4839cf2dce37..3f95aca700264 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -107,6 +107,7 @@ amdgpu_gem_update_timeline_node(struct drm_file *filp,
*chain = dma_fence_chain_alloc();
if (!*chain) {
drm_syncobj_put(*syncobj);
+ *syncobj = NULL;
return -ENOMEM;
}
@@ -983,6 +984,7 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
timeline_chain,
fence,
args->vm_timeline_point);
+ timeline_chain = NULL;
}
}
dma_fence_put(fence);
@@ -990,6 +992,9 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
}
error:
+ dma_fence_chain_free(timeline_chain);
+ if (timeline_syncobj)
+ drm_syncobj_put(timeline_syncobj);
drm_exec_fini(&exec);
error_put_gobj:
drm_gem_object_put(gobj);
--
2.53.0
next prev parent reply other threads:[~2026-04-20 13:29 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20260420132314.1023554-1-sashal@kernel.org>
2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-6.12] drm/amdgpu: fix DF NULL pointer issue for soc24 Sasha Levin
2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-6.18] drm/ttm: Avoid invoking the OOM killer when reading back swapped content Sasha Levin
2026-04-20 13:16 ` [PATCH AUTOSEL 6.18] drm/vc4: Release runtime PM reference after binding V3D Sasha Levin
2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-6.19] drm/xe/vf: Wait for all fixups before using default LRCs Sasha Levin
2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-6.12] drm/amd/display: remove duplicate format modifier Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0] drm/amdgpu/userq: unlock cancel_delayed_work_sync for hang_detect_work Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0-6.1] drm/amd/display: Merge pipes for validate Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 6.18] drm/xe: Fix bug in idledly unit conversion Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0] drm/xe: Skip adding PRL entry to NULL VMA Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 6.18] drm/vc4: Fix a memory leak in hang state error path Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 6.18] drm/vc4: Protect madv read in vc4_gem_object_mmap() with madv_lock Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0-6.12] drm/amd/display: Fix cursor pos at overlay plane edges on DCN4 Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.1] drm/msm/dpu: fix vblank IRQ registration before atomic_mode_set Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 6.18] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-5.10] drm/amd/display: bios_parser: fix GPIO I2C line off-by-one Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0] drm/amdgpu: Handle IH v7_1 reg offset differences Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu/vcn4.0.3: gate per-queue reset by PSP SOS program version Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/imx: parallel-display: add DRM_DISPLAY_HELPER for DRM_IMX_PARALLEL_DISPLAY Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: fix amdgpu_userq_evict Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-5.10] drm/amdgpu: validate fence_count in wait_fences ioctl Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.6] drm/amdgpu: fix shift-out-of-bounds when updating umc active mask Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0] drm/amdgpu/userq: remove queue from doorbell xa during clean up Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0] drm/amdkfd: fix kernel crash on releasing NULL sysfs entry Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/xe/guc: Add Wa_14025883347 for GuC DMA failure on reset Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: clear related counter after RAS eeprom reset Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.19] drm/amd/display: Restore full update for tiling change to linear Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0] drm/amdgpu: fix array out of bounds accesses for mes sw_fini Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.12] drm/amd/display: Exit IPS w/ DC helper for all dc_set_power_state cases Sasha Levin
2026-04-20 13:19 ` Sasha Levin [this message]
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: Check for multiplication overflow in checkpoint stack size Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.18] drm/prime: Limit scatter list size with dedicated DMA device Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.19] drm/amd/display: Clamp dc_cursor_position x_hotspot to prevent integer overflow Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0] drm/amdgpu/userq: defer queue publication until create completes Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu/userq: fix dma_fence refcount underflow in userq path Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.12] drm/amdgpu: guard atom_context in devcoredump VBIOS dump Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.18] drm/amd/display: Avoid turning off the PHY when OTG is running for DVI Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0] drm/amdgpu: Revert setting up Retry based Thrashing on GFX 12.1 Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0] drm/amd/pm: Avoid overflow when sorting pp_feature list Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.19] drm/amd/display: Fix number of opp Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.19] drm/panel-edp: Change BOE NV140WUM-N64 timings Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0] drm/amd/display: Fix HWSS v3 fast path determination Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-5.10] drm/mediatek: mtk_dsi: enable hs clock during pre-enable Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 6.18] drm/vc4: Fix memory leak of BO array in hang state Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.12] drm/amd/display: Remove invalid DPSTREAMCLK mask usage Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.18] drm/panel-edp: Add CMN N116BCL-EAK (C2) Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0] drm/amdgpu: Add default reset method for soc_v1_0 Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0] drm/amdgpu/userq: cleanup amdgpu_userq_get/put where not needed Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: fix some more bug in amdgpu_gem_va_ioctl Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-5.10] fbdev: omap2: fix inconsistent lock returns in omapfb_mmap Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.18] drm: gpu: msm: forbid mem reclaim from reset Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.18] drm/panel-edp: Add AUO B116XAT04.1 (HW: 1A) Sasha Levin
2026-04-20 13:22 ` [PATCH AUTOSEL 7.0-6.6] drm/gem-dma: set VM_DONTDUMP for mmap Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260420132314.1023554-180-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=Prike.Liang@amd.com \
--cc=airlied@gmail.com \
--cc=alexander.deucher@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=christian.koenig@amd.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=linux-kernel@vger.kernel.org \
--cc=patches@lists.linux.dev \
--cc=simona@ffwll.ch \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox