From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: "Sunil Khatri" <sunil.khatri@amd.com>,
"Christian König" <christian.koenig@amd.com>,
"Alex Deucher" <alexander.deucher@amd.com>,
"Sasha Levin" <sashal@kernel.org>,
airlied@gmail.com, simona@ffwll.ch,
amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
linux-kernel@vger.kernel.org
Subject: [PATCH AUTOSEL 7.0] drm/amdgpu/userq: remove queue from doorbell xarray
Date: Mon, 20 Apr 2026 09:08:38 -0400 [thread overview]
Message-ID: <20260420131539.986432-52-sashal@kernel.org> (raw)
In-Reply-To: <20260420131539.986432-1-sashal@kernel.org>
From: Sunil Khatri <sunil.khatri@amd.com>
[ Upstream commit f0e46fd06c3f7590b0f06aea3c877f441c2cbccc ]
In case of failure in xa_alloc, remove the queue during
clean up from the userq_doorbell_xa.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
Here is my complete analysis:
---
## PHASE 1: COMMIT MESSAGE FORENSICS
**Step 1.1: Subject Line**
- Subsystem: `drm/amdgpu/userq`
- Action verb: "remove" (corrective: removing a stale entry from data
structure on error path)
- Summary: Remove queue from doorbell xarray during cleanup when
xa_alloc fails
**Step 1.2: Tags**
- `Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>` - author, regular
AMD GPU contributor
- `Reviewed-by: Christian König <christian.koenig@amd.com>` - AMD GPU
subsystem co-maintainer
- `Signed-off-by: Alex Deucher <alexander.deucher@amd.com>` - AMD GPU
subsystem maintainer
- No Fixes: tag (expected for manual review candidates)
- No Cc: stable tag
**Step 1.3: Commit Body Analysis**
The message says: "In case of failure in xa_alloc, remove the queue
during clean up from the userq_doorbell_xa." This clearly describes a
missing cleanup in an error path: the queue gets stored in
`userq_doorbell_xa` but is not removed if the subsequent `xa_alloc`
fails.
**Step 1.4: Hidden Bug Fix Detection**
This is not hidden - it's straightforwardly described as an error path
cleanup fix. However, the severity is greater than the commit message
suggests. It's actually a **use-after-free**: the queue is freed
(`kfree(queue)`) but its pointer remains in the doorbell xarray.
## PHASE 2: DIFF ANALYSIS
**Step 2.1: Inventory**
- Single file: `drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c`
- 1 line added: `xa_erase_irq(&adev->userq_doorbell_xa, index);`
- Function modified: `amdgpu_userq_create()`
- Scope: Single-file surgical fix, error path only
**Step 2.2: Code Flow Change**
In `amdgpu_userq_create()`:
1. Line 863: `xa_store_irq(&adev->userq_doorbell_xa, index, queue,
GFP_KERNEL)` - stores queue in global doorbell xarray
2. Line 870: `xa_alloc(&uq_mgr->userq_xa, &qid, queue, ...)` - attempts
to allocate a queue ID
3. If xa_alloc fails (line 872): error path calls
`amdgpu_userq_fence_driver_free()`, `mqd_destroy()`, `kfree(queue)` -
but did NOT remove the entry from `userq_doorbell_xa`
4. Fix adds `xa_erase_irq(&adev->userq_doorbell_xa, index)` before the
kfree
**Step 2.3: Bug Mechanism**
This is a **use-after-free** caused by a missing cleanup on error path.
After `kfree(queue)`, the freed pointer remains in `userq_doorbell_xa`.
Six functions in the same file iterate `userq_doorbell_xa` with
`xa_for_each()` and dereference the queue pointer:
- `amdgpu_userq_suspend()` (line 1445)
- `amdgpu_userq_resume()` (line 1471)
- `amdgpu_userq_stop_sched_for_enforce_isolation()` (line 1501)
- `amdgpu_userq_start_sched_for_enforce_isolation()` (line 1535)
- `amdgpu_userq_pre_reset()` (line 1589)
- `amdgpu_userq_post_reset()` (line 1617)
All would crash or corrupt memory by accessing `queue->userq_mgr`,
`queue->state`, etc.
**Step 2.4: Fix Quality**
The fix is obviously correct: it mirrors the cleanup in the normal
destroy path (`amdgpu_userq_cleanup()` at line 463 already calls
`xa_erase_irq(&adev->userq_doorbell_xa, queue->doorbell_index)`). The
fix uses `xa_erase_irq()` matching the `xa_store_irq()` used during
insertion. No regression risk.
## PHASE 3: GIT HISTORY
**Step 3.1: Blame**
The buggy error path (xa_alloc failure without doorbell cleanup) was
introduced by commit `f18719ef4bb7b` (Jesse.Zhang, October 2025) which
converted IDR to XArray. When the XArray conversion added
`xa_store_irq(&adev->userq_doorbell_xa, ...)` before `xa_alloc`, the
error cleanup for `xa_alloc` failure was not updated to remove the
doorbell entry.
**Step 3.2: Fixes Target**
No Fixes: tag present. The bug was introduced by f18719ef4bb7b.
**Step 3.3: Related Changes**
The companion commit `65b5c326ce410` ("refcount userqueues to avoid race
conditions") was cherry-picked with `Cc: stable@vger.kernel.org`,
showing this area is actively being stabilized for stable trees.
**Step 3.4: Author**
Sunil Khatri is a regular AMD GPU contributor with 11 commits to this
file alone. He also authored the refcount fix.
**Step 3.5: Dependencies**
This fix is standalone - it requires only that the doorbell xarray code
(from f18719ef4bb7b) exists, which it does in 7.0.
## PHASE 4: MAILING LIST RESEARCH
The exact commit "remove queue from doorbell xarray" was not found
independently on lore.kernel.org. However, the v2 series of the refcount
patchset (20260302180852) was a 2-patch series that included related
doorbell cleanup work. The patch was reviewed by Christian König (AMD
GPU co-maintainer) and committed through Alex Deucher (AMD GPU
maintainer).
## PHASE 5: CODE SEMANTIC ANALYSIS
**Step 5.1: Modified Function**
`amdgpu_userq_create()` - creates usermode queues for AMD GPU userspace
queue submission.
**Step 5.2: Callers**
`amdgpu_userq_create()` is called from `amdgpu_userq_ioctl()` via
`AMDGPU_USERQ_OP_CREATE`, which is a DRM IOCTL callable from userspace.
This is a user-triggerable code path.
**Step 5.3-5.4: Impact Surface**
The stale entry in `userq_doorbell_xa` will be accessed by any
subsequent suspend, resume, reset, or isolation enforcement operation on
the GPU device. These are triggered by:
- System suspend/resume (S3, S0ix) - common laptop operation
- GPU reset (triggered by hang detection)
- Enforce isolation mode changes
## PHASE 6: STABLE TREE ANALYSIS
**Step 6.1: Code in Stable Trees**
The file `amdgpu_userq.c` was first added in the v6.16/6.17 timeframe.
The `userq_doorbell_xa` was introduced by f18719ef4bb7b in the kernel
7.0 development cycle. This code exists in the 7.0 stable tree.
**Step 6.2: Backport Complexity**
This is a clean 1-line addition with no context conflicts expected.
## PHASE 7: SUBSYSTEM CONTEXT
- Subsystem: `drivers/gpu/drm/amd/amdgpu` - AMD GPU driver
- Criticality: IMPORTANT - widely used GPU driver on desktop/laptop
systems
- Actively developed subsystem with heavy recent churn
## PHASE 8: IMPACT AND RISK ASSESSMENT
**Step 8.1: Who is Affected**
Users with AMD GPUs using the usermode queue feature (GFX/Compute/DMA
queues).
**Step 8.2: Trigger Conditions**
The bug requires:
1. A userqueue creation attempt where `xa_store_irq` succeeds but
`xa_alloc` fails (e.g., hitting `AMDGPU_MAX_USERQ_COUNT` limit or
memory pressure)
2. Followed by any suspend, resume, reset, or isolation operation
The `xa_alloc` failure can be triggered by resource exhaustion or
concurrent queue creation under memory pressure.
**Step 8.3: Failure Mode Severity**
- **CRITICAL**: Use-after-free on freed queue pointer leads to kernel
crash (oops/panic), potential memory corruption, or security
vulnerability. The freed pointer is dereferenced to access
`queue->userq_mgr`, `queue->state`, etc.
**Step 8.4: Risk-Benefit Ratio**
- BENEFIT: Very high - prevents UAF crash on a common GPU driver
- RISK: Very low - single line addition, mirrors existing cleanup
pattern, obviously correct
- Ratio: Strongly favorable
## PHASE 9: FINAL SYNTHESIS
**Evidence FOR backporting:**
- Fixes a real use-after-free bug (stale pointer in xarray after kfree)
- Single-line fix, surgically targeted to one error path
- Reviewed by Christian König (AMD GPU co-maintainer)
- Obviously correct (mirrors cleanup in normal destroy path)
- Related refcount commit already tagged Cc: stable
- User-triggerable code path (IOCTL)
- Failure mode is kernel crash or memory corruption
**Evidence AGAINST backporting:**
- Relatively new code (userq feature added in v6.16-7.0 timeframe)
- Trigger requires specific failure condition (xa_alloc failure)
**Stable Rules Checklist:**
1. Obviously correct and tested? YES - mirrors existing cleanup,
reviewed by maintainer
2. Fixes a real bug? YES - use-after-free
3. Important issue? YES - potential crash, memory corruption, security
4. Small and contained? YES - 1 line added
5. No new features? YES - pure bug fix
6. Can apply to stable? YES - clean 1-line addition
## Verification
- [Phase 1] Parsed tags: Reviewed-by Christian König, Signed-off-by Alex
Deucher and Sunil Khatri
- [Phase 2] Diff analysis: 1 line added in xa_alloc error path of
amdgpu_userq_create(), adds missing xa_erase_irq() for doorbell xarray
- [Phase 3] git blame: buggy code introduced in f18719ef4bb7b (Oct
2025), XArray conversion commit
- [Phase 3] git show f18719ef4bb7b: confirmed it added xa_store_irq to
doorbell_xa without corresponding xa_erase_irq in xa_alloc failure
path
- [Phase 3] Related commit 65b5c326ce410 already has Cc: stable
(confirmed via git show)
- [Phase 4] b4 dig -c 65b5c326ce410: found patch at
https://patch.msgid.link/20260303120849.2583663-1-sunil.khatri@amd.com
- [Phase 4] Mailing list archive: found v2 series and review discussion,
Christian König reviewed
- [Phase 5] Grep for userq_doorbell_xa: found 6 xa_for_each iteration
sites that would dereference the freed pointer
- [Phase 5] amdgpu_userq_create called from amdgpu_userq_ioctl (user-
triggerable via DRM IOCTL)
- [Phase 6] amdgpu_userq.c first added around v6.16, doorbell_xa added
in v7.0 cycle
- [Phase 8] Failure mode: UAF -> kernel crash/memory corruption,
severity CRITICAL
This is a small, surgical, obviously correct fix for a real use-after-
free bug in the AMD GPU driver's error handling path. It was reviewed by
the subsystem co-maintainer and carries essentially zero regression
risk.
**YES**
drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
index 0a1b93259887a..819c4c26416c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
@@ -872,6 +872,7 @@ amdgpu_userq_create(struct drm_file *filp, union drm_amdgpu_userq *args)
if (r) {
drm_file_err(uq_mgr->file, "Failed to allocate a queue id\n");
amdgpu_userq_fence_driver_free(queue);
+ xa_erase_irq(&adev->userq_doorbell_xa, index);
uq_funcs->mqd_destroy(queue);
kfree(queue);
r = -ENOMEM;
--
2.53.0
next prev parent reply other threads:[~2026-04-20 13:17 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20260420131539.986432-1-sashal@kernel.org>
2026-04-20 13:08 ` [PATCH AUTOSEL 7.0-6.18] media: rc: fix race between unregister and urb/irq callbacks Sasha Levin
2026-04-20 13:08 ` [PATCH AUTOSEL 7.0-5.10] fbdev: viafb: check ioremap return value in viafb_lcd_get_mobile_state Sasha Levin
2026-04-20 13:08 ` [PATCH AUTOSEL 7.0-6.19] drm/panel-edp: Add BOE NV153WUM-N42, CMN N153JCA-ELK, CSW MNF307QS3-2 Sasha Levin
2026-04-20 13:08 ` Sasha Levin [this message]
2026-04-20 13:08 ` [PATCH AUTOSEL 7.0-6.12] nouveau: pci: quiesce GPU on shutdown Sasha Levin
2026-04-20 13:08 ` [PATCH AUTOSEL 7.0-6.12] drm/amd/display: Fix dcn401_optimize_bandwidth Sasha Levin
2026-04-20 13:08 ` [PATCH AUTOSEL 7.0-6.18] drm/bridge: waveshare-dsi: Register and attach our DSI device at probe Sasha Levin
2026-04-20 13:09 ` [PATCH AUTOSEL 6.18] drm/amdkfd: Fix queue preemption/eviction failures by aligning control stack size to GPU page size Sasha Levin
2026-04-20 13:09 ` [PATCH AUTOSEL 7.0-6.12] drm/amdgpu: fix DF NULL pointer issue for soc24 Sasha Levin
2026-04-20 13:09 ` [PATCH AUTOSEL 7.0-6.18] drm/ttm: Avoid invoking the OOM killer when reading back swapped content Sasha Levin
2026-04-20 13:09 ` [PATCH AUTOSEL 6.18] drm/vc4: Release runtime PM reference after binding V3D Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260420131539.986432-52-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=airlied@gmail.com \
--cc=alexander.deucher@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=christian.koenig@amd.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=linux-kernel@vger.kernel.org \
--cc=patches@lists.linux.dev \
--cc=simona@ffwll.ch \
--cc=stable@vger.kernel.org \
--cc=sunil.khatri@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox