From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6B59A3C278F; Mon, 20 Apr 2026 13:26:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776691610; cv=none; b=TEmTBjoXzb5HCYwzO/nC01c8f+cWMjGbNusxHNOCMK5v1eaNzFnfDGkjljWQ8J9QTzYtMWPCbDOQ347sxeOceMo3/VZ04oCNuqkB0wjvaSu/z5V2eRekva2P7+RJOiIe7XKn8JPNsQxPqprr9phSuvJguwpZkYSOcy8hWUNukLk= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776691610; c=relaxed/simple; bh=blz79eM+HzuInKiBdmMx4WcrvDUFD3imPLGMZOzoJY8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=THIxsod6RNmQA/l8OGI9JCvTOAqaFFP8m3xkkhXcLb86oqtInm+cSzL8d5dtUz0uGhkeIHUqtw2D17zYMS3flOR8ByGqVmq3qmv7voc6aKOUP8E0tBdwd45Mf+LhUV8sXd+IqwWdDK9bbyTmJ8k9Vam1vTPMedhqjXyhnkMCjdA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=cwg2DygU; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="cwg2DygU" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 15B73C2BCB6; Mon, 20 Apr 2026 13:26:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1776691610; bh=blz79eM+HzuInKiBdmMx4WcrvDUFD3imPLGMZOzoJY8=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=cwg2DygUh68/yPQrvZ9OcK9Ua2dK3ZxUOIzX6QW0f87eihtJMavfFwa3G+E0+LNZA JDo9lnzcdKSbIUQomD4GCbXAq7ziGc0OxEJURyw5Uezqfp8eXQPlEpgnl2nPbw8Jwj hCkthAW/RasbqiygYQEv+kMGKLMyuvAMiy7k6zOnbKqAX/Gdq9Ic0Rh3eMg2ItrQs7 YrZDpkP8eM3Vx/7FNX/4RNEw0xhBKXaLFy4LaAYsGnzpSZziMlx84InsDv20YHl34c jCZKt4l5zyC23eo3AVSVwy/qyn7f3GrDx5iq72Ecz+nguKgLVxFvWUVZDnPm7Gs32B z5mYibSqo3ycg== From: Sasha Levin To: patches@lists.linux.dev, stable@vger.kernel.org Cc: Sunil Khatri , =?UTF-8?q?Christian=20K=C3=B6nig?= , Alex Deucher , Sasha Levin , airlied@gmail.com, simona@ffwll.ch, amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org Subject: [PATCH AUTOSEL 7.0] drm/amdgpu/userq: remove queue from doorbell xa during clean up Date: Mon, 20 Apr 2026 09:18:44 -0400 Message-ID: <20260420132314.1023554-130-sashal@kernel.org> X-Mailer: git-send-email 2.53.0 In-Reply-To: <20260420132314.1023554-1-sashal@kernel.org> References: <20260420132314.1023554-1-sashal@kernel.org> Precedence: bulk X-Mailing-List: patches@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 X-stable: review X-Patchwork-Hint: Ignore X-stable-base: Linux 7.0 Content-Transfer-Encoding: 8bit From: Sunil Khatri [ Upstream commit a978ed3d6454e4aa1a2ac74051ffd77b7d263e44 ] If function amdgpu_userq_map_helper fails we do need to clean up and remove the queue from the userq_doorbell_xa. Signed-off-by: Sunil Khatri Reviewed-by: Christian König Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin --- LLM Generated explanations, may be completely bogus: Now I have all the information needed. Here's my complete analysis: --- ## PHASE 1: COMMIT MESSAGE FORENSICS ### Step 1.1: PARSE THE SUBJECT LINE - **Subsystem**: `drm/amdgpu/userq` (AMD GPU userqueue subsystem) - **Action verb**: "remove" (cleanup action - removing stale entry) - **Summary**: Remove queue from doorbell XArray during error path cleanup when `amdgpu_userq_map_helper` fails. Record: [drm/amdgpu/userq] [remove/clean up] [Fix missing doorbell XArray cleanup on map_helper failure path] ### Step 1.2: PARSE ALL COMMIT MESSAGE TAGS - **Signed-off-by: Sunil Khatri** - Author of the fix - **Reviewed-by: Christian König** - Prominent AMD GPU kernel developer and subsystem maintainer - **Signed-off-by: Alex Deucher** - AMD GPU subsystem maintainer (commit path) - No Fixes: tag, no Cc: stable, no Reported-by, no Link: tags Record: Reviewed by Christian König (key AMD GPU developer). Signed off by Alex Deucher (subsystem maintainer). No Fixes tag - expected for manual review. ### Step 1.3: ANALYZE THE COMMIT BODY TEXT The commit message is brief but clear: "If function amdgpu_userq_map_helper fails we do need to clean up and remove the queue from the userq_doorbell_xa." This explicitly states a missing cleanup on an error path. The failure mode is a stale entry left in the doorbell XArray after the queue memory has been freed. Record: Bug = missing resource cleanup on error path. Symptom = stale/dangling pointer in `userq_doorbell_xa` after kfree. ### Step 1.4: DETECT HIDDEN BUG FIXES This IS a bug fix, not hidden at all. The commit explicitly adds missing error path cleanup. Without this fix, a use-after-free occurs because the kfree'd queue remains in the doorbell XArray. Record: Yes - this is a genuine error path resource cleanup fix (UAF prevention). --- ## PHASE 2: DIFF ANALYSIS - LINE BY LINE ### Step 2.1: INVENTORY THE CHANGES - **Files changed**: 1 (`drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c`) - **Lines added**: 1 line - **Functions modified**: `amdgpu_userq_create` - **Scope**: Single-file, single-line surgical fix in an error path. Record: +1 line in 1 file, modifying error handling in `amdgpu_userq_create`. Minimal surgical fix. ### Step 2.2: UNDERSTAND THE CODE FLOW CHANGE In `amdgpu_userq_create`: 1. Line 863: `xa_store_irq(&adev->userq_doorbell_xa, index, queue, ...)` stores the queue in the doorbell XArray 2. Line 870-871: `xa_alloc(&uq_mgr->userq_xa, &qid, queue, ...)` allocates a queue ID 3. Line 891: `amdgpu_userq_map_helper(queue)` tries to map the queue **Before fix**: When `amdgpu_userq_map_helper` fails (line 892-899), the error path does: `xa_erase(userq_xa)`, `fence_driver_free`, `mqd_destroy`, `kfree(queue)` — but does NOT erase from `userq_doorbell_xa`. **After fix**: Adds `xa_erase_irq(&adev->userq_doorbell_xa, index)` before the other cleanup calls, properly removing the stale entry. Record: The fix adds the missing doorbell XArray cleanup so that after kfree(queue), no dangling pointer remains in userq_doorbell_xa. ### Step 2.3: IDENTIFY THE BUG MECHANISM **Category**: Memory safety / Use-after-free The queue is stored in `userq_doorbell_xa` at line 863. When `amdgpu_userq_map_helper` fails, the queue is kfree'd at line 897. But the doorbell XArray still holds the pointer to freed memory. This pointer is accessed in 6 different `xa_for_each(&adev->userq_doorbell_xa, ...)` loops: - `amdgpu_userq_suspend` (line 1445): accesses `queue->userq_mgr` - `amdgpu_userq_resume` (line 1471): accesses `queue->userq_mgr` - `amdgpu_userq_stop_sched_for_enforce_isolation` (line 1501): accesses `queue->userq_mgr`, `queue->queue_type` - `amdgpu_userq_start_sched_for_enforce_isolation` (line 1535): same - `amdgpu_userq_pre_reset` (line 1589): accesses `queue->userq_mgr`, `queue->state` - `amdgpu_userq_post_reset` (line 1617): accesses `queue->state` Record: UAF - freed queue memory accessed via stale doorbell XArray entry during suspend/resume/reset/enforce-isolation operations. ### Step 2.4: ASSESS THE FIX QUALITY - The fix is obviously correct: `xa_erase_irq` is the right API (matches the cleanup function at line 463) - It's minimal: single line - No regression risk: it only affects the error path - The cleanup function `amdgpu_userq_cleanup` at line 463 does the same `xa_erase_irq` call Record: Obviously correct, minimal, no regression risk. Uses same pattern as the normal cleanup path. --- ## PHASE 3: GIT HISTORY INVESTIGATION ### Step 3.1: BLAME THE CHANGED LINES - Line 863 (xa_store_irq): introduced by `f18719ef4bb7b0` (Jesse Zhang, Oct 2025) - "Convert amdgpu userqueue management from IDR to XArray" - Lines 891-899 (map_helper error path): originally from `94976e7e5ede65` (Alex Deucher, Apr 2025), with refactoring by `dc21e39fd20c77` (Lijo Lazar, Nov 2025) The bug was introduced by the IDR-to-XArray conversion (`f18719ef4bb7b0`). When replacing `idr_remove` with `xa_erase`, the author forgot to add `xa_erase_irq` for the new `userq_doorbell_xa` in the `amdgpu_userq_map_helper` error path. Record: Bug introduced by f18719ef4bb7b0 (Oct 2025 XArray conversion). Present in 7.0 tree. ### Step 3.2: FOLLOW THE FIXES: TAG No Fixes: tag present. The logical Fixes: target would be `f18719ef4bb7b0` which IS in this 7.0 tree. Record: The buggy commit f18719ef4bb7b0 exists in the stable tree. ### Step 3.3: CHECK FILE HISTORY FOR RELATED CHANGES Recent history shows heavy refactoring of this file, including the refcount commit (`65b5c326ce4103`, Mar 2026), XArray conversion, and multiple error handling fixes. The userqueue code is under active development. Record: Actively developed file. Standalone fix - no series dependency in subject. ### Step 3.4: CHECK THE AUTHOR'S OTHER COMMITS Sunil Khatri is a regular AMD GPU contributor with extensive commit history (30+ commits in `drivers/gpu/drm/amd/`). He is familiar with the codebase and has authored multiple cleanup/fix patches. Record: Regular AMD GPU contributor with subsystem knowledge. ### Step 3.5: CHECK FOR DEPENDENT/PREREQUISITE COMMITS **CRITICAL FINDING**: The diff context shows that in mainline, the `xa_alloc` error path (line 872-879 in stable) already contains `xa_erase_irq(&adev->userq_doorbell_xa, index)`. However, in the current stable tree, this line is MISSING from the `xa_alloc` error path. This means there is a prerequisite commit that fixed the `xa_alloc` error path, and this commit only fixes the `amdgpu_userq_map_helper` error path. Record: Prerequisite exists - the xa_alloc error path fix must be applied first for this patch to apply cleanly. The patch context won't match the stable tree without it. --- ## PHASE 4: MAILING LIST AND EXTERNAL RESEARCH ### Step 4.1-4.5: PATCH DISCUSSION I was unable to find the exact lore discussion for this commit via b4 dig or web search. The commit is too new to have been indexed. However, the review chain is clear: Reviewed-by Christian König, Signed-off-by Alex Deucher — both are the primary AMD GPU kernel maintainers. Record: Could not find lore URL. Reviewed by top AMD GPU maintainers. --- ## PHASE 5: CODE SEMANTIC ANALYSIS ### Step 5.1-5.2: FUNCTION AND CALLER ANALYSIS - `amdgpu_userq_create` is called from `amdgpu_userq_ioctl` (line 1024), which is a DRM IOCTL handler — **reachable from userspace**. - The buggy path (map_helper failure) is exercised when GPU hardware mapping fails, which can happen during resource contention, hardware errors, or device issues. - The stale entry is then accessed by suspend/resume/reset paths which iterate `userq_doorbell_xa`. Record: Bug is reachable from userspace IOCTL. UAF is triggered during subsequent suspend/resume/reset operations. ### Step 5.3-5.5: CALL CHAIN Userspace → `amdgpu_userq_ioctl` → `amdgpu_userq_create` → `amdgpu_userq_map_helper` fails → stale doorbell_xa entry → any `xa_for_each(&adev->userq_doorbell_xa)` → UAF Record: Clear call chain from userspace to bug trigger to UAF exploitation. --- ## PHASE 6: CROSS-REFERENCING AND STABLE TREE ANALYSIS ### Step 6.1: DOES THE BUGGY CODE EXIST IN STABLE? Yes. The `userq_doorbell_xa` was introduced by `f18719ef4bb7b0` which is in the 7.0 tree. The `amdgpu_userq_map_helper` error path at line 891-899 exists and is missing the cleanup. Record: Buggy code exists in 7.0 stable tree. ### Step 6.2: BACKPORT COMPLICATIONS The diff context doesn't match the stable tree exactly. The `xa_alloc` error path in mainline already has `xa_erase_irq`, but the stable tree doesn't. This means the patch needs either a prerequisite commit or manual rework to apply cleanly. Record: Won't apply cleanly — needs prerequisite fix for xa_alloc error path or minor rework. ### Step 6.3: RELATED FIXES ALREADY IN STABLE No related fix for this specific issue exists in the stable tree. Record: No prior fix exists. --- ## PHASE 7: SUBSYSTEM AND MAINTAINER CONTEXT ### Step 7.1: SUBSYSTEM CRITICALITY - **Subsystem**: `drivers/gpu/drm/amd/amdgpu` — AMD GPU driver - **Criticality**: IMPORTANT — widely used GPU driver on desktop/laptop systems - Userqueue is a newer feature but actively used Record: IMPORTANT subsystem - AMD GPU is widely deployed. ### Step 7.2: SUBSYSTEM ACTIVITY Extremely active — 10+ changes per month to this specific file. The userqueue code is under heavy development. Record: Very active, rapidly evolving code. --- ## PHASE 8: IMPACT AND RISK ASSESSMENT ### Step 8.1: AFFECTED USERS Users of AMD GPUs with userqueue support enabled. This includes modern AMD Radeon hardware. Record: Driver-specific but for widely-used AMD GPUs. ### Step 8.2: TRIGGER CONDITIONS Triggered when `amdgpu_userq_map_helper` fails during queue creation (e.g., hardware map failure, resource exhaustion). The UAF manifests when subsequent suspend/resume/reset operations iterate the doorbell XArray. Record: Triggered by queue creation failure followed by system operation (suspend/resume/reset). Not extremely rare. ### Step 8.3: FAILURE MODE SEVERITY **Use-after-free** — the doorbell XArray holds a dangling pointer to freed memory. When the 6 `xa_for_each` loops iterate, they dereference `queue->userq_mgr`, `queue->state`, `queue->queue_type`. This can cause: - Kernel oops/panic (most likely) - Data corruption (if freed memory is reallocated) - Potential security vulnerability (UAF with userspace-controlled trigger) Record: UAF → CRITICAL (kernel crash, potential security issue). ### Step 8.4: RISK-BENEFIT RATIO - **Benefit**: HIGH — prevents UAF in a userspace-reachable GPU code path - **Risk**: VERY LOW — single line addition to error path, using established API pattern - **Ratio**: Very favorable Record: High benefit, very low risk. --- ## PHASE 9: FINAL SYNTHESIS ### Step 9.1: COMPILE THE EVIDENCE **FOR backporting:** - Fixes a real UAF bug (stale pointer in doorbell XArray after kfree) - Single-line fix — minimal scope - Obviously correct — mirrors the cleanup function at line 463 - Reviewed by Christian König (AMD GPU subsystem expert) - Signed off by Alex Deucher (AMD GPU maintainer) - Bug is reachable from userspace IOCTL - UAF can be triggered during common operations (suspend/resume/reset) - Buggy code exists in stable tree (introduced by f18719ef4bb7b0) **AGAINST backporting:** - No Fixes: tag (expected for manual review candidates) - Prerequisite commit needed for clean application (xa_alloc error path fix) - Userqueue is a newer, rapidly evolving feature - May need rework to apply to stable ### Step 9.2: STABLE RULES CHECKLIST 1. **Obviously correct and tested?** YES — single line, matches existing cleanup pattern, reviewed by maintainer 2. **Fixes a real bug?** YES — UAF in error path 3. **Important issue?** YES — UAF → potential crash or security issue 4. **Small and contained?** YES — 1 line in 1 file 5. **No new features or APIs?** Correct — pure bugfix 6. **Can apply to stable?** Needs prerequisite or minor rework ### Step 9.3: EXCEPTION CATEGORIES Not an exception category — this is a standard bug fix. ### Step 9.4: DECISION The fix addresses a genuine use-after-free vulnerability in the AMD GPU userqueue error handling path. The bug is reachable from userspace via DRM IOCTL, and the UAF can be triggered during suspend/resume/reset operations. The fix is a single-line, obviously correct addition that mirrors the existing cleanup function pattern. It was reviewed by the subsystem's top maintainers. The only concern is that the patch may need a prerequisite commit (xa_alloc error path fix) to apply cleanly. However, this is a practical application concern, not a question of whether the fix should be backported. --- ## Verification - [Phase 1] Parsed tags: Reviewed-by Christian König, Signed-off-by Alex Deucher (both AMD GPU maintainers) - [Phase 2] Diff analysis: 1 line added to amdgpu_userq_map_helper error path in amdgpu_userq_create(); adds missing xa_erase_irq for userq_doorbell_xa - [Phase 3] git blame: Buggy code introduced by f18719ef4bb7b0 (Jesse Zhang, Oct 2025, XArray conversion), confirmed present in stable tree - [Phase 3] git show f18719ef4bb7b0: Confirmed the XArray conversion failed to add xa_erase_irq in both xa_alloc and map_helper error paths - [Phase 3] Checked xa_alloc error path (line 872-879): CONFIRMED missing xa_erase_irq in stable tree, indicating a prerequisite commit exists - [Phase 4] b4 dig: Failed to find match (commit likely too recent); web search also unsuccessful - [Phase 4] Web search: Found related refcount commit discussion on lists.freedesktop.org - [Phase 5] Grep for amdgpu_userq_create callers: Called from amdgpu_userq_ioctl (DRM IOCTL handler), reachable from userspace - [Phase 5] Grep for xa_for_each userq_doorbell_xa: Found 6 iteration sites (suspend/resume/reset/enforce-isolation) — all access freed queue members - [Phase 5] Read amdgpu_userq_cleanup (line 450-469): CONFIRMED correct cleanup uses xa_erase_irq, validating the fix pattern - [Phase 6] Buggy code (f18719ef4bb7b0) confirmed in stable tree via git log - [Phase 6] Context mismatch identified: mainline xa_alloc error path has xa_erase_irq but stable doesn't — prerequisite needed - [Phase 8] Failure mode: UAF → kernel oops/potential security vulnerability, severity CRITICAL - UNVERIFIED: Could not find the exact mainline commit hash or lore discussion thread **YES** drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c index 819c4c26416c3..1849894d2837b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c @@ -892,6 +892,7 @@ amdgpu_userq_create(struct drm_file *filp, union drm_amdgpu_userq *args) r = amdgpu_userq_map_helper(queue); if (r) { drm_file_err(uq_mgr->file, "Failed to map Queue\n"); + xa_erase_irq(&adev->userq_doorbell_xa, index); xa_erase(&uq_mgr->userq_xa, qid); amdgpu_userq_fence_driver_free(queue); uq_funcs->mqd_destroy(queue); -- 2.53.0