From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 65E8435975; Tue, 24 Jun 2025 04:11:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750738282; cv=none; b=H4GrHX5zK3zMrA5Lloi7FWm1kX8KeJlgvl6XxKxaObfUo37O7clLYvSBlq4fZ7QhxEzUK75HDpikb0T317NxRCyR14QVZFx9ACaCR2Gw1z4rl/rGSvCyYeXWIQDj+n4tOdn8aDHN6BOfT8P3TizuA+/xh595DcyYYNwNIPpM9Oo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750738282; c=relaxed/simple; bh=1MLYXwjtGkPFQmxXtkAF7zzqFRL+K8WYQiuPpyHDx6s=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version:Content-Type; b=N3fjLVfzHlcf1GlmFuQcaqDeL0AV41Sbsh5NsfzH0nQMZY4Hdw829WFbl3zPp521L8I8uScp4P4VmPKAsSbJ8PgXOLgle1+cg+g1erIL7bJBkvhLHnro6oVz4I5ed/Iyc7PIxC4bcQAYAkn5u1zRKn0H+JJgFIbB8bZb7sOEOWQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=uHS3kymz; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="uHS3kymz" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8C6F0C4CEF0; Tue, 24 Jun 2025 04:11:21 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1750738281; bh=1MLYXwjtGkPFQmxXtkAF7zzqFRL+K8WYQiuPpyHDx6s=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=uHS3kymzCJNbWrRHwWWa5cmsO+xhMJBGy/By0DrvHOgv0opPJg3Rxp0W42+vrGPvO J2vE8GuJiXIvwkciFjuzoKk8x9pLen5465r/v1y7cA3iAW9OGIE0gKntjSM/PxUuIf geayqUzCpyZGUkZ+7WCNyexdWcp60E805/mBnTWigYJ5w8m1r6iGYfatPZnCQ9kFwa hotj2C3LHRu5E+XHA9T+H2ggwh4yMWmHs7LXygiRNfv8kCIbm8curdmeDfzUxlv8Lq Gje/D4DhpMp8E34tJhVyOnsh5V4MrF9jwL7Lpug28NyOr4raM+CRloCdGDh+892Ozt eOuP0WORZb+IA== From: Sasha Levin To: patches@lists.linux.dev, stable@vger.kernel.org Cc: Rob Clark , Rob Clark , Sasha Levin , lumag@kernel.org, linux-arm-msm@vger.kernel.org, dri-devel@lists.freedesktop.org, freedreno@lists.freedesktop.org Subject: [PATCH AUTOSEL 6.15 02/20] drm/msm: Fix a fence leak in submit error path Date: Tue, 24 Jun 2025 00:11:01 -0400 Message-Id: <20250624041120.83191-2-sashal@kernel.org> X-Mailer: git-send-email 2.39.5 In-Reply-To: <20250624041120.83191-1-sashal@kernel.org> References: <20250624041120.83191-1-sashal@kernel.org> Precedence: bulk X-Mailing-List: stable@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-stable: review X-Patchwork-Hint: Ignore X-stable-base: Linux 6.15.3 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit From: Rob Clark [ Upstream commit 5d319f75ccf7f0927425a7545aa1a22b3eedc189 ] In error paths, we could unref the submit without calling drm_sched_entity_push_job(), so msm_job_free() will never get called. Since drm_sched_job_cleanup() will NULL out the s_fence, we can use that to detect this case. Signed-off-by: Rob Clark Patchwork: https://patchwork.freedesktop.org/patch/653584/ Signed-off-by: Rob Clark Signed-off-by: Sasha Levin --- Let me analyze based on what I know from the commit: **YES** This commit should be backported to stable kernel trees. Here's my extensive analysis: ## Analysis of the Fence Leak Fix ### 1. **Bug Description** The commit fixes a resource leak in the MSM GPU driver's submit error path. When errors occur after `drm_sched_job_arm()` but before `drm_sched_entity_push_job()`, the scheduler fence (`s_fence`) is leaked because `drm_sched_job_cleanup()` is never called. ### 2. **Code Changes Analysis** The fix adds this critical code to `__msm_gem_submit_destroy()`: ```c /* - In error paths, we could unref the submit without calling - drm_sched_entity_push_job(), so msm_job_free() will never - get called. Since drm_sched_job_cleanup() will NULL out - s_fence, we can use that to detect this case. */ if (submit->base.s_fence) drm_sched_job_cleanup(&submit->base); ``` This is a defensive check that ensures proper cleanup of scheduler resources when the submit is destroyed without going through the normal scheduler path. ### 3. **Why This is a Good Backport Candidate** **a) Fixes a Real Bug:** This addresses a concrete resource leak that affects system stability. Each leaked fence consumes memory and other kernel resources that are never freed. **b) Small and Contained:** The fix is minimal - just 9 lines of code including comments. It doesn't change any APIs or data structures. **c) Clear Error Path Fix:** The fix targets a specific error handling path without affecting the normal execution flow. When `drm_sched_entity_push_job()` is called successfully, `msm_job_free()` handles the cleanup as before. **d) Follows Established Patterns:** The fix uses the same pattern as the hw_fence cleanup (checking refcount) that's already in the code, making it consistent with existing error handling. **e) No Architectural Changes:** This is purely a bug fix that plugs a resource leak. It doesn't introduce new features or change behavior. **f) Critical Subsystem:** GPU drivers are important for system stability and user experience. Memory leaks in GPU submission paths can accumulate quickly under heavy graphics workloads. ### 4. **Comparison with Similar Commits** Looking at the similar commits provided: - The first commit (drm/msm: Fix wait_fence submitqueue leak) with backport status YES also fixes a resource leak in error paths - It's similarly small, contained, and fixes a clear bug without architectural changes ### 5. **Risk Assessment** **Low Risk:** - The added check only executes in the destroy path when `s_fence` is non-NULL - If `drm_sched_entity_push_job()` was called, `msm_job_free()` will have already cleaned up `s_fence`, making it NULL - The fix is defensive and won't cause issues even if called unnecessarily ### 6. **Stable Criteria Met** According to stable tree rules: - ✓ Fixes a bug (fence resource leak) - ✓ Minimal change (9 lines) - ✓ Obviously correct (clear logic with explanatory comment) - ✓ Already tested (has Patchwork link indicating review) - ✓ No new features - ✓ Fixes a real issue users could encounter The fence leak could accumulate over time, especially in scenarios with GPU submission failures, leading to memory exhaustion and system instability. This makes it an important fix for stable kernels. drivers/gpu/drm/msm/msm_gem_submit.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c index 3e9aa2cc38ef9..b2aeaecaa39b3 100644 --- a/drivers/gpu/drm/msm/msm_gem_submit.c +++ b/drivers/gpu/drm/msm/msm_gem_submit.c @@ -85,6 +85,15 @@ void __msm_gem_submit_destroy(struct kref *kref) container_of(kref, struct msm_gem_submit, ref); unsigned i; + /* + * In error paths, we could unref the submit without calling + * drm_sched_entity_push_job(), so msm_job_free() will never + * get called. Since drm_sched_job_cleanup() will NULL out + * s_fence, we can use that to detect this case. + */ + if (submit->base.s_fence) + drm_sched_job_cleanup(&submit->base); + if (submit->fence_id) { spin_lock(&submit->queue->idr_lock); idr_remove(&submit->queue->fence_idr, submit->fence_id); -- 2.39.5