From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 14F6724DCE3; Sat, 25 Oct 2025 16:15:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761408915; cv=none; b=ahGXPoBz4XZydou/aFhp5FVGVzep2q5KFbAul87Aoa63cVUykW0onM/M943hozjH4Yx8IjGKPGh/dpCIUpaR2gGoM4QIJj9anH450r+LrO80RpKNY6w3GuG9vO6mj0BMJ4iJO0SpTpkGroEEBWdejUaZLcEnIuY5nr2z5LCfcV8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761408915; c=relaxed/simple; bh=78ERbFKaeO0h6PkzV58N49XWB06R/jyEXp38t/OS7tw=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=e7ub4rrJM1orV0PLRDvjCIlwE5E/P1It38tsgtOXsgiz+lOy7iN8FvfkP9VE1r1jWfAGPCasbHHV45QF6X0zl48IO2Y1/TjrYPUwnhwwcjc9z7EPigP+afitYTGfTWzzP7YfFthPiAxJoDX8RbpdtR5g/qUOqRGzkePuDRiqYVg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=ej+8x8/Y; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="ej+8x8/Y" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8EE70C4CEFB; Sat, 25 Oct 2025 16:15:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1761408913; bh=78ERbFKaeO0h6PkzV58N49XWB06R/jyEXp38t/OS7tw=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=ej+8x8/Y8+G0SLKDGSSSod+6qKvBWEOz3JNDqH/DIsM4OPnegVAE8PHpdjv1gHESJ XFdIRfHD1qXBAn9fG/rcwE0awD57lxGNYWmmJmiPfJVaDLdJCp41WGWR7cJYqbu1fj w766HjEXF9kPUJkK5U9Iy2Zx0HWkHpwcf7yKdlhFjKADX5KBgCw01ekXx5bugyRI2A vLsm89iBpFzSze7iW9+D4iybCD3BYQfdC/ALTG573AS8A4/GgZxMr79nBETN+HI8Lw mVpdX8GPgueKWyedKOkhEbRumitCPGSgMzM/IlUXYIh6Rn2ylBkLOhxdOMFggOneeX vTWaUJnz5EfCg== From: Sasha Levin To: patches@lists.linux.dev, stable@vger.kernel.org Cc: Antonino Maniscalco , Akhil P Oommen , Rob Clark , Sasha Levin , linux-arm-msm@vger.kernel.org, dri-devel@lists.freedesktop.org, freedreno@lists.freedesktop.org Subject: [PATCH AUTOSEL 6.17-5.15] drm/msm: make sure to not queue up recovery more than once Date: Sat, 25 Oct 2025 11:55:58 -0400 Message-ID: <20251025160905.3857885-127-sashal@kernel.org> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20251025160905.3857885-1-sashal@kernel.org> References: <20251025160905.3857885-1-sashal@kernel.org> Precedence: bulk X-Mailing-List: patches@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-stable: review X-Patchwork-Hint: Ignore X-stable-base: Linux 6.17.5 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit From: Antonino Maniscalco [ Upstream commit 10fb1b2fcaee5545a5e54db1ed4d7b15c2db50c8 ] If two fault IRQs arrive in short succession recovery work will be queued up twice. When recovery runs a second time it may end up killing an unrelated context. Prevent this by masking off interrupts when triggering recovery. Signed-off-by: Antonino Maniscalco Reviewed-by: Akhil P Oommen Patchwork: https://patchwork.freedesktop.org/patch/670023/ Signed-off-by: Rob Clark Signed-off-by: Sasha Levin --- LLM Generated explanations, may be completely bogus: YES - What it fixes: Two fault IRQs arriving back-to-back can queue `recover_work` twice; the second recovery may kill an unrelated context. The change masks interrupts before queuing recovery so subsequent fault IRQs don’t re-queue recovery. - Core change: In `drivers/gpu/drm/msm/adreno/a6xx_gpu.c:1821-1824`, `a6xx_fault_detect_irq()` adds: - `gpu_write(gpu, REG_A6XX_RBBM_INT_0_MASK, 0);` to mask all RBBM interrupts before `kthread_queue_work(gpu->worker, &gpu->recover_work);` - This follows the hangcheck timer being disabled, ensuring no further spurious recovery triggers while the first recovery proceeds. - Call path impact: `a6xx_irq()` invokes `a6xx_fault_detect_irq()` on hang detect (`drivers/gpu/drm/msm/adreno/a6xx_gpu.c:1891-1900`). With the new mask, subsequent IRQs won’t retrigger recovery for the same incident. - Interrupts are safely restored: During recovery, the GPU is reinitialized and interrupts are re-enabled in `a6xx_hw_init()` via `REG_A6XX_RBBM_INT_0_MASK` (sets `A6XX_INT_MASK`/`A7XX_INT_MASK`) at `drivers/gpu/drm/msm/adreno/a6xx_gpu.c:1410-1413`. `a6xx_recover()` calls `msm_gpu_hw_init(gpu)` to perform this re-init (`drivers/gpu/drm/msm/adreno/a6xx_gpu.c:1614`). Additionally, the top- level wrapper brackets `hw_init` with `disable_irq/enable_irq` (`drivers/gpu/drm/msm/msm_gpu.c:168-174`), so the flow cleanly unmasks after reset. - Consistency with existing patterns: A similar mask-on-fault pattern already exists for a7xx SW fuse violations (`drivers/gpu/drm/msm/adreno/a6xx_gpu.c:1831-1834`), indicating this is the established approach to prevent repeated fault handling. - Stable suitability: - User-visible bugfix: Prevents erroneous second recovery that can kill unrelated contexts. - Small and contained: One register write in an error path; no ABI or feature changes. - Low regression risk: Interrupts are restored during the normal recovery/reinit path; only affects a6xx hang/fault handling. - No architectural churn, limited to DRM/MSM Adreno a6xx driver. Conclusion: This is a minimal, targeted fix for a real correctness issue with low risk and clear recovery restore points, making it a good candidate for backporting to all supported stable kernels that include the a6xx driver. drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c index 45dd5fd1c2bfc..f8992a68df7fb 100644 --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c @@ -1727,6 +1727,9 @@ static void a6xx_fault_detect_irq(struct msm_gpu *gpu) /* Turn off the hangcheck timer to keep it from bothering us */ timer_delete(&gpu->hangcheck_timer); + /* Turn off interrupts to avoid triggering recovery again */ + gpu_write(gpu, REG_A6XX_RBBM_INT_0_MASK, 0); + kthread_queue_work(gpu->worker, &gpu->recover_work); } -- 2.51.0