From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id A78B0CD6E75 for ; Thu, 4 Jun 2026 20:25:24 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id E2DBF11A2F1; Thu, 4 Jun 2026 20:25:23 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=kernel.org header.i=@kernel.org header.b="POpaPPXG"; dkim-atps=neutral Received: from sea.source.kernel.org (sea.source.kernel.org [172.234.252.31]) by gabe.freedesktop.org (Postfix) with ESMTPS id B955611A2F1 for ; Thu, 4 Jun 2026 20:25:22 +0000 (UTC) Received: from smtp.kernel.org (quasi.space.kernel.org [100.103.45.18]) by sea.source.kernel.org (Postfix) with ESMTP id 86D6E43CE1; Thu, 4 Jun 2026 20:25:22 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4CDDF1F00893; Thu, 4 Jun 2026 20:25:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1780604722; bh=tCbLSCKRFTAFnnrwh+NCRcDKno79825XHp03R9a4B5w=; h=From:Subject:Reply-To:To:Cc:In-Reply-To:References:Date; b=POpaPPXGaDzlU4D+SPF3gZznt/tUC6HSM7ge9+8H9j5NCHqGjNd2w+dfoDAkM/ypm gVM5s5U5Ox9O2JDLGVmCYkr1tvuklFmSxOsbcOg2BM4pogzGWeK39vW/BnbJK6YNZ3 5PtoYEmRUgy/bY7Wg2DmiNs63s+a1pH0FCkY9GmywbpbKCHbbjON7UvHoiwUwzk0Ax 6J8cfhyj9/b/UchC8nQ3GMdnTWxm6QvJ+nYF4AehHbFnQnnVWF6lLilOXFZ63gfE1x KtUootvBAO9E7Rw7ZRGfh3RzpcjwH5dInIJXAH5tWQDvSok3d2WAnMCcPlUMQKLyMd 6xa9mPHp1v9FQ== From: sashiko-bot@kernel.org Subject: Re: [PATCH 5/6] drm/msm/a6xx: Fix IRQ storm during msm_recovery test To: "Akhil P Oommen" Cc: dri-devel@lists.freedesktop.org In-Reply-To: <20260605-assorted-fixes-june-v1-5-2caa04f7287c@oss.qualcomm.com> References: <20260605-assorted-fixes-june-v1-5-2caa04f7287c@oss.qualcomm.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Date: Thu, 04 Jun 2026 20:25:21 +0000 Message-Id: <20260604202522.4CDDF1F00893@smtp.kernel.org> X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: sashiko-reviews@lists.linux.dev Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Thank you for your contribution! Sashiko AI review found 3 potential issue(= s) to consider: - [High] Toggling `disable_err_irq` back to false via debugfs leaves hardwa= re error interrupts permanently disabled until a GPU reset. - [High] Writing a restrictive interrupt mask in `a6xx_irq` and `a8xx_irq` = drops the `CP_SW` interrupt, actively breaking preemption. - [Medium] The `recover_worker` function unilaterally clears the user-contr= olled `disable_err_irq` debugfs boolean during GPU recovery. -- commit c227655841882ac95655f120b99ea1eb3c758dc2 Author: Jie Zhang drm/msm/a6xx: Fix IRQ storm during msm_recovery test This commit masks GPU error IRQs during the msm_recovery test to prevent an interrupt storm when a hang is triggered. It then attempts to re-enable the interrupts during GPU recovery. > diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/= adreno/a5xx_gpu.c > index 2c0bbac43c52b..f1df2514c6132 100644 > --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c > +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c [ ... ] > @@ -1275,6 +1275,11 @@ static irqreturn_t a5xx_irq(struct msm_gpu *gpu) > status & ~A5XX_RBBM_INT_0_MASK_RBBM_AHB_ERROR); > =20 > if (priv->disable_err_irq) { > + /* Turn off interrupts to avoid interrupt storm */ > + gpu_write(gpu, REG_A5XX_RBBM_INT_0_MASK, > + A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS | > + A5XX_RBBM_INT_0_MASK_CP_SW); > + > status &=3D A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS | > A5XX_RBBM_INT_0_MASK_CP_SW; > } > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/= adreno/a6xx_gpu.c > index 8b3bb2fd433ba..9a4f9d0e17800 100644 > --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > @@ -1911,8 +1911,11 @@ static irqreturn_t a6xx_irq(struct msm_gpu *gpu) > =20 > gpu_write(gpu, REG_A6XX_RBBM_INT_CLEAR_CMD, status); > =20 > - if (priv->disable_err_irq) > + if (priv->disable_err_irq) { > + /* Turn off interrupts to avoid interrupt storm */ > + gpu_write(gpu, REG_A6XX_RBBM_INT_0_MASK, A6XX_RBBM_INT_0_MASK_CP_CACHE= _FLUSH_TS); [Severity: High] If priv->disable_err_irq is managed via debugfs_create_bool(), setting it o= nly updates the variable in memory. If a user sets it to true, we mask the hardware here. However, if they subsequently set it to false, there is no callback or mechanism to restore = the hardware mask back to its default state. Does this leave hardware error detection (like AHB errors and faults) permanently crippled until a full GPU reset occurs? [Severity: High] Also, writing this mask drops the CP_SW interrupt entirely. Both a6xx (which handles A7xx) and a8xx rely on CP_SW for preemption comple= tion events. Physically disabling it in hardware means subsequent preemptions wi= ll never be signaled to the CPU as complete. Will dropping it cause a spurious GPU hang during the test window, instead of correctly preserving both CP_CACHE_FLUSH_TS and CP_SW like a5xx_irq() do= es? > status &=3D A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS; > + } > =20 > if (status & A6XX_RBBM_INT_0_MASK_RBBM_HANG_DETECT) > a6xx_fault_detect_irq(gpu); [ ... ] > diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c > index 9ac7740a87f01..48ac51f4119be 100644 > --- a/drivers/gpu/drm/msm/msm_gpu.c > +++ b/drivers/gpu/drm/msm/msm_gpu.c > @@ -552,6 +552,8 @@ static void recover_worker(struct kthread_work *work) > msm_update_fence(ring->fctx, fence); > } > =20 > + priv->disable_err_irq =3D false; [Severity: Medium] The disable_err_irq flag is exposed via debugfs for userspace testing tools (like IGT msm_recovery) to intentionally bypass hardware error interrupts a= nd validate the software hangcheck fallback. By forcibly clearing it here upon the first recovery, the kernel silently overrides the user's explicit configuration. If a test expects this flag to remain active across multiple subtests or submissions, won't subsequent hangs unexpectedly fall back to hardware error detection? Since hardware interrupts are natively re-enabled by the hardware reset dur= ing recovery, should this debugfs flag remain strictly under user control inste= ad of being unilaterally cleared? > + > gpu->funcs->recover(gpu); > =20 > /* retire completed submits, plus the one that hung: */ --=20 Sashiko AI review =C2=B7 https://sashiko.dev/#/patchset/20260605-assorted-f= ixes-june-v1-0-2caa04f7287c@oss.qualcomm.com?part=3D5