Re: [PATCH 5/6] drm/msm/a6xx: Fix IRQ storm during msm_recovery test

All of lore.kernel.org
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: "Akhil P Oommen" <akhilpo@oss.qualcomm.com>
Cc: dri-devel@lists.freedesktop.org
Subject: Re: [PATCH 5/6] drm/msm/a6xx: Fix IRQ storm during msm_recovery test
Date: Thu, 04 Jun 2026 20:25:21 +0000	[thread overview]
Message-ID: <20260604202522.4CDDF1F00893@smtp.kernel.org> (raw)
In-Reply-To: <20260605-assorted-fixes-june-v1-5-2caa04f7287c@oss.qualcomm.com>

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Toggling `disable_err_irq` back to false via debugfs leaves hardware error interrupts permanently disabled until a GPU reset.
- [High] Writing a restrictive interrupt mask in `a6xx_irq` and `a8xx_irq` drops the `CP_SW` interrupt, actively breaking preemption.
- [Medium] The `recover_worker` function unilaterally clears the user-controlled `disable_err_irq` debugfs boolean during GPU recovery.
--

commit c227655841882ac95655f120b99ea1eb3c758dc2
Author: Jie Zhang <jie.zhang@oss.qualcomm.com>

drm/msm/a6xx: Fix IRQ storm during msm_recovery test

This commit masks GPU error IRQs during the msm_recovery test to prevent
an interrupt storm when a hang is triggered. It then attempts to re-enable
the interrupts during GPU recovery.

> diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> index 2c0bbac43c52b..f1df2514c6132 100644
> --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
[ ... ]
> @@ -1275,6 +1275,11 @@ static irqreturn_t a5xx_irq(struct msm_gpu *gpu)
>  		status & ~A5XX_RBBM_INT_0_MASK_RBBM_AHB_ERROR);
>  
>  	if (priv->disable_err_irq) {
> +		/* Turn off interrupts to avoid interrupt storm */
> +		gpu_write(gpu, REG_A5XX_RBBM_INT_0_MASK,
> +			       A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS |
> +			       A5XX_RBBM_INT_0_MASK_CP_SW);
> +
>  		status &= A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS |
>  			  A5XX_RBBM_INT_0_MASK_CP_SW;
>  	}
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> index 8b3bb2fd433ba..9a4f9d0e17800 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> @@ -1911,8 +1911,11 @@ static irqreturn_t a6xx_irq(struct msm_gpu *gpu)
>  
>  	gpu_write(gpu, REG_A6XX_RBBM_INT_CLEAR_CMD, status);
>  
> -	if (priv->disable_err_irq)
> +	if (priv->disable_err_irq) {
> +		/* Turn off interrupts to avoid interrupt storm */
> +		gpu_write(gpu, REG_A6XX_RBBM_INT_0_MASK, A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS);

[Severity: High]
If priv->disable_err_irq is managed via debugfs_create_bool(), setting it only
updates the variable in memory.

If a user sets it to true, we mask the hardware here. However, if they
subsequently set it to false, there is no callback or mechanism to restore the
hardware mask back to its default state.

Does this leave hardware error detection (like AHB errors and faults)
permanently crippled until a full GPU reset occurs?

[Severity: High]
Also, writing this mask drops the CP_SW interrupt entirely.

Both a6xx (which handles A7xx) and a8xx rely on CP_SW for preemption completion
events. Physically disabling it in hardware means subsequent preemptions will
never be signaled to the CPU as complete.

Will dropping it cause a spurious GPU hang during the test window, instead
of correctly preserving both CP_CACHE_FLUSH_TS and CP_SW like a5xx_irq() does?

>  		status &= A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS;
> +	}
>  
>  	if (status & A6XX_RBBM_INT_0_MASK_RBBM_HANG_DETECT)
>  		a6xx_fault_detect_irq(gpu);
[ ... ]
> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> index 9ac7740a87f01..48ac51f4119be 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.c
> +++ b/drivers/gpu/drm/msm/msm_gpu.c
> @@ -552,6 +552,8 @@ static void recover_worker(struct kthread_work *work)
>  		msm_update_fence(ring->fctx, fence);
>  	}
>  
> +	priv->disable_err_irq = false;

[Severity: Medium]
The disable_err_irq flag is exposed via debugfs for userspace testing tools
(like IGT msm_recovery) to intentionally bypass hardware error interrupts and
validate the software hangcheck fallback.

By forcibly clearing it here upon the first recovery, the kernel silently
overrides the user's explicit configuration.

If a test expects this flag to remain active across multiple subtests or
submissions, won't subsequent hangs unexpectedly fall back to hardware
error detection?

Since hardware interrupts are natively re-enabled by the hardware reset during
recovery, should this debugfs flag remain strictly under user control instead
of being unilaterally cleared?

> +
>  	gpu->funcs->recover(gpu);
>  
>  	/* retire completed submits, plus the one that hung: */

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260605-assorted-fixes-june-v1-0-2caa04f7287c@oss.qualcomm.com?part=5

next prev parent reply	other threads:[~2026-06-04 20:25 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-04 20:08 [PATCH 0/6] drm/msm: Assorted fixes - June/26 Akhil P Oommen
2026-06-04 20:08 ` [PATCH 1/6] drm/msm/a6xx: Fix stale rpmh votes after suspend Akhil P Oommen
2026-06-04 20:21   ` sashiko-bot
2026-06-05 13:09   ` Neil Armstrong
2026-06-06 12:04   ` Dmitry Baryshkov
2026-06-04 20:08 ` [PATCH 2/6] drm/msm: Recover HW before retire hung submit Akhil P Oommen
2026-06-04 20:27   ` sashiko-bot
2026-06-04 20:08 ` [PATCH 3/6] drm/msm/a6xx: Fix A663 GPUCC register list for state capture Akhil P Oommen
2026-06-06 12:04   ` Dmitry Baryshkov
2026-06-04 20:08 ` [PATCH 4/6] drm/msm/a6xx: Fix A621 " Akhil P Oommen
2026-06-06 12:05   ` Dmitry Baryshkov
2026-06-04 20:08 ` [PATCH 5/6] drm/msm/a6xx: Fix IRQ storm during msm_recovery test Akhil P Oommen
2026-06-04 20:25   ` sashiko-bot [this message]
2026-06-05  6:50   ` Rob Clark
2026-06-04 20:08 ` [PATCH 6/6] drm/msm: Fix task_struct reference leak in recover_worker Akhil P Oommen
2026-06-04 20:28   ` sashiko-bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260604202522.4CDDF1F00893@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=akhilpo@oss.qualcomm.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=sashiko-reviews@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.