[PATCH 0/6] drm/msm: Assorted fixes

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/6] drm/msm: Assorted fixes - June/26
@ 2026-06-04 20:08 Akhil P Oommen
  2026-06-04 20:08 ` [PATCH 1/6] drm/msm/a6xx: Fix stale rpmh votes after suspend Akhil P Oommen
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Akhil P Oommen @ 2026-06-04 20:08 UTC (permalink / raw)
  To: Rob Clark, Sean Paul, Konrad Dybcio, Dmitry Baryshkov,
	Abhinav Kumar, Jessica Zhang, Marijn Suijten, David Airlie,
	Simona Vetter, Puranam V G Tejaswi, Jie Zhang, Maíra Canal
  Cc: linux-arm-msm, dri-devel, freedreno, linux-kernel, Akhil P Oommen,
	Shivam Rawat, Jie Zhang

Signed-off-by: Akhil P Oommen <akhilpo@oss.qualcomm.com>
---
Jie Zhang (5):
      drm/msm: Recover HW before retire hung submit
      drm/msm/a6xx: Fix A663 GPUCC register list for state capture
      drm/msm/a6xx: Fix A621 GPUCC register list for state capture
      drm/msm/a6xx: Fix IRQ storm during msm_recovery test
      drm/msm: Fix task_struct reference leak in recover_worker

Shivam Rawat (1):
      drm/msm/a6xx: Fix stale rpmh votes after suspend

 drivers/gpu/drm/msm/adreno/a5xx_gpu.c       |  5 +++++
 drivers/gpu/drm/msm/adreno/a6xx_gmu.c       |  5 ++++-
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c       |  5 ++++-
 drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c |  5 ++++-
 drivers/gpu/drm/msm/adreno/a6xx_gpu_state.h | 12 ++++++++++++
 drivers/gpu/drm/msm/adreno/a8xx_gpu.c       |  5 ++++-
 drivers/gpu/drm/msm/msm_gpu.c               |  8 ++++++--
 7 files changed, 39 insertions(+), 6 deletions(-)
---
base-commit: ef8274b9c19a4b614e10ce95553d0d363dc1c1f8
change-id: 20260605-assorted-fixes-june-61940864a97f

Best regards,
-- 
Akhil P Oommen <akhilpo@oss.qualcomm.com>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/6] drm/msm/a6xx: Fix stale rpmh votes after suspend
  2026-06-04 20:08 [PATCH 0/6] drm/msm: Assorted fixes - June/26 Akhil P Oommen
@ 2026-06-04 20:08 ` Akhil P Oommen
  2026-06-04 20:21   ` sashiko-bot
  2026-06-05 13:09   ` Neil Armstrong
  2026-06-04 20:08 ` [PATCH 2/6] drm/msm: Recover HW before retire hung submit Akhil P Oommen
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 13+ messages in thread
From: Akhil P Oommen @ 2026-06-04 20:08 UTC (permalink / raw)
  To: Rob Clark, Sean Paul, Konrad Dybcio, Dmitry Baryshkov,
	Abhinav Kumar, Jessica Zhang, Marijn Suijten, David Airlie,
	Simona Vetter, Puranam V G Tejaswi, Jie Zhang, Maíra Canal
  Cc: linux-arm-msm, dri-devel, freedreno, linux-kernel, Akhil P Oommen,
	Shivam Rawat

From: Shivam Rawat <shivrawa@qti.qualcomm.com>

There are stale RPMH votes (BCM votes) observed after GMU suspend. This
is because the rpmh stop sequences are skipped during gmu suspend. Fix
this and also move GMU to reset state to avoid any further activity.

Fixes: f248d5d5159a ("drm/msm/a6xx: Fix PDC sleep sequence")
Signed-off-by: Shivam Rawat <shivrawa@qti.qualcomm.com>
Signed-off-by: Akhil P Oommen <akhilpo@oss.qualcomm.com>
---
 drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
index 2e5d7b53a0c3..a2f6918c4f7f 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
@@ -642,7 +642,7 @@ static void a6xx_rpmh_stop(struct a6xx_gmu *gmu)
 	int ret;
 	u32 val;
 
-	if (test_and_clear_bit(GMU_STATUS_FW_START, &gmu->status))
+	if (!test_and_clear_bit(GMU_STATUS_FW_START, &gmu->status))
 		return;
 
 	if (adreno_is_a840(adreno_gpu))
@@ -1465,6 +1465,9 @@ static void a6xx_gmu_shutdown(struct a6xx_gmu *gmu)
 	/* Stop the interrupts and mask the hardware */
 	a6xx_gmu_irq_disable(gmu);
 
+	/* Halt the gmu cm3 core */
+	gmu_write(gmu, REG_A6XX_GMU_CM3_SYSRESET, 1);
+
 	/* Tell RPMh to power off the GPU */
 	a6xx_rpmh_stop(gmu);
 

-- 
2.51.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/6] drm/msm: Recover HW before retire hung submit
  2026-06-04 20:08 [PATCH 0/6] drm/msm: Assorted fixes - June/26 Akhil P Oommen
  2026-06-04 20:08 ` [PATCH 1/6] drm/msm/a6xx: Fix stale rpmh votes after suspend Akhil P Oommen
@ 2026-06-04 20:08 ` Akhil P Oommen
  2026-06-04 20:27   ` sashiko-bot
  2026-06-04 20:08 ` [PATCH 3/6] drm/msm/a6xx: Fix A663 GPUCC register list for state capture Akhil P Oommen
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Akhil P Oommen @ 2026-06-04 20:08 UTC (permalink / raw)
  To: Rob Clark, Sean Paul, Konrad Dybcio, Dmitry Baryshkov,
	Abhinav Kumar, Jessica Zhang, Marijn Suijten, David Airlie,
	Simona Vetter, Puranam V G Tejaswi, Jie Zhang, Maíra Canal
  Cc: linux-arm-msm, dri-devel, freedreno, linux-kernel, Akhil P Oommen,
	Jie Zhang

From: Jie Zhang <jie.zhang@oss.qualcomm.com>

During recovery, it is not safe to retire the hung submit before we
recover the GPU. Retiring the submit triggers BO free and that can
result in GPU pagefaults since the GPU may be actively accessing those
BOs.

To fix this, retire the submits after gpu recovery is complete in
recover_worker().

Fixes: 1a370be9ac51 ("drm/msm: restart queued submits after hang")
Signed-off-by: Jie Zhang <jie.zhang@oss.qualcomm.com>
Signed-off-by: Akhil P Oommen <akhilpo@oss.qualcomm.com>
---
 drivers/gpu/drm/msm/msm_gpu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index 18ed00e5f143..9ac7740a87f0 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -552,11 +552,11 @@ static void recover_worker(struct kthread_work *work)
 		msm_update_fence(ring->fctx, fence);
 	}
 
+	gpu->funcs->recover(gpu);
+
 	/* retire completed submits, plus the one that hung: */
 	retire_submits(gpu);
 
-	gpu->funcs->recover(gpu);
-
 	/*
 	 * Replay all remaining submits starting with highest priority
 	 * ring

-- 
2.51.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/6] drm/msm/a6xx: Fix A663 GPUCC register list for state capture
  2026-06-04 20:08 [PATCH 0/6] drm/msm: Assorted fixes - June/26 Akhil P Oommen
  2026-06-04 20:08 ` [PATCH 1/6] drm/msm/a6xx: Fix stale rpmh votes after suspend Akhil P Oommen
  2026-06-04 20:08 ` [PATCH 2/6] drm/msm: Recover HW before retire hung submit Akhil P Oommen
@ 2026-06-04 20:08 ` Akhil P Oommen
  2026-06-04 20:08 ` [PATCH 4/6] drm/msm/a6xx: Fix A621 " Akhil P Oommen
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Akhil P Oommen @ 2026-06-04 20:08 UTC (permalink / raw)
  To: Rob Clark, Sean Paul, Konrad Dybcio, Dmitry Baryshkov,
	Abhinav Kumar, Jessica Zhang, Marijn Suijten, David Airlie,
	Simona Vetter, Puranam V G Tejaswi, Jie Zhang, Maíra Canal
  Cc: linux-arm-msm, dri-devel, freedreno, linux-kernel, Akhil P Oommen,
	Jie Zhang

From: Jie Zhang <jie.zhang@oss.qualcomm.com>

The GPUCC register list for A663 is incorrect, which can cause
out-of-bounds register access during GPU state capture.

Update it to use the correct register ranges.

Fixes: 5773cce8615c ("drm/msm/a6xx: Add support for A663")
Signed-off-by: Jie Zhang <jie.zhang@oss.qualcomm.com>
Signed-off-by: Akhil P Oommen <akhilpo@oss.qualcomm.com>
---
 drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
index 166365359fa6..2a62a22077f9 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
@@ -1244,7 +1244,9 @@ static void a6xx_get_gmu_registers(struct msm_gpu *gpu,
 	_a6xx_get_gmu_registers(gpu, a6xx_state, &a6xx_gmu_reglist[1],
 		&a6xx_state->gmu_registers[1], true);
 
-	if (adreno_is_a621(adreno_gpu) || adreno_is_a623(adreno_gpu))
+	if (adreno_is_a621(adreno_gpu) ||
+	    adreno_is_a623(adreno_gpu) ||
+	    adreno_is_a663(adreno_gpu))
 		_a6xx_get_gmu_registers(gpu, a6xx_state, &a621_gpucc_reg,
 			&a6xx_state->gmu_registers[2], false);
 	else

-- 
2.51.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/6] drm/msm/a6xx: Fix A621 GPUCC register list for state capture
  2026-06-04 20:08 [PATCH 0/6] drm/msm: Assorted fixes - June/26 Akhil P Oommen
                   ` (2 preceding siblings ...)
  2026-06-04 20:08 ` [PATCH 3/6] drm/msm/a6xx: Fix A663 GPUCC register list for state capture Akhil P Oommen
@ 2026-06-04 20:08 ` Akhil P Oommen
  2026-06-04 20:08 ` [PATCH 5/6] drm/msm/a6xx: Fix IRQ storm during msm_recovery test Akhil P Oommen
  2026-06-04 20:08 ` [PATCH 6/6] drm/msm: Fix task_struct reference leak in recover_worker Akhil P Oommen
  5 siblings, 0 replies; 13+ messages in thread
From: Akhil P Oommen @ 2026-06-04 20:08 UTC (permalink / raw)
  To: Rob Clark, Sean Paul, Konrad Dybcio, Dmitry Baryshkov,
	Abhinav Kumar, Jessica Zhang, Marijn Suijten, David Airlie,
	Simona Vetter, Puranam V G Tejaswi, Jie Zhang, Maíra Canal
  Cc: linux-arm-msm, dri-devel, freedreno, linux-kernel, Akhil P Oommen,
	Jie Zhang

From: Jie Zhang <jie.zhang@oss.qualcomm.com>

A621 uses an incorrect GPUCC register list during state capture.

The existing list matches A623/A663. Rename it accordingly and add a
dedicated A621 GPUCC register list.

Fixes: 11cdb81b3c1b ("drm/msm/a6xx: Fix gpucc register block for A621")
Signed-off-by: Jie Zhang <jie.zhang@oss.qualcomm.com>
Signed-off-by: Akhil P Oommen <akhilpo@oss.qualcomm.com>
---
 drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c |  7 ++++---
 drivers/gpu/drm/msm/adreno/a6xx_gpu_state.h | 12 ++++++++++++
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
index 2a62a22077f9..3ea8ff8c7404 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
@@ -1244,11 +1244,12 @@ static void a6xx_get_gmu_registers(struct msm_gpu *gpu,
 	_a6xx_get_gmu_registers(gpu, a6xx_state, &a6xx_gmu_reglist[1],
 		&a6xx_state->gmu_registers[1], true);
 
-	if (adreno_is_a621(adreno_gpu) ||
-	    adreno_is_a623(adreno_gpu) ||
-	    adreno_is_a663(adreno_gpu))
+	if (adreno_is_a621(adreno_gpu))
 		_a6xx_get_gmu_registers(gpu, a6xx_state, &a621_gpucc_reg,
 			&a6xx_state->gmu_registers[2], false);
+	else if (adreno_is_a623(adreno_gpu) || adreno_is_a663(adreno_gpu))
+		_a6xx_get_gmu_registers(gpu, a6xx_state, &a623_gpucc_reg,
+			&a6xx_state->gmu_registers[2], false);
 	else
 		_a6xx_get_gmu_registers(gpu, a6xx_state, &a6xx_gpucc_reg,
 			&a6xx_state->gmu_registers[2], false);
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.h b/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.h
index b49d8427b59e..0a13a65f89ac 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.h
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.h
@@ -377,6 +377,17 @@ static const u32 a6xx_gmu_gpucc_registers[] = {
 };
 
 static const u32 a621_gmu_gpucc_registers[] = {
+	/* GPU CC */
+	0x24000, 0x2400e, 0x24400, 0x2440e, 0x24800, 0x24805, 0x24c00, 0x24cff,
+	0x25800, 0x25804, 0x25c00, 0x25c04, 0x26000, 0x26004, 0x26400, 0x26405,
+	0x26414, 0x2641d, 0x2642a, 0x26430, 0x26432, 0x26432, 0x26441, 0x26455,
+	0x26466, 0x26468, 0x26478, 0x2647a, 0x26489, 0x2648a, 0x2649c, 0x2649e,
+	0x264a0, 0x264a3, 0x264b3, 0x264b5, 0x264c5, 0x264c7, 0x264d6, 0x264d8,
+	0x264e8, 0x264e9, 0x264f9, 0x264fc, 0x2650b, 0x2650c, 0x2651c, 0x2651e,
+	0x26540, 0x26570, 0x26600, 0x26616, 0x26620, 0x2662d,
+};
+
+static const u32 a623_gmu_gpucc_registers[] = {
 	/* GPU CC */
 	0x24000, 0x2400e, 0x24400, 0x2440e, 0x25800, 0x25804, 0x25c00, 0x25c04,
 	0x26000, 0x26004, 0x26400, 0x26405, 0x26414, 0x2641d, 0x2642a, 0x26430,
@@ -402,6 +413,7 @@ static const struct a6xx_registers a6xx_gmu_reglist[] = {
 
 static const struct a6xx_registers a6xx_gpucc_reg = REGS(a6xx_gmu_gpucc_registers, 0, 0);
 static const struct a6xx_registers a621_gpucc_reg = REGS(a621_gmu_gpucc_registers, 0, 0);
+static const struct a6xx_registers a623_gpucc_reg = REGS(a623_gmu_gpucc_registers, 0, 0);
 
 static u32 a6xx_get_cp_roq_size(struct msm_gpu *gpu);
 static u32 a7xx_get_cp_roq_size(struct msm_gpu *gpu);

-- 
2.51.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 5/6] drm/msm/a6xx: Fix IRQ storm during msm_recovery test
  2026-06-04 20:08 [PATCH 0/6] drm/msm: Assorted fixes - June/26 Akhil P Oommen
                   ` (3 preceding siblings ...)
  2026-06-04 20:08 ` [PATCH 4/6] drm/msm/a6xx: Fix A621 " Akhil P Oommen
@ 2026-06-04 20:08 ` Akhil P Oommen
  2026-06-04 20:25   ` sashiko-bot
  2026-06-05  6:50   ` Rob Clark
  2026-06-04 20:08 ` [PATCH 6/6] drm/msm: Fix task_struct reference leak in recover_worker Akhil P Oommen
  5 siblings, 2 replies; 13+ messages in thread
From: Akhil P Oommen @ 2026-06-04 20:08 UTC (permalink / raw)
  To: Rob Clark, Sean Paul, Konrad Dybcio, Dmitry Baryshkov,
	Abhinav Kumar, Jessica Zhang, Marijn Suijten, David Airlie,
	Simona Vetter, Puranam V G Tejaswi, Jie Zhang, Maíra Canal
  Cc: linux-arm-msm, dri-devel, freedreno, linux-kernel, Akhil P Oommen,
	Jie Zhang

From: Jie Zhang <jie.zhang@oss.qualcomm.com>

Once a hang is triggered by the msm_recovery test, the gpu error irq
remains asserted and triggers an interrupt storm. In the worst case,
this IRQ storm lands on the CPU core where the hangcheck timer is
scheduled, blocking it from running. This eventually leads to CPU
watchdog timeouts.

To fix this, mask the gpu error irqs during msm_recovery test and
enable them back during the recovery.

Fixes: 5edf2750d998 ("drm/msm: Add debugfs to disable hw err handling")
Signed-off-by: Jie Zhang <jie.zhang@oss.qualcomm.com>
Signed-off-by: Akhil P Oommen <akhilpo@oss.qualcomm.com>
---
 drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 5 +++++
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 5 ++++-
 drivers/gpu/drm/msm/adreno/a8xx_gpu.c | 5 ++++-
 drivers/gpu/drm/msm/msm_gpu.c         | 2 ++
 4 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
index 2c0bbac43c52..f1df2514c613 100644
--- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
@@ -1275,6 +1275,11 @@ static irqreturn_t a5xx_irq(struct msm_gpu *gpu)
 		status & ~A5XX_RBBM_INT_0_MASK_RBBM_AHB_ERROR);
 
 	if (priv->disable_err_irq) {
+		/* Turn off interrupts to avoid interrupt storm */
+		gpu_write(gpu, REG_A5XX_RBBM_INT_0_MASK,
+			       A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS |
+			       A5XX_RBBM_INT_0_MASK_CP_SW);
+
 		status &= A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS |
 			  A5XX_RBBM_INT_0_MASK_CP_SW;
 	}
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 8b3bb2fd433b..9a4f9d0e1780 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -1911,8 +1911,11 @@ static irqreturn_t a6xx_irq(struct msm_gpu *gpu)
 
 	gpu_write(gpu, REG_A6XX_RBBM_INT_CLEAR_CMD, status);
 
-	if (priv->disable_err_irq)
+	if (priv->disable_err_irq) {
+		/* Turn off interrupts to avoid interrupt storm */
+		gpu_write(gpu, REG_A6XX_RBBM_INT_0_MASK, A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS);
 		status &= A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS;
+	}
 
 	if (status & A6XX_RBBM_INT_0_MASK_RBBM_HANG_DETECT)
 		a6xx_fault_detect_irq(gpu);
diff --git a/drivers/gpu/drm/msm/adreno/a8xx_gpu.c b/drivers/gpu/drm/msm/adreno/a8xx_gpu.c
index 9e44fd1ae634..0f6fd35bd587 100644
--- a/drivers/gpu/drm/msm/adreno/a8xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a8xx_gpu.c
@@ -1211,8 +1211,11 @@ irqreturn_t a8xx_irq(struct msm_gpu *gpu)
 
 	gpu_write(gpu, REG_A8XX_RBBM_INT_CLEAR_CMD, status);
 
-	if (priv->disable_err_irq)
+	if (priv->disable_err_irq) {
+		/* Turn off interrupts to avoid interrupt storm */
+		gpu_write(gpu, REG_A8XX_RBBM_INT_0_MASK, A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS);
 		status &= A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS;
+	}
 
 	if (status & A6XX_RBBM_INT_0_MASK_RBBM_HANG_DETECT)
 		a8xx_fault_detect_irq(gpu);
diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index 9ac7740a87f0..48ac51f4119b 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -552,6 +552,8 @@ static void recover_worker(struct kthread_work *work)
 		msm_update_fence(ring->fctx, fence);
 	}
 
+	priv->disable_err_irq = false;
+
 	gpu->funcs->recover(gpu);
 
 	/* retire completed submits, plus the one that hung: */

-- 
2.51.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 6/6] drm/msm: Fix task_struct reference leak in recover_worker
  2026-06-04 20:08 [PATCH 0/6] drm/msm: Assorted fixes - June/26 Akhil P Oommen
                   ` (4 preceding siblings ...)
  2026-06-04 20:08 ` [PATCH 5/6] drm/msm/a6xx: Fix IRQ storm during msm_recovery test Akhil P Oommen
@ 2026-06-04 20:08 ` Akhil P Oommen
  2026-06-04 20:28   ` sashiko-bot
  5 siblings, 1 reply; 13+ messages in thread
From: Akhil P Oommen @ 2026-06-04 20:08 UTC (permalink / raw)
  To: Rob Clark, Sean Paul, Konrad Dybcio, Dmitry Baryshkov,
	Abhinav Kumar, Jessica Zhang, Marijn Suijten, David Airlie,
	Simona Vetter, Puranam V G Tejaswi, Jie Zhang, Maíra Canal
  Cc: linux-arm-msm, dri-devel, freedreno, linux-kernel, Akhil P Oommen,
	Jie Zhang

From: Jie Zhang <jie.zhang@oss.qualcomm.com>

get_pid_task() increments the task reference count, but the
corresponding put_task_struct() was missing in the else branch,
leaking a reference on every GPU hang recovery.

Fixes: 25654a1756a4 ("drm/msm: Update global fault counter when faulty process has already ended")
Signed-off-by: Jie Zhang <jie.zhang@oss.qualcomm.com>
Signed-off-by: Akhil P Oommen <akhilpo@oss.qualcomm.com>
---
 drivers/gpu/drm/msm/msm_gpu.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index 48ac51f4119b..03c057856065 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -505,6 +505,8 @@ static void recover_worker(struct kthread_work *work)
 		 */
 		if (!vm->managed)
 			msm_gem_vm_unusable(submit->vm);
+
+		put_task_struct(task);
 	}
 
 	noreclaim_flag = memalloc_noreclaim_save();

-- 
2.51.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/6] drm/msm/a6xx: Fix stale rpmh votes after suspend
  2026-06-04 20:08 ` [PATCH 1/6] drm/msm/a6xx: Fix stale rpmh votes after suspend Akhil P Oommen
@ 2026-06-04 20:21   ` sashiko-bot
  2026-06-05 13:09   ` Neil Armstrong
  1 sibling, 0 replies; 13+ messages in thread
From: sashiko-bot @ 2026-06-04 20:21 UTC (permalink / raw)
  To: Akhil P Oommen; +Cc: dri-devel

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] PDC sleep sequence is triggered even when GMU firmware initialization fails due to unconditional status flag setting.
--

commit 82656c1e5d01b9521acaeb8de8ad4bf1a538f52b
Author: Shivam Rawat <shivrawa@qti.qualcomm.com>

drm/msm/a6xx: Fix stale rpmh votes after suspend

This commit addresses stale RPMH (BCM) votes observed after GMU suspend by
ensuring RPMH stop sequences are not skipped. It also puts the GMU into a
reset state to prevent further activity.

> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> index 2e5d7b53a0c38..a2f6918c4f7f2 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> @@ -642,7 +642,7 @@ static void a6xx_rpmh_stop(struct a6xx_gmu *gmu)
>  	int ret;
>  	u32 val;
>  
> -	if (test_and_clear_bit(GMU_STATUS_FW_START, &gmu->status))
> +	if (!test_and_clear_bit(GMU_STATUS_FW_START, &gmu->status))
>  		return;

[Severity: High]
This isn't a bug introduced by this patch, but this logic change appears to
expose a pre-existing issue in a6xx_gmu_start().

If a6xx_gmu_start() times out waiting for the firmware, it logs an error but
still sets the status bit unconditionally:

drivers/gpu/drm/msm/adreno/a6xx_gmu.c:a6xx_gmu_start() {
	...
	if (ret)
		DRM_DEV_ERROR(gmu->dev, "GMU firmware initialization timed out\n");

	set_bit(GMU_STATUS_FW_START, &gmu->status);
	...
}

When a6xx_gmu_start() returns this error, the caller jumps to its error
handling path and calls a6xx_rpmh_stop() to clean up.

Now that a6xx_rpmh_stop() correctly proceeds when the bit is set, won't it
issue the RSCC sleep request even though the firmware failed to initialize?

Could this risk hanging the PDC subsystem, since the sleep sequence requires
the GMU firmware to have successfully initialized certain registers?

[ ... ]

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260605-assorted-fixes-june-v1-0-2caa04f7287c@oss.qualcomm.com?part=1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 5/6] drm/msm/a6xx: Fix IRQ storm during msm_recovery test
  2026-06-04 20:08 ` [PATCH 5/6] drm/msm/a6xx: Fix IRQ storm during msm_recovery test Akhil P Oommen
@ 2026-06-04 20:25   ` sashiko-bot
  2026-06-05  6:50   ` Rob Clark
  1 sibling, 0 replies; 13+ messages in thread
From: sashiko-bot @ 2026-06-04 20:25 UTC (permalink / raw)
  To: Akhil P Oommen; +Cc: dri-devel

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Toggling `disable_err_irq` back to false via debugfs leaves hardware error interrupts permanently disabled until a GPU reset.
- [High] Writing a restrictive interrupt mask in `a6xx_irq` and `a8xx_irq` drops the `CP_SW` interrupt, actively breaking preemption.
- [Medium] The `recover_worker` function unilaterally clears the user-controlled `disable_err_irq` debugfs boolean during GPU recovery.
--

commit c227655841882ac95655f120b99ea1eb3c758dc2
Author: Jie Zhang <jie.zhang@oss.qualcomm.com>

drm/msm/a6xx: Fix IRQ storm during msm_recovery test

This commit masks GPU error IRQs during the msm_recovery test to prevent
an interrupt storm when a hang is triggered. It then attempts to re-enable
the interrupts during GPU recovery.

> diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> index 2c0bbac43c52b..f1df2514c6132 100644
> --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
[ ... ]
> @@ -1275,6 +1275,11 @@ static irqreturn_t a5xx_irq(struct msm_gpu *gpu)
>  		status & ~A5XX_RBBM_INT_0_MASK_RBBM_AHB_ERROR);
>  
>  	if (priv->disable_err_irq) {
> +		/* Turn off interrupts to avoid interrupt storm */
> +		gpu_write(gpu, REG_A5XX_RBBM_INT_0_MASK,
> +			       A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS |
> +			       A5XX_RBBM_INT_0_MASK_CP_SW);
> +
>  		status &= A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS |
>  			  A5XX_RBBM_INT_0_MASK_CP_SW;
>  	}
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> index 8b3bb2fd433ba..9a4f9d0e17800 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> @@ -1911,8 +1911,11 @@ static irqreturn_t a6xx_irq(struct msm_gpu *gpu)
>  
>  	gpu_write(gpu, REG_A6XX_RBBM_INT_CLEAR_CMD, status);
>  
> -	if (priv->disable_err_irq)
> +	if (priv->disable_err_irq) {
> +		/* Turn off interrupts to avoid interrupt storm */
> +		gpu_write(gpu, REG_A6XX_RBBM_INT_0_MASK, A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS);

[Severity: High]
If priv->disable_err_irq is managed via debugfs_create_bool(), setting it only
updates the variable in memory.

If a user sets it to true, we mask the hardware here. However, if they
subsequently set it to false, there is no callback or mechanism to restore the
hardware mask back to its default state.

Does this leave hardware error detection (like AHB errors and faults)
permanently crippled until a full GPU reset occurs?

[Severity: High]
Also, writing this mask drops the CP_SW interrupt entirely.

Both a6xx (which handles A7xx) and a8xx rely on CP_SW for preemption completion
events. Physically disabling it in hardware means subsequent preemptions will
never be signaled to the CPU as complete.

Will dropping it cause a spurious GPU hang during the test window, instead
of correctly preserving both CP_CACHE_FLUSH_TS and CP_SW like a5xx_irq() does?

>  		status &= A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS;
> +	}
>  
>  	if (status & A6XX_RBBM_INT_0_MASK_RBBM_HANG_DETECT)
>  		a6xx_fault_detect_irq(gpu);
[ ... ]
> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> index 9ac7740a87f01..48ac51f4119be 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.c
> +++ b/drivers/gpu/drm/msm/msm_gpu.c
> @@ -552,6 +552,8 @@ static void recover_worker(struct kthread_work *work)
>  		msm_update_fence(ring->fctx, fence);
>  	}
>  
> +	priv->disable_err_irq = false;

[Severity: Medium]
The disable_err_irq flag is exposed via debugfs for userspace testing tools
(like IGT msm_recovery) to intentionally bypass hardware error interrupts and
validate the software hangcheck fallback.

By forcibly clearing it here upon the first recovery, the kernel silently
overrides the user's explicit configuration.

If a test expects this flag to remain active across multiple subtests or
submissions, won't subsequent hangs unexpectedly fall back to hardware
error detection?

Since hardware interrupts are natively re-enabled by the hardware reset during
recovery, should this debugfs flag remain strictly under user control instead
of being unilaterally cleared?

> +
>  	gpu->funcs->recover(gpu);
>  
>  	/* retire completed submits, plus the one that hung: */

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260605-assorted-fixes-june-v1-0-2caa04f7287c@oss.qualcomm.com?part=5

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/6] drm/msm: Recover HW before retire hung submit
  2026-06-04 20:08 ` [PATCH 2/6] drm/msm: Recover HW before retire hung submit Akhil P Oommen
@ 2026-06-04 20:27   ` sashiko-bot
  0 siblings, 0 replies; 13+ messages in thread
From: sashiko-bot @ 2026-06-04 20:27 UTC (permalink / raw)
  To: Akhil P Oommen; +Cc: dri-devel

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

Pre-existing issues:
- [High] Early fence signaling allows concurrent unmapping of shared BOs before the hung GPU is stopped, reproducing the exact pagefaults the patch intended to fix.
- [High] Concurrent GPU progress during hang recovery causes the recovery logic to double-increment the fence and silently drop a subsequent, unexecuted submit.
--

commit 3c515db3794c4f1061da14c3adffcf64bbdd2fc5
Author: Jie Zhang <jie.zhang@oss.qualcomm.com>

drm/msm: Recover HW before retire hung submit

During recovery, it is not safe to retire the hung submit before we
recover the GPU. Retiring the submit triggers BO free and that can
result in GPU pagefaults since the GPU may be actively accessing those
BOs.

To fix this, retire the submits after gpu recovery is complete in
recover_worker().

> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> index 18ed00e5f143b..9ac7740a87f01 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.c
> +++ b/drivers/gpu/drm/msm/msm_gpu.c
> @@ -552,11 +552,11 @@ static void recover_worker(struct kthread_work *work)
>  		msm_update_fence(ring->fctx, fence);
>  	}

[Severity: High]
This isn't a bug introduced by this patch, but can concurrent GPU progress
during hang recovery cause the recovery logic to double-increment the fence
and silently drop a subsequent, unexecuted submit?

In recover_worker(), the hung submit is identified early using:

    submit = find_submit(cur_ring, cur_ring->memptrs->fence + 1);

Because the GPU is not halted during the time it takes to capture the crash
state, it may legitimately finish the hung submit and concurrently write
the next fence value to ring->memptrs->fence.

Later in the function (just above this hunk), the loop reads the updated
fence:

    uint32_t fence = ring->memptrs->fence;
    ...
    if (ring == cur_ring)
        ring->memptrs->fence = ++fence;

This explicitly increments the already-updated fence and writes the next
value back to shared memory. retire_submits() will then retire the
unexecuted submit corresponding to that next fence, skipping it entirely
while userspace receives a success signal.

[Severity: High]
This is a pre-existing issue, but does early fence signaling still allow
concurrent unmapping of shared BOs before the hung GPU is stopped,
reproducing the exact pagefaults this patch intended to fix?

While this patch moves retire_submits() to happen after
gpu->funcs->recover(), the call to msm_update_fence() remains before the
recovery.

Because msm_update_fence() advances the fence context's completed_fence to
include the hung submit, any concurrent thread calling
dma_fence_is_signaled() (such as userspace polling a sync_file) will
observe the updated completed_fence.

This causes dma_fence_is_signaled() to wake up all waiters and allows them
to unmap the shared memory from the GPU page tables while the GPU is still
actively executing.

Should the loop containing msm_update_fence() also be moved after
gpu->funcs->recover() to completely close this window?

>  
> +	gpu->funcs->recover(gpu);
> +
>  	/* retire completed submits, plus the one that hung: */
>  	retire_submits(gpu);
>  
> -	gpu->funcs->recover(gpu);
> -
>  	/*
>  	 * Replay all remaining submits starting with highest priority
>  	 * ring

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260605-assorted-fixes-june-v1-0-2caa04f7287c@oss.qualcomm.com?part=2

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 6/6] drm/msm: Fix task_struct reference leak in recover_worker
  2026-06-04 20:08 ` [PATCH 6/6] drm/msm: Fix task_struct reference leak in recover_worker Akhil P Oommen
@ 2026-06-04 20:28   ` sashiko-bot
  0 siblings, 0 replies; 13+ messages in thread
From: sashiko-bot @ 2026-06-04 20:28 UTC (permalink / raw)
  To: Akhil P Oommen; +Cc: dri-devel

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [Critical] Direct access to ctx->vm without lazy initialization via msm_context_vm() causes a NULL pointer dereference if ioctls are called before the VM is created.
--

commit 9d77c7114e9006cacfe1802a7ff8d83c9c1a6bac
Author: Jie Zhang <jie.zhang@oss.qualcomm.com>

drm/msm: Fix task_struct reference leak in recover_worker

This commit fixes a missing put_task_struct() in the else branch of
recover_worker. This resolves a task struct reference leak that occurred
on every GPU hang recovery.

> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> index 48ac51f4119be..03c0578560658 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.c
> +++ b/drivers/gpu/drm/msm/msm_gpu.c
[ ... ]
> @@ -505,6 +505,8 @@ static void recover_worker(struct kthread_work *work)
>  		 */
>  		if (!vm->managed)
>  			msm_gem_vm_unusable(submit->vm);
> +
> +		put_task_struct(task);
>  	}
>  
>  	noreclaim_flag = memalloc_noreclaim_save();

[Severity: Critical]
This is a pre-existing issue, but while reviewing context VM usage in this
driver, I noticed that direct access to ctx->vm without lazy initialization
occurs in a few places.

In drivers/gpu/drm/msm/msm_gem_submit.c:msm_ioctl_gem_submit():

	if (to_msm_vm(ctx->vm)->unusable)
		return UERR(EPIPE, dev, "context is unusable");

If a userspace process opens the DRM device and immediately calls the
submit ioctl before the VM is lazily created, won't ctx->vm be NULL here?
Since base is the first element of struct msm_gem_vm, to_msm_vm(NULL)
evaluates to NULL, which would cause an immediate NULL pointer dereference
when accessing ->unusable.

Similarly, in drivers/gpu/drm/msm/msm_gem.c:msm_gem_new_handle():

		struct drm_gem_object *r_obj = drm_gpuvm_resv_obj(ctx->vm);

Could this also trigger a crash if ctx->vm is NULL, since passing NULL to
drm_gpuvm_resv_obj() expands to dereference ctx->vm directly? Should these
paths be using msm_context_vm() to ensure the VM is initialized first?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260605-assorted-fixes-june-v1-0-2caa04f7287c@oss.qualcomm.com?part=6

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 5/6] drm/msm/a6xx: Fix IRQ storm during msm_recovery test
  2026-06-04 20:08 ` [PATCH 5/6] drm/msm/a6xx: Fix IRQ storm during msm_recovery test Akhil P Oommen
  2026-06-04 20:25   ` sashiko-bot
@ 2026-06-05  6:50   ` Rob Clark
  1 sibling, 0 replies; 13+ messages in thread
From: Rob Clark @ 2026-06-05  6:50 UTC (permalink / raw)
  To: Akhil P Oommen
  Cc: Sean Paul, Konrad Dybcio, Dmitry Baryshkov, Abhinav Kumar,
	Jessica Zhang, Marijn Suijten, David Airlie, Simona Vetter,
	Puranam V G Tejaswi, Jie Zhang, Maíra Canal, linux-arm-msm,
	dri-devel, freedreno, linux-kernel, Jie Zhang

On Thu, Jun 4, 2026 at 1:10 PM Akhil P Oommen <akhilpo@oss.qualcomm.com> wrote:
>
> From: Jie Zhang <jie.zhang@oss.qualcomm.com>
>
> Once a hang is triggered by the msm_recovery test, the gpu error irq
> remains asserted and triggers an interrupt storm. In the worst case,
> this IRQ storm lands on the CPU core where the hangcheck timer is
> scheduled, blocking it from running. This eventually leads to CPU
> watchdog timeouts.
>
> To fix this, mask the gpu error irqs during msm_recovery test and
> enable them back during the recovery.
>
> Fixes: 5edf2750d998 ("drm/msm: Add debugfs to disable hw err handling")
> Signed-off-by: Jie Zhang <jie.zhang@oss.qualcomm.com>
> Signed-off-by: Akhil P Oommen <akhilpo@oss.qualcomm.com>
> ---
>  drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 5 +++++
>  drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 5 ++++-
>  drivers/gpu/drm/msm/adreno/a8xx_gpu.c | 5 ++++-
>  drivers/gpu/drm/msm/msm_gpu.c         | 2 ++
>  4 files changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> index 2c0bbac43c52..f1df2514c613 100644
> --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> @@ -1275,6 +1275,11 @@ static irqreturn_t a5xx_irq(struct msm_gpu *gpu)
>                 status & ~A5XX_RBBM_INT_0_MASK_RBBM_AHB_ERROR);
>
>         if (priv->disable_err_irq) {
> +               /* Turn off interrupts to avoid interrupt storm */
> +               gpu_write(gpu, REG_A5XX_RBBM_INT_0_MASK,
> +                              A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS |
> +                              A5XX_RBBM_INT_0_MASK_CP_SW);
> +
>                 status &= A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS |
>                           A5XX_RBBM_INT_0_MASK_CP_SW;
>         }
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> index 8b3bb2fd433b..9a4f9d0e1780 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> @@ -1911,8 +1911,11 @@ static irqreturn_t a6xx_irq(struct msm_gpu *gpu)
>
>         gpu_write(gpu, REG_A6XX_RBBM_INT_CLEAR_CMD, status);
>
> -       if (priv->disable_err_irq)
> +       if (priv->disable_err_irq) {
> +               /* Turn off interrupts to avoid interrupt storm */
> +               gpu_write(gpu, REG_A6XX_RBBM_INT_0_MASK, A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS);
>                 status &= A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS;
> +       }
>
>         if (status & A6XX_RBBM_INT_0_MASK_RBBM_HANG_DETECT)
>                 a6xx_fault_detect_irq(gpu);
> diff --git a/drivers/gpu/drm/msm/adreno/a8xx_gpu.c b/drivers/gpu/drm/msm/adreno/a8xx_gpu.c
> index 9e44fd1ae634..0f6fd35bd587 100644
> --- a/drivers/gpu/drm/msm/adreno/a8xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a8xx_gpu.c
> @@ -1211,8 +1211,11 @@ irqreturn_t a8xx_irq(struct msm_gpu *gpu)
>
>         gpu_write(gpu, REG_A8XX_RBBM_INT_CLEAR_CMD, status);
>
> -       if (priv->disable_err_irq)
> +       if (priv->disable_err_irq) {
> +               /* Turn off interrupts to avoid interrupt storm */
> +               gpu_write(gpu, REG_A8XX_RBBM_INT_0_MASK, A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS);
>                 status &= A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS;
> +       }
>
>         if (status & A6XX_RBBM_INT_0_MASK_RBBM_HANG_DETECT)
>                 a8xx_fault_detect_irq(gpu);
> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> index 9ac7740a87f0..48ac51f4119b 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.c
> +++ b/drivers/gpu/drm/msm/msm_gpu.c
> @@ -552,6 +552,8 @@ static void recover_worker(struct kthread_work *work)
>                 msm_update_fence(ring->fctx, fence);
>         }
>
> +       priv->disable_err_irq = false;

Ok, so we rely on recovery to re-enable the error irqs..  that is
probably ok, given the intended purpose of the debugfs file.  And,
well, it is debugfs.  But why do we clear disable_err_irq here?

BR,
-R

> +
>         gpu->funcs->recover(gpu);
>
>         /* retire completed submits, plus the one that hung: */
>
> --
> 2.51.0
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/6] drm/msm/a6xx: Fix stale rpmh votes after suspend
  2026-06-04 20:08 ` [PATCH 1/6] drm/msm/a6xx: Fix stale rpmh votes after suspend Akhil P Oommen
  2026-06-04 20:21   ` sashiko-bot
@ 2026-06-05 13:09   ` Neil Armstrong
  1 sibling, 0 replies; 13+ messages in thread
From: Neil Armstrong @ 2026-06-05 13:09 UTC (permalink / raw)
  To: Akhil P Oommen, Rob Clark, Sean Paul, Konrad Dybcio,
	Dmitry Baryshkov, Abhinav Kumar, Jessica Zhang, Marijn Suijten,
	David Airlie, Simona Vetter, Puranam V G Tejaswi, Jie Zhang,
	Maíra Canal
  Cc: linux-arm-msm, dri-devel, freedreno, linux-kernel, Shivam Rawat

On 6/4/26 22:08, Akhil P Oommen wrote:
> From: Shivam Rawat <shivrawa@qti.qualcomm.com>
> 
> There are stale RPMH votes (BCM votes) observed after GMU suspend. This
> is because the rpmh stop sequences are skipped during gmu suspend. Fix
> this and also move GMU to reset state to avoid any further activity.
> 
> Fixes: f248d5d5159a ("drm/msm/a6xx: Fix PDC sleep sequence")
> Signed-off-by: Shivam Rawat <shivrawa@qti.qualcomm.com>
> Signed-off-by: Akhil P Oommen <akhilpo@oss.qualcomm.com>
> ---
>   drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> index 2e5d7b53a0c3..a2f6918c4f7f 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> @@ -642,7 +642,7 @@ static void a6xx_rpmh_stop(struct a6xx_gmu *gmu)
>   	int ret;
>   	u32 val;
>   
> -	if (test_and_clear_bit(GMU_STATUS_FW_START, &gmu->status))
> +	if (!test_and_clear_bit(GMU_STATUS_FW_START, &gmu->status))
>   		return;
>   
>   	if (adreno_is_a840(adreno_gpu))
> @@ -1465,6 +1465,9 @@ static void a6xx_gmu_shutdown(struct a6xx_gmu *gmu)
>   	/* Stop the interrupts and mask the hardware */
>   	a6xx_gmu_irq_disable(gmu);
>   
> +	/* Halt the gmu cm3 core */
> +	gmu_write(gmu, REG_A6XX_GMU_CM3_SYSRESET, 1);
> +
>   	/* Tell RPMh to power off the GPU */
>   	a6xx_rpmh_stop(gmu);
>   
> 

Thanks !

Tested-by: Neil Armstrong <neil.armstrong@linaro.org> # on SM8650-HDK


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-06-05 13:09 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-04 20:08 [PATCH 0/6] drm/msm: Assorted fixes - June/26 Akhil P Oommen
2026-06-04 20:08 ` [PATCH 1/6] drm/msm/a6xx: Fix stale rpmh votes after suspend Akhil P Oommen
2026-06-04 20:21   ` sashiko-bot
2026-06-05 13:09   ` Neil Armstrong
2026-06-04 20:08 ` [PATCH 2/6] drm/msm: Recover HW before retire hung submit Akhil P Oommen
2026-06-04 20:27   ` sashiko-bot
2026-06-04 20:08 ` [PATCH 3/6] drm/msm/a6xx: Fix A663 GPUCC register list for state capture Akhil P Oommen
2026-06-04 20:08 ` [PATCH 4/6] drm/msm/a6xx: Fix A621 " Akhil P Oommen
2026-06-04 20:08 ` [PATCH 5/6] drm/msm/a6xx: Fix IRQ storm during msm_recovery test Akhil P Oommen
2026-06-04 20:25   ` sashiko-bot
2026-06-05  6:50   ` Rob Clark
2026-06-04 20:08 ` [PATCH 6/6] drm/msm: Fix task_struct reference leak in recover_worker Akhil P Oommen
2026-06-04 20:28   ` sashiko-bot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.