[PATCH] drm/msm: Wait for MMU devcoredump when waiting for GMU

linux-arm-msm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] drm/msm: Wait for MMU devcoredump when waiting for GMU
@ 2025-07-18 13:50 Connor Abbott
  2025-07-21 15:32 ` Rob Clark
  0 siblings, 1 reply; 5+ messages in thread
From: Connor Abbott @ 2025-07-18 13:50 UTC (permalink / raw)
  To: Rob Clark, Sean Paul, Konrad Dybcio, Dmitry Baryshkov,
	Abhinav Kumar, Jessica Zhang, Marijn Suijten
  Cc: linux-arm-msm, freedreno, Connor Abbott

If there is a flood of faults then the MMU can become saturated while it
waits for the kernel to process the first fault and resume it, so that
the GMU becomes blocked. This is mainly a problem when the kernel reads
the state of the GPU for a devcoredump, because this takes a while. If
we timeout waiting for the GMU, check if this has happened and retry
after we're finished.

Signed-off-by: Connor Abbott <cwabbott0@gmail.com>
---
 drivers/gpu/drm/msm/adreno/a6xx_gmu.c   | 21 ++++++++++++++++++---
 drivers/gpu/drm/msm/adreno/a6xx_hfi.c   | 21 ++++++++++++++++++---
 drivers/gpu/drm/msm/adreno/adreno_gpu.c | 11 +++++++++++
 drivers/gpu/drm/msm/adreno/adreno_gpu.h |  2 ++
 4 files changed, 49 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
index 28e6705c6da682c7b41c748e375dda59a6551898..6ec396fab22d194481a76d30b2d36ea5fb662241 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
@@ -340,6 +340,7 @@ int a6xx_gmu_set_oob(struct a6xx_gmu *gmu, enum a6xx_gmu_oob_state state)
 	int ret;
 	u32 val;
 	int request, ack;
+	struct a6xx_gpu *a6xx_gpu = container_of(gmu, struct a6xx_gpu, gmu);
 
 	WARN_ON_ONCE(!mutex_is_locked(&gmu->lock));
 
@@ -363,9 +364,23 @@ int a6xx_gmu_set_oob(struct a6xx_gmu *gmu, enum a6xx_gmu_oob_state state)
 	/* Trigger the equested OOB operation */
 	gmu_write(gmu, REG_A6XX_GMU_HOST2GMU_INTR_SET, 1 << request);
 
-	/* Wait for the acknowledge interrupt */
-	ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_GMU2HOST_INTR_INFO, val,
-		val & (1 << ack), 100, 10000);
+	do {
+		/* Wait for the acknowledge interrupt */
+		ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_GMU2HOST_INTR_INFO, val,
+			val & (1 << ack), 100, 10000);
+
+		if (!ret)
+			break;
+
+		if (completion_done(&a6xx_gpu->base.fault_coredump_done))
+			break;
+
+		/* We may timeout because the GMU is temporarily wedged from
+		 * pending faults from the GPU and we are taking a devcoredump.
+		 * Wait until the MMU is resumed and try again.
+		 */
+		wait_for_completion(&a6xx_gpu->base.fault_coredump_done);
+	} while (true);
 
 	if (ret)
 		DRM_DEV_ERROR(gmu->dev,
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_hfi.c b/drivers/gpu/drm/msm/adreno/a6xx_hfi.c
index 8e69b1e8465711837151725c8f70e7b4b16a368e..4e775ca757ce3649ac238d25cebfd7eb693fda61 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_hfi.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_hfi.c
@@ -104,10 +104,25 @@ static int a6xx_hfi_wait_for_msg_interrupt(struct a6xx_gmu *gmu, u32 id, u32 seq
 {
 	int ret;
 	u32 val;
+	struct a6xx_gpu *a6xx_gpu = container_of(gmu, struct a6xx_gpu, gmu);
+
+	do {
+		/* Wait for a response */
+		ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_GMU2HOST_INTR_INFO, val,
+			val & A6XX_GMU_GMU2HOST_INTR_INFO_MSGQ, 100, 1000000);
+
+		if (!ret)
+			break;
 
-	/* Wait for a response */
-	ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_GMU2HOST_INTR_INFO, val,
-		val & A6XX_GMU_GMU2HOST_INTR_INFO_MSGQ, 100, 1000000);
+		if (completion_done(&a6xx_gpu->base.fault_coredump_done))
+			break;
+
+		/* We may timeout because the GMU is temporarily wedged from
+		 * pending faults from the GPU and we are taking a devcoredump.
+		 * Wait until the MMU is resumed and try again.
+		 */
+		wait_for_completion(&a6xx_gpu->base.fault_coredump_done);
+	} while (true);
 
 	if (ret) {
 		DRM_DEV_ERROR(gmu->dev,
diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
index f1230465bf0d0840274a6eb03a10c4df3a7a68d3..19181b6fddfd518e2f60324da1a7087458115e40 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
@@ -293,6 +293,7 @@ int adreno_fault_handler(struct msm_gpu *gpu, unsigned long iova, int flags,
 			 struct adreno_smmu_fault_info *info, const char *block,
 			 u32 scratch[4])
 {
+	struct adreno_gpu *adreno_gpu = container_of(gpu, struct adreno_gpu, base);
 	struct msm_drm_private *priv = gpu->dev->dev_private;
 	struct msm_mmu *mmu = to_msm_vm(gpu->vm)->mmu;
 	const char *type = "UNKNOWN";
@@ -345,6 +346,11 @@ int adreno_fault_handler(struct msm_gpu *gpu, unsigned long iova, int flags,
 		/* Turn off the hangcheck timer to keep it from bothering us */
 		timer_delete(&gpu->hangcheck_timer);
 
+		/* Let any concurrent GMU transactions know that the MMU may be
+		 * blocked for a while and they should wait on us.
+		 */
+		reinit_completion(&adreno_gpu->fault_coredump_done);
+
 		fault_info.ttbr0 = info->ttbr0;
 		fault_info.iova  = iova;
 		fault_info.flags = flags;
@@ -352,6 +358,8 @@ int adreno_fault_handler(struct msm_gpu *gpu, unsigned long iova, int flags,
 		fault_info.block = block;
 
 		msm_gpu_fault_crashstate_capture(gpu, &fault_info);
+
+		complete_all(&adreno_gpu->fault_coredump_done);
 	}
 
 	return 0;
@@ -1238,6 +1246,9 @@ int adreno_gpu_init(struct drm_device *drm, struct platform_device *pdev,
 	if (ret)
 		return ret;
 
+	init_completion(&adreno_gpu->fault_coredump_done);
+	complete_all(&adreno_gpu->fault_coredump_done);
+
 	pm_runtime_set_autosuspend_delay(dev,
 		adreno_gpu->info->inactive_period);
 	pm_runtime_use_autosuspend(dev);
diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.h b/drivers/gpu/drm/msm/adreno/adreno_gpu.h
index 9dc93c247196d5b8b3659157f7aeea81809d4056..f16556c6f2921708e740ecd47f5b4668e87700aa 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_gpu.h
+++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.h
@@ -179,6 +179,8 @@ struct adreno_gpu {
 	uint16_t speedbin;
 	const struct adreno_gpu_funcs *funcs;
 
+	struct completion fault_coredump_done;
+
 	/* interesting register offsets to dump: */
 	const unsigned int *registers;
 

---
base-commit: 8290d37ad2b087bbcfe65fa5bcaf260e184b250a
change-id: 20250718-msm-gmu-fault-wait-465543a718fa

Best regards,
-- 
Connor Abbott <cwabbott0@gmail.com>


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/msm: Wait for MMU devcoredump when waiting for GMU
  2025-07-18 13:50 [PATCH] drm/msm: Wait for MMU devcoredump when waiting for GMU Connor Abbott
@ 2025-07-21 15:32 ` Rob Clark
  2025-07-24 19:48   ` Akhil P Oommen
  0 siblings, 1 reply; 5+ messages in thread
From: Rob Clark @ 2025-07-21 15:32 UTC (permalink / raw)
  To: Connor Abbott
  Cc: Sean Paul, Konrad Dybcio, Dmitry Baryshkov, Abhinav Kumar,
	Jessica Zhang, Marijn Suijten, linux-arm-msm, freedreno,
	Akhil P Oommen

On Fri, Jul 18, 2025 at 6:50 AM Connor Abbott <cwabbott0@gmail.com> wrote:
>
> If there is a flood of faults then the MMU can become saturated while it
> waits for the kernel to process the first fault and resume it, so that
> the GMU becomes blocked. This is mainly a problem when the kernel reads
> the state of the GPU for a devcoredump, because this takes a while. If
> we timeout waiting for the GMU, check if this has happened and retry
> after we're finished.
>
> Signed-off-by: Connor Abbott <cwabbott0@gmail.com>
> ---
>  drivers/gpu/drm/msm/adreno/a6xx_gmu.c   | 21 ++++++++++++++++++---
>  drivers/gpu/drm/msm/adreno/a6xx_hfi.c   | 21 ++++++++++++++++++---
>  drivers/gpu/drm/msm/adreno/adreno_gpu.c | 11 +++++++++++
>  drivers/gpu/drm/msm/adreno/adreno_gpu.h |  2 ++
>  4 files changed, 49 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> index 28e6705c6da682c7b41c748e375dda59a6551898..6ec396fab22d194481a76d30b2d36ea5fb662241 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> @@ -340,6 +340,7 @@ int a6xx_gmu_set_oob(struct a6xx_gmu *gmu, enum a6xx_gmu_oob_state state)
>         int ret;
>         u32 val;
>         int request, ack;
> +       struct a6xx_gpu *a6xx_gpu = container_of(gmu, struct a6xx_gpu, gmu);
>
>         WARN_ON_ONCE(!mutex_is_locked(&gmu->lock));
>
> @@ -363,9 +364,23 @@ int a6xx_gmu_set_oob(struct a6xx_gmu *gmu, enum a6xx_gmu_oob_state state)
>         /* Trigger the equested OOB operation */
>         gmu_write(gmu, REG_A6XX_GMU_HOST2GMU_INTR_SET, 1 << request);
>
> -       /* Wait for the acknowledge interrupt */
> -       ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_GMU2HOST_INTR_INFO, val,
> -               val & (1 << ack), 100, 10000);
> +       do {
> +               /* Wait for the acknowledge interrupt */
> +               ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_GMU2HOST_INTR_INFO, val,
> +                       val & (1 << ack), 100, 10000);
> +
> +               if (!ret)
> +                       break;
> +
> +               if (completion_done(&a6xx_gpu->base.fault_coredump_done))
> +                       break;
> +
> +               /* We may timeout because the GMU is temporarily wedged from
> +                * pending faults from the GPU and we are taking a devcoredump.
> +                * Wait until the MMU is resumed and try again.
> +                */
> +               wait_for_completion(&a6xx_gpu->base.fault_coredump_done);
> +       } while (true);

It is a bit sad to duplicate this nearly identical code twice.  And I
wonder if other gmu_poll_timeout()'s need similar treatment?  Maybe
Akhil has an opinion about whether we should just build this into
gmu_poll_timeout() instead?

BR,
-R

>
>         if (ret)
>                 DRM_DEV_ERROR(gmu->dev,
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_hfi.c b/drivers/gpu/drm/msm/adreno/a6xx_hfi.c
> index 8e69b1e8465711837151725c8f70e7b4b16a368e..4e775ca757ce3649ac238d25cebfd7eb693fda61 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_hfi.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_hfi.c
> @@ -104,10 +104,25 @@ static int a6xx_hfi_wait_for_msg_interrupt(struct a6xx_gmu *gmu, u32 id, u32 seq
>  {
>         int ret;
>         u32 val;
> +       struct a6xx_gpu *a6xx_gpu = container_of(gmu, struct a6xx_gpu, gmu);
> +
> +       do {
> +               /* Wait for a response */
> +               ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_GMU2HOST_INTR_INFO, val,
> +                       val & A6XX_GMU_GMU2HOST_INTR_INFO_MSGQ, 100, 1000000);
> +
> +               if (!ret)
> +                       break;
>
> -       /* Wait for a response */
> -       ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_GMU2HOST_INTR_INFO, val,
> -               val & A6XX_GMU_GMU2HOST_INTR_INFO_MSGQ, 100, 1000000);
> +               if (completion_done(&a6xx_gpu->base.fault_coredump_done))
> +                       break;
> +
> +               /* We may timeout because the GMU is temporarily wedged from
> +                * pending faults from the GPU and we are taking a devcoredump.
> +                * Wait until the MMU is resumed and try again.
> +                */
> +               wait_for_completion(&a6xx_gpu->base.fault_coredump_done);
> +       } while (true);
>
>         if (ret) {
>                 DRM_DEV_ERROR(gmu->dev,
> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> index f1230465bf0d0840274a6eb03a10c4df3a7a68d3..19181b6fddfd518e2f60324da1a7087458115e40 100644
> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> @@ -293,6 +293,7 @@ int adreno_fault_handler(struct msm_gpu *gpu, unsigned long iova, int flags,
>                          struct adreno_smmu_fault_info *info, const char *block,
>                          u32 scratch[4])
>  {
> +       struct adreno_gpu *adreno_gpu = container_of(gpu, struct adreno_gpu, base);
>         struct msm_drm_private *priv = gpu->dev->dev_private;
>         struct msm_mmu *mmu = to_msm_vm(gpu->vm)->mmu;
>         const char *type = "UNKNOWN";
> @@ -345,6 +346,11 @@ int adreno_fault_handler(struct msm_gpu *gpu, unsigned long iova, int flags,
>                 /* Turn off the hangcheck timer to keep it from bothering us */
>                 timer_delete(&gpu->hangcheck_timer);
>
> +               /* Let any concurrent GMU transactions know that the MMU may be
> +                * blocked for a while and they should wait on us.
> +                */
> +               reinit_completion(&adreno_gpu->fault_coredump_done);
> +
>                 fault_info.ttbr0 = info->ttbr0;
>                 fault_info.iova  = iova;
>                 fault_info.flags = flags;
> @@ -352,6 +358,8 @@ int adreno_fault_handler(struct msm_gpu *gpu, unsigned long iova, int flags,
>                 fault_info.block = block;
>
>                 msm_gpu_fault_crashstate_capture(gpu, &fault_info);
> +
> +               complete_all(&adreno_gpu->fault_coredump_done);
>         }
>
>         return 0;
> @@ -1238,6 +1246,9 @@ int adreno_gpu_init(struct drm_device *drm, struct platform_device *pdev,
>         if (ret)
>                 return ret;
>
> +       init_completion(&adreno_gpu->fault_coredump_done);
> +       complete_all(&adreno_gpu->fault_coredump_done);
> +
>         pm_runtime_set_autosuspend_delay(dev,
>                 adreno_gpu->info->inactive_period);
>         pm_runtime_use_autosuspend(dev);
> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.h b/drivers/gpu/drm/msm/adreno/adreno_gpu.h
> index 9dc93c247196d5b8b3659157f7aeea81809d4056..f16556c6f2921708e740ecd47f5b4668e87700aa 100644
> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.h
> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.h
> @@ -179,6 +179,8 @@ struct adreno_gpu {
>         uint16_t speedbin;
>         const struct adreno_gpu_funcs *funcs;
>
> +       struct completion fault_coredump_done;
> +
>         /* interesting register offsets to dump: */
>         const unsigned int *registers;
>
>
> ---
> base-commit: 8290d37ad2b087bbcfe65fa5bcaf260e184b250a
> change-id: 20250718-msm-gmu-fault-wait-465543a718fa
>
> Best regards,
> --
> Connor Abbott <cwabbott0@gmail.com>
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/msm: Wait for MMU devcoredump when waiting for GMU
  2025-07-21 15:32 ` Rob Clark
@ 2025-07-24 19:48   ` Akhil P Oommen
  2025-07-24 20:01     ` Connor Abbott
  0 siblings, 1 reply; 5+ messages in thread
From: Akhil P Oommen @ 2025-07-24 19:48 UTC (permalink / raw)
  To: rob.clark, Connor Abbott
  Cc: Sean Paul, Konrad Dybcio, Dmitry Baryshkov, Abhinav Kumar,
	Jessica Zhang, Marijn Suijten, linux-arm-msm, freedreno

On 7/21/2025 9:02 PM, Rob Clark wrote:
> On Fri, Jul 18, 2025 at 6:50 AM Connor Abbott <cwabbott0@gmail.com> wrote:
>>
>> If there is a flood of faults then the MMU can become saturated while it
>> waits for the kernel to process the first fault and resume it, so that
>> the GMU becomes blocked. This is mainly a problem when the kernel reads
>> the state of the GPU for a devcoredump, because this takes a while. If
>> we timeout waiting for the GMU, check if this has happened and retry
>> after we're finished.
>>
>> Signed-off-by: Connor Abbott <cwabbott0@gmail.com>
>> ---
>>  drivers/gpu/drm/msm/adreno/a6xx_gmu.c   | 21 ++++++++++++++++++---
>>  drivers/gpu/drm/msm/adreno/a6xx_hfi.c   | 21 ++++++++++++++++++---
>>  drivers/gpu/drm/msm/adreno/adreno_gpu.c | 11 +++++++++++
>>  drivers/gpu/drm/msm/adreno/adreno_gpu.h |  2 ++
>>  4 files changed, 49 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
>> index 28e6705c6da682c7b41c748e375dda59a6551898..6ec396fab22d194481a76d30b2d36ea5fb662241 100644
>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
>> @@ -340,6 +340,7 @@ int a6xx_gmu_set_oob(struct a6xx_gmu *gmu, enum a6xx_gmu_oob_state state)
>>         int ret;
>>         u32 val;
>>         int request, ack;
>> +       struct a6xx_gpu *a6xx_gpu = container_of(gmu, struct a6xx_gpu, gmu);
>>
>>         WARN_ON_ONCE(!mutex_is_locked(&gmu->lock));
>>
>> @@ -363,9 +364,23 @@ int a6xx_gmu_set_oob(struct a6xx_gmu *gmu, enum a6xx_gmu_oob_state state)
>>         /* Trigger the equested OOB operation */
>>         gmu_write(gmu, REG_A6XX_GMU_HOST2GMU_INTR_SET, 1 << request);
>>
>> -       /* Wait for the acknowledge interrupt */
>> -       ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_GMU2HOST_INTR_INFO, val,
>> -               val & (1 << ack), 100, 10000);
>> +       do {
>> +               /* Wait for the acknowledge interrupt */
>> +               ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_GMU2HOST_INTR_INFO, val,
>> +                       val & (1 << ack), 100, 10000);
>> +
>> +               if (!ret)
>> +                       break;
>> +
>> +               if (completion_done(&a6xx_gpu->base.fault_coredump_done))

I didn't get why this is required. Could you please add a comment?

>> +                       break;
>> +
>> +               /* We may timeout because the GMU is temporarily wedged from
>> +                * pending faults from the GPU and we are taking a devcoredump.
>> +                * Wait until the MMU is resumed and try again.
>> +                */
>> +               wait_for_completion(&a6xx_gpu->base.fault_coredump_done);
>> +       } while (true);
> 
> It is a bit sad to duplicate this nearly identical code twice.  And I
> wonder if other gmu_poll_timeout()'s need similar treatment?  Maybe
> Akhil has an opinion about whether we should just build this into
> gmu_poll_timeout() instead?

Yeah. That make sense. A potential issue I see is that we might be
holding both gpu and gmu locks here and the crashstate capture in the pf
handler tries to lock gpu, which can result in a dead lock.

-Akhil.

> 
> BR,
> -R
> 
>>
>>         if (ret)
>>                 DRM_DEV_ERROR(gmu->dev,
>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_hfi.c b/drivers/gpu/drm/msm/adreno/a6xx_hfi.c
>> index 8e69b1e8465711837151725c8f70e7b4b16a368e..4e775ca757ce3649ac238d25cebfd7eb693fda61 100644
>> --- a/drivers/gpu/drm/msm/adreno/a6xx_hfi.c
>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_hfi.c
>> @@ -104,10 +104,25 @@ static int a6xx_hfi_wait_for_msg_interrupt(struct a6xx_gmu *gmu, u32 id, u32 seq
>>  {
>>         int ret;
>>         u32 val;
>> +       struct a6xx_gpu *a6xx_gpu = container_of(gmu, struct a6xx_gpu, gmu);
>> +
>> +       do {
>> +               /* Wait for a response */
>> +               ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_GMU2HOST_INTR_INFO, val,
>> +                       val & A6XX_GMU_GMU2HOST_INTR_INFO_MSGQ, 100, 1000000);
>> +
>> +               if (!ret)
>> +                       break;
>>
>> -       /* Wait for a response */
>> -       ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_GMU2HOST_INTR_INFO, val,
>> -               val & A6XX_GMU_GMU2HOST_INTR_INFO_MSGQ, 100, 1000000);
>> +               if (completion_done(&a6xx_gpu->base.fault_coredump_done))
>> +                       break;
>> +
>> +               /* We may timeout because the GMU is temporarily wedged from
>> +                * pending faults from the GPU and we are taking a devcoredump.
>> +                * Wait until the MMU is resumed and try again.
>> +                */
>> +               wait_for_completion(&a6xx_gpu->base.fault_coredump_done);
>> +       } while (true);
>>
>>         if (ret) {
>>                 DRM_DEV_ERROR(gmu->dev,
>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>> index f1230465bf0d0840274a6eb03a10c4df3a7a68d3..19181b6fddfd518e2f60324da1a7087458115e40 100644
>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>> @@ -293,6 +293,7 @@ int adreno_fault_handler(struct msm_gpu *gpu, unsigned long iova, int flags,
>>                          struct adreno_smmu_fault_info *info, const char *block,
>>                          u32 scratch[4])
>>  {
>> +       struct adreno_gpu *adreno_gpu = container_of(gpu, struct adreno_gpu, base);
>>         struct msm_drm_private *priv = gpu->dev->dev_private;
>>         struct msm_mmu *mmu = to_msm_vm(gpu->vm)->mmu;
>>         const char *type = "UNKNOWN";
>> @@ -345,6 +346,11 @@ int adreno_fault_handler(struct msm_gpu *gpu, unsigned long iova, int flags,
>>                 /* Turn off the hangcheck timer to keep it from bothering us */
>>                 timer_delete(&gpu->hangcheck_timer);
>>
>> +               /* Let any concurrent GMU transactions know that the MMU may be
>> +                * blocked for a while and they should wait on us.
>> +                */
>> +               reinit_completion(&adreno_gpu->fault_coredump_done);
>> +
>>                 fault_info.ttbr0 = info->ttbr0;
>>                 fault_info.iova  = iova;
>>                 fault_info.flags = flags;
>> @@ -352,6 +358,8 @@ int adreno_fault_handler(struct msm_gpu *gpu, unsigned long iova, int flags,
>>                 fault_info.block = block;
>>
>>                 msm_gpu_fault_crashstate_capture(gpu, &fault_info);
>> +
>> +               complete_all(&adreno_gpu->fault_coredump_done);
>>         }
>>
>>         return 0;
>> @@ -1238,6 +1246,9 @@ int adreno_gpu_init(struct drm_device *drm, struct platform_device *pdev,
>>         if (ret)
>>                 return ret;
>>
>> +       init_completion(&adreno_gpu->fault_coredump_done);
>> +       complete_all(&adreno_gpu->fault_coredump_done);
>> +
>>         pm_runtime_set_autosuspend_delay(dev,
>>                 adreno_gpu->info->inactive_period);
>>         pm_runtime_use_autosuspend(dev);
>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.h b/drivers/gpu/drm/msm/adreno/adreno_gpu.h
>> index 9dc93c247196d5b8b3659157f7aeea81809d4056..f16556c6f2921708e740ecd47f5b4668e87700aa 100644
>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.h
>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.h
>> @@ -179,6 +179,8 @@ struct adreno_gpu {
>>         uint16_t speedbin;
>>         const struct adreno_gpu_funcs *funcs;
>>
>> +       struct completion fault_coredump_done;
>> +
>>         /* interesting register offsets to dump: */
>>         const unsigned int *registers;
>>
>>
>> ---
>> base-commit: 8290d37ad2b087bbcfe65fa5bcaf260e184b250a
>> change-id: 20250718-msm-gmu-fault-wait-465543a718fa
>>
>> Best regards,
>> --
>> Connor Abbott <cwabbott0@gmail.com>
>>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/msm: Wait for MMU devcoredump when waiting for GMU
  2025-07-24 19:48   ` Akhil P Oommen
@ 2025-07-24 20:01     ` Connor Abbott
  2025-07-25 22:12       ` Akhil P Oommen
  0 siblings, 1 reply; 5+ messages in thread
From: Connor Abbott @ 2025-07-24 20:01 UTC (permalink / raw)
  To: Akhil P Oommen
  Cc: rob.clark, Sean Paul, Konrad Dybcio, Dmitry Baryshkov,
	Abhinav Kumar, Jessica Zhang, Marijn Suijten, linux-arm-msm,
	freedreno

On Thu, Jul 24, 2025 at 3:48 PM Akhil P Oommen <akhilpo@oss.qualcomm.com> wrote:
>
> On 7/21/2025 9:02 PM, Rob Clark wrote:
> > On Fri, Jul 18, 2025 at 6:50 AM Connor Abbott <cwabbott0@gmail.com> wrote:
> >>
> >> If there is a flood of faults then the MMU can become saturated while it
> >> waits for the kernel to process the first fault and resume it, so that
> >> the GMU becomes blocked. This is mainly a problem when the kernel reads
> >> the state of the GPU for a devcoredump, because this takes a while. If
> >> we timeout waiting for the GMU, check if this has happened and retry
> >> after we're finished.
> >>
> >> Signed-off-by: Connor Abbott <cwabbott0@gmail.com>
> >> ---
> >>  drivers/gpu/drm/msm/adreno/a6xx_gmu.c   | 21 ++++++++++++++++++---
> >>  drivers/gpu/drm/msm/adreno/a6xx_hfi.c   | 21 ++++++++++++++++++---
> >>  drivers/gpu/drm/msm/adreno/adreno_gpu.c | 11 +++++++++++
> >>  drivers/gpu/drm/msm/adreno/adreno_gpu.h |  2 ++
> >>  4 files changed, 49 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> >> index 28e6705c6da682c7b41c748e375dda59a6551898..6ec396fab22d194481a76d30b2d36ea5fb662241 100644
> >> --- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> >> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> >> @@ -340,6 +340,7 @@ int a6xx_gmu_set_oob(struct a6xx_gmu *gmu, enum a6xx_gmu_oob_state state)
> >>         int ret;
> >>         u32 val;
> >>         int request, ack;
> >> +       struct a6xx_gpu *a6xx_gpu = container_of(gmu, struct a6xx_gpu, gmu);
> >>
> >>         WARN_ON_ONCE(!mutex_is_locked(&gmu->lock));
> >>
> >> @@ -363,9 +364,23 @@ int a6xx_gmu_set_oob(struct a6xx_gmu *gmu, enum a6xx_gmu_oob_state state)
> >>         /* Trigger the equested OOB operation */
> >>         gmu_write(gmu, REG_A6XX_GMU_HOST2GMU_INTR_SET, 1 << request);
> >>
> >> -       /* Wait for the acknowledge interrupt */
> >> -       ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_GMU2HOST_INTR_INFO, val,
> >> -               val & (1 << ack), 100, 10000);
> >> +       do {
> >> +               /* Wait for the acknowledge interrupt */
> >> +               ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_GMU2HOST_INTR_INFO, val,
> >> +                       val & (1 << ack), 100, 10000);
> >> +
> >> +               if (!ret)
> >> +                       break;
> >> +
> >> +               if (completion_done(&a6xx_gpu->base.fault_coredump_done))
>
> I didn't get why this is required. Could you please add a comment?

Without this, if the GMU timed out for some other reason not related
to SMMU then we'd loop infinitely. This gives up if there isn't
currently a crashstate pending.

>
> >> +                       break;
> >> +
> >> +               /* We may timeout because the GMU is temporarily wedged from
> >> +                * pending faults from the GPU and we are taking a devcoredump.
> >> +                * Wait until the MMU is resumed and try again.
> >> +                */
> >> +               wait_for_completion(&a6xx_gpu->base.fault_coredump_done);
> >> +       } while (true);
> >
> > It is a bit sad to duplicate this nearly identical code twice.  And I
> > wonder if other gmu_poll_timeout()'s need similar treatment?  Maybe
> > Akhil has an opinion about whether we should just build this into
> > gmu_poll_timeout() instead?
>
> Yeah. That make sense. A potential issue I see is that we might be
> holding both gpu and gmu locks here and the crashstate capture in the pf
> handler tries to lock gpu, which can result in a dead lock.

I think there would already be a deadlock, or at least timeout in that
situation now. Any task waiting for the GMU to complete while holding
the GPU lock would block the crashstate capture from completing and
allowing the GMU to continue.

Connor

>
> -Akhil.
>
> >
> > BR,
> > -R
> >
> >>
> >>         if (ret)
> >>                 DRM_DEV_ERROR(gmu->dev,
> >> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_hfi.c b/drivers/gpu/drm/msm/adreno/a6xx_hfi.c
> >> index 8e69b1e8465711837151725c8f70e7b4b16a368e..4e775ca757ce3649ac238d25cebfd7eb693fda61 100644
> >> --- a/drivers/gpu/drm/msm/adreno/a6xx_hfi.c
> >> +++ b/drivers/gpu/drm/msm/adreno/a6xx_hfi.c
> >> @@ -104,10 +104,25 @@ static int a6xx_hfi_wait_for_msg_interrupt(struct a6xx_gmu *gmu, u32 id, u32 seq
> >>  {
> >>         int ret;
> >>         u32 val;
> >> +       struct a6xx_gpu *a6xx_gpu = container_of(gmu, struct a6xx_gpu, gmu);
> >> +
> >> +       do {
> >> +               /* Wait for a response */
> >> +               ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_GMU2HOST_INTR_INFO, val,
> >> +                       val & A6XX_GMU_GMU2HOST_INTR_INFO_MSGQ, 100, 1000000);
> >> +
> >> +               if (!ret)
> >> +                       break;
> >>
> >> -       /* Wait for a response */
> >> -       ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_GMU2HOST_INTR_INFO, val,
> >> -               val & A6XX_GMU_GMU2HOST_INTR_INFO_MSGQ, 100, 1000000);
> >> +               if (completion_done(&a6xx_gpu->base.fault_coredump_done))
> >> +                       break;
> >> +
> >> +               /* We may timeout because the GMU is temporarily wedged from
> >> +                * pending faults from the GPU and we are taking a devcoredump.
> >> +                * Wait until the MMU is resumed and try again.
> >> +                */
> >> +               wait_for_completion(&a6xx_gpu->base.fault_coredump_done);
> >> +       } while (true);
> >>
> >>         if (ret) {
> >>                 DRM_DEV_ERROR(gmu->dev,
> >> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> >> index f1230465bf0d0840274a6eb03a10c4df3a7a68d3..19181b6fddfd518e2f60324da1a7087458115e40 100644
> >> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> >> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> >> @@ -293,6 +293,7 @@ int adreno_fault_handler(struct msm_gpu *gpu, unsigned long iova, int flags,
> >>                          struct adreno_smmu_fault_info *info, const char *block,
> >>                          u32 scratch[4])
> >>  {
> >> +       struct adreno_gpu *adreno_gpu = container_of(gpu, struct adreno_gpu, base);
> >>         struct msm_drm_private *priv = gpu->dev->dev_private;
> >>         struct msm_mmu *mmu = to_msm_vm(gpu->vm)->mmu;
> >>         const char *type = "UNKNOWN";
> >> @@ -345,6 +346,11 @@ int adreno_fault_handler(struct msm_gpu *gpu, unsigned long iova, int flags,
> >>                 /* Turn off the hangcheck timer to keep it from bothering us */
> >>                 timer_delete(&gpu->hangcheck_timer);
> >>
> >> +               /* Let any concurrent GMU transactions know that the MMU may be
> >> +                * blocked for a while and they should wait on us.
> >> +                */
> >> +               reinit_completion(&adreno_gpu->fault_coredump_done);
> >> +
> >>                 fault_info.ttbr0 = info->ttbr0;
> >>                 fault_info.iova  = iova;
> >>                 fault_info.flags = flags;
> >> @@ -352,6 +358,8 @@ int adreno_fault_handler(struct msm_gpu *gpu, unsigned long iova, int flags,
> >>                 fault_info.block = block;
> >>
> >>                 msm_gpu_fault_crashstate_capture(gpu, &fault_info);
> >> +
> >> +               complete_all(&adreno_gpu->fault_coredump_done);
> >>         }
> >>
> >>         return 0;
> >> @@ -1238,6 +1246,9 @@ int adreno_gpu_init(struct drm_device *drm, struct platform_device *pdev,
> >>         if (ret)
> >>                 return ret;
> >>
> >> +       init_completion(&adreno_gpu->fault_coredump_done);
> >> +       complete_all(&adreno_gpu->fault_coredump_done);
> >> +
> >>         pm_runtime_set_autosuspend_delay(dev,
> >>                 adreno_gpu->info->inactive_period);
> >>         pm_runtime_use_autosuspend(dev);
> >> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.h b/drivers/gpu/drm/msm/adreno/adreno_gpu.h
> >> index 9dc93c247196d5b8b3659157f7aeea81809d4056..f16556c6f2921708e740ecd47f5b4668e87700aa 100644
> >> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.h
> >> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.h
> >> @@ -179,6 +179,8 @@ struct adreno_gpu {
> >>         uint16_t speedbin;
> >>         const struct adreno_gpu_funcs *funcs;
> >>
> >> +       struct completion fault_coredump_done;
> >> +
> >>         /* interesting register offsets to dump: */
> >>         const unsigned int *registers;
> >>
> >>
> >> ---
> >> base-commit: 8290d37ad2b087bbcfe65fa5bcaf260e184b250a
> >> change-id: 20250718-msm-gmu-fault-wait-465543a718fa
> >>
> >> Best regards,
> >> --
> >> Connor Abbott <cwabbott0@gmail.com>
> >>
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/msm: Wait for MMU devcoredump when waiting for GMU
  2025-07-24 20:01     ` Connor Abbott
@ 2025-07-25 22:12       ` Akhil P Oommen
  0 siblings, 0 replies; 5+ messages in thread
From: Akhil P Oommen @ 2025-07-25 22:12 UTC (permalink / raw)
  To: Connor Abbott
  Cc: rob.clark, Sean Paul, Konrad Dybcio, Dmitry Baryshkov,
	Abhinav Kumar, Jessica Zhang, Marijn Suijten, linux-arm-msm,
	freedreno

On 7/25/2025 1:31 AM, Connor Abbott wrote:
> On Thu, Jul 24, 2025 at 3:48 PM Akhil P Oommen <akhilpo@oss.qualcomm.com> wrote:
>>
>> On 7/21/2025 9:02 PM, Rob Clark wrote:
>>> On Fri, Jul 18, 2025 at 6:50 AM Connor Abbott <cwabbott0@gmail.com> wrote:
>>>>
>>>> If there is a flood of faults then the MMU can become saturated while it
>>>> waits for the kernel to process the first fault and resume it, so that
>>>> the GMU becomes blocked. This is mainly a problem when the kernel reads
>>>> the state of the GPU for a devcoredump, because this takes a while. If
>>>> we timeout waiting for the GMU, check if this has happened and retry
>>>> after we're finished.
>>>>
>>>> Signed-off-by: Connor Abbott <cwabbott0@gmail.com>
>>>> ---
>>>>  drivers/gpu/drm/msm/adreno/a6xx_gmu.c   | 21 ++++++++++++++++++---
>>>>  drivers/gpu/drm/msm/adreno/a6xx_hfi.c   | 21 ++++++++++++++++++---
>>>>  drivers/gpu/drm/msm/adreno/adreno_gpu.c | 11 +++++++++++
>>>>  drivers/gpu/drm/msm/adreno/adreno_gpu.h |  2 ++
>>>>  4 files changed, 49 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
>>>> index 28e6705c6da682c7b41c748e375dda59a6551898..6ec396fab22d194481a76d30b2d36ea5fb662241 100644
>>>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
>>>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
>>>> @@ -340,6 +340,7 @@ int a6xx_gmu_set_oob(struct a6xx_gmu *gmu, enum a6xx_gmu_oob_state state)
>>>>         int ret;
>>>>         u32 val;
>>>>         int request, ack;
>>>> +       struct a6xx_gpu *a6xx_gpu = container_of(gmu, struct a6xx_gpu, gmu);
>>>>
>>>>         WARN_ON_ONCE(!mutex_is_locked(&gmu->lock));
>>>>
>>>> @@ -363,9 +364,23 @@ int a6xx_gmu_set_oob(struct a6xx_gmu *gmu, enum a6xx_gmu_oob_state state)
>>>>         /* Trigger the equested OOB operation */
>>>>         gmu_write(gmu, REG_A6XX_GMU_HOST2GMU_INTR_SET, 1 << request);
>>>>
>>>> -       /* Wait for the acknowledge interrupt */
>>>> -       ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_GMU2HOST_INTR_INFO, val,
>>>> -               val & (1 << ack), 100, 10000);
>>>> +       do {
>>>> +               /* Wait for the acknowledge interrupt */
>>>> +               ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_GMU2HOST_INTR_INFO, val,
>>>> +                       val & (1 << ack), 100, 10000);
>>>> +
>>>> +               if (!ret)
>>>> +                       break;
>>>> +
>>>> +               if (completion_done(&a6xx_gpu->base.fault_coredump_done))
>>
>> I didn't get why this is required. Could you please add a comment?
> 
> Without this, if the GMU timed out for some other reason not related
> to SMMU then we'd loop infinitely. This gives up if there isn't
> currently a crashstate pending.

Ah! That api doc somehow confused me.

> 
>>
>>>> +                       break;
>>>> +
>>>> +               /* We may timeout because the GMU is temporarily wedged from
>>>> +                * pending faults from the GPU and we are taking a devcoredump.
>>>> +                * Wait until the MMU is resumed and try again.
>>>> +                */
>>>> +               wait_for_completion(&a6xx_gpu->base.fault_coredump_done);

use the interruptible version? we may reach here from a process context.

>>>> +       } while (true);
>>>
>>> It is a bit sad to duplicate this nearly identical code twice.  And I
>>> wonder if other gmu_poll_timeout()'s need similar treatment?  Maybe
>>> Akhil has an opinion about whether we should just build this into
>>> gmu_poll_timeout() instead?
>>
>> Yeah. That make sense. A potential issue I see is that we might be
>> holding both gpu and gmu locks here and the crashstate capture in the pf
>> handler tries to lock gpu, which can result in a dead lock.
> 
> I think there would already be a deadlock, or at least timeout in that
> situation now. Any task waiting for the GMU to complete while holding
> the GPU lock would block the crashstate capture from completing and
> allowing the GMU to continue.

Timeout is fine as there is progress eventually. But deadlock is not
acceptable. Also, userspace can easily trigger this deadlock which makes
it a security issue.

I agree, we need to improve the gmu error handling situation overall. I
thought about this a few years ago actually. At that time, I thought it
would be simpler if we always did coredump/recovery from a single
thread. Not sure if that idea still makes sense.

On a related topic, stall-on-fault cannot be used in production. GMU is
very critical as it interacts directly with SoC power management
subsystems and also every year, there is an additional responsibility on
GMU to do a very time critical mitigation like CLX, thermal, BCL etc.
And these mitigations should be handled within a few microseconds. So
GMU should never be blocked, even for microseconds. Apart from that,
even GPU's internal bus can get locked up in rare cases which can lead
to a fatal system bus/NoC error when KMD access a register in the GPU space.

But stall-on-fault is useful while debugging. So any improvements in
this area is useful.

-Akhil.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-07-25 22:12 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-18 13:50 [PATCH] drm/msm: Wait for MMU devcoredump when waiting for GMU Connor Abbott
2025-07-21 15:32 ` Rob Clark
2025-07-24 19:48   ` Akhil P Oommen
2025-07-24 20:01     ` Connor Abbott
2025-07-25 22:12       ` Akhil P Oommen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).