[PATCH] drm/msm/a6xx: Fix recovery vs runpm race

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] drm/msm/a6xx: Fix recovery vs runpm race
@ 2023-12-18 15:59 Rob Clark
  2023-12-22 19:57 ` Akhil P Oommen
  0 siblings, 1 reply; 3+ messages in thread
From: Rob Clark @ 2023-12-18 15:59 UTC (permalink / raw)
  To: dri-devel
  Cc: freedreno, linux-arm-msm, Rob Clark, David Heidelberg, Rob Clark,
	Abhinav Kumar, Dmitry Baryshkov, Sean Paul, Marijn Suijten,
	David Airlie, Daniel Vetter, Konrad Dybcio, Akhil P Oommen,
	Danylo Piliaiev, Bjorn Andersson, open list

From: Rob Clark <robdclark@chromium.org>

a6xx_recover() is relying on the gpu lock to serialize against incoming
submits doing a runpm get, as it tries to temporarily balance out the
runpm gets with puts in order to power off the GPU.  Unfortunately this
gets worse when we (in a later patch) will move the runpm get out of the
scheduler thread/work to move it out of the fence signaling path.

Instead we can just simplify the whole thing by using force_suspend() /
force_resume() instead of trying to be clever.

Reported-by: David Heidelberg <david.heidelberg@collabora.com>
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10272
Fixes: abe2023b4cea ("drm/msm/gpu: Push gpu lock down past runpm")
Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 268737e59131..a5660d63535b 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -1244,12 +1244,7 @@ static void a6xx_recover(struct msm_gpu *gpu)
 	dev_pm_genpd_add_notifier(gmu->cxpd, &gmu->pd_nb);
 	dev_pm_genpd_synced_poweroff(gmu->cxpd);
 
-	/* Drop the rpm refcount from active submits */
-	if (active_submits)
-		pm_runtime_put(&gpu->pdev->dev);
-
-	/* And the final one from recover worker */
-	pm_runtime_put_sync(&gpu->pdev->dev);
+	pm_runtime_force_suspend(&gpu->pdev->dev);
 
 	if (!wait_for_completion_timeout(&gmu->pd_gate, msecs_to_jiffies(1000)))
 		DRM_DEV_ERROR(&gpu->pdev->dev, "cx gdsc didn't collapse\n");
@@ -1258,10 +1253,7 @@ static void a6xx_recover(struct msm_gpu *gpu)
 
 	pm_runtime_use_autosuspend(&gpu->pdev->dev);
 
-	if (active_submits)
-		pm_runtime_get(&gpu->pdev->dev);
-
-	pm_runtime_get_sync(&gpu->pdev->dev);
+	pm_runtime_force_resume(&gpu->pdev->dev);
 
 	gpu->active_submits = active_submits;
 	mutex_unlock(&gpu->active_lock);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] drm/msm/a6xx: Fix recovery vs runpm race
  2023-12-18 15:59 [PATCH] drm/msm/a6xx: Fix recovery vs runpm race Rob Clark
@ 2023-12-22 19:57 ` Akhil P Oommen
  2023-12-22 20:15   ` Rob Clark
  0 siblings, 1 reply; 3+ messages in thread
From: Akhil P Oommen @ 2023-12-22 19:57 UTC (permalink / raw)
  To: Rob Clark
  Cc: dri-devel, freedreno, linux-arm-msm, Rob Clark, David Heidelberg,
	Abhinav Kumar, Dmitry Baryshkov, Sean Paul, Marijn Suijten,
	David Airlie, Daniel Vetter, Konrad Dybcio, Danylo Piliaiev,
	Bjorn Andersson, open list

On Mon, Dec 18, 2023 at 07:59:24AM -0800, Rob Clark wrote:
> 
> From: Rob Clark <robdclark@chromium.org>
> 
> a6xx_recover() is relying on the gpu lock to serialize against incoming
> submits doing a runpm get, as it tries to temporarily balance out the
> runpm gets with puts in order to power off the GPU.  Unfortunately this
> gets worse when we (in a later patch) will move the runpm get out of the
> scheduler thread/work to move it out of the fence signaling path.
> 
> Instead we can just simplify the whole thing by using force_suspend() /
> force_resume() instead of trying to be clever.

At some places, we take a pm_runtime vote and access the gpu
registers assuming it will be powered until we drop the vote.  a6xx_get_timestamp()
is an example. If we do a force suspend, it may cause bus errors from
those threads. Now you have to serialize every place we do runtime_get/put with a
mutex. Or is there a better way to handle the 'later patch' you
mentioned?

-Akhil.

> 
> Reported-by: David Heidelberg <david.heidelberg@collabora.com>
> Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10272
> Fixes: abe2023b4cea ("drm/msm/gpu: Push gpu lock down past runpm")
> Signed-off-by: Rob Clark <robdclark@chromium.org>
> ---
>  drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 12 ++----------
>  1 file changed, 2 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> index 268737e59131..a5660d63535b 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> @@ -1244,12 +1244,7 @@ static void a6xx_recover(struct msm_gpu *gpu)
>  	dev_pm_genpd_add_notifier(gmu->cxpd, &gmu->pd_nb);
>  	dev_pm_genpd_synced_poweroff(gmu->cxpd);
>  
> -	/* Drop the rpm refcount from active submits */
> -	if (active_submits)
> -		pm_runtime_put(&gpu->pdev->dev);
> -
> -	/* And the final one from recover worker */
> -	pm_runtime_put_sync(&gpu->pdev->dev);
> +	pm_runtime_force_suspend(&gpu->pdev->dev);
>  
>  	if (!wait_for_completion_timeout(&gmu->pd_gate, msecs_to_jiffies(1000)))
>  		DRM_DEV_ERROR(&gpu->pdev->dev, "cx gdsc didn't collapse\n");
> @@ -1258,10 +1253,7 @@ static void a6xx_recover(struct msm_gpu *gpu)
>  
>  	pm_runtime_use_autosuspend(&gpu->pdev->dev);
>  
> -	if (active_submits)
> -		pm_runtime_get(&gpu->pdev->dev);
> -
> -	pm_runtime_get_sync(&gpu->pdev->dev);
> +	pm_runtime_force_resume(&gpu->pdev->dev);
>  
>  	gpu->active_submits = active_submits;
>  	mutex_unlock(&gpu->active_lock);
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] drm/msm/a6xx: Fix recovery vs runpm race
  2023-12-22 19:57 ` Akhil P Oommen
@ 2023-12-22 20:15   ` Rob Clark
  0 siblings, 0 replies; 3+ messages in thread
From: Rob Clark @ 2023-12-22 20:15 UTC (permalink / raw)
  To: Akhil P Oommen
  Cc: dri-devel, freedreno, linux-arm-msm, Rob Clark, David Heidelberg,
	Abhinav Kumar, Dmitry Baryshkov, Sean Paul, Marijn Suijten,
	David Airlie, Daniel Vetter, Konrad Dybcio, Danylo Piliaiev,
	Bjorn Andersson, open list

On Fri, Dec 22, 2023 at 11:58 AM Akhil P Oommen
<quic_akhilpo@quicinc.com> wrote:
>
> On Mon, Dec 18, 2023 at 07:59:24AM -0800, Rob Clark wrote:
> >
> > From: Rob Clark <robdclark@chromium.org>
> >
> > a6xx_recover() is relying on the gpu lock to serialize against incoming
> > submits doing a runpm get, as it tries to temporarily balance out the
> > runpm gets with puts in order to power off the GPU.  Unfortunately this
> > gets worse when we (in a later patch) will move the runpm get out of the
> > scheduler thread/work to move it out of the fence signaling path.
> >
> > Instead we can just simplify the whole thing by using force_suspend() /
> > force_resume() instead of trying to be clever.
>
> At some places, we take a pm_runtime vote and access the gpu
> registers assuming it will be powered until we drop the vote.  a6xx_get_timestamp()
> is an example. If we do a force suspend, it may cause bus errors from
> those threads. Now you have to serialize every place we do runtime_get/put with a
> mutex. Or is there a better way to handle the 'later patch' you
> mentioned?

So I was running into issues, when I started adding an igt test to
stress test recovery vs multi-threaded submit, with cxpd not always
suspending and getting "cx gdsc did not collapse", which may be
related.

I was considering using force_suspend() on the gmu and cxpd if
gpu->hang==true, I'm not sure.  I ran out of time to play with this
when I was in the office.

The issue the 'later patch' is trying to deal with is getting memory
allocations out of the "fence signaling path", ie. out from the
drm/sched kthread/worker.  One way to do that, without dragging all of
runpm/device-link/etc into it is to do the runpm get in the submit
ioctl before enqueuing the job to the scheduler.  But then we can hold
a lock to protect against racing with recovery.

BR,
-R

> -Akhil.
>
> >
> > Reported-by: David Heidelberg <david.heidelberg@collabora.com>
> > Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10272
> > Fixes: abe2023b4cea ("drm/msm/gpu: Push gpu lock down past runpm")
> > Signed-off-by: Rob Clark <robdclark@chromium.org>
> > ---
> >  drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 12 ++----------
> >  1 file changed, 2 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > index 268737e59131..a5660d63535b 100644
> > --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > @@ -1244,12 +1244,7 @@ static void a6xx_recover(struct msm_gpu *gpu)
> >       dev_pm_genpd_add_notifier(gmu->cxpd, &gmu->pd_nb);
> >       dev_pm_genpd_synced_poweroff(gmu->cxpd);
> >
> > -     /* Drop the rpm refcount from active submits */
> > -     if (active_submits)
> > -             pm_runtime_put(&gpu->pdev->dev);
> > -
> > -     /* And the final one from recover worker */
> > -     pm_runtime_put_sync(&gpu->pdev->dev);
> > +     pm_runtime_force_suspend(&gpu->pdev->dev);
> >
> >       if (!wait_for_completion_timeout(&gmu->pd_gate, msecs_to_jiffies(1000)))
> >               DRM_DEV_ERROR(&gpu->pdev->dev, "cx gdsc didn't collapse\n");
> > @@ -1258,10 +1253,7 @@ static void a6xx_recover(struct msm_gpu *gpu)
> >
> >       pm_runtime_use_autosuspend(&gpu->pdev->dev);
> >
> > -     if (active_submits)
> > -             pm_runtime_get(&gpu->pdev->dev);
> > -
> > -     pm_runtime_get_sync(&gpu->pdev->dev);
> > +     pm_runtime_force_resume(&gpu->pdev->dev);
> >
> >       gpu->active_submits = active_submits;
> >       mutex_unlock(&gpu->active_lock);
> > --
> > 2.43.0
> >

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-12-22 20:15 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-18 15:59 [PATCH] drm/msm/a6xx: Fix recovery vs runpm race Rob Clark
2023-12-22 19:57 ` Akhil P Oommen
2023-12-22 20:15   ` Rob Clark

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox