dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] nouveau/gsp: add a 50ms delay between fbsr and driver unload rpcs
@ 2025-07-02 23:27 Dave Airlie
  2025-07-03 21:46 ` Danilo Krummrich
  2025-07-03 22:22 ` Danilo Krummrich
  0 siblings, 2 replies; 4+ messages in thread
From: Dave Airlie @ 2025-07-02 23:27 UTC (permalink / raw)
  To: dri-devel, nouveau; +Cc: Dave Airlie, Ben Skeggs, Danilo Krummrich

From: Dave Airlie <airlied@redhat.com>

This fixes a bunch of command hangs after runtime suspend/resume.

This fixes a regression caused by code movement in the commit below,
the commit seems to just change timings enough to cause this to happen
now, and adding the sleep seems to avoid it.

I've spent some time trying to root cause it to no great avail,
it seems like a bug on the firmware side, but it could be a bug
in our rpc handling that I can't find.

Either way, we should land the workaround to fix the problem,
while we continue to work out the root cause.

Signed-off-by: Dave Airlie <airlied@redhat.com>
Cc: Ben Skeggs <bskeggs@nvidia.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Fixes: 21b039715ce9 ("drm/nouveau/gsp: add hals for fbsr.suspend/resume()")
---
 drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c
index baf42339f93e..ff362a6d9f5c 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c
@@ -1744,6 +1744,9 @@ r535_gsp_fini(struct nvkm_gsp *gsp, bool suspend)
 			nvkm_gsp_sg_free(gsp->subdev.device, &gsp->sr.sgt);
 			return ret;
 		}
+
+		/* without this Turing ends up resetting all channels after resume. */
+		msleep(50);
 	}
 
 	ret = r535_gsp_rpc_unloading_guest_driver(gsp, suspend);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] nouveau/gsp: add a 50ms delay between fbsr and driver unload rpcs
  2025-07-02 23:27 [PATCH] nouveau/gsp: add a 50ms delay between fbsr and driver unload rpcs Dave Airlie
@ 2025-07-03 21:46 ` Danilo Krummrich
  2025-07-03 21:56   ` David Airlie
  2025-07-03 22:22 ` Danilo Krummrich
  1 sibling, 1 reply; 4+ messages in thread
From: Danilo Krummrich @ 2025-07-03 21:46 UTC (permalink / raw)
  To: Dave Airlie; +Cc: dri-devel, nouveau, Dave Airlie, Ben Skeggs

On 7/3/25 1:27 AM, Dave Airlie wrote:
> From: Dave Airlie <airlied@redhat.com>
> 
> This fixes a bunch of command hangs after runtime suspend/resume.
> 
> This fixes a regression caused by code movement in the commit below,
> the commit seems to just change timings enough to cause this to happen
> now, and adding the sleep seems to avoid it.
> 
> I've spent some time trying to root cause it to no great avail,
> it seems like a bug on the firmware side, but it could be a bug
> in our rpc handling that I can't find.
> 
> Either way, we should land the workaround to fix the problem,
> while we continue to work out the root cause.

I think we should add a TODO above the msleep(); what do you think would be a
good comment here?

I can add it when applying the patch if you want.

> Signed-off-by: Dave Airlie <airlied@redhat.com>
> Cc: Ben Skeggs <bskeggs@nvidia.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Fixes: 21b039715ce9 ("drm/nouveau/gsp: add hals for fbsr.suspend/resume()")
> ---
>   drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c
> index baf42339f93e..ff362a6d9f5c 100644
> --- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c
> +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c
> @@ -1744,6 +1744,9 @@ r535_gsp_fini(struct nvkm_gsp *gsp, bool suspend)
>   			nvkm_gsp_sg_free(gsp->subdev.device, &gsp->sr.sgt);
>   			return ret;
>   		}
> +
> +		/* without this Turing ends up resetting all channels after resume. */
> +		msleep(50);
>   	}
>   
>   	ret = r535_gsp_rpc_unloading_guest_driver(gsp, suspend);


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] nouveau/gsp: add a 50ms delay between fbsr and driver unload rpcs
  2025-07-03 21:46 ` Danilo Krummrich
@ 2025-07-03 21:56   ` David Airlie
  0 siblings, 0 replies; 4+ messages in thread
From: David Airlie @ 2025-07-03 21:56 UTC (permalink / raw)
  To: Danilo Krummrich; +Cc: Dave Airlie, dri-devel, nouveau, Ben Skeggs

On Fri, Jul 4, 2025 at 7:46 AM Danilo Krummrich <dakr@kernel.org> wrote:
>
> On 7/3/25 1:27 AM, Dave Airlie wrote:
> > From: Dave Airlie <airlied@redhat.com>
> >
> > This fixes a bunch of command hangs after runtime suspend/resume.
> >
> > This fixes a regression caused by code movement in the commit below,
> > the commit seems to just change timings enough to cause this to happen
> > now, and adding the sleep seems to avoid it.
> >
> > I've spent some time trying to root cause it to no great avail,
> > it seems like a bug on the firmware side, but it could be a bug
> > in our rpc handling that I can't find.
> >
> > Either way, we should land the workaround to fix the problem,
> > while we continue to work out the root cause.
>
> I think we should add a TODO above the msleep(); what do you think would be a
> good comment here?

TODO: debug the gsp firmware or the rpc handling to find out why this
is happening and why it's Turing specific.

Don't really have a lot to go on,

Dave.
>
> I can add it when applying the patch if you want.
>
> > Signed-off-by: Dave Airlie <airlied@redhat.com>
> > Cc: Ben Skeggs <bskeggs@nvidia.com>
> > Cc: Danilo Krummrich <dakr@kernel.org>
> > Fixes: 21b039715ce9 ("drm/nouveau/gsp: add hals for fbsr.suspend/resume()")
> > ---
> >   drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c | 3 +++
> >   1 file changed, 3 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c
> > index baf42339f93e..ff362a6d9f5c 100644
> > --- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c
> > +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c
> > @@ -1744,6 +1744,9 @@ r535_gsp_fini(struct nvkm_gsp *gsp, bool suspend)
> >                       nvkm_gsp_sg_free(gsp->subdev.device, &gsp->sr.sgt);
> >                       return ret;
> >               }
> > +
> > +             /* without this Turing ends up resetting all channels after resume. */
> > +             msleep(50);
> >       }
> >
> >       ret = r535_gsp_rpc_unloading_guest_driver(gsp, suspend);
>


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] nouveau/gsp: add a 50ms delay between fbsr and driver unload rpcs
  2025-07-02 23:27 [PATCH] nouveau/gsp: add a 50ms delay between fbsr and driver unload rpcs Dave Airlie
  2025-07-03 21:46 ` Danilo Krummrich
@ 2025-07-03 22:22 ` Danilo Krummrich
  1 sibling, 0 replies; 4+ messages in thread
From: Danilo Krummrich @ 2025-07-03 22:22 UTC (permalink / raw)
  To: Dave Airlie; +Cc: dri-devel, nouveau, Dave Airlie, Ben Skeggs

On Thu, Jul 03, 2025 at 09:27:07AM +1000, Dave Airlie wrote:
> From: Dave Airlie <airlied@redhat.com>
> 
> This fixes a bunch of command hangs after runtime suspend/resume.
> 
> This fixes a regression caused by code movement in the commit below,
> the commit seems to just change timings enough to cause this to happen
> now, and adding the sleep seems to avoid it.
> 
> I've spent some time trying to root cause it to no great avail,
> it seems like a bug on the firmware side, but it could be a bug
> in our rpc handling that I can't find.
> 
> Either way, we should land the workaround to fix the problem,
> while we continue to work out the root cause.
> 
> Signed-off-by: Dave Airlie <airlied@redhat.com>
> Cc: Ben Skeggs <bskeggs@nvidia.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Fixes: 21b039715ce9 ("drm/nouveau/gsp: add hals for fbsr.suspend/resume()")

Applied to drm-misc-fixes with the following diff.

diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c
index ff362a6d9f5c..23f80e167705 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c
@@ -1745,7 +1745,11 @@ r535_gsp_fini(struct nvkm_gsp *gsp, bool suspend)
                        return ret;
                }

-               /* without this Turing ends up resetting all channels after resume. */
+               /*
+                * TODO: Debug the GSP firmware / RPC handling to find out why
+                * without this Turing (but none of the other architectures)
+                * ends up resetting all channels after resume.
+                */
                msleep(50);
        }

I also changed the 'Fixes' tag to:

Fixes: c21b039715ce ("drm/nouveau/gsp: add hals for fbsr.suspend/resume()")

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-07-03 22:22 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-02 23:27 [PATCH] nouveau/gsp: add a 50ms delay between fbsr and driver unload rpcs Dave Airlie
2025-07-03 21:46 ` Danilo Krummrich
2025-07-03 21:56   ` David Airlie
2025-07-03 22:22 ` Danilo Krummrich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).