* [PATCH] nouveau/gsp: add a 50ms delay between fbsr and driver unload rpcs
@ 2025-07-02 23:27 Dave Airlie
2025-07-03 21:46 ` Danilo Krummrich
2025-07-03 22:22 ` Danilo Krummrich
0 siblings, 2 replies; 4+ messages in thread
From: Dave Airlie @ 2025-07-02 23:27 UTC (permalink / raw)
To: dri-devel, nouveau; +Cc: Dave Airlie, Ben Skeggs, Danilo Krummrich
From: Dave Airlie <airlied@redhat.com>
This fixes a bunch of command hangs after runtime suspend/resume.
This fixes a regression caused by code movement in the commit below,
the commit seems to just change timings enough to cause this to happen
now, and adding the sleep seems to avoid it.
I've spent some time trying to root cause it to no great avail,
it seems like a bug on the firmware side, but it could be a bug
in our rpc handling that I can't find.
Either way, we should land the workaround to fix the problem,
while we continue to work out the root cause.
Signed-off-by: Dave Airlie <airlied@redhat.com>
Cc: Ben Skeggs <bskeggs@nvidia.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Fixes: 21b039715ce9 ("drm/nouveau/gsp: add hals for fbsr.suspend/resume()")
---
drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c
index baf42339f93e..ff362a6d9f5c 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c
@@ -1744,6 +1744,9 @@ r535_gsp_fini(struct nvkm_gsp *gsp, bool suspend)
nvkm_gsp_sg_free(gsp->subdev.device, &gsp->sr.sgt);
return ret;
}
+
+ /* without this Turing ends up resetting all channels after resume. */
+ msleep(50);
}
ret = r535_gsp_rpc_unloading_guest_driver(gsp, suspend);
--
2.49.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] nouveau/gsp: add a 50ms delay between fbsr and driver unload rpcs
2025-07-02 23:27 [PATCH] nouveau/gsp: add a 50ms delay between fbsr and driver unload rpcs Dave Airlie
@ 2025-07-03 21:46 ` Danilo Krummrich
2025-07-03 21:56 ` David Airlie
2025-07-03 22:22 ` Danilo Krummrich
1 sibling, 1 reply; 4+ messages in thread
From: Danilo Krummrich @ 2025-07-03 21:46 UTC (permalink / raw)
To: Dave Airlie; +Cc: dri-devel, nouveau, Dave Airlie, Ben Skeggs
On 7/3/25 1:27 AM, Dave Airlie wrote:
> From: Dave Airlie <airlied@redhat.com>
>
> This fixes a bunch of command hangs after runtime suspend/resume.
>
> This fixes a regression caused by code movement in the commit below,
> the commit seems to just change timings enough to cause this to happen
> now, and adding the sleep seems to avoid it.
>
> I've spent some time trying to root cause it to no great avail,
> it seems like a bug on the firmware side, but it could be a bug
> in our rpc handling that I can't find.
>
> Either way, we should land the workaround to fix the problem,
> while we continue to work out the root cause.
I think we should add a TODO above the msleep(); what do you think would be a
good comment here?
I can add it when applying the patch if you want.
> Signed-off-by: Dave Airlie <airlied@redhat.com>
> Cc: Ben Skeggs <bskeggs@nvidia.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Fixes: 21b039715ce9 ("drm/nouveau/gsp: add hals for fbsr.suspend/resume()")
> ---
> drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c
> index baf42339f93e..ff362a6d9f5c 100644
> --- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c
> +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c
> @@ -1744,6 +1744,9 @@ r535_gsp_fini(struct nvkm_gsp *gsp, bool suspend)
> nvkm_gsp_sg_free(gsp->subdev.device, &gsp->sr.sgt);
> return ret;
> }
> +
> + /* without this Turing ends up resetting all channels after resume. */
> + msleep(50);
> }
>
> ret = r535_gsp_rpc_unloading_guest_driver(gsp, suspend);
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] nouveau/gsp: add a 50ms delay between fbsr and driver unload rpcs
2025-07-03 21:46 ` Danilo Krummrich
@ 2025-07-03 21:56 ` David Airlie
0 siblings, 0 replies; 4+ messages in thread
From: David Airlie @ 2025-07-03 21:56 UTC (permalink / raw)
To: Danilo Krummrich; +Cc: Dave Airlie, dri-devel, nouveau, Ben Skeggs
On Fri, Jul 4, 2025 at 7:46 AM Danilo Krummrich <dakr@kernel.org> wrote:
>
> On 7/3/25 1:27 AM, Dave Airlie wrote:
> > From: Dave Airlie <airlied@redhat.com>
> >
> > This fixes a bunch of command hangs after runtime suspend/resume.
> >
> > This fixes a regression caused by code movement in the commit below,
> > the commit seems to just change timings enough to cause this to happen
> > now, and adding the sleep seems to avoid it.
> >
> > I've spent some time trying to root cause it to no great avail,
> > it seems like a bug on the firmware side, but it could be a bug
> > in our rpc handling that I can't find.
> >
> > Either way, we should land the workaround to fix the problem,
> > while we continue to work out the root cause.
>
> I think we should add a TODO above the msleep(); what do you think would be a
> good comment here?
TODO: debug the gsp firmware or the rpc handling to find out why this
is happening and why it's Turing specific.
Don't really have a lot to go on,
Dave.
>
> I can add it when applying the patch if you want.
>
> > Signed-off-by: Dave Airlie <airlied@redhat.com>
> > Cc: Ben Skeggs <bskeggs@nvidia.com>
> > Cc: Danilo Krummrich <dakr@kernel.org>
> > Fixes: 21b039715ce9 ("drm/nouveau/gsp: add hals for fbsr.suspend/resume()")
> > ---
> > drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c
> > index baf42339f93e..ff362a6d9f5c 100644
> > --- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c
> > +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c
> > @@ -1744,6 +1744,9 @@ r535_gsp_fini(struct nvkm_gsp *gsp, bool suspend)
> > nvkm_gsp_sg_free(gsp->subdev.device, &gsp->sr.sgt);
> > return ret;
> > }
> > +
> > + /* without this Turing ends up resetting all channels after resume. */
> > + msleep(50);
> > }
> >
> > ret = r535_gsp_rpc_unloading_guest_driver(gsp, suspend);
>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] nouveau/gsp: add a 50ms delay between fbsr and driver unload rpcs
2025-07-02 23:27 [PATCH] nouveau/gsp: add a 50ms delay between fbsr and driver unload rpcs Dave Airlie
2025-07-03 21:46 ` Danilo Krummrich
@ 2025-07-03 22:22 ` Danilo Krummrich
1 sibling, 0 replies; 4+ messages in thread
From: Danilo Krummrich @ 2025-07-03 22:22 UTC (permalink / raw)
To: Dave Airlie; +Cc: dri-devel, nouveau, Dave Airlie, Ben Skeggs
On Thu, Jul 03, 2025 at 09:27:07AM +1000, Dave Airlie wrote:
> From: Dave Airlie <airlied@redhat.com>
>
> This fixes a bunch of command hangs after runtime suspend/resume.
>
> This fixes a regression caused by code movement in the commit below,
> the commit seems to just change timings enough to cause this to happen
> now, and adding the sleep seems to avoid it.
>
> I've spent some time trying to root cause it to no great avail,
> it seems like a bug on the firmware side, but it could be a bug
> in our rpc handling that I can't find.
>
> Either way, we should land the workaround to fix the problem,
> while we continue to work out the root cause.
>
> Signed-off-by: Dave Airlie <airlied@redhat.com>
> Cc: Ben Skeggs <bskeggs@nvidia.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Fixes: 21b039715ce9 ("drm/nouveau/gsp: add hals for fbsr.suspend/resume()")
Applied to drm-misc-fixes with the following diff.
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c
index ff362a6d9f5c..23f80e167705 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c
@@ -1745,7 +1745,11 @@ r535_gsp_fini(struct nvkm_gsp *gsp, bool suspend)
return ret;
}
- /* without this Turing ends up resetting all channels after resume. */
+ /*
+ * TODO: Debug the GSP firmware / RPC handling to find out why
+ * without this Turing (but none of the other architectures)
+ * ends up resetting all channels after resume.
+ */
msleep(50);
}
I also changed the 'Fixes' tag to:
Fixes: c21b039715ce ("drm/nouveau/gsp: add hals for fbsr.suspend/resume()")
^ permalink raw reply related [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-07-03 22:22 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-02 23:27 [PATCH] nouveau/gsp: add a 50ms delay between fbsr and driver unload rpcs Dave Airlie
2025-07-03 21:46 ` Danilo Krummrich
2025-07-03 21:56 ` David Airlie
2025-07-03 22:22 ` Danilo Krummrich
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).