* [PATCH v3 0/3] drm/nouveau/gsp/r570: Fix runtime PM
@ 2026-07-01 18:17 Lyude Paul
2026-07-01 18:17 ` [PATCH v3 1/3] Revert "nouveau/gsp: fix suspend/resume regression on r570 firmware" Lyude Paul
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Lyude Paul @ 2026-07-01 18:17 UTC (permalink / raw)
To: nouveau, dri-devel, linux-kernel
Cc: Timur Tabi, Dave Airlie, Andy Shevchenko, Maarten Lankhorst,
Ben Skeggs, Kees Cook, Simona Vetter, David Airlie,
Thomas Zimmermann, Maxime Ripard, Mel Henning, Danilo Krummrich,
Lyude Paul
Runtime PM has been kind of unreliable with GSP for a while. It works
well enough to shut the GPU off and turn it back on, but more often then
not it ends up leaving the GPU in a broken state on resume - which makes
it impossible to really do anything else with it.
Recently however, I discovered that it's been failing much harder on
some ampere systems. This lead me down a rabbit hole that lead me to
figure out one of our previous fixes for runtime PM wasn't correct.
After fixing that and combining it with some fixes we had tried in the
past without success, I finally managed to get nouveau to handle runtime
PM with the GSP perfectly. These are those fixes.
Patch-series wide changes since V2:
* Drop the patch for switching oldLevel from LEVEL_3 to LEVEL_4. If I'm
understanding OpenRM correctly, LEVEL_3 corresponds to PCI D3 power
state and is intended for use on systems with ACPI. To be clear, I
mean ACPI support generally - it doesn't appear to specifically refer to
laptop-specific ACPI handles like _DSM or the like.
In an unexpected twist, it appears LEVEL_4 actually corresponds to APM
(as in "Advanced Power Management", e.g. pre-ACPI)! After re-testing
my local setups I wasn't able to find any significant behavior
difference between LEVEL_3 and LEVEL_4, so let's be conservative and
leave this be for now.
It still may be worth looking into the future why I've seen OpenRM
specifically call-out APM suspend being requested in debug builds
instead of S3. I wouldn't expect APM to actually work with modern
Nvidia hardware, but it's always possible there's some other reason
they still use this
Tested on a Lenovo P16 G1 (100 runtime PM cycles!), and RTX4000.
Previous version of the patch series:
https://patchwork.freedesktop.org/series/169457/
Lyude Paul (3):
Revert "nouveau/gsp: fix suspend/resume regression on r570 firmware"
drm/nouveau/gsp/r570: Never enter Gcoff state
drm/nouveau/gsp/r570: Add missing state flags to GSP resume arguments
drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/fbsr.c | 2 +-
drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c | 2 +-
drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r570/fbsr.c | 8 ++++----
drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r570/gsp.c | 3 ++-
.../gpu/drm/nouveau/nvkm/subdev/gsp/rm/r570/nvrm/gsp.h | 8 ++++++++
drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/rm.h | 2 +-
6 files changed, 17 insertions(+), 8 deletions(-)
base-commit: cff96362794a5c1f3adb013b4a46c7233149a629
--
2.54.0
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH v3 1/3] Revert "nouveau/gsp: fix suspend/resume regression on r570 firmware"
2026-07-01 18:17 [PATCH v3 0/3] drm/nouveau/gsp/r570: Fix runtime PM Lyude Paul
@ 2026-07-01 18:17 ` Lyude Paul
2026-07-01 18:17 ` [PATCH v3 2/3] drm/nouveau/gsp/r570: Never enter Gcoff state Lyude Paul
2026-07-01 18:17 ` [PATCH v3 3/3] drm/nouveau/gsp/r570: Add missing state flags to GSP resume arguments Lyude Paul
2 siblings, 0 replies; 8+ messages in thread
From: Lyude Paul @ 2026-07-01 18:17 UTC (permalink / raw)
To: nouveau, dri-devel, linux-kernel
Cc: stable, Timur Tabi, Dave Airlie, Andy Shevchenko,
Maarten Lankhorst, Ben Skeggs, Kees Cook, Simona Vetter,
David Airlie, Thomas Zimmermann, Maxime Ripard, Mel Henning,
Danilo Krummrich, Lyude Paul
This reverts commit 8302d0afeaec0bc57d951dd085e0cffe997d4d18.
It turns out this looked like the right fix on some systems, but it's not -
as this causes runtime PM to actually fail on many a laptop.
Fixes: 8302d0afeaec ("nouveau/gsp: fix suspend/resume regression on r570 firmware")
Cc: <stable@vger.kernel.org> # v6.19+
Signed-off-by: Lyude Paul <lyude@redhat.com>
---
drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/fbsr.c | 2 +-
drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c | 2 +-
drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r570/fbsr.c | 8 ++++----
drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/rm.h | 2 +-
4 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/fbsr.c b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/fbsr.c
index 700cea5def357..035b575722db7 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/fbsr.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/fbsr.c
@@ -208,7 +208,7 @@ r535_fbsr_resume(struct nvkm_gsp *gsp)
}
static int
-r535_fbsr_suspend(struct nvkm_gsp *gsp, bool runtime)
+r535_fbsr_suspend(struct nvkm_gsp *gsp)
{
struct nvkm_subdev *subdev = &gsp->subdev;
struct nvkm_device *device = subdev->device;
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c
index f544afa12b6bb..4a3b771ded255 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c
@@ -1749,7 +1749,7 @@ r535_gsp_fini(struct nvkm_gsp *gsp, enum nvkm_suspend_state suspend)
sr->sysmemAddrOfSuspendResumeData = gsp->sr.radix3.lvl0.addr;
sr->sizeOfSuspendResumeData = len;
- ret = rm->api->fbsr->suspend(gsp, suspend == NVKM_RUNTIME_SUSPEND);
+ ret = rm->api->fbsr->suspend(gsp);
if (ret) {
nvkm_gsp_mem_dtor(&gsp->sr.meta);
nvkm_gsp_radix3_dtor(gsp, &gsp->sr.radix3);
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r570/fbsr.c b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r570/fbsr.c
index 8ef8b4f655883..2945d5b4e5707 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r570/fbsr.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r570/fbsr.c
@@ -62,7 +62,7 @@ r570_fbsr_resume(struct nvkm_gsp *gsp)
}
static int
-r570_fbsr_init(struct nvkm_gsp *gsp, struct sg_table *sgt, u64 size, bool runtime)
+r570_fbsr_init(struct nvkm_gsp *gsp, struct sg_table *sgt, u64 size)
{
NV2080_CTRL_INTERNAL_FBSR_INIT_PARAMS *ctrl;
struct nvkm_gsp_object memlist;
@@ -81,7 +81,7 @@ r570_fbsr_init(struct nvkm_gsp *gsp, struct sg_table *sgt, u64 size, bool runtim
ctrl->hClient = gsp->internal.client.object.handle;
ctrl->hSysMem = memlist.handle;
ctrl->sysmemAddrOfSuspendResumeData = gsp->sr.meta.addr;
- ctrl->bEnteringGcoffState = runtime ? 1 : 0;
+ ctrl->bEnteringGcoffState = 1;
ret = nvkm_gsp_rm_ctrl_wr(&gsp->internal.device.subdevice, ctrl);
if (ret)
@@ -92,7 +92,7 @@ r570_fbsr_init(struct nvkm_gsp *gsp, struct sg_table *sgt, u64 size, bool runtim
}
static int
-r570_fbsr_suspend(struct nvkm_gsp *gsp, bool runtime)
+r570_fbsr_suspend(struct nvkm_gsp *gsp)
{
struct nvkm_subdev *subdev = &gsp->subdev;
struct nvkm_device *device = subdev->device;
@@ -133,7 +133,7 @@ r570_fbsr_suspend(struct nvkm_gsp *gsp, bool runtime)
return ret;
/* Initialise FBSR on RM. */
- ret = r570_fbsr_init(gsp, &gsp->sr.fbsr, size, runtime);
+ ret = r570_fbsr_init(gsp, &gsp->sr.fbsr, size);
if (ret) {
nvkm_gsp_sg_free(device, &gsp->sr.fbsr);
return ret;
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/rm.h b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/rm.h
index a9af94adf9efc..0fb0e67406c67 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/rm.h
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/rm.h
@@ -78,7 +78,7 @@ struct nvkm_rm_api {
} *device;
const struct nvkm_rm_api_fbsr {
- int (*suspend)(struct nvkm_gsp *, bool runtime);
+ int (*suspend)(struct nvkm_gsp *);
void (*resume)(struct nvkm_gsp *);
} *fbsr;
--
2.54.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v3 2/3] drm/nouveau/gsp/r570: Never enter Gcoff state
2026-07-01 18:17 [PATCH v3 0/3] drm/nouveau/gsp/r570: Fix runtime PM Lyude Paul
2026-07-01 18:17 ` [PATCH v3 1/3] Revert "nouveau/gsp: fix suspend/resume regression on r570 firmware" Lyude Paul
@ 2026-07-01 18:17 ` Lyude Paul
2026-07-02 0:27 ` Danilo Krummrich
2026-07-01 18:17 ` [PATCH v3 3/3] drm/nouveau/gsp/r570: Add missing state flags to GSP resume arguments Lyude Paul
2 siblings, 1 reply; 8+ messages in thread
From: Lyude Paul @ 2026-07-01 18:17 UTC (permalink / raw)
To: nouveau, dri-devel, linux-kernel
Cc: stable, Timur Tabi, Dave Airlie, Andy Shevchenko,
Maarten Lankhorst, Ben Skeggs, Kees Cook, Simona Vetter,
David Airlie, Thomas Zimmermann, Maxime Ripard, Mel Henning,
Danilo Krummrich, Lyude Paul
It turns out that the only reason our previous fixes looked like they
worked for this was because we would occasionally set the Gcoff state to 0
in the normal S3 path, which fixed suspend/resume on desktops - but not on
machines using runtime suspend.
The proper fix is to just never set this flag. Our current guess for the
reasoning behind this is that Gcoff likely coincides with GC6, and not
literally power off.
Signed-off-by: Lyude Paul <lyude@redhat.com>
Fixes: 53dac0623853 ("drm/nouveau/gsp: add support for 570.144")
Cc: <stable@vger.kernel.org> # v6.16+
---
drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r570/fbsr.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r570/fbsr.c b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r570/fbsr.c
index 2945d5b4e5707..af5aa5065c3dd 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r570/fbsr.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r570/fbsr.c
@@ -81,7 +81,7 @@ r570_fbsr_init(struct nvkm_gsp *gsp, struct sg_table *sgt, u64 size)
ctrl->hClient = gsp->internal.client.object.handle;
ctrl->hSysMem = memlist.handle;
ctrl->sysmemAddrOfSuspendResumeData = gsp->sr.meta.addr;
- ctrl->bEnteringGcoffState = 1;
+ ctrl->bEnteringGcoffState = 0;
ret = nvkm_gsp_rm_ctrl_wr(&gsp->internal.device.subdevice, ctrl);
if (ret)
--
2.54.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v3 3/3] drm/nouveau/gsp/r570: Add missing state flags to GSP resume arguments
2026-07-01 18:17 [PATCH v3 0/3] drm/nouveau/gsp/r570: Fix runtime PM Lyude Paul
2026-07-01 18:17 ` [PATCH v3 1/3] Revert "nouveau/gsp: fix suspend/resume regression on r570 firmware" Lyude Paul
2026-07-01 18:17 ` [PATCH v3 2/3] drm/nouveau/gsp/r570: Never enter Gcoff state Lyude Paul
@ 2026-07-01 18:17 ` Lyude Paul
2 siblings, 0 replies; 8+ messages in thread
From: Lyude Paul @ 2026-07-01 18:17 UTC (permalink / raw)
To: nouveau, dri-devel, linux-kernel
Cc: stable, Timur Tabi, Dave Airlie, Andy Shevchenko,
Maarten Lankhorst, Ben Skeggs, Kees Cook, Simona Vetter,
David Airlie, Thomas Zimmermann, Maxime Ripard, Mel Henning,
Danilo Krummrich, Lyude Paul
When resuming the GPU, we need to let GSP know that we need it to avoid
destructive state transitions when bringing the GPU back up. Otherwise, we
will end up with a plethora of strange behavior after coming out of resume
- such as push buffers timing out whenever we try to do rendering.
Signed-off-by: Lyude Paul <lyude@redhat.com>
Fixes: 53dac0623853 ("drm/nouveau/gsp: add support for 570.144")
Cc: <stable@vger.kernel.org> # v6.16+
---
drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r570/gsp.c | 3 ++-
.../gpu/drm/nouveau/nvkm/subdev/gsp/rm/r570/nvrm/gsp.h | 8 ++++++++
2 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r570/gsp.c b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r570/gsp.c
index 996941c668ba9..3e391646d8f7d 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r570/gsp.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r570/gsp.c
@@ -198,7 +198,8 @@ r570_gsp_set_rmargs(struct nvkm_gsp *gsp, bool resume)
args->srInitArguments.bInPMTransition = 0;
} else {
args->srInitArguments.oldLevel = NV2080_CTRL_GPU_SET_POWER_STATE_GPU_LEVEL_3;
- args->srInitArguments.flags = 0;
+ args->srInitArguments.flags =
+ GPU_STATE_FLAGS_PRESERVING | GPU_STATE_FLAGS_PM_TRANSITION;
args->srInitArguments.bInPMTransition = 1;
}
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r570/nvrm/gsp.h b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r570/nvrm/gsp.h
index b6075021e74f5..c458569af9d72 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r570/nvrm/gsp.h
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r570/nvrm/gsp.h
@@ -523,6 +523,14 @@ typedef struct
#define NV2080_CTRL_GPU_SET_POWER_STATE_GPU_LEVEL_3 (0x00000003U)
+#define GPU_STATE_FLAGS_PRESERVING BIT(0) // GPU state is preserved
+#define GPU_STATE_FLAGS_VGA_TRANSITION BIT(1) // To be used with GPU_STATE_FLAGS_PRESERVING.
+#define GPU_STATE_FLAGS_PM_TRANSITION BIT(2) // To be used with GPU_STATE_FLAGS_PRESERVING.
+#define GPU_STATE_FLAGS_PM_SUSPEND BIT(3)
+#define GPU_STATE_FLAGS_PM_HIBERNATE BIT(4)
+#define GPU_STATE_FLAGS_GC6_TRANSITION BIT(5) // To be used with GPU_STATE_FLAGS_PRESERVING.
+#define GPU_STATE_FLAGS_FAST_UNLOAD BIT(6) // Used during windows restart, skips stateDestroy steps
+
typedef struct
{
// Magic for verification by secure ucode
--
2.54.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH v3 2/3] drm/nouveau/gsp/r570: Never enter Gcoff state
2026-07-01 18:17 ` [PATCH v3 2/3] drm/nouveau/gsp/r570: Never enter Gcoff state Lyude Paul
@ 2026-07-02 0:27 ` Danilo Krummrich
2026-07-02 0:30 ` David Airlie
0 siblings, 1 reply; 8+ messages in thread
From: Danilo Krummrich @ 2026-07-02 0:27 UTC (permalink / raw)
To: Lyude Paul
Cc: nouveau, dri-devel, linux-kernel, stable, Timur Tabi, Dave Airlie,
Andy Shevchenko, Maarten Lankhorst, Kees Cook, Simona Vetter,
David Airlie, Thomas Zimmermann, Maxime Ripard, Mel Henning,
John Hubbard
(Cc: John)
On Wed Jul 1, 2026 at 8:17 PM CEST, Lyude Paul wrote:
> It turns out that the only reason our previous fixes looked like they
> worked for this was because we would occasionally set the Gcoff state to 0
> in the normal S3 path, which fixed suspend/resume on desktops - but not on
> machines using runtime suspend.
>
> The proper fix is to just never set this flag. Our current guess for the
> reasoning behind this is that Gcoff likely coincides with GC6, and not
> literally power off.
I don't think GcOff coincides with GC6, it should actually be a power off.
From a quick glance in OpenRM, it seems that with bEnteringGcoffState = 1 it
also saves off buffers flagged as MEMDESC_FLAGS_LOST_ON_SUSPEND.
My guess would be that with bEnteringGcoffState = 1, GSP's resume path expects
certain kernel-driver-allocated buffers to still be in place that nouveau didn't
save off, or rather never had in the first place.
John, do you have some details about this?
> Signed-off-by: Lyude Paul <lyude@redhat.com>
> Fixes: 53dac0623853 ("drm/nouveau/gsp: add support for 570.144")
> Cc: <stable@vger.kernel.org> # v6.16+
> ---
> drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r570/fbsr.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r570/fbsr.c b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r570/fbsr.c
> index 2945d5b4e5707..af5aa5065c3dd 100644
> --- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r570/fbsr.c
> +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r570/fbsr.c
> @@ -81,7 +81,7 @@ r570_fbsr_init(struct nvkm_gsp *gsp, struct sg_table *sgt, u64 size)
> ctrl->hClient = gsp->internal.client.object.handle;
> ctrl->hSysMem = memlist.handle;
> ctrl->sysmemAddrOfSuspendResumeData = gsp->sr.meta.addr;
> - ctrl->bEnteringGcoffState = 1;
> + ctrl->bEnteringGcoffState = 0;
>
> ret = nvkm_gsp_rm_ctrl_wr(&gsp->internal.device.subdevice, ctrl);
> if (ret)
> --
> 2.54.0
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v3 2/3] drm/nouveau/gsp/r570: Never enter Gcoff state
2026-07-02 0:27 ` Danilo Krummrich
@ 2026-07-02 0:30 ` David Airlie
2026-07-02 0:47 ` Danilo Krummrich
0 siblings, 1 reply; 8+ messages in thread
From: David Airlie @ 2026-07-02 0:30 UTC (permalink / raw)
To: Danilo Krummrich
Cc: Lyude Paul, nouveau, dri-devel, linux-kernel, stable, Timur Tabi,
Andy Shevchenko, Maarten Lankhorst, Kees Cook, Simona Vetter,
David Airlie, Thomas Zimmermann, Maxime Ripard, Mel Henning,
John Hubbard
On Thu, Jul 2, 2026 at 10:27 AM Danilo Krummrich <dakr@kernel.org> wrote:
>
> (Cc: John)
>
> On Wed Jul 1, 2026 at 8:17 PM CEST, Lyude Paul wrote:
> > It turns out that the only reason our previous fixes looked like they
> > worked for this was because we would occasionally set the Gcoff state to 0
> > in the normal S3 path, which fixed suspend/resume on desktops - but not on
> > machines using runtime suspend.
> >
> > The proper fix is to just never set this flag. Our current guess for the
> > reasoning behind this is that Gcoff likely coincides with GC6, and not
> > literally power off.
>
> I don't think GcOff coincides with GC6, it should actually be a power off.
>
> From a quick glance in OpenRM, it seems that with bEnteringGcoffState = 1 it
> also saves off buffers flagged as MEMDESC_FLAGS_LOST_ON_SUSPEND.
>
> My guess would be that with bEnteringGcoffState = 1, GSP's resume path expects
> certain kernel-driver-allocated buffers to still be in place that nouveau didn't
> save off, or rather never had in the first place.
>
> John, do you have some details about this?
>
In nouveau we have the INST_SR_LOST target, for buffers that aren't
preserved, I wonder did something change between 535 and 570 around
what needs to be kept around.
Dave.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v3 2/3] drm/nouveau/gsp/r570: Never enter Gcoff state
2026-07-02 0:30 ` David Airlie
@ 2026-07-02 0:47 ` Danilo Krummrich
2026-07-02 22:46 ` John Hubbard
0 siblings, 1 reply; 8+ messages in thread
From: Danilo Krummrich @ 2026-07-02 0:47 UTC (permalink / raw)
To: David Airlie
Cc: Lyude Paul, nouveau, dri-devel, linux-kernel, stable, Timur Tabi,
Andy Shevchenko, Maarten Lankhorst, Kees Cook, Simona Vetter,
David Airlie, Thomas Zimmermann, Maxime Ripard, Mel Henning,
John Hubbard
On Thu Jul 2, 2026 at 2:30 AM CEST, David Airlie wrote:
> On Thu, Jul 2, 2026 at 10:27 AM Danilo Krummrich <dakr@kernel.org> wrote:
>>
>> (Cc: John)
>>
>> On Wed Jul 1, 2026 at 8:17 PM CEST, Lyude Paul wrote:
>> > It turns out that the only reason our previous fixes looked like they
>> > worked for this was because we would occasionally set the Gcoff state to 0
>> > in the normal S3 path, which fixed suspend/resume on desktops - but not on
>> > machines using runtime suspend.
>> >
>> > The proper fix is to just never set this flag. Our current guess for the
>> > reasoning behind this is that Gcoff likely coincides with GC6, and not
>> > literally power off.
>>
>> I don't think GcOff coincides with GC6, it should actually be a power off.
>>
>> From a quick glance in OpenRM, it seems that with bEnteringGcoffState = 1 it
>> also saves off buffers flagged as MEMDESC_FLAGS_LOST_ON_SUSPEND.
>>
>> My guess would be that with bEnteringGcoffState = 1, GSP's resume path expects
>> certain kernel-driver-allocated buffers to still be in place that nouveau didn't
>> save off, or rather never had in the first place.
>>
>> John, do you have some details about this?
>>
>
> In nouveau we have the INST_SR_LOST target, for buffers that aren't
> preserved, I wonder did something change between 535 and 570 around
> what needs to be kept around.
The r535 code never set bEnteringGcoffState in the first place. In r535 OpenRM
seems to do the exact same thing.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v3 2/3] drm/nouveau/gsp/r570: Never enter Gcoff state
2026-07-02 0:47 ` Danilo Krummrich
@ 2026-07-02 22:46 ` John Hubbard
0 siblings, 0 replies; 8+ messages in thread
From: John Hubbard @ 2026-07-02 22:46 UTC (permalink / raw)
To: Danilo Krummrich, David Airlie
Cc: Lyude Paul, nouveau, dri-devel, linux-kernel, stable, Timur Tabi,
Andy Shevchenko, Maarten Lankhorst, Kees Cook, Simona Vetter,
David Airlie, Thomas Zimmermann, Maxime Ripard, Mel Henning
On 7/1/26 2:47 PM, Danilo Krummrich wrote:
> On Thu Jul 2, 2026 at 2:30 AM CEST, David Airlie wrote:
>> On Thu, Jul 2, 2026 at 10:27 AM Danilo Krummrich <dakr@kernel.org> wrote:
>>>
>>> (Cc: John)
Also Cc: Aaron Plattner. I've provided answers below, but Aaron
has actual experience in debugging suspend-resume on our Linux
drivers.
These answers are the result of my moderately long session with
our best AI tools, using Open RM, GSP-RM, and Nouveau sources
as a reference. I'm not actually experienced in this suspend-resume
area, much, but this makes sense from what anecdotal things I've
seen before.
>>>
>>> On Wed Jul 1, 2026 at 8:17 PM CEST, Lyude Paul wrote:
>>>> It turns out that the only reason our previous fixes looked like they
>>>> worked for this was because we would occasionally set the Gcoff state to 0
>>>> in the normal S3 path, which fixed suspend/resume on desktops - but not on
>>>> machines using runtime suspend.
>>>>
>>>> The proper fix is to just never set this flag. Our current guess for the
>>>> reasoning behind this is that Gcoff likely coincides with GC6, and not
>>>> literally power off.
>>>
>>> I don't think GcOff coincides with GC6, it should actually be a power off.
You're right, it's the other way around from the commit message guess.
In the RM sources GC6 and GCOFF are two distinct GCx targets. GC6 keeps
video memory alive in self-refresh. GCOFF is a full power-off where
video memory content is lost, so RM copies the used framebuffer out to
sysmem before entering it, and it reports vidmem power as off while in
GCOFF. GCOFF is the power-off case, GC6 is not.
>>>
>>> From a quick glance in OpenRM, it seems that with bEnteringGcoffState = 1 it
>>> also saves off buffers flagged as MEMDESC_FLAGS_LOST_ON_SUSPEND.
That matches what I see, and it's the key point. bEnteringGcoffState is
not a GC6-versus-off selector at the FBSR layer. It becomes the
PDB_PROP_GPU_GCOFF_STATE_ENTERING property on the RM side, and that
property widens the set of allocations RM saves and restores across
suspend (memmgrAddMemNodes, through its bSaveAllRmAllocations argument).
With it set:
* RM reserved regions get saved, unless they are LOST_ON_SUSPEND.
* RM channel-context and kernel-client buffers get saved even when
they are LOST_ON_SUSPEND.
With it clear, the reserved regions are skipped and the channel and
kernel-client buffers are saved only when they are not LOST_ON_SUSPEND.
So =1 is a strict superset of =0, and it does include the
LOST_ON_SUSPEND buffers you found.
The part that matters for nouveau: in the full driver that property is
never just a standalone flag. RM sets it only when it has decided to do
a GCOFF as part of its own RTD3 policy, after it has reserved correctly
sized sysmem for the save and turned on comptag backing-store
preservation for the state unload and load. Setting the flag in the
FBSR init RPC on its own, the way nouveau does, gives GSP the wider save
and restore set without any of that surrounding GCOFF handling.
So I would adjust the guess slightly. It is not that nouveau never
saved those buffers or never had them. nouveau provides the sysmem and
GSP-RM does the copy into it. The problem is the reverse: with =1, GSP
saves and then restores buffers that were meant to be reinitialized on
resume, and it does so without the comptag and state-load handling a
real GCOFF pairs with them. So the accurate framing is "buffers that
should have been reinitialized get restored instead", not "buffers
nouveau never saved".
>>>
>>> My guess would be that with bEnteringGcoffState = 1, GSP's resume path expects
>>> certain kernel-driver-allocated buffers to still be in place that nouveau didn't
>>> save off, or rather never had in the first place.
>>>
>>> John, do you have some details about this?
>>>
>>
>> In nouveau we have the INST_SR_LOST target, for buffers that aren't
>> preserved, I wonder did something change between 535 and 570 around
>> what needs to be kept around.
>
> The r535 code never set bEnteringGcoffState in the first place. In r535 OpenRM
> seems to do the exact same thing.
The set of buffers did not change. The FBSR client ABI did. In 535
nouveau enumerates the exact VRAM regions and sends them to RM one at a
time, and it never sets the gcoff field, so the flag is a no-op on 535.
In 570 nouveau passes RM a single sysmem buffer for the whole heap and
lets GSP build the region list itself, and the gcoff flag is the only
control nouveau has over which regions GSP picks. Forcing it to 0 makes
the 570 GSP-built set match what 535 effectively saved, which is why 535
looks like it does the same thing. So 0 is the right value for how
nouveau drives suspend today. RM derives this per transition from its
RTD3 policy, and 570 setting it to 1 was the deviation, not 0.
On patch 3 (the resume state flags), I looked at that as well, and here
is what the firmware actually does with it. In the 570 GSP firmware the
resume state load already runs with GPU_STATE_FLAGS_PRESERVING |
GPU_STATE_FLAGS_PM_TRANSITION. That is set unconditionally in the resume
path, and it is gated on the bInPMTransition field of the SR init
arguments, which nouveau already sets on resume. The firmware does not
derive those flags from srInitArguments.flags. That field is read in
only one place on the resume path, an unrelated display workaround gated
on the PM_SUSPEND bit. Neither 0 nor PRESERVING | PM_TRANSITION sets
that bit. And the value the open driver itself puts in that field on a
standby or RTD3 resume is GPU_STATE_FLAGS_PM_SUSPEND, which is a PM-type
indicator, not the state-load flags.
So from the 570 sources I do not see a path by which patch 3 changes
what the firmware does on resume. That points to patches 1 and 2, the
revert plus never entering the gcoff save path, as what actually fixes
the push-buffer timeouts. Your 100-cycle RTD3 result is consistent with
that: those two are what stop GSP from doing the wide GCOFF-style save
and restore.
I want to be clear about the limits of what I checked. I confirmed the
resume-side firmware behavior against the 570 release (latest) sources
rather
than the exact 570.144 build, so I am not claiming patch 3 is provably
inert on 570.144, only that I do not see how it changes behavior. And I
have the mechanism for the =1 breakage but not the single allocation
behind the timeout. I can see that =1 restores LOST_ON_SUSPEND RM
buffers that should have been reinitialized, without the matching
state-load handling, but I have not isolated the exact buffer that
produces the failure.
My bottom line: patch 2 (=0) is correct and is the right value for how
nouveau drives suspend today, and patch 1 is needed with it. Patch 3 is
harmless, and from the sources I do not expect it to change anything on
570.144.
Assisted-by: Cursor :)
thanks,
--
John Hubbard
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-07-02 22:46 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-01 18:17 [PATCH v3 0/3] drm/nouveau/gsp/r570: Fix runtime PM Lyude Paul
2026-07-01 18:17 ` [PATCH v3 1/3] Revert "nouveau/gsp: fix suspend/resume regression on r570 firmware" Lyude Paul
2026-07-01 18:17 ` [PATCH v3 2/3] drm/nouveau/gsp/r570: Never enter Gcoff state Lyude Paul
2026-07-02 0:27 ` Danilo Krummrich
2026-07-02 0:30 ` David Airlie
2026-07-02 0:47 ` Danilo Krummrich
2026-07-02 22:46 ` John Hubbard
2026-07-01 18:17 ` [PATCH v3 3/3] drm/nouveau/gsp/r570: Add missing state flags to GSP resume arguments Lyude Paul
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox