From: John Hubbard <jhubbard@nvidia.com>
To: Danilo Krummrich <dakr@kernel.org>, David Airlie <airlied@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>,
nouveau@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
linux-kernel@vger.kernel.org, stable@vger.kernel.org,
Timur Tabi <ttabi@nvidia.com>,
Andy Shevchenko <andriy.shevchenko@linux.intel.com>,
Maarten Lankhorst <maarten.lankhorst@linux.intel.com>,
Kees Cook <kees@kernel.org>, Simona Vetter <simona@ffwll.ch>,
David Airlie <airlied@gmail.com>,
Thomas Zimmermann <tzimmermann@suse.de>,
Maxime Ripard <mripard@kernel.org>,
Mel Henning <mhenning@darkrefraction.com>
Subject: Re: [PATCH v3 2/3] drm/nouveau/gsp/r570: Never enter Gcoff state
Date: Thu, 2 Jul 2026 12:46:26 -1000 [thread overview]
Message-ID: <b5d08cfe-aead-45f2-937d-6e9ef4dfea50@nvidia.com> (raw)
In-Reply-To: <DJNO6SIE8T88.1F0ZUILIRVDJC@kernel.org>
On 7/1/26 2:47 PM, Danilo Krummrich wrote:
> On Thu Jul 2, 2026 at 2:30 AM CEST, David Airlie wrote:
>> On Thu, Jul 2, 2026 at 10:27 AM Danilo Krummrich <dakr@kernel.org> wrote:
>>>
>>> (Cc: John)
Also Cc: Aaron Plattner. I've provided answers below, but Aaron
has actual experience in debugging suspend-resume on our Linux
drivers.
These answers are the result of my moderately long session with
our best AI tools, using Open RM, GSP-RM, and Nouveau sources
as a reference. I'm not actually experienced in this suspend-resume
area, much, but this makes sense from what anecdotal things I've
seen before.
>>>
>>> On Wed Jul 1, 2026 at 8:17 PM CEST, Lyude Paul wrote:
>>>> It turns out that the only reason our previous fixes looked like they
>>>> worked for this was because we would occasionally set the Gcoff state to 0
>>>> in the normal S3 path, which fixed suspend/resume on desktops - but not on
>>>> machines using runtime suspend.
>>>>
>>>> The proper fix is to just never set this flag. Our current guess for the
>>>> reasoning behind this is that Gcoff likely coincides with GC6, and not
>>>> literally power off.
>>>
>>> I don't think GcOff coincides with GC6, it should actually be a power off.
You're right, it's the other way around from the commit message guess.
In the RM sources GC6 and GCOFF are two distinct GCx targets. GC6 keeps
video memory alive in self-refresh. GCOFF is a full power-off where
video memory content is lost, so RM copies the used framebuffer out to
sysmem before entering it, and it reports vidmem power as off while in
GCOFF. GCOFF is the power-off case, GC6 is not.
>>>
>>> From a quick glance in OpenRM, it seems that with bEnteringGcoffState = 1 it
>>> also saves off buffers flagged as MEMDESC_FLAGS_LOST_ON_SUSPEND.
That matches what I see, and it's the key point. bEnteringGcoffState is
not a GC6-versus-off selector at the FBSR layer. It becomes the
PDB_PROP_GPU_GCOFF_STATE_ENTERING property on the RM side, and that
property widens the set of allocations RM saves and restores across
suspend (memmgrAddMemNodes, through its bSaveAllRmAllocations argument).
With it set:
* RM reserved regions get saved, unless they are LOST_ON_SUSPEND.
* RM channel-context and kernel-client buffers get saved even when
they are LOST_ON_SUSPEND.
With it clear, the reserved regions are skipped and the channel and
kernel-client buffers are saved only when they are not LOST_ON_SUSPEND.
So =1 is a strict superset of =0, and it does include the
LOST_ON_SUSPEND buffers you found.
The part that matters for nouveau: in the full driver that property is
never just a standalone flag. RM sets it only when it has decided to do
a GCOFF as part of its own RTD3 policy, after it has reserved correctly
sized sysmem for the save and turned on comptag backing-store
preservation for the state unload and load. Setting the flag in the
FBSR init RPC on its own, the way nouveau does, gives GSP the wider save
and restore set without any of that surrounding GCOFF handling.
So I would adjust the guess slightly. It is not that nouveau never
saved those buffers or never had them. nouveau provides the sysmem and
GSP-RM does the copy into it. The problem is the reverse: with =1, GSP
saves and then restores buffers that were meant to be reinitialized on
resume, and it does so without the comptag and state-load handling a
real GCOFF pairs with them. So the accurate framing is "buffers that
should have been reinitialized get restored instead", not "buffers
nouveau never saved".
>>>
>>> My guess would be that with bEnteringGcoffState = 1, GSP's resume path expects
>>> certain kernel-driver-allocated buffers to still be in place that nouveau didn't
>>> save off, or rather never had in the first place.
>>>
>>> John, do you have some details about this?
>>>
>>
>> In nouveau we have the INST_SR_LOST target, for buffers that aren't
>> preserved, I wonder did something change between 535 and 570 around
>> what needs to be kept around.
>
> The r535 code never set bEnteringGcoffState in the first place. In r535 OpenRM
> seems to do the exact same thing.
The set of buffers did not change. The FBSR client ABI did. In 535
nouveau enumerates the exact VRAM regions and sends them to RM one at a
time, and it never sets the gcoff field, so the flag is a no-op on 535.
In 570 nouveau passes RM a single sysmem buffer for the whole heap and
lets GSP build the region list itself, and the gcoff flag is the only
control nouveau has over which regions GSP picks. Forcing it to 0 makes
the 570 GSP-built set match what 535 effectively saved, which is why 535
looks like it does the same thing. So 0 is the right value for how
nouveau drives suspend today. RM derives this per transition from its
RTD3 policy, and 570 setting it to 1 was the deviation, not 0.
On patch 3 (the resume state flags), I looked at that as well, and here
is what the firmware actually does with it. In the 570 GSP firmware the
resume state load already runs with GPU_STATE_FLAGS_PRESERVING |
GPU_STATE_FLAGS_PM_TRANSITION. That is set unconditionally in the resume
path, and it is gated on the bInPMTransition field of the SR init
arguments, which nouveau already sets on resume. The firmware does not
derive those flags from srInitArguments.flags. That field is read in
only one place on the resume path, an unrelated display workaround gated
on the PM_SUSPEND bit. Neither 0 nor PRESERVING | PM_TRANSITION sets
that bit. And the value the open driver itself puts in that field on a
standby or RTD3 resume is GPU_STATE_FLAGS_PM_SUSPEND, which is a PM-type
indicator, not the state-load flags.
So from the 570 sources I do not see a path by which patch 3 changes
what the firmware does on resume. That points to patches 1 and 2, the
revert plus never entering the gcoff save path, as what actually fixes
the push-buffer timeouts. Your 100-cycle RTD3 result is consistent with
that: those two are what stop GSP from doing the wide GCOFF-style save
and restore.
I want to be clear about the limits of what I checked. I confirmed the
resume-side firmware behavior against the 570 release (latest) sources
rather
than the exact 570.144 build, so I am not claiming patch 3 is provably
inert on 570.144, only that I do not see how it changes behavior. And I
have the mechanism for the =1 breakage but not the single allocation
behind the timeout. I can see that =1 restores LOST_ON_SUSPEND RM
buffers that should have been reinitialized, without the matching
state-load handling, but I have not isolated the exact buffer that
produces the failure.
My bottom line: patch 2 (=0) is correct and is the right value for how
nouveau drives suspend today, and patch 1 is needed with it. Patch 3 is
harmless, and from the sources I do not expect it to change anything on
570.144.
Assisted-by: Cursor :)
thanks,
--
John Hubbard
next prev parent reply other threads:[~2026-07-02 22:46 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-07-01 18:17 [PATCH v3 0/3] drm/nouveau/gsp/r570: Fix runtime PM Lyude Paul
2026-07-01 18:17 ` [PATCH v3 1/3] Revert "nouveau/gsp: fix suspend/resume regression on r570 firmware" Lyude Paul
2026-07-01 18:17 ` [PATCH v3 2/3] drm/nouveau/gsp/r570: Never enter Gcoff state Lyude Paul
2026-07-02 0:27 ` Danilo Krummrich
2026-07-02 0:30 ` David Airlie
2026-07-02 0:47 ` Danilo Krummrich
2026-07-02 22:46 ` John Hubbard [this message]
2026-07-01 18:17 ` [PATCH v3 3/3] drm/nouveau/gsp/r570: Add missing state flags to GSP resume arguments Lyude Paul
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b5d08cfe-aead-45f2-937d-6e9ef4dfea50@nvidia.com \
--to=jhubbard@nvidia.com \
--cc=airlied@gmail.com \
--cc=airlied@redhat.com \
--cc=andriy.shevchenko@linux.intel.com \
--cc=dakr@kernel.org \
--cc=dri-devel@lists.freedesktop.org \
--cc=kees@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=lyude@redhat.com \
--cc=maarten.lankhorst@linux.intel.com \
--cc=mhenning@darkrefraction.com \
--cc=mripard@kernel.org \
--cc=nouveau@lists.freedesktop.org \
--cc=simona@ffwll.ch \
--cc=stable@vger.kernel.org \
--cc=ttabi@nvidia.com \
--cc=tzimmermann@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox