From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: "Teres Alexis, Alan Previn" <alan.previn.teres.alexis@intel.com>,
"intel-gfx@lists.freedesktop.org"
<intel-gfx@lists.freedesktop.org>
Cc: "Jana, Mousumi" <mousumi.jana@intel.com>,
"intel.com@freedesktop.org" <intel.com@freedesktop.org>,
"dri-devel@lists.freedesktop.org"
<dri-devel@lists.freedesktop.org>,
"Vivi, Rodrigo" <rodrigo.vivi@intel.com>
Subject: Re: [Intel-gfx] [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending
Date: Wed, 25 Oct 2023 13:58:48 +0100 [thread overview]
Message-ID: <ace7375e-d9a8-4b6f-aa92-6360ca3bfa96@linux.intel.com> (raw)
In-Reply-To: <0c1e1e713fc46bf0783ca6e0a72a39d6671a6b57.camel@intel.com>
On 04/10/2023 18:59, Teres Alexis, Alan Previn wrote:
> On Thu, 2023-09-28 at 13:46 +0100, Tvrtko Ursulin wrote:
>> On 27/09/2023 17:36, Teres Alexis, Alan Previn wrote:
>>> Thanks for taking the time to review this Tvrtko, replies inline below.
> alan:snip
>
>>>>
>>>> Main concern is that we need to be sure there are no possible
>>>> ill-effects, like letting the GPU/GuC scribble on some memory we
>>>> unmapped (or will unmap), having let the suspend continue after timing
>>>> out, and not perhaps doing the forced wedge like wait_for_suspend() does
>>>> on the existing timeout path.
>>> alan: this will not happen because the held wakeref is never force-released
>>> after the timeout - so what happens is the kernel would bail the suspend.
>>
>> How does it know to fail the suspend when there is no error code
>> returned with this timeout? Maybe a stupid question.. my knowledge of
>> suspend-resume paths was not great even before I forgot it all.
> alan:Tvrtko, you and I both sir. (apologies for the tardy response yet again busy week).
> So i did trace back the gt->wakeref before i posted these patches and discovered that
> runtime get/put call, i believe that the first 'get' leads to __intel_wakeref_get_first
> which calls intel_runtime_pm_get via rpm_get helper and eventually executes a
> pm_runtime_get_sync(rpm->kdev); (hanging off i915). (ofc, there is a corresponding
> for '_put_last') - so non-first, non-last increases the counter for the gt...
> but this last miss will mean kernel knows i915 hasnt 'put' everything.
>
> alan:snip
>>>
>>> Recap: so in both cases (original vs this patch), if we had a buggy gt-wakeref leak,
>>> we dont get invalid guc-accesses, but without this patch, we wait forever,
>>> and with this patch, we get some messages and eventually bail the suspend.
>>
>> It is not possible to wait for lost G2H in something like
>> intel_uc_suspend() and simply declare "bad things happened" if it times
>> out there, and forcibly clean it all up? (Which would include releasing
>> all the abandoned pm refs, so this patch wouldn't be needed.)
>>
> alan: I'm not sure if intel_uc_suspend should be held up by gt-level wakeref
> check unless huc/guc/gsc-uc are the only ones ever taking a gt wakeref.
>
> As we already know, what we do know from a uc-perspective:
> - ensure the outstanding guc related workers is flushed which we didnt before
> (addressed by patch #1).
> - any further late H2G-SchedDisable is not leaking wakerefs when calling H2G
> and not realizing it failed (addressed by patch #2).
> - (we already), "forcibly clean it all up" at the end of the intel_uc_suspend
> when we do the guc reset and cleanup all guc-ids. (pre-existing upstream code)
> - we previously didnt have a coherrent guarantee that "this is the end" i.e. no
> more new request after intel_uc_suspend. I mean by code logic, we thought we did
> (thats why intel_uc_suspend ends wth a guc reset), but we now know otherwise.
> So we that fix by adding the additional rcu_barrier (also part of patch #2).
It is not clear to me from the above if that includes cleaning up the
outstanding CT replies or no. But anyway..
>
> That said, patch-3 is NOT fixing a bug in guc -its about "if we ever have
> a future racy gt-wakeref late-leak somewhere - no matter which subsystem
> took it (guc is not the only subsystem taking gt wakerefs), we at
> least don't hang forever in this code. Ofc, based on that, even without
> patch-3 i am confident the issue is resolved anyway.
> So we could just drop patch-3 is you prefer?
.. given this it does sound to me that if you are confident patch 3
isn't fixing anything today that it should be dropped.
Regards,
Tvrtko
next prev parent reply other threads:[~2023-10-25 12:59 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-26 19:05 [Intel-gfx] [PATCH v4 0/3] Resolve suspend-resume racing with GuC destroy-context-worker Alan Previn
2023-09-26 19:05 ` [Intel-gfx] [PATCH v4 1/3] drm/i915/guc: Flush context destruction worker at suspend Alan Previn
2023-09-26 19:05 ` [Intel-gfx] [PATCH v4 2/3] drm/i915/guc: Close deregister-context race against CT-loss Alan Previn
2023-10-04 6:34 ` Gupta, Anshuman
2023-10-04 16:46 ` Teres Alexis, Alan Previn
2023-09-26 19:05 ` [Intel-gfx] [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending Alan Previn
2023-09-27 9:02 ` Tvrtko Ursulin
2023-09-27 16:36 ` Teres Alexis, Alan Previn
2023-09-28 12:46 ` Tvrtko Ursulin
2023-10-04 17:59 ` Teres Alexis, Alan Previn
2023-10-25 12:58 ` Tvrtko Ursulin [this message]
2023-11-13 17:57 ` Teres Alexis, Alan Previn
2023-11-14 17:27 ` Tvrtko Ursulin
2023-11-14 17:36 ` Rodrigo Vivi
2023-11-14 19:34 ` Teres Alexis, Alan Previn
2023-11-14 17:37 ` Teres Alexis, Alan Previn
2023-11-14 17:52 ` Tvrtko Ursulin
2023-11-14 19:48 ` Teres Alexis, Alan Previn
2023-11-16 10:19 ` Tvrtko Ursulin
2023-09-27 1:23 ` [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Resolve suspend-resume racing with GuC destroy-context-worker (rev4) Patchwork
2023-09-27 1:37 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2023-09-27 13:17 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ace7375e-d9a8-4b6f-aa92-6360ca3bfa96@linux.intel.com \
--to=tvrtko.ursulin@linux.intel.com \
--cc=alan.previn.teres.alexis@intel.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=intel-gfx@lists.freedesktop.org \
--cc=intel.com@freedesktop.org \
--cc=mousumi.jana@intel.com \
--cc=rodrigo.vivi@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox