All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: "Teres Alexis, Alan Previn" <alan.previn.teres.alexis@intel.com>,
	"intel-gfx@lists.freedesktop.org"
	<intel-gfx@lists.freedesktop.org>
Cc: "Jana, Mousumi" <mousumi.jana@intel.com>,
	"intel.com@freedesktop.org" <intel.com@freedesktop.org>,
	"dri-devel@lists.freedesktop.org"
	<dri-devel@lists.freedesktop.org>,
	"Vivi, Rodrigo" <rodrigo.vivi@intel.com>
Subject: Re: [Intel-gfx] [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending
Date: Wed, 25 Oct 2023 13:58:48 +0100	[thread overview]
Message-ID: <ace7375e-d9a8-4b6f-aa92-6360ca3bfa96@linux.intel.com> (raw)
In-Reply-To: <0c1e1e713fc46bf0783ca6e0a72a39d6671a6b57.camel@intel.com>


On 04/10/2023 18:59, Teres Alexis, Alan Previn wrote:
> On Thu, 2023-09-28 at 13:46 +0100, Tvrtko Ursulin wrote:
>> On 27/09/2023 17:36, Teres Alexis, Alan Previn wrote:
>>> Thanks for taking the time to review this Tvrtko, replies inline below.
> alan:snip
> 
>>>>
>>>> Main concern is that we need to be sure there are no possible
>>>> ill-effects, like letting the GPU/GuC scribble on some memory we
>>>> unmapped (or will unmap), having let the suspend continue after timing
>>>> out, and not perhaps doing the forced wedge like wait_for_suspend() does
>>>> on the existing timeout path.
>>> alan: this will not happen because the held wakeref is never force-released
>>> after the timeout - so what happens is the kernel would bail the suspend.
>>
>> How does it know to fail the suspend when there is no error code
>> returned with this timeout? Maybe a stupid question.. my knowledge of
>> suspend-resume paths was not great even before I forgot it all.
> alan:Tvrtko, you and I both sir. (apologies for the tardy response yet again busy week).
> So i did trace back the gt->wakeref before i posted these patches and discovered that
> runtime get/put call, i believe that the first 'get' leads to __intel_wakeref_get_first
> which calls intel_runtime_pm_get via rpm_get helper and eventually executes a
> pm_runtime_get_sync(rpm->kdev); (hanging off i915). (ofc, there is a corresponding
> for '_put_last') - so non-first, non-last increases the counter for the gt...
> but this last miss will mean kernel knows i915 hasnt 'put' everything.
> 
> alan:snip
>>>
>>> Recap: so in both cases (original vs this patch), if we had a buggy gt-wakeref leak,
>>> we dont get invalid guc-accesses, but without this patch, we wait forever,
>>> and with this patch, we get some messages and eventually bail the suspend.
>>
>> It is not possible to wait for lost G2H in something like
>> intel_uc_suspend() and simply declare "bad things happened" if it times
>> out there, and forcibly clean it all up? (Which would include releasing
>> all the abandoned pm refs, so this patch wouldn't be needed.)
>>
> alan: I'm not sure if intel_uc_suspend should be held up by gt-level wakeref
> check unless huc/guc/gsc-uc are the only ones ever taking a gt wakeref.
> 
> As we already know, what we do know from a uc-perspective:
> -  ensure the outstanding guc related workers is flushed which we didnt before
> (addressed by patch #1).
> - any further late H2G-SchedDisable is not leaking wakerefs when calling H2G
> and not realizing it failed (addressed by patch #2).
> - (we already), "forcibly clean it all up" at the end of the intel_uc_suspend
> when we do the guc reset and cleanup all guc-ids. (pre-existing upstream code)
> - we previously didnt have a coherrent guarantee that "this is the end" i.e. no
> more new request after intel_uc_suspend. I mean by code logic, we thought we did
> (thats why intel_uc_suspend ends wth a guc reset), but we now know otherwise.
> So we that fix by adding the additional rcu_barrier (also part of patch #2).

It is not clear to me from the above if that includes cleaning up the 
outstanding CT replies or no. But anyway..

> 
> That said, patch-3 is NOT fixing a bug in guc -its about "if we ever have
> a future racy gt-wakeref late-leak somewhere - no matter which subsystem
> took it (guc is not the only subsystem taking gt wakerefs), we at
> least don't hang forever in this code. Ofc, based on that, even without
> patch-3 i am confident the issue is resolved anyway.
> So we could just drop patch-3 is you prefer?

.. given this it does sound to me that if you are confident patch 3 
isn't fixing anything today that it should be dropped.

Regards,

Tvrtko

  reply	other threads:[~2023-10-25 12:59 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-26 19:05 [Intel-gfx] [PATCH v4 0/3] Resolve suspend-resume racing with GuC destroy-context-worker Alan Previn
2023-09-26 19:05 ` Alan Previn
2023-09-26 19:05 ` [Intel-gfx] [PATCH v4 1/3] drm/i915/guc: Flush context destruction worker at suspend Alan Previn
2023-09-26 19:05   ` Alan Previn
2023-09-26 19:05 ` [Intel-gfx] [PATCH v4 2/3] drm/i915/guc: Close deregister-context race against CT-loss Alan Previn
2023-09-26 19:05   ` Alan Previn
2023-10-04  6:34   ` [Intel-gfx] " Gupta, Anshuman
2023-10-04  6:34     ` Gupta, Anshuman
2023-10-04 16:46     ` [Intel-gfx] " Teres Alexis, Alan Previn
2023-10-04 16:46       ` Teres Alexis, Alan Previn
2023-09-26 19:05 ` [Intel-gfx] [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending Alan Previn
2023-09-26 19:05   ` Alan Previn
2023-09-27  9:02   ` [Intel-gfx] " Tvrtko Ursulin
2023-09-27 16:36     ` Teres Alexis, Alan Previn
2023-09-28 12:46       ` Tvrtko Ursulin
2023-10-04 17:59         ` Teres Alexis, Alan Previn
2023-10-25 12:58           ` Tvrtko Ursulin [this message]
2023-11-13 17:57             ` Teres Alexis, Alan Previn
2023-11-14 17:27               ` Tvrtko Ursulin
2023-11-14 17:36                 ` Rodrigo Vivi
2023-11-14 19:34                   ` Teres Alexis, Alan Previn
2023-11-14 17:37                 ` Teres Alexis, Alan Previn
2023-11-14 17:52                   ` Tvrtko Ursulin
2023-11-14 19:48                     ` Teres Alexis, Alan Previn
2023-11-16 10:19                       ` Tvrtko Ursulin
2023-09-27  1:23 ` [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Resolve suspend-resume racing with GuC destroy-context-worker (rev4) Patchwork
2023-09-27  1:37 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2023-09-27 13:17 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ace7375e-d9a8-4b6f-aa92-6360ca3bfa96@linux.intel.com \
    --to=tvrtko.ursulin@linux.intel.com \
    --cc=alan.previn.teres.alexis@intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=intel.com@freedesktop.org \
    --cc=mousumi.jana@intel.com \
    --cc=rodrigo.vivi@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.