Intel-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: "Teres Alexis, Alan Previn" <alan.previn.teres.alexis@intel.com>
To: "tvrtko.ursulin@linux.intel.com" <tvrtko.ursulin@linux.intel.com>,
	"intel-gfx@lists.freedesktop.org"
	<intel-gfx@lists.freedesktop.org>
Cc: "Jana, Mousumi" <mousumi.jana@intel.com>,
	"intel.com@freedesktop.org" <intel.com@freedesktop.org>,
	"dri-devel@lists.freedesktop.org"
	<dri-devel@lists.freedesktop.org>,
	"Vivi, Rodrigo" <rodrigo.vivi@intel.com>
Subject: Re: [Intel-gfx] [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending
Date: Wed, 27 Sep 2023 16:36:21 +0000	[thread overview]
Message-ID: <123edf6b37aa982de20279d64c213156a2dc8c2e.camel@intel.com> (raw)
In-Reply-To: <9ca17c5c-7bb4-ff6b-69cb-3983299729c1@linux.intel.com>

Thanks for taking the time to review this Tvrtko, replies inline below.

On Wed, 2023-09-27 at 10:02 +0100, Tvrtko Ursulin wrote:
> On 26/09/2023 20:05, Alan Previn wrote:
> > When suspending, add a timeout when calling
> > intel_gt_pm_wait_for_idle else if we have a lost
> > G2H event that holds a wakeref (which would be
> > indicative of a bug elsewhere in the driver),
> > driver will at least complete the suspend-resume
> > cycle, (albeit not hitting all the targets for
> > low power hw counters), instead of hanging in the kernel.
alan:snip

> >   {
> > +	int timeout_ms = CONFIG_DRM_I915_MAX_REQUEST_BUSYWAIT ? : 10000;
> 
> CONFIG_DRM_I915_MAX_REQUEST_BUSYWAIT is in ns so assigning it to _ms is 
> a bit to arbitrary.
> 
> Why not the existing I915_GT_SUSPEND_IDLE_TIMEOUT for instance?
alan: good catch, my bad. However I915_GT_SUSPEND_IDLE_TIMEOUT is only half a sec
which i feel is too quick. I guess i could create a new #define or a multiple
of I915_GT_SUSPEND_IDLE_TIMEOUT?

> >   	/*
> >   	 * On rare occasions, we've observed the fence completion trigger
> >   	 * free_engines asynchronously via rcu_call. Ensure those are done.
> > @@ -308,7 +309,10 @@ static void wait_for_suspend(struct intel_gt *gt)
> >   		intel_gt_retire_requests(gt);
> >   	}
> >   
> > -	intel_gt_pm_wait_for_idle(gt);
> > +	/* we are suspending, so we shouldn't be waiting forever */
> > +	if (intel_gt_pm_wait_timeout_for_idle(gt, timeout_ms) == -ETIMEDOUT)
> > +		gt_warn(gt, "bailing from %s after %d milisec timeout\n",
> > +			__func__, timeout_ms);
> 
> Does the timeout in intel_gt_pm_wait_timeout_for_idle always comes in 
> pair with the timeout first in intel_gt_wait_for_idle?
alan: Not necessarily, ... IIRC in nearly all cases, the first wait call
complete in time (i.e. no more ongoing work) and the second wait
does wait only if the last bit of work 'just-finished', then that second
wait may take a touch bit longer because of possible async gt-pm-put calls.
(which appear in several places in the driver as part of regular runtime
operation including request completion). NOTE: this not high end workloads
so the
> 
> Also, is the timeout here hit from the intel_gt_suspend_prepare, 
> intel_gt_suspend_late, or can be both?
> 
> Main concern is that we need to be sure there are no possible 
> ill-effects, like letting the GPU/GuC scribble on some memory we 
> unmapped (or will unmap), having let the suspend continue after timing 
> out, and not perhaps doing the forced wedge like wait_for_suspend() does 
> on the existing timeout path.
alan: this will not happen because the held wakeref is never force-released
after the timeout - so what happens is the kernel would bail the suspend.
> 
> Would it be possible to handle the lost G2H events directly in the 
> respective component instead of here? Like apply the timeout during the 
> step which explicitly idles the CT for suspend (presumably that 
> exists?), and so cleanup from there once declared a lost event.
alan: So yes, we don't solve the problem with this patch - Patch#2
is addressing the root cause - and verification is still ongoing - because its
hard to thoroughly test / reproduce. (i.e. Patch 2 could get re-rev'd).

Intent of this patch, is to be informed that gt-pm wait failed in prep for
suspending and kernel will eventually bail the suspend, that's all.
Because without this timed-out version of gt-pm-wait, if we did have a leaky
gt-wakeref, kernel will hang here forever and we will need to get serial
console or ramoops to eventually discover the kernel's cpu hung error with
call-stack pointing to this location. So the goal here is to help speed up
future debug process (if let's say some future code change with another
async gt-pm-put came along and caused new problems after Patch #2 resolved
this time).

Recap: so in both cases (original vs this patch), if we had a buggy gt-wakeref leak,
we dont get invalid guc-accesses, but without this patch, we wait forever,
and with this patch, we get some messages and eventually bail the suspend.

alan:snip


  reply	other threads:[~2023-09-27 16:38 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-26 19:05 [Intel-gfx] [PATCH v4 0/3] Resolve suspend-resume racing with GuC destroy-context-worker Alan Previn
2023-09-26 19:05 ` [Intel-gfx] [PATCH v4 1/3] drm/i915/guc: Flush context destruction worker at suspend Alan Previn
2023-09-26 19:05 ` [Intel-gfx] [PATCH v4 2/3] drm/i915/guc: Close deregister-context race against CT-loss Alan Previn
2023-10-04  6:34   ` Gupta, Anshuman
2023-10-04 16:46     ` Teres Alexis, Alan Previn
2023-09-26 19:05 ` [Intel-gfx] [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending Alan Previn
2023-09-27  9:02   ` Tvrtko Ursulin
2023-09-27 16:36     ` Teres Alexis, Alan Previn [this message]
2023-09-28 12:46       ` Tvrtko Ursulin
2023-10-04 17:59         ` Teres Alexis, Alan Previn
2023-10-25 12:58           ` Tvrtko Ursulin
2023-11-13 17:57             ` Teres Alexis, Alan Previn
2023-11-14 17:27               ` Tvrtko Ursulin
2023-11-14 17:36                 ` Rodrigo Vivi
2023-11-14 19:34                   ` Teres Alexis, Alan Previn
2023-11-14 17:37                 ` Teres Alexis, Alan Previn
2023-11-14 17:52                   ` Tvrtko Ursulin
2023-11-14 19:48                     ` Teres Alexis, Alan Previn
2023-11-16 10:19                       ` Tvrtko Ursulin
2023-09-27  1:23 ` [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Resolve suspend-resume racing with GuC destroy-context-worker (rev4) Patchwork
2023-09-27  1:37 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2023-09-27 13:17 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=123edf6b37aa982de20279d64c213156a2dc8c2e.camel@intel.com \
    --to=alan.previn.teres.alexis@intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=intel.com@freedesktop.org \
    --cc=mousumi.jana@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=tvrtko.ursulin@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox