From: Alan Previn <alan.previn.teres.alexis@intel.com>
To: intel-gfx@lists.freedesktop.org
Cc: , Alan Previn <alan.previn.teres.alexis@intel.com>,
dri-devel@lists.freedesktop.org,
Rodrigo Vivi <rodrigo.vivi@intel.com>,
intel.com@freedesktop.org
Subject: [Intel-gfx] [PATCH v3 0/3] Resolve suspend-resume racing with GuC destroy-context-worker
Date: Sat, 9 Sep 2023 20:58:43 -0700 [thread overview]
Message-ID: <20230910035846.493766-1-alan.previn.teres.alexis@intel.com> (raw)
This series is the result of debugging issues root caused to
races between the GuC's destroyed_worker_func being triggered
vs repeating suspend-resume cycles with concurrent delayed
fence signals for engine-freeing.
The reproduction steps require that an app is launched right
before the start of the suspend cycle where it creates a
new gem context and submits a tiny workload that would
complete in the middle of the suspend cycle. However this
app uses dma-buffer sharing or dma-fence with non-GPU
objects or signals that eventually triggers a FENCE_FREE
via__i915_sw_fence_notify that connects to engines_notify ->
free_engines_rcu -> intel_context_put ->
kref_put(&ce->ref..) that queues the worker after the GuCs
CTB has been disabled (i.e. after i915-gem's suspend-late).
This sequence is a corner-case and required repeating this
app->suspend->resume cycle ~1500 times across 4 identical
systems to see it once. That said, based on above callstack,
it is clear that merely flushing the context destruction worker,
which is obviously missing and needed, isn't sufficient.
Because of that, this series adds additional patches besides
the obvious (Patch #1) flushing of the worker during the
suspend flows. It also includes (Patch #2) closing a race
between sending the context-deregistration H2G vs the CTB
getting disabled in the midst of it (by detecing the failure
and unrolling the guc-lrc-unpin flow) and (Patch #32) not
infinitely waiting in intel_gt_pm_wait_timeout_for_idle
when in the suspend-flow.
NOTE: We are still observing one more wakeref leak from gt
but not necessarilty guc. We are still debugging this so
this series will very likely get rev'd up again.
Changes from prior revs:
v2: - Patch #2 Restructure code in guc_lrc_desc_unpin so
it's more readible to differentiate (1)direct guc-id
cleanup ..vs (2) sending the H2G ctx-destroy action ..
vs (3) the unrolling steps if the H2G fails.
- Patch #2 Add a check to close the race sooner by checking
for intel_guc_is_ready from destroyed_worker_func.
- Patch #2 When guc_submission_send_busy_loop gets a
failure from intel_guc_send_busy_loop, we need to undo
i.e. decrement the outstanding_submission_g2h.
- Patch #3 In wait_for_suspend, fix checking of return from
intel_gt_pm_wait_timeout_for_idle to now use -ETIMEDOUT
and add documentation for intel_wakeref_wait_for_idle.
(Rodrigo).
Alan Previn (3):
drm/i915/guc: Flush context destruction worker at suspend
drm/i915/guc: Close deregister-context race against CT-loss
drm/i915/gt: Timeout when waiting for idle in suspending
drivers/gpu/drm/i915/gt/intel_engine_cs.c | 2 +-
drivers/gpu/drm/i915/gt/intel_gt_pm.c | 7 +-
drivers/gpu/drm/i915/gt/intel_gt_pm.h | 7 +-
.../gpu/drm/i915/gt/uc/intel_guc_submission.c | 81 ++++++++++++++++---
.../gpu/drm/i915/gt/uc/intel_guc_submission.h | 2 +
drivers/gpu/drm/i915/gt/uc/intel_uc.c | 2 +
drivers/gpu/drm/i915/intel_wakeref.c | 14 +++-
drivers/gpu/drm/i915/intel_wakeref.h | 6 +-
8 files changed, 101 insertions(+), 20 deletions(-)
base-commit: f8d21cb17a99b75862196036bb4bb93ee9637b74
--
2.39.0
next reply other threads:[~2023-09-10 3:58 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-10 3:58 Alan Previn [this message]
2023-09-10 3:58 ` [Intel-gfx] [PATCH v3 1/3] drm/i915/guc: Flush context destruction worker at suspend Alan Previn
2023-09-14 15:35 ` Rodrigo Vivi
2023-09-22 16:54 ` Teres Alexis, Alan Previn
2023-09-10 3:58 ` [Intel-gfx] [PATCH v3 2/3] drm/i915/guc: Close deregister-context race against CT-loss Alan Previn
2023-09-14 15:34 ` Rodrigo Vivi
2023-09-22 18:02 ` Teres Alexis, Alan Previn
2023-09-23 4:00 ` Gupta, Anshuman
2023-09-26 17:19 ` Teres Alexis, Alan Previn
2023-09-26 18:21 ` Teres Alexis, Alan Previn
2023-09-10 3:58 ` [Intel-gfx] [PATCH v3 3/3] drm/i915/gt: Timeout when waiting for idle in suspending Alan Previn
2023-09-10 4:27 ` [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Resolve suspend-resume racing with GuC destroy-context-worker (rev3) Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230910035846.493766-1-alan.previn.teres.alexis@intel.com \
--to=alan.previn.teres.alexis@intel.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=intel-gfx@lists.freedesktop.org \
--cc=intel.com@freedesktop.org \
--cc=rodrigo.vivi@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).