All of lore.kernel.org
 help / color / mirror / Atom feed
* [Intel-gfx] [PATCH v2 0/3] Resolve suspend-resume racing with GuC destroy-context-worker
@ 2023-08-15  1:12 ` Alan Previn
  0 siblings, 0 replies; 29+ messages in thread
From: Alan Previn @ 2023-08-15  1:12 UTC (permalink / raw)
  To: intel-gfx; +Cc: dri-devel, Alan Previn, Rodrigo Vivi

This series is the result of debugging issues root caused to
races between the GuC's destroyed_worker_func being triggered
vs repeating suspend-resume cycles with concurrent delayed
fence signals for engine-freeing.

The reproduction steps require that an app is launched right
before the start of the suspend cycle where it creates a
new gem context and submits a tiny workload that would
complete in the middle of the suspend cycle. However this
app uses dma-buffer sharing or dma-fence with non-GPU
objects or signals that eventually triggers a FENCE_FREE
via__i915_sw_fence_notify that connects to engines_notify ->
free_engines_rcu -> intel_context_put ->
kref_put(&ce->ref..) that queues the worker after the GuCs
CTB has been disabled (i.e. after i915-gem's suspend-late).

This sequence is a corner-case and required repeating this
app->suspend->resume cycle ~1500 times across 4 identical
systems to see it once. That said, based on above callstack,
it is clear that merely flushing the context destruction worker,
which is obviously missing and needed, isn't sufficient.

Because of that, this series adds additional patches besides
the obvious (Patch #1) flushing of the worker during the
suspend flows. It also includes (Patch #2) closing a race
between sending the context-deregistration H2G vs the CTB
getting disabled in the midst of it (by detecing the failure
and unrolling the guc-lrc-unpin flow) and (Patch #32) not
infinitely waiting in intel_gt_pm_wait_timeout_for_idle
when in the suspend-flow.

Alan Previn (3):
  drm/i915/guc: Flush context destruction worker at suspend
  drm/i915/guc: Close deregister-context race against CT-loss
  drm/i915/gt: Timeout when waiting for idle in suspending

 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  2 +-
 drivers/gpu/drm/i915/gt/intel_gt_pm.c         |  7 ++-
 drivers/gpu/drm/i915/gt/intel_gt_pm.h         |  7 ++-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 45 +++++++++++++++++--
 .../gpu/drm/i915/gt/uc/intel_guc_submission.h |  2 +
 drivers/gpu/drm/i915/gt/uc/intel_uc.c         |  2 +
 drivers/gpu/drm/i915/intel_wakeref.c          | 14 ++++--
 drivers/gpu/drm/i915/intel_wakeref.h          |  5 ++-
 8 files changed, 71 insertions(+), 13 deletions(-)


base-commit: 85f20fb339f05ec4221bb295c13e46061c5c566f
-- 
2.39.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2023-08-28 21:06 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-15  1:12 [Intel-gfx] [PATCH v2 0/3] Resolve suspend-resume racing with GuC destroy-context-worker Alan Previn
2023-08-15  1:12 ` Alan Previn
2023-08-15  1:12 ` [Intel-gfx] [PATCH v2 1/3] drm/i915/guc: Flush context destruction worker at suspend Alan Previn
2023-08-15  1:12   ` Alan Previn
2023-08-15 13:53   ` [Intel-gfx] " Rodrigo Vivi
2023-08-15 13:53     ` Rodrigo Vivi
2023-08-25 18:48     ` [Intel-gfx] " Teres Alexis, Alan Previn
2023-08-25 18:48       ` Teres Alexis, Alan Previn
2023-08-15  1:12 ` [Intel-gfx] [PATCH v2 2/3] drm/i915/guc: Close deregister-context race against CT-loss Alan Previn
2023-08-15  1:12   ` Alan Previn
2023-08-15 13:56   ` [Intel-gfx] " Rodrigo Vivi
2023-08-15 13:56     ` Rodrigo Vivi
2023-08-15 19:08     ` [Intel-gfx] " Teres Alexis, Alan Previn
2023-08-15 19:08       ` Teres Alexis, Alan Previn
2023-08-25 18:54       ` [Intel-gfx] " Teres Alexis, Alan Previn
2023-08-25 18:54         ` Teres Alexis, Alan Previn
2023-08-28 21:06         ` [Intel-gfx] " Teres Alexis, Alan Previn
2023-08-28 21:06           ` Teres Alexis, Alan Previn
2023-08-15  1:12 ` [Intel-gfx] [PATCH v2 3/3] drm/i915/gt: Timeout when waiting for idle in suspending Alan Previn
2023-08-15  1:12   ` Alan Previn
2023-08-15 13:51   ` [Intel-gfx] " Rodrigo Vivi
2023-08-15 13:51     ` Rodrigo Vivi
2023-08-15 19:00     ` [Intel-gfx] " Teres Alexis, Alan Previn
2023-08-15 19:00       ` Teres Alexis, Alan Previn
2023-08-15  1:20 ` [Intel-gfx] [PATCH v2 0/3] Resolve suspend-resume racing with GuC destroy-context-worker Teres Alexis, Alan Previn
2023-08-15  1:20   ` Teres Alexis, Alan Previn
2023-08-15  1:56 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Resolve suspend-resume racing with GuC destroy-context-worker (rev2) Patchwork
2023-08-15  1:56 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2023-08-15  2:15 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.