[Intel-gfx] [PATCH v2 0/3] Resolve suspend-resume racing with GuC destroy-context-worker

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Intel-gfx] [PATCH v2 0/3] Resolve suspend-resume racing with GuC destroy-context-worker
@ 2023-08-15  1:12 ` Alan Previn
  0 siblings, 0 replies; 29+ messages in thread
From: Alan Previn @ 2023-08-15  1:12 UTC (permalink / raw)
  To: intel-gfx; +Cc: dri-devel, Alan Previn, Rodrigo Vivi

This series is the result of debugging issues root caused to
races between the GuC's destroyed_worker_func being triggered
vs repeating suspend-resume cycles with concurrent delayed
fence signals for engine-freeing.

The reproduction steps require that an app is launched right
before the start of the suspend cycle where it creates a
new gem context and submits a tiny workload that would
complete in the middle of the suspend cycle. However this
app uses dma-buffer sharing or dma-fence with non-GPU
objects or signals that eventually triggers a FENCE_FREE
via__i915_sw_fence_notify that connects to engines_notify ->
free_engines_rcu -> intel_context_put ->
kref_put(&ce->ref..) that queues the worker after the GuCs
CTB has been disabled (i.e. after i915-gem's suspend-late).

This sequence is a corner-case and required repeating this
app->suspend->resume cycle ~1500 times across 4 identical
systems to see it once. That said, based on above callstack,
it is clear that merely flushing the context destruction worker,
which is obviously missing and needed, isn't sufficient.

Because of that, this series adds additional patches besides
the obvious (Patch #1) flushing of the worker during the
suspend flows. It also includes (Patch #2) closing a race
between sending the context-deregistration H2G vs the CTB
getting disabled in the midst of it (by detecing the failure
and unrolling the guc-lrc-unpin flow) and (Patch #32) not
infinitely waiting in intel_gt_pm_wait_timeout_for_idle
when in the suspend-flow.

Alan Previn (3):
  drm/i915/guc: Flush context destruction worker at suspend
  drm/i915/guc: Close deregister-context race against CT-loss
  drm/i915/gt: Timeout when waiting for idle in suspending

 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  2 +-
 drivers/gpu/drm/i915/gt/intel_gt_pm.c         |  7 ++-
 drivers/gpu/drm/i915/gt/intel_gt_pm.h         |  7 ++-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 45 +++++++++++++++++--
 .../gpu/drm/i915/gt/uc/intel_guc_submission.h |  2 +
 drivers/gpu/drm/i915/gt/uc/intel_uc.c         |  2 +
 drivers/gpu/drm/i915/intel_wakeref.c          | 14 ++++--
 drivers/gpu/drm/i915/intel_wakeref.h          |  5 ++-
 8 files changed, 71 insertions(+), 13 deletions(-)

base-commit: 85f20fb339f05ec4221bb295c13e46061c5c566f
-- 
2.39.0

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2 0/3] Resolve suspend-resume racing with GuC destroy-context-worker
@ 2023-08-15  1:12 ` Alan Previn
  0 siblings, 0 replies; 29+ messages in thread
From: Alan Previn @ 2023-08-15  1:12 UTC (permalink / raw)
  To: intel-gfx
  Cc: John Harrison, Daniele Ceraolo Spurio, dri-devel, Alan Previn,
	Rodrigo Vivi

This series is the result of debugging issues root caused to
races between the GuC's destroyed_worker_func being triggered
vs repeating suspend-resume cycles with concurrent delayed
fence signals for engine-freeing.

The reproduction steps require that an app is launched right
before the start of the suspend cycle where it creates a
new gem context and submits a tiny workload that would
complete in the middle of the suspend cycle. However this
app uses dma-buffer sharing or dma-fence with non-GPU
objects or signals that eventually triggers a FENCE_FREE
via__i915_sw_fence_notify that connects to engines_notify ->
free_engines_rcu -> intel_context_put ->
kref_put(&ce->ref..) that queues the worker after the GuCs
CTB has been disabled (i.e. after i915-gem's suspend-late).

This sequence is a corner-case and required repeating this
app->suspend->resume cycle ~1500 times across 4 identical
systems to see it once. That said, based on above callstack,
it is clear that merely flushing the context destruction worker,
which is obviously missing and needed, isn't sufficient.

Because of that, this series adds additional patches besides
the obvious (Patch #1) flushing of the worker during the
suspend flows. It also includes (Patch #2) closing a race
between sending the context-deregistration H2G vs the CTB
getting disabled in the midst of it (by detecing the failure
and unrolling the guc-lrc-unpin flow) and (Patch #32) not
infinitely waiting in intel_gt_pm_wait_timeout_for_idle
when in the suspend-flow.

Alan Previn (3):
  drm/i915/guc: Flush context destruction worker at suspend
  drm/i915/guc: Close deregister-context race against CT-loss
  drm/i915/gt: Timeout when waiting for idle in suspending

 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  2 +-
 drivers/gpu/drm/i915/gt/intel_gt_pm.c         |  7 ++-
 drivers/gpu/drm/i915/gt/intel_gt_pm.h         |  7 ++-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 45 +++++++++++++++++--
 .../gpu/drm/i915/gt/uc/intel_guc_submission.h |  2 +
 drivers/gpu/drm/i915/gt/uc/intel_uc.c         |  2 +
 drivers/gpu/drm/i915/intel_wakeref.c          | 14 ++++--
 drivers/gpu/drm/i915/intel_wakeref.h          |  5 ++-
 8 files changed, 71 insertions(+), 13 deletions(-)

base-commit: 85f20fb339f05ec4221bb295c13e46061c5c566f
-- 
2.39.0

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Intel-gfx] [PATCH v2 1/3] drm/i915/guc: Flush context destruction worker at suspend
  2023-08-15  1:12 ` Alan Previn
@ 2023-08-15  1:12   ` Alan Previn
  -1 siblings, 0 replies; 29+ messages in thread
From: Alan Previn @ 2023-08-15  1:12 UTC (permalink / raw)
  To: intel-gfx; +Cc: dri-devel, Alan Previn, Rodrigo Vivi

When suspending, flush the context-guc-id
deregistration worker at the final stages of
intel_gt_suspend_late when we finally call gt_sanitize
that eventually leads down to __uc_sanitize so that
the deregistration worker doesn't fire off later as
we reset the GuC microcontroller.

Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 5 +++++
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h | 2 ++
 drivers/gpu/drm/i915/gt/uc/intel_uc.c             | 2 ++
 3 files changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index a0e3ef1c65d2..050572bb8dbe 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1578,6 +1578,11 @@ static void guc_flush_submissions(struct intel_guc *guc)
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
 }
 
+void intel_guc_submission_flush_work(struct intel_guc *guc)
+{
+	flush_work(&guc->submission_state.destroyed_worker);
+}
+
 static void guc_flush_destroyed_contexts(struct intel_guc *guc);
 
 void intel_guc_submission_reset_prepare(struct intel_guc *guc)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
index c57b29cdb1a6..b6df75622d3b 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
@@ -38,6 +38,8 @@ int intel_guc_wait_for_pending_msg(struct intel_guc *guc,
 				   bool interruptible,
 				   long timeout);
 
+void intel_guc_submission_flush_work(struct intel_guc *guc);
+
 static inline bool intel_guc_submission_is_supported(struct intel_guc *guc)
 {
 	return guc->submission_supported;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc.c b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
index 98b103375b7a..eb3554cb5ea4 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_uc.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
@@ -693,6 +693,8 @@ void intel_uc_suspend(struct intel_uc *uc)
 		return;
 	}
 
+	intel_guc_submission_flush_work(guc);
+
 	with_intel_runtime_pm(&uc_to_gt(uc)->i915->runtime_pm, wakeref) {
 		err = intel_guc_suspend(guc);
 		if (err)
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 1/3] drm/i915/guc: Flush context destruction worker at suspend
@ 2023-08-15  1:12   ` Alan Previn
  0 siblings, 0 replies; 29+ messages in thread
From: Alan Previn @ 2023-08-15  1:12 UTC (permalink / raw)
  To: intel-gfx
  Cc: John Harrison, Daniele Ceraolo Spurio, dri-devel, Alan Previn,
	Rodrigo Vivi

When suspending, flush the context-guc-id
deregistration worker at the final stages of
intel_gt_suspend_late when we finally call gt_sanitize
that eventually leads down to __uc_sanitize so that
the deregistration worker doesn't fire off later as
we reset the GuC microcontroller.

Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 5 +++++
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h | 2 ++
 drivers/gpu/drm/i915/gt/uc/intel_uc.c             | 2 ++
 3 files changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index a0e3ef1c65d2..050572bb8dbe 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1578,6 +1578,11 @@ static void guc_flush_submissions(struct intel_guc *guc)
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
 }
 
+void intel_guc_submission_flush_work(struct intel_guc *guc)
+{
+	flush_work(&guc->submission_state.destroyed_worker);
+}
+
 static void guc_flush_destroyed_contexts(struct intel_guc *guc);
 
 void intel_guc_submission_reset_prepare(struct intel_guc *guc)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
index c57b29cdb1a6..b6df75622d3b 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
@@ -38,6 +38,8 @@ int intel_guc_wait_for_pending_msg(struct intel_guc *guc,
 				   bool interruptible,
 				   long timeout);
 
+void intel_guc_submission_flush_work(struct intel_guc *guc);
+
 static inline bool intel_guc_submission_is_supported(struct intel_guc *guc)
 {
 	return guc->submission_supported;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc.c b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
index 98b103375b7a..eb3554cb5ea4 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_uc.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
@@ -693,6 +693,8 @@ void intel_uc_suspend(struct intel_uc *uc)
 		return;
 	}
 
+	intel_guc_submission_flush_work(guc);
+
 	with_intel_runtime_pm(&uc_to_gt(uc)->i915->runtime_pm, wakeref) {
 		err = intel_guc_suspend(guc);
 		if (err)
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [Intel-gfx] [PATCH v2 2/3] drm/i915/guc: Close deregister-context race against CT-loss
  2023-08-15  1:12 ` Alan Previn
@ 2023-08-15  1:12   ` Alan Previn
  -1 siblings, 0 replies; 29+ messages in thread
From: Alan Previn @ 2023-08-15  1:12 UTC (permalink / raw)
  To: intel-gfx; +Cc: dri-devel, Alan Previn, Rodrigo Vivi

If we are at the end of suspend or very early in resume
its possible an async fence signal could lead us to the
execution of the context destruction worker (after the
prior worker flush).

Even if checking that the CT is enabled before calling
destroyed_worker_func, guc_lrc_desc_unpin may still fail
because in corner cases, as we traverse the GuC's
context-destroy list, the CT could get disabled in the mid
of it right before calling the GuC's CT send function.

We've witnessed this race condition once every ~6000-8000
suspend-resume cycles while ensuring workloads that render
something onscreen is continuously started just before
we suspend (and the workload is small enough to complete
and trigger the queued engine/context free-up either very
late in suspend or very early in resume).

In such a case, we need to unroll the unpin process because
guc-lrc-unpin takes a gt wakeref which only gets released in
the G2H IRQ reply that never comes through in this corner
case. That will cascade into a kernel hang later at the tail
end of suspend in this function:

   intel_wakeref_wait_for_idle(&gt->wakeref)
   (called by) - intel_gt_pm_wait_for_idle
   (called by) - wait_for_suspend

Doing this unroll and keeping the context in the GuC's
destroy-list will allow the context to get picked up on
the next destroy worker invocation or purged as part of a
major GuC sanitization or reset flow.

Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 40 +++++++++++++++++--
 1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 050572bb8dbe..ddb4ee4c4e51 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -235,6 +235,13 @@ set_context_destroyed(struct intel_context *ce)
 	ce->guc_state.sched_state |= SCHED_STATE_DESTROYED;
 }
 
+static inline void
+clr_context_destroyed(struct intel_context *ce)
+{
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state &= ~SCHED_STATE_DESTROYED;
+}
+
 static inline bool context_pending_disable(struct intel_context *ce)
 {
 	return ce->guc_state.sched_state & SCHED_STATE_PENDING_DISABLE;
@@ -3175,7 +3182,7 @@ static void guc_context_close(struct intel_context *ce)
 	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 }
 
-static inline void guc_lrc_desc_unpin(struct intel_context *ce)
+static inline int guc_lrc_desc_unpin(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
 	struct intel_gt *gt = guc_to_gt(guc);
@@ -3199,10 +3206,20 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce)
 	if (unlikely(disabled)) {
 		release_guc_id(guc, ce);
 		__guc_context_destroy(ce);
-		return;
+		return 0;
+	}
+
+	if (deregister_context(ce, ce->guc_id.id)) {
+		/* Seal race with concurrent suspend by unrolling */
+		spin_lock_irqsave(&ce->guc_state.lock, flags);
+		set_context_registered(ce);
+		clr_context_destroyed(ce);
+		intel_gt_pm_put(gt);
+		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+		return -ENODEV;
 	}
 
-	deregister_context(ce, ce->guc_id.id);
+	return 0;
 }
 
 static void __guc_context_destroy(struct intel_context *ce)
@@ -3270,7 +3287,22 @@ static void deregister_destroyed_contexts(struct intel_guc *guc)
 		if (!ce)
 			break;
 
-		guc_lrc_desc_unpin(ce);
+		if (guc_lrc_desc_unpin(ce)) {
+			/*
+			 * This means GuC's CT link severed mid-way which could happen
+			 * in suspend-resume corner cases. In this case, put the
+			 * context back into the destroyed_contexts list which will
+			 * get picked up on the next context deregistration event or
+			 * purged in a GuC sanitization event (reset/unload/wedged/...).
+			 */
+			spin_lock_irqsave(&guc->submission_state.lock, flags);
+			list_add_tail(&ce->destroyed_link,
+				      &guc->submission_state.destroyed_contexts);
+			spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+			/* Bail now since the list might never be emptied if h2gs fail */
+			break;
+		}
+
 	}
 }
 
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 2/3] drm/i915/guc: Close deregister-context race against CT-loss
@ 2023-08-15  1:12   ` Alan Previn
  0 siblings, 0 replies; 29+ messages in thread
From: Alan Previn @ 2023-08-15  1:12 UTC (permalink / raw)
  To: intel-gfx
  Cc: John Harrison, Daniele Ceraolo Spurio, dri-devel, Alan Previn,
	Rodrigo Vivi

If we are at the end of suspend or very early in resume
its possible an async fence signal could lead us to the
execution of the context destruction worker (after the
prior worker flush).

Even if checking that the CT is enabled before calling
destroyed_worker_func, guc_lrc_desc_unpin may still fail
because in corner cases, as we traverse the GuC's
context-destroy list, the CT could get disabled in the mid
of it right before calling the GuC's CT send function.

We've witnessed this race condition once every ~6000-8000
suspend-resume cycles while ensuring workloads that render
something onscreen is continuously started just before
we suspend (and the workload is small enough to complete
and trigger the queued engine/context free-up either very
late in suspend or very early in resume).

In such a case, we need to unroll the unpin process because
guc-lrc-unpin takes a gt wakeref which only gets released in
the G2H IRQ reply that never comes through in this corner
case. That will cascade into a kernel hang later at the tail
end of suspend in this function:

   intel_wakeref_wait_for_idle(&gt->wakeref)
   (called by) - intel_gt_pm_wait_for_idle
   (called by) - wait_for_suspend

Doing this unroll and keeping the context in the GuC's
destroy-list will allow the context to get picked up on
the next destroy worker invocation or purged as part of a
major GuC sanitization or reset flow.

Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 40 +++++++++++++++++--
 1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 050572bb8dbe..ddb4ee4c4e51 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -235,6 +235,13 @@ set_context_destroyed(struct intel_context *ce)
 	ce->guc_state.sched_state |= SCHED_STATE_DESTROYED;
 }
 
+static inline void
+clr_context_destroyed(struct intel_context *ce)
+{
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state &= ~SCHED_STATE_DESTROYED;
+}
+
 static inline bool context_pending_disable(struct intel_context *ce)
 {
 	return ce->guc_state.sched_state & SCHED_STATE_PENDING_DISABLE;
@@ -3175,7 +3182,7 @@ static void guc_context_close(struct intel_context *ce)
 	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 }
 
-static inline void guc_lrc_desc_unpin(struct intel_context *ce)
+static inline int guc_lrc_desc_unpin(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
 	struct intel_gt *gt = guc_to_gt(guc);
@@ -3199,10 +3206,20 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce)
 	if (unlikely(disabled)) {
 		release_guc_id(guc, ce);
 		__guc_context_destroy(ce);
-		return;
+		return 0;
+	}
+
+	if (deregister_context(ce, ce->guc_id.id)) {
+		/* Seal race with concurrent suspend by unrolling */
+		spin_lock_irqsave(&ce->guc_state.lock, flags);
+		set_context_registered(ce);
+		clr_context_destroyed(ce);
+		intel_gt_pm_put(gt);
+		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+		return -ENODEV;
 	}
 
-	deregister_context(ce, ce->guc_id.id);
+	return 0;
 }
 
 static void __guc_context_destroy(struct intel_context *ce)
@@ -3270,7 +3287,22 @@ static void deregister_destroyed_contexts(struct intel_guc *guc)
 		if (!ce)
 			break;
 
-		guc_lrc_desc_unpin(ce);
+		if (guc_lrc_desc_unpin(ce)) {
+			/*
+			 * This means GuC's CT link severed mid-way which could happen
+			 * in suspend-resume corner cases. In this case, put the
+			 * context back into the destroyed_contexts list which will
+			 * get picked up on the next context deregistration event or
+			 * purged in a GuC sanitization event (reset/unload/wedged/...).
+			 */
+			spin_lock_irqsave(&guc->submission_state.lock, flags);
+			list_add_tail(&ce->destroyed_link,
+				      &guc->submission_state.destroyed_contexts);
+			spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+			/* Bail now since the list might never be emptied if h2gs fail */
+			break;
+		}
+
 	}
 }
 
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [Intel-gfx] [PATCH v2 3/3] drm/i915/gt: Timeout when waiting for idle in suspending
  2023-08-15  1:12 ` Alan Previn
@ 2023-08-15  1:12   ` Alan Previn
  -1 siblings, 0 replies; 29+ messages in thread
From: Alan Previn @ 2023-08-15  1:12 UTC (permalink / raw)
  To: intel-gfx; +Cc: dri-devel, Alan Previn, Rodrigo Vivi

When suspending, add a timeout when calling
intel_gt_pm_wait_for_idle else if we have a lost
G2H event that holds a wakeref (which would be
indicative of a bug elsewhere in the driver),
driver will at least complete the suspend-resume
cycle, (albeit not hitting all the targets for
low power hw counters), instead of hanging in the kernel.

Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c |  2 +-
 drivers/gpu/drm/i915/gt/intel_gt_pm.c     |  7 ++++++-
 drivers/gpu/drm/i915/gt/intel_gt_pm.h     |  7 ++++++-
 drivers/gpu/drm/i915/intel_wakeref.c      | 14 ++++++++++----
 drivers/gpu/drm/i915/intel_wakeref.h      |  5 +++--
 5 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index ee15486fed0d..090438eb8682 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -688,7 +688,7 @@ void intel_engines_release(struct intel_gt *gt)
 		if (!engine->release)
 			continue;
 
-		intel_wakeref_wait_for_idle(&engine->wakeref);
+		intel_wakeref_wait_for_idle(&engine->wakeref, 0);
 		GEM_BUG_ON(intel_engine_pm_is_awake(engine));
 
 		engine->release(engine);
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
index 5a942af0a14e..e8b006c3ef29 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
@@ -289,6 +289,8 @@ int intel_gt_resume(struct intel_gt *gt)
 
 static void wait_for_suspend(struct intel_gt *gt)
 {
+	int timeout_ms = CONFIG_DRM_I915_MAX_REQUEST_BUSYWAIT ? : 10000;
+
 	if (!intel_gt_pm_is_awake(gt))
 		return;
 
@@ -301,7 +303,10 @@ static void wait_for_suspend(struct intel_gt *gt)
 		intel_gt_retire_requests(gt);
 	}
 
-	intel_gt_pm_wait_for_idle(gt);
+	/* we are suspending, so we shouldn't be waiting forever */
+	if (intel_gt_pm_wait_timeout_for_idle(gt, timeout_ms) == -ETIME)
+		gt_warn(gt, "bailing from %s after %d milisec timeout\n",
+			__func__, timeout_ms);
 }
 
 void intel_gt_suspend_prepare(struct intel_gt *gt)
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
index 6c9a46452364..5358acc2b5b1 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
@@ -68,7 +68,12 @@ static inline void intel_gt_pm_might_put(struct intel_gt *gt)
 
 static inline int intel_gt_pm_wait_for_idle(struct intel_gt *gt)
 {
-	return intel_wakeref_wait_for_idle(&gt->wakeref);
+	return intel_wakeref_wait_for_idle(&gt->wakeref, 0);
+}
+
+static inline int intel_gt_pm_wait_timeout_for_idle(struct intel_gt *gt, int timeout_ms)
+{
+	return intel_wakeref_wait_for_idle(&gt->wakeref, timeout_ms);
 }
 
 void intel_gt_pm_init_early(struct intel_gt *gt);
diff --git a/drivers/gpu/drm/i915/intel_wakeref.c b/drivers/gpu/drm/i915/intel_wakeref.c
index 718f2f1b6174..383a37521415 100644
--- a/drivers/gpu/drm/i915/intel_wakeref.c
+++ b/drivers/gpu/drm/i915/intel_wakeref.c
@@ -111,14 +111,20 @@ void __intel_wakeref_init(struct intel_wakeref *wf,
 			 "wakeref.work", &key->work, 0);
 }
 
-int intel_wakeref_wait_for_idle(struct intel_wakeref *wf)
+int intel_wakeref_wait_for_idle(struct intel_wakeref *wf, int timeout_ms)
 {
-	int err;
+	int err = 0;
 
 	might_sleep();
 
-	err = wait_var_event_killable(&wf->wakeref,
-				      !intel_wakeref_is_active(wf));
+	if (!timeout_ms)
+		err = wait_var_event_killable(&wf->wakeref,
+					      !intel_wakeref_is_active(wf));
+	else if (wait_var_event_timeout(&wf->wakeref,
+					!intel_wakeref_is_active(wf),
+					msecs_to_jiffies(timeout_ms)) < 1)
+		err = -ETIMEDOUT;
+
 	if (err)
 		return err;
 
diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
index ec881b097368..6fbb7a2fb6ea 100644
--- a/drivers/gpu/drm/i915/intel_wakeref.h
+++ b/drivers/gpu/drm/i915/intel_wakeref.h
@@ -251,15 +251,16 @@ __intel_wakeref_defer_park(struct intel_wakeref *wf)
 /**
  * intel_wakeref_wait_for_idle: Wait until the wakeref is idle
  * @wf: the wakeref
+ * @timeout_ms: timeout to wait in milisecs, zero means forever
  *
  * Wait for the earlier asynchronous release of the wakeref. Note
  * this will wait for any third party as well, so make sure you only wait
  * when you have control over the wakeref and trust no one else is acquiring
  * it.
  *
- * Return: 0 on success, error code if killed.
+ * Return: 0 on success, error code if killed, -ETIME if timed-out.
  */
-int intel_wakeref_wait_for_idle(struct intel_wakeref *wf);
+int intel_wakeref_wait_for_idle(struct intel_wakeref *wf, int timeout_ms);
 
 struct intel_wakeref_auto {
 	struct drm_i915_private *i915;
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 3/3] drm/i915/gt: Timeout when waiting for idle in suspending
@ 2023-08-15  1:12   ` Alan Previn
  0 siblings, 0 replies; 29+ messages in thread
From: Alan Previn @ 2023-08-15  1:12 UTC (permalink / raw)
  To: intel-gfx
  Cc: John Harrison, Daniele Ceraolo Spurio, dri-devel, Alan Previn,
	Rodrigo Vivi

When suspending, add a timeout when calling
intel_gt_pm_wait_for_idle else if we have a lost
G2H event that holds a wakeref (which would be
indicative of a bug elsewhere in the driver),
driver will at least complete the suspend-resume
cycle, (albeit not hitting all the targets for
low power hw counters), instead of hanging in the kernel.

Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c |  2 +-
 drivers/gpu/drm/i915/gt/intel_gt_pm.c     |  7 ++++++-
 drivers/gpu/drm/i915/gt/intel_gt_pm.h     |  7 ++++++-
 drivers/gpu/drm/i915/intel_wakeref.c      | 14 ++++++++++----
 drivers/gpu/drm/i915/intel_wakeref.h      |  5 +++--
 5 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index ee15486fed0d..090438eb8682 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -688,7 +688,7 @@ void intel_engines_release(struct intel_gt *gt)
 		if (!engine->release)
 			continue;
 
-		intel_wakeref_wait_for_idle(&engine->wakeref);
+		intel_wakeref_wait_for_idle(&engine->wakeref, 0);
 		GEM_BUG_ON(intel_engine_pm_is_awake(engine));
 
 		engine->release(engine);
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
index 5a942af0a14e..e8b006c3ef29 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
@@ -289,6 +289,8 @@ int intel_gt_resume(struct intel_gt *gt)
 
 static void wait_for_suspend(struct intel_gt *gt)
 {
+	int timeout_ms = CONFIG_DRM_I915_MAX_REQUEST_BUSYWAIT ? : 10000;
+
 	if (!intel_gt_pm_is_awake(gt))
 		return;
 
@@ -301,7 +303,10 @@ static void wait_for_suspend(struct intel_gt *gt)
 		intel_gt_retire_requests(gt);
 	}
 
-	intel_gt_pm_wait_for_idle(gt);
+	/* we are suspending, so we shouldn't be waiting forever */
+	if (intel_gt_pm_wait_timeout_for_idle(gt, timeout_ms) == -ETIME)
+		gt_warn(gt, "bailing from %s after %d milisec timeout\n",
+			__func__, timeout_ms);
 }
 
 void intel_gt_suspend_prepare(struct intel_gt *gt)
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
index 6c9a46452364..5358acc2b5b1 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
@@ -68,7 +68,12 @@ static inline void intel_gt_pm_might_put(struct intel_gt *gt)
 
 static inline int intel_gt_pm_wait_for_idle(struct intel_gt *gt)
 {
-	return intel_wakeref_wait_for_idle(&gt->wakeref);
+	return intel_wakeref_wait_for_idle(&gt->wakeref, 0);
+}
+
+static inline int intel_gt_pm_wait_timeout_for_idle(struct intel_gt *gt, int timeout_ms)
+{
+	return intel_wakeref_wait_for_idle(&gt->wakeref, timeout_ms);
 }
 
 void intel_gt_pm_init_early(struct intel_gt *gt);
diff --git a/drivers/gpu/drm/i915/intel_wakeref.c b/drivers/gpu/drm/i915/intel_wakeref.c
index 718f2f1b6174..383a37521415 100644
--- a/drivers/gpu/drm/i915/intel_wakeref.c
+++ b/drivers/gpu/drm/i915/intel_wakeref.c
@@ -111,14 +111,20 @@ void __intel_wakeref_init(struct intel_wakeref *wf,
 			 "wakeref.work", &key->work, 0);
 }
 
-int intel_wakeref_wait_for_idle(struct intel_wakeref *wf)
+int intel_wakeref_wait_for_idle(struct intel_wakeref *wf, int timeout_ms)
 {
-	int err;
+	int err = 0;
 
 	might_sleep();
 
-	err = wait_var_event_killable(&wf->wakeref,
-				      !intel_wakeref_is_active(wf));
+	if (!timeout_ms)
+		err = wait_var_event_killable(&wf->wakeref,
+					      !intel_wakeref_is_active(wf));
+	else if (wait_var_event_timeout(&wf->wakeref,
+					!intel_wakeref_is_active(wf),
+					msecs_to_jiffies(timeout_ms)) < 1)
+		err = -ETIMEDOUT;
+
 	if (err)
 		return err;
 
diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
index ec881b097368..6fbb7a2fb6ea 100644
--- a/drivers/gpu/drm/i915/intel_wakeref.h
+++ b/drivers/gpu/drm/i915/intel_wakeref.h
@@ -251,15 +251,16 @@ __intel_wakeref_defer_park(struct intel_wakeref *wf)
 /**
  * intel_wakeref_wait_for_idle: Wait until the wakeref is idle
  * @wf: the wakeref
+ * @timeout_ms: timeout to wait in milisecs, zero means forever
  *
  * Wait for the earlier asynchronous release of the wakeref. Note
  * this will wait for any third party as well, so make sure you only wait
  * when you have control over the wakeref and trust no one else is acquiring
  * it.
  *
- * Return: 0 on success, error code if killed.
+ * Return: 0 on success, error code if killed, -ETIME if timed-out.
  */
-int intel_wakeref_wait_for_idle(struct intel_wakeref *wf);
+int intel_wakeref_wait_for_idle(struct intel_wakeref *wf, int timeout_ms);
 
 struct intel_wakeref_auto {
 	struct drm_i915_private *i915;
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [Intel-gfx] [PATCH v2 0/3] Resolve suspend-resume racing with GuC destroy-context-worker
  2023-08-15  1:12 ` Alan Previn
@ 2023-08-15  1:20   ` Teres Alexis, Alan Previn
  -1 siblings, 0 replies; 29+ messages in thread
From: Teres Alexis, Alan Previn @ 2023-08-15  1:20 UTC (permalink / raw)
  To: intel-gfx@lists.freedesktop.org
  Cc: dri-devel@lists.freedesktop.org, Vivi, Rodrigo

On Mon, 2023-08-14 at 18:12 -0700, Teres Alexis, Alan Previn wrote:
> This series is the result of debugging issues root caused to
> races between the GuC's destroyed_worker_func being triggered
> vs repeating suspend-resume cycles with concurrent delayed
> fence signals for engine-freeing.
alan: forgot credit:
Tested-by: Mousumi Jana <mousumi.jana@intel.com>

alan:snip.
> 
> 
> Alan Previn (3):
>   drm/i915/guc: Flush context destruction worker at suspend
>   drm/i915/guc: Close deregister-context race against CT-loss
>   drm/i915/gt: Timeout when waiting for idle in suspending
> 
>  drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  2 +-
>  drivers/gpu/drm/i915/gt/intel_gt_pm.c         |  7 ++-
>  drivers/gpu/drm/i915/gt/intel_gt_pm.h         |  7 ++-
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 45 +++++++++++++++++--
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.h |  2 +
>  drivers/gpu/drm/i915/gt/uc/intel_uc.c         |  2 +
>  drivers/gpu/drm/i915/intel_wakeref.c          | 14 ++++--
>  drivers/gpu/drm/i915/intel_wakeref.h          |  5 ++-
>  8 files changed, 71 insertions(+), 13 deletions(-)
> 
> 
> base-commit: 85f20fb339f05ec4221bb295c13e46061c5c566f


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 0/3] Resolve suspend-resume racing with GuC destroy-context-worker
@ 2023-08-15  1:20   ` Teres Alexis, Alan Previn
  0 siblings, 0 replies; 29+ messages in thread
From: Teres Alexis, Alan Previn @ 2023-08-15  1:20 UTC (permalink / raw)
  To: intel-gfx@lists.freedesktop.org
  Cc: Ceraolo Spurio, Daniele, Harrison, John C,
	dri-devel@lists.freedesktop.org, Vivi, Rodrigo

On Mon, 2023-08-14 at 18:12 -0700, Teres Alexis, Alan Previn wrote:
> This series is the result of debugging issues root caused to
> races between the GuC's destroyed_worker_func being triggered
> vs repeating suspend-resume cycles with concurrent delayed
> fence signals for engine-freeing.
alan: forgot credit:
Tested-by: Mousumi Jana <mousumi.jana@intel.com>

alan:snip.
> 
> 
> Alan Previn (3):
>   drm/i915/guc: Flush context destruction worker at suspend
>   drm/i915/guc: Close deregister-context race against CT-loss
>   drm/i915/gt: Timeout when waiting for idle in suspending
> 
>  drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  2 +-
>  drivers/gpu/drm/i915/gt/intel_gt_pm.c         |  7 ++-
>  drivers/gpu/drm/i915/gt/intel_gt_pm.h         |  7 ++-
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 45 +++++++++++++++++--
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.h |  2 +
>  drivers/gpu/drm/i915/gt/uc/intel_uc.c         |  2 +
>  drivers/gpu/drm/i915/intel_wakeref.c          | 14 ++++--
>  drivers/gpu/drm/i915/intel_wakeref.h          |  5 ++-
>  8 files changed, 71 insertions(+), 13 deletions(-)
> 
> 
> base-commit: 85f20fb339f05ec4221bb295c13e46061c5c566f


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Resolve suspend-resume racing with GuC destroy-context-worker (rev2)
  2023-08-15  1:12 ` Alan Previn
                   ` (4 preceding siblings ...)
  (?)
@ 2023-08-15  1:56 ` Patchwork
  -1 siblings, 0 replies; 29+ messages in thread
From: Patchwork @ 2023-08-15  1:56 UTC (permalink / raw)
  To: Alan Previn; +Cc: intel-gfx

== Series Details ==

Series: Resolve suspend-resume racing with GuC destroy-context-worker (rev2)
URL   : https://patchwork.freedesktop.org/series/121916/
State : warning

== Summary ==

Error: dim checkpatch failed
/home/kbuild2/linux/maintainer-tools/dim: line 50: /home/kbuild2/.dimrc: No such file or directory



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Resolve suspend-resume racing with GuC destroy-context-worker (rev2)
  2023-08-15  1:12 ` Alan Previn
                   ` (5 preceding siblings ...)
  (?)
@ 2023-08-15  1:56 ` Patchwork
  -1 siblings, 0 replies; 29+ messages in thread
From: Patchwork @ 2023-08-15  1:56 UTC (permalink / raw)
  To: Alan Previn; +Cc: intel-gfx

== Series Details ==

Series: Resolve suspend-resume racing with GuC destroy-context-worker (rev2)
URL   : https://patchwork.freedesktop.org/series/121916/
State : warning

== Summary ==

Error: dim sparse failed
/home/kbuild2/linux/maintainer-tools/dim: line 50: /home/kbuild2/.dimrc: No such file or directory



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BAT: failure for Resolve suspend-resume racing with GuC destroy-context-worker (rev2)
  2023-08-15  1:12 ` Alan Previn
                   ` (6 preceding siblings ...)
  (?)
@ 2023-08-15  2:15 ` Patchwork
  -1 siblings, 0 replies; 29+ messages in thread
From: Patchwork @ 2023-08-15  2:15 UTC (permalink / raw)
  To: Alan Previn; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 8797 bytes --]

== Series Details ==

Series: Resolve suspend-resume racing with GuC destroy-context-worker (rev2)
URL   : https://patchwork.freedesktop.org/series/121916/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_13517 -> Patchwork_121916v2
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_121916v2 absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_121916v2, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v2/index.html

Participating hosts (41 -> 40)
------------------------------

  Missing    (1): fi-snb-2520m 

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_121916v2:

### IGT changes ###

#### Possible regressions ####

  * igt@kms_addfb_basic@size-max:
    - fi-kbl-soraka:      [PASS][1] -> [INCOMPLETE][2]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13517/fi-kbl-soraka/igt@kms_addfb_basic@size-max.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v2/fi-kbl-soraka/igt@kms_addfb_basic@size-max.html

  
#### Warnings ####

  * igt@i915_suspend@basic-s3-without-i915:
    - bat-adlm-1:         [INCOMPLETE][3] ([i915#7443]) -> [INCOMPLETE][4]
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13517/bat-adlm-1/igt@i915_suspend@basic-s3-without-i915.html
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v2/bat-adlm-1/igt@i915_suspend@basic-s3-without-i915.html

  
Known issues
------------

  Here are the changes found in Patchwork_121916v2 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@core_auth@basic-auth:
    - bat-adlp-11:        NOTRUN -> [ABORT][5] ([i915#8011])
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v2/bat-adlp-11/igt@core_auth@basic-auth.html

  * igt@i915_selftest@live@reset:
    - bat-rpls-2:         [PASS][6] -> [ABORT][7] ([i915#4983] / [i915#7461] / [i915#7913] / [i915#7981] / [i915#8347])
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13517/bat-rpls-2/igt@i915_selftest@live@reset.html
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v2/bat-rpls-2/igt@i915_selftest@live@reset.html

  * igt@i915_selftest@live@workarounds:
    - bat-rpls-1:         [PASS][8] -> [DMESG-FAIL][9] ([i915#6763])
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13517/bat-rpls-1/igt@i915_selftest@live@workarounds.html
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v2/bat-rpls-1/igt@i915_selftest@live@workarounds.html

  * igt@kms_psr@cursor_plane_move:
    - bat-rplp-1:         NOTRUN -> [SKIP][10] ([i915#1072])
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v2/bat-rplp-1/igt@kms_psr@cursor_plane_move.html

  * igt@kms_psr@sprite_plane_onoff:
    - bat-rplp-1:         NOTRUN -> [ABORT][11] ([i915#8442] / [i915#8668] / [i915#8712])
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v2/bat-rplp-1/igt@kms_psr@sprite_plane_onoff.html

  
#### Possible fixes ####

  * igt@i915_selftest@live@gt_mocs:
    - bat-mtlp-8:         [DMESG-FAIL][12] ([i915#7059]) -> [PASS][13]
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13517/bat-mtlp-8/igt@i915_selftest@live@gt_mocs.html
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v2/bat-mtlp-8/igt@i915_selftest@live@gt_mocs.html
    - bat-mtlp-6:         [DMESG-FAIL][14] ([i915#7059]) -> [PASS][15]
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13517/bat-mtlp-6/igt@i915_selftest@live@gt_mocs.html
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v2/bat-mtlp-6/igt@i915_selftest@live@gt_mocs.html

  * igt@i915_selftest@live@migrate:
    - bat-adlp-9:         [DMESG-FAIL][16] ([i915#7699] / [i915#7913]) -> [PASS][17]
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13517/bat-adlp-9/igt@i915_selftest@live@migrate.html
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v2/bat-adlp-9/igt@i915_selftest@live@migrate.html

  * igt@i915_selftest@live@requests:
    - bat-mtlp-8:         [DMESG-FAIL][18] ([i915#8497]) -> [PASS][19]
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13517/bat-mtlp-8/igt@i915_selftest@live@requests.html
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v2/bat-mtlp-8/igt@i915_selftest@live@requests.html

  * igt@i915_selftest@live@slpc:
    - bat-mtlp-8:         [DMESG-WARN][20] ([i915#6367]) -> [PASS][21]
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13517/bat-mtlp-8/igt@i915_selftest@live@slpc.html
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v2/bat-mtlp-8/igt@i915_selftest@live@slpc.html
    - bat-rpls-1:         [DMESG-WARN][22] ([i915#6367]) -> [PASS][23]
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13517/bat-rpls-1/igt@i915_selftest@live@slpc.html
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v2/bat-rpls-1/igt@i915_selftest@live@slpc.html

  
#### Warnings ####

  * igt@i915_module_load@load:
    - bat-adlp-11:        [ABORT][24] ([i915#4423]) -> [DMESG-WARN][25] ([i915#4423])
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13517/bat-adlp-11/igt@i915_module_load@load.html
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v2/bat-adlp-11/igt@i915_module_load@load.html

  * igt@i915_selftest@live@migrate:
    - bat-dg2-11:         [DMESG-WARN][26] ([i915#7699]) -> [DMESG-FAIL][27] ([i915#7699] / [i915#7913])
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13517/bat-dg2-11/igt@i915_selftest@live@migrate.html
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v2/bat-dg2-11/igt@i915_selftest@live@migrate.html

  * igt@kms_psr@primary_page_flip:
    - bat-rplp-1:         [ABORT][28] ([i915#8442] / [i915#8668] / [i915#8860]) -> [SKIP][29] ([i915#1072])
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13517/bat-rplp-1/igt@kms_psr@primary_page_flip.html
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v2/bat-rplp-1/igt@kms_psr@primary_page_flip.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [Intel XE#486]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/486
  [i915#1072]: https://gitlab.freedesktop.org/drm/intel/issues/1072
  [i915#1982]: https://gitlab.freedesktop.org/drm/intel/issues/1982
  [i915#4423]: https://gitlab.freedesktop.org/drm/intel/issues/4423
  [i915#4983]: https://gitlab.freedesktop.org/drm/intel/issues/4983
  [i915#6367]: https://gitlab.freedesktop.org/drm/intel/issues/6367
  [i915#6763]: https://gitlab.freedesktop.org/drm/intel/issues/6763
  [i915#7059]: https://gitlab.freedesktop.org/drm/intel/issues/7059
  [i915#7359]: https://gitlab.freedesktop.org/drm/intel/issues/7359
  [i915#7443]: https://gitlab.freedesktop.org/drm/intel/issues/7443
  [i915#7461]: https://gitlab.freedesktop.org/drm/intel/issues/7461
  [i915#7699]: https://gitlab.freedesktop.org/drm/intel/issues/7699
  [i915#7913]: https://gitlab.freedesktop.org/drm/intel/issues/7913
  [i915#7953]: https://gitlab.freedesktop.org/drm/intel/issues/7953
  [i915#7981]: https://gitlab.freedesktop.org/drm/intel/issues/7981
  [i915#8011]: https://gitlab.freedesktop.org/drm/intel/issues/8011
  [i915#8347]: https://gitlab.freedesktop.org/drm/intel/issues/8347
  [i915#8442]: https://gitlab.freedesktop.org/drm/intel/issues/8442
  [i915#8497]: https://gitlab.freedesktop.org/drm/intel/issues/8497
  [i915#8668]: https://gitlab.freedesktop.org/drm/intel/issues/8668
  [i915#8712]: https://gitlab.freedesktop.org/drm/intel/issues/8712
  [i915#8860]: https://gitlab.freedesktop.org/drm/intel/issues/8860
  [i915#8879]: https://gitlab.freedesktop.org/drm/intel/issues/8879


Build changes
-------------

  * Linux: CI_DRM_13517 -> Patchwork_121916v2

  CI-20190529: 20190529
  CI_DRM_13517: 85f20fb339f05ec4221bb295c13e46061c5c566f @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_7435: b6eaa6bfdc94c94b34ec80f437c4b6125eb19357 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_121916v2: 85f20fb339f05ec4221bb295c13e46061c5c566f @ git://anongit.freedesktop.org/gfx-ci/linux


### Linux commits

6ea698474beb drm/i915/gt: Timeout when waiting for idle in suspending
3d3cfd7f01b9 drm/i915/guc: Close deregister-context race against CT-loss
8781064449ef drm/i915/guc: Flush context destruction worker at suspend

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v2/index.html

[-- Attachment #2: Type: text/html, Size: 10119 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Intel-gfx] [PATCH v2 3/3] drm/i915/gt: Timeout when waiting for idle in suspending
  2023-08-15  1:12   ` Alan Previn
@ 2023-08-15 13:51     ` Rodrigo Vivi
  -1 siblings, 0 replies; 29+ messages in thread
From: Rodrigo Vivi @ 2023-08-15 13:51 UTC (permalink / raw)
  To: Alan Previn; +Cc: intel-gfx, dri-devel

On Mon, Aug 14, 2023 at 06:12:10PM -0700, Alan Previn wrote:
> When suspending, add a timeout when calling
> intel_gt_pm_wait_for_idle else if we have a lost
> G2H event that holds a wakeref (which would be
> indicative of a bug elsewhere in the driver),
> driver will at least complete the suspend-resume
> cycle, (albeit not hitting all the targets for
> low power hw counters), instead of hanging in the kernel.
> 
> Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_engine_cs.c |  2 +-
>  drivers/gpu/drm/i915/gt/intel_gt_pm.c     |  7 ++++++-
>  drivers/gpu/drm/i915/gt/intel_gt_pm.h     |  7 ++++++-
>  drivers/gpu/drm/i915/intel_wakeref.c      | 14 ++++++++++----
>  drivers/gpu/drm/i915/intel_wakeref.h      |  5 +++--
>  5 files changed, 26 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index ee15486fed0d..090438eb8682 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -688,7 +688,7 @@ void intel_engines_release(struct intel_gt *gt)
>  		if (!engine->release)
>  			continue;
>  
> -		intel_wakeref_wait_for_idle(&engine->wakeref);
> +		intel_wakeref_wait_for_idle(&engine->wakeref, 0);
>  		GEM_BUG_ON(intel_engine_pm_is_awake(engine));
>  
>  		engine->release(engine);
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> index 5a942af0a14e..e8b006c3ef29 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> @@ -289,6 +289,8 @@ int intel_gt_resume(struct intel_gt *gt)
>  
>  static void wait_for_suspend(struct intel_gt *gt)
>  {
> +	int timeout_ms = CONFIG_DRM_I915_MAX_REQUEST_BUSYWAIT ? : 10000;
> +
>  	if (!intel_gt_pm_is_awake(gt))
>  		return;
>  
> @@ -301,7 +303,10 @@ static void wait_for_suspend(struct intel_gt *gt)
>  		intel_gt_retire_requests(gt);
>  	}
>  
> -	intel_gt_pm_wait_for_idle(gt);
> +	/* we are suspending, so we shouldn't be waiting forever */
> +	if (intel_gt_pm_wait_timeout_for_idle(gt, timeout_ms) == -ETIME)

you forgot to change the error code here..........................^

but maybe we don't even need this here and a simple
if (intel_gt_pm_wait_timeout_for_idle(gt, timeout_ms)) should be enough
since the error from the killable one is unlikely and the only place
I error I could check on that path would be a catastrophic -ERESTARTSYS.

but up to you.

> +		gt_warn(gt, "bailing from %s after %d milisec timeout\n",
> +			__func__, timeout_ms);
>  }
>  
>  void intel_gt_suspend_prepare(struct intel_gt *gt)
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> index 6c9a46452364..5358acc2b5b1 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> @@ -68,7 +68,12 @@ static inline void intel_gt_pm_might_put(struct intel_gt *gt)
>  
>  static inline int intel_gt_pm_wait_for_idle(struct intel_gt *gt)
>  {
> -	return intel_wakeref_wait_for_idle(&gt->wakeref);
> +	return intel_wakeref_wait_for_idle(&gt->wakeref, 0);
> +}
> +
> +static inline int intel_gt_pm_wait_timeout_for_idle(struct intel_gt *gt, int timeout_ms)
> +{
> +	return intel_wakeref_wait_for_idle(&gt->wakeref, timeout_ms);
>  }

I was going to ask why you created a single use function here, but then I
noticed the above one. So it makes sense.
Then I was going to ask why in here you didn't use the same change of
timeout = 0 in the existent function like you used below, but then I
noticed that the above function is called in multiple places and the
patch with this change is much cleaner and the function is static inline
so your approach was good here.

>  
>  void intel_gt_pm_init_early(struct intel_gt *gt);
> diff --git a/drivers/gpu/drm/i915/intel_wakeref.c b/drivers/gpu/drm/i915/intel_wakeref.c
> index 718f2f1b6174..383a37521415 100644
> --- a/drivers/gpu/drm/i915/intel_wakeref.c
> +++ b/drivers/gpu/drm/i915/intel_wakeref.c
> @@ -111,14 +111,20 @@ void __intel_wakeref_init(struct intel_wakeref *wf,
>  			 "wakeref.work", &key->work, 0);
>  }
>  

Please add a documentation for this function making sure you have the following
mentions:

/**
[snip]
* @timeout_ms: Timeout in ums, 0 means never timeout.
*
* Returns 0 on success, -ETIMEDOUT upon a timeout, or the unlikely
* error propagation from wait_var_event_killable if timeout_ms is 0.
*/

with the return error fixed above and the documentation in place:

Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

> -int intel_wakeref_wait_for_idle(struct intel_wakeref *wf)
> +int intel_wakeref_wait_for_idle(struct intel_wakeref *wf, int timeout_ms)
>  {
> -	int err;
> +	int err = 0;
>  
>  	might_sleep();
>  
> -	err = wait_var_event_killable(&wf->wakeref,
> -				      !intel_wakeref_is_active(wf));
> +	if (!timeout_ms)
> +		err = wait_var_event_killable(&wf->wakeref,
> +					      !intel_wakeref_is_active(wf));
> +	else if (wait_var_event_timeout(&wf->wakeref,
> +					!intel_wakeref_is_active(wf),
> +					msecs_to_jiffies(timeout_ms)) < 1)
> +		err = -ETIMEDOUT;
> +
>  	if (err)
>  		return err;
>  
> diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
> index ec881b097368..6fbb7a2fb6ea 100644
> --- a/drivers/gpu/drm/i915/intel_wakeref.h
> +++ b/drivers/gpu/drm/i915/intel_wakeref.h
> @@ -251,15 +251,16 @@ __intel_wakeref_defer_park(struct intel_wakeref *wf)
>  /**
>   * intel_wakeref_wait_for_idle: Wait until the wakeref is idle
>   * @wf: the wakeref
> + * @timeout_ms: timeout to wait in milisecs, zero means forever
>   *
>   * Wait for the earlier asynchronous release of the wakeref. Note
>   * this will wait for any third party as well, so make sure you only wait
>   * when you have control over the wakeref and trust no one else is acquiring
>   * it.
>   *
> - * Return: 0 on success, error code if killed.
> + * Return: 0 on success, error code if killed, -ETIME if timed-out.
>   */
> -int intel_wakeref_wait_for_idle(struct intel_wakeref *wf);
> +int intel_wakeref_wait_for_idle(struct intel_wakeref *wf, int timeout_ms);
>  
>  struct intel_wakeref_auto {
>  	struct drm_i915_private *i915;
> -- 
> 2.39.0
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 3/3] drm/i915/gt: Timeout when waiting for idle in suspending
@ 2023-08-15 13:51     ` Rodrigo Vivi
  0 siblings, 0 replies; 29+ messages in thread
From: Rodrigo Vivi @ 2023-08-15 13:51 UTC (permalink / raw)
  To: Alan Previn; +Cc: intel-gfx, Daniele Ceraolo Spurio, John Harrison, dri-devel

On Mon, Aug 14, 2023 at 06:12:10PM -0700, Alan Previn wrote:
> When suspending, add a timeout when calling
> intel_gt_pm_wait_for_idle else if we have a lost
> G2H event that holds a wakeref (which would be
> indicative of a bug elsewhere in the driver),
> driver will at least complete the suspend-resume
> cycle, (albeit not hitting all the targets for
> low power hw counters), instead of hanging in the kernel.
> 
> Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_engine_cs.c |  2 +-
>  drivers/gpu/drm/i915/gt/intel_gt_pm.c     |  7 ++++++-
>  drivers/gpu/drm/i915/gt/intel_gt_pm.h     |  7 ++++++-
>  drivers/gpu/drm/i915/intel_wakeref.c      | 14 ++++++++++----
>  drivers/gpu/drm/i915/intel_wakeref.h      |  5 +++--
>  5 files changed, 26 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index ee15486fed0d..090438eb8682 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -688,7 +688,7 @@ void intel_engines_release(struct intel_gt *gt)
>  		if (!engine->release)
>  			continue;
>  
> -		intel_wakeref_wait_for_idle(&engine->wakeref);
> +		intel_wakeref_wait_for_idle(&engine->wakeref, 0);
>  		GEM_BUG_ON(intel_engine_pm_is_awake(engine));
>  
>  		engine->release(engine);
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> index 5a942af0a14e..e8b006c3ef29 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> @@ -289,6 +289,8 @@ int intel_gt_resume(struct intel_gt *gt)
>  
>  static void wait_for_suspend(struct intel_gt *gt)
>  {
> +	int timeout_ms = CONFIG_DRM_I915_MAX_REQUEST_BUSYWAIT ? : 10000;
> +
>  	if (!intel_gt_pm_is_awake(gt))
>  		return;
>  
> @@ -301,7 +303,10 @@ static void wait_for_suspend(struct intel_gt *gt)
>  		intel_gt_retire_requests(gt);
>  	}
>  
> -	intel_gt_pm_wait_for_idle(gt);
> +	/* we are suspending, so we shouldn't be waiting forever */
> +	if (intel_gt_pm_wait_timeout_for_idle(gt, timeout_ms) == -ETIME)

you forgot to change the error code here..........................^

but maybe we don't even need this here and a simple
if (intel_gt_pm_wait_timeout_for_idle(gt, timeout_ms)) should be enough
since the error from the killable one is unlikely and the only place
I error I could check on that path would be a catastrophic -ERESTARTSYS.

but up to you.

> +		gt_warn(gt, "bailing from %s after %d milisec timeout\n",
> +			__func__, timeout_ms);
>  }
>  
>  void intel_gt_suspend_prepare(struct intel_gt *gt)
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> index 6c9a46452364..5358acc2b5b1 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
> @@ -68,7 +68,12 @@ static inline void intel_gt_pm_might_put(struct intel_gt *gt)
>  
>  static inline int intel_gt_pm_wait_for_idle(struct intel_gt *gt)
>  {
> -	return intel_wakeref_wait_for_idle(&gt->wakeref);
> +	return intel_wakeref_wait_for_idle(&gt->wakeref, 0);
> +}
> +
> +static inline int intel_gt_pm_wait_timeout_for_idle(struct intel_gt *gt, int timeout_ms)
> +{
> +	return intel_wakeref_wait_for_idle(&gt->wakeref, timeout_ms);
>  }

I was going to ask why you created a single use function here, but then I
noticed the above one. So it makes sense.
Then I was going to ask why in here you didn't use the same change of
timeout = 0 in the existent function like you used below, but then I
noticed that the above function is called in multiple places and the
patch with this change is much cleaner and the function is static inline
so your approach was good here.

>  
>  void intel_gt_pm_init_early(struct intel_gt *gt);
> diff --git a/drivers/gpu/drm/i915/intel_wakeref.c b/drivers/gpu/drm/i915/intel_wakeref.c
> index 718f2f1b6174..383a37521415 100644
> --- a/drivers/gpu/drm/i915/intel_wakeref.c
> +++ b/drivers/gpu/drm/i915/intel_wakeref.c
> @@ -111,14 +111,20 @@ void __intel_wakeref_init(struct intel_wakeref *wf,
>  			 "wakeref.work", &key->work, 0);
>  }
>  

Please add a documentation for this function making sure you have the following
mentions:

/**
[snip]
* @timeout_ms: Timeout in ums, 0 means never timeout.
*
* Returns 0 on success, -ETIMEDOUT upon a timeout, or the unlikely
* error propagation from wait_var_event_killable if timeout_ms is 0.
*/

with the return error fixed above and the documentation in place:

Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

> -int intel_wakeref_wait_for_idle(struct intel_wakeref *wf)
> +int intel_wakeref_wait_for_idle(struct intel_wakeref *wf, int timeout_ms)
>  {
> -	int err;
> +	int err = 0;
>  
>  	might_sleep();
>  
> -	err = wait_var_event_killable(&wf->wakeref,
> -				      !intel_wakeref_is_active(wf));
> +	if (!timeout_ms)
> +		err = wait_var_event_killable(&wf->wakeref,
> +					      !intel_wakeref_is_active(wf));
> +	else if (wait_var_event_timeout(&wf->wakeref,
> +					!intel_wakeref_is_active(wf),
> +					msecs_to_jiffies(timeout_ms)) < 1)
> +		err = -ETIMEDOUT;
> +
>  	if (err)
>  		return err;
>  
> diff --git a/drivers/gpu/drm/i915/intel_wakeref.h b/drivers/gpu/drm/i915/intel_wakeref.h
> index ec881b097368..6fbb7a2fb6ea 100644
> --- a/drivers/gpu/drm/i915/intel_wakeref.h
> +++ b/drivers/gpu/drm/i915/intel_wakeref.h
> @@ -251,15 +251,16 @@ __intel_wakeref_defer_park(struct intel_wakeref *wf)
>  /**
>   * intel_wakeref_wait_for_idle: Wait until the wakeref is idle
>   * @wf: the wakeref
> + * @timeout_ms: timeout to wait in milisecs, zero means forever
>   *
>   * Wait for the earlier asynchronous release of the wakeref. Note
>   * this will wait for any third party as well, so make sure you only wait
>   * when you have control over the wakeref and trust no one else is acquiring
>   * it.
>   *
> - * Return: 0 on success, error code if killed.
> + * Return: 0 on success, error code if killed, -ETIME if timed-out.
>   */
> -int intel_wakeref_wait_for_idle(struct intel_wakeref *wf);
> +int intel_wakeref_wait_for_idle(struct intel_wakeref *wf, int timeout_ms);
>  
>  struct intel_wakeref_auto {
>  	struct drm_i915_private *i915;
> -- 
> 2.39.0
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Intel-gfx] [PATCH v2 1/3] drm/i915/guc: Flush context destruction worker at suspend
  2023-08-15  1:12   ` Alan Previn
@ 2023-08-15 13:53     ` Rodrigo Vivi
  -1 siblings, 0 replies; 29+ messages in thread
From: Rodrigo Vivi @ 2023-08-15 13:53 UTC (permalink / raw)
  To: Alan Previn; +Cc: intel-gfx, dri-devel

On Mon, Aug 14, 2023 at 06:12:08PM -0700, Alan Previn wrote:
> When suspending, flush the context-guc-id
> deregistration worker at the final stages of
> intel_gt_suspend_late when we finally call gt_sanitize
> that eventually leads down to __uc_sanitize so that
> the deregistration worker doesn't fire off later as
> we reset the GuC microcontroller.
> 
> Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 5 +++++
>  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h | 2 ++
>  drivers/gpu/drm/i915/gt/uc/intel_uc.c             | 2 ++
>  3 files changed, 9 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index a0e3ef1c65d2..050572bb8dbe 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1578,6 +1578,11 @@ static void guc_flush_submissions(struct intel_guc *guc)
>  	spin_unlock_irqrestore(&sched_engine->lock, flags);
>  }
>  
> +void intel_guc_submission_flush_work(struct intel_guc *guc)
> +{
> +	flush_work(&guc->submission_state.destroyed_worker);
> +}
> +
>  static void guc_flush_destroyed_contexts(struct intel_guc *guc);
>  
>  void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> index c57b29cdb1a6..b6df75622d3b 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> @@ -38,6 +38,8 @@ int intel_guc_wait_for_pending_msg(struct intel_guc *guc,
>  				   bool interruptible,
>  				   long timeout);
>  
> +void intel_guc_submission_flush_work(struct intel_guc *guc);
> +
>  static inline bool intel_guc_submission_is_supported(struct intel_guc *guc)
>  {
>  	return guc->submission_supported;
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc.c b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> index 98b103375b7a..eb3554cb5ea4 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> @@ -693,6 +693,8 @@ void intel_uc_suspend(struct intel_uc *uc)
>  		return;
>  	}
>  
> +	intel_guc_submission_flush_work(guc);
> +

what happens if a new job comes exactly here?
This still sounds a bit racy, although this already looks
much cleaner than the previous version.

>  	with_intel_runtime_pm(&uc_to_gt(uc)->i915->runtime_pm, wakeref) {
>  		err = intel_guc_suspend(guc);
>  		if (err)
> -- 
> 2.39.0
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/3] drm/i915/guc: Flush context destruction worker at suspend
@ 2023-08-15 13:53     ` Rodrigo Vivi
  0 siblings, 0 replies; 29+ messages in thread
From: Rodrigo Vivi @ 2023-08-15 13:53 UTC (permalink / raw)
  To: Alan Previn; +Cc: intel-gfx, Daniele Ceraolo Spurio, John Harrison, dri-devel

On Mon, Aug 14, 2023 at 06:12:08PM -0700, Alan Previn wrote:
> When suspending, flush the context-guc-id
> deregistration worker at the final stages of
> intel_gt_suspend_late when we finally call gt_sanitize
> that eventually leads down to __uc_sanitize so that
> the deregistration worker doesn't fire off later as
> we reset the GuC microcontroller.
> 
> Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 5 +++++
>  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h | 2 ++
>  drivers/gpu/drm/i915/gt/uc/intel_uc.c             | 2 ++
>  3 files changed, 9 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index a0e3ef1c65d2..050572bb8dbe 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1578,6 +1578,11 @@ static void guc_flush_submissions(struct intel_guc *guc)
>  	spin_unlock_irqrestore(&sched_engine->lock, flags);
>  }
>  
> +void intel_guc_submission_flush_work(struct intel_guc *guc)
> +{
> +	flush_work(&guc->submission_state.destroyed_worker);
> +}
> +
>  static void guc_flush_destroyed_contexts(struct intel_guc *guc);
>  
>  void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> index c57b29cdb1a6..b6df75622d3b 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> @@ -38,6 +38,8 @@ int intel_guc_wait_for_pending_msg(struct intel_guc *guc,
>  				   bool interruptible,
>  				   long timeout);
>  
> +void intel_guc_submission_flush_work(struct intel_guc *guc);
> +
>  static inline bool intel_guc_submission_is_supported(struct intel_guc *guc)
>  {
>  	return guc->submission_supported;
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc.c b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> index 98b103375b7a..eb3554cb5ea4 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> @@ -693,6 +693,8 @@ void intel_uc_suspend(struct intel_uc *uc)
>  		return;
>  	}
>  
> +	intel_guc_submission_flush_work(guc);
> +

what happens if a new job comes exactly here?
This still sounds a bit racy, although this already looks
much cleaner than the previous version.

>  	with_intel_runtime_pm(&uc_to_gt(uc)->i915->runtime_pm, wakeref) {
>  		err = intel_guc_suspend(guc);
>  		if (err)
> -- 
> 2.39.0
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Intel-gfx] [PATCH v2 2/3] drm/i915/guc: Close deregister-context race against CT-loss
  2023-08-15  1:12   ` Alan Previn
@ 2023-08-15 13:56     ` Rodrigo Vivi
  -1 siblings, 0 replies; 29+ messages in thread
From: Rodrigo Vivi @ 2023-08-15 13:56 UTC (permalink / raw)
  To: Alan Previn; +Cc: intel-gfx, dri-devel

On Mon, Aug 14, 2023 at 06:12:09PM -0700, Alan Previn wrote:
> If we are at the end of suspend or very early in resume
> its possible an async fence signal could lead us to the
> execution of the context destruction worker (after the
> prior worker flush).
> 
> Even if checking that the CT is enabled before calling
> destroyed_worker_func, guc_lrc_desc_unpin may still fail
> because in corner cases, as we traverse the GuC's
> context-destroy list, the CT could get disabled in the mid
> of it right before calling the GuC's CT send function.
> 
> We've witnessed this race condition once every ~6000-8000
> suspend-resume cycles while ensuring workloads that render
> something onscreen is continuously started just before
> we suspend (and the workload is small enough to complete
> and trigger the queued engine/context free-up either very
> late in suspend or very early in resume).
> 
> In such a case, we need to unroll the unpin process because
> guc-lrc-unpin takes a gt wakeref which only gets released in
> the G2H IRQ reply that never comes through in this corner
> case. That will cascade into a kernel hang later at the tail
> end of suspend in this function:
> 
>    intel_wakeref_wait_for_idle(&gt->wakeref)
>    (called by) - intel_gt_pm_wait_for_idle
>    (called by) - wait_for_suspend
> 
> Doing this unroll and keeping the context in the GuC's
> destroy-list will allow the context to get picked up on
> the next destroy worker invocation or purged as part of a
> major GuC sanitization or reset flow.
> 
> Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
> ---
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 40 +++++++++++++++++--
>  1 file changed, 36 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 050572bb8dbe..ddb4ee4c4e51 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -235,6 +235,13 @@ set_context_destroyed(struct intel_context *ce)
>  	ce->guc_state.sched_state |= SCHED_STATE_DESTROYED;
>  }
>  
> +static inline void
> +clr_context_destroyed(struct intel_context *ce)
> +{
> +	lockdep_assert_held(&ce->guc_state.lock);
> +	ce->guc_state.sched_state &= ~SCHED_STATE_DESTROYED;
> +}
> +
>  static inline bool context_pending_disable(struct intel_context *ce)
>  {
>  	return ce->guc_state.sched_state & SCHED_STATE_PENDING_DISABLE;
> @@ -3175,7 +3182,7 @@ static void guc_context_close(struct intel_context *ce)
>  	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>  }
>  
> -static inline void guc_lrc_desc_unpin(struct intel_context *ce)
> +static inline int guc_lrc_desc_unpin(struct intel_context *ce)
>  {
>  	struct intel_guc *guc = ce_to_guc(ce);
>  	struct intel_gt *gt = guc_to_gt(guc);
> @@ -3199,10 +3206,20 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce)
>  	if (unlikely(disabled)) {
>  		release_guc_id(guc, ce);
>  		__guc_context_destroy(ce);
> -		return;
> +		return 0;
> +	}
> +
> +	if (deregister_context(ce, ce->guc_id.id)) {
> +		/* Seal race with concurrent suspend by unrolling */
> +		spin_lock_irqsave(&ce->guc_state.lock, flags);
> +		set_context_registered(ce);
> +		clr_context_destroyed(ce);
> +		intel_gt_pm_put(gt);

how sure we are this is not calling unbalanced puts?
could we wrap this in some existent function to make this clear?

> +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> +		return -ENODEV;
>  	}
>  
> -	deregister_context(ce, ce->guc_id.id);
> +	return 0;
>  }
>  
>  static void __guc_context_destroy(struct intel_context *ce)
> @@ -3270,7 +3287,22 @@ static void deregister_destroyed_contexts(struct intel_guc *guc)
>  		if (!ce)
>  			break;
>  
> -		guc_lrc_desc_unpin(ce);
> +		if (guc_lrc_desc_unpin(ce)) {
> +			/*
> +			 * This means GuC's CT link severed mid-way which could happen
> +			 * in suspend-resume corner cases. In this case, put the
> +			 * context back into the destroyed_contexts list which will
> +			 * get picked up on the next context deregistration event or
> +			 * purged in a GuC sanitization event (reset/unload/wedged/...).
> +			 */
> +			spin_lock_irqsave(&guc->submission_state.lock, flags);
> +			list_add_tail(&ce->destroyed_link,
> +				      &guc->submission_state.destroyed_contexts);
> +			spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> +			/* Bail now since the list might never be emptied if h2gs fail */
> +			break;
> +		}
> +
>  	}
>  }
>  
> -- 
> 2.39.0
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/3] drm/i915/guc: Close deregister-context race against CT-loss
@ 2023-08-15 13:56     ` Rodrigo Vivi
  0 siblings, 0 replies; 29+ messages in thread
From: Rodrigo Vivi @ 2023-08-15 13:56 UTC (permalink / raw)
  To: Alan Previn; +Cc: intel-gfx, Daniele Ceraolo Spurio, John Harrison, dri-devel

On Mon, Aug 14, 2023 at 06:12:09PM -0700, Alan Previn wrote:
> If we are at the end of suspend or very early in resume
> its possible an async fence signal could lead us to the
> execution of the context destruction worker (after the
> prior worker flush).
> 
> Even if checking that the CT is enabled before calling
> destroyed_worker_func, guc_lrc_desc_unpin may still fail
> because in corner cases, as we traverse the GuC's
> context-destroy list, the CT could get disabled in the mid
> of it right before calling the GuC's CT send function.
> 
> We've witnessed this race condition once every ~6000-8000
> suspend-resume cycles while ensuring workloads that render
> something onscreen is continuously started just before
> we suspend (and the workload is small enough to complete
> and trigger the queued engine/context free-up either very
> late in suspend or very early in resume).
> 
> In such a case, we need to unroll the unpin process because
> guc-lrc-unpin takes a gt wakeref which only gets released in
> the G2H IRQ reply that never comes through in this corner
> case. That will cascade into a kernel hang later at the tail
> end of suspend in this function:
> 
>    intel_wakeref_wait_for_idle(&gt->wakeref)
>    (called by) - intel_gt_pm_wait_for_idle
>    (called by) - wait_for_suspend
> 
> Doing this unroll and keeping the context in the GuC's
> destroy-list will allow the context to get picked up on
> the next destroy worker invocation or purged as part of a
> major GuC sanitization or reset flow.
> 
> Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
> ---
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 40 +++++++++++++++++--
>  1 file changed, 36 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 050572bb8dbe..ddb4ee4c4e51 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -235,6 +235,13 @@ set_context_destroyed(struct intel_context *ce)
>  	ce->guc_state.sched_state |= SCHED_STATE_DESTROYED;
>  }
>  
> +static inline void
> +clr_context_destroyed(struct intel_context *ce)
> +{
> +	lockdep_assert_held(&ce->guc_state.lock);
> +	ce->guc_state.sched_state &= ~SCHED_STATE_DESTROYED;
> +}
> +
>  static inline bool context_pending_disable(struct intel_context *ce)
>  {
>  	return ce->guc_state.sched_state & SCHED_STATE_PENDING_DISABLE;
> @@ -3175,7 +3182,7 @@ static void guc_context_close(struct intel_context *ce)
>  	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>  }
>  
> -static inline void guc_lrc_desc_unpin(struct intel_context *ce)
> +static inline int guc_lrc_desc_unpin(struct intel_context *ce)
>  {
>  	struct intel_guc *guc = ce_to_guc(ce);
>  	struct intel_gt *gt = guc_to_gt(guc);
> @@ -3199,10 +3206,20 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce)
>  	if (unlikely(disabled)) {
>  		release_guc_id(guc, ce);
>  		__guc_context_destroy(ce);
> -		return;
> +		return 0;
> +	}
> +
> +	if (deregister_context(ce, ce->guc_id.id)) {
> +		/* Seal race with concurrent suspend by unrolling */
> +		spin_lock_irqsave(&ce->guc_state.lock, flags);
> +		set_context_registered(ce);
> +		clr_context_destroyed(ce);
> +		intel_gt_pm_put(gt);

how sure we are this is not calling unbalanced puts?
could we wrap this in some existent function to make this clear?

> +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> +		return -ENODEV;
>  	}
>  
> -	deregister_context(ce, ce->guc_id.id);
> +	return 0;
>  }
>  
>  static void __guc_context_destroy(struct intel_context *ce)
> @@ -3270,7 +3287,22 @@ static void deregister_destroyed_contexts(struct intel_guc *guc)
>  		if (!ce)
>  			break;
>  
> -		guc_lrc_desc_unpin(ce);
> +		if (guc_lrc_desc_unpin(ce)) {
> +			/*
> +			 * This means GuC's CT link severed mid-way which could happen
> +			 * in suspend-resume corner cases. In this case, put the
> +			 * context back into the destroyed_contexts list which will
> +			 * get picked up on the next context deregistration event or
> +			 * purged in a GuC sanitization event (reset/unload/wedged/...).
> +			 */
> +			spin_lock_irqsave(&guc->submission_state.lock, flags);
> +			list_add_tail(&ce->destroyed_link,
> +				      &guc->submission_state.destroyed_contexts);
> +			spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> +			/* Bail now since the list might never be emptied if h2gs fail */
> +			break;
> +		}
> +
>  	}
>  }
>  
> -- 
> 2.39.0
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Intel-gfx] [PATCH v2 3/3] drm/i915/gt: Timeout when waiting for idle in suspending
  2023-08-15 13:51     ` Rodrigo Vivi
@ 2023-08-15 19:00       ` Teres Alexis, Alan Previn
  -1 siblings, 0 replies; 29+ messages in thread
From: Teres Alexis, Alan Previn @ 2023-08-15 19:00 UTC (permalink / raw)
  To: Vivi, Rodrigo
  Cc: intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org

Thanks Rodrigo - agreed on everything below - will re-rev.

On Tue, 2023-08-15 at 09:51 -0400, Vivi, Rodrigo wrote:
> On Mon, Aug 14, 2023 at 06:12:10PM -0700, Alan Previn wrote:
> > When suspending, add a timeout when calling
> > intel_gt_pm_wait_for_idle else if we have a lost
> > G2H event that holds a wakeref (which would be
> > 
[snip]

> > @@ -301,7 +303,10 @@ static void wait_for_suspend(struct intel_gt *gt)
> >  		intel_gt_retire_requests(gt);
> >  	}
> >  
> > -	intel_gt_pm_wait_for_idle(gt);
> > +	/* we are suspending, so we shouldn't be waiting forever */
> > +	if (intel_gt_pm_wait_timeout_for_idle(gt, timeout_ms) == -ETIME)
> 
> you forgot to change the error code here..........................^
> 
> but maybe we don't even need this here and a simple
> if (intel_gt_pm_wait_timeout_for_idle(gt, timeout_ms)) should be enough
> since the error from the killable one is unlikely and the only place
> I error I could check on that path would be a catastrophic -ERESTARTSYS.
> 
> but up to you.
alan: my bad - I'll fix it - but i agree with not needing to check the failure type.
and I'll keep the error the same ("bailing from...")
[snip]

> > +static inline int intel_gt_pm_wait_timeout_for_idle(struct intel_gt *gt, int timeout_ms)
> > +{
> > +	return intel_wakeref_wait_for_idle(&gt->wakeref, timeout_ms);
> >  }
> 
> I was going to ask why you created a single use function here, but then I
> noticed the above one. So it makes sense.
> Then I was going to ask why in here you didn't use the same change of
> timeout = 0 in the existent function like you used below, but then I
> noticed that the above function is called in multiple places and the
> patch with this change is much cleaner and the function is static inline
> so your approach was good here.
alan: yes that was my exact reason - thus no impact across other callers.
[snip]


> Please add a documentation for this function making sure you have the following
> mentions:
alan: good catch -will do.

> 
> /**
> [snip]
> * @timeout_ms: Timeout in ums, 0 means never timeout.
> *
> * Returns 0 on success, -ETIMEDOUT upon a timeout, or the unlikely
> * error propagation from wait_var_event_killable if timeout_ms is 0.
> */
> 
> with the return error fixed above and the documentation in place:
> 
> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> 
> > -int intel_wakeref_wait_for_idle(struct intel_wakeref *wf)
> > +int intel_wakeref_wait_for_idle(struct intel_wakeref *wf, int timeout_ms)
> >  {
> > -	int err;
> > 
[snip]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 3/3] drm/i915/gt: Timeout when waiting for idle in suspending
@ 2023-08-15 19:00       ` Teres Alexis, Alan Previn
  0 siblings, 0 replies; 29+ messages in thread
From: Teres Alexis, Alan Previn @ 2023-08-15 19:00 UTC (permalink / raw)
  To: Vivi, Rodrigo
  Cc: intel-gfx@lists.freedesktop.org, Ceraolo Spurio, Daniele,
	Harrison, John C, dri-devel@lists.freedesktop.org

Thanks Rodrigo - agreed on everything below - will re-rev.

On Tue, 2023-08-15 at 09:51 -0400, Vivi, Rodrigo wrote:
> On Mon, Aug 14, 2023 at 06:12:10PM -0700, Alan Previn wrote:
> > When suspending, add a timeout when calling
> > intel_gt_pm_wait_for_idle else if we have a lost
> > G2H event that holds a wakeref (which would be
> > 
[snip]

> > @@ -301,7 +303,10 @@ static void wait_for_suspend(struct intel_gt *gt)
> >  		intel_gt_retire_requests(gt);
> >  	}
> >  
> > -	intel_gt_pm_wait_for_idle(gt);
> > +	/* we are suspending, so we shouldn't be waiting forever */
> > +	if (intel_gt_pm_wait_timeout_for_idle(gt, timeout_ms) == -ETIME)
> 
> you forgot to change the error code here..........................^
> 
> but maybe we don't even need this here and a simple
> if (intel_gt_pm_wait_timeout_for_idle(gt, timeout_ms)) should be enough
> since the error from the killable one is unlikely and the only place
> I error I could check on that path would be a catastrophic -ERESTARTSYS.
> 
> but up to you.
alan: my bad - I'll fix it - but i agree with not needing to check the failure type.
and I'll keep the error the same ("bailing from...")
[snip]

> > +static inline int intel_gt_pm_wait_timeout_for_idle(struct intel_gt *gt, int timeout_ms)
> > +{
> > +	return intel_wakeref_wait_for_idle(&gt->wakeref, timeout_ms);
> >  }
> 
> I was going to ask why you created a single use function here, but then I
> noticed the above one. So it makes sense.
> Then I was going to ask why in here you didn't use the same change of
> timeout = 0 in the existent function like you used below, but then I
> noticed that the above function is called in multiple places and the
> patch with this change is much cleaner and the function is static inline
> so your approach was good here.
alan: yes that was my exact reason - thus no impact across other callers.
[snip]


> Please add a documentation for this function making sure you have the following
> mentions:
alan: good catch -will do.

> 
> /**
> [snip]
> * @timeout_ms: Timeout in ums, 0 means never timeout.
> *
> * Returns 0 on success, -ETIMEDOUT upon a timeout, or the unlikely
> * error propagation from wait_var_event_killable if timeout_ms is 0.
> */
> 
> with the return error fixed above and the documentation in place:
> 
> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> 
> > -int intel_wakeref_wait_for_idle(struct intel_wakeref *wf)
> > +int intel_wakeref_wait_for_idle(struct intel_wakeref *wf, int timeout_ms)
> >  {
> > -	int err;
> > 
[snip]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Intel-gfx] [PATCH v2 2/3] drm/i915/guc: Close deregister-context race against CT-loss
  2023-08-15 13:56     ` Rodrigo Vivi
@ 2023-08-15 19:08       ` Teres Alexis, Alan Previn
  -1 siblings, 0 replies; 29+ messages in thread
From: Teres Alexis, Alan Previn @ 2023-08-15 19:08 UTC (permalink / raw)
  To: Vivi, Rodrigo
  Cc: intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org

On Tue, 2023-08-15 at 09:56 -0400, Vivi, Rodrigo wrote:
> On Mon, Aug 14, 2023 at 06:12:09PM -0700, Alan Previn wrote:
> > If we are at the end of suspend or very early in resume
> > its possible an async fence signal could lead us to the
> > execution of the context destruction worker (after the
> > prior worker flush).

[snip]
> 
> > @@ -3199,10 +3206,20 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce)
> >  	if (unlikely(disabled)) {
> >  		release_guc_id(guc, ce);
> >  		__guc_context_destroy(ce);
> > -		return;
> > +		return 0;
> > +	}
> > +
> > +	if (deregister_context(ce, ce->guc_id.id)) {
> > +		/* Seal race with concurrent suspend by unrolling */
> > +		spin_lock_irqsave(&ce->guc_state.lock, flags);
> > +		set_context_registered(ce);
> > +		clr_context_destroyed(ce);
> > +		intel_gt_pm_put(gt);
> 
> how sure we are this is not calling unbalanced puts?
alan: in this function (guc_lrc_desc_unpin), the summarized flow is:

	check-status-stuff
	if guc-is-not-disabled, take-pm, change ctx-state-bits
	[implied else] if guc-is-disabled, scub-ctx and return
	
thus derigster_context is only called if we didnt exit, i.e. when guc-is-not-disabled, i.e. after the pm was taken.
> could we wrap this in some existent function to make this clear?
alan: yeah - not so readible as it now - let me tweak this function and make it cleaner

> 
> > +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > +		return -ENODEV;
> >  	}
> >  
> > -	deregister_context(ce, ce->guc_id.id);
> > +	return 0;
> >  }


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/3] drm/i915/guc: Close deregister-context race against CT-loss
@ 2023-08-15 19:08       ` Teres Alexis, Alan Previn
  0 siblings, 0 replies; 29+ messages in thread
From: Teres Alexis, Alan Previn @ 2023-08-15 19:08 UTC (permalink / raw)
  To: Vivi, Rodrigo
  Cc: intel-gfx@lists.freedesktop.org, Ceraolo Spurio, Daniele,
	Harrison, John C, dri-devel@lists.freedesktop.org

On Tue, 2023-08-15 at 09:56 -0400, Vivi, Rodrigo wrote:
> On Mon, Aug 14, 2023 at 06:12:09PM -0700, Alan Previn wrote:
> > If we are at the end of suspend or very early in resume
> > its possible an async fence signal could lead us to the
> > execution of the context destruction worker (after the
> > prior worker flush).

[snip]
> 
> > @@ -3199,10 +3206,20 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce)
> >  	if (unlikely(disabled)) {
> >  		release_guc_id(guc, ce);
> >  		__guc_context_destroy(ce);
> > -		return;
> > +		return 0;
> > +	}
> > +
> > +	if (deregister_context(ce, ce->guc_id.id)) {
> > +		/* Seal race with concurrent suspend by unrolling */
> > +		spin_lock_irqsave(&ce->guc_state.lock, flags);
> > +		set_context_registered(ce);
> > +		clr_context_destroyed(ce);
> > +		intel_gt_pm_put(gt);
> 
> how sure we are this is not calling unbalanced puts?
alan: in this function (guc_lrc_desc_unpin), the summarized flow is:

	check-status-stuff
	if guc-is-not-disabled, take-pm, change ctx-state-bits
	[implied else] if guc-is-disabled, scub-ctx and return
	
thus derigster_context is only called if we didnt exit, i.e. when guc-is-not-disabled, i.e. after the pm was taken.
> could we wrap this in some existent function to make this clear?
alan: yeah - not so readible as it now - let me tweak this function and make it cleaner

> 
> > +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > +		return -ENODEV;
> >  	}
> >  
> > -	deregister_context(ce, ce->guc_id.id);
> > +	return 0;
> >  }


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Intel-gfx] [PATCH v2 1/3] drm/i915/guc: Flush context destruction worker at suspend
  2023-08-15 13:53     ` Rodrigo Vivi
@ 2023-08-25 18:48       ` Teres Alexis, Alan Previn
  -1 siblings, 0 replies; 29+ messages in thread
From: Teres Alexis, Alan Previn @ 2023-08-25 18:48 UTC (permalink / raw)
  To: Vivi, Rodrigo
  Cc: intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org

Thanks again Rodrigo for reviewing and apologies for my tardy replies.
We are stil testing on shipping platforms and these latest patches seemed
to have reduced the frequency and solved the "system hangs" while suspending
but its still causing issues so we continue to debug. (issue is that
its running thousands of cycles overnight on multiple systems and takes
time). 

[snip]

> > @@ -693,6 +693,8 @@ void intel_uc_suspend(struct intel_uc *uc)
> >  		return;
> >  	}
> >  
> > +	intel_guc_submission_flush_work(guc);
> > +
> 
> what happens if a new job comes exactly here?
> This still sounds a bit racy, although this already looks
> much cleaner than the previous version.
alan: intel_uc_suspend is called from suspend-late or init-failure.
and the very next function -> intel_guc_suspend will soft reset the guc
and eventually nuke it. this is the most appropriate "latests as
possible" place to put this.
> 
> >  	with_intel_runtime_pm(&uc_to_gt(uc)->i915->runtime_pm, wakeref) {
> >  		err = intel_guc_suspend(guc);
> >  		if (err)





^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/3] drm/i915/guc: Flush context destruction worker at suspend
@ 2023-08-25 18:48       ` Teres Alexis, Alan Previn
  0 siblings, 0 replies; 29+ messages in thread
From: Teres Alexis, Alan Previn @ 2023-08-25 18:48 UTC (permalink / raw)
  To: Vivi, Rodrigo
  Cc: intel-gfx@lists.freedesktop.org, Ceraolo Spurio, Daniele,
	Harrison, John C, dri-devel@lists.freedesktop.org

Thanks again Rodrigo for reviewing and apologies for my tardy replies.
We are stil testing on shipping platforms and these latest patches seemed
to have reduced the frequency and solved the "system hangs" while suspending
but its still causing issues so we continue to debug. (issue is that
its running thousands of cycles overnight on multiple systems and takes
time). 

[snip]

> > @@ -693,6 +693,8 @@ void intel_uc_suspend(struct intel_uc *uc)
> >  		return;
> >  	}
> >  
> > +	intel_guc_submission_flush_work(guc);
> > +
> 
> what happens if a new job comes exactly here?
> This still sounds a bit racy, although this already looks
> much cleaner than the previous version.
alan: intel_uc_suspend is called from suspend-late or init-failure.
and the very next function -> intel_guc_suspend will soft reset the guc
and eventually nuke it. this is the most appropriate "latests as
possible" place to put this.
> 
> >  	with_intel_runtime_pm(&uc_to_gt(uc)->i915->runtime_pm, wakeref) {
> >  		err = intel_guc_suspend(guc);
> >  		if (err)





^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Intel-gfx] [PATCH v2 2/3] drm/i915/guc: Close deregister-context race against CT-loss
  2023-08-15 19:08       ` Teres Alexis, Alan Previn
@ 2023-08-25 18:54         ` Teres Alexis, Alan Previn
  -1 siblings, 0 replies; 29+ messages in thread
From: Teres Alexis, Alan Previn @ 2023-08-25 18:54 UTC (permalink / raw)
  To: Vivi, Rodrigo
  Cc: intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org

just a follow up note-to-self:

On Tue, 2023-08-15 at 12:08 -0700, Teres Alexis, Alan Previn wrote:
> On Tue, 2023-08-15 at 09:56 -0400, Vivi, Rodrigo wrote:
> > On Mon, Aug 14, 2023 at 06:12:09PM -0700, Alan Previn wrote:
> > > 
[snip]

in guc_submission_send_busy_loop, we are incrementing the following
that needs to be decremented if the function fails.

atomic_inc(&guc->outstanding_submission_g2h);

also, it seems that even with thie unroll design - we are still
leaking a wakeref elsewhere. this is despite a cleaner redesign of
flows in function "guc_lrc_desc_unpin"
(discussed earlier that wasnt very readible).

will re-rev today but will probably need more follow ups
tracking that one more leaking gt-wakeref (one in thousands-cycles)
but at least now we are not hanging mid-suspend.. we bail from suspend
with useful kernel messages.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/3] drm/i915/guc: Close deregister-context race against CT-loss
@ 2023-08-25 18:54         ` Teres Alexis, Alan Previn
  0 siblings, 0 replies; 29+ messages in thread
From: Teres Alexis, Alan Previn @ 2023-08-25 18:54 UTC (permalink / raw)
  To: Vivi, Rodrigo
  Cc: intel-gfx@lists.freedesktop.org, Ceraolo Spurio, Daniele,
	Harrison, John C, dri-devel@lists.freedesktop.org

just a follow up note-to-self:

On Tue, 2023-08-15 at 12:08 -0700, Teres Alexis, Alan Previn wrote:
> On Tue, 2023-08-15 at 09:56 -0400, Vivi, Rodrigo wrote:
> > On Mon, Aug 14, 2023 at 06:12:09PM -0700, Alan Previn wrote:
> > > 
[snip]

in guc_submission_send_busy_loop, we are incrementing the following
that needs to be decremented if the function fails.

atomic_inc(&guc->outstanding_submission_g2h);

also, it seems that even with thie unroll design - we are still
leaking a wakeref elsewhere. this is despite a cleaner redesign of
flows in function "guc_lrc_desc_unpin"
(discussed earlier that wasnt very readible).

will re-rev today but will probably need more follow ups
tracking that one more leaking gt-wakeref (one in thousands-cycles)
but at least now we are not hanging mid-suspend.. we bail from suspend
with useful kernel messages.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Intel-gfx] [PATCH v2 2/3] drm/i915/guc: Close deregister-context race against CT-loss
  2023-08-25 18:54         ` Teres Alexis, Alan Previn
@ 2023-08-28 21:06           ` Teres Alexis, Alan Previn
  -1 siblings, 0 replies; 29+ messages in thread
From: Teres Alexis, Alan Previn @ 2023-08-28 21:06 UTC (permalink / raw)
  To: Vivi, Rodrigo
  Cc: intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org

Additional update from the most recent testing.

When relying solely on guc_lrc_desc_unpin getting a failure from deregister_context
as a means for identifying that we are in the "deregister-context-vs-suspend-late" race,
it is too late a location to handle this safely. This is because one of the
first things destroyed_worker_func does it to take a gt pm wakeref - which triggers
the gt_unpark function that does a whole lot bunch of other flows including triggering more
workers and taking additional refs. That said, its best to not even call
deregister_destroyed_contexts from the worker when !intel_guc_is_ready (ct-is-disabled).

...alan

On Fri, 2023-08-25 at 11:54 -0700, Teres Alexis, Alan Previn wrote:
> just a follow up note-to-self:
> 
> On Tue, 2023-08-15 at 12:08 -0700, Teres Alexis, Alan Previn wrote:
> > On Tue, 2023-08-15 at 09:56 -0400, Vivi, Rodrigo wrote:
> > > On Mon, Aug 14, 2023 at 06:12:09PM -0700, Alan Previn wrote:
> > > > 
> [snip]
> 
> in guc_submission_send_busy_loop, we are incrementing the following
> that needs to be decremented if the function fails.
> 
> atomic_inc(&guc->outstanding_submission_g2h);
> 
> also, it seems that even with thie unroll design - we are still
> leaking a wakeref elsewhere. this is despite a cleaner redesign of
> flows in function "guc_lrc_desc_unpin"
> (discussed earlier that wasnt very readible).
> 
> will re-rev today but will probably need more follow ups
> tracking that one more leaking gt-wakeref (one in thousands-cycles)
> but at least now we are not hanging mid-suspend.. we bail from suspend
> with useful kernel messages.
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/3] drm/i915/guc: Close deregister-context race against CT-loss
@ 2023-08-28 21:06           ` Teres Alexis, Alan Previn
  0 siblings, 0 replies; 29+ messages in thread
From: Teres Alexis, Alan Previn @ 2023-08-28 21:06 UTC (permalink / raw)
  To: Vivi, Rodrigo
  Cc: intel-gfx@lists.freedesktop.org, Ceraolo Spurio, Daniele,
	Harrison, John C, dri-devel@lists.freedesktop.org

Additional update from the most recent testing.

When relying solely on guc_lrc_desc_unpin getting a failure from deregister_context
as a means for identifying that we are in the "deregister-context-vs-suspend-late" race,
it is too late a location to handle this safely. This is because one of the
first things destroyed_worker_func does it to take a gt pm wakeref - which triggers
the gt_unpark function that does a whole lot bunch of other flows including triggering more
workers and taking additional refs. That said, its best to not even call
deregister_destroyed_contexts from the worker when !intel_guc_is_ready (ct-is-disabled).

...alan

On Fri, 2023-08-25 at 11:54 -0700, Teres Alexis, Alan Previn wrote:
> just a follow up note-to-self:
> 
> On Tue, 2023-08-15 at 12:08 -0700, Teres Alexis, Alan Previn wrote:
> > On Tue, 2023-08-15 at 09:56 -0400, Vivi, Rodrigo wrote:
> > > On Mon, Aug 14, 2023 at 06:12:09PM -0700, Alan Previn wrote:
> > > > 
> [snip]
> 
> in guc_submission_send_busy_loop, we are incrementing the following
> that needs to be decremented if the function fails.
> 
> atomic_inc(&guc->outstanding_submission_g2h);
> 
> also, it seems that even with thie unroll design - we are still
> leaking a wakeref elsewhere. this is despite a cleaner redesign of
> flows in function "guc_lrc_desc_unpin"
> (discussed earlier that wasnt very readible).
> 
> will re-rev today but will probably need more follow ups
> tracking that one more leaking gt-wakeref (one in thousands-cycles)
> but at least now we are not hanging mid-suspend.. we bail from suspend
> with useful kernel messages.
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2023-08-28 21:06 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-15  1:12 [Intel-gfx] [PATCH v2 0/3] Resolve suspend-resume racing with GuC destroy-context-worker Alan Previn
2023-08-15  1:12 ` Alan Previn
2023-08-15  1:12 ` [Intel-gfx] [PATCH v2 1/3] drm/i915/guc: Flush context destruction worker at suspend Alan Previn
2023-08-15  1:12   ` Alan Previn
2023-08-15 13:53   ` [Intel-gfx] " Rodrigo Vivi
2023-08-15 13:53     ` Rodrigo Vivi
2023-08-25 18:48     ` [Intel-gfx] " Teres Alexis, Alan Previn
2023-08-25 18:48       ` Teres Alexis, Alan Previn
2023-08-15  1:12 ` [Intel-gfx] [PATCH v2 2/3] drm/i915/guc: Close deregister-context race against CT-loss Alan Previn
2023-08-15  1:12   ` Alan Previn
2023-08-15 13:56   ` [Intel-gfx] " Rodrigo Vivi
2023-08-15 13:56     ` Rodrigo Vivi
2023-08-15 19:08     ` [Intel-gfx] " Teres Alexis, Alan Previn
2023-08-15 19:08       ` Teres Alexis, Alan Previn
2023-08-25 18:54       ` [Intel-gfx] " Teres Alexis, Alan Previn
2023-08-25 18:54         ` Teres Alexis, Alan Previn
2023-08-28 21:06         ` [Intel-gfx] " Teres Alexis, Alan Previn
2023-08-28 21:06           ` Teres Alexis, Alan Previn
2023-08-15  1:12 ` [Intel-gfx] [PATCH v2 3/3] drm/i915/gt: Timeout when waiting for idle in suspending Alan Previn
2023-08-15  1:12   ` Alan Previn
2023-08-15 13:51   ` [Intel-gfx] " Rodrigo Vivi
2023-08-15 13:51     ` Rodrigo Vivi
2023-08-15 19:00     ` [Intel-gfx] " Teres Alexis, Alan Previn
2023-08-15 19:00       ` Teres Alexis, Alan Previn
2023-08-15  1:20 ` [Intel-gfx] [PATCH v2 0/3] Resolve suspend-resume racing with GuC destroy-context-worker Teres Alexis, Alan Previn
2023-08-15  1:20   ` Teres Alexis, Alan Previn
2023-08-15  1:56 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Resolve suspend-resume racing with GuC destroy-context-worker (rev2) Patchwork
2023-08-15  1:56 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2023-08-15  2:15 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.