* [Intel-gfx] [PATCH v7 0/2] Resolve suspend-resume racing with GuC destroy-context-worker
@ 2023-11-30 0:20 ` Alan Previn
0 siblings, 0 replies; 18+ messages in thread
From: Alan Previn @ 2023-11-30 0:20 UTC (permalink / raw)
To: intel-gfx; +Cc: Alan Previn, Tvrtko Ursulin, dri-devel, Rodrigo Vivi
This series is the result of debugging issues root caused to
races between the GuC's destroyed_worker_func being triggered
vs repeating suspend-resume cycles with concurrent delayed
fence signals for engine-freeing.
The reproduction steps require that an app is launched right
before the start of the suspend cycle where it creates a
new gem context and submits a tiny workload that would
complete in the middle of the suspend cycle. However this
app uses dma-buffer sharing or dma-fence with non-GPU
objects or signals that eventually triggers a FENCE_FREE
via__i915_sw_fence_notify that connects to engines_notify ->
free_engines_rcu -> intel_context_put ->
kref_put(&ce->ref..) that queues the worker after the GuCs
CTB has been disabled (i.e. after i915-gem's suspend-late).
This sequence is a corner-case and required repeating this
app->suspend->resume cycle ~1500 times across 4 identical
systems to see it once. That said, based on above callstack,
it is clear that merely flushing the context destruction worker,
which is obviously missing and needed, isn't sufficient.
Because of that, this series adds additional patches besides
the obvious (Patch #1) flushing of the worker during the
suspend flows. It also includes (Patch #2) closing a race
between sending the context-deregistration H2G vs the CTB
getting disabled in the midst of it (by detecing the failure
and unrolling the guc-lrc-unpin flow) and adding an additional
rcu_barrier in the gem-suspend flow to purge outstanding
rcu defered tasks that may include context destruction.
This patch was tested and confirmed to be reliably working
after running ~1500 suspend resume cycles on 4 concurrent
machines.
Changes from prior revs:
v6: - Dont hold the spinlock while calling deregister_context
which can take a longer time. (Rodrigo)
- Fix / improve of comments. (Rodrigo)
- Release the ce->guc_state.lock before calling
deregister_context and retake it if we fail. (John/Daniele).
v5: - Remove Patch #3 which doesnt solve this exact bug
but can be a separate patch(Tvrtko).
v4: - In Patch #2, change the position of the calls into
rcu_barrier based on latest testing data. (Alan/Anshuman).
- In Patch #3, fix the timeout value selection for the
final gt-pm idle-wait that was incorrectly using a 'ns'
#define as a milisec timeout.
v3: - In Patch #3, when deregister_context fails, instead
of calling intel_gt_pm_put(that might sleep), call
__intel_wakeref_put (without ASYNC flag) (Rodrigo/Anshuman).
- In wait_for_suspend add an rcu_barrier before we
proceed to wait for idle. (Anshuman)
v2: - Patch #2 Restructure code in guc_lrc_desc_unpin so
it's more readible to differentiate (1)direct guc-id
cleanup ..vs (2) sending the H2G ctx-destroy action ..
vs (3) the unrolling steps if the H2G fails.
- Patch #2 Add a check to close the race sooner by checking
for intel_guc_is_ready from destroyed_worker_func.
- Patch #2 When guc_submission_send_busy_loop gets a
failure from intel_guc_send_busy_loop, we need to undo
i.e. decrement the outstanding_submission_g2h.
- Patch #3 In wait_for_suspend, fix checking of return from
intel_gt_pm_wait_timeout_for_idle to now use -ETIMEDOUT
and add documentation for intel_wakeref_wait_for_idle.
(Rodrigo).
Alan Previn (2):
drm/i915/guc: Flush context destruction worker at suspend
drm/i915/guc: Close deregister-context race against CT-loss
drivers/gpu/drm/i915/gem/i915_gem_pm.c | 10 +++
.../gpu/drm/i915/gt/uc/intel_guc_submission.c | 78 +++++++++++++++++--
.../gpu/drm/i915/gt/uc/intel_guc_submission.h | 2 +
drivers/gpu/drm/i915/gt/uc/intel_uc.c | 2 +
4 files changed, 87 insertions(+), 5 deletions(-)
base-commit: 436cb0ff9f20fadc99ec3b70c4d2ac6cb2e4410a
--
2.39.0
^ permalink raw reply [flat|nested] 18+ messages in thread* [PATCH v7 0/2] Resolve suspend-resume racing with GuC destroy-context-worker @ 2023-11-30 0:20 ` Alan Previn 0 siblings, 0 replies; 18+ messages in thread From: Alan Previn @ 2023-11-30 0:20 UTC (permalink / raw) To: intel-gfx Cc: Alan Previn, Tvrtko Ursulin, Anshuman Gupta, dri-devel, Daniele Ceraolo Spurio, Rodrigo Vivi, John Harrison This series is the result of debugging issues root caused to races between the GuC's destroyed_worker_func being triggered vs repeating suspend-resume cycles with concurrent delayed fence signals for engine-freeing. The reproduction steps require that an app is launched right before the start of the suspend cycle where it creates a new gem context and submits a tiny workload that would complete in the middle of the suspend cycle. However this app uses dma-buffer sharing or dma-fence with non-GPU objects or signals that eventually triggers a FENCE_FREE via__i915_sw_fence_notify that connects to engines_notify -> free_engines_rcu -> intel_context_put -> kref_put(&ce->ref..) that queues the worker after the GuCs CTB has been disabled (i.e. after i915-gem's suspend-late). This sequence is a corner-case and required repeating this app->suspend->resume cycle ~1500 times across 4 identical systems to see it once. That said, based on above callstack, it is clear that merely flushing the context destruction worker, which is obviously missing and needed, isn't sufficient. Because of that, this series adds additional patches besides the obvious (Patch #1) flushing of the worker during the suspend flows. It also includes (Patch #2) closing a race between sending the context-deregistration H2G vs the CTB getting disabled in the midst of it (by detecing the failure and unrolling the guc-lrc-unpin flow) and adding an additional rcu_barrier in the gem-suspend flow to purge outstanding rcu defered tasks that may include context destruction. This patch was tested and confirmed to be reliably working after running ~1500 suspend resume cycles on 4 concurrent machines. Changes from prior revs: v6: - Dont hold the spinlock while calling deregister_context which can take a longer time. (Rodrigo) - Fix / improve of comments. (Rodrigo) - Release the ce->guc_state.lock before calling deregister_context and retake it if we fail. (John/Daniele). v5: - Remove Patch #3 which doesnt solve this exact bug but can be a separate patch(Tvrtko). v4: - In Patch #2, change the position of the calls into rcu_barrier based on latest testing data. (Alan/Anshuman). - In Patch #3, fix the timeout value selection for the final gt-pm idle-wait that was incorrectly using a 'ns' #define as a milisec timeout. v3: - In Patch #3, when deregister_context fails, instead of calling intel_gt_pm_put(that might sleep), call __intel_wakeref_put (without ASYNC flag) (Rodrigo/Anshuman). - In wait_for_suspend add an rcu_barrier before we proceed to wait for idle. (Anshuman) v2: - Patch #2 Restructure code in guc_lrc_desc_unpin so it's more readible to differentiate (1)direct guc-id cleanup ..vs (2) sending the H2G ctx-destroy action .. vs (3) the unrolling steps if the H2G fails. - Patch #2 Add a check to close the race sooner by checking for intel_guc_is_ready from destroyed_worker_func. - Patch #2 When guc_submission_send_busy_loop gets a failure from intel_guc_send_busy_loop, we need to undo i.e. decrement the outstanding_submission_g2h. - Patch #3 In wait_for_suspend, fix checking of return from intel_gt_pm_wait_timeout_for_idle to now use -ETIMEDOUT and add documentation for intel_wakeref_wait_for_idle. (Rodrigo). Alan Previn (2): drm/i915/guc: Flush context destruction worker at suspend drm/i915/guc: Close deregister-context race against CT-loss drivers/gpu/drm/i915/gem/i915_gem_pm.c | 10 +++ .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 78 +++++++++++++++++-- .../gpu/drm/i915/gt/uc/intel_guc_submission.h | 2 + drivers/gpu/drm/i915/gt/uc/intel_uc.c | 2 + 4 files changed, 87 insertions(+), 5 deletions(-) base-commit: 436cb0ff9f20fadc99ec3b70c4d2ac6cb2e4410a -- 2.39.0 ^ permalink raw reply [flat|nested] 18+ messages in thread
* [Intel-gfx] [PATCH v7 1/2] drm/i915/guc: Flush context destruction worker at suspend 2023-11-30 0:20 ` Alan Previn @ 2023-11-30 0:20 ` Alan Previn -1 siblings, 0 replies; 18+ messages in thread From: Alan Previn @ 2023-11-30 0:20 UTC (permalink / raw) To: intel-gfx Cc: Alan Previn, Tvrtko Ursulin, dri-devel, Rodrigo Vivi, Mousumi Jana When suspending, flush the context-guc-id deregistration worker at the final stages of intel_gt_suspend_late when we finally call gt_sanitize that eventually leads down to __uc_sanitize so that the deregistration worker doesn't fire off later as we reset the GuC microcontroller. Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Tested-by: Mousumi Jana <mousumi.jana@intel.com> --- drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 5 +++++ drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h | 2 ++ drivers/gpu/drm/i915/gt/uc/intel_uc.c | 2 ++ 3 files changed, 9 insertions(+) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index 04f8377fd7a3..6e3fb2fcce4f 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -1613,6 +1613,11 @@ static void guc_flush_submissions(struct intel_guc *guc) spin_unlock_irqrestore(&sched_engine->lock, flags); } +void intel_guc_submission_flush_work(struct intel_guc *guc) +{ + flush_work(&guc->submission_state.destroyed_worker); +} + static void guc_flush_destroyed_contexts(struct intel_guc *guc); void intel_guc_submission_reset_prepare(struct intel_guc *guc) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h index c57b29cdb1a6..b6df75622d3b 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h @@ -38,6 +38,8 @@ int intel_guc_wait_for_pending_msg(struct intel_guc *guc, bool interruptible, long timeout); +void intel_guc_submission_flush_work(struct intel_guc *guc); + static inline bool intel_guc_submission_is_supported(struct intel_guc *guc) { return guc->submission_supported; diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc.c b/drivers/gpu/drm/i915/gt/uc/intel_uc.c index 3872d309ed31..b8b09b1bee3e 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_uc.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc.c @@ -690,6 +690,8 @@ void intel_uc_suspend(struct intel_uc *uc) return; } + intel_guc_submission_flush_work(guc); + with_intel_runtime_pm(&uc_to_gt(uc)->i915->runtime_pm, wakeref) { err = intel_guc_suspend(guc); if (err) -- 2.39.0 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v7 1/2] drm/i915/guc: Flush context destruction worker at suspend @ 2023-11-30 0:20 ` Alan Previn 0 siblings, 0 replies; 18+ messages in thread From: Alan Previn @ 2023-11-30 0:20 UTC (permalink / raw) To: intel-gfx Cc: Alan Previn, Tvrtko Ursulin, Anshuman Gupta, dri-devel, Daniele Ceraolo Spurio, Rodrigo Vivi, Mousumi Jana, John Harrison When suspending, flush the context-guc-id deregistration worker at the final stages of intel_gt_suspend_late when we finally call gt_sanitize that eventually leads down to __uc_sanitize so that the deregistration worker doesn't fire off later as we reset the GuC microcontroller. Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Tested-by: Mousumi Jana <mousumi.jana@intel.com> --- drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 5 +++++ drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h | 2 ++ drivers/gpu/drm/i915/gt/uc/intel_uc.c | 2 ++ 3 files changed, 9 insertions(+) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index 04f8377fd7a3..6e3fb2fcce4f 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -1613,6 +1613,11 @@ static void guc_flush_submissions(struct intel_guc *guc) spin_unlock_irqrestore(&sched_engine->lock, flags); } +void intel_guc_submission_flush_work(struct intel_guc *guc) +{ + flush_work(&guc->submission_state.destroyed_worker); +} + static void guc_flush_destroyed_contexts(struct intel_guc *guc); void intel_guc_submission_reset_prepare(struct intel_guc *guc) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h index c57b29cdb1a6..b6df75622d3b 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h @@ -38,6 +38,8 @@ int intel_guc_wait_for_pending_msg(struct intel_guc *guc, bool interruptible, long timeout); +void intel_guc_submission_flush_work(struct intel_guc *guc); + static inline bool intel_guc_submission_is_supported(struct intel_guc *guc) { return guc->submission_supported; diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc.c b/drivers/gpu/drm/i915/gt/uc/intel_uc.c index 3872d309ed31..b8b09b1bee3e 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_uc.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc.c @@ -690,6 +690,8 @@ void intel_uc_suspend(struct intel_uc *uc) return; } + intel_guc_submission_flush_work(guc); + with_intel_runtime_pm(&uc_to_gt(uc)->i915->runtime_pm, wakeref) { err = intel_guc_suspend(guc); if (err) -- 2.39.0 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* [Intel-gfx] [PATCH v7 2/2] drm/i915/guc: Close deregister-context race against CT-loss 2023-11-30 0:20 ` Alan Previn @ 2023-11-30 0:20 ` Alan Previn -1 siblings, 0 replies; 18+ messages in thread From: Alan Previn @ 2023-11-30 0:20 UTC (permalink / raw) To: intel-gfx Cc: Alan Previn, Tvrtko Ursulin, dri-devel, Rodrigo Vivi, Mousumi Jana If we are at the end of suspend or very early in resume its possible an async fence signal (via rcu_call) is triggered to free_engines which could lead us to the execution of the context destruction worker (after a prior worker flush). Thus, when suspending, insert rcu_barriers at the start of i915_gem_suspend (part of driver's suspend prepare) and again in i915_gem_suspend_late so that all such cases have completed and context destruction list isn't missing anything. In destroyed_worker_func, close the race against CT-loss by checking that CT is enabled before calling into deregister_destroyed_contexts. Based on testing, guc_lrc_desc_unpin may still race and fail as we traverse the GuC's context-destroy list because the CT could be disabled right before calling GuC's CT send function. We've witnessed this race condition once every ~6000-8000 suspend-resume cycles while ensuring workloads that render something onscreen is continuously started just before we suspend (and the workload is small enough to complete and trigger the queued engine/context free-up either very late in suspend or very early in resume). In such a case, we need to unroll the entire process because guc-lrc-unpin takes a gt wakeref which only gets released in the G2H IRQ reply that never comes through in this corner case. Without the unroll, the taken wakeref is leaked and will cascade into a kernel hang later at the tail end of suspend in this function: intel_wakeref_wait_for_idle(>->wakeref) (called by) - intel_gt_pm_wait_for_idle (called by) - wait_for_suspend Thus, do an unroll in guc_lrc_desc_unpin and deregister_destroyed_- contexts if guc_lrc_desc_unpin fails due to CT send falure. When unrolling, keep the context in the GuC's destroy-list so it can get picked up on the next destroy worker invocation (if suspend aborted) or get fully purged as part of a GuC sanitization (end of suspend) or a reset flow. Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com> Signed-off-by: Anshuman Gupta <anshuman.gupta@intel.com> Tested-by: Mousumi Jana <mousumi.jana@intel.com> --- drivers/gpu/drm/i915/gem/i915_gem_pm.c | 10 +++ .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 73 +++++++++++++++++-- 2 files changed, 78 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_pm.c b/drivers/gpu/drm/i915/gem/i915_gem_pm.c index 0d812f4d787d..3b27218aabe2 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_pm.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_pm.c @@ -28,6 +28,13 @@ void i915_gem_suspend(struct drm_i915_private *i915) GEM_TRACE("%s\n", dev_name(i915->drm.dev)); intel_wakeref_auto(&i915->runtime_pm.userfault_wakeref, 0); + /* + * On rare occasions, we've observed the fence completion triggers + * free_engines asynchronously via rcu_call. Ensure those are done. + * This path is only called on suspend, so it's an acceptable cost. + */ + rcu_barrier(); + flush_workqueue(i915->wq); /* @@ -160,6 +167,9 @@ void i915_gem_suspend_late(struct drm_i915_private *i915) * machine in an unusable condition. */ + /* Like i915_gem_suspend, flush tasks staged from fence triggers */ + rcu_barrier(); + for_each_gt(gt, i915, i) intel_gt_suspend_late(gt); diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index 6e3fb2fcce4f..a7228bebec39 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -236,6 +236,13 @@ set_context_destroyed(struct intel_context *ce) ce->guc_state.sched_state |= SCHED_STATE_DESTROYED; } +static inline void +clr_context_destroyed(struct intel_context *ce) +{ + lockdep_assert_held(&ce->guc_state.lock); + ce->guc_state.sched_state &= ~SCHED_STATE_DESTROYED; +} + static inline bool context_pending_disable(struct intel_context *ce) { return ce->guc_state.sched_state & SCHED_STATE_PENDING_DISABLE; @@ -613,6 +620,8 @@ static int guc_submission_send_busy_loop(struct intel_guc *guc, u32 g2h_len_dw, bool loop) { + int ret; + /* * We always loop when a send requires a reply (i.e. g2h_len_dw > 0), * so we don't handle the case where we don't get a reply because we @@ -623,7 +632,11 @@ static int guc_submission_send_busy_loop(struct intel_guc *guc, if (g2h_len_dw) atomic_inc(&guc->outstanding_submission_g2h); - return intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop); + ret = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop); + if (ret) + atomic_dec(&guc->outstanding_submission_g2h); + + return ret; } int intel_guc_wait_for_pending_msg(struct intel_guc *guc, @@ -3288,12 +3301,13 @@ static void guc_context_close(struct intel_context *ce) spin_unlock_irqrestore(&ce->guc_state.lock, flags); } -static inline void guc_lrc_desc_unpin(struct intel_context *ce) +static inline int guc_lrc_desc_unpin(struct intel_context *ce) { struct intel_guc *guc = ce_to_guc(ce); struct intel_gt *gt = guc_to_gt(guc); unsigned long flags; bool disabled; + int ret; GEM_BUG_ON(!intel_gt_pm_is_awake(gt)); GEM_BUG_ON(!ctx_id_mapped(guc, ce->guc_id.id)); @@ -3304,18 +3318,41 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce) spin_lock_irqsave(&ce->guc_state.lock, flags); disabled = submission_disabled(guc); if (likely(!disabled)) { + /* + * Take a gt-pm ref and change context state to be destroyed. + * NOTE: a G2H IRQ that comes after will put this gt-pm ref back + */ __intel_gt_pm_get(gt); set_context_destroyed(ce); clr_context_registered(ce); } spin_unlock_irqrestore(&ce->guc_state.lock, flags); + if (unlikely(disabled)) { release_guc_id(guc, ce); __guc_context_destroy(ce); - return; + return 0; } - deregister_context(ce, ce->guc_id.id); + /* + * GuC is active, lets destroy this context, but at this point we can still be racing + * with suspend, so we undo everything if the H2G fails in deregister_context so + * that GuC reset will find this context during clean up. + */ + ret = deregister_context(ce, ce->guc_id.id); + if (ret) { + spin_lock(&ce->guc_state.lock); + set_context_registered(ce); + clr_context_destroyed(ce); + spin_unlock(&ce->guc_state.lock); + /* + * As gt-pm is awake at function entry, intel_wakeref_put_async merely decrements + * the wakeref immediately but per function spec usage call this after unlock. + */ + intel_wakeref_put_async(>->wakeref); + } + + return ret; } static void __guc_context_destroy(struct intel_context *ce) @@ -3383,7 +3420,22 @@ static void deregister_destroyed_contexts(struct intel_guc *guc) if (!ce) break; - guc_lrc_desc_unpin(ce); + if (guc_lrc_desc_unpin(ce)) { + /* + * This means GuC's CT link severed mid-way which could happen + * in suspend-resume corner cases. In this case, put the + * context back into the destroyed_contexts list which will + * get picked up on the next context deregistration event or + * purged in a GuC sanitization event (reset/unload/wedged/...). + */ + spin_lock_irqsave(&guc->submission_state.lock, flags); + list_add_tail(&ce->destroyed_link, + &guc->submission_state.destroyed_contexts); + spin_unlock_irqrestore(&guc->submission_state.lock, flags); + /* Bail now since the list might never be emptied if h2gs fail */ + break; + } + } } @@ -3394,6 +3446,17 @@ static void destroyed_worker_func(struct work_struct *w) struct intel_gt *gt = guc_to_gt(guc); intel_wakeref_t wakeref; + /* + * In rare cases we can get here via async context-free fence-signals that + * come very late in suspend flow or very early in resume flows. In these + * cases, GuC won't be ready but just skipping it here is fine as these + * pending-destroy-contexts get destroyed totally at GuC reset time at the + * end of suspend.. OR.. this worker can be picked up later on the next + * context destruction trigger after resume-completes + */ + if (!intel_guc_is_ready(guc)) + return; + with_intel_gt_pm(gt, wakeref) deregister_destroyed_contexts(guc); } -- 2.39.0 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v7 2/2] drm/i915/guc: Close deregister-context race against CT-loss @ 2023-11-30 0:20 ` Alan Previn 0 siblings, 0 replies; 18+ messages in thread From: Alan Previn @ 2023-11-30 0:20 UTC (permalink / raw) To: intel-gfx Cc: Alan Previn, Tvrtko Ursulin, Anshuman Gupta, dri-devel, Daniele Ceraolo Spurio, Rodrigo Vivi, Mousumi Jana, John Harrison If we are at the end of suspend or very early in resume its possible an async fence signal (via rcu_call) is triggered to free_engines which could lead us to the execution of the context destruction worker (after a prior worker flush). Thus, when suspending, insert rcu_barriers at the start of i915_gem_suspend (part of driver's suspend prepare) and again in i915_gem_suspend_late so that all such cases have completed and context destruction list isn't missing anything. In destroyed_worker_func, close the race against CT-loss by checking that CT is enabled before calling into deregister_destroyed_contexts. Based on testing, guc_lrc_desc_unpin may still race and fail as we traverse the GuC's context-destroy list because the CT could be disabled right before calling GuC's CT send function. We've witnessed this race condition once every ~6000-8000 suspend-resume cycles while ensuring workloads that render something onscreen is continuously started just before we suspend (and the workload is small enough to complete and trigger the queued engine/context free-up either very late in suspend or very early in resume). In such a case, we need to unroll the entire process because guc-lrc-unpin takes a gt wakeref which only gets released in the G2H IRQ reply that never comes through in this corner case. Without the unroll, the taken wakeref is leaked and will cascade into a kernel hang later at the tail end of suspend in this function: intel_wakeref_wait_for_idle(>->wakeref) (called by) - intel_gt_pm_wait_for_idle (called by) - wait_for_suspend Thus, do an unroll in guc_lrc_desc_unpin and deregister_destroyed_- contexts if guc_lrc_desc_unpin fails due to CT send falure. When unrolling, keep the context in the GuC's destroy-list so it can get picked up on the next destroy worker invocation (if suspend aborted) or get fully purged as part of a GuC sanitization (end of suspend) or a reset flow. Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com> Signed-off-by: Anshuman Gupta <anshuman.gupta@intel.com> Tested-by: Mousumi Jana <mousumi.jana@intel.com> --- drivers/gpu/drm/i915/gem/i915_gem_pm.c | 10 +++ .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 73 +++++++++++++++++-- 2 files changed, 78 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_pm.c b/drivers/gpu/drm/i915/gem/i915_gem_pm.c index 0d812f4d787d..3b27218aabe2 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_pm.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_pm.c @@ -28,6 +28,13 @@ void i915_gem_suspend(struct drm_i915_private *i915) GEM_TRACE("%s\n", dev_name(i915->drm.dev)); intel_wakeref_auto(&i915->runtime_pm.userfault_wakeref, 0); + /* + * On rare occasions, we've observed the fence completion triggers + * free_engines asynchronously via rcu_call. Ensure those are done. + * This path is only called on suspend, so it's an acceptable cost. + */ + rcu_barrier(); + flush_workqueue(i915->wq); /* @@ -160,6 +167,9 @@ void i915_gem_suspend_late(struct drm_i915_private *i915) * machine in an unusable condition. */ + /* Like i915_gem_suspend, flush tasks staged from fence triggers */ + rcu_barrier(); + for_each_gt(gt, i915, i) intel_gt_suspend_late(gt); diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index 6e3fb2fcce4f..a7228bebec39 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -236,6 +236,13 @@ set_context_destroyed(struct intel_context *ce) ce->guc_state.sched_state |= SCHED_STATE_DESTROYED; } +static inline void +clr_context_destroyed(struct intel_context *ce) +{ + lockdep_assert_held(&ce->guc_state.lock); + ce->guc_state.sched_state &= ~SCHED_STATE_DESTROYED; +} + static inline bool context_pending_disable(struct intel_context *ce) { return ce->guc_state.sched_state & SCHED_STATE_PENDING_DISABLE; @@ -613,6 +620,8 @@ static int guc_submission_send_busy_loop(struct intel_guc *guc, u32 g2h_len_dw, bool loop) { + int ret; + /* * We always loop when a send requires a reply (i.e. g2h_len_dw > 0), * so we don't handle the case where we don't get a reply because we @@ -623,7 +632,11 @@ static int guc_submission_send_busy_loop(struct intel_guc *guc, if (g2h_len_dw) atomic_inc(&guc->outstanding_submission_g2h); - return intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop); + ret = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop); + if (ret) + atomic_dec(&guc->outstanding_submission_g2h); + + return ret; } int intel_guc_wait_for_pending_msg(struct intel_guc *guc, @@ -3288,12 +3301,13 @@ static void guc_context_close(struct intel_context *ce) spin_unlock_irqrestore(&ce->guc_state.lock, flags); } -static inline void guc_lrc_desc_unpin(struct intel_context *ce) +static inline int guc_lrc_desc_unpin(struct intel_context *ce) { struct intel_guc *guc = ce_to_guc(ce); struct intel_gt *gt = guc_to_gt(guc); unsigned long flags; bool disabled; + int ret; GEM_BUG_ON(!intel_gt_pm_is_awake(gt)); GEM_BUG_ON(!ctx_id_mapped(guc, ce->guc_id.id)); @@ -3304,18 +3318,41 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce) spin_lock_irqsave(&ce->guc_state.lock, flags); disabled = submission_disabled(guc); if (likely(!disabled)) { + /* + * Take a gt-pm ref and change context state to be destroyed. + * NOTE: a G2H IRQ that comes after will put this gt-pm ref back + */ __intel_gt_pm_get(gt); set_context_destroyed(ce); clr_context_registered(ce); } spin_unlock_irqrestore(&ce->guc_state.lock, flags); + if (unlikely(disabled)) { release_guc_id(guc, ce); __guc_context_destroy(ce); - return; + return 0; } - deregister_context(ce, ce->guc_id.id); + /* + * GuC is active, lets destroy this context, but at this point we can still be racing + * with suspend, so we undo everything if the H2G fails in deregister_context so + * that GuC reset will find this context during clean up. + */ + ret = deregister_context(ce, ce->guc_id.id); + if (ret) { + spin_lock(&ce->guc_state.lock); + set_context_registered(ce); + clr_context_destroyed(ce); + spin_unlock(&ce->guc_state.lock); + /* + * As gt-pm is awake at function entry, intel_wakeref_put_async merely decrements + * the wakeref immediately but per function spec usage call this after unlock. + */ + intel_wakeref_put_async(>->wakeref); + } + + return ret; } static void __guc_context_destroy(struct intel_context *ce) @@ -3383,7 +3420,22 @@ static void deregister_destroyed_contexts(struct intel_guc *guc) if (!ce) break; - guc_lrc_desc_unpin(ce); + if (guc_lrc_desc_unpin(ce)) { + /* + * This means GuC's CT link severed mid-way which could happen + * in suspend-resume corner cases. In this case, put the + * context back into the destroyed_contexts list which will + * get picked up on the next context deregistration event or + * purged in a GuC sanitization event (reset/unload/wedged/...). + */ + spin_lock_irqsave(&guc->submission_state.lock, flags); + list_add_tail(&ce->destroyed_link, + &guc->submission_state.destroyed_contexts); + spin_unlock_irqrestore(&guc->submission_state.lock, flags); + /* Bail now since the list might never be emptied if h2gs fail */ + break; + } + } } @@ -3394,6 +3446,17 @@ static void destroyed_worker_func(struct work_struct *w) struct intel_gt *gt = guc_to_gt(guc); intel_wakeref_t wakeref; + /* + * In rare cases we can get here via async context-free fence-signals that + * come very late in suspend flow or very early in resume flows. In these + * cases, GuC won't be ready but just skipping it here is fine as these + * pending-destroy-contexts get destroyed totally at GuC reset time at the + * end of suspend.. OR.. this worker can be picked up later on the next + * context destruction trigger after resume-completes + */ + if (!intel_guc_is_ready(guc)) + return; + with_intel_gt_pm(gt, wakeref) deregister_destroyed_contexts(guc); } -- 2.39.0 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [Intel-gfx] [PATCH v7 2/2] drm/i915/guc: Close deregister-context race against CT-loss 2023-11-30 0:20 ` Alan Previn @ 2023-11-30 21:18 ` Rodrigo Vivi -1 siblings, 0 replies; 18+ messages in thread From: Rodrigo Vivi @ 2023-11-30 21:18 UTC (permalink / raw) To: Alan Previn; +Cc: Tvrtko Ursulin, intel-gfx, dri-devel, Mousumi Jana On Wed, Nov 29, 2023 at 04:20:13PM -0800, Alan Previn wrote: > If we are at the end of suspend or very early in resume > its possible an async fence signal (via rcu_call) is triggered > to free_engines which could lead us to the execution of > the context destruction worker (after a prior worker flush). > > Thus, when suspending, insert rcu_barriers at the start > of i915_gem_suspend (part of driver's suspend prepare) and > again in i915_gem_suspend_late so that all such cases have > completed and context destruction list isn't missing anything. > > In destroyed_worker_func, close the race against CT-loss > by checking that CT is enabled before calling into > deregister_destroyed_contexts. > > Based on testing, guc_lrc_desc_unpin may still race and fail > as we traverse the GuC's context-destroy list because the > CT could be disabled right before calling GuC's CT send function. > > We've witnessed this race condition once every ~6000-8000 > suspend-resume cycles while ensuring workloads that render > something onscreen is continuously started just before > we suspend (and the workload is small enough to complete > and trigger the queued engine/context free-up either very > late in suspend or very early in resume). > > In such a case, we need to unroll the entire process because > guc-lrc-unpin takes a gt wakeref which only gets released in > the G2H IRQ reply that never comes through in this corner > case. Without the unroll, the taken wakeref is leaked and will > cascade into a kernel hang later at the tail end of suspend in > this function: > > intel_wakeref_wait_for_idle(>->wakeref) > (called by) - intel_gt_pm_wait_for_idle > (called by) - wait_for_suspend > > Thus, do an unroll in guc_lrc_desc_unpin and deregister_destroyed_- > contexts if guc_lrc_desc_unpin fails due to CT send falure. > When unrolling, keep the context in the GuC's destroy-list so > it can get picked up on the next destroy worker invocation > (if suspend aborted) or get fully purged as part of a GuC > sanitization (end of suspend) or a reset flow. > > Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com> > Signed-off-by: Anshuman Gupta <anshuman.gupta@intel.com> > Tested-by: Mousumi Jana <mousumi.jana@intel.com> > --- > drivers/gpu/drm/i915/gem/i915_gem_pm.c | 10 +++ > .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 73 +++++++++++++++++-- > 2 files changed, 78 insertions(+), 5 deletions(-) > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_pm.c b/drivers/gpu/drm/i915/gem/i915_gem_pm.c > index 0d812f4d787d..3b27218aabe2 100644 > --- a/drivers/gpu/drm/i915/gem/i915_gem_pm.c > +++ b/drivers/gpu/drm/i915/gem/i915_gem_pm.c > @@ -28,6 +28,13 @@ void i915_gem_suspend(struct drm_i915_private *i915) > GEM_TRACE("%s\n", dev_name(i915->drm.dev)); > > intel_wakeref_auto(&i915->runtime_pm.userfault_wakeref, 0); > + /* > + * On rare occasions, we've observed the fence completion triggers > + * free_engines asynchronously via rcu_call. Ensure those are done. > + * This path is only called on suspend, so it's an acceptable cost. > + */ > + rcu_barrier(); > + > flush_workqueue(i915->wq); > > /* > @@ -160,6 +167,9 @@ void i915_gem_suspend_late(struct drm_i915_private *i915) > * machine in an unusable condition. > */ > > + /* Like i915_gem_suspend, flush tasks staged from fence triggers */ > + rcu_barrier(); > + > for_each_gt(gt, i915, i) > intel_gt_suspend_late(gt); > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > index 6e3fb2fcce4f..a7228bebec39 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > @@ -236,6 +236,13 @@ set_context_destroyed(struct intel_context *ce) > ce->guc_state.sched_state |= SCHED_STATE_DESTROYED; > } > > +static inline void > +clr_context_destroyed(struct intel_context *ce) > +{ > + lockdep_assert_held(&ce->guc_state.lock); > + ce->guc_state.sched_state &= ~SCHED_STATE_DESTROYED; > +} > + > static inline bool context_pending_disable(struct intel_context *ce) > { > return ce->guc_state.sched_state & SCHED_STATE_PENDING_DISABLE; > @@ -613,6 +620,8 @@ static int guc_submission_send_busy_loop(struct intel_guc *guc, > u32 g2h_len_dw, > bool loop) > { > + int ret; > + > /* > * We always loop when a send requires a reply (i.e. g2h_len_dw > 0), > * so we don't handle the case where we don't get a reply because we > @@ -623,7 +632,11 @@ static int guc_submission_send_busy_loop(struct intel_guc *guc, > if (g2h_len_dw) > atomic_inc(&guc->outstanding_submission_g2h); > > - return intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop); > + ret = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop); > + if (ret) > + atomic_dec(&guc->outstanding_submission_g2h); > + > + return ret; > } > > int intel_guc_wait_for_pending_msg(struct intel_guc *guc, > @@ -3288,12 +3301,13 @@ static void guc_context_close(struct intel_context *ce) > spin_unlock_irqrestore(&ce->guc_state.lock, flags); > } > > -static inline void guc_lrc_desc_unpin(struct intel_context *ce) > +static inline int guc_lrc_desc_unpin(struct intel_context *ce) > { > struct intel_guc *guc = ce_to_guc(ce); > struct intel_gt *gt = guc_to_gt(guc); > unsigned long flags; > bool disabled; > + int ret; > > GEM_BUG_ON(!intel_gt_pm_is_awake(gt)); > GEM_BUG_ON(!ctx_id_mapped(guc, ce->guc_id.id)); > @@ -3304,18 +3318,41 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce) > spin_lock_irqsave(&ce->guc_state.lock, flags); > disabled = submission_disabled(guc); > if (likely(!disabled)) { > + /* > + * Take a gt-pm ref and change context state to be destroyed. > + * NOTE: a G2H IRQ that comes after will put this gt-pm ref back > + */ > __intel_gt_pm_get(gt); > set_context_destroyed(ce); > clr_context_registered(ce); > } > spin_unlock_irqrestore(&ce->guc_state.lock, flags); > + > if (unlikely(disabled)) { > release_guc_id(guc, ce); > __guc_context_destroy(ce); > - return; > + return 0; is success the right return case here? > } > > - deregister_context(ce, ce->guc_id.id); > + /* > + * GuC is active, lets destroy this context, but at this point we can still be racing > + * with suspend, so we undo everything if the H2G fails in deregister_context so > + * that GuC reset will find this context during clean up. > + */ > + ret = deregister_context(ce, ce->guc_id.id); > + if (ret) { > + spin_lock(&ce->guc_state.lock); > + set_context_registered(ce); > + clr_context_destroyed(ce); > + spin_unlock(&ce->guc_state.lock); > + /* > + * As gt-pm is awake at function entry, intel_wakeref_put_async merely decrements > + * the wakeref immediately but per function spec usage call this after unlock. > + */ > + intel_wakeref_put_async(>->wakeref); > + } > + > + return ret; > } > > static void __guc_context_destroy(struct intel_context *ce) > @@ -3383,7 +3420,22 @@ static void deregister_destroyed_contexts(struct intel_guc *guc) > if (!ce) > break; > > - guc_lrc_desc_unpin(ce); > + if (guc_lrc_desc_unpin(ce)) { > + /* > + * This means GuC's CT link severed mid-way which could happen > + * in suspend-resume corner cases. In this case, put the > + * context back into the destroyed_contexts list which will > + * get picked up on the next context deregistration event or > + * purged in a GuC sanitization event (reset/unload/wedged/...). > + */ > + spin_lock_irqsave(&guc->submission_state.lock, flags); > + list_add_tail(&ce->destroyed_link, > + &guc->submission_state.destroyed_contexts); > + spin_unlock_irqrestore(&guc->submission_state.lock, flags); > + /* Bail now since the list might never be emptied if h2gs fail */ > + break; > + } > + > } > } > > @@ -3394,6 +3446,17 @@ static void destroyed_worker_func(struct work_struct *w) > struct intel_gt *gt = guc_to_gt(guc); > intel_wakeref_t wakeref; > > + /* > + * In rare cases we can get here via async context-free fence-signals that > + * come very late in suspend flow or very early in resume flows. In these > + * cases, GuC won't be ready but just skipping it here is fine as these > + * pending-destroy-contexts get destroyed totally at GuC reset time at the > + * end of suspend.. OR.. this worker can be picked up later on the next > + * context destruction trigger after resume-completes > + */ > + if (!intel_guc_is_ready(guc)) > + return; > + > with_intel_gt_pm(gt, wakeref) > deregister_destroyed_contexts(guc); > } > -- > 2.39.0 > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v7 2/2] drm/i915/guc: Close deregister-context race against CT-loss @ 2023-11-30 21:18 ` Rodrigo Vivi 0 siblings, 0 replies; 18+ messages in thread From: Rodrigo Vivi @ 2023-11-30 21:18 UTC (permalink / raw) To: Alan Previn Cc: Tvrtko Ursulin, Anshuman Gupta, intel-gfx, dri-devel, Daniele Ceraolo Spurio, Mousumi Jana, John Harrison On Wed, Nov 29, 2023 at 04:20:13PM -0800, Alan Previn wrote: > If we are at the end of suspend or very early in resume > its possible an async fence signal (via rcu_call) is triggered > to free_engines which could lead us to the execution of > the context destruction worker (after a prior worker flush). > > Thus, when suspending, insert rcu_barriers at the start > of i915_gem_suspend (part of driver's suspend prepare) and > again in i915_gem_suspend_late so that all such cases have > completed and context destruction list isn't missing anything. > > In destroyed_worker_func, close the race against CT-loss > by checking that CT is enabled before calling into > deregister_destroyed_contexts. > > Based on testing, guc_lrc_desc_unpin may still race and fail > as we traverse the GuC's context-destroy list because the > CT could be disabled right before calling GuC's CT send function. > > We've witnessed this race condition once every ~6000-8000 > suspend-resume cycles while ensuring workloads that render > something onscreen is continuously started just before > we suspend (and the workload is small enough to complete > and trigger the queued engine/context free-up either very > late in suspend or very early in resume). > > In such a case, we need to unroll the entire process because > guc-lrc-unpin takes a gt wakeref which only gets released in > the G2H IRQ reply that never comes through in this corner > case. Without the unroll, the taken wakeref is leaked and will > cascade into a kernel hang later at the tail end of suspend in > this function: > > intel_wakeref_wait_for_idle(>->wakeref) > (called by) - intel_gt_pm_wait_for_idle > (called by) - wait_for_suspend > > Thus, do an unroll in guc_lrc_desc_unpin and deregister_destroyed_- > contexts if guc_lrc_desc_unpin fails due to CT send falure. > When unrolling, keep the context in the GuC's destroy-list so > it can get picked up on the next destroy worker invocation > (if suspend aborted) or get fully purged as part of a GuC > sanitization (end of suspend) or a reset flow. > > Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com> > Signed-off-by: Anshuman Gupta <anshuman.gupta@intel.com> > Tested-by: Mousumi Jana <mousumi.jana@intel.com> > --- > drivers/gpu/drm/i915/gem/i915_gem_pm.c | 10 +++ > .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 73 +++++++++++++++++-- > 2 files changed, 78 insertions(+), 5 deletions(-) > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_pm.c b/drivers/gpu/drm/i915/gem/i915_gem_pm.c > index 0d812f4d787d..3b27218aabe2 100644 > --- a/drivers/gpu/drm/i915/gem/i915_gem_pm.c > +++ b/drivers/gpu/drm/i915/gem/i915_gem_pm.c > @@ -28,6 +28,13 @@ void i915_gem_suspend(struct drm_i915_private *i915) > GEM_TRACE("%s\n", dev_name(i915->drm.dev)); > > intel_wakeref_auto(&i915->runtime_pm.userfault_wakeref, 0); > + /* > + * On rare occasions, we've observed the fence completion triggers > + * free_engines asynchronously via rcu_call. Ensure those are done. > + * This path is only called on suspend, so it's an acceptable cost. > + */ > + rcu_barrier(); > + > flush_workqueue(i915->wq); > > /* > @@ -160,6 +167,9 @@ void i915_gem_suspend_late(struct drm_i915_private *i915) > * machine in an unusable condition. > */ > > + /* Like i915_gem_suspend, flush tasks staged from fence triggers */ > + rcu_barrier(); > + > for_each_gt(gt, i915, i) > intel_gt_suspend_late(gt); > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > index 6e3fb2fcce4f..a7228bebec39 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > @@ -236,6 +236,13 @@ set_context_destroyed(struct intel_context *ce) > ce->guc_state.sched_state |= SCHED_STATE_DESTROYED; > } > > +static inline void > +clr_context_destroyed(struct intel_context *ce) > +{ > + lockdep_assert_held(&ce->guc_state.lock); > + ce->guc_state.sched_state &= ~SCHED_STATE_DESTROYED; > +} > + > static inline bool context_pending_disable(struct intel_context *ce) > { > return ce->guc_state.sched_state & SCHED_STATE_PENDING_DISABLE; > @@ -613,6 +620,8 @@ static int guc_submission_send_busy_loop(struct intel_guc *guc, > u32 g2h_len_dw, > bool loop) > { > + int ret; > + > /* > * We always loop when a send requires a reply (i.e. g2h_len_dw > 0), > * so we don't handle the case where we don't get a reply because we > @@ -623,7 +632,11 @@ static int guc_submission_send_busy_loop(struct intel_guc *guc, > if (g2h_len_dw) > atomic_inc(&guc->outstanding_submission_g2h); > > - return intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop); > + ret = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop); > + if (ret) > + atomic_dec(&guc->outstanding_submission_g2h); > + > + return ret; > } > > int intel_guc_wait_for_pending_msg(struct intel_guc *guc, > @@ -3288,12 +3301,13 @@ static void guc_context_close(struct intel_context *ce) > spin_unlock_irqrestore(&ce->guc_state.lock, flags); > } > > -static inline void guc_lrc_desc_unpin(struct intel_context *ce) > +static inline int guc_lrc_desc_unpin(struct intel_context *ce) > { > struct intel_guc *guc = ce_to_guc(ce); > struct intel_gt *gt = guc_to_gt(guc); > unsigned long flags; > bool disabled; > + int ret; > > GEM_BUG_ON(!intel_gt_pm_is_awake(gt)); > GEM_BUG_ON(!ctx_id_mapped(guc, ce->guc_id.id)); > @@ -3304,18 +3318,41 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce) > spin_lock_irqsave(&ce->guc_state.lock, flags); > disabled = submission_disabled(guc); > if (likely(!disabled)) { > + /* > + * Take a gt-pm ref and change context state to be destroyed. > + * NOTE: a G2H IRQ that comes after will put this gt-pm ref back > + */ > __intel_gt_pm_get(gt); > set_context_destroyed(ce); > clr_context_registered(ce); > } > spin_unlock_irqrestore(&ce->guc_state.lock, flags); > + > if (unlikely(disabled)) { > release_guc_id(guc, ce); > __guc_context_destroy(ce); > - return; > + return 0; is success the right return case here? > } > > - deregister_context(ce, ce->guc_id.id); > + /* > + * GuC is active, lets destroy this context, but at this point we can still be racing > + * with suspend, so we undo everything if the H2G fails in deregister_context so > + * that GuC reset will find this context during clean up. > + */ > + ret = deregister_context(ce, ce->guc_id.id); > + if (ret) { > + spin_lock(&ce->guc_state.lock); > + set_context_registered(ce); > + clr_context_destroyed(ce); > + spin_unlock(&ce->guc_state.lock); > + /* > + * As gt-pm is awake at function entry, intel_wakeref_put_async merely decrements > + * the wakeref immediately but per function spec usage call this after unlock. > + */ > + intel_wakeref_put_async(>->wakeref); > + } > + > + return ret; > } > > static void __guc_context_destroy(struct intel_context *ce) > @@ -3383,7 +3420,22 @@ static void deregister_destroyed_contexts(struct intel_guc *guc) > if (!ce) > break; > > - guc_lrc_desc_unpin(ce); > + if (guc_lrc_desc_unpin(ce)) { > + /* > + * This means GuC's CT link severed mid-way which could happen > + * in suspend-resume corner cases. In this case, put the > + * context back into the destroyed_contexts list which will > + * get picked up on the next context deregistration event or > + * purged in a GuC sanitization event (reset/unload/wedged/...). > + */ > + spin_lock_irqsave(&guc->submission_state.lock, flags); > + list_add_tail(&ce->destroyed_link, > + &guc->submission_state.destroyed_contexts); > + spin_unlock_irqrestore(&guc->submission_state.lock, flags); > + /* Bail now since the list might never be emptied if h2gs fail */ > + break; > + } > + > } > } > > @@ -3394,6 +3446,17 @@ static void destroyed_worker_func(struct work_struct *w) > struct intel_gt *gt = guc_to_gt(guc); > intel_wakeref_t wakeref; > > + /* > + * In rare cases we can get here via async context-free fence-signals that > + * come very late in suspend flow or very early in resume flows. In these > + * cases, GuC won't be ready but just skipping it here is fine as these > + * pending-destroy-contexts get destroyed totally at GuC reset time at the > + * end of suspend.. OR.. this worker can be picked up later on the next > + * context destruction trigger after resume-completes > + */ > + if (!intel_guc_is_ready(guc)) > + return; > + > with_intel_gt_pm(gt, wakeref) > deregister_destroyed_contexts(guc); > } > -- > 2.39.0 > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Intel-gfx] [PATCH v7 2/2] drm/i915/guc: Close deregister-context race against CT-loss 2023-11-30 21:18 ` Rodrigo Vivi @ 2023-12-01 0:09 ` Teres Alexis, Alan Previn -1 siblings, 0 replies; 18+ messages in thread From: Teres Alexis, Alan Previn @ 2023-12-01 0:09 UTC (permalink / raw) To: Vivi, Rodrigo Cc: Ursulin, Tvrtko, intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, Jana, Mousumi On Thu, 2023-11-30 at 16:18 -0500, Vivi, Rodrigo wrote: > On Wed, Nov 29, 2023 at 04:20:13PM -0800, Alan Previn wrote: alan:snip > > + > > if (unlikely(disabled)) { > > release_guc_id(guc, ce); > > __guc_context_destroy(ce); > > - return; > > + return 0; > > is success the right return case here? alan: yes: we may discover "disabled == true" if submission_disabled found that gt-is-wedged. I dont believe such a case will happen as part of flushing destroyed_worker_func during suspend but may occur as part of regular runtime context closing that just happens to get queued in the middle of a gt-reset/wedging process. In such a case, the reset-prepare code will sanitize everytihng including cleaning up the pending-destructoin-contexts-link-list. So its either we pick it from here and dump it ... or reset picks it first and dumps it there (where both dumpings only occur if guc got disabled first). Supplemental: How regular context cleanup leads to the same path --> i915_sw_fence_notify -> engines_notify -> free_engines_rcu -> intel_context_put -> kref_put(&ce->ref..) -> ce->ops->destroy -> (where ce->ops = engine->cops and engine->cops = guc_context_ops) *and, guc_context_ops->destroy == guc_context_destroy so -> ce->ops->destroy -> guc_context_destroy -> queue_work(..&guc->submission_state.destroyed_worker); -> [eventually] -> the same guc_lrc_unpin above However with additional "if (!intel_guc_is_ready(guc))" in destroyed_worker_func as part of this patch, hitting this "disabled==true" case will be even less likely. As far as i can tell, its only if we started resetting / wedging right after this queued worker got started. (i ran out of time to check if reset can ever occur from within the same system_unbound_wq but then i recall we also have CI using debugfs to force wedging for select (/all?) igt tests) so i suspect it can still happen in parallel. NOTE: the original checking of the "is disabled" is not new code - its the original code. ...alan P.S.- oh man, that took a lot of code tracing as i can't remember these paths by heart. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v7 2/2] drm/i915/guc: Close deregister-context race against CT-loss @ 2023-12-01 0:09 ` Teres Alexis, Alan Previn 0 siblings, 0 replies; 18+ messages in thread From: Teres Alexis, Alan Previn @ 2023-12-01 0:09 UTC (permalink / raw) To: Vivi, Rodrigo Cc: Ursulin, Tvrtko, Gupta, Anshuman, intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, Ceraolo Spurio, Daniele, Jana, Mousumi, Harrison, John C On Thu, 2023-11-30 at 16:18 -0500, Vivi, Rodrigo wrote: > On Wed, Nov 29, 2023 at 04:20:13PM -0800, Alan Previn wrote: alan:snip > > + > > if (unlikely(disabled)) { > > release_guc_id(guc, ce); > > __guc_context_destroy(ce); > > - return; > > + return 0; > > is success the right return case here? alan: yes: we may discover "disabled == true" if submission_disabled found that gt-is-wedged. I dont believe such a case will happen as part of flushing destroyed_worker_func during suspend but may occur as part of regular runtime context closing that just happens to get queued in the middle of a gt-reset/wedging process. In such a case, the reset-prepare code will sanitize everytihng including cleaning up the pending-destructoin-contexts-link-list. So its either we pick it from here and dump it ... or reset picks it first and dumps it there (where both dumpings only occur if guc got disabled first). Supplemental: How regular context cleanup leads to the same path --> i915_sw_fence_notify -> engines_notify -> free_engines_rcu -> intel_context_put -> kref_put(&ce->ref..) -> ce->ops->destroy -> (where ce->ops = engine->cops and engine->cops = guc_context_ops) *and, guc_context_ops->destroy == guc_context_destroy so -> ce->ops->destroy -> guc_context_destroy -> queue_work(..&guc->submission_state.destroyed_worker); -> [eventually] -> the same guc_lrc_unpin above However with additional "if (!intel_guc_is_ready(guc))" in destroyed_worker_func as part of this patch, hitting this "disabled==true" case will be even less likely. As far as i can tell, its only if we started resetting / wedging right after this queued worker got started. (i ran out of time to check if reset can ever occur from within the same system_unbound_wq but then i recall we also have CI using debugfs to force wedging for select (/all?) igt tests) so i suspect it can still happen in parallel. NOTE: the original checking of the "is disabled" is not new code - its the original code. ...alan P.S.- oh man, that took a lot of code tracing as i can't remember these paths by heart. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Intel-gfx] [PATCH v7 2/2] drm/i915/guc: Close deregister-context race against CT-loss 2023-12-01 0:09 ` Teres Alexis, Alan Previn @ 2023-12-01 0:10 ` Teres Alexis, Alan Previn -1 siblings, 0 replies; 18+ messages in thread From: Teres Alexis, Alan Previn @ 2023-12-01 0:10 UTC (permalink / raw) To: Vivi, Rodrigo, Ceraolo Spurio, Daniele Cc: Ursulin, Tvrtko, intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, Jana, Mousumi > As far as i can tell, its only if we started resetting / wedging right after this > queued worker got started. alan: hope Daniele can proof read my tracing and confirm if got it right. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v7 2/2] drm/i915/guc: Close deregister-context race against CT-loss @ 2023-12-01 0:10 ` Teres Alexis, Alan Previn 0 siblings, 0 replies; 18+ messages in thread From: Teres Alexis, Alan Previn @ 2023-12-01 0:10 UTC (permalink / raw) To: Vivi, Rodrigo, Ceraolo Spurio, Daniele Cc: Ursulin, Tvrtko, Gupta, Anshuman, intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, Jana, Mousumi, Harrison, John C > As far as i can tell, its only if we started resetting / wedging right after this > queued worker got started. alan: hope Daniele can proof read my tracing and confirm if got it right. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Intel-gfx] [PATCH v7 2/2] drm/i915/guc: Close deregister-context race against CT-loss 2023-12-01 0:10 ` Teres Alexis, Alan Previn @ 2023-12-06 21:47 ` Daniele Ceraolo Spurio -1 siblings, 0 replies; 18+ messages in thread From: Daniele Ceraolo Spurio @ 2023-12-06 21:47 UTC (permalink / raw) To: Teres Alexis, Alan Previn, Vivi, Rodrigo Cc: Ursulin, Tvrtko, intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, Jana, Mousumi On 11/30/2023 4:10 PM, Teres Alexis, Alan Previn wrote: >> As far as i can tell, its only if we started resetting / wedging right after this >> queued worker got started. > alan: hope Daniele can proof read my tracing and confirm if got it right. Yup, we don't flush the worker in reset prepare, so there is a chance that it might run parallel to the reset/wedge code, which we handle by checking the submission status. The list manipulation is protected by spinlock so we're safe on that side. The rest of the approach also LGTM: Acked-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Daniele ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v7 2/2] drm/i915/guc: Close deregister-context race against CT-loss @ 2023-12-06 21:47 ` Daniele Ceraolo Spurio 0 siblings, 0 replies; 18+ messages in thread From: Daniele Ceraolo Spurio @ 2023-12-06 21:47 UTC (permalink / raw) To: Teres Alexis, Alan Previn, Vivi, Rodrigo Cc: Ursulin, Tvrtko, Gupta, Anshuman, intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, Jana, Mousumi, Harrison, John C On 11/30/2023 4:10 PM, Teres Alexis, Alan Previn wrote: >> As far as i can tell, its only if we started resetting / wedging right after this >> queued worker got started. > alan: hope Daniele can proof read my tracing and confirm if got it right. Yup, we don't flush the worker in reset prepare, so there is a chance that it might run parallel to the reset/wedge code, which we handle by checking the submission status. The list manipulation is protected by spinlock so we're safe on that side. The rest of the approach also LGTM: Acked-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Daniele ^ permalink raw reply [flat|nested] 18+ messages in thread
* [Intel-gfx] ✗ Fi.CI.BAT: failure for Resolve suspend-resume racing with GuC destroy-context-worker (rev7) 2023-11-30 0:20 ` Alan Previn ` (2 preceding siblings ...) (?) @ 2023-11-30 7:50 ` Patchwork -1 siblings, 0 replies; 18+ messages in thread From: Patchwork @ 2023-11-30 7:50 UTC (permalink / raw) To: Teres Alexis, Alan Previn; +Cc: intel-gfx [-- Attachment #1: Type: text/plain, Size: 9645 bytes --] == Series Details == Series: Resolve suspend-resume racing with GuC destroy-context-worker (rev7) URL : https://patchwork.freedesktop.org/series/121916/ State : failure == Summary == CI Bug Log - changes from CI_DRM_13951 -> Patchwork_121916v7 ==================================================== Summary ------- **FAILURE** Serious unknown changes coming with Patchwork_121916v7 absolutely need to be verified manually. If you think the reported changes have nothing to do with the changes introduced in Patchwork_121916v7, please notify your bug team (I915-ci-infra@lists.freedesktop.org) to allow them to document this new failure mode, which will reduce false positives in CI. External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v7/index.html Participating hosts (36 -> 35) ------------------------------ Additional (2): bat-kbl-2 bat-mtlp-8 Missing (3): fi-pnv-d510 fi-snb-2520m bat-dg1-5 Possible new issues ------------------- Here are the unknown changes that may have been introduced in Patchwork_121916v7: ### IGT changes ### #### Possible regressions #### * igt@i915_selftest@live@gt_timelines: - bat-rpls-1: [PASS][1] -> [INCOMPLETE][2] [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13951/bat-rpls-1/igt@i915_selftest@live@gt_timelines.html [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v7/bat-rpls-1/igt@i915_selftest@live@gt_timelines.html Known issues ------------ Here are the changes found in Patchwork_121916v7 that come from known issues: ### IGT changes ### #### Issues hit #### * igt@debugfs_test@basic-hwmon: - bat-mtlp-8: NOTRUN -> [SKIP][3] ([i915#9318]) [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v7/bat-mtlp-8/igt@debugfs_test@basic-hwmon.html * igt@fbdev@info: - bat-kbl-2: NOTRUN -> [SKIP][4] ([fdo#109271] / [i915#1849]) [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v7/bat-kbl-2/igt@fbdev@info.html * igt@gem_lmem_swapping@parallel-random-engines: - bat-kbl-2: NOTRUN -> [SKIP][5] ([fdo#109271]) +20 other tests skip [5]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v7/bat-kbl-2/igt@gem_lmem_swapping@parallel-random-engines.html * igt@gem_lmem_swapping@verify-random: - bat-mtlp-8: NOTRUN -> [SKIP][6] ([i915#4613]) +3 other tests skip [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v7/bat-mtlp-8/igt@gem_lmem_swapping@verify-random.html * igt@gem_mmap@basic: - bat-mtlp-8: NOTRUN -> [SKIP][7] ([i915#4083]) [7]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v7/bat-mtlp-8/igt@gem_mmap@basic.html * igt@gem_mmap_gtt@basic: - bat-mtlp-8: NOTRUN -> [SKIP][8] ([i915#4077]) +2 other tests skip [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v7/bat-mtlp-8/igt@gem_mmap_gtt@basic.html * igt@gem_render_tiled_blits@basic: - bat-mtlp-8: NOTRUN -> [SKIP][9] ([i915#4079]) +1 other test skip [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v7/bat-mtlp-8/igt@gem_render_tiled_blits@basic.html * igt@i915_pm_rps@basic-api: - bat-mtlp-8: NOTRUN -> [SKIP][10] ([i915#6621]) [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v7/bat-mtlp-8/igt@i915_pm_rps@basic-api.html * igt@i915_suspend@basic-s3-without-i915: - bat-mtlp-8: NOTRUN -> [SKIP][11] ([i915#6645]) [11]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v7/bat-mtlp-8/igt@i915_suspend@basic-s3-without-i915.html * igt@kms_addfb_basic@addfb25-y-tiled-small-legacy: - bat-mtlp-8: NOTRUN -> [SKIP][12] ([i915#5190]) [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v7/bat-mtlp-8/igt@kms_addfb_basic@addfb25-y-tiled-small-legacy.html * igt@kms_addfb_basic@basic-y-tiled-legacy: - bat-mtlp-8: NOTRUN -> [SKIP][13] ([i915#4212]) +8 other tests skip [13]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v7/bat-mtlp-8/igt@kms_addfb_basic@basic-y-tiled-legacy.html * igt@kms_cursor_legacy@basic-busy-flip-before-cursor-legacy: - bat-mtlp-8: NOTRUN -> [SKIP][14] ([i915#4213]) +1 other test skip [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v7/bat-mtlp-8/igt@kms_cursor_legacy@basic-busy-flip-before-cursor-legacy.html * igt@kms_dsc@dsc-basic: - bat-mtlp-8: NOTRUN -> [SKIP][15] ([i915#3555] / [i915#3840] / [i915#4098] / [i915#9159]) [15]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v7/bat-mtlp-8/igt@kms_dsc@dsc-basic.html * igt@kms_force_connector_basic@force-load-detect: - bat-mtlp-8: NOTRUN -> [SKIP][16] ([fdo#109285]) [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v7/bat-mtlp-8/igt@kms_force_connector_basic@force-load-detect.html * igt@kms_force_connector_basic@prune-stale-modes: - bat-mtlp-8: NOTRUN -> [SKIP][17] ([i915#5274]) [17]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v7/bat-mtlp-8/igt@kms_force_connector_basic@prune-stale-modes.html * igt@kms_pipe_crc_basic@read-crc-frame-sequence: - bat-kbl-2: NOTRUN -> [SKIP][18] ([fdo#109271] / [i915#1845]) +14 other tests skip [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v7/bat-kbl-2/igt@kms_pipe_crc_basic@read-crc-frame-sequence.html * igt@kms_setmode@basic-clone-single-crtc: - bat-mtlp-8: NOTRUN -> [SKIP][19] ([i915#3555] / [i915#8809]) [19]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v7/bat-mtlp-8/igt@kms_setmode@basic-clone-single-crtc.html * igt@prime_vgem@basic-fence-mmap: - bat-mtlp-8: NOTRUN -> [SKIP][20] ([i915#3708] / [i915#4077]) +1 other test skip [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v7/bat-mtlp-8/igt@prime_vgem@basic-fence-mmap.html * igt@prime_vgem@basic-fence-read: - bat-mtlp-8: NOTRUN -> [SKIP][21] ([i915#3708]) +2 other tests skip [21]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v7/bat-mtlp-8/igt@prime_vgem@basic-fence-read.html #### Possible fixes #### * igt@i915_selftest@live@gt_heartbeat: - fi-apl-guc: [DMESG-FAIL][22] ([i915#5334]) -> [PASS][23] [22]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13951/fi-apl-guc/igt@i915_selftest@live@gt_heartbeat.html [23]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v7/fi-apl-guc/igt@i915_selftest@live@gt_heartbeat.html * igt@kms_hdmi_inject@inject-audio: - fi-kbl-guc: [FAIL][24] ([IGT#3]) -> [PASS][25] [24]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13951/fi-kbl-guc/igt@kms_hdmi_inject@inject-audio.html [25]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v7/fi-kbl-guc/igt@kms_hdmi_inject@inject-audio.html * igt@kms_pipe_crc_basic@read-crc-frame-sequence@pipe-d-edp-1: - bat-rplp-1: [ABORT][26] ([i915#8668]) -> [PASS][27] [26]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13951/bat-rplp-1/igt@kms_pipe_crc_basic@read-crc-frame-sequence@pipe-d-edp-1.html [27]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v7/bat-rplp-1/igt@kms_pipe_crc_basic@read-crc-frame-sequence@pipe-d-edp-1.html {name}: This element is suppressed. This means it is ignored when computing the status of the difference (SUCCESS, WARNING, or FAILURE). [IGT#3]: https://gitlab.freedesktop.org/drm/igt-gpu-tools/issues/3 [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271 [fdo#109285]: https://bugs.freedesktop.org/show_bug.cgi?id=109285 [i915#1845]: https://gitlab.freedesktop.org/drm/intel/issues/1845 [i915#1849]: https://gitlab.freedesktop.org/drm/intel/issues/1849 [i915#3555]: https://gitlab.freedesktop.org/drm/intel/issues/3555 [i915#3708]: https://gitlab.freedesktop.org/drm/intel/issues/3708 [i915#3840]: https://gitlab.freedesktop.org/drm/intel/issues/3840 [i915#4077]: https://gitlab.freedesktop.org/drm/intel/issues/4077 [i915#4079]: https://gitlab.freedesktop.org/drm/intel/issues/4079 [i915#4083]: https://gitlab.freedesktop.org/drm/intel/issues/4083 [i915#4098]: https://gitlab.freedesktop.org/drm/intel/issues/4098 [i915#4212]: https://gitlab.freedesktop.org/drm/intel/issues/4212 [i915#4213]: https://gitlab.freedesktop.org/drm/intel/issues/4213 [i915#4613]: https://gitlab.freedesktop.org/drm/intel/issues/4613 [i915#5190]: https://gitlab.freedesktop.org/drm/intel/issues/5190 [i915#5274]: https://gitlab.freedesktop.org/drm/intel/issues/5274 [i915#5334]: https://gitlab.freedesktop.org/drm/intel/issues/5334 [i915#6621]: https://gitlab.freedesktop.org/drm/intel/issues/6621 [i915#6645]: https://gitlab.freedesktop.org/drm/intel/issues/6645 [i915#8668]: https://gitlab.freedesktop.org/drm/intel/issues/8668 [i915#8809]: https://gitlab.freedesktop.org/drm/intel/issues/8809 [i915#9159]: https://gitlab.freedesktop.org/drm/intel/issues/9159 [i915#9318]: https://gitlab.freedesktop.org/drm/intel/issues/9318 Build changes ------------- * Linux: CI_DRM_13951 -> Patchwork_121916v7 CI-20190529: 20190529 CI_DRM_13951: 7bd342323d5bd405b02fd21cd78f157f363278ac @ git://anongit.freedesktop.org/gfx-ci/linux IGT_7610: 7610 Patchwork_121916v7: 7bd342323d5bd405b02fd21cd78f157f363278ac @ git://anongit.freedesktop.org/gfx-ci/linux ### Linux commits 7bdae4807cf4 drm/i915/guc: Close deregister-context race against CT-loss 2d7c8c27c131 drm/i915/guc: Flush context destruction worker at suspend == Logs == For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v7/index.html [-- Attachment #2: Type: text/html, Size: 11191 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Resolve suspend-resume racing with GuC destroy-context-worker (rev8) 2023-11-30 0:20 ` Alan Previn ` (3 preceding siblings ...) (?) @ 2023-12-01 1:33 ` Patchwork -1 siblings, 0 replies; 18+ messages in thread From: Patchwork @ 2023-12-01 1:33 UTC (permalink / raw) To: Teres Alexis, Alan Previn; +Cc: intel-gfx == Series Details == Series: Resolve suspend-resume racing with GuC destroy-context-worker (rev8) URL : https://patchwork.freedesktop.org/series/121916/ State : warning == Summary == Error: dim sparse failed Sparse version: v0.6.2 Fast mode used, each commit won't be checked separately. ^ permalink raw reply [flat|nested] 18+ messages in thread
* [Intel-gfx] ✗ Fi.CI.BAT: failure for Resolve suspend-resume racing with GuC destroy-context-worker (rev8) 2023-11-30 0:20 ` Alan Previn ` (4 preceding siblings ...) (?) @ 2023-12-01 2:20 ` Patchwork 2023-12-01 5:10 ` Teres Alexis, Alan Previn -1 siblings, 1 reply; 18+ messages in thread From: Patchwork @ 2023-12-01 2:20 UTC (permalink / raw) To: Teres Alexis, Alan Previn; +Cc: intel-gfx [-- Attachment #1: Type: text/plain, Size: 5292 bytes --] == Series Details == Series: Resolve suspend-resume racing with GuC destroy-context-worker (rev8) URL : https://patchwork.freedesktop.org/series/121916/ State : failure == Summary == CI Bug Log - changes from CI_DRM_13956 -> Patchwork_121916v8 ==================================================== Summary ------- **FAILURE** Serious unknown changes coming with Patchwork_121916v8 absolutely need to be verified manually. If you think the reported changes have nothing to do with the changes introduced in Patchwork_121916v8, please notify your bug team (I915-ci-infra@lists.freedesktop.org) to allow them to document this new failure mode, which will reduce false positives in CI. External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v8/index.html Participating hosts (39 -> 38) ------------------------------ Missing (1): fi-snb-2520m Possible new issues ------------------- Here are the unknown changes that may have been introduced in Patchwork_121916v8: ### IGT changes ### #### Possible regressions #### * igt@i915_selftest@live@requests: - bat-mtlp-6: [PASS][1] -> [DMESG-FAIL][2] [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13956/bat-mtlp-6/igt@i915_selftest@live@requests.html [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v8/bat-mtlp-6/igt@i915_selftest@live@requests.html Known issues ------------ Here are the changes found in Patchwork_121916v8 that come from known issues: ### CI changes ### #### Issues hit #### * boot: - fi-bsw-n3050: [PASS][3] -> [FAIL][4] ([i915#8293]) [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13956/fi-bsw-n3050/boot.html [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v8/fi-bsw-n3050/boot.html ### IGT changes ### #### Issues hit #### * igt@gem_lmem_swapping@basic: - fi-apl-guc: NOTRUN -> [SKIP][5] ([fdo#109271] / [i915#4613]) +3 other tests skip [5]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v8/fi-apl-guc/igt@gem_lmem_swapping@basic.html * igt@kms_hdmi_inject@inject-audio: - fi-kbl-guc: [PASS][6] -> [FAIL][7] ([IGT#3]) [6]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13956/fi-kbl-guc/igt@kms_hdmi_inject@inject-audio.html [7]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v8/fi-kbl-guc/igt@kms_hdmi_inject@inject-audio.html * igt@kms_pipe_crc_basic@read-crc-frame-sequence@pipe-d-edp-1: - bat-rplp-1: [PASS][8] -> [ABORT][9] ([i915#8668]) [8]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13956/bat-rplp-1/igt@kms_pipe_crc_basic@read-crc-frame-sequence@pipe-d-edp-1.html [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v8/bat-rplp-1/igt@kms_pipe_crc_basic@read-crc-frame-sequence@pipe-d-edp-1.html * igt@kms_pipe_crc_basic@suspend-read-crc: - bat-rpls-1: NOTRUN -> [SKIP][10] ([i915#1845]) [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v8/bat-rpls-1/igt@kms_pipe_crc_basic@suspend-read-crc.html #### Possible fixes #### * igt@core_hotunplug@unbind-rebind: - fi-apl-guc: [ABORT][11] ([i915#8213] / [i915#8668]) -> [PASS][12] [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13956/fi-apl-guc/igt@core_hotunplug@unbind-rebind.html [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v8/fi-apl-guc/igt@core_hotunplug@unbind-rebind.html * igt@gem_exec_suspend@basic-s3@smem: - bat-rpls-1: [ABORT][13] ([i915#7978]) -> [PASS][14] [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13956/bat-rpls-1/igt@gem_exec_suspend@basic-s3@smem.html [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v8/bat-rpls-1/igt@gem_exec_suspend@basic-s3@smem.html {name}: This element is suppressed. This means it is ignored when computing the status of the difference (SUCCESS, WARNING, or FAILURE). [IGT#3]: https://gitlab.freedesktop.org/drm/igt-gpu-tools/issues/3 [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271 [i915#1845]: https://gitlab.freedesktop.org/drm/intel/issues/1845 [i915#4613]: https://gitlab.freedesktop.org/drm/intel/issues/4613 [i915#7359]: https://gitlab.freedesktop.org/drm/intel/issues/7359 [i915#7978]: https://gitlab.freedesktop.org/drm/intel/issues/7978 [i915#8213]: https://gitlab.freedesktop.org/drm/intel/issues/8213 [i915#8293]: https://gitlab.freedesktop.org/drm/intel/issues/8293 [i915#8668]: https://gitlab.freedesktop.org/drm/intel/issues/8668 [i915#8981]: https://gitlab.freedesktop.org/drm/intel/issues/8981 Build changes ------------- * Linux: CI_DRM_13956 -> Patchwork_121916v8 CI-20190529: 20190529 CI_DRM_13956: b59a0a6520764f36a79ba6b4c590e243ac6b913d @ git://anongit.freedesktop.org/gfx-ci/linux IGT_7612: b5c47966901ee1060bcb9d4bccdd3ccec9651ef4 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git Patchwork_121916v8: b59a0a6520764f36a79ba6b4c590e243ac6b913d @ git://anongit.freedesktop.org/gfx-ci/linux ### Linux commits 44d3e6811ff3 drm/i915/guc: Close deregister-context race against CT-loss f073df82cd83 drm/i915/guc: Flush context destruction worker at suspend == Logs == For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v8/index.html [-- Attachment #2: Type: text/html, Size: 6046 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Intel-gfx] ✗ Fi.CI.BAT: failure for Resolve suspend-resume racing with GuC destroy-context-worker (rev8) 2023-12-01 2:20 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork @ 2023-12-01 5:10 ` Teres Alexis, Alan Previn 0 siblings, 0 replies; 18+ messages in thread From: Teres Alexis, Alan Previn @ 2023-12-01 5:10 UTC (permalink / raw) To: intel-gfx@lists.freedesktop.org On Fri, 2023-12-01 at 02:20 +0000, Patchwork wrote: > Patch Details > Series: Resolve suspend-resume racing with GuC destroy-context-worker (rev8) > URL: https://patchwork.freedesktop.org/series/121916/ > State: failure > Details: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v8/index.html alan:snip > Here are the unknown changes that may have been introduced in Patchwork_121916v8: > > IGT changes > Possible regressions > > * igt@i915_selftest@live@requests: > * bat-mtlp-6: PASS<https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13956/bat-mtlp-6/igt@i915_selftest@live@requests.html> -> DMESG-FAIL<https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v8/bat-mtlp-6/igt@i915_selftest@live@requests.html> alan: inspecting the selftest code that is failing WRT changes in this targets, i really dont see any relation. the series is changing how we handle context deregistratoin either thru the regular context-close or through the flushing during suspend... however the selftest reporting a reset would be an issue where the context is not closed (because it would be hanging) - also its a kernel context - which should not get closed. I will run another retest after i verify trybot sees no issue. ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2023-12-06 21:47 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-11-30 0:20 [Intel-gfx] [PATCH v7 0/2] Resolve suspend-resume racing with GuC destroy-context-worker Alan Previn 2023-11-30 0:20 ` Alan Previn 2023-11-30 0:20 ` [Intel-gfx] [PATCH v7 1/2] drm/i915/guc: Flush context destruction worker at suspend Alan Previn 2023-11-30 0:20 ` Alan Previn 2023-11-30 0:20 ` [Intel-gfx] [PATCH v7 2/2] drm/i915/guc: Close deregister-context race against CT-loss Alan Previn 2023-11-30 0:20 ` Alan Previn 2023-11-30 21:18 ` [Intel-gfx] " Rodrigo Vivi 2023-11-30 21:18 ` Rodrigo Vivi 2023-12-01 0:09 ` [Intel-gfx] " Teres Alexis, Alan Previn 2023-12-01 0:09 ` Teres Alexis, Alan Previn 2023-12-01 0:10 ` [Intel-gfx] " Teres Alexis, Alan Previn 2023-12-01 0:10 ` Teres Alexis, Alan Previn 2023-12-06 21:47 ` [Intel-gfx] " Daniele Ceraolo Spurio 2023-12-06 21:47 ` Daniele Ceraolo Spurio 2023-11-30 7:50 ` [Intel-gfx] ✗ Fi.CI.BAT: failure for Resolve suspend-resume racing with GuC destroy-context-worker (rev7) Patchwork 2023-12-01 1:33 ` [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Resolve suspend-resume racing with GuC destroy-context-worker (rev8) Patchwork 2023-12-01 2:20 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork 2023-12-01 5:10 ` Teres Alexis, Alan Previn
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.