Re: [Intel-gfx] [PATCH v2 2/3] drm/i915/guc: Close deregister-context race against CT-loss

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Alan Previn <alan.previn.teres.alexis@intel.com>
Cc: intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org
Subject: Re: [Intel-gfx] [PATCH v2 2/3] drm/i915/guc: Close deregister-context race against CT-loss
Date: Tue, 15 Aug 2023 09:56:04 -0400	[thread overview]
Message-ID: <ZNuD9JdmoYhYJ+G5@intel.com> (raw)
In-Reply-To: <20230815011210.1188379-3-alan.previn.teres.alexis@intel.com>

On Mon, Aug 14, 2023 at 06:12:09PM -0700, Alan Previn wrote:
> If we are at the end of suspend or very early in resume
> its possible an async fence signal could lead us to the
> execution of the context destruction worker (after the
> prior worker flush).
> 
> Even if checking that the CT is enabled before calling
> destroyed_worker_func, guc_lrc_desc_unpin may still fail
> because in corner cases, as we traverse the GuC's
> context-destroy list, the CT could get disabled in the mid
> of it right before calling the GuC's CT send function.
> 
> We've witnessed this race condition once every ~6000-8000
> suspend-resume cycles while ensuring workloads that render
> something onscreen is continuously started just before
> we suspend (and the workload is small enough to complete
> and trigger the queued engine/context free-up either very
> late in suspend or very early in resume).
> 
> In such a case, we need to unroll the unpin process because
> guc-lrc-unpin takes a gt wakeref which only gets released in
> the G2H IRQ reply that never comes through in this corner
> case. That will cascade into a kernel hang later at the tail
> end of suspend in this function:
> 
>    intel_wakeref_wait_for_idle(&gt->wakeref)
>    (called by) - intel_gt_pm_wait_for_idle
>    (called by) - wait_for_suspend
> 
> Doing this unroll and keeping the context in the GuC's
> destroy-list will allow the context to get picked up on
> the next destroy worker invocation or purged as part of a
> major GuC sanitization or reset flow.
> 
> Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
> ---
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 40 +++++++++++++++++--
>  1 file changed, 36 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 050572bb8dbe..ddb4ee4c4e51 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -235,6 +235,13 @@ set_context_destroyed(struct intel_context *ce)
>  	ce->guc_state.sched_state |= SCHED_STATE_DESTROYED;
>  }
>  
> +static inline void
> +clr_context_destroyed(struct intel_context *ce)
> +{
> +	lockdep_assert_held(&ce->guc_state.lock);
> +	ce->guc_state.sched_state &= ~SCHED_STATE_DESTROYED;
> +}
> +
>  static inline bool context_pending_disable(struct intel_context *ce)
>  {
>  	return ce->guc_state.sched_state & SCHED_STATE_PENDING_DISABLE;
> @@ -3175,7 +3182,7 @@ static void guc_context_close(struct intel_context *ce)
>  	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>  }
>  
> -static inline void guc_lrc_desc_unpin(struct intel_context *ce)
> +static inline int guc_lrc_desc_unpin(struct intel_context *ce)
>  {
>  	struct intel_guc *guc = ce_to_guc(ce);
>  	struct intel_gt *gt = guc_to_gt(guc);
> @@ -3199,10 +3206,20 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce)
>  	if (unlikely(disabled)) {
>  		release_guc_id(guc, ce);
>  		__guc_context_destroy(ce);
> -		return;
> +		return 0;
> +	}
> +
> +	if (deregister_context(ce, ce->guc_id.id)) {
> +		/* Seal race with concurrent suspend by unrolling */
> +		spin_lock_irqsave(&ce->guc_state.lock, flags);
> +		set_context_registered(ce);
> +		clr_context_destroyed(ce);
> +		intel_gt_pm_put(gt);

how sure we are this is not calling unbalanced puts?
could we wrap this in some existent function to make this clear?

> +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> +		return -ENODEV;
>  	}
>  
> -	deregister_context(ce, ce->guc_id.id);
> +	return 0;
>  }
>  
>  static void __guc_context_destroy(struct intel_context *ce)
> @@ -3270,7 +3287,22 @@ static void deregister_destroyed_contexts(struct intel_guc *guc)
>  		if (!ce)
>  			break;
>  
> -		guc_lrc_desc_unpin(ce);
> +		if (guc_lrc_desc_unpin(ce)) {
> +			/*
> +			 * This means GuC's CT link severed mid-way which could happen
> +			 * in suspend-resume corner cases. In this case, put the
> +			 * context back into the destroyed_contexts list which will
> +			 * get picked up on the next context deregistration event or
> +			 * purged in a GuC sanitization event (reset/unload/wedged/...).
> +			 */
> +			spin_lock_irqsave(&guc->submission_state.lock, flags);
> +			list_add_tail(&ce->destroyed_link,
> +				      &guc->submission_state.destroyed_contexts);
> +			spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> +			/* Bail now since the list might never be emptied if h2gs fail */
> +			break;
> +		}
> +
>  	}
>  }
>  
> -- 
> 2.39.0
>

WARNING: multiple messages have this Message-ID (diff)

From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Alan Previn <alan.previn.teres.alexis@intel.com>
Cc: intel-gfx@lists.freedesktop.org,
	Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>,
	John Harrison <john.c.harrison@intel.com>,
	dri-devel@lists.freedesktop.org
Subject: Re: [PATCH v2 2/3] drm/i915/guc: Close deregister-context race against CT-loss
Date: Tue, 15 Aug 2023 09:56:04 -0400	[thread overview]
Message-ID: <ZNuD9JdmoYhYJ+G5@intel.com> (raw)
In-Reply-To: <20230815011210.1188379-3-alan.previn.teres.alexis@intel.com>

On Mon, Aug 14, 2023 at 06:12:09PM -0700, Alan Previn wrote:
> If we are at the end of suspend or very early in resume
> its possible an async fence signal could lead us to the
> execution of the context destruction worker (after the
> prior worker flush).
> 
> Even if checking that the CT is enabled before calling
> destroyed_worker_func, guc_lrc_desc_unpin may still fail
> because in corner cases, as we traverse the GuC's
> context-destroy list, the CT could get disabled in the mid
> of it right before calling the GuC's CT send function.
> 
> We've witnessed this race condition once every ~6000-8000
> suspend-resume cycles while ensuring workloads that render
> something onscreen is continuously started just before
> we suspend (and the workload is small enough to complete
> and trigger the queued engine/context free-up either very
> late in suspend or very early in resume).
> 
> In such a case, we need to unroll the unpin process because
> guc-lrc-unpin takes a gt wakeref which only gets released in
> the G2H IRQ reply that never comes through in this corner
> case. That will cascade into a kernel hang later at the tail
> end of suspend in this function:
> 
>    intel_wakeref_wait_for_idle(&gt->wakeref)
>    (called by) - intel_gt_pm_wait_for_idle
>    (called by) - wait_for_suspend
> 
> Doing this unroll and keeping the context in the GuC's
> destroy-list will allow the context to get picked up on
> the next destroy worker invocation or purged as part of a
> major GuC sanitization or reset flow.
> 
> Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
> ---
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 40 +++++++++++++++++--
>  1 file changed, 36 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 050572bb8dbe..ddb4ee4c4e51 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -235,6 +235,13 @@ set_context_destroyed(struct intel_context *ce)
>  	ce->guc_state.sched_state |= SCHED_STATE_DESTROYED;
>  }
>  
> +static inline void
> +clr_context_destroyed(struct intel_context *ce)
> +{
> +	lockdep_assert_held(&ce->guc_state.lock);
> +	ce->guc_state.sched_state &= ~SCHED_STATE_DESTROYED;
> +}
> +
>  static inline bool context_pending_disable(struct intel_context *ce)
>  {
>  	return ce->guc_state.sched_state & SCHED_STATE_PENDING_DISABLE;
> @@ -3175,7 +3182,7 @@ static void guc_context_close(struct intel_context *ce)
>  	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>  }
>  
> -static inline void guc_lrc_desc_unpin(struct intel_context *ce)
> +static inline int guc_lrc_desc_unpin(struct intel_context *ce)
>  {
>  	struct intel_guc *guc = ce_to_guc(ce);
>  	struct intel_gt *gt = guc_to_gt(guc);
> @@ -3199,10 +3206,20 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce)
>  	if (unlikely(disabled)) {
>  		release_guc_id(guc, ce);
>  		__guc_context_destroy(ce);
> -		return;
> +		return 0;
> +	}
> +
> +	if (deregister_context(ce, ce->guc_id.id)) {
> +		/* Seal race with concurrent suspend by unrolling */
> +		spin_lock_irqsave(&ce->guc_state.lock, flags);
> +		set_context_registered(ce);
> +		clr_context_destroyed(ce);
> +		intel_gt_pm_put(gt);

how sure we are this is not calling unbalanced puts?
could we wrap this in some existent function to make this clear?

> +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> +		return -ENODEV;
>  	}
>  
> -	deregister_context(ce, ce->guc_id.id);
> +	return 0;
>  }
>  
>  static void __guc_context_destroy(struct intel_context *ce)
> @@ -3270,7 +3287,22 @@ static void deregister_destroyed_contexts(struct intel_guc *guc)
>  		if (!ce)
>  			break;
>  
> -		guc_lrc_desc_unpin(ce);
> +		if (guc_lrc_desc_unpin(ce)) {
> +			/*
> +			 * This means GuC's CT link severed mid-way which could happen
> +			 * in suspend-resume corner cases. In this case, put the
> +			 * context back into the destroyed_contexts list which will
> +			 * get picked up on the next context deregistration event or
> +			 * purged in a GuC sanitization event (reset/unload/wedged/...).
> +			 */
> +			spin_lock_irqsave(&guc->submission_state.lock, flags);
> +			list_add_tail(&ce->destroyed_link,
> +				      &guc->submission_state.destroyed_contexts);
> +			spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> +			/* Bail now since the list might never be emptied if h2gs fail */
> +			break;
> +		}
> +
>  	}
>  }
>  
> -- 
> 2.39.0
>

next prev parent reply	other threads:[~2023-08-15 13:56 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-15  1:12 [Intel-gfx] [PATCH v2 0/3] Resolve suspend-resume racing with GuC destroy-context-worker Alan Previn
2023-08-15  1:12 ` Alan Previn
2023-08-15  1:12 ` [Intel-gfx] [PATCH v2 1/3] drm/i915/guc: Flush context destruction worker at suspend Alan Previn
2023-08-15  1:12   ` Alan Previn
2023-08-15 13:53   ` [Intel-gfx] " Rodrigo Vivi
2023-08-15 13:53     ` Rodrigo Vivi
2023-08-25 18:48     ` [Intel-gfx] " Teres Alexis, Alan Previn
2023-08-25 18:48       ` Teres Alexis, Alan Previn
2023-08-15  1:12 ` [Intel-gfx] [PATCH v2 2/3] drm/i915/guc: Close deregister-context race against CT-loss Alan Previn
2023-08-15  1:12   ` Alan Previn
2023-08-15 13:56   ` Rodrigo Vivi [this message]
2023-08-15 13:56     ` Rodrigo Vivi
2023-08-15 19:08     ` [Intel-gfx] " Teres Alexis, Alan Previn
2023-08-15 19:08       ` Teres Alexis, Alan Previn
2023-08-25 18:54       ` [Intel-gfx] " Teres Alexis, Alan Previn
2023-08-25 18:54         ` Teres Alexis, Alan Previn
2023-08-28 21:06         ` [Intel-gfx] " Teres Alexis, Alan Previn
2023-08-28 21:06           ` Teres Alexis, Alan Previn
2023-08-15  1:12 ` [Intel-gfx] [PATCH v2 3/3] drm/i915/gt: Timeout when waiting for idle in suspending Alan Previn
2023-08-15  1:12   ` Alan Previn
2023-08-15 13:51   ` [Intel-gfx] " Rodrigo Vivi
2023-08-15 13:51     ` Rodrigo Vivi
2023-08-15 19:00     ` [Intel-gfx] " Teres Alexis, Alan Previn
2023-08-15 19:00       ` Teres Alexis, Alan Previn
2023-08-15  1:20 ` [Intel-gfx] [PATCH v2 0/3] Resolve suspend-resume racing with GuC destroy-context-worker Teres Alexis, Alan Previn
2023-08-15  1:20   ` Teres Alexis, Alan Previn
2023-08-15  1:56 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Resolve suspend-resume racing with GuC destroy-context-worker (rev2) Patchwork
2023-08-15  1:56 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2023-08-15  2:15 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZNuD9JdmoYhYJ+G5@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=alan.previn.teres.alexis@intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.