intel-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: Ramalingam C <ramalingam.c@intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>
Cc: intel-gfx@lists.freedesktop.org
Subject: Re: [CI 1/5] drm/i915/gt: Try to more gracefully quiesce the system before resets
Date: Thu, 24 Oct 2019 07:02:01 +0530	[thread overview]
Message-ID: <20191024013201.GA23391@intel.com> (raw)
In-Reply-To: <20191023133108.21401-1-chris@chris-wilson.co.uk>

On 2019-10-23 at 14:31:04 +0100, Chris Wilson wrote:
> If we are doing a normal GPU reset triggered after detecting a long
> period of stalled work, we can take our time and allow the engines to
> quiesce. Since we've stopped submission to the engine, and if we wait
> long enough an innocent context should complete, leaving the engine idle.
> So by waiting a short amount of time, we should prevent clobbering other
> users when resetting a stuck context.
> 
> Suggested-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Suggested-by: Jon Bloomfield <jon.bloomfield@intel.com>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/Kconfig.profile         | 11 +++++++++++
>  drivers/gpu/drm/i915/gt/intel_engine_cs.c    | 20 +++++++++++++++++++-
>  drivers/gpu/drm/i915/gt/intel_engine_types.h |  4 ++++
>  3 files changed, 34 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
> index 48df8889a88a..3a3881d5e44b 100644
> --- a/drivers/gpu/drm/i915/Kconfig.profile
> +++ b/drivers/gpu/drm/i915/Kconfig.profile
> @@ -25,3 +25,14 @@ config DRM_I915_SPIN_REQUEST
>  	  May be 0 to disable the initial spin. In practice, we estimate
>  	  the cost of enabling the interrupt (if currently disabled) to be
>  	  a few microseconds.
> +
> +config DRM_I915_STOP_TIMEOUT
> +	int "How long to wait for an engine to quiesce gracefully before reset (ms)"
> +	default 100 # milliseconds
> +	help
> +	  By stopping submission and sleeping for a short time before resetting
> +	  the GPU, we allow the innocent contexts also on the system to quiesce.
> +	  It is then less likely for a hanging context to cause collateral
> +	  damage as the system is reset in order to recover. The corollary is
> +	  that the reset itself may take longer and so be more disruptive to
> +	  interactive or low latency workloads.
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index 0e20713603ec..e4203eb44139 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -308,6 +308,9 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
>  	engine->instance = info->instance;
>  	__sprint_engine_name(engine);
>  
> +	engine->props.stop_timeout_ms =
> +		CONFIG_DRM_I915_STOP_TIMEOUT;
Compare to previous version where you used the CONFIG variable directly,
what is the benefit of using it through a variable? So that we could
alter it in runtime?

-Ram
> +
>  	/*
>  	 * To be overridden by the backend on setup. However to facilitate
>  	 * cleanup on error during setup, we always provide the destroy vfunc.
> @@ -875,6 +878,21 @@ u64 intel_engine_get_last_batch_head(const struct intel_engine_cs *engine)
>  	return bbaddr;
>  }
>  
> +static unsigned long stop_timeout(const struct intel_engine_cs *engine)
> +{
> +	if (in_atomic() || irqs_disabled()) /* inside atomic preempt-reset? */
> +		return 0;
> +
> +	/*
> +	 * If we are doing a normal GPU reset, we can take our time and allow
> +	 * the engine to quiesce. We've stopped submission to the engine, and
> +	 * if we wait long enough an innocent context should complete and
> +	 * leave the engine idle. So they should not be caught unaware by
> +	 * the forthcoming GPU reset (which usually follows the stop_cs)!
> +	 */
> +	return READ_ONCE(engine->props.stop_timeout_ms);
> +}
> +
>  int intel_engine_stop_cs(struct intel_engine_cs *engine)
>  {
>  	struct intel_uncore *uncore = engine->uncore;
> @@ -892,7 +910,7 @@ int intel_engine_stop_cs(struct intel_engine_cs *engine)
>  	err = 0;
>  	if (__intel_wait_for_register_fw(uncore,
>  					 mode, MODE_IDLE, MODE_IDLE,
> -					 1000, 0,
> +					 1000, stop_timeout(engine),
>  					 NULL)) {
>  		GEM_TRACE("%s: timed out on STOP_RING -> IDLE\n", engine->name);
>  		err = -ETIMEDOUT;
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> index 3451be034caf..87d5c4ef3ae7 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> @@ -542,6 +542,10 @@ struct intel_engine_cs {
>  		 */
>  		ktime_t total;
>  	} stats;
> +
> +	struct {
> +		unsigned long stop_timeout_ms;
> +	} props;
>  };
>  
>  static inline bool
> -- 
> 2.24.0.rc0
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

WARNING: multiple messages have this Message-ID (diff)
From: Ramalingam C <ramalingam.c@intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>
Cc: intel-gfx@lists.freedesktop.org
Subject: Re: [Intel-gfx] [CI 1/5] drm/i915/gt: Try to more gracefully quiesce the system before resets
Date: Thu, 24 Oct 2019 07:02:01 +0530	[thread overview]
Message-ID: <20191024013201.GA23391@intel.com> (raw)
Message-ID: <20191024013201.Gchi0lM5wqhL00rwNn97AxtQ0FmmxxkztULR7-2g5Tc@z> (raw)
In-Reply-To: <20191023133108.21401-1-chris@chris-wilson.co.uk>

On 2019-10-23 at 14:31:04 +0100, Chris Wilson wrote:
> If we are doing a normal GPU reset triggered after detecting a long
> period of stalled work, we can take our time and allow the engines to
> quiesce. Since we've stopped submission to the engine, and if we wait
> long enough an innocent context should complete, leaving the engine idle.
> So by waiting a short amount of time, we should prevent clobbering other
> users when resetting a stuck context.
> 
> Suggested-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Suggested-by: Jon Bloomfield <jon.bloomfield@intel.com>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/Kconfig.profile         | 11 +++++++++++
>  drivers/gpu/drm/i915/gt/intel_engine_cs.c    | 20 +++++++++++++++++++-
>  drivers/gpu/drm/i915/gt/intel_engine_types.h |  4 ++++
>  3 files changed, 34 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
> index 48df8889a88a..3a3881d5e44b 100644
> --- a/drivers/gpu/drm/i915/Kconfig.profile
> +++ b/drivers/gpu/drm/i915/Kconfig.profile
> @@ -25,3 +25,14 @@ config DRM_I915_SPIN_REQUEST
>  	  May be 0 to disable the initial spin. In practice, we estimate
>  	  the cost of enabling the interrupt (if currently disabled) to be
>  	  a few microseconds.
> +
> +config DRM_I915_STOP_TIMEOUT
> +	int "How long to wait for an engine to quiesce gracefully before reset (ms)"
> +	default 100 # milliseconds
> +	help
> +	  By stopping submission and sleeping for a short time before resetting
> +	  the GPU, we allow the innocent contexts also on the system to quiesce.
> +	  It is then less likely for a hanging context to cause collateral
> +	  damage as the system is reset in order to recover. The corollary is
> +	  that the reset itself may take longer and so be more disruptive to
> +	  interactive or low latency workloads.
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index 0e20713603ec..e4203eb44139 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -308,6 +308,9 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
>  	engine->instance = info->instance;
>  	__sprint_engine_name(engine);
>  
> +	engine->props.stop_timeout_ms =
> +		CONFIG_DRM_I915_STOP_TIMEOUT;
Compare to previous version where you used the CONFIG variable directly,
what is the benefit of using it through a variable? So that we could
alter it in runtime?

-Ram
> +
>  	/*
>  	 * To be overridden by the backend on setup. However to facilitate
>  	 * cleanup on error during setup, we always provide the destroy vfunc.
> @@ -875,6 +878,21 @@ u64 intel_engine_get_last_batch_head(const struct intel_engine_cs *engine)
>  	return bbaddr;
>  }
>  
> +static unsigned long stop_timeout(const struct intel_engine_cs *engine)
> +{
> +	if (in_atomic() || irqs_disabled()) /* inside atomic preempt-reset? */
> +		return 0;
> +
> +	/*
> +	 * If we are doing a normal GPU reset, we can take our time and allow
> +	 * the engine to quiesce. We've stopped submission to the engine, and
> +	 * if we wait long enough an innocent context should complete and
> +	 * leave the engine idle. So they should not be caught unaware by
> +	 * the forthcoming GPU reset (which usually follows the stop_cs)!
> +	 */
> +	return READ_ONCE(engine->props.stop_timeout_ms);
> +}
> +
>  int intel_engine_stop_cs(struct intel_engine_cs *engine)
>  {
>  	struct intel_uncore *uncore = engine->uncore;
> @@ -892,7 +910,7 @@ int intel_engine_stop_cs(struct intel_engine_cs *engine)
>  	err = 0;
>  	if (__intel_wait_for_register_fw(uncore,
>  					 mode, MODE_IDLE, MODE_IDLE,
> -					 1000, 0,
> +					 1000, stop_timeout(engine),
>  					 NULL)) {
>  		GEM_TRACE("%s: timed out on STOP_RING -> IDLE\n", engine->name);
>  		err = -ETIMEDOUT;
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> index 3451be034caf..87d5c4ef3ae7 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> @@ -542,6 +542,10 @@ struct intel_engine_cs {
>  		 */
>  		ktime_t total;
>  	} stats;
> +
> +	struct {
> +		unsigned long stop_timeout_ms;
> +	} props;
>  };
>  
>  static inline bool
> -- 
> 2.24.0.rc0
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  parent reply	other threads:[~2019-10-24  1:31 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-23 13:31 [CI 1/5] drm/i915/gt: Try to more gracefully quiesce the system before resets Chris Wilson
2019-10-23 13:31 ` [Intel-gfx] " Chris Wilson
2019-10-23 13:31 ` [CI 2/5] drm/i915/execlists: Force preemption Chris Wilson
2019-10-23 13:31   ` [Intel-gfx] " Chris Wilson
2019-10-23 13:31 ` [CI 3/5] drm/i915/execlists: Cancel banned contexts on schedule-out Chris Wilson
2019-10-23 13:31   ` [Intel-gfx] " Chris Wilson
2019-10-23 13:31 ` [CI 4/5] drm/i915/gem: Cancel contexts when hangchecking is disabled Chris Wilson
2019-10-23 13:31   ` [Intel-gfx] " Chris Wilson
2019-10-23 13:31 ` [CI 5/5] drm/i915/gt: Replace hangcheck by heartbeats Chris Wilson
2019-10-23 13:31   ` [Intel-gfx] " Chris Wilson
2019-10-24  0:09 ` ✗ Fi.CI.BUILD: failure for series starting with [CI,1/5] drm/i915/gt: Try to more gracefully quiesce the system before resets Patchwork
2019-10-24  0:09   ` [Intel-gfx] " Patchwork
2019-10-24  1:32 ` Ramalingam C [this message]
2019-10-24  1:32   ` [Intel-gfx] [CI 1/5] " Ramalingam C
2019-10-24  6:14   ` Chris Wilson
2019-10-24  6:14     ` [Intel-gfx] " Chris Wilson
2019-10-24  6:37     ` Ramalingam C
2019-10-24  6:37       ` [Intel-gfx] " Ramalingam C

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191024013201.GA23391@intel.com \
    --to=ramalingam.c@intel.com \
    --cc=chris@chris-wilson.co.uk \
    --cc=intel-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).