public inbox for intel-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed
From: Michel Thierry <michel.thierry@intel.com>
To: intel-gfx@lists.freedesktop.org
Subject: [PATCH 3/9] drm/i915/tdr: Modify error handler for per engine hang recovery
Date: Fri, 16 Dec 2016 12:20:04 -0800	[thread overview]
Message-ID: <20161216202010.7983-4-michel.thierry@intel.com> (raw)
In-Reply-To: <20161216202010.7983-1-michel.thierry@intel.com>

From: Arun Siluvery <arun.siluvery@linux.intel.com>

This is a preparatory patch which modifies error handler to do per engine
hang recovery. The actual patch which implements this sequence follows
later in the series. The aim is to prepare existing recovery function to
adapt to this new function where applicable (which fails at this point
because core implementation is lacking) and continue recovery using legacy
full gpu reset.

A helper function is also added to query the availability of engine
reset.

The error events behaviour that are used to notify user of reset are
adapted to engine reset such that it doesn't break users listening to these
events. In legacy we report an error event, a reset event before resetting
the gpu and a reset done event marking the completion of reset. The same
behaviour is adapted but reset event is only dispatched once even when
multiple engines are hung. Finally once reset is complete we send reset
done event as usual.

Note that this implementation of engine reset is for i915 directly
submitting to the ELSP, where the driver manages the hang detection,
recovery and resubmission. With GuC submission these tasks are shared
between driver and firmware; i915 will still responsible for detecting a
hang, and when it does it will have to request GuC to reset that Engine and
remind the firmware about the outstanding submissions.

v2: rebase, advertise engine reset availability in platform definition,
add note about GuC submission.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c     | 21 +++++++++
 drivers/gpu/drm/i915/i915_drv.h     |  3 ++
 drivers/gpu/drm/i915/i915_irq.c     | 88 +++++++++++++++++++++++++++----------
 drivers/gpu/drm/i915/i915_pci.c     |  5 ++-
 drivers/gpu/drm/i915/intel_uncore.c | 11 +++++
 5 files changed, 103 insertions(+), 25 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 6428588518aa..e5688edd62cd 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1824,6 +1824,27 @@ void i915_reset(struct drm_i915_private *dev_priv)
 	goto wakeup;
 }
 
+/**
+ * i915_reset_engine - reset GPU engine to recover from a hang
+ * @engine: engine to reset
+ *
+ * Reset a specific GPU engine. Useful if a hang is detected.
+ * Returns zero on successful reset or otherwise an error code.
+ */
+int i915_reset_engine(struct intel_engine_cs *engine)
+{
+	int ret;
+	struct drm_i915_private *dev_priv = engine->i915;
+
+	/* FIXME: replace me with engine reset sequence */
+	ret = -ENODEV;
+
+	/* use full gpu reset to recover on error */
+	set_bit(I915_RESET_IN_PROGRESS, &dev_priv->gpu_error.flags);
+
+	return ret;
+}
+
 static int i915_pm_suspend(struct device *kdev)
 {
 	struct pci_dev *pdev = to_pci_dev(kdev);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index fb3c88144aa8..6b4e8ee19905 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -781,6 +781,7 @@ struct intel_csr {
 	func(has_ddi); \
 	func(has_decoupled_mmio); \
 	func(has_dp_mst); \
+	func(has_engine_reset); \
 	func(has_fbc); \
 	func(has_fpga_dbg); \
 	func(has_full_ppgtt); \
@@ -3007,6 +3008,8 @@ extern void i915_driver_unload(struct drm_device *dev);
 extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
 extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
 extern void i915_reset(struct drm_i915_private *dev_priv);
+extern bool intel_has_engine_reset(struct drm_i915_private *dev_priv);
+extern int i915_reset_engine(struct intel_engine_cs *engine);
 extern int intel_guc_reset(struct drm_i915_private *dev_priv);
 extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
 extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 77caeeb5ee55..da619810f59e 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2594,7 +2594,8 @@ static void i915_error_wake_up(struct drm_i915_private *dev_priv)
  * Fire an error uevent so userspace can see that a hang or error
  * was detected.
  */
-static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
+static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv,
+				  u32 engine_mask)
 {
 	struct kobject *kobj = &dev_priv->drm.primary->kdev->kobj;
 	char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
@@ -2603,7 +2604,15 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
 
 	kobject_uevent_env(kobj, KOBJ_CHANGE, error_event);
 
-	DRM_DEBUG_DRIVER("resetting chip\n");
+	/*
+	 * This event needs to be sent before performing gpu reset. When
+	 * engine resets are supported we iterate through all engines and
+	 * reset hung engines individually. To keep the event dispatch
+	 * mechanism consistent with full gpu reset, this is only sent once
+	 * even when multiple engines are hung. It is also safe to move this
+	 * here because when we are in this function, we will definitely
+	 * perform gpu reset.
+	 */
 	kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);
 
 	/*
@@ -2614,30 +2623,55 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
 	 * simulated reset via debugs, so get an RPM reference.
 	 */
 	intel_runtime_pm_get(dev_priv);
-	intel_prepare_reset(dev_priv);
 
-	do {
-		/*
-		 * All state reset _must_ be completed before we update the
-		 * reset counter, for otherwise waiters might miss the reset
-		 * pending state and not properly drop locks, resulting in
-		 * deadlocks with the reset work.
-		 */
-		if (mutex_trylock(&dev_priv->drm.struct_mutex)) {
-			i915_reset(dev_priv);
-			mutex_unlock(&dev_priv->drm.struct_mutex);
+	/* If hardware supports it (GEN8+), try engine reset first */
+	if (intel_has_engine_reset(dev_priv)) {
+		struct intel_engine_cs *engine;
+		unsigned int tmp, ret;
+
+		for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
+			ret = i915_reset_engine(engine);
+			/* on failure fallback to full gpu reset for recovery */
+			if (ret)
+				break;
 		}
+	}
 
-		/* We need to wait for anyone holding the lock to wakeup */
-	} while (wait_on_bit_timeout(&dev_priv->gpu_error.flags,
-				     I915_RESET_IN_PROGRESS,
-				     TASK_UNINTERRUPTIBLE,
-				     HZ));
+	/*
+	 * If the waiter already held the struct_mutex lock, it may have already
+	 * triggered the GPU reset and the reset_in_progress can be false.
+	 */
+	if (i915_reset_in_progress(&dev_priv->gpu_error)) {
+		DRM_DEBUG_DRIVER("resetting chip\n");
+		intel_prepare_reset(dev_priv);
+
+		do {
+			/*
+			 * All state reset _must_ be completed before we update
+			 * the reset counter, for otherwise waiters might miss
+			 * the reset pending state and not properly drop locks,
+			 * resulting in deadlocks with the reset work.
+			 */
+			if (mutex_trylock(&dev_priv->drm.struct_mutex)) {
+				i915_reset(dev_priv);
+				mutex_unlock(&dev_priv->drm.struct_mutex);
+			}
+
+			/*
+			 * We need to wait for anyone holding the lock to
+			 * wakeup.
+			 */
+		} while (wait_on_bit_timeout(&dev_priv->gpu_error.flags,
+					     I915_RESET_IN_PROGRESS,
+					     TASK_UNINTERRUPTIBLE,
+					     HZ));
+
+		intel_finish_reset(dev_priv);
+	}
 
-	intel_finish_reset(dev_priv);
 	intel_runtime_pm_put(dev_priv);
 
-	if (!test_bit(I915_WEDGED, &dev_priv->gpu_error.flags))
+	if (!i915_terminally_wedged(&dev_priv->gpu_error))
 		kobject_uevent_env(kobj,
 				   KOBJ_CHANGE, reset_done_event);
 
@@ -2728,9 +2762,15 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
 	if (!engine_mask)
 		return;
 
-	if (test_and_set_bit(I915_RESET_IN_PROGRESS,
-			     &dev_priv->gpu_error.flags))
-		return;
+	/*
+	 * Engine reset support is only available from Gen8 onwards so if
+	 * it is not available or explicity disabled, use full gpu reset.
+	 */
+	if (!intel_has_engine_reset(dev_priv)) {
+		if (test_and_set_bit(I915_RESET_IN_PROGRESS,
+				     &dev_priv->gpu_error.flags))
+			return;
+	}
 
 	/*
 	 * Wakeup waiting processes so that the reset function
@@ -2746,7 +2786,7 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
 	 */
 	i915_error_wake_up(dev_priv);
 
-	i915_reset_and_wakeup(dev_priv);
+	i915_reset_and_wakeup(dev_priv, engine_mask);
 }
 
 /* Called from drm generic code, passed 'crtc' which
diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
index 37244fad9b03..6e9716ae4c21 100644
--- a/drivers/gpu/drm/i915/i915_pci.c
+++ b/drivers/gpu/drm/i915/i915_pci.c
@@ -308,7 +308,8 @@ static const struct intel_device_info intel_haswell_info = {
 	BDW_COLORS, \
 	.has_logical_ring_contexts = 1, \
 	.has_full_48bit_ppgtt = 1, \
-	.has_64bit_reloc = 1
+	.has_64bit_reloc = 1, \
+	.has_engine_reset = 1
 
 static const struct intel_device_info intel_broadwell_info = {
 	BDW_FEATURES,
@@ -339,6 +340,7 @@ static const struct intel_device_info intel_cherryview_info = {
 	.has_gmch_display = 1,
 	.has_aliasing_ppgtt = 1,
 	.has_full_ppgtt = 1,
+	.has_engine_reset = 1,
 	.display_mmio_offset = VLV_DISPLAY_BASE,
 	GEN_CHV_PIPEOFFSETS,
 	CURSOR_OFFSETS,
@@ -390,6 +392,7 @@ static const struct intel_device_info intel_skylake_gt3_info = {
 	.has_aliasing_ppgtt = 1, \
 	.has_full_ppgtt = 1, \
 	.has_full_48bit_ppgtt = 1, \
+	.has_engine_reset = 1, \
 	GEN_DEFAULT_PIPEOFFSETS, \
 	IVB_CURSOR_OFFSETS, \
 	BDW_COLORS
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 8fc5f29e79a8..97ce324570ad 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1851,6 +1851,17 @@ bool intel_has_gpu_reset(struct drm_i915_private *dev_priv)
 	return intel_get_gpu_reset(dev_priv) != NULL;
 }
 
+/*
+ * When GuC submission is enabled, GuC manages ELSP and can initiate the
+ * engine reset too. For now, fall back to full GPU reset if it is enabled.
+ */
+bool intel_has_engine_reset(struct drm_i915_private *dev_priv)
+{
+	return (dev_priv->info.has_engine_reset &&
+		!dev_priv->guc.execbuf_client &&
+		i915.reset == 2);
+}
+
 int intel_guc_reset(struct drm_i915_private *dev_priv)
 {
 	int ret;
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  parent reply	other threads:[~2016-12-16 20:20 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-16 20:20 [PATCH 0/9] Execlist based engine-reset Michel Thierry
2016-12-16 20:20 ` [PATCH 1/9] drm/i915: Keep i915_handle_error kerneldoc parameters together Michel Thierry
2016-12-16 21:40   ` Chris Wilson
2016-12-19  8:22   ` Tvrtko Ursulin
2016-12-16 20:20 ` [PATCH 2/9] drm/i915: Update i915.reset to handle engine resets Michel Thierry
2016-12-16 20:20 ` Michel Thierry [this message]
2016-12-16 20:20 ` [PATCH 4/9] drm/i915/tdr: Add support for per engine reset recovery Michel Thierry
2016-12-16 20:45   ` Chris Wilson
2016-12-19  5:02     ` Michel Thierry
2016-12-19  9:51       ` Chris Wilson
2017-01-05 23:55         ` Michel Thierry
2016-12-19 14:24     ` Mika Kuoppala
2016-12-19 14:42       ` Chris Wilson
2016-12-16 20:20 ` [PATCH 5/9] drm/i915: Skip reset request if there is one already Michel Thierry
2016-12-16 20:20 ` [PATCH 6/9] drm/i915/tdr: Add engine reset count to error state Michel Thierry
2016-12-16 20:37   ` Chris Wilson
2016-12-16 21:19     ` Michel Thierry
2016-12-19  8:27   ` Tvrtko Ursulin
2016-12-19 18:09     ` Michel Thierry
2016-12-16 20:20 ` [PATCH 7/9] drm/i915/tdr: Export per-engine reset count info to debugfs Michel Thierry
2016-12-16 20:20 ` [PATCH 8/9] drm/i915/tdr: Enable Engine reset and recovery support Michel Thierry
2016-12-16 20:20 ` [RFC 9/9] drm/i915: Add engine reset count in get-reset-stats ioctl Michel Thierry
2016-12-16 20:45 ` ✗ Fi.CI.BAT: warning for Execlist based engine-reset Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161216202010.7983-4-michel.thierry@intel.com \
    --to=michel.thierry@intel.com \
    --cc=intel-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox