public inbox for intel-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed
* [PATCH 00/11] Execlist based Engine reset patches
@ 2016-07-26 16:40 Arun Siluvery
  2016-07-26 16:40 ` [PATCH 01/11] drm/i915: Update i915.reset to handle engine resets Arun Siluvery
                   ` (12 more replies)
  0 siblings, 13 replies; 28+ messages in thread
From: Arun Siluvery @ 2016-07-26 16:40 UTC (permalink / raw)
  To: intel-gfx

These patches are to add engine reset feature from Gen8. This is also
referred to as Timeout detection and recovery (TDR). This complements to
the full gpu reset feature available in i915 but it only allows to reset a
particular engine instead of all engines thus providing a light weight
engine reset and recovery mechanism.

This implementation is for execlist based submission only hence limited
from Gen8 onwards.

Timeout detection is by means of existing hangcheck which remains the same,
main changes are to the recovery mechanism. Once we detect a hang on a
particular engine we identify the request that caused the hang, skip the
request and adjust head pointers to allow the execution to proceed
normally. After some cleanup, submissions are restarted to process
remaining work queued to that engine.

If engine reset fails to recover engine correctly then we fallback to full
gpu reset.

Arun Siluvery (9):
  drm/i915: Update i915.reset to handle engine resets
  drm/i915/tdr: Update reset_in_progress to account for engine reset
  drm/i915/tdr: Modify error handler for per engine hang recovery
  drm/i915/tdr: Identify hung request and drop it
  drm/i915/tdr: Restart submission after engine reset
  drm/i915/tdr: Add support for per engine reset recovery
  drm/i915/tdr: Add engine reset count to error state
  drm/i915/tdr: Export reset count info to debugfs
  drm/i915/tdr: Enable Engine reset and recovery support

Chris Wilson (1):
  drm/i915: Separate out reset flags from the reset counter

Mika Kuoppala (1):
  drm/i915: Skip reset request if there is one already

 drivers/gpu/drm/i915/i915_debugfs.c     |  19 ++++
 drivers/gpu/drm/i915/i915_drv.c         |  82 ++++++++++++--
 drivers/gpu/drm/i915/i915_drv.h         |  58 +++++-----
 drivers/gpu/drm/i915/i915_gem.c         |   4 +-
 drivers/gpu/drm/i915/i915_gem_request.c |  13 +--
 drivers/gpu/drm/i915/i915_gpu_error.c   |   3 +
 drivers/gpu/drm/i915/i915_irq.c         | 184 +++++++++++++++++++++++---------
 drivers/gpu/drm/i915/i915_params.c      |   6 +-
 drivers/gpu/drm/i915/i915_params.h      |   2 +-
 drivers/gpu/drm/i915/intel_display.c    |  25 +++--
 drivers/gpu/drm/i915/intel_drv.h        |   4 +-
 drivers/gpu/drm/i915/intel_lrc.c        | 161 +++++++++++++++++++++++++++-
 drivers/gpu/drm/i915/intel_lrc.h        |   6 ++
 drivers/gpu/drm/i915/intel_uncore.c     |  47 +++++++-
 14 files changed, 502 insertions(+), 112 deletions(-)

-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 01/11] drm/i915: Update i915.reset to handle engine resets
  2016-07-26 16:40 [PATCH 00/11] Execlist based Engine reset patches Arun Siluvery
@ 2016-07-26 16:40 ` Arun Siluvery
  2016-07-26 16:40 ` [PATCH 02/11] drm/i915: Separate out reset flags from the reset counter Arun Siluvery
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Arun Siluvery @ 2016-07-26 16:40 UTC (permalink / raw)
  To: intel-gfx

In preparation for engine reset work update this parameter to handle more
than one type of reset. Default at the moment is still full gpu reset.

Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_params.c | 6 +++---
 drivers/gpu/drm/i915/i915_params.h | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c
index b6e404c..a96df81 100644
--- a/drivers/gpu/drm/i915/i915_params.c
+++ b/drivers/gpu/drm/i915/i915_params.c
@@ -45,7 +45,7 @@ struct i915_params i915 __read_mostly = {
 	.fastboot = 0,
 	.prefault_disable = 0,
 	.load_detect_test = 0,
-	.reset = true,
+	.reset = 1,
 	.invert_brightness = 0,
 	.disable_display = 0,
 	.enable_cmd_parser = 1,
@@ -111,8 +111,8 @@ MODULE_PARM_DESC(vbt_sdvo_panel_type,
 	"Override/Ignore selection of SDVO panel mode in the VBT "
 	"(-2=ignore, -1=auto [default], index in VBT BIOS table)");
 
-module_param_named_unsafe(reset, i915.reset, bool, 0600);
-MODULE_PARM_DESC(reset, "Attempt GPU resets (default: true)");
+module_param_named_unsafe(reset, i915.reset, int, 0600);
+MODULE_PARM_DESC(reset, "Attempt GPU resets (0=disabled, 1=full gpu reset [default], 2=engine reset)");
 
 module_param_named_unsafe(enable_hangcheck, i915.enable_hangcheck, bool, 0644);
 MODULE_PARM_DESC(enable_hangcheck,
diff --git a/drivers/gpu/drm/i915/i915_params.h b/drivers/gpu/drm/i915/i915_params.h
index 0ad020b..6276fa2 100644
--- a/drivers/gpu/drm/i915/i915_params.h
+++ b/drivers/gpu/drm/i915/i915_params.h
@@ -34,6 +34,7 @@ struct i915_params {
 	int lvds_channel_mode;
 	int panel_use_ssc;
 	int vbt_sdvo_panel_type;
+	int reset;
 	int enable_rc6;
 	int enable_dc;
 	int enable_fbc;
@@ -57,7 +58,6 @@ struct i915_params {
 	bool fastboot;
 	bool prefault_disable;
 	bool load_detect_test;
-	bool reset;
 	bool disable_display;
 	bool verbose_state_checks;
 	bool nuclear_pageflip;
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 02/11] drm/i915: Separate out reset flags from the reset counter
  2016-07-26 16:40 [PATCH 00/11] Execlist based Engine reset patches Arun Siluvery
  2016-07-26 16:40 ` [PATCH 01/11] drm/i915: Update i915.reset to handle engine resets Arun Siluvery
@ 2016-07-26 16:40 ` Arun Siluvery
  2016-07-26 16:40 ` [PATCH 03/11] drm/i915/tdr: Update reset_in_progress to account for engine reset Arun Siluvery
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Arun Siluvery @ 2016-07-26 16:40 UTC (permalink / raw)
  To: intel-gfx; +Cc: Mika Kuoppala

From: Chris Wilson <chris@chris-wilson.co.uk>

In preparation for introducing a per-engine reset, we can first separate
the mixing of the reset state from the global reset counter.

The loss of atomicity in updating the reset state poses a small problem
for handling the waiters. For requests, this is solved by advancing the
seqno so that a waiter waking up after the reset knows the request is
complete. For pending flips, we still rely on the increment of the
global reset epoch (as well as the reset-in-progress flag) to signify
when the hardware was reset.

The advantage, now that we do not inspect the reset state during reset
itself i.e. we no longer emit requests during reset, is that we can use
the atomic updates of the state flags to ensure that only one reset
worker is active.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Arun Siluvery <arun.siluvery@linux.intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c         |  16 ++---
 drivers/gpu/drm/i915/i915_drv.h         |  46 +++++---------
 drivers/gpu/drm/i915/i915_gem.c         |   2 +-
 drivers/gpu/drm/i915/i915_gem_request.c |  13 ++--
 drivers/gpu/drm/i915/i915_irq.c         | 104 +++++++++++++++-----------------
 drivers/gpu/drm/i915/intel_display.c    |  25 +++++---
 drivers/gpu/drm/i915/intel_drv.h        |   4 +-
 7 files changed, 93 insertions(+), 117 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 83afdd0..fb72842 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1593,7 +1593,7 @@ static int i915_drm_resume(struct drm_device *dev)
 	mutex_lock(&dev->struct_mutex);
 	if (i915_gem_init_hw(dev)) {
 		DRM_ERROR("failed to re-initialize GPU, declaring wedged!\n");
-		atomic_or(I915_WEDGED, &dev_priv->gpu_error.reset_counter);
+		set_bit(I915_WEDGED, &dev_priv->gpu_error.flags);
 	}
 	mutex_unlock(&dev->struct_mutex);
 
@@ -1754,20 +1754,13 @@ int i915_reset(struct drm_i915_private *dev_priv)
 {
 	struct drm_device *dev = &dev_priv->drm;
 	struct i915_gpu_error *error = &dev_priv->gpu_error;
-	unsigned reset_counter;
 	int ret;
 
 	mutex_lock(&dev->struct_mutex);
 
 	/* Clear any previous failed attempts at recovery. Time to try again. */
-	atomic_andnot(I915_WEDGED, &error->reset_counter);
-
-	/* Clear the reset-in-progress flag and increment the reset epoch. */
-	reset_counter = atomic_inc_return(&error->reset_counter);
-	if (WARN_ON(__i915_reset_in_progress(reset_counter))) {
-		ret = -EIO;
-		goto error;
-	}
+	__clear_bit(I915_WEDGED, &error->flags);
+	error->reset_count++;
 
 	pr_notice("drm/i915: Resetting chip after gpu hang\n");
 
@@ -1804,6 +1797,7 @@ int i915_reset(struct drm_i915_private *dev_priv)
 		goto error;
 	}
 
+	clear_bit(I915_RESET_IN_PROGRESS, &error->flags);
 	mutex_unlock(&dev->struct_mutex);
 
 	/*
@@ -1817,7 +1811,7 @@ int i915_reset(struct drm_i915_private *dev_priv)
 	return 0;
 
 error:
-	atomic_or(I915_WEDGED, &error->reset_counter);
+	set_bit(I915_WEDGED, &error->flags);
 	mutex_unlock(&dev->struct_mutex);
 	return ret;
 }
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 9f655e2..11436e7 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1379,9 +1379,10 @@ struct i915_gpu_error {
 	 * State variable controlling the reset flow and count
 	 *
 	 * This is a counter which gets incremented when reset is triggered,
-	 * and again when reset has been handled. So odd values (lowest bit set)
-	 * means that reset is in progress and even values that
-	 * (reset_counter >> 1):th reset was successfully completed.
+	 *
+	 * Before the reset commences, the I915_RESET_IN_PROGRESS bit is set
+	 * meaning that any waiters holding onto the struct_mutex should
+	 * relinquish the lock immediately in order for the reset to start.
 	 *
 	 * If reset is not completed succesfully, the I915_WEDGE bit is
 	 * set meaning that hardware is terminally sour and there is no
@@ -1396,10 +1397,11 @@ struct i915_gpu_error {
 	 * naturally enforces the correct ordering between the bail-out of the
 	 * waiter and the gpu reset work code.
 	 */
-	atomic_t reset_counter;
+	unsigned long reset_count;
 
-#define I915_RESET_IN_PROGRESS_FLAG	1
-#define I915_WEDGED			(1 << 31)
+	unsigned long flags;
+#define I915_RESET_IN_PROGRESS	0
+#define I915_WEDGED		(BITS_PER_LONG-1)
 
 	/**
 	 * Waitqueue to signal when a hang is detected. Used to for waiters
@@ -3192,44 +3194,24 @@ i915_gem_find_active_request(struct intel_engine_cs *engine);
 void i915_gem_retire_requests(struct drm_i915_private *dev_priv);
 void i915_gem_retire_requests_ring(struct intel_engine_cs *engine);
 
-static inline u32 i915_reset_counter(struct i915_gpu_error *error)
-{
-	return atomic_read(&error->reset_counter);
-}
-
-static inline bool __i915_reset_in_progress(u32 reset)
-{
-	return unlikely(reset & I915_RESET_IN_PROGRESS_FLAG);
-}
-
-static inline bool __i915_reset_in_progress_or_wedged(u32 reset)
-{
-	return unlikely(reset & (I915_RESET_IN_PROGRESS_FLAG | I915_WEDGED));
-}
-
-static inline bool __i915_terminally_wedged(u32 reset)
-{
-	return unlikely(reset & I915_WEDGED);
-}
-
 static inline bool i915_reset_in_progress(struct i915_gpu_error *error)
 {
-	return __i915_reset_in_progress(i915_reset_counter(error));
+	return test_bit(I915_RESET_IN_PROGRESS, &error->flags);
 }
 
-static inline bool i915_reset_in_progress_or_wedged(struct i915_gpu_error *error)
+static inline bool i915_terminally_wedged(struct i915_gpu_error *error)
 {
-	return __i915_reset_in_progress_or_wedged(i915_reset_counter(error));
+	return test_bit(I915_WEDGED, &error->flags);
 }
 
-static inline bool i915_terminally_wedged(struct i915_gpu_error *error)
+static inline bool i915_reset_in_progress_or_wedged(struct i915_gpu_error *error)
 {
-	return __i915_terminally_wedged(i915_reset_counter(error));
+	return i915_reset_in_progress(error) | i915_terminally_wedged(error);
 }
 
 static inline u32 i915_reset_count(struct i915_gpu_error *error)
 {
-	return ((i915_reset_counter(error) & ~I915_WEDGED) + 1) / 2;
+	return READ_ONCE(error->reset_count);
 }
 
 void i915_gem_reset(struct drm_device *dev);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index c843663..443570f 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -4604,7 +4604,7 @@ int i915_gem_init(struct drm_device *dev)
 		 * for all other failure, such as an allocation failure, bail.
 		 */
 		DRM_ERROR("Failed to initialize GPU, declaring it wedged\n");
-		atomic_or(I915_WEDGED, &dev_priv->gpu_error.reset_counter);
+		__set_bit(I915_WEDGED, &dev_priv->gpu_error.flags);
 		ret = 0;
 	}
 
diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index 60a3a34..74fa27d 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -204,16 +204,18 @@ void i915_gem_request_retire_upto(struct drm_i915_gem_request *req)
 	WARN_ON(i915_verify_lists(engine->dev));
 }
 
-static int i915_gem_check_wedge(unsigned int reset_counter, bool interruptible)
+static int i915_gem_check_wedge(struct drm_i915_private *dev_priv)
 {
-	if (__i915_terminally_wedged(reset_counter))
+	struct i915_gpu_error *error = &dev_priv->gpu_error;
+
+	if (i915_terminally_wedged(error))
 		return -EIO;
 
-	if (__i915_reset_in_progress(reset_counter)) {
+	if (i915_reset_in_progress(error)) {
 		/* Non-interruptible callers can't handle -EAGAIN, hence return
 		 * -EIO unconditionally for these.
 		 */
-		if (!interruptible)
+		if (!dev_priv->mm.interruptible)
 			return -EIO;
 
 		return -EAGAIN;
@@ -298,7 +300,6 @@ __i915_gem_request_alloc(struct intel_engine_cs *engine,
 			 struct drm_i915_gem_request **req_out)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
-	unsigned int reset_counter = i915_reset_counter(&dev_priv->gpu_error);
 	struct drm_i915_gem_request *req;
 	u32 seqno;
 	int ret;
@@ -312,7 +313,7 @@ __i915_gem_request_alloc(struct intel_engine_cs *engine,
 	 * EIO if the GPU is already wedged, or EAGAIN to drop the struct_mutex
 	 * and restart.
 	 */
-	ret = i915_gem_check_wedge(reset_counter, dev_priv->mm.interruptible);
+	ret = i915_gem_check_wedge(dev_priv);
 	if (ret)
 		return ret;
 
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 7104dc1..845d340 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2509,53 +2509,41 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
 
 	kobject_uevent_env(kobj, KOBJ_CHANGE, error_event);
 
+	DRM_DEBUG_DRIVER("resetting chip\n");
+	kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);
+
 	/*
-	 * Note that there's only one work item which does gpu resets, so we
-	 * need not worry about concurrent gpu resets potentially incrementing
-	 * error->reset_counter twice. We only need to take care of another
-	 * racing irq/hangcheck declaring the gpu dead for a second time. A
-	 * quick check for that is good enough: schedule_work ensures the
-	 * correct ordering between hang detection and this work item, and since
-	 * the reset in-progress bit is only ever set by code outside of this
-	 * work we don't need to worry about any other races.
+	 * In most cases it's guaranteed that we get here with an RPM
+	 * reference held, for example because there is a pending GPU
+	 * request that won't finish until the reset is done. This
+	 * isn't the case at least when we get here by doing a
+	 * simulated reset via debugs, so get an RPM reference.
 	 */
-	if (i915_reset_in_progress(&dev_priv->gpu_error)) {
-		DRM_DEBUG_DRIVER("resetting chip\n");
-		kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);
-
-		/*
-		 * In most cases it's guaranteed that we get here with an RPM
-		 * reference held, for example because there is a pending GPU
-		 * request that won't finish until the reset is done. This
-		 * isn't the case at least when we get here by doing a
-		 * simulated reset via debugs, so get an RPM reference.
-		 */
-		intel_runtime_pm_get(dev_priv);
+	intel_runtime_pm_get(dev_priv);
 
-		intel_prepare_reset(dev_priv);
+	intel_prepare_reset(dev_priv);
 
-		/*
-		 * All state reset _must_ be completed before we update the
-		 * reset counter, for otherwise waiters might miss the reset
-		 * pending state and not properly drop locks, resulting in
-		 * deadlocks with the reset work.
-		 */
-		ret = i915_reset(dev_priv);
+	/*
+	 * All state reset _must_ be completed before we update the
+	 * reset counter, for otherwise waiters might miss the reset
+	 * pending state and not properly drop locks, resulting in
+	 * deadlocks with the reset work.
+	 */
+	ret = i915_reset(dev_priv);
 
-		intel_finish_reset(dev_priv);
+	intel_finish_reset(dev_priv);
 
-		intel_runtime_pm_put(dev_priv);
+	intel_runtime_pm_put(dev_priv);
 
-		if (ret == 0)
-			kobject_uevent_env(kobj,
-					   KOBJ_CHANGE, reset_done_event);
+	if (ret == 0)
+		kobject_uevent_env(kobj,
+				   KOBJ_CHANGE, reset_done_event);
 
-		/*
-		 * Note: The wake_up also serves as a memory barrier so that
-		 * waiters see the update value of the reset counter atomic_t.
-		 */
-		wake_up_all(&dev_priv->gpu_error.reset_queue);
-	}
+	/*
+	 * Note: The wake_up also serves as a memory barrier so that
+	 * waiters see the update value of the reset counter atomic_t.
+	 */
+	i915_error_wake_up(dev_priv);
 }
 
 static void i915_report_and_clear_eir(struct drm_i915_private *dev_priv)
@@ -2674,25 +2662,27 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
 	i915_capture_error_state(dev_priv, engine_mask, error_msg);
 	i915_report_and_clear_eir(dev_priv);
 
-	if (engine_mask) {
-		atomic_or(I915_RESET_IN_PROGRESS_FLAG,
-				&dev_priv->gpu_error.reset_counter);
+	if (!engine_mask)
+		return;
 
-		/*
-		 * Wakeup waiting processes so that the reset function
-		 * i915_reset_and_wakeup doesn't deadlock trying to grab
-		 * various locks. By bumping the reset counter first, the woken
-		 * processes will see a reset in progress and back off,
-		 * releasing their locks and then wait for the reset completion.
-		 * We must do this for _all_ gpu waiters that might hold locks
-		 * that the reset work needs to acquire.
-		 *
-		 * Note: The wake_up serves as the required memory barrier to
-		 * ensure that the waiters see the updated value of the reset
-		 * counter atomic_t.
-		 */
-		i915_error_wake_up(dev_priv);
-	}
+	if (test_and_set_bit(I915_RESET_IN_PROGRESS,
+			     &dev_priv->gpu_error.flags))
+		return;
+
+	/*
+	 * Wakeup waiting processes so that the reset function
+	 * i915_reset_and_wakeup doesn't deadlock trying to grab
+	 * various locks. By bumping the reset counter first, the woken
+	 * processes will see a reset in progress and back off,
+	 * releasing their locks and then wait for the reset completion.
+	 * We must do this for _all_ gpu waiters that might hold locks
+	 * that the reset work needs to acquire.
+	 *
+	 * Note: The wake_up serves as the required memory barrier to
+	 * ensure that the waiters see the updated value of the reset
+	 * counter atomic_t.
+	 */
+	i915_error_wake_up(dev_priv);
 
 	i915_reset_and_wakeup(dev_priv);
 }
diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 78beb7e..46e904a 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -3174,15 +3174,26 @@ void intel_finish_reset(struct drm_i915_private *dev_priv)
 	drm_modeset_unlock_all(&dev_priv->drm);
 }
 
+static bool abort_flip_on_reset(struct intel_crtc *crtc)
+{
+	struct i915_gpu_error *error = &to_i915(crtc->base.dev)->gpu_error;
+
+	if (i915_reset_in_progress(error))
+		return true;
+
+	if (crtc->reset_count != i915_reset_count(error))
+		return true;
+
+	return false;
+}
+
 static bool intel_crtc_has_pending_flip(struct drm_crtc *crtc)
 {
 	struct drm_device *dev = crtc->dev;
 	struct intel_crtc *intel_crtc = to_intel_crtc(crtc);
-	unsigned reset_counter;
 	bool pending;
 
-	reset_counter = i915_reset_counter(&to_i915(dev)->gpu_error);
-	if (intel_crtc->reset_counter != reset_counter)
+	if (abort_flip_on_reset(intel_crtc))
 		return false;
 
 	spin_lock_irq(&dev->event_lock);
@@ -10979,10 +10990,8 @@ static bool __pageflip_finished_cs(struct intel_crtc *crtc,
 {
 	struct drm_device *dev = crtc->base.dev;
 	struct drm_i915_private *dev_priv = to_i915(dev);
-	unsigned reset_counter;
 
-	reset_counter = i915_reset_counter(&dev_priv->gpu_error);
-	if (crtc->reset_counter != reset_counter)
+	if (abort_flip_on_reset(crtc))
 		return true;
 
 	/*
@@ -11662,8 +11671,8 @@ static int intel_crtc_page_flip(struct drm_crtc *crtc,
 	if (ret)
 		goto cleanup;
 
-	intel_crtc->reset_counter = i915_reset_counter(&dev_priv->gpu_error);
-	if (__i915_reset_in_progress_or_wedged(intel_crtc->reset_counter)) {
+	intel_crtc->reset_count = i915_reset_count(&dev_priv->gpu_error);
+	if (i915_reset_in_progress_or_wedged(&dev_priv->gpu_error)) {
 		ret = -EIO;
 		goto cleanup;
 	}
diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
index e74d851..63ce82d 100644
--- a/drivers/gpu/drm/i915/intel_drv.h
+++ b/drivers/gpu/drm/i915/intel_drv.h
@@ -683,8 +683,8 @@ struct intel_crtc {
 
 	struct intel_crtc_state *config;
 
-	/* reset counter value when the last flip was submitted */
-	unsigned int reset_counter;
+	/* global reset count when the last flip was submitted */
+	unsigned int reset_count;
 
 	/* Access to these should be protected by dev_priv->irq_lock. */
 	bool cpu_fifo_underrun_disabled;
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 03/11] drm/i915/tdr: Update reset_in_progress to account for engine reset
  2016-07-26 16:40 [PATCH 00/11] Execlist based Engine reset patches Arun Siluvery
  2016-07-26 16:40 ` [PATCH 01/11] drm/i915: Update i915.reset to handle engine resets Arun Siluvery
  2016-07-26 16:40 ` [PATCH 02/11] drm/i915: Separate out reset flags from the reset counter Arun Siluvery
@ 2016-07-26 16:40 ` Arun Siluvery
  2016-07-26 21:52   ` Chris Wilson
  2016-07-26 16:40 ` [PATCH 04/11] drm/i915/tdr: Modify error handler for per engine hang recovery Arun Siluvery
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 28+ messages in thread
From: Arun Siluvery @ 2016-07-26 16:40 UTC (permalink / raw)
  To: intel-gfx

Now that we track reset progress using separate set of flags, update it to
account for engine reset as well.

A bit corresponding engine->id is set if reset is in progress for that
engine. Bit0 is for full gpu reset.

Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 11436e7..125fafa 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1403,6 +1403,8 @@ struct i915_gpu_error {
 #define I915_RESET_IN_PROGRESS	0
 #define I915_WEDGED		(BITS_PER_LONG-1)
 
+	unsigned long engine_reset_count[I915_NUM_ENGINES];
+
 	/**
 	 * Waitqueue to signal when a hang is detected. Used to for waiters
 	 * to release the struct_mutex for the reset to procede.
@@ -3194,9 +3196,10 @@ i915_gem_find_active_request(struct intel_engine_cs *engine);
 void i915_gem_retire_requests(struct drm_i915_private *dev_priv);
 void i915_gem_retire_requests_ring(struct intel_engine_cs *engine);
 
+/* indicates the progress of engine reset or full gpu reset */
 static inline bool i915_reset_in_progress(struct i915_gpu_error *error)
 {
-	return test_bit(I915_RESET_IN_PROGRESS, &error->flags);
+	return unlikely(READ_ONCE(error->flags) & ~BIT(I915_WEDGED));
 }
 
 static inline bool i915_terminally_wedged(struct i915_gpu_error *error)
@@ -3214,6 +3217,17 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
 	return READ_ONCE(error->reset_count);
 }
 
+static inline bool i915_full_gpu_reset_in_progress(struct i915_gpu_error *error)
+{
+	return test_bit(I915_RESET_IN_PROGRESS, &error->flags);
+}
+
+static inline bool i915_engine_reset_in_progress(struct i915_gpu_error *error,
+						 struct intel_engine_cs *engine)
+{
+	return test_bit(engine->id + 1, &error->flags);
+}
+
 void i915_gem_reset(struct drm_device *dev);
 bool i915_gem_clflush_object(struct drm_i915_gem_object *obj, bool force);
 int __must_check i915_gem_init(struct drm_device *dev);
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 04/11] drm/i915/tdr: Modify error handler for per engine hang recovery
  2016-07-26 16:40 [PATCH 00/11] Execlist based Engine reset patches Arun Siluvery
                   ` (2 preceding siblings ...)
  2016-07-26 16:40 ` [PATCH 03/11] drm/i915/tdr: Update reset_in_progress to account for engine reset Arun Siluvery
@ 2016-07-26 16:40 ` Arun Siluvery
  2016-07-26 16:40 ` [PATCH 05/11] drm/i915/tdr: Identify hung request and drop it Arun Siluvery
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Arun Siluvery @ 2016-07-26 16:40 UTC (permalink / raw)
  To: intel-gfx; +Cc: Ian Lister, Tomas Elf

This is a preparatory patch which modifies error handler to do per engine
hang recovery. The actual patch which implements this sequence follows
later in the series. The aim is to prepare existing recovery function to
adapt to this new function where applicable (which fails at this point
because core implementation is lacking) and continue recovery using legacy
full gpu reset.

A helper function is also added to query the availability of engine
reset. Engine reset and full gpu reset sections are also extracted to
separate functions, it helps readability and branching between reset type
simpler.

The error events behaviour that are used to notify user of reset are
adapted to engine reset such that it doesn't break users listening to these
events. In legacy we report an error event, a reset event before resetting
the gpu and a reset done event marking the completion of reset. The same
behaviour is adapted but reset event is only dispatched once even when
multiple engines are hung. Finally once reset is complete we send reset
done event as usual.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c     |  23 +++++++
 drivers/gpu/drm/i915/i915_drv.h     |   2 +
 drivers/gpu/drm/i915/i915_irq.c     | 132 +++++++++++++++++++++++++++++++-----
 drivers/gpu/drm/i915/intel_uncore.c |   5 ++
 4 files changed, 144 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index fb72842..5c20e5d 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1816,6 +1816,29 @@ error:
 	return ret;
 }
 
+/**
+ * i915_reset_engine - reset GPU engine to recover from a hang
+ * @engine: engine to reset
+ *
+ * Reset a specific GPU engine. Useful if a hang is detected.
+ * Returns zero on successful reset or otherwise an error code.
+ *
+ * Procedure is fairly simple:
+ *  - force engine to idle
+ *  - save current state which includes head and current request
+ *  - reset engine
+ *  - restore saved state and resubmit context
+ */
+int i915_reset_engine(struct intel_engine_cs *engine)
+{
+	int ret;
+
+	/* FIXME: replace me with engine reset sequence */
+	ret = -ENODEV;
+
+	return ret;
+}
+
 static int i915_pm_suspend(struct device *dev)
 {
 	struct pci_dev *pdev = to_pci_dev(dev);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 125fafa..61e57a8b 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2836,6 +2836,8 @@ extern long i915_compat_ioctl(struct file *filp, unsigned int cmd,
 extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
 extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
 extern int i915_reset(struct drm_i915_private *dev_priv);
+extern bool intel_has_engine_reset(struct drm_i915_private *dev_priv);
+extern int i915_reset_engine(struct intel_engine_cs *engine);
 extern int intel_guc_reset(struct drm_i915_private *dev_priv);
 extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
 extern unsigned long i915_chipset_val(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 845d340..49dd658 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2492,6 +2492,53 @@ static void i915_error_wake_up(struct drm_i915_private *dev_priv)
 	wake_up_all(&dev_priv->pending_flip_queue);
 }
 
+static int i915_reset_engines(struct drm_i915_private *dev_priv)
+{
+	struct intel_engine_cs *engine;
+
+	for_each_engine(engine, dev_priv) {
+		int ret;
+		struct i915_gpu_error *error = &dev_priv->gpu_error;
+
+		if (!i915_engine_reset_in_progress(error, engine))
+			continue;
+
+		ret = i915_reset_engine(engine);
+		if (ret) {
+			DRM_ERROR("Reset of %s failed! ret=%d",
+				  engine->name, ret);
+			return ret;
+		}
+
+		clear_bit(engine->id + 1, &error->flags);
+		error->engine_reset_count[engine->id]++;
+	}
+
+	return 0;
+}
+
+static int i915_reset_full(struct drm_i915_private *dev_priv)
+{
+	int ret;
+
+	/* ensure device is awake */
+	assert_rpm_wakelock_held(dev_priv);
+
+	intel_prepare_reset(dev_priv);
+
+	/*
+	 * All state reset _must_ be completed before we update the
+	 * reset counter, for otherwise waiters might miss the reset
+	 * pending state and not properly drop locks, resulting in
+	 * deadlocks with the reset work.
+	 */
+	ret = i915_reset(dev_priv);
+
+	intel_finish_reset(dev_priv);
+
+	return ret;
+}
+
 /**
  * i915_reset_and_wakeup - do process context error handling work
  * @dev_priv: i915 device private
@@ -2501,6 +2548,7 @@ static void i915_error_wake_up(struct drm_i915_private *dev_priv)
  */
 static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
 {
+	struct i915_gpu_error *error = &dev_priv->gpu_error;
 	struct kobject *kobj = &dev_priv->drm.primary->kdev->kobj;
 	char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
 	char *reset_event[] = { I915_RESET_UEVENT "=1", NULL };
@@ -2509,7 +2557,15 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
 
 	kobject_uevent_env(kobj, KOBJ_CHANGE, error_event);
 
-	DRM_DEBUG_DRIVER("resetting chip\n");
+	/*
+	 * This event needs to be sent before performing gpu reset. When
+	 * engine resets are supported we iterate through all engines and
+	 * reset hung engines individually. To keep the event dispatch
+	 * mechanism consistent with full gpu reset, this is only sent once
+	 * even when multiple engines are hung. It is also safe to move this
+	 * here because when we are in this function, we will definitely
+	 * perform gpu reset.
+	 */
 	kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);
 
 	/*
@@ -2521,29 +2577,57 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
 	 */
 	intel_runtime_pm_get(dev_priv);
 
-	intel_prepare_reset(dev_priv);
+	if (!i915_full_gpu_reset_in_progress(error)) {
+		ret = i915_reset_engines(dev_priv);
+		if (ret) {
+			struct intel_engine_cs *e;
 
-	/*
-	 * All state reset _must_ be completed before we update the
-	 * reset counter, for otherwise waiters might miss the reset
-	 * pending state and not properly drop locks, resulting in
-	 * deadlocks with the reset work.
-	 */
-	ret = i915_reset(dev_priv);
+			/* attempt full gpu reset to recover */
+			set_bit(I915_RESET_IN_PROGRESS,	&error->flags);
 
-	intel_finish_reset(dev_priv);
+			/*
+			 * when engine reset fails we switch to full gpu
+			 * reset which clears everything; In the case where
+			 * multiple engines are hung we would've already
+			 * scheduled work items and when they attempt to do
+			 * engine reset they won't find any active request
+			 * (full gpu reset would've cleared it). To make
+			 * the work items exit safely, clear engine reset
+			 * pending mask.
+			 */
+			for_each_engine(e, dev_priv) {
+				if (i915_engine_reset_in_progress(error, e))
+					clear_bit(e->id + 1, &error->flags);
+			}
+		}
+	}
 
-	intel_runtime_pm_put(dev_priv);
+	/*
+	 * Note that there's only one work item which does gpu resets, so we
+	 * need not worry about concurrent gpu resets potentially incrementing
+	 * error->reset_counter twice. We only need to take care of another
+	 * racing irq/hangcheck declaring the gpu dead for a second time. A
+	 * quick check for that is good enough: schedule_work ensures the
+	 * correct ordering between hang detection and this work item, and since
+	 * the reset in-progress bit is only ever set by code outside of this
+	 * work we don't need to worry about any other races.
+	 */
+	if (i915_full_gpu_reset_in_progress(&dev_priv->gpu_error)) {
+		DRM_DEBUG_DRIVER("resetting chip\n");
 
-	if (ret == 0)
-		kobject_uevent_env(kobj,
-				   KOBJ_CHANGE, reset_done_event);
+		ret = i915_reset_full(dev_priv);
+	}
 
 	/*
 	 * Note: The wake_up also serves as a memory barrier so that
 	 * waiters see the update value of the reset counter atomic_t.
 	 */
-	i915_error_wake_up(dev_priv);
+	if (!i915_terminally_wedged(error)) {
+		wake_up_all(&dev_priv->gpu_error.reset_queue);
+		kobject_uevent_env(kobj, KOBJ_CHANGE, reset_done_event);
+	}
+
+	intel_runtime_pm_put(dev_priv);
 }
 
 static void i915_report_and_clear_eir(struct drm_i915_private *dev_priv)
@@ -2641,6 +2725,8 @@ static void i915_report_and_clear_eir(struct drm_i915_private *dev_priv)
  * i915_handle_error - handle a gpu error
  * @dev_priv: i915 device private
  * @engine_mask: mask representing engines that are hung
+ * @fmt: formatted hang msg that gets logged in captured error state
+ *
  * Do some basic checking of register state at error time and
  * dump it to the syslog.  Also call i915_capture_error_state() to make
  * sure we get a record and make it available in debugfs.  Fire a uevent
@@ -2665,9 +2751,19 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
 	if (!engine_mask)
 		return;
 
-	if (test_and_set_bit(I915_RESET_IN_PROGRESS,
-			     &dev_priv->gpu_error.flags))
-		return;
+	if (intel_has_engine_reset(dev_priv)) {
+		struct intel_engine_cs *engine;
+		struct i915_gpu_error *error = &dev_priv->gpu_error;
+
+		for_each_engine_masked(engine, dev_priv, engine_mask) {
+			if (i915_engine_reset_in_progress(error, engine))
+				continue;
+
+			set_bit(engine->id + 1, &error->flags);
+		}
+	} else {
+		set_bit(I915_RESET_IN_PROGRESS, &dev_priv->gpu_error.flags);
+	}
 
 	/*
 	 * Wakeup waiting processes so that the reset function
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 43f8339..418fd0d 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1775,6 +1775,11 @@ bool intel_has_gpu_reset(struct drm_i915_private *dev_priv)
 	return intel_get_gpu_reset(dev_priv) != NULL;
 }
 
+bool intel_has_engine_reset(struct drm_i915_private *dev_priv)
+{
+	return (INTEL_INFO(dev_priv)->gen >=8 && i915.reset == 2);
+}
+
 int intel_guc_reset(struct drm_i915_private *dev_priv)
 {
 	int ret;
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 05/11] drm/i915/tdr: Identify hung request and drop it
  2016-07-26 16:40 [PATCH 00/11] Execlist based Engine reset patches Arun Siluvery
                   ` (3 preceding siblings ...)
  2016-07-26 16:40 ` [PATCH 04/11] drm/i915/tdr: Modify error handler for per engine hang recovery Arun Siluvery
@ 2016-07-26 16:40 ` Arun Siluvery
  2016-07-26 21:37   ` Chris Wilson
  2016-07-26 16:40 ` [PATCH 06/11] drm/i915/tdr: Restart submission after engine reset Arun Siluvery
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 28+ messages in thread
From: Arun Siluvery @ 2016-07-26 16:40 UTC (permalink / raw)
  To: intel-gfx

The current active request is the one that caused the hang so this is
retrieved and removed from elsp queue, otherwise we cannot submit other
workloads to be processed by GPU.

A consistency check between HW and driver is performed to ensure that we
are dropping the correct request. Since this request doesn't get executed
anymore, we also need to advance the seqno to mark it as complete. Head
pointer is advanced to skip the offending batch so that HW resumes
execution other workloads. If HW and SW don't agree then we won't proceed
with engine reset, this is treated as an error condition and we fallback to
full gpu reset.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 116 +++++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_lrc.h |   2 +
 2 files changed, 118 insertions(+)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index daf1279..8fc5a3b 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1026,6 +1026,122 @@ void intel_lr_context_unpin(struct i915_gem_context *ctx,
 	i915_gem_context_put(ctx);
 }
 
+static void intel_lr_context_resync(struct i915_gem_context *ctx,
+				    struct intel_engine_cs *engine)
+{
+	u32 head;
+	u32 head_addr, tail_addr;
+	u32 *reg_state;
+	struct intel_ringbuffer *ringbuf;
+	struct drm_i915_private *dev_priv = engine->i915;
+
+	ringbuf = ctx->engine[engine->id].ringbuf;
+	reg_state = ctx->engine[engine->id].lrc_reg_state;
+
+	head = I915_READ_HEAD(engine);
+	head_addr = head & HEAD_ADDR;
+	tail_addr = reg_state[CTX_RING_TAIL+1] & TAIL_ADDR;
+
+	/*
+	 * force head it to advance to the next QWORD. In most cases the
+	 * engine head pointer will automatically advance to the next
+	 * instruction as soon as it has read the current instruction,
+	 * without waiting for it to complete. This seems to be the default
+	 * behaviour, however an MBOX wait inserted directly to the VCS/BCS
+	 * engines does not behave in the same way, instead the head
+	 * pointer will still be pointing at the MBOX instruction until it
+	 * completes.
+	 */
+	head_addr = roundup(head_addr, 8);
+
+	if (head_addr > tail_addr)
+		head_addr = tail_addr;
+	else if (head_addr >= ringbuf->size)
+		head_addr = 0;
+
+	head &= ~HEAD_ADDR;
+	head |= (head_addr & HEAD_ADDR);
+
+	/* update head in ctx */
+	reg_state[CTX_RING_HEAD+1] = head;
+	I915_WRITE_HEAD(engine, head);
+
+	ringbuf->head = head;
+	ringbuf->last_retired_head = -1;
+	intel_ring_update_space(ringbuf);
+}
+
+/**
+ * intel_execlists_reset_prepare() - identifies the request that is
+ * hung and drops it
+ *
+ * Head is adjusted to skip the batch that caused the hang
+ *
+ * @engine: Engine that is currently hung
+ *
+ * Returns:
+ *   0 - on success
+ *   nonzero errorcode otherwise
+ */
+int intel_execlists_reset_prepare(struct intel_engine_cs *engine)
+{
+	struct drm_i915_gem_request *req;
+	bool continue_with_reset;
+
+	spin_lock_bh(&engine->execlist_lock);
+
+	req = list_first_entry_or_null(&engine->execlist_queue,
+					    struct drm_i915_gem_request,
+					    execlist_link);
+
+	/*
+	 * Only acknowledge the request in the execlist queue if it's actually
+	 * been submitted to hardware, otherwise it cannot cause hang.
+	 */
+	if (req && req->ctx && req->elsp_submitted) {
+		u32 execlist_status;
+		u32 hw_context;
+		u32 hw_active;
+		struct drm_i915_private *dev_priv = engine->i915;
+
+		hw_context = I915_READ(RING_EXECLIST_STATUS_HI(engine));
+		execlist_status = I915_READ(RING_EXECLIST_STATUS_LO(engine));
+		hw_active = ((execlist_status & EXECLIST_STATUS_ELEMENT0_ACTIVE) ||
+			     (execlist_status & EXECLIST_STATUS_ELEMENT1_ACTIVE));
+
+		continue_with_reset = hw_active && hw_context == req->ctx->hw_id;
+		if (!continue_with_reset) {
+			DRM_ERROR("GPU hung when HW is not active !!\n");
+			goto unlock;
+		}
+
+		/*
+		 * GPU is now hung and the request that caused it
+		 * will be dropped so mark it as completed
+		 */
+		intel_write_status_page(engine, I915_GEM_HWS_INDEX, req->fence.seqno);
+
+		intel_lr_context_resync(req->ctx, engine);
+
+		/*
+		 * remove the request from the elsp queue so that
+		 * engine can resume execution after reset when new
+		 * requests are submitted
+		 */
+		if (!--req->elsp_submitted) {
+			list_del(&req->execlist_link);
+			i915_gem_request_put(req);
+		}
+	} else {
+		WARN(1, "GPU hang detected with no active request\n");
+		continue_with_reset = false;
+	}
+
+unlock:
+	spin_unlock_bh(&engine->execlist_lock);
+	return !continue_with_reset;
+}
+
 static int intel_logical_ring_workarounds_emit(struct drm_i915_gem_request *req)
 {
 	int ret, i;
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index 3828730..1171ea1 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -31,6 +31,8 @@
 /* Execlists regs */
 #define RING_ELSP(engine)			_MMIO((engine)->mmio_base + 0x230)
 #define RING_EXECLIST_STATUS_LO(engine)		_MMIO((engine)->mmio_base + 0x234)
+#define   EXECLIST_STATUS_ELEMENT0_ACTIVE       (1 << 14)
+#define   EXECLIST_STATUS_ELEMENT1_ACTIVE       (1 << 15)
 #define RING_EXECLIST_STATUS_HI(engine)		_MMIO((engine)->mmio_base + 0x234 + 4)
 #define RING_CONTEXT_CONTROL(engine)		_MMIO((engine)->mmio_base + 0x244)
 #define	  CTX_CTRL_INHIBIT_SYN_CTX_SWITCH	(1 << 3)
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 06/11] drm/i915/tdr: Restart submission after engine reset
  2016-07-26 16:40 [PATCH 00/11] Execlist based Engine reset patches Arun Siluvery
                   ` (4 preceding siblings ...)
  2016-07-26 16:40 ` [PATCH 05/11] drm/i915/tdr: Identify hung request and drop it Arun Siluvery
@ 2016-07-26 16:40 ` Arun Siluvery
  2016-07-26 21:32   ` Chris Wilson
  2016-07-26 16:40 ` [PATCH 07/11] drm/i915/tdr: Add support for per engine reset recovery Arun Siluvery
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 28+ messages in thread
From: Arun Siluvery @ 2016-07-26 16:40 UTC (permalink / raw)
  To: intel-gfx; +Cc: Tomas Elf

We stop the engine during reset and recovery so after a successful reset
the request that caused the hang would've been removed from the queue so we
can now restart submissions to elsp.

Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 45 +++++++++++++++++++++++++++++++++++++---
 1 file changed, 42 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 8fc5a3b..7834edc 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -418,7 +418,8 @@ static inline void execlists_context_status_change(
 	atomic_notifier_call_chain(&rq->ctx->status_notifier, status, rq);
 }
 
-static void execlists_context_unqueue(struct intel_engine_cs *engine)
+static void execlists_context_unqueue(struct intel_engine_cs *engine,
+				      bool submission_after_reset)
 {
 	struct drm_i915_gem_request *req0 = NULL, *req1 = NULL;
 	struct drm_i915_gem_request *cursor, *tmp;
@@ -436,6 +437,27 @@ static void execlists_context_unqueue(struct intel_engine_cs *engine)
 				 execlist_link) {
 		if (!req0) {
 			req0 = cursor;
+
+			/*
+			 * we submit two requests at a time, req0 and req1.
+			 * Assume that req0 is the one that causes hang and
+			 * req1 is a normal batch.
+
+			 * After engine reset, once engine is
+			 * reinitialized, we skip req0 and submit req1
+			 * along with next request in the queue so we endup
+			 * incrementing req1->elsp_submitted again. But
+			 * after reset HW would've switched to req1 and
+			 * executed it so just this once, submit only req1
+			 * (which is req0 now) and don't increment
+			 * submission count. Once this is removed we submit
+			 * two requests as usual.
+			 */
+			if (submission_after_reset) {
+				if (req0->elsp_submitted)
+					req0->elsp_submitted--;
+				break;
+			}
 		} else if (req0->ctx == cursor->ctx) {
 			/* Same ctx: ignore first request, as second request
 			 * will update tail past first request's workload */
@@ -600,7 +622,7 @@ static void intel_lrc_irq_handler(unsigned long data)
 	if (submit_contexts) {
 		if (!engine->disable_lite_restore_wa ||
 		    (csb[i][0] & GEN8_CTX_STATUS_ACTIVE_IDLE))
-			execlists_context_unqueue(engine);
+			execlists_context_unqueue(engine, false);
 	}
 
 	spin_unlock(&engine->execlist_lock);
@@ -640,7 +662,7 @@ static void execlists_context_queue(struct drm_i915_gem_request *request)
 	list_add_tail(&request->execlist_link, &engine->execlist_queue);
 	request->ctx_hw_id = request->ctx->hw_id;
 	if (num_elements == 0)
-		execlists_context_unqueue(engine);
+		execlists_context_unqueue(engine, false);
 
 	spin_unlock_bh(&engine->execlist_lock);
 }
@@ -1142,6 +1164,23 @@ unlock:
 	return !continue_with_reset;
 }
 
+/**
+ * intel_execlists_restart_submission() - restarts elsp submissions after
+ * reset
+ *
+ * @engine: engine to be re-started
+ *
+ */
+void intel_execlists_restart_submission(struct intel_engine_cs *engine)
+{
+	spin_lock_bh(&engine->execlist_lock);
+
+	if (!list_empty(&engine->execlist_queue))
+		execlists_context_unqueue(engine, true);
+
+	spin_unlock_bh(&engine->execlist_lock);
+}
+
 static int intel_logical_ring_workarounds_emit(struct drm_i915_gem_request *req)
 {
 	int ret, i;
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 07/11] drm/i915/tdr: Add support for per engine reset recovery
  2016-07-26 16:40 [PATCH 00/11] Execlist based Engine reset patches Arun Siluvery
                   ` (5 preceding siblings ...)
  2016-07-26 16:40 ` [PATCH 06/11] drm/i915/tdr: Restart submission after engine reset Arun Siluvery
@ 2016-07-26 16:40 ` Arun Siluvery
  2016-07-26 21:51   ` Chris Wilson
  2016-07-26 16:40 ` [PATCH 08/11] drm/i915: Skip reset request if there is one already Arun Siluvery
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 28+ messages in thread
From: Arun Siluvery @ 2016-07-26 16:40 UTC (permalink / raw)
  To: intel-gfx; +Cc: Tomas Elf

This change implements support for per-engine reset as an initial, less
intrusive hang recovery option to be attempted before falling back to the
legacy full GPU reset recovery mode if necessary. This is only supported
from Gen8 onwards.

Hangchecker determines which engines are hung and invokes error handler to
recover from it. Error handler schedules recovery for each of those engines
that are hung. The recovery procedure is as follows,
 - force engine to idle: this is done by issuing a reset request
 - identifies the request that caused the hang and it is dropped
 - reset and re-init engine
 - restart submissions to the engine

If engine reset fails then we fall back to heavy weight full gpu reset
which resets all engines and reinitiazes complete state of HW and SW.

Possible reasons for failure,
 - engine not ready for reset
 - if the HW and SW doesn't agree on the context that caused the hang
 - reset itself failed for some reason

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c     | 55 +++++++++++++++++++++++++++++++++----
 drivers/gpu/drm/i915/i915_drv.h     |  3 ++
 drivers/gpu/drm/i915/i915_gem.c     |  2 +-
 drivers/gpu/drm/i915/intel_lrc.h    |  4 +++
 drivers/gpu/drm/i915/intel_uncore.c | 33 ++++++++++++++++++++++
 5 files changed, 90 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 5c20e5d..8151aa9 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1824,17 +1824,60 @@ error:
  * Returns zero on successful reset or otherwise an error code.
  *
  * Procedure is fairly simple:
- *  - force engine to idle
- *  - save current state which includes head and current request
- *  - reset engine
- *  - restore saved state and resubmit context
+ *    - force engine to idle: this is done by issuing a reset request
+ *    - identifies the request that caused the hang and it is retired
+ *    - reset engine
+ *    - restart submissions to the engine
  */
 int i915_reset_engine(struct intel_engine_cs *engine)
 {
+	struct drm_i915_private *dev_priv = engine->i915;
 	int ret;
 
-	/* FIXME: replace me with engine reset sequence */
-	ret = -ENODEV;
+	/* Ensure irq handler finishes or is cancelled. */
+	tasklet_kill(&engine->irq_tasklet);
+
+	i915_gem_reset_engine_status(engine);
+
+	/*Take wake lock to prevent power saving mode */
+	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
+
+	ret = intel_request_for_reset(engine);
+	if (ret) {
+		DRM_ERROR("Failed to disable %s\n", engine->name);
+		goto out;
+	}
+
+	ret = intel_execlists_reset_prepare(engine);
+	if (ret)
+		goto enable_engine;
+
+	ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
+	if (ret) {
+		DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
+		goto enable_engine;
+	}
+
+	ret = engine->init_hw(engine);
+	if (ret)
+		goto out;
+
+	/* Restart submissions to the engine after reset */
+	intel_execlists_restart_submission(engine);
+
+enable_engine:
+	/*
+	 * we only need to enable engine if we cannot prepare engine for
+	 * reset or reset fails. If the reset is successful, engine gets
+	 * enabled automatically so we can skip this step.
+	 */
+	if (ret)
+		intel_clear_reset_request(engine);
+
+out:
+	/* Wake up anything waiting on this engine's queue */
+	intel_engine_wakeup(engine);
+	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
 
 	return ret;
 }
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 61e57a8b..ac7a42c 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2837,6 +2837,8 @@ extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
 extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
 extern int i915_reset(struct drm_i915_private *dev_priv);
 extern bool intel_has_engine_reset(struct drm_i915_private *dev_priv);
+extern int intel_request_for_reset(struct intel_engine_cs *engine);
+extern int intel_clear_reset_request(struct intel_engine_cs *engine);
 extern int i915_reset_engine(struct intel_engine_cs *engine);
 extern int intel_guc_reset(struct drm_i915_private *dev_priv);
 extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
@@ -3231,6 +3233,7 @@ static inline bool i915_engine_reset_in_progress(struct i915_gpu_error *error,
 }
 
 void i915_gem_reset(struct drm_device *dev);
+void i915_gem_reset_engine_status(struct intel_engine_cs *ring);
 bool i915_gem_clflush_object(struct drm_i915_gem_object *obj, bool force);
 int __must_check i915_gem_init(struct drm_device *dev);
 int __must_check i915_gem_init_hw(struct drm_device *dev);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 443570f..e34effb 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2468,7 +2468,7 @@ i915_gem_find_active_request(struct intel_engine_cs *engine)
 	return NULL;
 }
 
-static void i915_gem_reset_engine_status(struct intel_engine_cs *engine)
+void i915_gem_reset_engine_status(struct intel_engine_cs *engine)
 {
 	struct drm_i915_gem_request *request;
 	bool ring_hung;
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index 1171ea1..4035f63 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -132,4 +132,8 @@ int intel_execlists_submission(struct i915_execbuffer_params *params,
 
 void intel_execlists_cancel_requests(struct intel_engine_cs *engine);
 
+/* Engine reset */
+int intel_execlists_reset_prepare(struct intel_engine_cs *engine);
+void intel_execlists_restart_submission(struct intel_engine_cs *engine);
+
 #endif /* _INTEL_LRC_H_ */
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 418fd0d..d807871 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1799,6 +1799,39 @@ int intel_guc_reset(struct drm_i915_private *dev_priv)
 	return ret;
 }
 
+/*
+ * On gen8+ a reset request has to be issued via the reset control register
+ * before a GPU engine can be reset in order to stop the command streamer
+ * and idle the engine. This replaces the legacy way of stopping an engine
+ * by writing to the stop ring bit in the MI_MODE register.
+ */
+int intel_request_for_reset(struct intel_engine_cs *engine)
+{
+	if (!intel_has_engine_reset(engine->i915)) {
+		DRM_ERROR("Engine Reset not supported on Gen%d\n",
+			  INTEL_INFO(engine->i915)->gen);
+		return -EINVAL;
+	}
+
+	return gen8_request_engine_reset(engine);
+}
+
+/*
+ * It is possible to back off from a previously issued reset request by simply
+ * clearing the reset request bit in the reset control register.
+ */
+int intel_clear_reset_request(struct intel_engine_cs *engine)
+{
+	if (!intel_has_engine_reset(engine->i915)) {
+		DRM_ERROR("Request to clear reset not supported on Gen%d\n",
+			  INTEL_INFO(engine->i915)->gen);
+		return -EINVAL;
+	}
+
+	gen8_unrequest_engine_reset(engine);
+	return 0;
+}
+
 bool intel_uncore_unclaimed_mmio(struct drm_i915_private *dev_priv)
 {
 	return check_for_unclaimed_mmio(dev_priv);
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 08/11] drm/i915: Skip reset request if there is one already
  2016-07-26 16:40 [PATCH 00/11] Execlist based Engine reset patches Arun Siluvery
                   ` (6 preceding siblings ...)
  2016-07-26 16:40 ` [PATCH 07/11] drm/i915/tdr: Add support for per engine reset recovery Arun Siluvery
@ 2016-07-26 16:40 ` Arun Siluvery
  2016-07-26 16:40 ` [PATCH 09/11] drm/i915/tdr: Add engine reset count to error state Arun Siluvery
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Arun Siluvery @ 2016-07-26 16:40 UTC (permalink / raw)
  To: intel-gfx

From: Mika Kuoppala <mika.kuoppala@linux.intel.com>

To perform engine reset we first disable engine to capture its state. This
is done by issuing a reset request. Because we are reusing existing
infrastructure, again when we actually reset an engine, reset function
checks engine mask and issues reset request again which is unnecessary. To
avoid this we check if the engine is already prepared, if so we just exit
from that point.

Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
---
 drivers/gpu/drm/i915/intel_uncore.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index d807871..0ad0b7f 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1686,10 +1686,15 @@ int intel_wait_for_register(struct drm_i915_private *dev_priv,
 static int gen8_request_engine_reset(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
+	const i915_reg_t reset_ctrl = RING_RESET_CTL(engine->mmio_base);
+	const u32 ready = RESET_CTL_REQUEST_RESET | RESET_CTL_READY_TO_RESET;
 	int ret;
 
-	I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base),
-		      _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
+	/* If engine has been already prepared, we can shortcut here */
+	if ((I915_READ_FW(reset_ctrl) & ready) == ready)
+		return 0;
+
+	I915_WRITE_FW(reset_ctrl, _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
 
 	ret = intel_wait_for_register_fw(dev_priv,
 					 RING_RESET_CTL(engine->mmio_base),
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 09/11] drm/i915/tdr: Add engine reset count to error state
  2016-07-26 16:40 [PATCH 00/11] Execlist based Engine reset patches Arun Siluvery
                   ` (7 preceding siblings ...)
  2016-07-26 16:40 ` [PATCH 08/11] drm/i915: Skip reset request if there is one already Arun Siluvery
@ 2016-07-26 16:40 ` Arun Siluvery
  2016-07-26 16:40 ` [PATCH 10/11] drm/i915/tdr: Export reset count info to debugfs Arun Siluvery
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Arun Siluvery @ 2016-07-26 16:40 UTC (permalink / raw)
  To: intel-gfx

Driver maintains count of how many times a given engine is reset, useful to
capture this in error state also. It gives an idea of how engine is coping
up with the workloads it is executing before this error state.

Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h       | 7 +++++++
 drivers/gpu/drm/i915/i915_gpu_error.c | 3 +++
 2 files changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index ac7a42c..654bbcc 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -520,6 +520,7 @@ struct drm_i915_error_state {
 		int hangcheck_score;
 		enum intel_ring_hangcheck_action hangcheck_action;
 		int num_requests;
+		u32 reset_count;
 
 		/* our own tracking of ring head and tail */
 		u32 cpu_ring_head;
@@ -3232,6 +3233,12 @@ static inline bool i915_engine_reset_in_progress(struct i915_gpu_error *error,
 	return test_bit(engine->id + 1, &error->flags);
 }
 
+static inline u32 i915_engine_reset_count(struct i915_gpu_error *error,
+					  struct intel_engine_cs *engine)
+{
+	return READ_ONCE(error->engine_reset_count[engine->id]);
+}
+
 void i915_gem_reset(struct drm_device *dev);
 void i915_gem_reset_engine_status(struct intel_engine_cs *ring);
 bool i915_gem_clflush_object(struct drm_i915_gem_object *obj, bool force);
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 4d39c72..ea14b4c 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -303,6 +303,7 @@ static void i915_ring_error_state(struct drm_i915_error_state_buf *m,
 	err_printf(m, "  hangcheck: %s [%d]\n",
 		   hangcheck_action_to_str(ring->hangcheck_action),
 		   ring->hangcheck_score);
+	err_printf(m, "  engine reset count: %u\n", ring->reset_count);
 }
 
 void i915_error_printf(struct drm_i915_error_state_buf *e, const char *f, ...)
@@ -1022,6 +1023,8 @@ static void i915_record_ring_state(struct drm_i915_private *dev_priv,
 
 	ering->hangcheck_score = engine->hangcheck.score;
 	ering->hangcheck_action = engine->hangcheck.action;
+	ering->reset_count = i915_engine_reset_count(&dev_priv->gpu_error,
+						     engine);
 
 	if (USES_PPGTT(dev_priv)) {
 		int i;
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 10/11] drm/i915/tdr: Export reset count info to debugfs
  2016-07-26 16:40 [PATCH 00/11] Execlist based Engine reset patches Arun Siluvery
                   ` (8 preceding siblings ...)
  2016-07-26 16:40 ` [PATCH 09/11] drm/i915/tdr: Add engine reset count to error state Arun Siluvery
@ 2016-07-26 16:40 ` Arun Siluvery
  2016-07-26 16:40 ` [PATCH 11/11] drm/i915/tdr: Enable Engine reset and recovery support Arun Siluvery
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Arun Siluvery @ 2016-07-26 16:40 UTC (permalink / raw)
  To: intel-gfx

A new variable is added to export the reset counts to debugfs, this
includes full gpu reset and engine reset count. This is useful for tests
where they areexpected to trigger reset; these counts are checked before
and after the test to ensure the same.

Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 9aa62c5..853da76 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -1470,6 +1470,24 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
 	return 0;
 }
 
+static int i915_reset_info(struct seq_file *m, void *unused)
+{
+	struct drm_info_node *node = m->private;
+	struct drm_device *dev = node->minor->dev;
+	struct drm_i915_private *dev_priv = dev->dev_private;
+	struct i915_gpu_error *error = &dev_priv->gpu_error;
+	struct intel_engine_cs *engine;
+
+	seq_printf(m, "full gpu reset = %u\n", i915_reset_count(error));
+
+	for_each_engine(engine, dev_priv) {
+		seq_printf(m, "%s = %u\n", engine->name,
+			   i915_engine_reset_count(error, engine));
+	}
+
+	return 0;
+}
+
 static int ironlake_drpc_info(struct seq_file *m)
 {
 	struct drm_info_node *node = m->private;
@@ -5379,6 +5397,7 @@ static const struct drm_info_list i915_debugfs_list[] = {
 	{"i915_guc_log_dump", i915_guc_log_dump, 0},
 	{"i915_frequency_info", i915_frequency_info, 0},
 	{"i915_hangcheck_info", i915_hangcheck_info, 0},
+	{"i915_reset_info", i915_reset_info, 0},
 	{"i915_drpc_info", i915_drpc_info, 0},
 	{"i915_emon_status", i915_emon_status, 0},
 	{"i915_ring_freq_table", i915_ring_freq_table, 0},
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 11/11] drm/i915/tdr: Enable Engine reset and recovery support
  2016-07-26 16:40 [PATCH 00/11] Execlist based Engine reset patches Arun Siluvery
                   ` (9 preceding siblings ...)
  2016-07-26 16:40 ` [PATCH 10/11] drm/i915/tdr: Export reset count info to debugfs Arun Siluvery
@ 2016-07-26 16:40 ` Arun Siluvery
  2016-07-26 17:11 ` ✗ Ro.CI.BAT: failure for Execlist based Engine reset patches Patchwork
  2016-07-28  5:40 ` ✗ Ro.CI.BAT: failure for Execlist based Engine reset patches (rev4) Patchwork
  12 siblings, 0 replies; 28+ messages in thread
From: Arun Siluvery @ 2016-07-26 16:40 UTC (permalink / raw)
  To: intel-gfx; +Cc: Tomas Elf

Everything in place, flip the switch.

This feature is available only from Gen8, for previous gen devices driver
falls back to legacy full gpu reset.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_params.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c
index a96df81..dab6fc3 100644
--- a/drivers/gpu/drm/i915/i915_params.c
+++ b/drivers/gpu/drm/i915/i915_params.c
@@ -45,7 +45,7 @@ struct i915_params i915 __read_mostly = {
 	.fastboot = 0,
 	.prefault_disable = 0,
 	.load_detect_test = 0,
-	.reset = 1,
+	.reset = 2,
 	.invert_brightness = 0,
 	.disable_display = 0,
 	.enable_cmd_parser = 1,
@@ -112,7 +112,7 @@ MODULE_PARM_DESC(vbt_sdvo_panel_type,
 	"(-2=ignore, -1=auto [default], index in VBT BIOS table)");
 
 module_param_named_unsafe(reset, i915.reset, int, 0600);
-MODULE_PARM_DESC(reset, "Attempt GPU resets (0=disabled, 1=full gpu reset [default], 2=engine reset)");
+MODULE_PARM_DESC(reset, "Attempt GPU resets (0=disabled, 1=full gpu reset, 2=engine reset [default])");
 
 module_param_named_unsafe(enable_hangcheck, i915.enable_hangcheck, bool, 0644);
 MODULE_PARM_DESC(enable_hangcheck,
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* ✗ Ro.CI.BAT: failure for Execlist based Engine reset patches
  2016-07-26 16:40 [PATCH 00/11] Execlist based Engine reset patches Arun Siluvery
                   ` (10 preceding siblings ...)
  2016-07-26 16:40 ` [PATCH 11/11] drm/i915/tdr: Enable Engine reset and recovery support Arun Siluvery
@ 2016-07-26 17:11 ` Patchwork
  2016-07-28  5:40 ` ✗ Ro.CI.BAT: failure for Execlist based Engine reset patches (rev4) Patchwork
  12 siblings, 0 replies; 28+ messages in thread
From: Patchwork @ 2016-07-26 17:11 UTC (permalink / raw)
  To: arun.siluvery; +Cc: intel-gfx

== Series Details ==

Series: Execlist based Engine reset patches
URL   : https://patchwork.freedesktop.org/series/10283/
State : failure

== Summary ==

Series 10283v1 Execlist based Engine reset patches
http://patchwork.freedesktop.org/api/1.0/series/10283/revisions/1/mbox

Test drv_hangman:
        Subgroup error-state-basic:
                pass       -> DMESG-WARN (ro-bdw-i7-5600u)
Test gem_exec_gttfill:
        Subgroup basic:
                pass       -> SKIP       (fi-snb-i7-2600)
Test gem_ringfill:
        Subgroup basic-default-hang:
                pass       -> DMESG-WARN (ro-bdw-i7-5600u)
Test kms_cursor_legacy:
        Subgroup basic-cursor-vs-flip-legacy:
                fail       -> PASS       (ro-ilk1-i5-650)
        Subgroup basic-flip-vs-cursor-legacy:
                fail       -> PASS       (ro-hsw-i7-4770r)
                fail       -> PASS       (ro-byt-n2820)
                fail       -> DMESG-FAIL (ro-skl3-i5-6260u)
        Subgroup basic-flip-vs-cursor-varying-size:
                pass       -> FAIL       (ro-snb-i7-2620M)
                fail       -> PASS       (ro-bdw-i5-5250u)
Test kms_pipe_crc_basic:
        Subgroup hang-read-crc-pipe-a:
                pass       -> DMESG-WARN (ro-bdw-i7-5600u)
        Subgroup hang-read-crc-pipe-b:
                pass       -> DMESG-WARN (ro-bdw-i7-5600u)
        Subgroup hang-read-crc-pipe-c:
                pass       -> DMESG-WARN (ro-bdw-i7-5600u)

fi-hsw-i7-4770k  total:235  pass:215  dwarn:0   dfail:0   fail:0   skip:20 
fi-kbl-qkkr      total:235  pass:180  dwarn:28  dfail:1   fail:0   skip:26 
fi-skl-i5-6260u  total:235  pass:223  dwarn:0   dfail:0   fail:0   skip:12 
fi-skl-i7-6700k  total:235  pass:209  dwarn:0   dfail:0   fail:0   skip:26 
fi-snb-i7-2600   total:235  pass:194  dwarn:0   dfail:0   fail:0   skip:41 
ro-bdw-i5-5250u  total:239  pass:218  dwarn:4   dfail:0   fail:1   skip:16 
ro-bdw-i7-5600u  total:239  pass:201  dwarn:5   dfail:0   fail:1   skip:32 
ro-bsw-n3050     total:239  pass:193  dwarn:0   dfail:0   fail:4   skip:42 
ro-byt-n2820     total:239  pass:196  dwarn:0   dfail:0   fail:3   skip:40 
ro-hsw-i3-4010u  total:239  pass:213  dwarn:0   dfail:0   fail:0   skip:26 
ro-hsw-i7-4770r  total:239  pass:213  dwarn:0   dfail:0   fail:0   skip:26 
ro-ilk1-i5-650   total:234  pass:172  dwarn:0   dfail:0   fail:2   skip:60 
ro-ivb-i7-3770   total:239  pass:204  dwarn:0   dfail:0   fail:0   skip:35 
ro-ivb2-i7-3770  total:239  pass:208  dwarn:0   dfail:0   fail:0   skip:31 
ro-skl3-i5-6260u total:239  pass:223  dwarn:0   dfail:1   fail:1   skip:14 
ro-snb-i7-2620M  total:239  pass:196  dwarn:0   dfail:0   fail:2   skip:41 
ro-bdw-i7-5557U failed to connect after reboot

Results at /archive/results/CI_IGT_test/RO_Patchwork_1613/

819b812 drm-intel-nightly: 2016y-07m-26d-14h-26m-25s UTC integration manifest
e3836e9 drm/i915/tdr: Enable Engine reset and recovery support
5659355 drm/i915/tdr: Export reset count info to debugfs
6fadd07 drm/i915/tdr: Add engine reset count to error state
6a15b51 drm/i915: Skip reset request if there is one already
5342eeb drm/i915/tdr: Add support for per engine reset recovery
1143c7d drm/i915/tdr: Restart submission after engine reset
d864b8a drm/i915/tdr: Identify hung request and drop it
8bd2892 drm/i915/tdr: Modify error handler for per engine hang recovery
7df335e drm/i915/tdr: Update reset_in_progress to account for engine reset
8513718 drm/i915: Separate out reset flags from the reset counter
8c5bc89 drm/i915: Update i915.reset to handle engine resets

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 06/11] drm/i915/tdr: Restart submission after engine reset
  2016-07-26 16:40 ` [PATCH 06/11] drm/i915/tdr: Restart submission after engine reset Arun Siluvery
@ 2016-07-26 21:32   ` Chris Wilson
  0 siblings, 0 replies; 28+ messages in thread
From: Chris Wilson @ 2016-07-26 21:32 UTC (permalink / raw)
  To: Arun Siluvery; +Cc: intel-gfx, Tomas Elf

On Tue, Jul 26, 2016 at 05:40:52PM +0100, Arun Siluvery wrote:
> We stop the engine during reset and recovery so after a successful reset
> the request that caused the hang would've been removed from the queue so we
> can now restart submissions to elsp.
> 
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/intel_lrc.c | 45 +++++++++++++++++++++++++++++++++++++---
>  1 file changed, 42 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 8fc5a3b..7834edc 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -418,7 +418,8 @@ static inline void execlists_context_status_change(
>  	atomic_notifier_call_chain(&rq->ctx->status_notifier, status, rq);
>  }
>  
> -static void execlists_context_unqueue(struct intel_engine_cs *engine)
> +static void execlists_context_unqueue(struct intel_engine_cs *engine,
> +				      bool submission_after_reset)
>  {
>  	struct drm_i915_gem_request *req0 = NULL, *req1 = NULL;
>  	struct drm_i915_gem_request *cursor, *tmp;
> @@ -436,6 +437,27 @@ static void execlists_context_unqueue(struct intel_engine_cs *engine)
>  				 execlist_link) {
>  		if (!req0) {
>  			req0 = cursor;
> +
> +			/*
> +			 * we submit two requests at a time, req0 and req1.
> +			 * Assume that req0 is the one that causes hang and
> +			 * req1 is a normal batch.
> +
> +			 * After engine reset, once engine is
> +			 * reinitialized, we skip req0 and submit req1
> +			 * along with next request in the queue so we endup
> +			 * incrementing req1->elsp_submitted again. But
> +			 * after reset HW would've switched to req1 and
> +			 * executed it so just this once, submit only req1
> +			 * (which is req0 now) and don't increment
> +			 * submission count. Once this is removed we submit
> +			 * two requests as usual.
> +			 */
> +			if (submission_after_reset) {
> +				if (req0->elsp_submitted)
> +					req0->elsp_submitted--;
> +				break;

And no. Put the special case handling in the reset to cancel the
submitted hw tracking and re-establish the request queue.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 05/11] drm/i915/tdr: Identify hung request and drop it
  2016-07-26 16:40 ` [PATCH 05/11] drm/i915/tdr: Identify hung request and drop it Arun Siluvery
@ 2016-07-26 21:37   ` Chris Wilson
  2016-07-27 11:54     ` Arun Siluvery
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Wilson @ 2016-07-26 21:37 UTC (permalink / raw)
  To: Arun Siluvery; +Cc: intel-gfx

On Tue, Jul 26, 2016 at 05:40:51PM +0100, Arun Siluvery wrote:
> The current active request is the one that caused the hang so this is
> retrieved and removed from elsp queue, otherwise we cannot submit other
> workloads to be processed by GPU.
> 
> A consistency check between HW and driver is performed to ensure that we
> are dropping the correct request. Since this request doesn't get executed
> anymore, we also need to advance the seqno to mark it as complete. Head
> pointer is advanced to skip the offending batch so that HW resumes
> execution other workloads. If HW and SW don't agree then we won't proceed
> with engine reset, this is treated as an error condition and we fallback to
> full gpu reset.
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/intel_lrc.c | 116 +++++++++++++++++++++++++++++++++++++++
>  drivers/gpu/drm/i915/intel_lrc.h |   2 +
>  2 files changed, 118 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index daf1279..8fc5a3b 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1026,6 +1026,122 @@ void intel_lr_context_unpin(struct i915_gem_context *ctx,
>  	i915_gem_context_put(ctx);
>  }
>  
> +static void intel_lr_context_resync(struct i915_gem_context *ctx,
> +				    struct intel_engine_cs *engine)
> +{
> +	u32 head;
> +	u32 head_addr, tail_addr;
> +	u32 *reg_state;
> +	struct intel_ringbuffer *ringbuf;
> +	struct drm_i915_private *dev_priv = engine->i915;
> +
> +	ringbuf = ctx->engine[engine->id].ringbuf;
> +	reg_state = ctx->engine[engine->id].lrc_reg_state;
> +
> +	head = I915_READ_HEAD(engine);
> +	head_addr = head & HEAD_ADDR;
> +	tail_addr = reg_state[CTX_RING_TAIL+1] & TAIL_ADDR;

?

We know where we want the head to be to emit the breadcrumb and complete
the request since we can record that when constructing the request. That
also neatly solves the riddle of how to update the hw state.

resync?  intel_lr_context_reset_ring may be more apt, or maybe
intel_execlists_reset_request?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 07/11] drm/i915/tdr: Add support for per engine reset recovery
  2016-07-26 16:40 ` [PATCH 07/11] drm/i915/tdr: Add support for per engine reset recovery Arun Siluvery
@ 2016-07-26 21:51   ` Chris Wilson
  2016-07-27 11:49     ` Arun Siluvery
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Wilson @ 2016-07-26 21:51 UTC (permalink / raw)
  To: Arun Siluvery; +Cc: intel-gfx, Tomas Elf

On Tue, Jul 26, 2016 at 05:40:53PM +0100, Arun Siluvery wrote:
> This change implements support for per-engine reset as an initial, less
> intrusive hang recovery option to be attempted before falling back to the
> legacy full GPU reset recovery mode if necessary. This is only supported
> from Gen8 onwards.
> 
> Hangchecker determines which engines are hung and invokes error handler to
> recover from it. Error handler schedules recovery for each of those engines
> that are hung. The recovery procedure is as follows,
>  - force engine to idle: this is done by issuing a reset request
>  - identifies the request that caused the hang and it is dropped
>  - reset and re-init engine
>  - restart submissions to the engine
> 
> If engine reset fails then we fall back to heavy weight full gpu reset
> which resets all engines and reinitiazes complete state of HW and SW.
> 
> Possible reasons for failure,
>  - engine not ready for reset
>  - if the HW and SW doesn't agree on the context that caused the hang
>  - reset itself failed for some reason
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.c     | 55 +++++++++++++++++++++++++++++++++----
>  drivers/gpu/drm/i915/i915_drv.h     |  3 ++
>  drivers/gpu/drm/i915/i915_gem.c     |  2 +-
>  drivers/gpu/drm/i915/intel_lrc.h    |  4 +++
>  drivers/gpu/drm/i915/intel_uncore.c | 33 ++++++++++++++++++++++
>  5 files changed, 90 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index 5c20e5d..8151aa9 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1824,17 +1824,60 @@ error:
>   * Returns zero on successful reset or otherwise an error code.
>   *
>   * Procedure is fairly simple:
> - *  - force engine to idle
> - *  - save current state which includes head and current request
> - *  - reset engine
> - *  - restore saved state and resubmit context
> + *    - force engine to idle: this is done by issuing a reset request
> + *    - identifies the request that caused the hang and it is retired
> + *    - reset engine
> + *    - restart submissions to the engine
>   */
>  int i915_reset_engine(struct intel_engine_cs *engine)
>  {
> +	struct drm_i915_private *dev_priv = engine->i915;
>  	int ret;
>  
> -	/* FIXME: replace me with engine reset sequence */
> -	ret = -ENODEV;
> +	/* Ensure irq handler finishes or is cancelled. */
> +	tasklet_kill(&engine->irq_tasklet);
> +
> +	i915_gem_reset_engine_status(engine);

Not yet safe to be used lockless.

> +	/*Take wake lock to prevent power saving mode */

Why else would you be taking it? We know the wakelock prevents the GT
power well disappearing, the question is why is that important now
across the entirety of this sequence? That's what you explain in
comments.

> +	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
> +
> +	ret = intel_request_for_reset(engine);
> +	if (ret) {
> +		DRM_ERROR("Failed to disable %s\n", engine->name);
> +		goto out;
> +	}
> +
> +	ret = intel_execlists_reset_prepare(engine);
> +	if (ret)
> +		goto enable_engine;
> +
> +	ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
> +	if (ret) {
> +		DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
> +		goto enable_engine;
> +	}
> +
> +	ret = engine->init_hw(engine);
> +	if (ret)
> +		goto out;
> +
> +	/* Restart submissions to the engine after reset */
> +	intel_execlists_restart_submission(engine);

Clumsy. See above as we already have an entry point that can initiate
the hw.

> +enable_engine:
> +	/*
> +	 * we only need to enable engine if we cannot prepare engine for
> +	 * reset or reset fails. If the reset is successful, engine gets
> +	 * enabled automatically so we can skip this step.
> +	 */
> +	if (ret)

Then why is the normal path coming through here?

> +		intel_clear_reset_request(engine);
> +
> +out:
> +	/* Wake up anything waiting on this engine's queue */
> +	intel_engine_wakeup(engine);

? Random call with incorrect locking. What do you think you are
achieving here because it's not what the comment implies.

> +	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
>  
>  	return ret;
>  }
> +/*
> + * On gen8+ a reset request has to be issued via the reset control register
> + * before a GPU engine can be reset in order to stop the command streamer
> + * and idle the engine. This replaces the legacy way of stopping an engine
> + * by writing to the stop ring bit in the MI_MODE register.
> + */
> +int intel_request_for_reset(struct intel_engine_cs *engine)

intel_engine_reset_begin()

It would be preferrable if we can avoid using request here as we are
talking about reseting the hung request elsewhere in this chain.

> +{
> +	if (!intel_has_engine_reset(engine->i915)) {
> +		DRM_ERROR("Engine Reset not supported on Gen%d\n",
> +			  INTEL_INFO(engine->i915)->gen);

ERROR?

> +		return -EINVAL;
return -ENODEV;

> +	}
> +
> +	return gen8_request_engine_reset(engine);
> +}

gen8_engine_reset_begin()

> +/*
> + * It is possible to back off from a previously issued reset request by simply
> + * clearing the reset request bit in the reset control register.
> + */
> +int intel_clear_reset_request(struct intel_engine_cs *engine)

intel_engine_reset_cancel()

> +{
> +{
> +	if (!intel_has_engine_reset(engine->i915)) {
> +		DRM_ERROR("Request to clear reset not supported on Gen%d\n",
> +			  INTEL_INFO(engine->i915)->gen);
> +		return -EINVAL;
> +	}
> +
> +	gen8_unrequest_engine_reset(engine);

gen8_engine_reset_cancel()

(only suggestions for avoiding having confusion over requests)


-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 03/11] drm/i915/tdr: Update reset_in_progress to account for engine reset
  2016-07-26 16:40 ` [PATCH 03/11] drm/i915/tdr: Update reset_in_progress to account for engine reset Arun Siluvery
@ 2016-07-26 21:52   ` Chris Wilson
  2016-07-27 11:16     ` Arun Siluvery
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Wilson @ 2016-07-26 21:52 UTC (permalink / raw)
  To: Arun Siluvery; +Cc: intel-gfx

On Tue, Jul 26, 2016 at 05:40:49PM +0100, Arun Siluvery wrote:
> Now that we track reset progress using separate set of flags, update it to
> account for engine reset as well.
> 
> A bit corresponding engine->id is set if reset is in progress for that
> engine. Bit0 is for full gpu reset.
> 
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.h | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 11436e7..125fafa 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -1403,6 +1403,8 @@ struct i915_gpu_error {
>  #define I915_RESET_IN_PROGRESS	0
>  #define I915_WEDGED		(BITS_PER_LONG-1)
>  
> +	unsigned long engine_reset_count[I915_NUM_ENGINES];
> +
>  	/**
>  	 * Waitqueue to signal when a hang is detected. Used to for waiters
>  	 * to release the struct_mutex for the reset to procede.
> @@ -3194,9 +3196,10 @@ i915_gem_find_active_request(struct intel_engine_cs *engine);
>  void i915_gem_retire_requests(struct drm_i915_private *dev_priv);
>  void i915_gem_retire_requests_ring(struct intel_engine_cs *engine);
>  
> +/* indicates the progress of engine reset or full gpu reset */
>  static inline bool i915_reset_in_progress(struct i915_gpu_error *error)
>  {
> -	return test_bit(I915_RESET_IN_PROGRESS, &error->flags);
> +	return unlikely(READ_ONCE(error->flags) & ~BIT(I915_WEDGED));
>  }
>  
>  static inline bool i915_terminally_wedged(struct i915_gpu_error *error)
> @@ -3214,6 +3217,17 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
>  	return READ_ONCE(error->reset_count);
>  }
>  
> +static inline bool i915_full_gpu_reset_in_progress(struct i915_gpu_error *error)
> +{
> +	return test_bit(I915_RESET_IN_PROGRESS, &error->flags);
> +}
> +
> +static inline bool i915_engine_reset_in_progress(struct i915_gpu_error *error,
> +						 struct intel_engine_cs *engine)
> +{
> +	return test_bit(engine->id + 1, &error->flags);
> +}

?

A totally unexplained change. If it is because you think to want to break
waiters on struct_mutex, try again.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 03/11] drm/i915/tdr: Update reset_in_progress to account for engine reset
  2016-07-26 21:52   ` Chris Wilson
@ 2016-07-27 11:16     ` Arun Siluvery
  2016-07-27 11:41       ` Chris Wilson
  0 siblings, 1 reply; 28+ messages in thread
From: Arun Siluvery @ 2016-07-27 11:16 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx, Mika Kuoppala

On 26/07/2016 22:52, Chris Wilson wrote:
> On Tue, Jul 26, 2016 at 05:40:49PM +0100, Arun Siluvery wrote:
>> Now that we track reset progress using separate set of flags, update it to
>> account for engine reset as well.
>>
>> A bit corresponding engine->id is set if reset is in progress for that
>> engine. Bit0 is for full gpu reset.
>>
>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_drv.h | 16 +++++++++++++++-
>>   1 file changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
>> index 11436e7..125fafa 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -1403,6 +1403,8 @@ struct i915_gpu_error {
>>   #define I915_RESET_IN_PROGRESS	0
>>   #define I915_WEDGED		(BITS_PER_LONG-1)
>>
>> +	unsigned long engine_reset_count[I915_NUM_ENGINES];
>> +
>>   	/**
>>   	 * Waitqueue to signal when a hang is detected. Used to for waiters
>>   	 * to release the struct_mutex for the reset to procede.
>> @@ -3194,9 +3196,10 @@ i915_gem_find_active_request(struct intel_engine_cs *engine);
>>   void i915_gem_retire_requests(struct drm_i915_private *dev_priv);
>>   void i915_gem_retire_requests_ring(struct intel_engine_cs *engine);
>>
>> +/* indicates the progress of engine reset or full gpu reset */
>>   static inline bool i915_reset_in_progress(struct i915_gpu_error *error)
>>   {
>> -	return test_bit(I915_RESET_IN_PROGRESS, &error->flags);
>> +	return unlikely(READ_ONCE(error->flags) & ~BIT(I915_WEDGED));
>>   }
>>
>>   static inline bool i915_terminally_wedged(struct i915_gpu_error *error)
>> @@ -3214,6 +3217,17 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
>>   	return READ_ONCE(error->reset_count);
>>   }
>>
>> +static inline bool i915_full_gpu_reset_in_progress(struct i915_gpu_error *error)
>> +{
>> +	return test_bit(I915_RESET_IN_PROGRESS, &error->flags);
>> +}
>> +
>> +static inline bool i915_engine_reset_in_progress(struct i915_gpu_error *error,
>> +						 struct intel_engine_cs *engine)
>> +{
>> +	return test_bit(engine->id + 1, &error->flags);
>> +}
>
> ?
>
> A totally unexplained change. If it is because you think to want to break
> waiters on struct_mutex, try again.
So you don't want error->flags to include engine reset bits?
ok, it should be possible to use engine_mask itself.

Next patch separates engine reset and full gpu reset in separate 
functions, for branching purposes i915_full_gpu_reset_in_progress() is 
added, is this ok or directly use test_bit() ?

regards
Arun


> -Chris
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 03/11] drm/i915/tdr: Update reset_in_progress to account for engine reset
  2016-07-27 11:16     ` Arun Siluvery
@ 2016-07-27 11:41       ` Chris Wilson
  2016-07-27 11:52         ` Arun Siluvery
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Wilson @ 2016-07-27 11:41 UTC (permalink / raw)
  To: Arun Siluvery; +Cc: intel-gfx

On Wed, Jul 27, 2016 at 12:16:04PM +0100, Arun Siluvery wrote:
> On 26/07/2016 22:52, Chris Wilson wrote:
> >A totally unexplained change. If it is because you think to want to break
> >waiters on struct_mutex, try again.
> So you don't want error->flags to include engine reset bits?
> ok, it should be possible to use engine_mask itself.
> 
> Next patch separates engine reset and full gpu reset in separate
> functions, for branching purposes i915_full_gpu_reset_in_progress()
> is added, is this ok or directly use test_bit() ?

The bit serves 2 functions: serialise error handling, and waking up
waiters on the struct_mutex. That second function is exposed through the
i915_reset_in_progress(), which is being altered here without explaining
how the changed semantics impacts the current users or why it is
necessary. imo we can do engine resets without struct_mutex.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 07/11] drm/i915/tdr: Add support for per engine reset recovery
  2016-07-26 21:51   ` Chris Wilson
@ 2016-07-27 11:49     ` Arun Siluvery
  0 siblings, 0 replies; 28+ messages in thread
From: Arun Siluvery @ 2016-07-27 11:49 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx, Mika Kuoppala, Tomas Elf

On 26/07/2016 22:51, Chris Wilson wrote:
> On Tue, Jul 26, 2016 at 05:40:53PM +0100, Arun Siluvery wrote:
>> This change implements support for per-engine reset as an initial, less
>> intrusive hang recovery option to be attempted before falling back to the
>> legacy full GPU reset recovery mode if necessary. This is only supported
>> from Gen8 onwards.
>>
>> Hangchecker determines which engines are hung and invokes error handler to
>> recover from it. Error handler schedules recovery for each of those engines
>> that are hung. The recovery procedure is as follows,
>>   - force engine to idle: this is done by issuing a reset request
>>   - identifies the request that caused the hang and it is dropped
>>   - reset and re-init engine
>>   - restart submissions to the engine
>>
>> If engine reset fails then we fall back to heavy weight full gpu reset
>> which resets all engines and reinitiazes complete state of HW and SW.
>>
>> Possible reasons for failure,
>>   - engine not ready for reset
>>   - if the HW and SW doesn't agree on the context that caused the hang
>>   - reset itself failed for some reason
>>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_drv.c     | 55 +++++++++++++++++++++++++++++++++----
>>   drivers/gpu/drm/i915/i915_drv.h     |  3 ++
>>   drivers/gpu/drm/i915/i915_gem.c     |  2 +-
>>   drivers/gpu/drm/i915/intel_lrc.h    |  4 +++
>>   drivers/gpu/drm/i915/intel_uncore.c | 33 ++++++++++++++++++++++
>>   5 files changed, 90 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
>> index 5c20e5d..8151aa9 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.c
>> +++ b/drivers/gpu/drm/i915/i915_drv.c
>> @@ -1824,17 +1824,60 @@ error:
>>    * Returns zero on successful reset or otherwise an error code.
>>    *
>>    * Procedure is fairly simple:
>> - *  - force engine to idle
>> - *  - save current state which includes head and current request
>> - *  - reset engine
>> - *  - restore saved state and resubmit context
>> + *    - force engine to idle: this is done by issuing a reset request
>> + *    - identifies the request that caused the hang and it is retired
>> + *    - reset engine
>> + *    - restart submissions to the engine
>>    */
>>   int i915_reset_engine(struct intel_engine_cs *engine)
>>   {
>> +	struct drm_i915_private *dev_priv = engine->i915;
>>   	int ret;
>>
>> -	/* FIXME: replace me with engine reset sequence */
>> -	ret = -ENODEV;
>> +	/* Ensure irq handler finishes or is cancelled. */
>> +	tasklet_kill(&engine->irq_tasklet);
>> +
>> +	i915_gem_reset_engine_status(engine);
>
> Not yet safe to be used lockless.
>
engine is hung and intel_lrc_irq_handler() also won't be running at this 
point and we should've received all seqno updates, is it still not safe? 
do I really need struct_mutex for this? only caller is i915_gem_reset() 
and it takes it.

>> +	/*Take wake lock to prevent power saving mode */
>
> Why else would you be taking it? We know the wakelock prevents the GT
> power well disappearing, the question is why is that important now
> across the entirety of this sequence? That's what you explain in
> comments.
>
yes, not useful. It is carried over from initial patch but I agree I 
should've updated this.

>> +	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
>> +
>> +	ret = intel_request_for_reset(engine);
>> +	if (ret) {
>> +		DRM_ERROR("Failed to disable %s\n", engine->name);
>> +		goto out;
>> +	}
>> +
>> +	ret = intel_execlists_reset_prepare(engine);
>> +	if (ret)
>> +		goto enable_engine;
>> +
>> +	ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
>> +	if (ret) {
>> +		DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
>> +		goto enable_engine;
>> +	}
>> +
>> +	ret = engine->init_hw(engine);
>> +	if (ret)
>> +		goto out;
>> +
>> +	/* Restart submissions to the engine after reset */
>> +	intel_execlists_restart_submission(engine);
>
> Clumsy. See above as we already have an entry point that can initiate
> the hw.
>
I think this is a separate step from init_hw().
In init_hw() we initialize hw state and keep it ready then we feed 
submissions.

>> +enable_engine:
>> +	/*
>> +	 * we only need to enable engine if we cannot prepare engine for
>> +	 * reset or reset fails. If the reset is successful, engine gets
>> +	 * enabled automatically so we can skip this step.
>> +	 */
>> +	if (ret)
>
> Then why is the normal path coming through here?
>
>> +		intel_clear_reset_request(engine);
>> +
there is no harm in doing this even when reset is successful, hence 
included in the normal path. I will remove ret check to avoid confusion.

>> +out:
>> +	/* Wake up anything waiting on this engine's queue */
>> +	intel_engine_wakeup(engine);
>
> ? Random call with incorrect locking. What do you think you are
> achieving here because it's not what the comment implies.

sorry, I will come back on this one, I need to better understand few 
more things.
>
>> +	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
>>
>>   	return ret;
>>   }
>> +/*
>> + * On gen8+ a reset request has to be issued via the reset control register
>> + * before a GPU engine can be reset in order to stop the command streamer
>> + * and idle the engine. This replaces the legacy way of stopping an engine
>> + * by writing to the stop ring bit in the MI_MODE register.
>> + */
>> +int intel_request_for_reset(struct intel_engine_cs *engine)
>
> intel_engine_reset_begin()
>
> It would be preferrable if we can avoid using request here as we are
> talking about reseting the hung request elsewhere in this chain.
>
the term request is overloaded but here infact we are actually 
requesting the hw to prepare for reset and wait for it to acknowledge.

intel_engine_reset_begin() sounds fine.

>> +{
>> +	if (!intel_has_engine_reset(engine->i915)) {
>> +		DRM_ERROR("Engine Reset not supported on Gen%d\n",
>> +			  INTEL_INFO(engine->i915)->gen);
>
> ERROR?
This msg is probably unnecessary as we go for engine reset only when it 
is available.
>
>> +		return -EINVAL;
> return -ENODEV;

ok.

>
>> +	}
>> +
>> +	return gen8_request_engine_reset(engine);
>> +}
>
> gen8_engine_reset_begin()
>
>> +/*
>> + * It is possible to back off from a previously issued reset request by simply
>> + * clearing the reset request bit in the reset control register.
>> + */
>> +int intel_clear_reset_request(struct intel_engine_cs *engine)
>
> intel_engine_reset_cancel()
>
>> +{
>> +{
>> +	if (!intel_has_engine_reset(engine->i915)) {
>> +		DRM_ERROR("Request to clear reset not supported on Gen%d\n",
>> +			  INTEL_INFO(engine->i915)->gen);
>> +		return -EINVAL;
>> +	}
>> +
>> +	gen8_unrequest_engine_reset(engine);
>
> gen8_engine_reset_cancel()
>
> (only suggestions for avoiding having confusion over requests)

ok, thanks for the renaming suggestions.

regards
Arun

>
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 03/11] drm/i915/tdr: Update reset_in_progress to account for engine reset
  2016-07-27 11:41       ` Chris Wilson
@ 2016-07-27 11:52         ` Arun Siluvery
  0 siblings, 0 replies; 28+ messages in thread
From: Arun Siluvery @ 2016-07-27 11:52 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx, Mika Kuoppala

On 27/07/2016 12:41, Chris Wilson wrote:
> On Wed, Jul 27, 2016 at 12:16:04PM +0100, Arun Siluvery wrote:
>> On 26/07/2016 22:52, Chris Wilson wrote:
>>> A totally unexplained change. If it is because you think to want to break
>>> waiters on struct_mutex, try again.
>> So you don't want error->flags to include engine reset bits?
>> ok, it should be possible to use engine_mask itself.
>>
>> Next patch separates engine reset and full gpu reset in separate
>> functions, for branching purposes i915_full_gpu_reset_in_progress()
>> is added, is this ok or directly use test_bit() ?
>
> The bit serves 2 functions: serialise error handling, and waking up
> waiters on the struct_mutex. That second function is exposed through the
> i915_reset_in_progress(), which is being altered here without explaining
> how the changed semantics impacts the current users or why it is
> necessary. imo we can do engine resets without struct_mutex.
Yes, as you suggested I am not taking struct_mutex now but adding engine 
reset bits to error->flags was breaking the waiters. I will try to use 
engine_mask itself and keep error->flags unchanged.

regards
Arun

> -Chris
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 05/11] drm/i915/tdr: Identify hung request and drop it
  2016-07-26 21:37   ` Chris Wilson
@ 2016-07-27 11:54     ` Arun Siluvery
  2016-07-27 12:27       ` Chris Wilson
  0 siblings, 1 reply; 28+ messages in thread
From: Arun Siluvery @ 2016-07-27 11:54 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx, Mika Kuoppala

On 26/07/2016 22:37, Chris Wilson wrote:
> On Tue, Jul 26, 2016 at 05:40:51PM +0100, Arun Siluvery wrote:
>> The current active request is the one that caused the hang so this is
>> retrieved and removed from elsp queue, otherwise we cannot submit other
>> workloads to be processed by GPU.
>>
>> A consistency check between HW and driver is performed to ensure that we
>> are dropping the correct request. Since this request doesn't get executed
>> anymore, we also need to advance the seqno to mark it as complete. Head
>> pointer is advanced to skip the offending batch so that HW resumes
>> execution other workloads. If HW and SW don't agree then we won't proceed
>> with engine reset, this is treated as an error condition and we fallback to
>> full gpu reset.
>>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>> ---
>>   drivers/gpu/drm/i915/intel_lrc.c | 116 +++++++++++++++++++++++++++++++++++++++
>>   drivers/gpu/drm/i915/intel_lrc.h |   2 +
>>   2 files changed, 118 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
>> index daf1279..8fc5a3b 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>> @@ -1026,6 +1026,122 @@ void intel_lr_context_unpin(struct i915_gem_context *ctx,
>>   	i915_gem_context_put(ctx);
>>   }
>>
>> +static void intel_lr_context_resync(struct i915_gem_context *ctx,
>> +				    struct intel_engine_cs *engine)
>> +{
>> +	u32 head;
>> +	u32 head_addr, tail_addr;
>> +	u32 *reg_state;
>> +	struct intel_ringbuffer *ringbuf;
>> +	struct drm_i915_private *dev_priv = engine->i915;
>> +
>> +	ringbuf = ctx->engine[engine->id].ringbuf;
>> +	reg_state = ctx->engine[engine->id].lrc_reg_state;
>> +
>> +	head = I915_READ_HEAD(engine);
>> +	head_addr = head & HEAD_ADDR;
>> +	tail_addr = reg_state[CTX_RING_TAIL+1] & TAIL_ADDR;
>
> ?
>
> We know where we want the head to be to emit the breadcrumb and complete
> the request since we can record that when constructing the request. That
> also neatly solves the riddle of how to update the hw state.

We want to skip only MI_BATCH_BUFFER_START and continue as usual so just 
using existing info.
>
> resync?  intel_lr_context_reset_ring may be more apt, or maybe
> intel_execlists_reset_request?

resync because we read current state and update it. 
intel_execlists_reset_request() sounds better, will change it as 
suggested. thanks.

regards
Arun


> -Chris
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 05/11] drm/i915/tdr: Identify hung request and drop it
  2016-07-27 11:54     ` Arun Siluvery
@ 2016-07-27 12:27       ` Chris Wilson
  2016-07-27 21:35         ` Resubmitting requests following the hang Chris Wilson
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Wilson @ 2016-07-27 12:27 UTC (permalink / raw)
  To: Arun Siluvery; +Cc: intel-gfx

On Wed, Jul 27, 2016 at 12:54:44PM +0100, Arun Siluvery wrote:
> On 26/07/2016 22:37, Chris Wilson wrote:
> >On Tue, Jul 26, 2016 at 05:40:51PM +0100, Arun Siluvery wrote:
> >>The current active request is the one that caused the hang so this is
> >>retrieved and removed from elsp queue, otherwise we cannot submit other
> >>workloads to be processed by GPU.
> >>
> >>A consistency check between HW and driver is performed to ensure that we
> >>are dropping the correct request. Since this request doesn't get executed
> >>anymore, we also need to advance the seqno to mark it as complete. Head
> >>pointer is advanced to skip the offending batch so that HW resumes
> >>execution other workloads. If HW and SW don't agree then we won't proceed
> >>with engine reset, this is treated as an error condition and we fallback to
> >>full gpu reset.
> >>
> >>Cc: Chris Wilson <chris@chris-wilson.co.uk>
> >>Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> >>Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> >>---
> >>  drivers/gpu/drm/i915/intel_lrc.c | 116 +++++++++++++++++++++++++++++++++++++++
> >>  drivers/gpu/drm/i915/intel_lrc.h |   2 +
> >>  2 files changed, 118 insertions(+)
> >>
> >>diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> >>index daf1279..8fc5a3b 100644
> >>--- a/drivers/gpu/drm/i915/intel_lrc.c
> >>+++ b/drivers/gpu/drm/i915/intel_lrc.c
> >>@@ -1026,6 +1026,122 @@ void intel_lr_context_unpin(struct i915_gem_context *ctx,
> >>  	i915_gem_context_put(ctx);
> >>  }
> >>
> >>+static void intel_lr_context_resync(struct i915_gem_context *ctx,
> >>+				    struct intel_engine_cs *engine)
> >>+{
> >>+	u32 head;
> >>+	u32 head_addr, tail_addr;
> >>+	u32 *reg_state;
> >>+	struct intel_ringbuffer *ringbuf;
> >>+	struct drm_i915_private *dev_priv = engine->i915;
> >>+
> >>+	ringbuf = ctx->engine[engine->id].ringbuf;
> >>+	reg_state = ctx->engine[engine->id].lrc_reg_state;
> >>+
> >>+	head = I915_READ_HEAD(engine);
> >>+	head_addr = head & HEAD_ADDR;
> >>+	tail_addr = reg_state[CTX_RING_TAIL+1] & TAIL_ADDR;
> >
> >?
> >
> >We know where we want the head to be to emit the breadcrumb and complete
> >the request since we can record that when constructing the request. That
> >also neatly solves the riddle of how to update the hw state.
> 
> We want to skip only MI_BATCH_BUFFER_START and continue as usual so
> just using existing info.

That's exactly my point and why this approach is overkill since we
aleady know where we need to resume from.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Resubmitting requests following the hang
  2016-07-27 12:27       ` Chris Wilson
@ 2016-07-27 21:35         ` Chris Wilson
  2016-07-27 21:35           ` [PATCH 1/3] drm/i915: Record the position of the start of the request Chris Wilson
                             ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Chris Wilson @ 2016-07-27 21:35 UTC (permalink / raw)
  To: intel-gfx

To clarify what I mean about knowing what values we need to write into
the ring following the reset, please consider these poc patches which
implement request recovery following the global reset.
-Chris

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/3] drm/i915: Record the position of the start of the request
  2016-07-27 21:35         ` Resubmitting requests following the hang Chris Wilson
@ 2016-07-27 21:35           ` Chris Wilson
  2016-07-27 21:35           ` [PATCH 2/3] lrc Chris Wilson
  2016-07-27 21:36           ` [PATCH 3/3] reset-request-recovery Chris Wilson
  2 siblings, 0 replies; 28+ messages in thread
From: Chris Wilson @ 2016-07-27 21:35 UTC (permalink / raw)
  To: intel-gfx

Not only does it make for good documentation and debugging aide, but it
is also vital for when we want to unwind requests - such as when
throwing away an incomplete request.

v2: Rebase

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_drv.h         |  1 +
 drivers/gpu/drm/i915/i915_gem_request.c | 13 +++++++++----
 drivers/gpu/drm/i915/i915_gpu_error.c   |  6 ++++--
 3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 309000af003d..aacce6625f84 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -557,6 +557,7 @@ struct drm_i915_error_state {
 		struct drm_i915_error_request {
 			long jiffies;
 			u32 seqno;
+			u32 head;
 			u32 tail;
 		} *requests;
 
diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index 389a26429744..f50f25a27bd0 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -393,6 +393,13 @@ i915_gem_request_alloc(struct intel_engine_cs *engine,
 	if (ret)
 		goto err_ctx;
 
+	/* Record the position of the start of the request so that
+	 * should we detect the updated seqno part-way through the
+	 * GPU processing the request, we never over-estimate the
+	 * position of the head.
+	 */
+	req->head = req->ring->tail;
+
 	return req;
 
 err_ctx:
@@ -469,8 +476,6 @@ void __i915_add_request(struct drm_i915_gem_request *request,
 
 	trace_i915_gem_request_add(request);
 
-	request->head = request_start;
-
 	/* Whilst this request exists, batch_obj will be on the
 	 * active_list, and so will hold the active reference. Only when this
 	 * request is retired will the the batch_obj be moved onto the
@@ -490,10 +495,10 @@ void __i915_add_request(struct drm_i915_gem_request *request,
 	list_add_tail(&request->link, &engine->request_list);
 	list_add_tail(&request->ring_link, &ring->request_list);
 
-	/* Record the position of the start of the request so that
+	/* Record the position of the start of the breadcrumb so that
 	 * should we detect the updated seqno part-way through the
 	 * GPU processing the request, we never over-estimate the
-	 * position of the head.
+	 * position of the ring's HEAD.
 	 */
 	request->postfix = ring->tail;
 
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index cc28ad429dd8..b150147fc25b 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -455,9 +455,10 @@ int i915_error_state_to_str(struct drm_i915_error_state_buf *m,
 				   dev_priv->engine[i].name,
 				   ee->num_requests);
 			for (j = 0; j < ee->num_requests; j++) {
-				err_printf(m, "  seqno 0x%08x, emitted %ld, tail 0x%08x\n",
+				err_printf(m, "  seqno 0x%08x, emitted %ld, head 0x%08x tail 0x%08x\n",
 					   ee->requests[j].seqno,
 					   ee->requests[j].jiffies,
+					   ee->requests[j].head,
 					   ee->requests[j].tail);
 			}
 		}
@@ -1205,7 +1206,8 @@ static void i915_gem_record_rings(struct drm_i915_private *dev_priv,
 			erq = &ee->requests[count++];
 			erq->seqno = request->fence.seqno;
 			erq->jiffies = request->emitted_jiffies;
-			erq->tail = request->postfix;
+			erq->head = request->head;
+			erq->tail = request->tail;
 		}
 	}
 }
-- 
2.8.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/3] lrc
  2016-07-27 21:35         ` Resubmitting requests following the hang Chris Wilson
  2016-07-27 21:35           ` [PATCH 1/3] drm/i915: Record the position of the start of the request Chris Wilson
@ 2016-07-27 21:35           ` Chris Wilson
  2016-07-27 21:36           ` [PATCH 3/3] reset-request-recovery Chris Wilson
  2 siblings, 0 replies; 28+ messages in thread
From: Chris Wilson @ 2016-07-27 21:35 UTC (permalink / raw)
  To: intel-gfx

---
 drivers/gpu/drm/i915/i915_debugfs.c     |   2 +-
 drivers/gpu/drm/i915/i915_gem.c         |   6 +-
 drivers/gpu/drm/i915/i915_gem_request.h |  24 +-
 drivers/gpu/drm/i915/intel_lrc.c        | 416 +++++++++++---------------------
 drivers/gpu/drm/i915/intel_lrc.h        |   2 -
 drivers/gpu/drm/i915/intel_ringbuffer.h |   7 +-
 6 files changed, 151 insertions(+), 306 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 920a2de95cd5..076c126ac08a 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -2181,7 +2181,7 @@ static int i915_execlists(struct seq_file *m, void *data)
 		status_pointer = I915_READ(RING_CONTEXT_STATUS_PTR(engine));
 		seq_printf(m, "\tStatus pointer: 0x%08X\n", status_pointer);
 
-		read_pointer = engine->next_context_status_buffer;
+		read_pointer = GEN8_CSB_READ_PTR(status_pointer);
 		write_pointer = GEN8_CSB_WRITE_PTR(status_pointer);
 		if (read_pointer > write_pointer)
 			write_pointer += GEN8_CSB_ENTRIES;
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index bcfcc5af6b6e..f56fa8ac30bb 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2462,7 +2462,11 @@ static void i915_gem_reset_engine_cleanup(struct intel_engine_cs *engine)
 		/* Ensure irq handler finishes or is cancelled. */
 		tasklet_kill(&engine->irq_tasklet);
 
-		intel_execlists_cancel_requests(engine);
+		INIT_LIST_HEAD(&engine->execlist_queue);
+		i915_gem_request_assign(&engine->execlist_port[0].request,
+					NULL);
+		i915_gem_request_assign(&engine->execlist_port[1].request,
+					NULL);
 	}
 
 	/*
diff --git a/drivers/gpu/drm/i915/i915_gem_request.h b/drivers/gpu/drm/i915/i915_gem_request.h
index 3453df09befc..16dcf46bc6d0 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.h
+++ b/drivers/gpu/drm/i915/i915_gem_request.h
@@ -83,6 +83,9 @@ struct drm_i915_gem_request {
 	/** Position in the ringbuffer of the end of the whole request */
 	u32 tail;
 
+	/** Position in the ringbuffer after the end of the whole request */
+	u32 wa_tail;
+
 	/** Preallocate space in the ringbuffer for the emitting the request */
 	u32 reserved_space;
 
@@ -119,27 +122,8 @@ struct drm_i915_gem_request {
 	/** process identifier submitting this request */
 	struct pid *pid;
 
-	/**
-	 * The ELSP only accepts two elements at a time, so we queue
-	 * context/tail pairs on a given queue (ring->execlist_queue) until the
-	 * hardware is available. The queue serves a double purpose: we also use
-	 * it to keep track of the up to 2 contexts currently in the hardware
-	 * (usually one in execution and the other queued up by the GPU): We
-	 * only remove elements from the head of the queue when the hardware
-	 * informs us that an element has been completed.
-	 *
-	 * All accesses to the queue are mediated by a spinlock
-	 * (ring->execlist_lock).
-	 */
-
-	/** Execlist link in the submission queue.*/
+	/** Link in the execlist submission queue, guarded by execlist_lock. */
 	struct list_head execlist_link;
-
-	/** Execlists no. of times this request has been sent to the ELSP */
-	int elsp_submitted;
-
-	/** Execlists context hardware id. */
-	unsigned int ctx_hw_id;
 };
 
 extern const struct fence_ops i915_fence_ops;
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 8ff987a89566..9be71a8a1d63 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -156,6 +156,11 @@
 #define GEN8_CTX_STATUS_COMPLETE	(1 << 4)
 #define GEN8_CTX_STATUS_LITE_RESTORE	(1 << 15)
 
+#define GEN8_CTX_STATUS_COMPLETED_MASK \
+	 (GEN8_CTX_STATUS_ACTIVE_IDLE | \
+	  GEN8_CTX_STATUS_PREEMPTED | \
+	  GEN8_CTX_STATUS_ELEMENT_SWITCH)
+
 #define CTX_LRI_HEADER_0		0x01
 #define CTX_CONTEXT_CONTROL		0x02
 #define CTX_RING_HEAD			0x04
@@ -263,12 +268,10 @@ logical_ring_init_platform_invariants(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
 
-	if (IS_GEN8(dev_priv) || IS_GEN9(dev_priv))
-		engine->idle_lite_restore_wa = ~0;
-
-	engine->disable_lite_restore_wa = (IS_SKL_REVID(dev_priv, 0, SKL_REVID_B0) ||
-					IS_BXT_REVID(dev_priv, 0, BXT_REVID_A1)) &&
-					(engine->id == VCS || engine->id == VCS2);
+	engine->disable_lite_restore_wa =
+		(IS_SKL_REVID(dev_priv, 0, SKL_REVID_B0) ||
+		 IS_BXT_REVID(dev_priv, 0, BXT_REVID_A1)) &&
+		(engine->id == VCS || engine->id == VCS2);
 
 	engine->ctx_desc_template = GEN8_CTX_VALID;
 	if (IS_GEN8(dev_priv))
@@ -328,36 +331,6 @@ uint64_t intel_lr_context_descriptor(struct i915_gem_context *ctx,
 	return ctx->engine[engine->id].lrc_desc;
 }
 
-static void execlists_elsp_write(struct drm_i915_gem_request *rq0,
-				 struct drm_i915_gem_request *rq1)
-{
-
-	struct intel_engine_cs *engine = rq0->engine;
-	struct drm_i915_private *dev_priv = rq0->i915;
-	uint64_t desc[2];
-
-	if (rq1) {
-		desc[1] = intel_lr_context_descriptor(rq1->ctx, rq1->engine);
-		rq1->elsp_submitted++;
-	} else {
-		desc[1] = 0;
-	}
-
-	desc[0] = intel_lr_context_descriptor(rq0->ctx, rq0->engine);
-	rq0->elsp_submitted++;
-
-	/* You must always write both descriptors in the order below. */
-	I915_WRITE_FW(RING_ELSP(engine), upper_32_bits(desc[1]));
-	I915_WRITE_FW(RING_ELSP(engine), lower_32_bits(desc[1]));
-
-	I915_WRITE_FW(RING_ELSP(engine), upper_32_bits(desc[0]));
-	/* The context is automatically loaded after the following */
-	I915_WRITE_FW(RING_ELSP(engine), lower_32_bits(desc[0]));
-
-	/* ELSP is a wo register, use another nearby reg for posting */
-	POSTING_READ_FW(RING_EXECLIST_STATUS_LO(engine));
-}
-
 static void
 execlists_update_context_pdps(struct i915_hw_ppgtt *ppgtt, u32 *reg_state)
 {
@@ -367,13 +340,12 @@ execlists_update_context_pdps(struct i915_hw_ppgtt *ppgtt, u32 *reg_state)
 	ASSIGN_CTX_PDP(ppgtt, reg_state, 0);
 }
 
-static void execlists_update_context(struct drm_i915_gem_request *rq)
+static u64 execlists_update_context(struct drm_i915_gem_request *rq)
 {
-	struct intel_engine_cs *engine = rq->engine;
+	struct intel_context *ce = &rq->ctx->engine[rq->engine->id];
 	struct i915_hw_ppgtt *ppgtt = rq->ctx->ppgtt;
-	uint32_t *reg_state = rq->ctx->engine[engine->id].lrc_reg_state;
 
-	reg_state[CTX_RING_TAIL+1] = rq->tail % (rq->ring->size - 1);
+	ce->lrc_reg_state[CTX_RING_TAIL+1] = rq->tail % (rq->ring->size - 1);
 
 	/* True 32b PPGTT with dynamic page allocation: update PDP
 	 * registers and point the unallocated PDPs to scratch page.
@@ -381,32 +353,14 @@ static void execlists_update_context(struct drm_i915_gem_request *rq)
 	 * in 48-bit mode.
 	 */
 	if (ppgtt && !USES_FULL_48BIT_PPGTT(ppgtt->base.dev))
-		execlists_update_context_pdps(ppgtt, reg_state);
-}
-
-static void execlists_elsp_submit_contexts(struct drm_i915_gem_request *rq0,
-					   struct drm_i915_gem_request *rq1)
-{
-	struct drm_i915_private *dev_priv = rq0->i915;
-	unsigned int fw_domains = rq0->engine->fw_domains;
+		execlists_update_context_pdps(ppgtt, ce->lrc_reg_state);
 
-	execlists_update_context(rq0);
-
-	if (rq1)
-		execlists_update_context(rq1);
-
-	spin_lock_irq(&dev_priv->uncore.lock);
-	intel_uncore_forcewake_get__locked(dev_priv, fw_domains);
-
-	execlists_elsp_write(rq0, rq1);
-
-	intel_uncore_forcewake_put__locked(dev_priv, fw_domains);
-	spin_unlock_irq(&dev_priv->uncore.lock);
+	return ce->lrc_desc;
 }
 
-static inline void execlists_context_status_change(
-		struct drm_i915_gem_request *rq,
-		unsigned long status)
+static inline void
+execlists_context_status_change(struct drm_i915_gem_request *rq,
+				unsigned long status)
 {
 	/*
 	 * Only used when GVT-g is enabled now. When GVT-g is disabled,
@@ -418,122 +372,106 @@ static inline void execlists_context_status_change(
 	atomic_notifier_call_chain(&rq->ctx->status_notifier, status, rq);
 }
 
-static void execlists_unqueue(struct intel_engine_cs *engine)
+static void execlists_submit_ports(struct intel_engine_cs *engine)
 {
-	struct drm_i915_gem_request *req0 = NULL, *req1 = NULL;
-	struct drm_i915_gem_request *cursor, *tmp;
-
-	assert_spin_locked(&engine->execlist_lock);
-
-	/*
-	 * If irqs are not active generate a warning as batches that finish
-	 * without the irqs may get lost and a GPU Hang may occur.
-	 */
-	WARN_ON(!intel_irqs_enabled(engine->i915));
-
-	/* Try to read in pairs */
-	list_for_each_entry_safe(cursor, tmp, &engine->execlist_queue,
-				 execlist_link) {
-		if (!req0) {
-			req0 = cursor;
-		} else if (req0->ctx == cursor->ctx) {
-			/* Same ctx: ignore first request, as second request
-			 * will update tail past first request's workload */
-			cursor->elsp_submitted = req0->elsp_submitted;
-			list_del(&req0->execlist_link);
-			i915_gem_request_put(req0);
-			req0 = cursor;
-		} else {
-			if (IS_ENABLED(CONFIG_DRM_I915_GVT)) {
-				/*
-				 * req0 (after merged) ctx requires single
-				 * submission, stop picking
-				 */
-				if (req0->ctx->execlists_force_single_submission)
-					break;
-				/*
-				 * req0 ctx doesn't require single submission,
-				 * but next req ctx requires, stop picking
-				 */
-				if (cursor->ctx->execlists_force_single_submission)
-					break;
-			}
-			req1 = cursor;
-			WARN_ON(req1->elsp_submitted);
-			break;
-		}
-	}
-
-	if (unlikely(!req0))
-		return;
+	struct drm_i915_private *dev_priv = engine->i915;
+	u32 *elsp = dev_priv->regs + i915_mmio_reg_offset(RING_ELSP(engine));
+	u64 desc[2];
 
-	execlists_context_status_change(req0, INTEL_CONTEXT_SCHEDULE_IN);
+	if (!engine->execlist_port[0].count)
+		execlists_context_status_change(engine->execlist_port[0].request,
+						INTEL_CONTEXT_SCHEDULE_IN);
+	desc[0] = execlists_update_context(engine->execlist_port[0].request);
+	engine->preempt_wa = engine->execlist_port[0].count++;
 
-	if (req1)
-		execlists_context_status_change(req1,
+	if (engine->execlist_port[1].request) {
+		GEM_BUG_ON(engine->execlist_port[1].count);
+		execlists_context_status_change(engine->execlist_port[1].request,
 						INTEL_CONTEXT_SCHEDULE_IN);
+		desc[1] = execlists_update_context(engine->execlist_port[1].request);
+		engine->execlist_port[1].count = 1;
+	} else
+		desc[1] = 0;
 
-	if (req0->elsp_submitted & engine->idle_lite_restore_wa) {
-		/*
-		 * WaIdleLiteRestore: make sure we never cause a lite restore
-		 * with HEAD==TAIL.
-		 *
-		 * Apply the wa NOOPS to prevent ring:HEAD == req:TAIL as we
-		 * resubmit the request. See gen8_emit_request() for where we
-		 * prepare the padding after the end of the request.
-		 */
-		req0->tail += 8;
-		req0->tail &= req0->ring->size - 1;
-	}
+	/* You must always write both descriptors in the order below. */
+	writel(upper_32_bits(desc[1]), elsp);
+	writel(lower_32_bits(desc[1]), elsp);
 
-	execlists_elsp_submit_contexts(req0, req1);
+	writel(upper_32_bits(desc[0]), elsp);
+	/* The context is automatically loaded after the following */
+	writel(lower_32_bits(desc[0]), elsp);
 }
 
-static unsigned int
-execlists_check_remove_request(struct intel_engine_cs *engine, u32 ctx_id)
+static bool merge_ctx(struct i915_gem_context *prev,
+		      struct i915_gem_context *next)
 {
-	struct drm_i915_gem_request *head_req;
+	if (prev != next)
+		return false;
 
-	assert_spin_locked(&engine->execlist_lock);
+	if (IS_ENABLED(CONFIG_DRM_I915_GVT) &&
+	    prev->execlists_force_single_submission)
+		return false;
 
-	head_req = list_first_entry_or_null(&engine->execlist_queue,
-					    struct drm_i915_gem_request,
-					    execlist_link);
+	return true;
+}
 
-	if (WARN_ON(!head_req || (head_req->ctx_hw_id != ctx_id)))
-               return 0;
+static void execlists_context_unqueue(struct intel_engine_cs *engine)
+{
+	struct drm_i915_gem_request *cursor, *last;
+	struct execlist_port *port = engine->execlist_port;
+	bool submit = false;
+
+	last = port->request;
+	if (last != NULL) {
+		/* WaIdleLiteRestore:bdw,skl
+		 * Apply the wa NOOPs to prevent ring:HEAD == req:TAIL
+		 * as we resubmit the request. See gen8_emit_request()
+		 * for where we prepare the padding after the end of the
+		 * request.
+		 */
+		last->tail = last->wa_tail;
+	}
 
-	WARN(head_req->elsp_submitted == 0, "Never submitted head request\n");
+	/* Try to read in pairs and fill both submission ports */
+	spin_lock(&engine->execlist_lock);
+	list_for_each_entry(cursor, &engine->execlist_queue, execlist_link) {
+		if (last && !merge_ctx(cursor->ctx, last->ctx)) {
+			if (port != engine->execlist_port)
+				break;
 
-	if (--head_req->elsp_submitted > 0)
-		return 0;
+			if (IS_ENABLED(CONFIG_DRM_I915_GVT) &&
+			    cursor->ctx->execlists_force_single_submission)
+				break;
 
-	execlists_context_status_change(head_req, INTEL_CONTEXT_SCHEDULE_OUT);
+			i915_gem_request_assign(&port->request, last);
+			port++;
 
-	list_del(&head_req->execlist_link);
-	i915_gem_request_put(head_req);
+		}
+		last = cursor;
+		submit = true;
+	}
+	if (submit) {
+		i915_gem_request_assign(&port->request, last);
+		engine->execlist_queue.next = &cursor->execlist_link;
+		cursor->execlist_link.prev = &engine->execlist_queue;
+	}
+	spin_unlock(&engine->execlist_lock);
 
-	return 1;
+	if (submit)
+		execlists_submit_ports(engine);
 }
 
-static u32
-get_context_status(struct intel_engine_cs *engine, unsigned int read_pointer,
-		   u32 *context_id)
+static bool execlists_elsp_idle(struct intel_engine_cs *engine)
 {
-	struct drm_i915_private *dev_priv = engine->i915;
-	u32 status;
-
-	read_pointer %= GEN8_CSB_ENTRIES;
-
-	status = I915_READ_FW(RING_CONTEXT_STATUS_BUF_LO(engine, read_pointer));
-
-	if (status & GEN8_CTX_STATUS_IDLE_ACTIVE)
-		return 0;
-
-	*context_id = I915_READ_FW(RING_CONTEXT_STATUS_BUF_HI(engine,
-							      read_pointer));
+	return engine->execlist_port[0].request == NULL;
+}
 
-	return status;
+static bool execlists_elsp_ready(struct intel_engine_cs *engine)
+{
+	if (engine->disable_lite_restore_wa || engine->preempt_wa)
+		return engine->execlist_port[0].request == NULL;
+	else
+		return engine->execlist_port[1].request == NULL;
 }
 
 /*
@@ -544,101 +482,61 @@ static void intel_lrc_irq_handler(unsigned long data)
 {
 	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
 	struct drm_i915_private *dev_priv = engine->i915;
-	u32 status_pointer;
-	unsigned int read_pointer, write_pointer;
-	u32 csb[GEN8_CSB_ENTRIES][2];
-	unsigned int csb_read = 0, i;
-	unsigned int submit_contexts = 0;
 
 	intel_uncore_forcewake_get(dev_priv, engine->fw_domains);
 
-	status_pointer = I915_READ_FW(RING_CONTEXT_STATUS_PTR(engine));
-
-	read_pointer = engine->next_context_status_buffer;
-	write_pointer = GEN8_CSB_WRITE_PTR(status_pointer);
-	if (read_pointer > write_pointer)
-		write_pointer += GEN8_CSB_ENTRIES;
-
-	while (read_pointer < write_pointer) {
-		if (WARN_ON_ONCE(csb_read == GEN8_CSB_ENTRIES))
-			break;
-		csb[csb_read][0] = get_context_status(engine, ++read_pointer,
-						      &csb[csb_read][1]);
-		csb_read++;
-	}
-
-	engine->next_context_status_buffer = write_pointer % GEN8_CSB_ENTRIES;
-
-	/* Update the read pointer to the old write pointer. Manual ringbuffer
-	 * management ftw </sarcasm> */
-	I915_WRITE_FW(RING_CONTEXT_STATUS_PTR(engine),
-		      _MASKED_FIELD(GEN8_CSB_READ_PTR_MASK,
-				    engine->next_context_status_buffer << 8));
-
-	intel_uncore_forcewake_put(dev_priv, engine->fw_domains);
-
-	spin_lock(&engine->execlist_lock);
-
-	for (i = 0; i < csb_read; i++) {
-		if (unlikely(csb[i][0] & GEN8_CTX_STATUS_PREEMPTED)) {
-			if (csb[i][0] & GEN8_CTX_STATUS_LITE_RESTORE) {
-				if (execlists_check_remove_request(engine, csb[i][1]))
-					WARN(1, "Lite Restored request removed from queue\n");
-			} else
-				WARN(1, "Preemption without Lite Restore\n");
+	if (!execlists_elsp_idle(engine)) {
+		u32 *ring_mmio =
+			dev_priv->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine));
+		u32 *csb_mmio =
+			dev_priv->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_BUF_LO(engine, 0));
+		unsigned ring, head, tail;
+
+		ring = readl(ring_mmio);
+		head = GEN8_CSB_READ_PTR(ring);
+		tail = GEN8_CSB_WRITE_PTR(ring);
+		if (tail < head)
+			tail += GEN8_CSB_ENTRIES;
+		while (head < tail) {
+			unsigned idx = ++head % GEN8_CSB_ENTRIES;
+			unsigned status = readl(&csb_mmio[2*idx]);
+
+			if (status & GEN8_CTX_STATUS_COMPLETED_MASK) {
+				GEM_BUG_ON(engine->execlist_port[0].count == 0);
+				if (--engine->execlist_port[0].count == 0) {
+					GEM_BUG_ON(status & GEN8_CTX_STATUS_PREEMPTED);
+					execlists_context_status_change(engine->execlist_port[0].request,
+									INTEL_CONTEXT_SCHEDULE_OUT);
+					i915_gem_request_put(engine->execlist_port[0].request);
+					engine->execlist_port[0] = engine->execlist_port[1];
+					memset(&engine->execlist_port[1], 0,
+					       sizeof(engine->execlist_port[1]));
+					engine->preempt_wa = false;
+				}
+			}
+			GEM_BUG_ON(engine->execlist_port[0].count == 0 &&
+				   !(status & GEN8_CTX_STATUS_ACTIVE_IDLE));
 		}
 
-		if (csb[i][0] & (GEN8_CTX_STATUS_ACTIVE_IDLE |
-		    GEN8_CTX_STATUS_ELEMENT_SWITCH))
-			submit_contexts +=
-				execlists_check_remove_request(engine, csb[i][1]);
+		writel(_MASKED_FIELD(GEN8_CSB_READ_PTR_MASK,
+				     GEN8_CSB_WRITE_PTR(ring) << 8),
+		       ring_mmio);
 	}
 
-	if (submit_contexts) {
-		if (!engine->disable_lite_restore_wa ||
-		    (csb[i][0] & GEN8_CTX_STATUS_ACTIVE_IDLE))
-			execlists_unqueue(engine);
-	}
-
-	spin_unlock(&engine->execlist_lock);
+	if (execlists_elsp_ready(engine))
+		execlists_context_unqueue(engine);
 
-	if (unlikely(submit_contexts > 2))
-		DRM_ERROR("More than two context complete events?\n");
+	intel_uncore_forcewake_put(dev_priv, engine->fw_domains);
 }
 
 static void execlists_submit_request(struct drm_i915_gem_request *request)
 {
 	struct intel_engine_cs *engine = request->engine;
-	struct drm_i915_gem_request *cursor;
-	int num_elements = 0;
 
 	spin_lock_bh(&engine->execlist_lock);
-
-	list_for_each_entry(cursor, &engine->execlist_queue, execlist_link)
-		if (++num_elements > 2)
-			break;
-
-	if (num_elements > 2) {
-		struct drm_i915_gem_request *tail_req;
-
-		tail_req = list_last_entry(&engine->execlist_queue,
-					   struct drm_i915_gem_request,
-					   execlist_link);
-
-		if (request->ctx == tail_req->ctx) {
-			WARN(tail_req->elsp_submitted != 0,
-				"More than 2 already-submitted reqs queued\n");
-			list_del(&tail_req->execlist_link);
-			i915_gem_request_put(tail_req);
-		}
-	}
-
-	i915_gem_request_get(request);
 	list_add_tail(&request->execlist_link, &engine->execlist_queue);
-	request->ctx_hw_id = request->ctx->hw_id;
-	if (num_elements == 0)
-		execlists_unqueue(engine);
-
+	if (execlists_elsp_idle(engine))
+		tasklet_hi_schedule(&engine->irq_tasklet);
 	spin_unlock_bh(&engine->execlist_lock);
 }
 
@@ -731,6 +629,7 @@ intel_logical_ring_advance(struct drm_i915_gem_request *request)
 	intel_ring_emit(ring, MI_NOOP);
 	intel_ring_emit(ring, MI_NOOP);
 	intel_ring_advance(ring);
+	request->wa_tail = request->ring->tail;
 
 	/* We keep the previous context alive until we retire the following
 	 * request. This ensures that any the context object is still pinned
@@ -743,23 +642,6 @@ intel_logical_ring_advance(struct drm_i915_gem_request *request)
 	return 0;
 }
 
-void intel_execlists_cancel_requests(struct intel_engine_cs *engine)
-{
-	struct drm_i915_gem_request *req, *tmp;
-	LIST_HEAD(cancel_list);
-
-	WARN_ON(!mutex_is_locked(&engine->i915->drm.struct_mutex));
-
-	spin_lock_bh(&engine->execlist_lock);
-	list_replace_init(&engine->execlist_queue, &cancel_list);
-	spin_unlock_bh(&engine->execlist_lock);
-
-	list_for_each_entry_safe(req, tmp, &cancel_list, execlist_link) {
-		list_del(&req->execlist_link);
-		i915_gem_request_put(req);
-	}
-}
-
 void intel_logical_ring_stop(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
@@ -1310,7 +1192,6 @@ static void lrc_init_hws(struct intel_engine_cs *engine)
 static int gen8_init_common_ring(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
-	unsigned int next_context_status_buffer_hw;
 
 	lrc_init_hws(engine);
 
@@ -1323,30 +1204,6 @@ static int gen8_init_common_ring(struct intel_engine_cs *engine)
 		   _MASKED_BIT_ENABLE(GFX_RUN_LIST_ENABLE));
 	POSTING_READ(RING_MODE_GEN7(engine));
 
-	/*
-	 * Instead of resetting the Context Status Buffer (CSB) read pointer to
-	 * zero, we need to read the write pointer from hardware and use its
-	 * value because "this register is power context save restored".
-	 * Effectively, these states have been observed:
-	 *
-	 *      | Suspend-to-idle (freeze) | Suspend-to-RAM (mem) |
-	 * BDW  | CSB regs not reset       | CSB regs reset       |
-	 * CHT  | CSB regs not reset       | CSB regs not reset   |
-	 * SKL  |         ?                |         ?            |
-	 * BXT  |         ?                |         ?            |
-	 */
-	next_context_status_buffer_hw =
-		GEN8_CSB_WRITE_PTR(I915_READ(RING_CONTEXT_STATUS_PTR(engine)));
-
-	/*
-	 * When the CSB registers are reset (also after power-up / gpu reset),
-	 * CSB write pointer is set to all 1's, which is not valid, use '5' in
-	 * this special case, so the first element read is CSB[0].
-	 */
-	if (next_context_status_buffer_hw == GEN8_CSB_PTR_MASK)
-		next_context_status_buffer_hw = (GEN8_CSB_ENTRIES - 1);
-
-	engine->next_context_status_buffer = next_context_status_buffer_hw;
 	DRM_DEBUG_DRIVER("Execlists enabled for %s\n", engine->name);
 
 	intel_engine_init_hangcheck(engine);
@@ -1735,7 +1592,6 @@ void intel_logical_ring_cleanup(struct intel_engine_cs *engine)
 	}
 	intel_lr_context_unpin(dev_priv->kernel_context, engine);
 
-	engine->idle_lite_restore_wa = 0;
 	engine->disable_lite_restore_wa = false;
 	engine->ctx_desc_template = 0;
 
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index a1d4e7f9c64c..a4b90834ecbd 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -96,6 +96,4 @@ uint64_t intel_lr_context_descriptor(struct i915_gem_context *ctx,
 int intel_sanitize_enable_execlists(struct drm_i915_private *dev_priv,
 				    int enable_execlists);
 
-void intel_execlists_cancel_requests(struct intel_engine_cs *engine);
-
 #endif /* _INTEL_LRC_H_ */
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index c119e1dedbce..a5cee335881a 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -287,11 +287,14 @@ struct intel_engine_cs {
 	/* Execlists */
 	struct tasklet_struct irq_tasklet;
 	spinlock_t execlist_lock; /* used inside tasklet, use spin_lock_bh */
+	struct execlist_port {
+		struct drm_i915_gem_request *request;
+		unsigned count;
+	} execlist_port[2];
 	struct list_head execlist_queue;
 	unsigned int fw_domains;
-	unsigned int next_context_status_buffer;
-	unsigned int idle_lite_restore_wa;
 	bool disable_lite_restore_wa;
+	bool preempt_wa;
 	u32 ctx_desc_template;
 
 	/**
-- 
2.8.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/3] reset-request-recovery
  2016-07-27 21:35         ` Resubmitting requests following the hang Chris Wilson
  2016-07-27 21:35           ` [PATCH 1/3] drm/i915: Record the position of the start of the request Chris Wilson
  2016-07-27 21:35           ` [PATCH 2/3] lrc Chris Wilson
@ 2016-07-27 21:36           ` Chris Wilson
  2 siblings, 0 replies; 28+ messages in thread
From: Chris Wilson @ 2016-07-27 21:36 UTC (permalink / raw)
  To: intel-gfx

---
 drivers/gpu/drm/i915/i915_drv.c         |  4 +-
 drivers/gpu/drm/i915/i915_drv.h         |  2 +-
 drivers/gpu/drm/i915/i915_gem.c         | 66 +++++++++++++++++++++------------
 drivers/gpu/drm/i915/i915_gem_context.c | 16 --------
 drivers/gpu/drm/i915/intel_lrc.c        | 34 ++++++++++++++++-
 drivers/gpu/drm/i915/intel_ringbuffer.c | 32 ++++++++++------
 drivers/gpu/drm/i915/intel_ringbuffer.h |  2 +
 7 files changed, 100 insertions(+), 56 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 869baa6a5196..d250dce0c8fb 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1574,7 +1574,7 @@ static int i915_drm_resume(struct drm_device *dev)
 	mutex_lock(&dev->struct_mutex);
 	if (i915_gem_init_hw(dev)) {
 		DRM_ERROR("failed to re-initialize GPU, declaring wedged!\n");
-		atomic_or(I915_WEDGED, &dev_priv->gpu_error.reset_counter);
+		i915_gem_set_wedged(dev_priv);
 	}
 	mutex_unlock(&dev->struct_mutex);
 
@@ -1798,7 +1798,7 @@ int i915_reset(struct drm_i915_private *dev_priv)
 	return 0;
 
 error:
-	atomic_or(I915_WEDGED, &error->reset_counter);
+	i915_gem_set_wedged(dev_priv);
 	mutex_unlock(&dev->struct_mutex);
 	return ret;
 }
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index aacce6625f84..08c212a5f1a8 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3224,6 +3224,7 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
 }
 
 void i915_gem_reset(struct drm_device *dev);
+void i915_gem_set_wedged(struct drm_i915_private *dev_priv);
 bool i915_gem_clflush_object(struct drm_i915_gem_object *obj, bool force);
 int __must_check i915_gem_init(struct drm_device *dev);
 int __must_check i915_gem_init_hw(struct drm_device *dev);
@@ -3343,7 +3344,6 @@ void i915_gem_object_save_bit_17_swizzle(struct drm_i915_gem_object *obj);
 int __must_check i915_gem_context_init(struct drm_device *dev);
 void i915_gem_context_lost(struct drm_i915_private *dev_priv);
 void i915_gem_context_fini(struct drm_device *dev);
-void i915_gem_context_reset(struct drm_device *dev);
 int i915_gem_context_open(struct drm_device *dev, struct drm_file *file);
 void i915_gem_context_close(struct drm_device *dev, struct drm_file *file);
 int i915_switch_context(struct drm_i915_gem_request *req);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index f56fa8ac30bb..1e5c62476585 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2426,23 +2426,57 @@ i915_gem_find_active_request(struct intel_engine_cs *engine)
 	return NULL;
 }
 
-static void i915_gem_reset_engine_status(struct intel_engine_cs *engine)
+static void i915_gem_reset_engine(struct intel_engine_cs *engine)
 {
 	struct drm_i915_gem_request *request;
+	struct i915_gem_context *incomplete_ctx;
 	bool ring_hung;
 
+	/* Ensure irq handler finishes or is cancelled. */
+	tasklet_kill(&engine->irq_tasklet);
+
 	request = i915_gem_find_active_request(engine);
-	if (request == NULL)
+	if (!request)
 		return;
 
 	ring_hung = engine->hangcheck.score >= HANGCHECK_SCORE_RING_HUNG;
-
 	i915_set_reset_status(request->ctx, ring_hung);
-	list_for_each_entry_continue(request, &engine->request_list, link)
+	engine->reset_hw(engine, request);
+
+	incomplete_ctx = request->ctx;
+	if (i915_gem_context_is_default(incomplete_ctx))
+		incomplete_ctx = NULL;
+	list_for_each_entry_continue(request, &engine->request_list, link) {
+		void *vaddr = request->ring->vaddr;
+		u32 head;
+
+		if (request->ctx != incomplete_ctx)
+			continue;
+
+		head = request->head;
+		if (request->postfix < head) {
+			memset(vaddr + head, 0, request->ring->size - head);
+			head = 0;
+		}
+		memset(vaddr + head, 0, request->postfix - head);
+
 		i915_set_reset_status(request->ctx, false);
+	}
+}
+
+void i915_gem_reset(struct drm_device *dev)
+{
+	struct drm_i915_private *dev_priv = to_i915(dev);
+	struct intel_engine_cs *engine;
+
+	for_each_engine(engine, dev_priv)
+		i915_gem_reset_engine(engine);
+
+	i915_gem_context_lost(dev_priv);
+	i915_gem_restore_fences(dev);
 }
 
-static void i915_gem_reset_engine_cleanup(struct intel_engine_cs *engine)
+static void i915_gem_cleanup_engine(struct intel_engine_cs *engine)
 {
 	struct intel_ring *ring;
 
@@ -2459,9 +2493,6 @@ static void i915_gem_reset_engine_cleanup(struct intel_engine_cs *engine)
 	 */
 
 	if (i915.enable_execlists) {
-		/* Ensure irq handler finishes or is cancelled. */
-		tasklet_kill(&engine->irq_tasklet);
-
 		INIT_LIST_HEAD(&engine->execlist_queue);
 		i915_gem_request_assign(&engine->execlist_port[0].request,
 					NULL);
@@ -2501,26 +2532,13 @@ static void i915_gem_reset_engine_cleanup(struct intel_engine_cs *engine)
 	engine->i915->gt.active_engines &= ~intel_engine_flag(engine);
 }
 
-void i915_gem_reset(struct drm_device *dev)
+void i915_gem_set_wedged(struct drm_i915_private *dev_priv)
 {
-	struct drm_i915_private *dev_priv = to_i915(dev);
 	struct intel_engine_cs *engine;
 
-	/*
-	 * Before we free the objects from the requests, we need to inspect
-	 * them for finding the guilty party. As the requests only borrow
-	 * their reference to the objects, the inspection must be done first.
-	 */
-	for_each_engine(engine, dev_priv)
-		i915_gem_reset_engine_status(engine);
-
 	for_each_engine(engine, dev_priv)
-		i915_gem_reset_engine_cleanup(engine);
+		i915_gem_cleanup_engine(engine);
 	mod_delayed_work(dev_priv->wq, &dev_priv->gt.idle_work, 0);
-
-	i915_gem_context_reset(dev);
-
-	i915_gem_restore_fences(dev);
 }
 
 static void
@@ -4346,7 +4364,7 @@ int i915_gem_init(struct drm_device *dev)
 		 * for all other failure, such as an allocation failure, bail.
 		 */
 		DRM_ERROR("Failed to initialize GPU, declaring it wedged\n");
-		atomic_or(I915_WEDGED, &dev_priv->gpu_error.reset_counter);
+		i915_gem_set_wedged(dev_priv);
 		ret = 0;
 	}
 
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index bb72af5320b0..a5b3ba4cca4a 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -405,22 +405,6 @@ static void i915_gem_context_unpin(struct i915_gem_context *ctx,
 	}
 }
 
-void i915_gem_context_reset(struct drm_device *dev)
-{
-	struct drm_i915_private *dev_priv = to_i915(dev);
-
-	lockdep_assert_held(&dev->struct_mutex);
-
-	if (i915.enable_execlists) {
-		struct i915_gem_context *ctx;
-
-		list_for_each_entry(ctx, &dev_priv->context_list, link)
-			intel_lr_context_reset(dev_priv, ctx);
-	}
-
-	i915_gem_context_lost(dev_priv);
-}
-
 int i915_gem_context_init(struct drm_device *dev)
 {
 	struct drm_i915_private *dev_priv = to_i915(dev);
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 9be71a8a1d63..e24ce2d5cf90 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1193,6 +1193,7 @@ static int gen8_init_common_ring(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
 
+	intel_mocs_init_engine(engine);
 	lrc_init_hws(engine);
 
 	I915_WRITE_IMR(engine,
@@ -1208,7 +1209,10 @@ static int gen8_init_common_ring(struct intel_engine_cs *engine)
 
 	intel_engine_init_hangcheck(engine);
 
-	return intel_mocs_init_engine(engine);
+	if (engine->execlist_port[0].request)
+		execlists_submit_ports(engine);
+
+	return 0;
 }
 
 static int gen8_init_render_ring(struct intel_engine_cs *engine)
@@ -1244,6 +1248,33 @@ static int gen9_init_render_ring(struct intel_engine_cs *engine)
 	return init_workarounds_ring(engine);
 }
 
+static void reset_common_ring(struct intel_engine_cs *engine,
+			      struct drm_i915_gem_request *request)
+{
+	struct drm_i915_private *dev_priv = engine->i915;
+	struct i915_gem_context *ctx = request->ctx;
+	struct intel_context *ce = &ctx->engine[engine->id];
+	u32 *reg_state;
+
+	reg_state = ce->lrc_reg_state;
+	reg_state[CTX_RING_HEAD+1] = request->postfix;
+
+	request->ring->head = request->postfix;
+	request->ring->last_retired_head = -1;
+	intel_ring_update_space(request->ring);
+
+	if (request == engine->execlist_port[1].request) {
+		i915_gem_request_put(engine->execlist_port[0].request);
+		engine->execlist_port[0] = engine->execlist_port[1];
+		memset(&engine->execlist_port[1], 0,
+		       sizeof(engine->execlist_port[1]));
+	}
+
+	engine->execlist_port[0].count = 0;
+
+	I915_WRITE(RING_CONTEXT_STATUS_PTR(engine), _MASKED_FIELD(0xffff, 0));
+}
+
 static int intel_logical_ring_emit_pdps(struct drm_i915_gem_request *req)
 {
 	struct i915_hw_ppgtt *ppgtt = req->ctx->ppgtt;
@@ -1604,6 +1635,7 @@ logical_ring_default_vfuncs(struct intel_engine_cs *engine)
 {
 	/* Default vfuncs which can be overriden by each engine. */
 	engine->init_hw = gen8_init_common_ring;
+	engine->reset_hw = reset_common_ring;
 	engine->emit_flush = gen8_emit_flush;
 	engine->emit_request = gen8_emit_request;
 	engine->submit_request = execlists_submit_request;
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 54ec2faa0bf2..4730b1f178fd 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -577,34 +577,31 @@ static int init_ring_common(struct intel_engine_cs *engine)
 	if (I915_READ_HEAD(engine))
 		DRM_DEBUG("%s initialization failed [head=%08x], fudging\n",
 			  engine->name, I915_READ_HEAD(engine));
-	I915_WRITE_HEAD(engine, 0);
-	(void)I915_READ_HEAD(engine);
+	I915_WRITE_HEAD(engine, ring->head);
+	I915_WRITE_TAIL(engine, ring->tail);
+	(void)I915_READ_TAIL(engine);
 
 	I915_WRITE_CTL(engine,
 			((ring->size - PAGE_SIZE) & RING_NR_PAGES)
 			| RING_VALID);
 
 	/* If the head is still not zero, the ring is dead */
-	if (wait_for((I915_READ_CTL(engine) & RING_VALID) != 0 &&
-		     I915_READ_START(engine) == i915_gem_obj_ggtt_offset(obj) &&
-		     (I915_READ_HEAD(engine) & HEAD_ADDR) == 0, 50)) {
+	if (intel_wait_for_register_fw(dev_priv, RING_CTL(engine->mmio_base),
+				       RING_VALID, RING_VALID,
+				       50)) {
 		DRM_ERROR("%s initialization failed "
-			  "ctl %08x (valid? %d) head %08x tail %08x start %08x [expected %08lx]\n",
+			  "ctl %08x (valid? %d) head %08x [%08x] tail %08x [%08x] start %08x [expected %08lx]\n",
 			  engine->name,
 			  I915_READ_CTL(engine),
 			  I915_READ_CTL(engine) & RING_VALID,
-			  I915_READ_HEAD(engine), I915_READ_TAIL(engine),
+			  I915_READ_HEAD(engine), ring->head,
+			  I915_READ_TAIL(engine), ring->tail,
 			  I915_READ_START(engine),
 			  (unsigned long)i915_gem_obj_ggtt_offset(obj));
 		ret = -EIO;
 		goto out;
 	}
 
-	ring->last_retired_head = -1;
-	ring->head = I915_READ_HEAD(engine);
-	ring->tail = I915_READ_TAIL(engine) & TAIL_ADDR;
-	intel_ring_update_space(ring);
-
 	intel_engine_init_hangcheck(engine);
 
 out:
@@ -613,6 +610,16 @@ out:
 	return ret;
 }
 
+static void reset_ring_common(struct intel_engine_cs *engine,
+			      struct drm_i915_gem_request *request)
+{
+	struct intel_ring *ring = request->ring;
+
+	ring->head = request->postfix;
+	ring->last_retired_head = -1;
+	intel_ring_update_space(ring);
+}
+
 void intel_fini_pipe_control(struct intel_engine_cs *engine)
 {
 	if (engine->scratch.obj == NULL)
@@ -2767,6 +2774,7 @@ static void intel_ring_default_vfuncs(struct drm_i915_private *dev_priv,
 	intel_ring_init_semaphores(dev_priv, engine);
 
 	engine->init_hw = init_ring_common;
+	engine->reset_hw = reset_ring_common;
 
 	engine->emit_request = i9xx_emit_request;
 	if (i915.semaphores)
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index a5cee335881a..092e5b27652f 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -203,6 +203,8 @@ struct intel_engine_cs {
 	void		(*irq_disable)(struct intel_engine_cs *engine);
 
 	int		(*init_hw)(struct intel_engine_cs *engine);
+	void		(*reset_hw)(struct intel_engine_cs *engine,
+				    struct drm_i915_gem_request *req);
 
 	int		(*init_context)(struct drm_i915_gem_request *req);
 
-- 
2.8.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* ✗ Ro.CI.BAT: failure for Execlist based Engine reset patches (rev4)
  2016-07-26 16:40 [PATCH 00/11] Execlist based Engine reset patches Arun Siluvery
                   ` (11 preceding siblings ...)
  2016-07-26 17:11 ` ✗ Ro.CI.BAT: failure for Execlist based Engine reset patches Patchwork
@ 2016-07-28  5:40 ` Patchwork
  12 siblings, 0 replies; 28+ messages in thread
From: Patchwork @ 2016-07-28  5:40 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: Execlist based Engine reset patches (rev4)
URL   : https://patchwork.freedesktop.org/series/10283/
State : failure

== Summary ==

Applying: drm/i915: Update i915.reset to handle engine resets
Applying: drm/i915: Separate out reset flags from the reset counter
Applying: drm/i915/tdr: Update reset_in_progress to account for engine reset
Applying: drm/i915/tdr: Modify error handler for per engine hang recovery
Applying: reset-request-recovery
fatal: sha1 information is lacking or useless (drivers/gpu/drm/i915/i915_drv.h).
error: could not build fake ancestor
Patch failed at 0005 reset-request-recovery
The copy of the patch that failed is found in: .git/rebase-apply/patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2016-07-28  5:40 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-07-26 16:40 [PATCH 00/11] Execlist based Engine reset patches Arun Siluvery
2016-07-26 16:40 ` [PATCH 01/11] drm/i915: Update i915.reset to handle engine resets Arun Siluvery
2016-07-26 16:40 ` [PATCH 02/11] drm/i915: Separate out reset flags from the reset counter Arun Siluvery
2016-07-26 16:40 ` [PATCH 03/11] drm/i915/tdr: Update reset_in_progress to account for engine reset Arun Siluvery
2016-07-26 21:52   ` Chris Wilson
2016-07-27 11:16     ` Arun Siluvery
2016-07-27 11:41       ` Chris Wilson
2016-07-27 11:52         ` Arun Siluvery
2016-07-26 16:40 ` [PATCH 04/11] drm/i915/tdr: Modify error handler for per engine hang recovery Arun Siluvery
2016-07-26 16:40 ` [PATCH 05/11] drm/i915/tdr: Identify hung request and drop it Arun Siluvery
2016-07-26 21:37   ` Chris Wilson
2016-07-27 11:54     ` Arun Siluvery
2016-07-27 12:27       ` Chris Wilson
2016-07-27 21:35         ` Resubmitting requests following the hang Chris Wilson
2016-07-27 21:35           ` [PATCH 1/3] drm/i915: Record the position of the start of the request Chris Wilson
2016-07-27 21:35           ` [PATCH 2/3] lrc Chris Wilson
2016-07-27 21:36           ` [PATCH 3/3] reset-request-recovery Chris Wilson
2016-07-26 16:40 ` [PATCH 06/11] drm/i915/tdr: Restart submission after engine reset Arun Siluvery
2016-07-26 21:32   ` Chris Wilson
2016-07-26 16:40 ` [PATCH 07/11] drm/i915/tdr: Add support for per engine reset recovery Arun Siluvery
2016-07-26 21:51   ` Chris Wilson
2016-07-27 11:49     ` Arun Siluvery
2016-07-26 16:40 ` [PATCH 08/11] drm/i915: Skip reset request if there is one already Arun Siluvery
2016-07-26 16:40 ` [PATCH 09/11] drm/i915/tdr: Add engine reset count to error state Arun Siluvery
2016-07-26 16:40 ` [PATCH 10/11] drm/i915/tdr: Export reset count info to debugfs Arun Siluvery
2016-07-26 16:40 ` [PATCH 11/11] drm/i915/tdr: Enable Engine reset and recovery support Arun Siluvery
2016-07-26 17:11 ` ✗ Ro.CI.BAT: failure for Execlist based Engine reset patches Patchwork
2016-07-28  5:40 ` ✗ Ro.CI.BAT: failure for Execlist based Engine reset patches (rev4) Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox