public inbox for intel-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed
* [PATCH v2 1/2] drm/i915: Empty the ring before disabling
@ 2017-10-27 10:40 Chris Wilson
  2017-10-27 10:40 ` [PATCH v2 2/2] drm/i915: Abandon the reset if we fail to stop the engines Chris Wilson
  2017-10-27 11:26 ` ✗ Fi.CI.BAT: failure for series starting with [v2,1/2] drm/i915: Empty the ring before disabling Patchwork
  0 siblings, 2 replies; 5+ messages in thread
From: Chris Wilson @ 2017-10-27 10:40 UTC (permalink / raw)
  To: intel-gfx

An interesting snippet from Sandybridge's prm:

"Although a Ring Buffer can be enabled in the non-empty state, it must
not be disabled unless it is empty. Attempting to disable a Ring Buffer
in the non-empty state is UNDEFINED."

Let's avoid the undefined behaviour as we disable the rings prior to
reset and resume.

v2: Tell HEAD to catch up to TAIL (empty ring) first, then reset both to
0 (supposedly while stopped).

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
More investigatory stabs at the shards.
-Chris
---
 drivers/gpu/drm/i915/intel_ringbuffer.c | 6 +++++-
 drivers/gpu/drm/i915/intel_uncore.c     | 6 +++++-
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 05e01446b00b..47fadf8da84e 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -480,10 +480,14 @@ static bool stop_ring(struct intel_engine_cs *engine)
 		}
 	}
 
-	I915_WRITE_CTL(engine, 0);
+	I915_WRITE_HEAD(engine, I915_READ_TAIL(engine));
+
 	I915_WRITE_HEAD(engine, 0);
 	I915_WRITE_TAIL(engine, 0);
 
+	/* The ring must be empty before it is disabled */
+	I915_WRITE_CTL(engine, 0);
+
 	return (I915_READ_HEAD(engine) & HEAD_ADDR) == 0;
 }
 
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 20e3c65c0999..96ee6b2754be 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1387,10 +1387,14 @@ static void gen3_stop_engine(struct intel_engine_cs *engine)
 		DRM_DEBUG_DRIVER("%s: timed out on STOP_RING\n",
 				 engine->name);
 
-	I915_WRITE_FW(RING_CTL(base), 0);
+	I915_WRITE_FW(RING_HEAD(base), I915_READ_FW(RING_TAIL(base)));
+
 	I915_WRITE_FW(RING_HEAD(base), 0);
 	I915_WRITE_FW(RING_TAIL(base), 0);
 
+	/* The ring must be empty before it is disabled */
+	I915_WRITE_FW(RING_CTL(base), 0);
+
 	/* Check acts as a post */
 	if (I915_READ_FW(RING_HEAD(base)) != 0)
 		DRM_DEBUG_DRIVER("%s: ring head not parked\n",
-- 
2.15.0.rc2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v2 2/2] drm/i915: Abandon the reset if we fail to stop the engines
  2017-10-27 10:40 [PATCH v2 1/2] drm/i915: Empty the ring before disabling Chris Wilson
@ 2017-10-27 10:40 ` Chris Wilson
  2017-10-27 12:18   ` Mika Kuoppala
  2017-10-27 11:26 ` ✗ Fi.CI.BAT: failure for series starting with [v2,1/2] drm/i915: Empty the ring before disabling Patchwork
  1 sibling, 1 reply; 5+ messages in thread
From: Chris Wilson @ 2017-10-27 10:40 UTC (permalink / raw)
  To: intel-gfx

Some machines, *cough* snb *cough*, fail catastrophically if asked to
reset the GPU under certain conditions. The initial guess is that this
is when the rings are still busy at the time of the reset request
(because that's a pattern we've seen elsewhere, hence why we do try
gen3_stop_engines() before reset) so abandon the reset and leave the
device wedged, if gen3_stop_engines() fails.

v2: Only give up if not idle after emptying the ring.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103240
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/intel_uncore.c | 43 ++++++++++++++++++++++++++++---------
 1 file changed, 33 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 96ee6b2754be..f80dbff3595f 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1372,7 +1372,7 @@ int i915_reg_read_ioctl(struct drm_device *dev,
 	return ret;
 }
 
-static void gen3_stop_engine(struct intel_engine_cs *engine)
+static bool gen3_stop_engine(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
 	const u32 base = engine->mmio_base;
@@ -1392,26 +1392,46 @@ static void gen3_stop_engine(struct intel_engine_cs *engine)
 	I915_WRITE_FW(RING_HEAD(base), 0);
 	I915_WRITE_FW(RING_TAIL(base), 0);
 
+	/* Check acts as a post */
+	if (intel_wait_for_register_fw(dev_priv,
+				       mode,
+				       MODE_IDLE,
+				       MODE_IDLE,
+				       1000)) {
+		DRM_DEBUG_DRIVER("%s: timed out after clearing ring\n",
+				 engine->name);
+		return false;
+	}
+
 	/* The ring must be empty before it is disabled */
 	I915_WRITE_FW(RING_CTL(base), 0);
+	POSTING_READ_FW(RING_CTL(base));
 
-	/* Check acts as a post */
-	if (I915_READ_FW(RING_HEAD(base)) != 0)
-		DRM_DEBUG_DRIVER("%s: ring head not parked\n",
-				 engine->name);
+	return true;
 }
 
-static void i915_stop_engines(struct drm_i915_private *dev_priv,
-			      unsigned engine_mask)
+static int i915_stop_engines(struct drm_i915_private *dev_priv,
+			     unsigned engine_mask)
 {
 	struct intel_engine_cs *engine;
 	enum intel_engine_id id;
+	bool idle;
 
 	if (INTEL_GEN(dev_priv) < 3)
-		return;
+		return true;
 
+	idle = true;
 	for_each_engine_masked(engine, dev_priv, engine_mask, id)
-		gen3_stop_engine(engine);
+		idle &= gen3_stop_engine(engine);
+	if (idle)
+		return true;
+
+	dev_err(dev_priv->drm.dev, "Failed to stop all engines\n");
+	for_each_engine_masked(engine, dev_priv, engine_mask, id) {
+		struct drm_printer p = drm_debug_printer(__func__);
+		intel_engine_dump(engine, &p);
+	}
+	return false;
 }
 
 static bool i915_reset_complete(struct pci_dev *pdev)
@@ -1772,7 +1792,10 @@ int intel_gpu_reset(struct drm_i915_private *dev_priv, unsigned engine_mask)
 		 *
 		 * FIXME: Wa for more modern gens needs to be validated
 		 */
-		i915_stop_engines(dev_priv, engine_mask);
+		if (!i915_stop_engines(dev_priv, engine_mask)) {
+			ret = -EIO;
+			break;
+		}
 
 		ret = -ENODEV;
 		if (reset)
-- 
2.15.0.rc2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* ✗ Fi.CI.BAT: failure for series starting with [v2,1/2] drm/i915: Empty the ring before disabling
  2017-10-27 10:40 [PATCH v2 1/2] drm/i915: Empty the ring before disabling Chris Wilson
  2017-10-27 10:40 ` [PATCH v2 2/2] drm/i915: Abandon the reset if we fail to stop the engines Chris Wilson
@ 2017-10-27 11:26 ` Patchwork
  1 sibling, 0 replies; 5+ messages in thread
From: Patchwork @ 2017-10-27 11:26 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [v2,1/2] drm/i915: Empty the ring before disabling
URL   : https://patchwork.freedesktop.org/series/32747/
State : failure

== Summary ==

Series 32747v1 series starting with [v2,1/2] drm/i915: Empty the ring before disabling
https://patchwork.freedesktop.org/api/1.0/series/32747/revisions/1/mbox/

Test chamelium:
        Subgroup dp-crc-fast:
                pass       -> FAIL       (fi-kbl-7500u) fdo#102514
Test core_auth:
        Subgroup basic-auth:
                pass       -> SKIP       (fi-ilk-650)
Test core_prop_blob:
        Subgroup basic:
                pass       -> SKIP       (fi-ilk-650)
Test debugfs_test:
        Subgroup read_all_entries:
                pass       -> SKIP       (fi-ilk-650)
Test drv_getparams_basic:
        Subgroup basic-eu-total:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-subslice-total:
                pass       -> SKIP       (fi-ilk-650)
Test drv_hangman:
        Subgroup error-state-basic:
                pass       -> SKIP       (fi-ilk-650)
Test gem_basic:
        Subgroup bad-close:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup create-close:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup create-fd-close:
                pass       -> SKIP       (fi-ilk-650)
Test gem_busy:
        Subgroup basic-busy-default:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-hang-default:
                pass       -> SKIP       (fi-ilk-650)
Test gem_close_race:
        Subgroup basic-process:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-threads:
                pass       -> SKIP       (fi-ilk-650)
Test gem_cpu_reloc:
        Subgroup basic:
                pass       -> SKIP       (fi-ilk-650)
Test gem_cs_tlb:
        Subgroup basic-default:
                pass       -> SKIP       (fi-ilk-650)
Test gem_exec_basic:
        Subgroup basic-bsd:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-default:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-render:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup gtt-bsd:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup gtt-default:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup gtt-render:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup readonly-bsd:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup readonly-default:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup readonly-render:
                pass       -> SKIP       (fi-ilk-650)
Test gem_exec_create:
        Subgroup basic:
                pass       -> SKIP       (fi-ilk-650)
Test gem_exec_fence:
        Subgroup basic-busy-default:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-wait-default:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-await-default:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup await-hang-default:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup nb-await-default:
                pass       -> SKIP       (fi-ilk-650)
Test gem_exec_flush:
        Subgroup basic-batch-kernel-default-uc:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-batch-kernel-default-wb:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-uc-pro-default:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-uc-prw-default:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-uc-ro-default:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-uc-rw-default:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-uc-set-default:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-wb-pro-default:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-wb-prw-default:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-wb-ro-before-default:
                pass       -> SKIP       (fi-ilk-650)
WARNING: Long output truncated

1db1a27d38d789ce2886db512828ae306c998bc3 drm-tip: 2017y-10m-27d-07h-23m-09s UTC integration manifest
c4198883fdfb drm/i915: Abandon the reset if we fail to stop the engines
a7c4ebfa4c41 drm/i915: Empty the ring before disabling

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_6228/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 2/2] drm/i915: Abandon the reset if we fail to stop the engines
  2017-10-27 10:40 ` [PATCH v2 2/2] drm/i915: Abandon the reset if we fail to stop the engines Chris Wilson
@ 2017-10-27 12:18   ` Mika Kuoppala
  2017-10-27 12:34     ` Chris Wilson
  0 siblings, 1 reply; 5+ messages in thread
From: Mika Kuoppala @ 2017-10-27 12:18 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> Some machines, *cough* snb *cough*, fail catastrophically if asked to
> reset the GPU under certain conditions. The initial guess is that this
> is when the rings are still busy at the time of the reset request
> (because that's a pattern we've seen elsewhere, hence why we do try
> gen3_stop_engines() before reset) so abandon the reset and leave the
> device wedged, if gen3_stop_engines() fails.
>
> v2: Only give up if not idle after emptying the ring.
>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103240
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/intel_uncore.c | 43 ++++++++++++++++++++++++++++---------
>  1 file changed, 33 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> index 96ee6b2754be..f80dbff3595f 100644
> --- a/drivers/gpu/drm/i915/intel_uncore.c
> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> @@ -1372,7 +1372,7 @@ int i915_reg_read_ioctl(struct drm_device *dev,
>  	return ret;
>  }
>  
> -static void gen3_stop_engine(struct intel_engine_cs *engine)
> +static bool gen3_stop_engine(struct intel_engine_cs *engine)
>  {
>  	struct drm_i915_private *dev_priv = engine->i915;
>  	const u32 base = engine->mmio_base;
> @@ -1392,26 +1392,46 @@ static void gen3_stop_engine(struct intel_engine_cs *engine)
>  	I915_WRITE_FW(RING_HEAD(base), 0);
>  	I915_WRITE_FW(RING_TAIL(base), 0);
>  
> +	/* Check acts as a post */
> +	if (intel_wait_for_register_fw(dev_priv,
> +				       mode,
> +				       MODE_IDLE,
> +				       MODE_IDLE,
> +				       1000)) {
> +		DRM_DEBUG_DRIVER("%s: timed out after clearing ring\n",
> +				 engine->name);
> +		return false;
> +	}
> +

I recall that this bailout was the reason I didn't want to
use the stop_ring in intel_ringbuffer.c

Now if you choose to reintroduce the bailout, I think you can
make a generic stop_engine and get rid of the copy.

-Mika

>  	/* The ring must be empty before it is disabled */
>  	I915_WRITE_FW(RING_CTL(base), 0);
> +	POSTING_READ_FW(RING_CTL(base));
>  
> -	/* Check acts as a post */
> -	if (I915_READ_FW(RING_HEAD(base)) != 0)
> -		DRM_DEBUG_DRIVER("%s: ring head not parked\n",
> -				 engine->name);
> +	return true;
>  }
>  
> -static void i915_stop_engines(struct drm_i915_private *dev_priv,
> -			      unsigned engine_mask)
> +static int i915_stop_engines(struct drm_i915_private *dev_priv,
> +			     unsigned engine_mask)
>  {
>  	struct intel_engine_cs *engine;
>  	enum intel_engine_id id;
> +	bool idle;
>  
>  	if (INTEL_GEN(dev_priv) < 3)
> -		return;
> +		return true;
>  
> +	idle = true;
>  	for_each_engine_masked(engine, dev_priv, engine_mask, id)
> -		gen3_stop_engine(engine);
> +		idle &= gen3_stop_engine(engine);
> +	if (idle)
> +		return true;
> +
> +	dev_err(dev_priv->drm.dev, "Failed to stop all engines\n");
> +	for_each_engine_masked(engine, dev_priv, engine_mask, id) {
> +		struct drm_printer p = drm_debug_printer(__func__);
> +		intel_engine_dump(engine, &p);
> +	}
> +	return false;
>  }
>  
>  static bool i915_reset_complete(struct pci_dev *pdev)
> @@ -1772,7 +1792,10 @@ int intel_gpu_reset(struct drm_i915_private *dev_priv, unsigned engine_mask)
>  		 *
>  		 * FIXME: Wa for more modern gens needs to be validated
>  		 */
> -		i915_stop_engines(dev_priv, engine_mask);
> +		if (!i915_stop_engines(dev_priv, engine_mask)) {
> +			ret = -EIO;
> +			break;
> +		}
>  
>  		ret = -ENODEV;
>  		if (reset)
> -- 
> 2.15.0.rc2
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 2/2] drm/i915: Abandon the reset if we fail to stop the engines
  2017-10-27 12:18   ` Mika Kuoppala
@ 2017-10-27 12:34     ` Chris Wilson
  0 siblings, 0 replies; 5+ messages in thread
From: Chris Wilson @ 2017-10-27 12:34 UTC (permalink / raw)
  To: Mika Kuoppala, intel-gfx

Quoting Mika Kuoppala (2017-10-27 13:18:44)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > Some machines, *cough* snb *cough*, fail catastrophically if asked to
> > reset the GPU under certain conditions. The initial guess is that this
> > is when the rings are still busy at the time of the reset request
> > (because that's a pattern we've seen elsewhere, hence why we do try
> > gen3_stop_engines() before reset) so abandon the reset and leave the
> > device wedged, if gen3_stop_engines() fails.
> >
> > v2: Only give up if not idle after emptying the ring.
> >
> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103240
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > ---
> >  drivers/gpu/drm/i915/intel_uncore.c | 43 ++++++++++++++++++++++++++++---------
> >  1 file changed, 33 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> > index 96ee6b2754be..f80dbff3595f 100644
> > --- a/drivers/gpu/drm/i915/intel_uncore.c
> > +++ b/drivers/gpu/drm/i915/intel_uncore.c
> > @@ -1372,7 +1372,7 @@ int i915_reg_read_ioctl(struct drm_device *dev,
> >       return ret;
> >  }
> >  
> > -static void gen3_stop_engine(struct intel_engine_cs *engine)
> > +static bool gen3_stop_engine(struct intel_engine_cs *engine)
> >  {
> >       struct drm_i915_private *dev_priv = engine->i915;
> >       const u32 base = engine->mmio_base;
> > @@ -1392,26 +1392,46 @@ static void gen3_stop_engine(struct intel_engine_cs *engine)
> >       I915_WRITE_FW(RING_HEAD(base), 0);
> >       I915_WRITE_FW(RING_TAIL(base), 0);
> >  
> > +     /* Check acts as a post */
> > +     if (intel_wait_for_register_fw(dev_priv,
> > +                                    mode,
> > +                                    MODE_IDLE,
> > +                                    MODE_IDLE,
> > +                                    1000)) {
> > +             DRM_DEBUG_DRIVER("%s: timed out after clearing ring\n",
> > +                              engine->name);
> > +             return false;
> > +     }
> > +
> 
> I recall that this bailout was the reason I didn't want to
> use the stop_ring in intel_ringbuffer.c
> 
> Now if you choose to reintroduce the bailout, I think you can
> make a generic stop_engine and get rid of the copy.

This bailout isn't looking too hot. For sure it prevents snb from
eating itself in this situation, but it looks too general and the
collateral damage is unacceptable.

Atm, one lead I have is that the RING_CTL change doesn't take. That may
indeed relate to Ville's observation that it only takes effect when the
CP hits an arbitration point. But in the igt, we use... oh no we don't,
not for inject hang.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-10-27 12:34 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-10-27 10:40 [PATCH v2 1/2] drm/i915: Empty the ring before disabling Chris Wilson
2017-10-27 10:40 ` [PATCH v2 2/2] drm/i915: Abandon the reset if we fail to stop the engines Chris Wilson
2017-10-27 12:18   ` Mika Kuoppala
2017-10-27 12:34     ` Chris Wilson
2017-10-27 11:26 ` ✗ Fi.CI.BAT: failure for series starting with [v2,1/2] drm/i915: Empty the ring before disabling Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox