public inbox for intel-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed
* [PATCH] drm/i915: Restore context and pd for ringbuffer submission after reset
@ 2017-02-04 19:37 Chris Wilson
  2017-02-04 19:46 ` Chris Wilson
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: Chris Wilson @ 2017-02-04 19:37 UTC (permalink / raw)
  To: intel-gfx; +Cc: Mika Kuoppala

Following a reset, the context and page directory registers are lost.
However, the queue of requests that we resubmit after the reset may
depend upon them - the registers are restored from a context image, but
that restore may be inhibited and may simply be absent from the request
if it was in the middle of a sequence using the same context. If we
prime the CCID/PD registers with the first request in the queue (even
for the hung request), we prevent invalid memory access for the
following requests (and continually hung engines).

Fixes: 821ed7df6e2a ("drm/i915: Update reset path to fix incomplete requests")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
---

This could do with going to stable but requires a few odds and ends, such
as dma_fence_set_error(). Oh well, fortunately it is not as bad it might
seem since these registers are restored from the context - but that then
requires a mesa context to reset the GPU state (as fortunately we called
MI_SET_CONTEXT at the start of every batch!), but any other request in
the meantime will likely hang again.

(Also I left gen8/ringbuffer reset_hw as an exercise for the reader)
-Chris

---
 drivers/gpu/drm/i915/i915_gem.c         | 18 +++++++-----------
 drivers/gpu/drm/i915/intel_lrc.c        |  6 +++++-
 drivers/gpu/drm/i915/intel_ringbuffer.c | 30 +++++++++++++++++++++++++++---
 3 files changed, 39 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index fe32b1edbda2..7a16c859c875 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2735,21 +2735,17 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine)
 		engine->irq_seqno_barrier(engine);
 
 	request = i915_gem_find_active_request(engine);
-	if (!request)
-		return;
-
-	if (!i915_gem_reset_request(request))
-		return;
+	if (request && i915_gem_reset_request(request)) {
+		DRM_DEBUG_DRIVER("resetting %s to restart from tail of request 0x%x\n",
+				 engine->name, request->global_seqno);
 
-	DRM_DEBUG_DRIVER("resetting %s to restart from tail of request 0x%x\n",
-			 engine->name, request->global_seqno);
+		/* If this context is now banned, skip all pending requests. */
+		if (i915_gem_context_is_banned(request->ctx))
+			engine_skip_context(request);
+	}
 
 	/* Setup the CS to resume from the breadcrumb of the hung request */
 	engine->reset_hw(engine, request);
-
-	/* If this context is now banned, skip all of its pending requests. */
-	if (i915_gem_context_is_banned(request->ctx))
-		engine_skip_context(request);
 }
 
 void i915_gem_reset_finish(struct drm_i915_private *dev_priv)
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 44a92ea464ba..f44dd42e46b5 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1324,7 +1324,10 @@ static void reset_common_ring(struct intel_engine_cs *engine,
 			      struct drm_i915_gem_request *request)
 {
 	struct execlist_port *port = engine->execlist_port;
-	struct intel_context *ce = &request->ctx->engine[engine->id];
+	struct intel_context *ce;
+
+	if (!request || request->fence.error != -EIO)
+		return;
 
 	/* We want a simple context + ring to execute the breadcrumb update.
 	 * We cannot rely on the context being intact across the GPU hang,
@@ -1333,6 +1336,7 @@ static void reset_common_ring(struct intel_engine_cs *engine,
 	 * future request will be after userspace has had the opportunity
 	 * to recreate its own state.
 	 */
+	ce = &request->ctx->engine[engine->id];
 	execlists_init_reg_state(ce->lrc_reg_state,
 				 request->ctx, engine, ce->ring);
 
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index d32cbba25d98..b30c1dc3061c 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -599,10 +599,34 @@ static int init_ring_common(struct intel_engine_cs *engine)
 static void reset_ring_common(struct intel_engine_cs *engine,
 			      struct drm_i915_gem_request *request)
 {
-	struct intel_ring *ring = request->ring;
+	if (request) {
+		struct drm_i915_private *dev_priv = request->i915;
+		struct intel_context *ce = &request->ctx->engine[engine->id];
+		struct i915_hw_ppgtt *ppgtt;
+
+		if (ce->state) {
+			I915_WRITE(CCID,
+				   i915_ggtt_offset(ce->state) |
+				   BIT(3) | BIT(2) | CCID_EN);
+		}
 
-	ring->head = request->postfix;
-	ring->last_retired_head = -1;
+		ppgtt = request->ctx->ppgtt ?: engine->i915->mm.aliasing_ppgtt;
+		if (ppgtt) {
+			u32 pd_offset = ppgtt->pd.base.ggtt_offset << 10;
+
+			I915_WRITE(RING_PP_DIR_DCLV(engine), PP_DIR_DCLV_2G);
+			I915_WRITE(RING_PP_DIR_BASE(engine), pd_offset);
+		}
+
+		if (request->fence.error == -EIO) {
+			struct intel_ring *ring = request->ring;
+
+			ring->head = request->postfix;
+			ring->last_retired_head = -1;
+		}
+	} else {
+		engine->legacy_active_context = NULL;
+	}
 }
 
 static int intel_ring_workarounds_emit(struct drm_i915_gem_request *req)
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915: Restore context and pd for ringbuffer submission after reset
  2017-02-04 19:37 [PATCH] drm/i915: Restore context and pd for ringbuffer submission after reset Chris Wilson
@ 2017-02-04 19:46 ` Chris Wilson
  2017-02-06  9:10 ` ✗ Fi.CI.BAT: failure for " Patchwork
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Chris Wilson @ 2017-02-04 19:46 UTC (permalink / raw)
  To: intel-gfx; +Cc: Mika Kuoppala

On Sat, Feb 04, 2017 at 07:37:13PM +0000, Chris Wilson wrote:
> Following a reset, the context and page directory registers are lost.
> However, the queue of requests that we resubmit after the reset may
> depend upon them - the registers are restored from a context image, but
> that restore may be inhibited and may simply be absent from the request
> if it was in the middle of a sequence using the same context. If we
> prime the CCID/PD registers with the first request in the queue (even
> for the hung request), we prevent invalid memory access for the
> following requests (and continually hung engines).
> 
> Fixes: 821ed7df6e2a ("drm/i915: Update reset path to fix incomplete requests")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
> 
> This could do with going to stable but requires a few odds and ends, such
> as dma_fence_set_error(). Oh well, fortunately it is not as bad it might
> seem since these registers are restored from the context - but that then
> requires a mesa context to reset the GPU state (as fortunately we called
> MI_SET_CONTEXT at the start of every batch!), but any other request in
> the meantime will likely hang again.
> 
> (Also I left gen8/ringbuffer reset_hw as an exercise for the reader)

I'm also puzzled as to how this escaped igt, the fence test should have
tried to write through the aliasing ppgtt without a context restore
(i.e. into randomness) following the hang. Weird. On the positive side,
it may mean the impact isn't as large as I think it should be.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* ✗ Fi.CI.BAT: failure for drm/i915: Restore context and pd for ringbuffer submission after reset
  2017-02-04 19:37 [PATCH] drm/i915: Restore context and pd for ringbuffer submission after reset Chris Wilson
  2017-02-04 19:46 ` Chris Wilson
@ 2017-02-06  9:10 ` Patchwork
  2017-02-06 15:33 ` [PATCH] " Mika Kuoppala
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Patchwork @ 2017-02-06  9:10 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Restore context and pd for ringbuffer submission after reset
URL   : https://patchwork.freedesktop.org/series/19110/
State : failure

== Summary ==

Series 19110v1 drm/i915: Restore context and pd for ringbuffer submission after reset
https://patchwork.freedesktop.org/api/1.0/series/19110/revisions/1/mbox/

Test gem_exec_fence:
        Subgroup await-hang-default:
                pass       -> INCOMPLETE (fi-skl-6770hq)
Test kms_force_connector_basic:
        Subgroup force-connector-state:
                pass       -> SKIP       (fi-ivb-3520m)
        Subgroup force-edid:
                pass       -> SKIP       (fi-ivb-3520m)
        Subgroup force-load-detect:
                pass       -> SKIP       (fi-ivb-3520m)
        Subgroup prune-stale-modes:
                pass       -> SKIP       (fi-ivb-3520m)

fi-bdw-5557u     total:252  pass:214  dwarn:0   dfail:0   fail:0   skip:38 
fi-bsw-n3050     total:252  pass:192  dwarn:0   dfail:0   fail:0   skip:60 
fi-bxt-j4205     total:252  pass:208  dwarn:0   dfail:0   fail:0   skip:44 
fi-bxt-t5700     total:209  pass:167  dwarn:0   dfail:0   fail:0   skip:41 
fi-byt-j1900     total:252  pass:204  dwarn:0   dfail:0   fail:0   skip:48 
fi-byt-n2820     total:252  pass:200  dwarn:0   dfail:0   fail:0   skip:52 
fi-hsw-4770      total:252  pass:211  dwarn:0   dfail:0   fail:0   skip:41 
fi-hsw-4770r     total:252  pass:211  dwarn:0   dfail:0   fail:0   skip:41 
fi-ilk-650       total:14   pass:11   dwarn:0   dfail:0   fail:0   skip:2  
fi-ivb-3520m     total:252  pass:206  dwarn:0   dfail:0   fail:0   skip:46 
fi-ivb-3770      total:252  pass:210  dwarn:0   dfail:0   fail:0   skip:42 
fi-kbl-7500u     total:252  pass:207  dwarn:0   dfail:0   fail:2   skip:43 
fi-skl-6260u     total:252  pass:215  dwarn:0   dfail:0   fail:0   skip:37 
fi-skl-6700hq    total:252  pass:210  dwarn:0   dfail:0   fail:0   skip:42 
fi-skl-6700k     total:252  pass:205  dwarn:4   dfail:0   fail:0   skip:43 
fi-skl-6770hq    total:48   pass:45   dwarn:0   dfail:0   fail:0   skip:2  
fi-snb-2520m     total:252  pass:201  dwarn:0   dfail:0   fail:0   skip:51 
fi-snb-2600      total:252  pass:200  dwarn:0   dfail:0   fail:0   skip:52 

4cd56b5749aa66055e08b407081d1dd8d5c1bb82 drm-tip: 2017y-02m-04d-19h-26m-21s UTC integration manifest
05b01b7 drm/i915: Restore context and pd for ringbuffer submission after reset

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_3705/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915: Restore context and pd for ringbuffer submission after reset
  2017-02-04 19:37 [PATCH] drm/i915: Restore context and pd for ringbuffer submission after reset Chris Wilson
  2017-02-04 19:46 ` Chris Wilson
  2017-02-06  9:10 ` ✗ Fi.CI.BAT: failure for " Patchwork
@ 2017-02-06 15:33 ` Mika Kuoppala
  2017-02-06 15:51   ` Chris Wilson
  2017-02-07  9:58   ` [PATCH v2] " Chris Wilson
  2017-02-07 10:17 ` ✗ Fi.CI.BAT: warning for drm/i915: Restore context and pd for ringbuffer submission after reset (rev2) Patchwork
  2017-02-07 16:52 ` ✓ Fi.CI.BAT: success for drm/i915: Restore context and pd for ringbuffer submission after reset (rev3) Patchwork
  4 siblings, 2 replies; 11+ messages in thread
From: Mika Kuoppala @ 2017-02-06 15:33 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> Following a reset, the context and page directory registers are lost.
> However, the queue of requests that we resubmit after the reset may
> depend upon them - the registers are restored from a context image, but
> that restore may be inhibited and may simply be absent from the request
> if it was in the middle of a sequence using the same context. If we
> prime the CCID/PD registers with the first request in the queue (even
> for the hung request), we prevent invalid memory access for the
> following requests (and continually hung engines).
>
> Fixes: 821ed7df6e2a ("drm/i915: Update reset path to fix incomplete requests")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
>
> This could do with going to stable but requires a few odds and ends, such
> as dma_fence_set_error(). Oh well, fortunately it is not as bad it might
> seem since these registers are restored from the context - but that then
> requires a mesa context to reset the GPU state (as fortunately we called
> MI_SET_CONTEXT at the start of every batch!), but any other request in
> the meantime will likely hang again.
>
> (Also I left gen8/ringbuffer reset_hw as an exercise for the reader)
> -Chris
>
> ---
>  drivers/gpu/drm/i915/i915_gem.c         | 18 +++++++-----------
>  drivers/gpu/drm/i915/intel_lrc.c        |  6 +++++-
>  drivers/gpu/drm/i915/intel_ringbuffer.c | 30 +++++++++++++++++++++++++++---
>  3 files changed, 39 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index fe32b1edbda2..7a16c859c875 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2735,21 +2735,17 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine)
>  		engine->irq_seqno_barrier(engine);
>  
>  	request = i915_gem_find_active_request(engine);
> -	if (!request)
> -		return;
> -
> -	if (!i915_gem_reset_request(request))
> -		return;
> +	if (request && i915_gem_reset_request(request)) {
> +		DRM_DEBUG_DRIVER("resetting %s to restart from tail of request 0x%x\n",
> +				 engine->name, request->global_seqno);
>  
> -	DRM_DEBUG_DRIVER("resetting %s to restart from tail of request 0x%x\n",
> -			 engine->name, request->global_seqno);
> +		/* If this context is now banned, skip all pending requests. */
> +		if (i915_gem_context_is_banned(request->ctx))
> +			engine_skip_context(request);
> +	}
>  
>  	/* Setup the CS to resume from the breadcrumb of the hung request */
>  	engine->reset_hw(engine, request);
> -
> -	/* If this context is now banned, skip all of its pending requests. */
> -	if (i915_gem_context_is_banned(request->ctx))
> -		engine_skip_context(request);
>  }
>  
>  void i915_gem_reset_finish(struct drm_i915_private *dev_priv)
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 44a92ea464ba..f44dd42e46b5 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1324,7 +1324,10 @@ static void reset_common_ring(struct intel_engine_cs *engine,
>  			      struct drm_i915_gem_request *request)
>  {
>  	struct execlist_port *port = engine->execlist_port;
> -	struct intel_context *ce = &request->ctx->engine[engine->id];
> +	struct intel_context *ce;
> +
> +	if (!request || request->fence.error != -EIO)
> +		return;
>  
>  	/* We want a simple context + ring to execute the breadcrumb update.
>  	 * We cannot rely on the context being intact across the GPU hang,
> @@ -1333,6 +1336,7 @@ static void reset_common_ring(struct intel_engine_cs *engine,
>  	 * future request will be after userspace has had the opportunity
>  	 * to recreate its own state.
>  	 */
> +	ce = &request->ctx->engine[engine->id];
>  	execlists_init_reg_state(ce->lrc_reg_state,
>  				 request->ctx, engine, ce->ring);
>  
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> index d32cbba25d98..b30c1dc3061c 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> @@ -599,10 +599,34 @@ static int init_ring_common(struct intel_engine_cs *engine)
>  static void reset_ring_common(struct intel_engine_cs *engine,
>  			      struct drm_i915_gem_request *request)
>  {
> -	struct intel_ring *ring = request->ring;
> +	if (request) {
> +		struct drm_i915_private *dev_priv = request->i915;
> +		struct intel_context *ce = &request->ctx->engine[engine->id];
> +		struct i915_hw_ppgtt *ppgtt;
> +
> +		if (ce->state) {
> +			I915_WRITE(CCID,
> +				   i915_ggtt_offset(ce->state) |
> +				   BIT(3) | BIT(2) | CCID_EN);

Can we assume here that the hw reset have made the ringbuffer
empty (as hw head == tail == 0)?

BIT8 should be set.
Bits 3|2 shuold be EXTENDED_STATE_SAVE|RESTORE_ENABLE.

Does the BIT2 affect the way the context will be loaded?

-Mika

> +		}
>  
> -	ring->head = request->postfix;
> -	ring->last_retired_head = -1;
> +		ppgtt = request->ctx->ppgtt ?: engine->i915->mm.aliasing_ppgtt;
> +		if (ppgtt) {
> +			u32 pd_offset = ppgtt->pd.base.ggtt_offset << 10;
> +
> +			I915_WRITE(RING_PP_DIR_DCLV(engine), PP_DIR_DCLV_2G);
> +			I915_WRITE(RING_PP_DIR_BASE(engine), pd_offset);
> +		}
> +
> +		if (request->fence.error == -EIO) {
> +			struct intel_ring *ring = request->ring;
> +
> +			ring->head = request->postfix;
> +			ring->last_retired_head = -1;
> +		}
> +	} else {
> +		engine->legacy_active_context = NULL;
> +	}
>  }
>  
>  static int intel_ring_workarounds_emit(struct drm_i915_gem_request *req)
> -- 
> 2.11.0
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915: Restore context and pd for ringbuffer submission after reset
  2017-02-06 15:33 ` [PATCH] " Mika Kuoppala
@ 2017-02-06 15:51   ` Chris Wilson
  2017-02-07  9:58   ` [PATCH v2] " Chris Wilson
  1 sibling, 0 replies; 11+ messages in thread
From: Chris Wilson @ 2017-02-06 15:51 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx

On Mon, Feb 06, 2017 at 05:33:34PM +0200, Mika Kuoppala wrote:
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> >  static void reset_ring_common(struct intel_engine_cs *engine,
> >  			      struct drm_i915_gem_request *request)
> >  {
> > -	struct intel_ring *ring = request->ring;
> > +	if (request) {
> > +		struct drm_i915_private *dev_priv = request->i915;
> > +		struct intel_context *ce = &request->ctx->engine[engine->id];
> > +		struct i915_hw_ppgtt *ppgtt;
> > +
> > +		if (ce->state) {
> > +			I915_WRITE(CCID,
> > +				   i915_ggtt_offset(ce->state) |
> > +				   BIT(3) | BIT(2) | CCID_EN);
> 
> Can we assume here that the hw reset have made the ringbuffer
> empty (as hw head == tail == 0)?

Yes. This is after reset. Perhaps not so much as assume they are 0, but
the engine is completely idle, which is all we need.

> BIT8 should be set.

Not according to the gen6/7 bspec I checked.

> Bits 3|2 shuold be EXTENDED_STATE_SAVE|RESTORE_ENABLE.

That would require touching i915_reg.h. I don't want to start a bad
habit.
 
> Does the BIT2 affect the way the context will be loaded?

Yes. It may be the next context and valid. If is the hung ctx, do we
care? Certainly we should assume it can recover. From very quick thought
experiments, it seems more prudent to try to reload the context.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2] drm/i915: Restore context and pd for ringbuffer submission after reset
  2017-02-06 15:33 ` [PATCH] " Mika Kuoppala
  2017-02-06 15:51   ` Chris Wilson
@ 2017-02-07  9:58   ` Chris Wilson
  2017-02-07 15:24     ` [PATCH v4] " Chris Wilson
  1 sibling, 1 reply; 11+ messages in thread
From: Chris Wilson @ 2017-02-07  9:58 UTC (permalink / raw)
  To: intel-gfx; +Cc: Mika Kuoppala

Following a reset, the context and page directory registers are lost.
However, the queue of requests that we resubmit after the reset may
depend upon them - the registers are restored from a context image, but
that restore may be inhibited and may simply be absent from the request
if it was in the middle of a sequence using the same context. If we
prime the CCID/PD registers with the first request in the queue (even
for the hung request), we prevent invalid memory access for the
following requests (and continually hung engines).

v2: Magic BIT(8), reserved for future use but still appears unused.

Fixes: 821ed7df6e2a ("drm/i915: Update reset path to fix incomplete requests")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c         | 18 +++++++---------
 drivers/gpu/drm/i915/i915_reg.h         |  6 ++++--
 drivers/gpu/drm/i915/intel_lrc.c        |  6 +++++-
 drivers/gpu/drm/i915/intel_ringbuffer.c | 37 ++++++++++++++++++++++++++++++---
 4 files changed, 50 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index fe32b1edbda2..7a16c859c875 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2735,21 +2735,17 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine)
 		engine->irq_seqno_barrier(engine);
 
 	request = i915_gem_find_active_request(engine);
-	if (!request)
-		return;
-
-	if (!i915_gem_reset_request(request))
-		return;
+	if (request && i915_gem_reset_request(request)) {
+		DRM_DEBUG_DRIVER("resetting %s to restart from tail of request 0x%x\n",
+				 engine->name, request->global_seqno);
 
-	DRM_DEBUG_DRIVER("resetting %s to restart from tail of request 0x%x\n",
-			 engine->name, request->global_seqno);
+		/* If this context is now banned, skip all pending requests. */
+		if (i915_gem_context_is_banned(request->ctx))
+			engine_skip_context(request);
+	}
 
 	/* Setup the CS to resume from the breadcrumb of the hung request */
 	engine->reset_hw(engine, request);
-
-	/* If this context is now banned, skip all of its pending requests. */
-	if (i915_gem_context_is_banned(request->ctx))
-		engine_skip_context(request);
 }
 
 void i915_gem_reset_finish(struct drm_i915_private *dev_priv)
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 5bf36bdc5926..4a59091c0152 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -3323,8 +3323,10 @@ enum skl_disp_power_wells {
 /*
  * Logical Context regs
  */
-#define CCID			_MMIO(0x2180)
-#define   CCID_EN		(1<<0)
+#define CCID				_MMIO(0x2180)
+#define   CCID_EN			BIT(0)
+#define   CCID_EXTENDED_STATE_RESTORE	BIT(2)
+#define   CCID_EXTENDED_STATE_SAVE	BIT(3)
 /*
  * Notes on SNB/IVB/VLV context size:
  * - Power context is saved elsewhere (LLC or stolen)
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index df8e6f74dc7e..8a63a1bd0eb6 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1352,7 +1352,10 @@ static void reset_common_ring(struct intel_engine_cs *engine,
 			      struct drm_i915_gem_request *request)
 {
 	struct execlist_port *port = engine->execlist_port;
-	struct intel_context *ce = &request->ctx->engine[engine->id];
+	struct intel_context *ce;
+
+	if (!request || request->fence.error != -EIO)
+		return;
 
 	/* We want a simple context + ring to execute the breadcrumb update.
 	 * We cannot rely on the context being intact across the GPU hang,
@@ -1361,6 +1364,7 @@ static void reset_common_ring(struct intel_engine_cs *engine,
 	 * future request will be after userspace has had the opportunity
 	 * to recreate its own state.
 	 */
+	ce = &request->ctx->engine[engine->id];
 	execlists_init_reg_state(ce->lrc_reg_state,
 				 request->ctx, engine, ce->ring);
 
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 383083ef2210..1c4136cc3a84 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -599,10 +599,41 @@ static int init_ring_common(struct intel_engine_cs *engine)
 static void reset_ring_common(struct intel_engine_cs *engine,
 			      struct drm_i915_gem_request *request)
 {
-	struct intel_ring *ring = request->ring;
+	if (request) {
+		struct drm_i915_private *dev_priv = request->i915;
+		struct intel_context *ce = &request->ctx->engine[engine->id];
+		struct i915_hw_ppgtt *ppgtt;
+
+		/* FIXME consider gen8 reset */
+
+		if (ce->state) {
+			I915_WRITE(CCID,
+				   i915_ggtt_offset(ce->state) |
+				   BIT(8) /* must be set! */ |
+				   CCID_EXTENDED_STATE_SAVE |
+				   CCID_EXTENDED_STATE_RESTORE |
+				   CCID_EN);
+		}
 
-	ring->head = request->postfix;
-	ring->last_retired_head = -1;
+		ppgtt = request->ctx->ppgtt ?: engine->i915->mm.aliasing_ppgtt;
+		if (ppgtt) {
+			u32 pd_offset = ppgtt->pd.base.ggtt_offset << 10;
+
+			I915_WRITE(RING_PP_DIR_DCLV(engine), PP_DIR_DCLV_2G);
+			I915_WRITE(RING_PP_DIR_BASE(engine), pd_offset);
+			ppgtt->pd_dirty_rings &= ~intel_engine_flag(engine);
+		}
+
+		/* If the rq hung, jump to its breadcrumb and skip the batch */
+		if (request->fence.error == -EIO) {
+			struct intel_ring *ring = request->ring;
+
+			ring->head = request->postfix;
+			ring->last_retired_head = -1;
+		}
+	} else {
+		engine->legacy_active_context = NULL;
+	}
 }
 
 static int intel_ring_workarounds_emit(struct drm_i915_gem_request *req)
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* ✗ Fi.CI.BAT: warning for drm/i915: Restore context and pd for ringbuffer submission after reset (rev2)
  2017-02-04 19:37 [PATCH] drm/i915: Restore context and pd for ringbuffer submission after reset Chris Wilson
                   ` (2 preceding siblings ...)
  2017-02-06 15:33 ` [PATCH] " Mika Kuoppala
@ 2017-02-07 10:17 ` Patchwork
  2017-02-07 16:52 ` ✓ Fi.CI.BAT: success for drm/i915: Restore context and pd for ringbuffer submission after reset (rev3) Patchwork
  4 siblings, 0 replies; 11+ messages in thread
From: Patchwork @ 2017-02-07 10:17 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Restore context and pd for ringbuffer submission after reset (rev2)
URL   : https://patchwork.freedesktop.org/series/19110/
State : warning

== Summary ==

Series 19110v2 drm/i915: Restore context and pd for ringbuffer submission after reset
https://patchwork.freedesktop.org/api/1.0/series/19110/revisions/2/mbox/

Test kms_flip:
        Subgroup basic-flip-vs-wf_vblank:
                pass       -> DMESG-WARN (fi-snb-2520m)
Test kms_pipe_crc_basic:
        Subgroup nonblocking-crc-pipe-b:
                dmesg-warn -> PASS       (fi-snb-2520m)

fi-bdw-5557u     total:252  pass:238  dwarn:0   dfail:0   fail:0   skip:14 
fi-bsw-n3050     total:252  pass:213  dwarn:0   dfail:0   fail:0   skip:39 
fi-bxt-j4205     total:252  pass:230  dwarn:0   dfail:0   fail:0   skip:22 
fi-byt-j1900     total:252  pass:225  dwarn:0   dfail:0   fail:0   skip:27 
fi-byt-n2820     total:252  pass:221  dwarn:0   dfail:0   fail:0   skip:31 
fi-hsw-4770      total:252  pass:233  dwarn:0   dfail:0   fail:0   skip:19 
fi-hsw-4770r     total:252  pass:233  dwarn:0   dfail:0   fail:0   skip:19 
fi-ilk-650       total:252  pass:199  dwarn:0   dfail:0   fail:0   skip:53 
fi-ivb-3520m     total:252  pass:231  dwarn:0   dfail:0   fail:0   skip:21 
fi-ivb-3770      total:252  pass:231  dwarn:0   dfail:0   fail:0   skip:21 
fi-kbl-7500u     total:252  pass:229  dwarn:0   dfail:0   fail:2   skip:21 
fi-skl-6260u     total:252  pass:239  dwarn:0   dfail:0   fail:0   skip:13 
fi-skl-6700hq    total:252  pass:232  dwarn:0   dfail:0   fail:0   skip:20 
fi-skl-6700k     total:252  pass:227  dwarn:4   dfail:0   fail:0   skip:21 
fi-skl-6770hq    total:252  pass:239  dwarn:0   dfail:0   fail:0   skip:13 
fi-snb-2520m     total:252  pass:220  dwarn:1   dfail:0   fail:0   skip:31 
fi-snb-2600      total:252  pass:220  dwarn:0   dfail:0   fail:0   skip:32 

4b60513d8a00fc960a2d837fcccf6a63c861d9ec drm-tip: 2017y-02m-06d-22h-40m-55s UTC integration manifest
636cba5 drm/i915: Restore context and pd for ringbuffer submission after reset

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_3718/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v4] drm/i915: Restore context and pd for ringbuffer submission after reset
  2017-02-07  9:58   ` [PATCH v2] " Chris Wilson
@ 2017-02-07 15:24     ` Chris Wilson
  2017-02-07 15:54       ` Mika Kuoppala
  0 siblings, 1 reply; 11+ messages in thread
From: Chris Wilson @ 2017-02-07 15:24 UTC (permalink / raw)
  To: intel-gfx; +Cc: Mika Kuoppala

Following a reset, the context and page directory registers are lost.
However, the queue of requests that we resubmit after the reset may
depend upon them - the registers are restored from a context image, but
that restore may be inhibited and may simply be absent from the request
if it was in the middle of a sequence using the same context. If we
prime the CCID/PD registers with the first request in the queue (even
for the hung request), we prevent invalid memory access for the
following requests (and continually hung engines).

v2: Magic BIT(8), reserved for future use but still appears unused.
v3: Some commentary on handling innocent vs guilty requests
v4: Add a wait for PD_BASE fetch. The reload appears to be instant on my
Ivybridge, but this bit probably exists for a reason.

Fixes: 821ed7df6e2a ("drm/i915: Update reset path to fix incomplete requests")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c         | 18 ++++------
 drivers/gpu/drm/i915/i915_reg.h         |  6 ++--
 drivers/gpu/drm/i915/intel_lrc.c        | 16 ++++++++-
 drivers/gpu/drm/i915/intel_ringbuffer.c | 58 +++++++++++++++++++++++++++++++--
 4 files changed, 81 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 8a510c7f6828..b7632bbbafd8 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2743,21 +2743,17 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine)
 		engine->irq_seqno_barrier(engine);
 
 	request = i915_gem_find_active_request(engine);
-	if (!request)
-		return;
-
-	if (!i915_gem_reset_request(request))
-		return;
+	if (request && i915_gem_reset_request(request)) {
+		DRM_DEBUG_DRIVER("resetting %s to restart from tail of request 0x%x\n",
+				 engine->name, request->global_seqno);
 
-	DRM_DEBUG_DRIVER("resetting %s to restart from tail of request 0x%x\n",
-			 engine->name, request->global_seqno);
+		/* If this context is now banned, skip all pending requests. */
+		if (i915_gem_context_is_banned(request->ctx))
+			engine_skip_context(request);
+	}
 
 	/* Setup the CS to resume from the breadcrumb of the hung request */
 	engine->reset_hw(engine, request);
-
-	/* If this context is now banned, skip all of its pending requests. */
-	if (i915_gem_context_is_banned(request->ctx))
-		engine_skip_context(request);
 }
 
 void i915_gem_reset_finish(struct drm_i915_private *dev_priv)
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 5bf36bdc5926..4a59091c0152 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -3323,8 +3323,10 @@ enum skl_disp_power_wells {
 /*
  * Logical Context regs
  */
-#define CCID			_MMIO(0x2180)
-#define   CCID_EN		(1<<0)
+#define CCID				_MMIO(0x2180)
+#define   CCID_EN			BIT(0)
+#define   CCID_EXTENDED_STATE_RESTORE	BIT(2)
+#define   CCID_EXTENDED_STATE_SAVE	BIT(3)
 /*
  * Notes on SNB/IVB/VLV context size:
  * - Power context is saved elsewhere (LLC or stolen)
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index df8e6f74dc7e..e42990b56fa8 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1352,7 +1352,20 @@ static void reset_common_ring(struct intel_engine_cs *engine,
 			      struct drm_i915_gem_request *request)
 {
 	struct execlist_port *port = engine->execlist_port;
-	struct intel_context *ce = &request->ctx->engine[engine->id];
+	struct intel_context *ce;
+
+	/* If the request was innocent, we leave the request in the ELSP
+	 * and will try to replay it on restarting. The context image may
+	 * have been corrupted by the reset, in which case we may have
+	 * to service a new GPU hang, but more likely we can continue on
+	 * without impact.
+	 *
+	 * If the request was guilty, we presume the context is corrupt
+	 * and have to at least restore the RING register in the context
+	 * image back to the expected values to skip over the guilty request.
+	 */
+	if (!request || request->fence.error != -EIO)
+		return;
 
 	/* We want a simple context + ring to execute the breadcrumb update.
 	 * We cannot rely on the context being intact across the GPU hang,
@@ -1361,6 +1374,7 @@ static void reset_common_ring(struct intel_engine_cs *engine,
 	 * future request will be after userspace has had the opportunity
 	 * to recreate its own state.
 	 */
+	ce = &request->ctx->engine[engine->id];
 	execlists_init_reg_state(ce->lrc_reg_state,
 				 request->ctx, engine, ce->ring);
 
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 383083ef2210..d3d1e64f2498 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -599,10 +599,62 @@ static int init_ring_common(struct intel_engine_cs *engine)
 static void reset_ring_common(struct intel_engine_cs *engine,
 			      struct drm_i915_gem_request *request)
 {
-	struct intel_ring *ring = request->ring;
+	/* Try to restore the logical GPU state to match the continuation
+	 * of the request queue. If we skip the context/PD restore, then
+	 * the next request may try to execute assuming that its context
+	 * is valid and loaded on the GPU and so may try to access invalid
+	 * memory, prompting repeated GPU hangs.
+	 *
+	 * If the request was guilty, we still restore the logical state
+	 * in case the next request requires it (e.g. the aliasing ppgtt),
+	 * but skip over the hung batch.
+	 *
+	 * If the request was innocent, we try to replay the request with
+	 * the restored context.
+	 */
+	if (request) {
+		struct drm_i915_private *dev_priv = request->i915;
+		struct intel_context *ce = &request->ctx->engine[engine->id];
+		struct i915_hw_ppgtt *ppgtt;
+
+		/* FIXME consider gen8 reset */
+
+		if (ce->state) {
+			I915_WRITE(CCID,
+				   i915_ggtt_offset(ce->state) |
+				   BIT(8) /* must be set! */ |
+				   CCID_EXTENDED_STATE_SAVE |
+				   CCID_EXTENDED_STATE_RESTORE |
+				   CCID_EN);
+		}
 
-	ring->head = request->postfix;
-	ring->last_retired_head = -1;
+		ppgtt = request->ctx->ppgtt ?: engine->i915->mm.aliasing_ppgtt;
+		if (ppgtt) {
+			u32 pd_offset = ppgtt->pd.base.ggtt_offset << 10;
+
+			I915_WRITE(RING_PP_DIR_DCLV(engine), PP_DIR_DCLV_2G);
+			I915_WRITE(RING_PP_DIR_BASE(engine), pd_offset);
+
+			/* Wait for the PD reload to complete */
+			if (intel_wait_for_register(dev_priv,
+						    RING_PP_DIR_BASE(engine),
+						    BIT(0), 0,
+						    10))
+				DRM_ERROR("Wait for reload of ppgtt page-directory timed out\n");
+
+			ppgtt->pd_dirty_rings &= ~intel_engine_flag(engine);
+		}
+
+		/* If the rq hung, jump to its breadcrumb and skip the batch */
+		if (request->fence.error == -EIO) {
+			struct intel_ring *ring = request->ring;
+
+			ring->head = request->postfix;
+			ring->last_retired_head = -1;
+		}
+	} else {
+		engine->legacy_active_context = NULL;
+	}
 }
 
 static int intel_ring_workarounds_emit(struct drm_i915_gem_request *req)
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v4] drm/i915: Restore context and pd for ringbuffer submission after reset
  2017-02-07 15:24     ` [PATCH v4] " Chris Wilson
@ 2017-02-07 15:54       ` Mika Kuoppala
  2017-02-07 21:47         ` Chris Wilson
  0 siblings, 1 reply; 11+ messages in thread
From: Mika Kuoppala @ 2017-02-07 15:54 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> Following a reset, the context and page directory registers are lost.
> However, the queue of requests that we resubmit after the reset may
> depend upon them - the registers are restored from a context image, but
> that restore may be inhibited and may simply be absent from the request
> if it was in the middle of a sequence using the same context. If we
> prime the CCID/PD registers with the first request in the queue (even
> for the hung request), we prevent invalid memory access for the
> following requests (and continually hung engines).
>
> v2: Magic BIT(8), reserved for future use but still appears unused.
> v3: Some commentary on handling innocent vs guilty requests
> v4: Add a wait for PD_BASE fetch. The reload appears to be instant on my
> Ivybridge, but this bit probably exists for a reason.
>
> Fixes: 821ed7df6e2a ("drm/i915: Update reset path to fix incomplete requests")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>

Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>

> ---
>  drivers/gpu/drm/i915/i915_gem.c         | 18 ++++------
>  drivers/gpu/drm/i915/i915_reg.h         |  6 ++--
>  drivers/gpu/drm/i915/intel_lrc.c        | 16 ++++++++-
>  drivers/gpu/drm/i915/intel_ringbuffer.c | 58 +++++++++++++++++++++++++++++++--
>  4 files changed, 81 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 8a510c7f6828..b7632bbbafd8 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2743,21 +2743,17 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine)
>  		engine->irq_seqno_barrier(engine);
>  
>  	request = i915_gem_find_active_request(engine);
> -	if (!request)
> -		return;
> -
> -	if (!i915_gem_reset_request(request))
> -		return;
> +	if (request && i915_gem_reset_request(request)) {
> +		DRM_DEBUG_DRIVER("resetting %s to restart from tail of request 0x%x\n",
> +				 engine->name, request->global_seqno);
>  
> -	DRM_DEBUG_DRIVER("resetting %s to restart from tail of request 0x%x\n",
> -			 engine->name, request->global_seqno);
> +		/* If this context is now banned, skip all pending requests. */
> +		if (i915_gem_context_is_banned(request->ctx))
> +			engine_skip_context(request);
> +	}
>  
>  	/* Setup the CS to resume from the breadcrumb of the hung request */
>  	engine->reset_hw(engine, request);
> -
> -	/* If this context is now banned, skip all of its pending requests. */
> -	if (i915_gem_context_is_banned(request->ctx))
> -		engine_skip_context(request);
>  }
>  
>  void i915_gem_reset_finish(struct drm_i915_private *dev_priv)
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index 5bf36bdc5926..4a59091c0152 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -3323,8 +3323,10 @@ enum skl_disp_power_wells {
>  /*
>   * Logical Context regs
>   */
> -#define CCID			_MMIO(0x2180)
> -#define   CCID_EN		(1<<0)
> +#define CCID				_MMIO(0x2180)
> +#define   CCID_EN			BIT(0)
> +#define   CCID_EXTENDED_STATE_RESTORE	BIT(2)
> +#define   CCID_EXTENDED_STATE_SAVE	BIT(3)
>  /*
>   * Notes on SNB/IVB/VLV context size:
>   * - Power context is saved elsewhere (LLC or stolen)
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index df8e6f74dc7e..e42990b56fa8 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1352,7 +1352,20 @@ static void reset_common_ring(struct intel_engine_cs *engine,
>  			      struct drm_i915_gem_request *request)
>  {
>  	struct execlist_port *port = engine->execlist_port;
> -	struct intel_context *ce = &request->ctx->engine[engine->id];
> +	struct intel_context *ce;
> +
> +	/* If the request was innocent, we leave the request in the ELSP
> +	 * and will try to replay it on restarting. The context image may
> +	 * have been corrupted by the reset, in which case we may have
> +	 * to service a new GPU hang, but more likely we can continue on
> +	 * without impact.
> +	 *
> +	 * If the request was guilty, we presume the context is corrupt
> +	 * and have to at least restore the RING register in the context
> +	 * image back to the expected values to skip over the guilty request.
> +	 */
> +	if (!request || request->fence.error != -EIO)
> +		return;
>  
>  	/* We want a simple context + ring to execute the breadcrumb update.
>  	 * We cannot rely on the context being intact across the GPU hang,
> @@ -1361,6 +1374,7 @@ static void reset_common_ring(struct intel_engine_cs *engine,
>  	 * future request will be after userspace has had the opportunity
>  	 * to recreate its own state.
>  	 */
> +	ce = &request->ctx->engine[engine->id];
>  	execlists_init_reg_state(ce->lrc_reg_state,
>  				 request->ctx, engine, ce->ring);
>  
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> index 383083ef2210..d3d1e64f2498 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> @@ -599,10 +599,62 @@ static int init_ring_common(struct intel_engine_cs *engine)
>  static void reset_ring_common(struct intel_engine_cs *engine,
>  			      struct drm_i915_gem_request *request)
>  {
> -	struct intel_ring *ring = request->ring;
> +	/* Try to restore the logical GPU state to match the continuation
> +	 * of the request queue. If we skip the context/PD restore, then
> +	 * the next request may try to execute assuming that its context
> +	 * is valid and loaded on the GPU and so may try to access invalid
> +	 * memory, prompting repeated GPU hangs.
> +	 *
> +	 * If the request was guilty, we still restore the logical state
> +	 * in case the next request requires it (e.g. the aliasing ppgtt),
> +	 * but skip over the hung batch.
> +	 *
> +	 * If the request was innocent, we try to replay the request with
> +	 * the restored context.
> +	 */
> +	if (request) {
> +		struct drm_i915_private *dev_priv = request->i915;
> +		struct intel_context *ce = &request->ctx->engine[engine->id];
> +		struct i915_hw_ppgtt *ppgtt;
> +
> +		/* FIXME consider gen8 reset */
> +
> +		if (ce->state) {
> +			I915_WRITE(CCID,
> +				   i915_ggtt_offset(ce->state) |
> +				   BIT(8) /* must be set! */ |
> +				   CCID_EXTENDED_STATE_SAVE |
> +				   CCID_EXTENDED_STATE_RESTORE |
> +				   CCID_EN);
> +		}
>  
> -	ring->head = request->postfix;
> -	ring->last_retired_head = -1;
> +		ppgtt = request->ctx->ppgtt ?: engine->i915->mm.aliasing_ppgtt;
> +		if (ppgtt) {
> +			u32 pd_offset = ppgtt->pd.base.ggtt_offset << 10;
> +
> +			I915_WRITE(RING_PP_DIR_DCLV(engine), PP_DIR_DCLV_2G);
> +			I915_WRITE(RING_PP_DIR_BASE(engine), pd_offset);
> +
> +			/* Wait for the PD reload to complete */
> +			if (intel_wait_for_register(dev_priv,
> +						    RING_PP_DIR_BASE(engine),
> +						    BIT(0), 0,
> +						    10))
> +				DRM_ERROR("Wait for reload of ppgtt page-directory timed out\n");
> +
> +			ppgtt->pd_dirty_rings &= ~intel_engine_flag(engine);
> +		}
> +
> +		/* If the rq hung, jump to its breadcrumb and skip the batch */
> +		if (request->fence.error == -EIO) {
> +			struct intel_ring *ring = request->ring;
> +
> +			ring->head = request->postfix;
> +			ring->last_retired_head = -1;
> +		}
> +	} else {
> +		engine->legacy_active_context = NULL;
> +	}
>  }
>  
>  static int intel_ring_workarounds_emit(struct drm_i915_gem_request *req)
> -- 
> 2.11.0
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* ✓ Fi.CI.BAT: success for drm/i915: Restore context and pd for ringbuffer submission after reset (rev3)
  2017-02-04 19:37 [PATCH] drm/i915: Restore context and pd for ringbuffer submission after reset Chris Wilson
                   ` (3 preceding siblings ...)
  2017-02-07 10:17 ` ✗ Fi.CI.BAT: warning for drm/i915: Restore context and pd for ringbuffer submission after reset (rev2) Patchwork
@ 2017-02-07 16:52 ` Patchwork
  4 siblings, 0 replies; 11+ messages in thread
From: Patchwork @ 2017-02-07 16:52 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Restore context and pd for ringbuffer submission after reset (rev3)
URL   : https://patchwork.freedesktop.org/series/19110/
State : success

== Summary ==

Series 19110v3 drm/i915: Restore context and pd for ringbuffer submission after reset
https://patchwork.freedesktop.org/api/1.0/series/19110/revisions/3/mbox/

Test gem_exec_fence:
        Subgroup await-hang-default:
                fail       -> PASS       (fi-bxt-j4205)

fi-bdw-5557u     total:252  pass:238  dwarn:0   dfail:0   fail:0   skip:14 
fi-bsw-n3050     total:252  pass:213  dwarn:0   dfail:0   fail:0   skip:39 
fi-bxt-j4205     total:252  pass:230  dwarn:0   dfail:0   fail:0   skip:22 
fi-bxt-t5700     total:83   pass:70   dwarn:0   dfail:0   fail:0   skip:12 
fi-byt-j1900     total:252  pass:225  dwarn:0   dfail:0   fail:0   skip:27 
fi-byt-n2820     total:252  pass:221  dwarn:0   dfail:0   fail:0   skip:31 
fi-hsw-4770      total:252  pass:233  dwarn:0   dfail:0   fail:0   skip:19 
fi-hsw-4770r     total:252  pass:233  dwarn:0   dfail:0   fail:0   skip:19 
fi-ilk-650       total:252  pass:199  dwarn:0   dfail:0   fail:0   skip:53 
fi-ivb-3520m     total:252  pass:231  dwarn:0   dfail:0   fail:0   skip:21 
fi-ivb-3770      total:252  pass:231  dwarn:0   dfail:0   fail:0   skip:21 
fi-kbl-7500u     total:252  pass:229  dwarn:0   dfail:0   fail:2   skip:21 
fi-skl-6260u     total:252  pass:239  dwarn:0   dfail:0   fail:0   skip:13 
fi-skl-6700hq    total:252  pass:232  dwarn:0   dfail:0   fail:0   skip:20 
fi-skl-6700k     total:252  pass:227  dwarn:4   dfail:0   fail:0   skip:21 
fi-skl-6770hq    total:252  pass:239  dwarn:0   dfail:0   fail:0   skip:13 
fi-snb-2520m     total:252  pass:221  dwarn:0   dfail:0   fail:0   skip:31 
fi-snb-2600      total:252  pass:220  dwarn:0   dfail:0   fail:0   skip:32 

b8157792b685dfcc86fa7dc23d66d663ad6fb93f drm-tip: 2017y-02m-07d-14h-55m-05s UTC integration manifest
0ee7e6a drm/i915: Restore context and pd for ringbuffer submission after reset

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_3724/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4] drm/i915: Restore context and pd for ringbuffer submission after reset
  2017-02-07 15:54       ` Mika Kuoppala
@ 2017-02-07 21:47         ` Chris Wilson
  0 siblings, 0 replies; 11+ messages in thread
From: Chris Wilson @ 2017-02-07 21:47 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx

On Tue, Feb 07, 2017 at 05:54:10PM +0200, Mika Kuoppala wrote:
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > Following a reset, the context and page directory registers are lost.
> > However, the queue of requests that we resubmit after the reset may
> > depend upon them - the registers are restored from a context image, but
> > that restore may be inhibited and may simply be absent from the request
> > if it was in the middle of a sequence using the same context. If we
> > prime the CCID/PD registers with the first request in the queue (even
> > for the hung request), we prevent invalid memory access for the
> > following requests (and continually hung engines).
> >
> > v2: Magic BIT(8), reserved for future use but still appears unused.
> > v3: Some commentary on handling innocent vs guilty requests
> > v4: Add a wait for PD_BASE fetch. The reload appears to be instant on my
> > Ivybridge, but this bit probably exists for a reason.
> >
> > Fixes: 821ed7df6e2a ("drm/i915: Update reset path to fix incomplete requests")
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> 
> Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>

Thanks very much for the review, pushed as this fixes a reproducible
hang when combined with redundant MI_SET_CONTEXT elimination.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-02-07 21:47 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-02-04 19:37 [PATCH] drm/i915: Restore context and pd for ringbuffer submission after reset Chris Wilson
2017-02-04 19:46 ` Chris Wilson
2017-02-06  9:10 ` ✗ Fi.CI.BAT: failure for " Patchwork
2017-02-06 15:33 ` [PATCH] " Mika Kuoppala
2017-02-06 15:51   ` Chris Wilson
2017-02-07  9:58   ` [PATCH v2] " Chris Wilson
2017-02-07 15:24     ` [PATCH v4] " Chris Wilson
2017-02-07 15:54       ` Mika Kuoppala
2017-02-07 21:47         ` Chris Wilson
2017-02-07 10:17 ` ✗ Fi.CI.BAT: warning for drm/i915: Restore context and pd for ringbuffer submission after reset (rev2) Patchwork
2017-02-07 16:52 ` ✓ Fi.CI.BAT: success for drm/i915: Restore context and pd for ringbuffer submission after reset (rev3) Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox