[PATCH] drm/i915: Advance seqno upon reseting the GPU following a hang

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] drm/i915: Advance seqno upon reseting the GPU following a hang
@ 2013-05-08 13:29 Chris Wilson
  2013-05-08 14:02 ` Daniel Vetter
  2013-05-13 13:10 ` Mika Kuoppala
  0 siblings, 2 replies; 7+ messages in thread
From: Chris Wilson @ 2013-05-08 13:29 UTC (permalink / raw)
  To: intel-gfx

There is an unlikely corner case whereby a lockless wait may not notice
a GPU hang and reset, and so continue to wait for the device to advance
beyond the chosen seqno. This of course may never happen as the waiter
may be the only user. Instead, we can explicitly advance the device
seqno to match the requests that are forcibly retired following the
hang.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_gem.c |   15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 84ee1f2..b3c8abd 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2118,8 +2118,11 @@ static void i915_gem_free_request(struct drm_i915_gem_request *request)
 }
 
 static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
-				      struct intel_ring_buffer *ring)
+				      struct intel_ring_buffer *ring,
+				      u32 seqno)
 {
+	int i;
+
 	while (!list_empty(&ring->request_list)) {
 		struct drm_i915_gem_request *request;
 
@@ -2139,6 +2142,10 @@ static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
 
 		i915_gem_object_move_to_inactive(obj);
 	}
+
+	intel_ring_init_seqno(ring, seqno);
+	for (i = 0; i < ARRAY_SIZE(ring->sync_seqno); i++)
+		ring->sync_seqno[i] = 0;
 }
 
 static void i915_gem_reset_fences(struct drm_device *dev)
@@ -2167,10 +2174,14 @@ void i915_gem_reset(struct drm_device *dev)
 	struct drm_i915_private *dev_priv = dev->dev_private;
 	struct drm_i915_gem_object *obj;
 	struct intel_ring_buffer *ring;
+	u32 seqno;
 	int i;
 
+	if (i915_gem_get_seqno(dev, &seqno))
+		seqno = dev_priv->next_seqno - 1;
+
 	for_each_ring(ring, dev_priv, i)
-		i915_gem_reset_ring_lists(dev_priv, ring);
+		i915_gem_reset_ring_lists(dev_priv, ring, seqno);
 
 	/* Move everything out of the GPU domains to ensure we do any
 	 * necessary invalidation upon reuse.
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/i915: Advance seqno upon reseting the GPU following a hang
  2013-05-08 13:29 [PATCH] drm/i915: Advance seqno upon reseting the GPU following a hang Chris Wilson
@ 2013-05-08 14:02 ` Daniel Vetter
  2013-05-08 14:06   ` Chris Wilson
  2013-05-13 13:10 ` Mika Kuoppala
  1 sibling, 1 reply; 7+ messages in thread
From: Daniel Vetter @ 2013-05-08 14:02 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

On Wed, May 08, 2013 at 02:29:30PM +0100, Chris Wilson wrote:
> There is an unlikely corner case whereby a lockless wait may not notice
> a GPU hang and reset, and so continue to wait for the device to advance
> beyond the chosen seqno. This of course may never happen as the waiter
> may be the only user. Instead, we can explicitly advance the device
> seqno to match the requests that are forcibly retired following the
> hang.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

This race is why the reset counter must always increase and can't just
flip-flop between the reset-in-progress and everything-works states.

Now if we want to unwedge on resume we need to reconsider this, but imo it
would be easier to simply remember the reset counter before we wedge the
gpu and restore that one (incremented as if the gpu reset worked). We
already assume that wedged will never collide with a real reset counter,
so this should work.
-Daniel

> ---
>  drivers/gpu/drm/i915/i915_gem.c |   15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 84ee1f2..b3c8abd 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2118,8 +2118,11 @@ static void i915_gem_free_request(struct drm_i915_gem_request *request)
>  }
>  
>  static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
> -				      struct intel_ring_buffer *ring)
> +				      struct intel_ring_buffer *ring,
> +				      u32 seqno)
>  {
> +	int i;
> +
>  	while (!list_empty(&ring->request_list)) {
>  		struct drm_i915_gem_request *request;
>  
> @@ -2139,6 +2142,10 @@ static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
>  
>  		i915_gem_object_move_to_inactive(obj);
>  	}
> +
> +	intel_ring_init_seqno(ring, seqno);
> +	for (i = 0; i < ARRAY_SIZE(ring->sync_seqno); i++)
> +		ring->sync_seqno[i] = 0;
>  }
>  
>  static void i915_gem_reset_fences(struct drm_device *dev)
> @@ -2167,10 +2174,14 @@ void i915_gem_reset(struct drm_device *dev)
>  	struct drm_i915_private *dev_priv = dev->dev_private;
>  	struct drm_i915_gem_object *obj;
>  	struct intel_ring_buffer *ring;
> +	u32 seqno;
>  	int i;
>  
> +	if (i915_gem_get_seqno(dev, &seqno))
> +		seqno = dev_priv->next_seqno - 1;
> +
>  	for_each_ring(ring, dev_priv, i)
> -		i915_gem_reset_ring_lists(dev_priv, ring);
> +		i915_gem_reset_ring_lists(dev_priv, ring, seqno);
>  
>  	/* Move everything out of the GPU domains to ensure we do any
>  	 * necessary invalidation upon reuse.
> -- 
> 1.7.10.4
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/i915: Advance seqno upon reseting the GPU following a hang
  2013-05-08 14:02 ` Daniel Vetter
@ 2013-05-08 14:06   ` Chris Wilson
  2013-05-10 15:02     ` Daniel Vetter
  0 siblings, 1 reply; 7+ messages in thread
From: Chris Wilson @ 2013-05-08 14:06 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx

On Wed, May 08, 2013 at 04:02:00PM +0200, Daniel Vetter wrote:
> On Wed, May 08, 2013 at 02:29:30PM +0100, Chris Wilson wrote:
> > There is an unlikely corner case whereby a lockless wait may not notice
> > a GPU hang and reset, and so continue to wait for the device to advance
> > beyond the chosen seqno. This of course may never happen as the waiter
> > may be the only user. Instead, we can explicitly advance the device
> > seqno to match the requests that are forcibly retired following the
> > hang.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> 
> This race is why the reset counter must always increase and can't just
> flip-flop between the reset-in-progress and everything-works states.
> 
> Now if we want to unwedge on resume we need to reconsider this, but imo it
> would be easier to simply remember the reset counter before we wedge the
> gpu and restore that one (incremented as if the gpu reset worked). We
> already assume that wedged will never collide with a real reset counter,
> so this should work.

Agree that this a unwedge-upon-resume issue, but my argument here is
that this leaves the hardware state consistent with what we forcibly
reset it to. From that perspective your suggestion is papering over this
here bug and this is the neat solution.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/i915: Advance seqno upon reseting the GPU following a hang
  2013-05-08 14:06   ` Chris Wilson
@ 2013-05-10 15:02     ` Daniel Vetter
  0 siblings, 0 replies; 7+ messages in thread
From: Daniel Vetter @ 2013-05-10 15:02 UTC (permalink / raw)
  To: Chris Wilson, Daniel Vetter, intel-gfx

On Wed, May 8, 2013 at 4:06 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Wed, May 08, 2013 at 04:02:00PM +0200, Daniel Vetter wrote:
>> On Wed, May 08, 2013 at 02:29:30PM +0100, Chris Wilson wrote:
>> > There is an unlikely corner case whereby a lockless wait may not notice
>> > a GPU hang and reset, and so continue to wait for the device to advance
>> > beyond the chosen seqno. This of course may never happen as the waiter
>> > may be the only user. Instead, we can explicitly advance the device
>> > seqno to match the requests that are forcibly retired following the
>> > hang.
>> >
>> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>
>> This race is why the reset counter must always increase and can't just
>> flip-flop between the reset-in-progress and everything-works states.
>>
>> Now if we want to unwedge on resume we need to reconsider this, but imo it
>> would be easier to simply remember the reset counter before we wedge the
>> gpu and restore that one (incremented as if the gpu reset worked). We
>> already assume that wedged will never collide with a real reset counter,
>> so this should work.
>
> Agree that this a unwedge-upon-resume issue, but my argument here is
> that this leaves the hardware state consistent with what we forcibly
> reset it to. From that perspective your suggestion is papering over this
> here bug and this is the neat solution.

Yeah, for the reset case I agree that just continuing in the sequence
would be more resilient. I'm still a bit unsure though what to do
across suspend/resume (where we currently force-reset the sequence
numbers, too). Maybe we need the poke-y stick there, too (in the form
of kicking waiters and incrementing the reset counter).
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/i915: Advance seqno upon reseting the GPU following a hang
  2013-05-08 13:29 [PATCH] drm/i915: Advance seqno upon reseting the GPU following a hang Chris Wilson
  2013-05-08 14:02 ` Daniel Vetter
@ 2013-05-13 13:10 ` Mika Kuoppala
  2013-05-14 10:34   ` Chris Wilson
  1 sibling, 1 reply; 7+ messages in thread
From: Mika Kuoppala @ 2013-05-13 13:10 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> There is an unlikely corner case whereby a lockless wait may not notice
> a GPU hang and reset, and so continue to wait for the device to advance
> beyond the chosen seqno. This of course may never happen as the waiter
> may be the only user. Instead, we can explicitly advance the device
> seqno to match the requests that are forcibly retired following the
> hang.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>  drivers/gpu/drm/i915/i915_gem.c |   15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 84ee1f2..b3c8abd 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2118,8 +2118,11 @@ static void i915_gem_free_request(struct drm_i915_gem_request *request)
>  }
>  
>  static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
> -				      struct intel_ring_buffer *ring)
> +				      struct intel_ring_buffer *ring,
> +				      u32 seqno)
>  {
> +	int i;
> +
>  	while (!list_empty(&ring->request_list)) {
>  		struct drm_i915_gem_request *request;
>  
> @@ -2139,6 +2142,10 @@ static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
>  
>  		i915_gem_object_move_to_inactive(obj);
>  	}
> +
> +	intel_ring_init_seqno(ring, seqno);
> +	for (i = 0; i < ARRAY_SIZE(ring->sync_seqno); i++)
> +		ring->sync_seqno[i] = 0;
>  }

I remember pondering about resetting sync_seqno's
inside intel_ring_init_seqno(). Is there reason
not to?

>  static void i915_gem_reset_fences(struct drm_device *dev)
> @@ -2167,10 +2174,14 @@ void i915_gem_reset(struct drm_device *dev)
>  	struct drm_i915_private *dev_priv = dev->dev_private;
>  	struct drm_i915_gem_object *obj;
>  	struct intel_ring_buffer *ring;
> +	u32 seqno;
>  	int i;
>  
> +	if (i915_gem_get_seqno(dev, &seqno))
> +		seqno = dev_priv->next_seqno - 1;
> +
>  	for_each_ring(ring, dev_priv, i)
> -		i915_gem_reset_ring_lists(dev_priv, ring);
> +		i915_gem_reset_ring_lists(dev_priv, ring, seqno);
>  
>  	/* Move everything out of the GPU domains to ensure we do any
>  	 * necessary invalidation upon reuse.
> -- 
> 1.7.10.4
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/i915: Advance seqno upon reseting the GPU following a hang
  2013-05-13 13:10 ` Mika Kuoppala
@ 2013-05-14 10:34   ` Chris Wilson
  2013-05-14 12:31     ` Mika Kuoppala
  0 siblings, 1 reply; 7+ messages in thread
From: Chris Wilson @ 2013-05-14 10:34 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx

On Mon, May 13, 2013 at 04:10:18PM +0300, Mika Kuoppala wrote:
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> > +
> > +	intel_ring_init_seqno(ring, seqno);
> > +	for (i = 0; i < ARRAY_SIZE(ring->sync_seqno); i++)
> > +		ring->sync_seqno[i] = 0;
> >  }
> 
> I remember pondering about resetting sync_seqno's
> inside intel_ring_init_seqno(). Is there reason
> not to?

Not a strong one. Conceptually the ring->sync_seqno[] belong to the other
rings, so I felt it was clumsy for intel_ring_init_seqno() to falsely
claim ownership and reset its own sync_seqno. But I think we can
refactor away those qualms with a comment.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/i915: Advance seqno upon reseting the GPU following a hang
  2013-05-14 10:34   ` Chris Wilson
@ 2013-05-14 12:31     ` Mika Kuoppala
  0 siblings, 0 replies; 7+ messages in thread
From: Mika Kuoppala @ 2013-05-14 12:31 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> On Mon, May 13, 2013 at 04:10:18PM +0300, Mika Kuoppala wrote:
>> Chris Wilson <chris@chris-wilson.co.uk> writes:
>> > +
>> > +	intel_ring_init_seqno(ring, seqno);
>> > +	for (i = 0; i < ARRAY_SIZE(ring->sync_seqno); i++)
>> > +		ring->sync_seqno[i] = 0;
>> >  }
>> 
>> I remember pondering about resetting sync_seqno's
>> inside intel_ring_init_seqno(). Is there reason
>> not to?
>
> Not a strong one. Conceptually the ring->sync_seqno[] belong to the other
> rings, so I felt it was clumsy for intel_ring_init_seqno() to falsely
> claim ownership and reset its own sync_seqno. But I think we can
> refactor away those qualms with a comment.

The existing intel_ring_init_seqno() already clumsily
resets the sync registers of other rings. As we can't
and wont initialize anything but all of the ring seqnos
at once, the existing code could be more explicit on that.

But for this patch:
Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-05-14 12:32 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-08 13:29 [PATCH] drm/i915: Advance seqno upon reseting the GPU following a hang Chris Wilson
2013-05-08 14:02 ` Daniel Vetter
2013-05-08 14:06   ` Chris Wilson
2013-05-10 15:02     ` Daniel Vetter
2013-05-13 13:10 ` Mika Kuoppala
2013-05-14 10:34   ` Chris Wilson
2013-05-14 12:31     ` Mika Kuoppala

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.