All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/i915: dump even more into the error_state
@ 2012-02-01 21:26 Daniel Vetter
  2012-02-01 21:39 ` Chris Wilson
  2012-02-04  2:23 ` Ben Widawsky
  0 siblings, 2 replies; 7+ messages in thread
From: Daniel Vetter @ 2012-02-01 21:26 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter

Chris Wilson and me have again stared at funny error states and it's
been pretty clear from the start that something was seriously amiss.
The seqnos last seen by the cpu were a few hundred behind those that
the gpu could have possibly emitted last before it died ...

Chris now tracked it down (hopefully, definit verdict's still out),
but in hindsight we'd have found the bug by simply dumping the cpu
side tracking of the ring head and tail registers.

Fix this and prevent an identical time-waster in the future.

Because the hangs always involved semaphores in one way or another,
we've tried to dump the mbox registers, but couldn't find any
inconsistencies. Still, dump them too.

Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_debugfs.c |    6 ++++++
 drivers/gpu/drm/i915/i915_drv.h     |    4 ++++
 drivers/gpu/drm/i915/i915_irq.c     |    7 +++++++
 3 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 681cbe4..e9eeacb 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -721,8 +721,14 @@ static void i915_ring_error_state(struct seq_file *m,
 	if (INTEL_INFO(dev)->gen >= 6) {
 		seq_printf(m, "  FADDR: 0x%08x\n", error->faddr[ring]);
 		seq_printf(m, "  FAULT_REG: 0x%08x\n", error->fault_reg[ring]);
+		seq_printf(m, "  SYNC_0: 0x%08x\n",
+			   error->semaphore_mboxes[ring][0]);
+		seq_printf(m, "  SYNC_1: 0x%08x\n",
+			   error->semaphore_mboxes[ring][1]);
 	}
 	seq_printf(m, "  seqno: 0x%08x\n", error->seqno[ring]);
+	seq_printf(m, "  ring->head: 0x%08x\n", error->cpu_ring_head[ring]);
+	seq_printf(m, "  ring->tail: 0x%08x\n", error->cpu_ring_tail[ring]);
 }
 
 static int i915_error_state(struct seq_file *m, void *unused)
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 865de80..1eb3b9a 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -159,6 +159,10 @@ struct drm_i915_error_state {
 	u32 ipehr[I915_NUM_RINGS];
 	u32 instdone[I915_NUM_RINGS];
 	u32 acthd[I915_NUM_RINGS];
+	u32 semaphore_mboxes[I915_NUM_RINGS][I915_NUM_RINGS - 1];
+	/* our own tracking of ring head and tail */
+	u32 cpu_ring_head[I915_NUM_RINGS];
+	u32 cpu_ring_tail[I915_NUM_RINGS];
 	u32 error; /* gen6+ */
 	u32 instpm[I915_NUM_RINGS];
 	u32 instps[I915_NUM_RINGS];
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 8ea1ca4..cde1ce9 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -903,6 +903,10 @@ static void i915_record_ring_state(struct drm_device *dev,
 	if (INTEL_INFO(dev)->gen >= 6) {
 		error->faddr[ring->id] = I915_READ(RING_DMA_FADD(ring->mmio_base));
 		error->fault_reg[ring->id] = I915_READ(RING_FAULT_REG(ring));
+		error->semaphore_mboxes[ring->id][0]
+			= I915_READ(RING_SYNC_0(ring->mmio_base));
+		error->semaphore_mboxes[ring->id][1]
+			= I915_READ(RING_SYNC_1(ring->mmio_base));
 	}
 
 	if (INTEL_INFO(dev)->gen >= 4) {
@@ -925,6 +929,9 @@ static void i915_record_ring_state(struct drm_device *dev,
 	error->acthd[ring->id] = intel_ring_get_active_head(ring);
 	error->head[ring->id] = I915_READ_HEAD(ring);
 	error->tail[ring->id] = I915_READ_TAIL(ring);
+
+	error->cpu_ring_head[ring->id] = ring->head;
+	error->cpu_ring_tail[ring->id] = ring->tail;
 }
 
 /**
-- 
1.7.8.3

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/i915: dump even more into the error_state
  2012-02-01 21:26 [PATCH] drm/i915: dump even more into the error_state Daniel Vetter
@ 2012-02-01 21:39 ` Chris Wilson
  2012-02-02  0:23   ` Eugeni Dodonov
  2012-02-04  2:23 ` Ben Widawsky
  1 sibling, 1 reply; 7+ messages in thread
From: Chris Wilson @ 2012-02-01 21:39 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter

On Wed,  1 Feb 2012 22:26:45 +0100, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> Chris Wilson and me have again stared at funny error states and it's
> been pretty clear from the start that something was seriously amiss.
> The seqnos last seen by the cpu were a few hundred behind those that
> the gpu could have possibly emitted last before it died ...
> 
> Chris now tracked it down (hopefully, definit verdict's still out),
> but in hindsight we'd have found the bug by simply dumping the cpu
> side tracking of the ring head and tail registers.
> 
> Fix this and prevent an identical time-waster in the future.
> 
> Because the hangs always involved semaphores in one way or another,
> we've tried to dump the mbox registers, but couldn't find any
> inconsistencies. Still, dump them too.
> 
> Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-and-wanted-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/i915: dump even more into the error_state
  2012-02-01 21:39 ` Chris Wilson
@ 2012-02-02  0:23   ` Eugeni Dodonov
  2012-02-09 14:52     ` Daniel Vetter
  0 siblings, 1 reply; 7+ messages in thread
From: Eugeni Dodonov @ 2012-02-02  0:23 UTC (permalink / raw)
  To: Chris Wilson; +Cc: Daniel Vetter, Intel Graphics Development


[-- Attachment #1.1: Type: text/plain, Size: 1169 bytes --]

On Wed, Feb 1, 2012 at 19:39, Chris Wilson <chris@chris-wilson.co.uk> wrote:

> On Wed,  1 Feb 2012 22:26:45 +0100, Daniel Vetter <daniel.vetter@ffwll.ch>
> wrote:
> > Chris Wilson and me have again stared at funny error states and it's
> > been pretty clear from the start that something was seriously amiss.
> > The seqnos last seen by the cpu were a few hundred behind those that
> > the gpu could have possibly emitted last before it died ...
> >
> > Chris now tracked it down (hopefully, definit verdict's still out),
> > but in hindsight we'd have found the bug by simply dumping the cpu
> > side tracking of the ring head and tail registers.
> >
> > Fix this and prevent an identical time-waster in the future.
> >
> > Because the hangs always involved semaphores in one way or another,
> > we've tried to dump the mbox registers, but couldn't find any
> > inconsistencies. Still, dump them too.
> >
> > Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> Reviewed-and-wanted-by: Chris Wilson <chris@chris-wilson.co.uk>
>

Yeah!! Thanks a lot for that.

Reviewed-by: Eugeni Dodonov <eugeni.dodonov@intel.com>

-- 
Eugeni Dodonov
<http://eugeni.dodonov.net/>

[-- Attachment #1.2: Type: text/html, Size: 1808 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/i915: dump even more into the error_state
  2012-02-01 21:26 [PATCH] drm/i915: dump even more into the error_state Daniel Vetter
  2012-02-01 21:39 ` Chris Wilson
@ 2012-02-04  2:23 ` Ben Widawsky
  2012-02-04 10:45   ` Chris Wilson
  1 sibling, 1 reply; 7+ messages in thread
From: Ben Widawsky @ 2012-02-04  2:23 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Intel Graphics Development

On Wed, Feb 01, 2012 at 10:26:45PM +0100, Daniel Vetter wrote:
> Chris Wilson and me have again stared at funny error states and it's
> been pretty clear from the start that something was seriously amiss.
> The seqnos last seen by the cpu were a few hundred behind those that
> the gpu could have possibly emitted last before it died ...
> 
> Chris now tracked it down (hopefully, definit verdict's still out),
> but in hindsight we'd have found the bug by simply dumping the cpu
> side tracking of the ring head and tail registers.
> 
> Fix this and prevent an identical time-waster in the future.
> 
> Because the hangs always involved semaphores in one way or another,
> we've tried to dump the mbox registers, but couldn't find any
> inconsistencies. Still, dump them too.

How about we only do this if semaphores are actually enabled? I think
that will eliminate potential confusion.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/i915: dump even more into the error_state
  2012-02-04  2:23 ` Ben Widawsky
@ 2012-02-04 10:45   ` Chris Wilson
  2012-02-04 19:55     ` Ben Widawsky
  0 siblings, 1 reply; 7+ messages in thread
From: Chris Wilson @ 2012-02-04 10:45 UTC (permalink / raw)
  To: Ben Widawsky, Daniel Vetter; +Cc: Intel Graphics Development

On Fri, 3 Feb 2012 18:23:10 -0800, Ben Widawsky <ben@bwidawsk.net> wrote:
> On Wed, Feb 01, 2012 at 10:26:45PM +0100, Daniel Vetter wrote:
> > Because the hangs always involved semaphores in one way or another,
> > we've tried to dump the mbox registers, but couldn't find any
> > inconsistencies. Still, dump them too.
> 
> How about we only do this if semaphores are actually enabled? I think
> that will eliminate potential confusion.

I suggested that we add the module parameters to the error-state to
aide, though even that is not fool proof as they could change between
the driver doing an undefined op and it fouling up the GPU. Any sign of
conflicting information inside the error-state is meant to be thought
provoking. ;-)

The idea is to dump as much information as we can about the errors and
leave the complex art of interpretation up to userspace.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/i915: dump even more into the error_state
  2012-02-04 10:45   ` Chris Wilson
@ 2012-02-04 19:55     ` Ben Widawsky
  0 siblings, 0 replies; 7+ messages in thread
From: Ben Widawsky @ 2012-02-04 19:55 UTC (permalink / raw)
  To: Chris Wilson; +Cc: Daniel Vetter, Intel Graphics Development

On Sat, Feb 04, 2012 at 10:45:50AM +0000, Chris Wilson wrote:
> On Fri, 3 Feb 2012 18:23:10 -0800, Ben Widawsky <ben@bwidawsk.net> wrote:
> > On Wed, Feb 01, 2012 at 10:26:45PM +0100, Daniel Vetter wrote:
> > > Because the hangs always involved semaphores in one way or another,
> > > we've tried to dump the mbox registers, but couldn't find any
> > > inconsistencies. Still, dump them too.
> > 
> > How about we only do this if semaphores are actually enabled? I think
> > that will eliminate potential confusion.
> 
> I suggested that we add the module parameters to the error-state to
> aide, though even that is not fool proof as they could change between
> the driver doing an undefined op and it fouling up the GPU. Any sign of
> conflicting information inside the error-state is meant to be thought
> provoking. ;-)
> 
> The idea is to dump as much information as we can about the errors and
> leave the complex art of interpretation up to userspace.
> -Chris

I understand your sentiment, but I think this is one of those things
where 18 months from now someone (very likely me) will spend far too
much time chasing weird MBOX values only to realize semaphores weren't
turned on. Dumping the module parameters is good enough for me. The risk
of the situation you describe should be quite low.

Personally, I think it's confusing enough that we put the MBOX_UPDATES
in the ring even when semaphores aren't on.

~Ben

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/i915: dump even more into the error_state
  2012-02-02  0:23   ` Eugeni Dodonov
@ 2012-02-09 14:52     ` Daniel Vetter
  0 siblings, 0 replies; 7+ messages in thread
From: Daniel Vetter @ 2012-02-09 14:52 UTC (permalink / raw)
  To: Eugeni Dodonov; +Cc: Daniel Vetter, Intel Graphics Development

On Wed, Feb 01, 2012 at 10:23:05PM -0200, Eugeni Dodonov wrote:
> On Wed, Feb 1, 2012 at 19:39, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> 
> > On Wed,  1 Feb 2012 22:26:45 +0100, Daniel Vetter <daniel.vetter@ffwll.ch>
> > wrote:
> > > Chris Wilson and me have again stared at funny error states and it's
> > > been pretty clear from the start that something was seriously amiss.
> > > The seqnos last seen by the cpu were a few hundred behind those that
> > > the gpu could have possibly emitted last before it died ...
> > >
> > > Chris now tracked it down (hopefully, definit verdict's still out),
> > > but in hindsight we'd have found the bug by simply dumping the cpu
> > > side tracking of the ring head and tail registers.
> > >
> > > Fix this and prevent an identical time-waster in the future.
> > >
> > > Because the hangs always involved semaphores in one way or another,
> > > we've tried to dump the mbox registers, but couldn't find any
> > > inconsistencies. Still, dump them too.
> > >
> > > Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> > Reviewed-and-wanted-by: Chris Wilson <chris@chris-wilson.co.uk>
> 
> Yeah!! Thanks a lot for that.
> 
> Reviewed-by: Eugeni Dodonov <eugeni.dodonov@intel.com>
Queued for -next, thanks for the review.
-Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-02-09 14:52 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-01 21:26 [PATCH] drm/i915: dump even more into the error_state Daniel Vetter
2012-02-01 21:39 ` Chris Wilson
2012-02-02  0:23   ` Eugeni Dodonov
2012-02-09 14:52     ` Daniel Vetter
2012-02-04  2:23 ` Ben Widawsky
2012-02-04 10:45   ` Chris Wilson
2012-02-04 19:55     ` Ben Widawsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.