public inbox for intel-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed
* [PATCH] drm/i915: Stop gathering error states for CS error interrupts
@ 2014-11-04 14:52 Daniel Vetter
  2014-11-04 15:02 ` Jani Nikula
  2014-11-05  8:35 ` Chris Wilson
  0 siblings, 2 replies; 6+ messages in thread
From: Daniel Vetter @ 2014-11-04 14:52 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter, Daniel Vetter

There's quite a few bug reports with error states where the error
reasons makes just about no sense at all. Like dying on tlbs for a
display plane that's not even there. Also users don't really report a
lot of bad side effects generally, just the error states.

Furthermore we don't even enable these interrupts any more on gen5+
(though the handling code is still there). So this mostly concerns old
platforms.

Given all that lets make our lives a bit easier and stop capturing
error states, in the hopes that we can just ignore them. In case
that's not true and the gpu indeed dies the hangcheck should
eventually kick in. And I've left some debug log in to make this case
noticeble. Referenced bug is just an example.

References: https://bugs.freedesktop.org/show_bug.cgi?id=82095
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c | 25 +++++++------------------
 1 file changed, 7 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 318a6a0724d0..2f78764cb215 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1319,10 +1319,8 @@ static void snb_gt_irq_handler(struct drm_device *dev,
 
 	if (gt_iir & (GT_BLT_CS_ERROR_INTERRUPT |
 		      GT_BSD_CS_ERROR_INTERRUPT |
-		      GT_RENDER_CS_MASTER_ERROR_INTERRUPT)) {
-		i915_handle_error(dev, false, "GT error interrupt 0x%08x",
-				  gt_iir);
-	}
+		      GT_RENDER_CS_MASTER_ERROR_INTERRUPT))
+		DRM_DEBUG("Command parser error, gt_iir 0x%08x", gt_iir);
 
 	if (gt_iir & GT_PARITY_ERROR(dev))
 		ivybridge_parity_error_irq_handler(dev, gt_iir);
@@ -1715,11 +1713,8 @@ static void gen6_rps_irq_handler(struct drm_i915_private *dev_priv, u32 pm_iir)
 		if (pm_iir & PM_VEBOX_USER_INTERRUPT)
 			notify_ring(dev_priv->dev, &dev_priv->ring[VECS]);
 
-		if (pm_iir & PM_VEBOX_CS_ERROR_INTERRUPT) {
-			i915_handle_error(dev_priv->dev, false,
-					  "VEBOX CS error interrupt 0x%08x",
-					  pm_iir);
-		}
+		if (pm_iir & PM_VEBOX_CS_ERROR_INTERRUPT)
+			DRM_DEBUG("Command parser error, pm_iir 0x%08x", pm_iir);
 	}
 }
 
@@ -3744,9 +3739,7 @@ static irqreturn_t i8xx_irq_handler(int irq, void *arg)
 		 */
 		spin_lock(&dev_priv->irq_lock);
 		if (iir & I915_RENDER_COMMAND_PARSER_ERROR_INTERRUPT)
-			i915_handle_error(dev, false,
-					  "Command parser error, iir 0x%08x",
-					  iir);
+			DRM_DEBUG("Command parser error, iir 0x%08x", iir);
 
 		for_each_pipe(dev_priv, pipe) {
 			int reg = PIPESTAT(pipe);
@@ -3929,9 +3922,7 @@ static irqreturn_t i915_irq_handler(int irq, void *arg)
 		 */
 		spin_lock(&dev_priv->irq_lock);
 		if (iir & I915_RENDER_COMMAND_PARSER_ERROR_INTERRUPT)
-			i915_handle_error(dev, false,
-					  "Command parser error, iir 0x%08x",
-					  iir);
+			DRM_DEBUG("Command parser error, iir 0x%08x", iir);
 
 		for_each_pipe(dev_priv, pipe) {
 			int reg = PIPESTAT(pipe);
@@ -4156,9 +4147,7 @@ static irqreturn_t i965_irq_handler(int irq, void *arg)
 		 */
 		spin_lock(&dev_priv->irq_lock);
 		if (iir & I915_RENDER_COMMAND_PARSER_ERROR_INTERRUPT)
-			i915_handle_error(dev, false,
-					  "Command parser error, iir 0x%08x",
-					  iir);
+			DRM_DEBUG("Command parser error, iir 0x%08x", iir);
 
 		for_each_pipe(dev_priv, pipe) {
 			int reg = PIPESTAT(pipe);
-- 
2.1.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/i915: Stop gathering error states for CS error interrupts
  2014-11-04 14:52 [PATCH] drm/i915: Stop gathering error states for CS error interrupts Daniel Vetter
@ 2014-11-04 15:02 ` Jani Nikula
  2014-11-05  8:35 ` Chris Wilson
  1 sibling, 0 replies; 6+ messages in thread
From: Jani Nikula @ 2014-11-04 15:02 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter, Daniel Vetter

On Tue, 04 Nov 2014, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> There's quite a few bug reports with error states where the error
> reasons makes just about no sense at all. Like dying on tlbs for a
> display plane that's not even there. Also users don't really report a
> lot of bad side effects generally, just the error states.
>
> Furthermore we don't even enable these interrupts any more on gen5+
> (though the handling code is still there). So this mostly concerns old
> platforms.
>
> Given all that lets make our lives a bit easier and stop capturing
> error states, in the hopes that we can just ignore them. In case
> that's not true and the gpu indeed dies the hangcheck should
> eventually kick in. And I've left some debug log in to make this case
> noticeble. Referenced bug is just an example.
>
> References: https://bugs.freedesktop.org/show_bug.cgi?id=82095
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_irq.c | 25 +++++++------------------
>  1 file changed, 7 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 318a6a0724d0..2f78764cb215 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -1319,10 +1319,8 @@ static void snb_gt_irq_handler(struct drm_device *dev,
>  
>  	if (gt_iir & (GT_BLT_CS_ERROR_INTERRUPT |
>  		      GT_BSD_CS_ERROR_INTERRUPT |
> -		      GT_RENDER_CS_MASTER_ERROR_INTERRUPT)) {
> -		i915_handle_error(dev, false, "GT error interrupt 0x%08x",
> -				  gt_iir);
> -	}
> +		      GT_RENDER_CS_MASTER_ERROR_INTERRUPT))
> +		DRM_DEBUG("Command parser error, gt_iir 0x%08x", gt_iir);

\n missing all around.

BR,
Jani.

>  
>  	if (gt_iir & GT_PARITY_ERROR(dev))
>  		ivybridge_parity_error_irq_handler(dev, gt_iir);
> @@ -1715,11 +1713,8 @@ static void gen6_rps_irq_handler(struct drm_i915_private *dev_priv, u32 pm_iir)
>  		if (pm_iir & PM_VEBOX_USER_INTERRUPT)
>  			notify_ring(dev_priv->dev, &dev_priv->ring[VECS]);
>  
> -		if (pm_iir & PM_VEBOX_CS_ERROR_INTERRUPT) {
> -			i915_handle_error(dev_priv->dev, false,
> -					  "VEBOX CS error interrupt 0x%08x",
> -					  pm_iir);
> -		}
> +		if (pm_iir & PM_VEBOX_CS_ERROR_INTERRUPT)
> +			DRM_DEBUG("Command parser error, pm_iir 0x%08x", pm_iir);
>  	}
>  }
>  
> @@ -3744,9 +3739,7 @@ static irqreturn_t i8xx_irq_handler(int irq, void *arg)
>  		 */
>  		spin_lock(&dev_priv->irq_lock);
>  		if (iir & I915_RENDER_COMMAND_PARSER_ERROR_INTERRUPT)
> -			i915_handle_error(dev, false,
> -					  "Command parser error, iir 0x%08x",
> -					  iir);
> +			DRM_DEBUG("Command parser error, iir 0x%08x", iir);
>  
>  		for_each_pipe(dev_priv, pipe) {
>  			int reg = PIPESTAT(pipe);
> @@ -3929,9 +3922,7 @@ static irqreturn_t i915_irq_handler(int irq, void *arg)
>  		 */
>  		spin_lock(&dev_priv->irq_lock);
>  		if (iir & I915_RENDER_COMMAND_PARSER_ERROR_INTERRUPT)
> -			i915_handle_error(dev, false,
> -					  "Command parser error, iir 0x%08x",
> -					  iir);
> +			DRM_DEBUG("Command parser error, iir 0x%08x", iir);
>  
>  		for_each_pipe(dev_priv, pipe) {
>  			int reg = PIPESTAT(pipe);
> @@ -4156,9 +4147,7 @@ static irqreturn_t i965_irq_handler(int irq, void *arg)
>  		 */
>  		spin_lock(&dev_priv->irq_lock);
>  		if (iir & I915_RENDER_COMMAND_PARSER_ERROR_INTERRUPT)
> -			i915_handle_error(dev, false,
> -					  "Command parser error, iir 0x%08x",
> -					  iir);
> +			DRM_DEBUG("Command parser error, iir 0x%08x", iir);
>  
>  		for_each_pipe(dev_priv, pipe) {
>  			int reg = PIPESTAT(pipe);
> -- 
> 2.1.1
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Jani Nikula, Intel Open Source Technology Center
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/i915: Stop gathering error states for CS error interrupts
  2014-11-04 14:52 [PATCH] drm/i915: Stop gathering error states for CS error interrupts Daniel Vetter
  2014-11-04 15:02 ` Jani Nikula
@ 2014-11-05  8:35 ` Chris Wilson
  2014-11-05  9:56   ` Daniel Vetter
  1 sibling, 1 reply; 6+ messages in thread
From: Chris Wilson @ 2014-11-05  8:35 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Daniel Vetter, Intel Graphics Development

On Tue, Nov 04, 2014 at 03:52:22PM +0100, Daniel Vetter wrote:
> There's quite a few bug reports with error states where the error
> reasons makes just about no sense at all. Like dying on tlbs for a
> display plane that's not even there. Also users don't really report a
> lot of bad side effects generally, just the error states.
> 
> Furthermore we don't even enable these interrupts any more on gen5+
> (though the handling code is still there). So this mostly concerns old
> platforms.
> 
> Given all that lets make our lives a bit easier and stop capturing
> error states, in the hopes that we can just ignore them. In case
> that's not true and the gpu indeed dies the hangcheck should
> eventually kick in. And I've left some debug log in to make this case
> noticeble. Referenced bug is just an example.

The problem is they can be useful. They have shown when our modesetting
sequence has been completely snafu, and they can also be used to detect
page faults (but that does require a bit of kernel trickery) in
userspace GPU command streams. Even in the Display B on 845g, we must
have done something to upset the hardware, but we simply haven't
captured what. I am not yet convinced we want to throw all such reports
away, in case we do ignore genuine fail.

How about just toning down the error message for non-fatal faults, and
discarding the earlier error state should we get a fatal fault afterwards?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/i915: Stop gathering error states for CS error interrupts
  2014-11-05  8:35 ` Chris Wilson
@ 2014-11-05  9:56   ` Daniel Vetter
  2014-11-24 20:57     ` Daniel Vetter
  0 siblings, 1 reply; 6+ messages in thread
From: Daniel Vetter @ 2014-11-05  9:56 UTC (permalink / raw)
  To: Chris Wilson, Daniel Vetter, Intel Graphics Development,
	Daniel Vetter

On Wed, Nov 05, 2014 at 08:35:01AM +0000, Chris Wilson wrote:
> On Tue, Nov 04, 2014 at 03:52:22PM +0100, Daniel Vetter wrote:
> > There's quite a few bug reports with error states where the error
> > reasons makes just about no sense at all. Like dying on tlbs for a
> > display plane that's not even there. Also users don't really report a
> > lot of bad side effects generally, just the error states.
> > 
> > Furthermore we don't even enable these interrupts any more on gen5+
> > (though the handling code is still there). So this mostly concerns old
> > platforms.
> > 
> > Given all that lets make our lives a bit easier and stop capturing
> > error states, in the hopes that we can just ignore them. In case
> > that's not true and the gpu indeed dies the hangcheck should
> > eventually kick in. And I've left some debug log in to make this case
> > noticeble. Referenced bug is just an example.
> 
> The problem is they can be useful. They have shown when our modesetting
> sequence has been completely snafu, and they can also be used to detect
> page faults (but that does require a bit of kernel trickery) in
> userspace GPU command streams. Even in the Display B on 845g, we must
> have done something to upset the hardware, but we simply haven't
> captured what. I am not yet convinced we want to throw all such reports
> away, in case we do ignore genuine fail.
> 
> How about just toning down the error message for non-fatal faults, and
> discarding the earlier error state should we get a fatal fault afterwards?

Hm yeah, that might work too.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/i915: Stop gathering error states for CS error interrupts
  2014-11-05  9:56   ` Daniel Vetter
@ 2014-11-24 20:57     ` Daniel Vetter
  2014-11-24 21:42       ` Chris Wilson
  0 siblings, 1 reply; 6+ messages in thread
From: Daniel Vetter @ 2014-11-24 20:57 UTC (permalink / raw)
  To: Chris Wilson, Daniel Vetter, Intel Graphics Development,
	Daniel Vetter

On Wed, Nov 05, 2014 at 10:56:06AM +0100, Daniel Vetter wrote:
> On Wed, Nov 05, 2014 at 08:35:01AM +0000, Chris Wilson wrote:
> > On Tue, Nov 04, 2014 at 03:52:22PM +0100, Daniel Vetter wrote:
> > > There's quite a few bug reports with error states where the error
> > > reasons makes just about no sense at all. Like dying on tlbs for a
> > > display plane that's not even there. Also users don't really report a
> > > lot of bad side effects generally, just the error states.
> > > 
> > > Furthermore we don't even enable these interrupts any more on gen5+
> > > (though the handling code is still there). So this mostly concerns old
> > > platforms.
> > > 
> > > Given all that lets make our lives a bit easier and stop capturing
> > > error states, in the hopes that we can just ignore them. In case
> > > that's not true and the gpu indeed dies the hangcheck should
> > > eventually kick in. And I've left some debug log in to make this case
> > > noticeble. Referenced bug is just an example.
> > 
> > The problem is they can be useful. They have shown when our modesetting
> > sequence has been completely snafu, and they can also be used to detect
> > page faults (but that does require a bit of kernel trickery) in
> > userspace GPU command streams. Even in the Display B on 845g, we must
> > have done something to upset the hardware, but we simply haven't
> > captured what. I am not yet convinced we want to throw all such reports
> > away, in case we do ignore genuine fail.
> > 
> > How about just toning down the error message for non-fatal faults, and
> > discarding the earlier error state should we get a fatal fault afterwards?
> 
> Hm yeah, that might work too.

I looked at this and it gets ugly fast. Given that we seem to have a quite
substantial false-positive (found one more by just reading recent bug
spam) rate and haven't enabled this on gen5+ I've decided to just merge
this one here. With the missing \n added ofc.

We can still inject manual captures using debugfs, and wiring this up
again if it indeed proves useful should be quit.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/i915: Stop gathering error states for CS error interrupts
  2014-11-24 20:57     ` Daniel Vetter
@ 2014-11-24 21:42       ` Chris Wilson
  0 siblings, 0 replies; 6+ messages in thread
From: Chris Wilson @ 2014-11-24 21:42 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Daniel Vetter, Intel Graphics Development, Daniel Vetter

On Mon, Nov 24, 2014 at 09:57:32PM +0100, Daniel Vetter wrote:
> I looked at this and it gets ugly fast. Given that we seem to have a quite
> substantial false-positive (found one more by just reading recent bug
> spam) rate and haven't enabled this on gen5+ I've decided to just merge
> this one here. With the missing \n added ofc.

I am not 100% convinced they are false-positives, I just haven't found
the bug. In the past, they have been very reliable for detecting real
bugs.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-11-24 21:42 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-04 14:52 [PATCH] drm/i915: Stop gathering error states for CS error interrupts Daniel Vetter
2014-11-04 15:02 ` Jani Nikula
2014-11-05  8:35 ` Chris Wilson
2014-11-05  9:56   ` Daniel Vetter
2014-11-24 20:57     ` Daniel Vetter
2014-11-24 21:42       ` Chris Wilson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox