intel-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] drm/i915: Prevent runaway head from denying hangcheck
@ 2016-02-19 14:09 Mika Kuoppala
  2016-02-19 14:24 ` Chris Wilson
  2016-02-19 16:40 ` ✗ Fi.CI.BAT: warning for " Patchwork
  0 siblings, 2 replies; 3+ messages in thread
From: Mika Kuoppala @ 2016-02-19 14:09 UTC (permalink / raw)
  To: intel-gfx

If we have runaway head moving out of allocated address space,
that space is mapped to point into scratch page. The content of scratch
page is is zero (MI_NOOP). This leads to actual head proceeding
unhindered towards the end of the address space and with with 64 bit
vmas it is a long walk.

We could inspect ppgtts to see if acthd is on valid space. But
that would need a lock over active vma list and we have tried very
hard to keep hangcheck lockfree.

Take note of our current global highest vma address, when objects
are added to active list. And check against this address when
hangcheck is run. This is more coarse than per ppgtt inspection,
but it should work of finding runaway head.

Testcase: igt/drv_hangman/ppgtt_walk
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c     | 2 ++
 drivers/gpu/drm/i915/i915_gem.c         | 4 ++++
 drivers/gpu/drm/i915/i915_irq.c         | 7 +++++++
 drivers/gpu/drm/i915/intel_ringbuffer.h | 1 +
 4 files changed, 14 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 9e19cf0e7075..b6735ae32997 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -1371,6 +1371,8 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
 			   (long long)acthd[i]);
 		seq_printf(m, "\tmax ACTHD = 0x%08llx\n",
 			   (long long)ring->hangcheck.max_acthd);
+		seq_printf(m, "\tmax vma = 0x%08llx\n",
+			   (long long)ring->hangcheck.max_active_vma);
 		seq_printf(m, "\tscore = %d\n", ring->hangcheck.score);
 		seq_printf(m, "\taction = %d\n", ring->hangcheck.action);
 
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index f68f34606f2f..331b7a3d4206 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2416,6 +2416,10 @@ void i915_vma_move_to_active(struct i915_vma *vma,
 	list_move_tail(&obj->ring_list[ring->id], &ring->active_list);
 	i915_gem_request_assign(&obj->last_read_req[ring->id], req);
 
+	if (vma->node.start + vma->node.size > ring->hangcheck.max_active_vma)
+		ring->hangcheck.max_active_vma =
+			vma->node.start + vma->node.size;
+
 	list_move_tail(&vma->mm_list, &vma->vm->active_list);
 }
 
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 25a89373df63..e59817328971 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2996,7 +2996,13 @@ head_stuck(struct intel_engine_cs *ring, u64 acthd)
 		       sizeof(ring->hangcheck.instdone));
 
 		if (acthd > ring->hangcheck.max_acthd) {
+			u64 max_vma = READ_ONCE(ring->hangcheck.max_active_vma);
+
 			ring->hangcheck.max_acthd = acthd;
+
+			if (max_vma && acthd > max_vma)
+				return HANGCHECK_HUNG;
+
 			return HANGCHECK_ACTIVE;
 		}
 
@@ -3107,6 +3113,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 		if (ring->hangcheck.seqno == seqno) {
 			if (ring_idle(ring, seqno)) {
 				ring->hangcheck.action = HANGCHECK_IDLE;
+				ring->hangcheck.max_active_vma = 0;
 
 				if (waitqueue_active(&ring->irq_queue)) {
 					/* Issue a wake-up to catch stuck h/w. */
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 566b0ae10ce0..9b61d9c6e6d1 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -89,6 +89,7 @@ enum intel_ring_hangcheck_action {
 struct intel_ring_hangcheck {
 	u64 acthd;
 	u64 max_acthd;
+	u64 max_active_vma;
 	u32 seqno;
 	int score;
 	enum intel_ring_hangcheck_action action;
-- 
2.5.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] drm/i915: Prevent runaway head from denying hangcheck
  2016-02-19 14:09 [PATCH] drm/i915: Prevent runaway head from denying hangcheck Mika Kuoppala
@ 2016-02-19 14:24 ` Chris Wilson
  2016-02-19 16:40 ` ✗ Fi.CI.BAT: warning for " Patchwork
  1 sibling, 0 replies; 3+ messages in thread
From: Chris Wilson @ 2016-02-19 14:24 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx

On Fri, Feb 19, 2016 at 04:09:03PM +0200, Mika Kuoppala wrote:
> If we have runaway head moving out of allocated address space,
> that space is mapped to point into scratch page. The content of scratch
> page is is zero (MI_NOOP). This leads to actual head proceeding
> unhindered towards the end of the address space and with with 64 bit
> vmas it is a long walk.
> 
> We could inspect ppgtts to see if acthd is on valid space. But
> that would need a lock over active vma list and we have tried very
> hard to keep hangcheck lockfree.
> 
> Take note of our current global highest vma address, when objects
> are added to active list. And check against this address when
> hangcheck is run. This is more coarse than per ppgtt inspection,
> but it should work of finding runaway head.
> 
> Testcase: igt/drv_hangman/ppgtt_walk
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_debugfs.c     | 2 ++
>  drivers/gpu/drm/i915/i915_gem.c         | 4 ++++
>  drivers/gpu/drm/i915/i915_irq.c         | 7 +++++++
>  drivers/gpu/drm/i915/intel_ringbuffer.h | 1 +
>  4 files changed, 14 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> index 9e19cf0e7075..b6735ae32997 100644
> --- a/drivers/gpu/drm/i915/i915_debugfs.c
> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> @@ -1371,6 +1371,8 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
>  			   (long long)acthd[i]);
>  		seq_printf(m, "\tmax ACTHD = 0x%08llx\n",
>  			   (long long)ring->hangcheck.max_acthd);
> +		seq_printf(m, "\tmax vma = 0x%08llx\n",
> +			   (long long)ring->hangcheck.max_active_vma);
>  		seq_printf(m, "\tscore = %d\n", ring->hangcheck.score);
>  		seq_printf(m, "\taction = %d\n", ring->hangcheck.action);
>  
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index f68f34606f2f..331b7a3d4206 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2416,6 +2416,10 @@ void i915_vma_move_to_active(struct i915_vma *vma,
>  	list_move_tail(&obj->ring_list[ring->id], &ring->active_list);
>  	i915_gem_request_assign(&obj->last_read_req[ring->id], req);
>  
> +	if (vma->node.start + vma->node.size > ring->hangcheck.max_active_vma)
> +		ring->hangcheck.max_active_vma =
> +			vma->node.start + vma->node.size;

No.

> +
>  	list_move_tail(&vma->mm_list, &vma->vm->active_list);
>  }
>  
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 25a89373df63..e59817328971 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -2996,7 +2996,13 @@ head_stuck(struct intel_engine_cs *ring, u64 acthd)
>  		       sizeof(ring->hangcheck.instdone));
>  
>  		if (acthd > ring->hangcheck.max_acthd) {
> +			u64 max_vma = READ_ONCE(ring->hangcheck.max_active_vma);
> +
>  			ring->hangcheck.max_acthd = acthd;
> +
> +			if (max_vma && acthd > max_vma)
> +				return HANGCHECK_HUNG;
> +
>  			return HANGCHECK_ACTIVE;

The issue is that the active-loop detection completely nullifies the DoS
detection, what I thought I sent in Dec/Jan was a tweak to always
increment for loops:

commit 7ea8bea0738d6a2449f1819d8ee6efd402ae7a6c
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jan 8 18:07:39 2016 +0000

    drm/i915/hangcheck: Prevent long walks across full-ppgtt
    
    With full-ppgtt, it takes the GPU an eon to traverse the entire 256PiB
    address space, so causing a loop to be detected. Under the current
    scheme, if ACTHD walks off the end of a batch buffer and into an empty
    address space, we "never" detect the hang. If we always increment the
    score as the ACTHD is progressing then we will eventually timeout (after
    ~60s without advancing onto a new batch). To counter act this, increase
    the amount we reduce the score for good batches, so that only a series
    of almost-bad batches trigger a full reset.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 8e449772b1bb..4fc4fbe4b798 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2961,21 +2961,15 @@ static enum intel_engine_hangcheck_action
 head_stuck(struct intel_engine_cs *ring, u64 acthd)
 {
        if (acthd != ring->hangcheck.acthd) {
-
                /* Clear subunit states on head movement */
                memset(ring->hangcheck.instdone, 0,
                       sizeof(ring->hangcheck.instdone));
 
-               if (acthd > ring->hangcheck.max_acthd) {
-                       ring->hangcheck.max_acthd = acthd;
-                       return HANGCHECK_ACTIVE;
-               }
-
                return HANGCHECK_ACTIVE_LOOP;
        }
 
        if (!subunits_stuck(ring))
-               return HANGCHECK_ACTIVE;
+               return HANGCHECK_ACTIVE_LOOP;
 
        return HANGCHECK_HUNG;
 }
@@ -3142,7 +3136,9 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
                         * attempts across multiple batches.
                         */
                        if (ring->hangcheck.score > 0)
-                               ring->hangcheck.score--;
+                               ring->hangcheck.score -= HUNG;
+                       if (ring->hangcheck.score < 0)
+                               ring->hangcheck.score = 0;
 
                        /* Clear head and subunit states on seqno movement */
                        ring->hangcheck.acthd = ring->hangcheck.max_acthd = 0;

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* ✗ Fi.CI.BAT: warning for drm/i915: Prevent runaway head from denying hangcheck
  2016-02-19 14:09 [PATCH] drm/i915: Prevent runaway head from denying hangcheck Mika Kuoppala
  2016-02-19 14:24 ` Chris Wilson
@ 2016-02-19 16:40 ` Patchwork
  1 sibling, 0 replies; 3+ messages in thread
From: Patchwork @ 2016-02-19 16:40 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Summary ==

Series 3630v1 drm/i915: Prevent runaway head from denying hangcheck
http://patchwork.freedesktop.org/api/1.0/series/3630/revisions/1/mbox/

Test kms_force_connector_basic:
        Subgroup force-load-detect:
                fail       -> DMESG-FAIL (snb-x220t)
                dmesg-fail -> FAIL       (hsw-gt2)
Test kms_pipe_crc_basic:
        Subgroup suspend-read-crc-pipe-a:
                dmesg-warn -> PASS       (skl-i5k-2) UNSTABLE
        Subgroup suspend-read-crc-pipe-c:
                pass       -> DMESG-WARN (skl-i5k-2) UNSTABLE
Test pm_rpm:
        Subgroup basic-pci-d3-state:
                dmesg-warn -> PASS       (bsw-nuc-2)
                pass       -> DMESG-WARN (byt-nuc)
        Subgroup basic-rte:
                dmesg-warn -> PASS       (byt-nuc) UNSTABLE

bdw-nuci7        total:164  pass:153  dwarn:0   dfail:0   fail:0   skip:11 
bdw-ultra        total:167  pass:153  dwarn:0   dfail:0   fail:0   skip:14 
bsw-nuc-2        total:167  pass:136  dwarn:1   dfail:0   fail:0   skip:30 
byt-nuc          total:167  pass:141  dwarn:1   dfail:0   fail:0   skip:25 
hsw-gt2          total:167  pass:156  dwarn:0   dfail:0   fail:1   skip:10 
ivb-t430s        total:167  pass:152  dwarn:0   dfail:0   fail:1   skip:14 
skl-i5k-2        total:167  pass:150  dwarn:1   dfail:0   fail:0   skip:16 
snb-dellxps      total:167  pass:144  dwarn:0   dfail:0   fail:1   skip:22 
snb-x220t        total:167  pass:144  dwarn:0   dfail:1   fail:1   skip:21 

Results at /archive/results/CI_IGT_test/Patchwork_1447/

82aa2ac97f3ec18d471b89c833288b236aeb70c5 drm-intel-nightly: 2016y-02m-19d-15h-32m-39s UTC integration manifest
82e5a155f99cc04278e4bc015e0fc4d9a765b237 drm/i915: Prevent runaway head from denying hangcheck

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-02-19 16:41 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-19 14:09 [PATCH] drm/i915: Prevent runaway head from denying hangcheck Mika Kuoppala
2016-02-19 14:24 ` Chris Wilson
2016-02-19 16:40 ` ✗ Fi.CI.BAT: warning for " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).