* [PATCH] drm/i915: Prevent runaway head from denying hangcheck @ 2016-02-19 14:09 Mika Kuoppala 2016-02-19 14:24 ` Chris Wilson 2016-02-19 16:40 ` ✗ Fi.CI.BAT: warning for " Patchwork 0 siblings, 2 replies; 3+ messages in thread From: Mika Kuoppala @ 2016-02-19 14:09 UTC (permalink / raw) To: intel-gfx If we have runaway head moving out of allocated address space, that space is mapped to point into scratch page. The content of scratch page is is zero (MI_NOOP). This leads to actual head proceeding unhindered towards the end of the address space and with with 64 bit vmas it is a long walk. We could inspect ppgtts to see if acthd is on valid space. But that would need a lock over active vma list and we have tried very hard to keep hangcheck lockfree. Take note of our current global highest vma address, when objects are added to active list. And check against this address when hangcheck is run. This is more coarse than per ppgtt inspection, but it should work of finding runaway head. Testcase: igt/drv_hangman/ppgtt_walk Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> --- drivers/gpu/drm/i915/i915_debugfs.c | 2 ++ drivers/gpu/drm/i915/i915_gem.c | 4 ++++ drivers/gpu/drm/i915/i915_irq.c | 7 +++++++ drivers/gpu/drm/i915/intel_ringbuffer.h | 1 + 4 files changed, 14 insertions(+) diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c index 9e19cf0e7075..b6735ae32997 100644 --- a/drivers/gpu/drm/i915/i915_debugfs.c +++ b/drivers/gpu/drm/i915/i915_debugfs.c @@ -1371,6 +1371,8 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused) (long long)acthd[i]); seq_printf(m, "\tmax ACTHD = 0x%08llx\n", (long long)ring->hangcheck.max_acthd); + seq_printf(m, "\tmax vma = 0x%08llx\n", + (long long)ring->hangcheck.max_active_vma); seq_printf(m, "\tscore = %d\n", ring->hangcheck.score); seq_printf(m, "\taction = %d\n", ring->hangcheck.action); diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index f68f34606f2f..331b7a3d4206 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -2416,6 +2416,10 @@ void i915_vma_move_to_active(struct i915_vma *vma, list_move_tail(&obj->ring_list[ring->id], &ring->active_list); i915_gem_request_assign(&obj->last_read_req[ring->id], req); + if (vma->node.start + vma->node.size > ring->hangcheck.max_active_vma) + ring->hangcheck.max_active_vma = + vma->node.start + vma->node.size; + list_move_tail(&vma->mm_list, &vma->vm->active_list); } diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index 25a89373df63..e59817328971 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -2996,7 +2996,13 @@ head_stuck(struct intel_engine_cs *ring, u64 acthd) sizeof(ring->hangcheck.instdone)); if (acthd > ring->hangcheck.max_acthd) { + u64 max_vma = READ_ONCE(ring->hangcheck.max_active_vma); + ring->hangcheck.max_acthd = acthd; + + if (max_vma && acthd > max_vma) + return HANGCHECK_HUNG; + return HANGCHECK_ACTIVE; } @@ -3107,6 +3113,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work) if (ring->hangcheck.seqno == seqno) { if (ring_idle(ring, seqno)) { ring->hangcheck.action = HANGCHECK_IDLE; + ring->hangcheck.max_active_vma = 0; if (waitqueue_active(&ring->irq_queue)) { /* Issue a wake-up to catch stuck h/w. */ diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h index 566b0ae10ce0..9b61d9c6e6d1 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.h +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h @@ -89,6 +89,7 @@ enum intel_ring_hangcheck_action { struct intel_ring_hangcheck { u64 acthd; u64 max_acthd; + u64 max_active_vma; u32 seqno; int score; enum intel_ring_hangcheck_action action; -- 2.5.0 _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH] drm/i915: Prevent runaway head from denying hangcheck 2016-02-19 14:09 [PATCH] drm/i915: Prevent runaway head from denying hangcheck Mika Kuoppala @ 2016-02-19 14:24 ` Chris Wilson 2016-02-19 16:40 ` ✗ Fi.CI.BAT: warning for " Patchwork 1 sibling, 0 replies; 3+ messages in thread From: Chris Wilson @ 2016-02-19 14:24 UTC (permalink / raw) To: Mika Kuoppala; +Cc: intel-gfx On Fri, Feb 19, 2016 at 04:09:03PM +0200, Mika Kuoppala wrote: > If we have runaway head moving out of allocated address space, > that space is mapped to point into scratch page. The content of scratch > page is is zero (MI_NOOP). This leads to actual head proceeding > unhindered towards the end of the address space and with with 64 bit > vmas it is a long walk. > > We could inspect ppgtts to see if acthd is on valid space. But > that would need a lock over active vma list and we have tried very > hard to keep hangcheck lockfree. > > Take note of our current global highest vma address, when objects > are added to active list. And check against this address when > hangcheck is run. This is more coarse than per ppgtt inspection, > but it should work of finding runaway head. > > Testcase: igt/drv_hangman/ppgtt_walk > Cc: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> > Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> > --- > drivers/gpu/drm/i915/i915_debugfs.c | 2 ++ > drivers/gpu/drm/i915/i915_gem.c | 4 ++++ > drivers/gpu/drm/i915/i915_irq.c | 7 +++++++ > drivers/gpu/drm/i915/intel_ringbuffer.h | 1 + > 4 files changed, 14 insertions(+) > > diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c > index 9e19cf0e7075..b6735ae32997 100644 > --- a/drivers/gpu/drm/i915/i915_debugfs.c > +++ b/drivers/gpu/drm/i915/i915_debugfs.c > @@ -1371,6 +1371,8 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused) > (long long)acthd[i]); > seq_printf(m, "\tmax ACTHD = 0x%08llx\n", > (long long)ring->hangcheck.max_acthd); > + seq_printf(m, "\tmax vma = 0x%08llx\n", > + (long long)ring->hangcheck.max_active_vma); > seq_printf(m, "\tscore = %d\n", ring->hangcheck.score); > seq_printf(m, "\taction = %d\n", ring->hangcheck.action); > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > index f68f34606f2f..331b7a3d4206 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -2416,6 +2416,10 @@ void i915_vma_move_to_active(struct i915_vma *vma, > list_move_tail(&obj->ring_list[ring->id], &ring->active_list); > i915_gem_request_assign(&obj->last_read_req[ring->id], req); > > + if (vma->node.start + vma->node.size > ring->hangcheck.max_active_vma) > + ring->hangcheck.max_active_vma = > + vma->node.start + vma->node.size; No. > + > list_move_tail(&vma->mm_list, &vma->vm->active_list); > } > > diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c > index 25a89373df63..e59817328971 100644 > --- a/drivers/gpu/drm/i915/i915_irq.c > +++ b/drivers/gpu/drm/i915/i915_irq.c > @@ -2996,7 +2996,13 @@ head_stuck(struct intel_engine_cs *ring, u64 acthd) > sizeof(ring->hangcheck.instdone)); > > if (acthd > ring->hangcheck.max_acthd) { > + u64 max_vma = READ_ONCE(ring->hangcheck.max_active_vma); > + > ring->hangcheck.max_acthd = acthd; > + > + if (max_vma && acthd > max_vma) > + return HANGCHECK_HUNG; > + > return HANGCHECK_ACTIVE; The issue is that the active-loop detection completely nullifies the DoS detection, what I thought I sent in Dec/Jan was a tweak to always increment for loops: commit 7ea8bea0738d6a2449f1819d8ee6efd402ae7a6c Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Jan 8 18:07:39 2016 +0000 drm/i915/hangcheck: Prevent long walks across full-ppgtt With full-ppgtt, it takes the GPU an eon to traverse the entire 256PiB address space, so causing a loop to be detected. Under the current scheme, if ACTHD walks off the end of a batch buffer and into an empty address space, we "never" detect the hang. If we always increment the score as the ACTHD is progressing then we will eventually timeout (after ~60s without advancing onto a new batch). To counter act this, increase the amount we reduce the score for good batches, so that only a series of almost-bad batches trigger a full reset. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index 8e449772b1bb..4fc4fbe4b798 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -2961,21 +2961,15 @@ static enum intel_engine_hangcheck_action head_stuck(struct intel_engine_cs *ring, u64 acthd) { if (acthd != ring->hangcheck.acthd) { - /* Clear subunit states on head movement */ memset(ring->hangcheck.instdone, 0, sizeof(ring->hangcheck.instdone)); - if (acthd > ring->hangcheck.max_acthd) { - ring->hangcheck.max_acthd = acthd; - return HANGCHECK_ACTIVE; - } - return HANGCHECK_ACTIVE_LOOP; } if (!subunits_stuck(ring)) - return HANGCHECK_ACTIVE; + return HANGCHECK_ACTIVE_LOOP; return HANGCHECK_HUNG; } @@ -3142,7 +3136,9 @@ static void i915_hangcheck_elapsed(struct work_struct *work) * attempts across multiple batches. */ if (ring->hangcheck.score > 0) - ring->hangcheck.score--; + ring->hangcheck.score -= HUNG; + if (ring->hangcheck.score < 0) + ring->hangcheck.score = 0; /* Clear head and subunit states on seqno movement */ ring->hangcheck.acthd = ring->hangcheck.max_acthd = 0; -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply related [flat|nested] 3+ messages in thread
* ✗ Fi.CI.BAT: warning for drm/i915: Prevent runaway head from denying hangcheck 2016-02-19 14:09 [PATCH] drm/i915: Prevent runaway head from denying hangcheck Mika Kuoppala 2016-02-19 14:24 ` Chris Wilson @ 2016-02-19 16:40 ` Patchwork 1 sibling, 0 replies; 3+ messages in thread From: Patchwork @ 2016-02-19 16:40 UTC (permalink / raw) To: Chris Wilson; +Cc: intel-gfx == Summary == Series 3630v1 drm/i915: Prevent runaway head from denying hangcheck http://patchwork.freedesktop.org/api/1.0/series/3630/revisions/1/mbox/ Test kms_force_connector_basic: Subgroup force-load-detect: fail -> DMESG-FAIL (snb-x220t) dmesg-fail -> FAIL (hsw-gt2) Test kms_pipe_crc_basic: Subgroup suspend-read-crc-pipe-a: dmesg-warn -> PASS (skl-i5k-2) UNSTABLE Subgroup suspend-read-crc-pipe-c: pass -> DMESG-WARN (skl-i5k-2) UNSTABLE Test pm_rpm: Subgroup basic-pci-d3-state: dmesg-warn -> PASS (bsw-nuc-2) pass -> DMESG-WARN (byt-nuc) Subgroup basic-rte: dmesg-warn -> PASS (byt-nuc) UNSTABLE bdw-nuci7 total:164 pass:153 dwarn:0 dfail:0 fail:0 skip:11 bdw-ultra total:167 pass:153 dwarn:0 dfail:0 fail:0 skip:14 bsw-nuc-2 total:167 pass:136 dwarn:1 dfail:0 fail:0 skip:30 byt-nuc total:167 pass:141 dwarn:1 dfail:0 fail:0 skip:25 hsw-gt2 total:167 pass:156 dwarn:0 dfail:0 fail:1 skip:10 ivb-t430s total:167 pass:152 dwarn:0 dfail:0 fail:1 skip:14 skl-i5k-2 total:167 pass:150 dwarn:1 dfail:0 fail:0 skip:16 snb-dellxps total:167 pass:144 dwarn:0 dfail:0 fail:1 skip:22 snb-x220t total:167 pass:144 dwarn:0 dfail:1 fail:1 skip:21 Results at /archive/results/CI_IGT_test/Patchwork_1447/ 82aa2ac97f3ec18d471b89c833288b236aeb70c5 drm-intel-nightly: 2016y-02m-19d-15h-32m-39s UTC integration manifest 82e5a155f99cc04278e4bc015e0fc4d9a765b237 drm/i915: Prevent runaway head from denying hangcheck _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2016-02-19 16:41 UTC | newest] Thread overview: 3+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-02-19 14:09 [PATCH] drm/i915: Prevent runaway head from denying hangcheck Mika Kuoppala 2016-02-19 14:24 ` Chris Wilson 2016-02-19 16:40 ` ✗ Fi.CI.BAT: warning for " Patchwork
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).