[PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests

public inbox for intel-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed

* [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
@ 2017-09-15 13:09 Chris Wilson
  2017-09-15 13:29 ` Chris Wilson
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Chris Wilson @ 2017-09-15 13:09 UTC (permalink / raw)
  To: intel-gfx; +Cc: Jari Tahvanainen

If we see the seqno stop progressing, we abandon the test for fear that
the GPU died following the reset. However, during test teardown we still
wait for the GPU to idle before continuing, but we have already
confirmed that the GPU is dead. Furthermore, since we are inside a reset
test, we have disabled the hangchecker, and so there is no safety net and
we wait indefinitely. Detect the stuck GPU and declare it wedged as a
state of emergency so we can escape.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Jari Tahvanainen <jari.tahvanainen@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 25 +++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
index 02e52a146ed8..913fe752f6b4 100644
--- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
@@ -165,6 +165,7 @@ static int emit_recurse_batch(struct hang *h,
 		*batch++ = lower_32_bits(vma->node.start);
 	}
 	*batch++ = MI_BATCH_BUFFER_END; /* not reached */
+	wmb();
 
 	flags = 0;
 	if (INTEL_GEN(vm->i915) <= 5)
@@ -621,7 +622,12 @@ static int igt_wait_reset(void *arg)
 	__i915_add_request(rq, true);
 
 	if (!wait_for_hang(&h, rq)) {
-		pr_err("Failed to start request %x\n", rq->fence.seqno);
+		pr_err("Failed to start request %x, at %x\n",
+		       rq->fence.seqno, hws_seqno(&h, rq));
+
+		i915_reset(i915, 0);
+		i915_gem_set_wedged(i915);
+
 		err = -EIO;
 		goto out_rq;
 	}
@@ -708,10 +714,14 @@ static int igt_reset_queue(void *arg)
 			__i915_add_request(rq, true);
 
 			if (!wait_for_hang(&h, prev)) {
-				pr_err("Failed to start request %x\n",
-				       prev->fence.seqno);
+				pr_err("Failed to start request %x, at %x\n",
+				       rq->fence.seqno, hws_seqno(&h, rq));
 				i915_gem_request_put(rq);
 				i915_gem_request_put(prev);
+
+				i915_reset(i915, 0);
+				i915_gem_set_wedged(i915);
+
 				err = -EIO;
 				goto fini;
 			}
@@ -806,7 +816,12 @@ static int igt_handle_error(void *arg)
 	__i915_add_request(rq, true);
 
 	if (!wait_for_hang(&h, rq)) {
-		pr_err("Failed to start request %x\n", rq->fence.seqno);
+		pr_err("Failed to start request %x, at %x\n",
+		       rq->fence.seqno, hws_seqno(&h, rq));
+
+		i915_reset(i915, 0);
+		i915_gem_set_wedged(i915);
+
 		err = -EIO;
 		goto err_request;
 	}
@@ -843,8 +858,8 @@ static int igt_handle_error(void *arg)
 int intel_hangcheck_live_selftests(struct drm_i915_private *i915)
 {
 	static const struct i915_subtest tests[] = {
+		SUBTEST(igt_global_reset), /* attempt to recover GPU first */
 		SUBTEST(igt_hang_sanitycheck),
-		SUBTEST(igt_global_reset),
 		SUBTEST(igt_reset_engine),
 		SUBTEST(igt_reset_active_engines),
 		SUBTEST(igt_wait_reset),
-- 
2.14.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
  2017-09-15 13:09 [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests Chris Wilson
@ 2017-09-15 13:29 ` Chris Wilson
  2017-09-15 13:31 ` ✓ Fi.CI.BAT: success for " Patchwork
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Chris Wilson @ 2017-09-15 13:29 UTC (permalink / raw)
  To: intel-gfx; +Cc: Jari Tahvanainen

Quoting Chris Wilson (2017-09-15 14:09:29)
> If we see the seqno stop progressing, we abandon the test for fear that
> the GPU died following the reset. However, during test teardown we still
> wait for the GPU to idle before continuing, but we have already
> confirmed that the GPU is dead. Furthermore, since we are inside a reset
> test, we have disabled the hangchecker, and so there is no safety net and
> we wait indefinitely. Detect the stuck GPU and declare it wedged as a
> state of emergency so we can escape.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Jari Tahvanainen <jari.tahvanainen@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 25 +++++++++++++++++++-----
>  1 file changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> index 02e52a146ed8..913fe752f6b4 100644
> --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> @@ -165,6 +165,7 @@ static int emit_recurse_batch(struct hang *h,
>                 *batch++ = lower_32_bits(vma->node.start);
>         }
>         *batch++ = MI_BATCH_BUFFER_END; /* not reached */
> +       wmb();
>  
>         flags = 0;
>         if (INTEL_GEN(vm->i915) <= 5)
> @@ -621,7 +622,12 @@ static int igt_wait_reset(void *arg)
>         __i915_add_request(rq, true);
>  
>         if (!wait_for_hang(&h, rq)) {
> -               pr_err("Failed to start request %x\n", rq->fence.seqno);
> +               pr_err("Failed to start request %x, at %x\n",
> +                      rq->fence.seqno, hws_seqno(&h, rq));
> +
> +               i915_reset(i915, 0);
> +               i915_gem_set_wedged(i915);
> +
>                 err = -EIO;
>                 goto out_rq;
>         }
> @@ -708,10 +714,14 @@ static int igt_reset_queue(void *arg)
>                         __i915_add_request(rq, true);
>  
>                         if (!wait_for_hang(&h, prev)) {
> -                               pr_err("Failed to start request %x\n",
> -                                      prev->fence.seqno);
> +                               pr_err("Failed to start request %x, at %x\n",
> +                                      rq->fence.seqno, hws_seqno(&h, rq));

Odd one out needs s/rq/prev/
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* ✓ Fi.CI.BAT: success for drm/i915/selftests: Try to recover from a wedged GPU during reset tests
  2017-09-15 13:09 [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests Chris Wilson
  2017-09-15 13:29 ` Chris Wilson
@ 2017-09-15 13:31 ` Patchwork
  2017-09-15 15:04 ` ✓ Fi.CI.IGT: " Patchwork
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Patchwork @ 2017-09-15 13:31 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915/selftests: Try to recover from a wedged GPU during reset tests
URL   : https://patchwork.freedesktop.org/series/30419/
State : success

== Summary ==

Series 30419v1 drm/i915/selftests: Try to recover from a wedged GPU during reset tests
https://patchwork.freedesktop.org/api/1.0/series/30419/revisions/1/mbox/

Test chamelium:
        Subgroup dp-crc-fast:
                pass       -> FAIL       (fi-kbl-7500u) fdo#102514
Test kms_cursor_legacy:
        Subgroup basic-busy-flip-before-cursor-atomic:
                fail       -> PASS       (fi-snb-2600) fdo#100215 +1
Test pm_rpm:
        Subgroup basic-rte:
                pass       -> DMESG-WARN (fi-cfl-s) fdo#102294

fdo#102514 https://bugs.freedesktop.org/show_bug.cgi?id=102514
fdo#100215 https://bugs.freedesktop.org/show_bug.cgi?id=100215
fdo#102294 https://bugs.freedesktop.org/show_bug.cgi?id=102294

fi-bdw-5557u     total:289  pass:268  dwarn:0   dfail:0   fail:0   skip:21  time:444s
fi-bdw-gvtdvm    total:289  pass:265  dwarn:0   dfail:0   fail:0   skip:24  time:456s
fi-blb-e6850     total:289  pass:224  dwarn:1   dfail:0   fail:0   skip:64  time:379s
fi-bsw-n3050     total:289  pass:243  dwarn:0   dfail:0   fail:0   skip:46  time:528s
fi-bwr-2160      total:289  pass:184  dwarn:0   dfail:0   fail:0   skip:105 time:268s
fi-bxt-j4205     total:289  pass:260  dwarn:0   dfail:0   fail:0   skip:29  time:504s
fi-byt-j1900     total:289  pass:254  dwarn:1   dfail:0   fail:0   skip:34  time:504s
fi-byt-n2820     total:289  pass:250  dwarn:1   dfail:0   fail:0   skip:38  time:493s
fi-cfl-s         total:289  pass:222  dwarn:35  dfail:0   fail:0   skip:32  time:543s
fi-elk-e7500     total:289  pass:230  dwarn:0   dfail:0   fail:0   skip:59  time:414s
fi-glk-2a        total:289  pass:260  dwarn:0   dfail:0   fail:0   skip:29  time:600s
fi-hsw-4770      total:289  pass:263  dwarn:0   dfail:0   fail:0   skip:26  time:429s
fi-hsw-4770r     total:289  pass:263  dwarn:0   dfail:0   fail:0   skip:26  time:408s
fi-ilk-650       total:289  pass:229  dwarn:0   dfail:0   fail:0   skip:60  time:436s
fi-ivb-3520m     total:289  pass:261  dwarn:0   dfail:0   fail:0   skip:28  time:484s
fi-ivb-3770      total:289  pass:261  dwarn:0   dfail:0   fail:0   skip:28  time:468s
fi-kbl-7500u     total:289  pass:263  dwarn:1   dfail:0   fail:1   skip:24  time:485s
fi-kbl-7560u     total:289  pass:270  dwarn:0   dfail:0   fail:0   skip:19  time:586s
fi-kbl-r         total:289  pass:262  dwarn:0   dfail:0   fail:0   skip:27  time:588s
fi-pnv-d510      total:289  pass:223  dwarn:1   dfail:0   fail:0   skip:65  time:553s
fi-skl-6260u     total:289  pass:269  dwarn:0   dfail:0   fail:0   skip:20  time:458s
fi-skl-6700k     total:289  pass:265  dwarn:0   dfail:0   fail:0   skip:24  time:521s
fi-skl-6770hq    total:289  pass:269  dwarn:0   dfail:0   fail:0   skip:20  time:493s
fi-skl-gvtdvm    total:289  pass:266  dwarn:0   dfail:0   fail:0   skip:23  time:457s
fi-skl-x1585l    total:289  pass:268  dwarn:0   dfail:0   fail:0   skip:21  time:474s
fi-snb-2520m     total:289  pass:251  dwarn:0   dfail:0   fail:0   skip:38  time:570s
fi-snb-2600      total:289  pass:250  dwarn:0   dfail:0   fail:0   skip:39  time:433s

9adc9e93d6243c82bcefd175c2d11770802de194 drm-tip: 2017y-09m-15d-11h-44m-46s UTC integration manifest
d69539e5cad5 drm/i915/selftests: Try to recover from a wedged GPU during reset tests

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_5712/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* ✓ Fi.CI.IGT: success for drm/i915/selftests: Try to recover from a wedged GPU during reset tests
  2017-09-15 13:09 [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests Chris Wilson
  2017-09-15 13:29 ` Chris Wilson
  2017-09-15 13:31 ` ✓ Fi.CI.BAT: success for " Patchwork
@ 2017-09-15 15:04 ` Patchwork
  2017-09-19 14:18 ` [PATCH] " Chris Wilson
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Patchwork @ 2017-09-15 15:04 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915/selftests: Try to recover from a wedged GPU during reset tests
URL   : https://patchwork.freedesktop.org/series/30419/
State : success

== Summary ==

Test perf:
        Subgroup polling:
                pass       -> FAIL       (shard-hsw) fdo#102252 +1

fdo#102252 https://bugs.freedesktop.org/show_bug.cgi?id=102252

shard-hsw        total:2313 pass:1245 dwarn:0   dfail:0   fail:13  skip:1055 time:9350s

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_5712/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
  2017-09-15 13:09 [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests Chris Wilson
                   ` (2 preceding siblings ...)
  2017-09-15 15:04 ` ✓ Fi.CI.IGT: " Patchwork
@ 2017-09-19 14:18 ` Chris Wilson
  2017-09-19 14:24   ` Tahvanainen, Jari
  2017-09-25 12:01 ` Chris Wilson
  2017-09-26 12:48 ` Mika Kuoppala
  5 siblings, 1 reply; 11+ messages in thread
From: Chris Wilson @ 2017-09-19 14:18 UTC (permalink / raw)
  To: intel-gfx; +Cc: Jari Tahvanainen

Quoting Chris Wilson (2017-09-15 14:09:29)
> If we see the seqno stop progressing, we abandon the test for fear that
> the GPU died following the reset. However, during test teardown we still
> wait for the GPU to idle before continuing, but we have already
> confirmed that the GPU is dead. Furthermore, since we are inside a reset
> test, we have disabled the hangchecker, and so there is no safety net and
> we wait indefinitely. Detect the stuck GPU and declare it wedged as a
> state of emergency so we can escape.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Jari Tahvanainen <jari.tahvanainen@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>

Ping?

> ---
>  drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 25 +++++++++++++++++++-----
>  1 file changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> index 02e52a146ed8..913fe752f6b4 100644
> --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> @@ -165,6 +165,7 @@ static int emit_recurse_batch(struct hang *h,
>                 *batch++ = lower_32_bits(vma->node.start);
>         }
>         *batch++ = MI_BATCH_BUFFER_END; /* not reached */
> +       wmb();
>  
>         flags = 0;
>         if (INTEL_GEN(vm->i915) <= 5)
> @@ -621,7 +622,12 @@ static int igt_wait_reset(void *arg)
>         __i915_add_request(rq, true);
>  
>         if (!wait_for_hang(&h, rq)) {
> -               pr_err("Failed to start request %x\n", rq->fence.seqno);
> +               pr_err("Failed to start request %x, at %x\n",
> +                      rq->fence.seqno, hws_seqno(&h, rq));
> +
> +               i915_reset(i915, 0);
> +               i915_gem_set_wedged(i915);
> +
>                 err = -EIO;
>                 goto out_rq;
>         }
> @@ -708,10 +714,14 @@ static int igt_reset_queue(void *arg)
>                         __i915_add_request(rq, true);
>  
>                         if (!wait_for_hang(&h, prev)) {
> -                               pr_err("Failed to start request %x\n",
> -                                      prev->fence.seqno);
> +                               pr_err("Failed to start request %x, at %x\n",
> +                                      rq->fence.seqno, hws_seqno(&h, rq));
>                                 i915_gem_request_put(rq);
>                                 i915_gem_request_put(prev);
> +
> +                               i915_reset(i915, 0);
> +                               i915_gem_set_wedged(i915);
> +
>                                 err = -EIO;
>                                 goto fini;
>                         }
> @@ -806,7 +816,12 @@ static int igt_handle_error(void *arg)
>         __i915_add_request(rq, true);
>  
>         if (!wait_for_hang(&h, rq)) {
> -               pr_err("Failed to start request %x\n", rq->fence.seqno);
> +               pr_err("Failed to start request %x, at %x\n",
> +                      rq->fence.seqno, hws_seqno(&h, rq));
> +
> +               i915_reset(i915, 0);
> +               i915_gem_set_wedged(i915);
> +
>                 err = -EIO;
>                 goto err_request;
>         }
> @@ -843,8 +858,8 @@ static int igt_handle_error(void *arg)
>  int intel_hangcheck_live_selftests(struct drm_i915_private *i915)
>  {
>         static const struct i915_subtest tests[] = {
> +               SUBTEST(igt_global_reset), /* attempt to recover GPU first */
>                 SUBTEST(igt_hang_sanitycheck),
> -               SUBTEST(igt_global_reset),
>                 SUBTEST(igt_reset_engine),
>                 SUBTEST(igt_reset_active_engines),
>                 SUBTEST(igt_wait_reset),
> -- 
> 2.14.1
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
  2017-09-19 14:18 ` [PATCH] " Chris Wilson
@ 2017-09-19 14:24   ` Tahvanainen, Jari
  2017-09-19 14:33     ` Chris Wilson
  0 siblings, 1 reply; 11+ messages in thread
From: Tahvanainen, Jari @ 2017-09-19 14:24 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx@lists.freedesktop.org

-----Original Message-----
From: Chris Wilson [mailto:chris@chris-wilson.co.uk] 
Sent: Tuesday, September 19, 2017 5:19 PM
To: intel-gfx@lists.freedesktop.org
Cc: Tahvanainen, Jari <jari.tahvanainen@intel.com>; Mika Kuoppala <mika.kuoppala@linux.intel.com>
Subject: Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests

Quoting Chris Wilson (2017-09-15 14:09:29)
> If we see the seqno stop progressing, we abandon the test for fear 
> that the GPU died following the reset. However, during test teardown 
> we still wait for the GPU to idle before continuing, but we have 
> already confirmed that the GPU is dead. Furthermore, since we are 
> inside a reset test, we have disabled the hangchecker, and so there is 
> no safety net and we wait indefinitely. Detect the stuck GPU and 
> declare it wedged as a state of emergency so we can escape.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Jari Tahvanainen <jari.tahvanainen@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>

>Ping?

Sorry Chris for late answer. Tried to get touch with you earlier through IRC.
I merged the series on top of the drm-tip and executed it in HSW - no hang anymore - FAIL.

(drv_selftest:6304) igt-kmod-CRITICAL: Test assertion failure function igt_kselftest_execute, file igt_kmod.c:513:
(drv_selftest:6304) igt-kmod-CRITICAL: Failed assertion: err == 0
(drv_selftest:6304) igt-kmod-CRITICAL: kselftest "i915 igt__19__live_hangcheck=1 live_selftests=-1" failed: Input/output error [5]
(drv_selftest:6304) igt-core-INFO: Stack trace:
(drv_selftest:6304) igt-core-INFO:   #0 [__igt_fail_assert+0x101]
(drv_selftest:6304) igt-core-INFO:   #1 [igt_kselftest_execute+0x296]
(drv_selftest:6304) igt-core-INFO:   #2 [igt_kselftests+0x295]
(drv_selftest:6304) igt-core-INFO:   #3 [main+0x5f]
(drv_selftest:6304) igt-core-INFO:   #4 [__libc_start_main+0xf1]
(drv_selftest:6304) igt-core-INFO:   #5 [_start+0x2a]
(drv_selftest:6304) igt-core-INFO:   #6 [<unknown>+0x2a]
****  END  ****
Stack trace:
  #0 [__igt_fail_assert+0x101]
  #1 [igt_kselftest_execute+0x296]
  #2 [igt_kselftests+0x295]
  #3 [main+0x5f]
  #4 [__libc_start_main+0xf1]
  #5 [_start+0x2a]
  #6 [<unknown>+0x2a]
Subtest live_hangcheck: FAIL (1.911s)

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
  2017-09-19 14:24   ` Tahvanainen, Jari
@ 2017-09-19 14:33     ` Chris Wilson
  0 siblings, 0 replies; 11+ messages in thread
From: Chris Wilson @ 2017-09-19 14:33 UTC (permalink / raw)
  To: Tahvanainen, Jari, intel-gfx@lists.freedesktop.org

Quoting Tahvanainen, Jari (2017-09-19 15:24:22)
> -----Original Message-----
> From: Chris Wilson [mailto:chris@chris-wilson.co.uk] 
> Sent: Tuesday, September 19, 2017 5:19 PM
> To: intel-gfx@lists.freedesktop.org
> Cc: Tahvanainen, Jari <jari.tahvanainen@intel.com>; Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Subject: Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
> 
> Quoting Chris Wilson (2017-09-15 14:09:29)
> > If we see the seqno stop progressing, we abandon the test for fear 
> > that the GPU died following the reset. However, during test teardown 
> > we still wait for the GPU to idle before continuing, but we have 
> > already confirmed that the GPU is dead. Furthermore, since we are 
> > inside a reset test, we have disabled the hangchecker, and so there is 
> > no safety net and we wait indefinitely. Detect the stuck GPU and 
> > declare it wedged as a state of emergency so we can escape.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Jari Tahvanainen <jari.tahvanainen@intel.com>
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> 
> >Ping?
> 
> Sorry Chris for late answer. Tried to get touch with you earlier through IRC.
> I merged the series on top of the drm-tip and executed it in HSW - no hang anymore - FAIL.
> 
> (drv_selftest:6304) igt-kmod-CRITICAL: Test assertion failure function igt_kselftest_execute, file igt_kmod.c:513:
> (drv_selftest:6304) igt-kmod-CRITICAL: Failed assertion: err == 0
> (drv_selftest:6304) igt-kmod-CRITICAL: kselftest "i915 igt__19__live_hangcheck=1 live_selftests=-1" failed: Input/output error [5]
> (drv_selftest:6304) igt-core-INFO: Stack trace:
> (drv_selftest:6304) igt-core-INFO:   #0 [__igt_fail_assert+0x101]
> (drv_selftest:6304) igt-core-INFO:   #1 [igt_kselftest_execute+0x296]
> (drv_selftest:6304) igt-core-INFO:   #2 [igt_kselftests+0x295]
> (drv_selftest:6304) igt-core-INFO:   #3 [main+0x5f]
> (drv_selftest:6304) igt-core-INFO:   #4 [__libc_start_main+0xf1]
> (drv_selftest:6304) igt-core-INFO:   #5 [_start+0x2a]
> (drv_selftest:6304) igt-core-INFO:   #6 [<unknown>+0x2a]
> ****  END  ****
> Stack trace:
>   #0 [__igt_fail_assert+0x101]
>   #1 [igt_kselftest_execute+0x296]
>   #2 [igt_kselftests+0x295]
>   #3 [main+0x5f]
>   #4 [__libc_start_main+0xf1]
>   #5 [_start+0x2a]
>   #6 [<unknown>+0x2a]
> Subtest live_hangcheck: FAIL (1.911s)

That's what it is meant to do; stop the fail from freezing the machine.
I'll take that as a
Tested-by: Jari Tahvanainen <jari.tahvanainen@intel.com>
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
  2017-09-15 13:09 [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests Chris Wilson
                   ` (3 preceding siblings ...)
  2017-09-19 14:18 ` [PATCH] " Chris Wilson
@ 2017-09-25 12:01 ` Chris Wilson
  2017-09-26 12:48 ` Mika Kuoppala
  5 siblings, 0 replies; 11+ messages in thread
From: Chris Wilson @ 2017-09-25 12:01 UTC (permalink / raw)
  To: intel-gfx; +Cc: Jari Tahvanainen

Quoting Chris Wilson (2017-09-15 14:09:29)
> If we see the seqno stop progressing, we abandon the test for fear that
> the GPU died following the reset. However, during test teardown we still
> wait for the GPU to idle before continuing, but we have already
> confirmed that the GPU is dead. Furthermore, since we are inside a reset
> test, we have disabled the hangchecker, and so there is no safety net and
> we wait indefinitely. Detect the stuck GPU and declare it wedged as a
> state of emergency so we can escape.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Jari Tahvanainen <jari.tahvanainen@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>

Ping? We now have CI coverage of kselftests!
-Chris

> ---
>  drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 25 +++++++++++++++++++-----
>  1 file changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> index 02e52a146ed8..913fe752f6b4 100644
> --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> @@ -165,6 +165,7 @@ static int emit_recurse_batch(struct hang *h,
>                 *batch++ = lower_32_bits(vma->node.start);
>         }
>         *batch++ = MI_BATCH_BUFFER_END; /* not reached */
> +       wmb();
>  
>         flags = 0;
>         if (INTEL_GEN(vm->i915) <= 5)
> @@ -621,7 +622,12 @@ static int igt_wait_reset(void *arg)
>         __i915_add_request(rq, true);
>  
>         if (!wait_for_hang(&h, rq)) {
> -               pr_err("Failed to start request %x\n", rq->fence.seqno);
> +               pr_err("Failed to start request %x, at %x\n",
> +                      rq->fence.seqno, hws_seqno(&h, rq));
> +
> +               i915_reset(i915, 0);
> +               i915_gem_set_wedged(i915);
> +
>                 err = -EIO;
>                 goto out_rq;
>         }
> @@ -708,10 +714,14 @@ static int igt_reset_queue(void *arg)
>                         __i915_add_request(rq, true);
>  
>                         if (!wait_for_hang(&h, prev)) {
> -                               pr_err("Failed to start request %x\n",
> -                                      prev->fence.seqno);
> +                               pr_err("Failed to start request %x, at %x\n",
> +                                      rq->fence.seqno, hws_seqno(&h, rq));
>                                 i915_gem_request_put(rq);
>                                 i915_gem_request_put(prev);
> +
> +                               i915_reset(i915, 0);
> +                               i915_gem_set_wedged(i915);
> +
>                                 err = -EIO;
>                                 goto fini;
>                         }
> @@ -806,7 +816,12 @@ static int igt_handle_error(void *arg)
>         __i915_add_request(rq, true);
>  
>         if (!wait_for_hang(&h, rq)) {
> -               pr_err("Failed to start request %x\n", rq->fence.seqno);
> +               pr_err("Failed to start request %x, at %x\n",
> +                      rq->fence.seqno, hws_seqno(&h, rq));
> +
> +               i915_reset(i915, 0);
> +               i915_gem_set_wedged(i915);
> +
>                 err = -EIO;
>                 goto err_request;
>         }
> @@ -843,8 +858,8 @@ static int igt_handle_error(void *arg)
>  int intel_hangcheck_live_selftests(struct drm_i915_private *i915)
>  {
>         static const struct i915_subtest tests[] = {
> +               SUBTEST(igt_global_reset), /* attempt to recover GPU first */
>                 SUBTEST(igt_hang_sanitycheck),
> -               SUBTEST(igt_global_reset),
>                 SUBTEST(igt_reset_engine),
>                 SUBTEST(igt_reset_active_engines),
>                 SUBTEST(igt_wait_reset),
> -- 
> 2.14.1
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
  2017-09-15 13:09 [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests Chris Wilson
                   ` (4 preceding siblings ...)
  2017-09-25 12:01 ` Chris Wilson
@ 2017-09-26 12:48 ` Mika Kuoppala
  2017-09-26 13:03   ` Chris Wilson
  5 siblings, 1 reply; 11+ messages in thread
From: Mika Kuoppala @ 2017-09-26 12:48 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Jari Tahvanainen

Chris Wilson <chris@chris-wilson.co.uk> writes:

> If we see the seqno stop progressing, we abandon the test for fear that
> the GPU died following the reset. However, during test teardown we still
> wait for the GPU to idle before continuing, but we have already
> confirmed that the GPU is dead. Furthermore, since we are inside a reset
> test, we have disabled the hangchecker, and so there is no safety net and
> we wait indefinitely. Detect the stuck GPU and declare it wedged as a
> state of emergency so we can escape.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Jari Tahvanainen <jari.tahvanainen@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 25 +++++++++++++++++++-----
>  1 file changed, 20 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> index 02e52a146ed8..913fe752f6b4 100644
> --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> @@ -165,6 +165,7 @@ static int emit_recurse_batch(struct hang *h,
>  		*batch++ = lower_32_bits(vma->node.start);
>  	}
>  	*batch++ = MI_BATCH_BUFFER_END; /* not reached */
> +	wmb();
>

Why not the big hammer with i915_gem_chipset_flush() here?

>  	flags = 0;
>  	if (INTEL_GEN(vm->i915) <= 5)
> @@ -621,7 +622,12 @@ static int igt_wait_reset(void *arg)
>  	__i915_add_request(rq, true);
>  
>  	if (!wait_for_hang(&h, rq)) {
> -		pr_err("Failed to start request %x\n", rq->fence.seqno);
> +		pr_err("Failed to start request %x, at %x\n",
> +		       rq->fence.seqno, hws_seqno(&h, rq));
> +
> +		i915_reset(i915, 0);
> +		i915_gem_set_wedged(i915);
> +
>  		err = -EIO;
>  		goto out_rq;
>  	}
> @@ -708,10 +714,14 @@ static int igt_reset_queue(void *arg)
>  			__i915_add_request(rq, true);
>  
>  			if (!wait_for_hang(&h, prev)) {
> -				pr_err("Failed to start request %x\n",
> -				       prev->fence.seqno);
> +				pr_err("Failed to start request %x, at %x\n",
> +				       rq->fence.seqno, hws_seqno(&h, rq));

As you pointed out the debug in here is for wrong request.

Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

>  				i915_gem_request_put(rq);
>  				i915_gem_request_put(prev);
> +
> +				i915_reset(i915, 0);
> +				i915_gem_set_wedged(i915);
> +
>  				err = -EIO;
>  				goto fini;
>  			}
> @@ -806,7 +816,12 @@ static int igt_handle_error(void *arg)
>  	__i915_add_request(rq, true);
>  
>  	if (!wait_for_hang(&h, rq)) {
> -		pr_err("Failed to start request %x\n", rq->fence.seqno);
> +		pr_err("Failed to start request %x, at %x\n",
> +		       rq->fence.seqno, hws_seqno(&h, rq));
> +
> +		i915_reset(i915, 0);
> +		i915_gem_set_wedged(i915);
> +
>  		err = -EIO;
>  		goto err_request;
>  	}
> @@ -843,8 +858,8 @@ static int igt_handle_error(void *arg)
>  int intel_hangcheck_live_selftests(struct drm_i915_private *i915)
>  {
>  	static const struct i915_subtest tests[] = {
> +		SUBTEST(igt_global_reset), /* attempt to recover GPU first */
>  		SUBTEST(igt_hang_sanitycheck),
> -		SUBTEST(igt_global_reset),
>  		SUBTEST(igt_reset_engine),
>  		SUBTEST(igt_reset_active_engines),
>  		SUBTEST(igt_wait_reset),
> -- 
> 2.14.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
  2017-09-26 12:48 ` Mika Kuoppala
@ 2017-09-26 13:03   ` Chris Wilson
  2017-09-26 13:36     ` Mika Kuoppala
  0 siblings, 1 reply; 11+ messages in thread
From: Chris Wilson @ 2017-09-26 13:03 UTC (permalink / raw)
  To: Mika Kuoppala, intel-gfx; +Cc: Jari Tahvanainen

Quoting Mika Kuoppala (2017-09-26 13:48:17)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > If we see the seqno stop progressing, we abandon the test for fear that
> > the GPU died following the reset. However, during test teardown we still
> > wait for the GPU to idle before continuing, but we have already
> > confirmed that the GPU is dead. Furthermore, since we are inside a reset
> > test, we have disabled the hangchecker, and so there is no safety net and
> > we wait indefinitely. Detect the stuck GPU and declare it wedged as a
> > state of emergency so we can escape.
> >
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Jari Tahvanainen <jari.tahvanainen@intel.com>
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > ---
> >  drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 25 +++++++++++++++++++-----
> >  1 file changed, 20 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> > index 02e52a146ed8..913fe752f6b4 100644
> > --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> > +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> > @@ -165,6 +165,7 @@ static int emit_recurse_batch(struct hang *h,
> >               *batch++ = lower_32_bits(vma->node.start);
> >       }
> >       *batch++ = MI_BATCH_BUFFER_END; /* not reached */
> > +     wmb();
> >
> 
> Why not the big hammer with i915_gem_chipset_flush() here?

It didn't cross my mind, I was just doodling :)

> 
> >       flags = 0;
> >       if (INTEL_GEN(vm->i915) <= 5)
> > @@ -621,7 +622,12 @@ static int igt_wait_reset(void *arg)
> >       __i915_add_request(rq, true);
> >  
> >       if (!wait_for_hang(&h, rq)) {
> > -             pr_err("Failed to start request %x\n", rq->fence.seqno);
> > +             pr_err("Failed to start request %x, at %x\n",
> > +                    rq->fence.seqno, hws_seqno(&h, rq));
> > +
> > +             i915_reset(i915, 0);
> > +             i915_gem_set_wedged(i915);
> > +
> >               err = -EIO;
> >               goto out_rq;
> >       }
> > @@ -708,10 +714,14 @@ static int igt_reset_queue(void *arg)
> >                       __i915_add_request(rq, true);
> >  
> >                       if (!wait_for_hang(&h, prev)) {
> > -                             pr_err("Failed to start request %x\n",
> > -                                    prev->fence.seqno);
> > +                             pr_err("Failed to start request %x, at %x\n",
> > +                                    rq->fence.seqno, hws_seqno(&h, rq));
> 
> As you pointed out the debug in here is for wrong request.
> 
> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

Happy if I drop the wmb() for a later patch and replace it with a
chipset flush instead?
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
  2017-09-26 13:03   ` Chris Wilson
@ 2017-09-26 13:36     ` Mika Kuoppala
  0 siblings, 0 replies; 11+ messages in thread
From: Mika Kuoppala @ 2017-09-26 13:36 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Jari Tahvanainen

Chris Wilson <chris@chris-wilson.co.uk> writes:

> Quoting Mika Kuoppala (2017-09-26 13:48:17)
>> Chris Wilson <chris@chris-wilson.co.uk> writes:
>> 
>> > If we see the seqno stop progressing, we abandon the test for fear that
>> > the GPU died following the reset. However, during test teardown we still
>> > wait for the GPU to idle before continuing, but we have already
>> > confirmed that the GPU is dead. Furthermore, since we are inside a reset
>> > test, we have disabled the hangchecker, and so there is no safety net and
>> > we wait indefinitely. Detect the stuck GPU and declare it wedged as a
>> > state of emergency so we can escape.
>> >
>> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>> > Cc: Jari Tahvanainen <jari.tahvanainen@intel.com>
>> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>> > ---
>> >  drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 25 +++++++++++++++++++-----
>> >  1 file changed, 20 insertions(+), 5 deletions(-)
>> >
>> > diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
>> > index 02e52a146ed8..913fe752f6b4 100644
>> > --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
>> > +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
>> > @@ -165,6 +165,7 @@ static int emit_recurse_batch(struct hang *h,
>> >               *batch++ = lower_32_bits(vma->node.start);
>> >       }
>> >       *batch++ = MI_BATCH_BUFFER_END; /* not reached */
>> > +     wmb();
>> >
>> 
>> Why not the big hammer with i915_gem_chipset_flush() here?
>
> It didn't cross my mind, I was just doodling :)
>
>> 
>> >       flags = 0;
>> >       if (INTEL_GEN(vm->i915) <= 5)
>> > @@ -621,7 +622,12 @@ static int igt_wait_reset(void *arg)
>> >       __i915_add_request(rq, true);
>> >  
>> >       if (!wait_for_hang(&h, rq)) {
>> > -             pr_err("Failed to start request %x\n", rq->fence.seqno);
>> > +             pr_err("Failed to start request %x, at %x\n",
>> > +                    rq->fence.seqno, hws_seqno(&h, rq));
>> > +
>> > +             i915_reset(i915, 0);
>> > +             i915_gem_set_wedged(i915);
>> > +
>> >               err = -EIO;
>> >               goto out_rq;
>> >       }
>> > @@ -708,10 +714,14 @@ static int igt_reset_queue(void *arg)
>> >                       __i915_add_request(rq, true);
>> >  
>> >                       if (!wait_for_hang(&h, prev)) {
>> > -                             pr_err("Failed to start request %x\n",
>> > -                                    prev->fence.seqno);
>> > +                             pr_err("Failed to start request %x, at %x\n",
>> > +                                    rq->fence.seqno, hws_seqno(&h, rq));
>> 
>> As you pointed out the debug in here is for wrong request.
>> 
>> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>
> Happy if I drop the wmb() for a later patch and replace it with a
> chipset flush instead?

Will be happy.
-Mika

> -Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-09-26 13:39 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-09-15 13:09 [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests Chris Wilson
2017-09-15 13:29 ` Chris Wilson
2017-09-15 13:31 ` ✓ Fi.CI.BAT: success for " Patchwork
2017-09-15 15:04 ` ✓ Fi.CI.IGT: " Patchwork
2017-09-19 14:18 ` [PATCH] " Chris Wilson
2017-09-19 14:24   ` Tahvanainen, Jari
2017-09-19 14:33     ` Chris Wilson
2017-09-25 12:01 ` Chris Wilson
2017-09-26 12:48 ` Mika Kuoppala
2017-09-26 13:03   ` Chris Wilson
2017-09-26 13:36     ` Mika Kuoppala

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox