* Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
2017-09-15 13:09 [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests Chris Wilson
@ 2017-09-15 13:29 ` Chris Wilson
2017-09-15 13:31 ` ✓ Fi.CI.BAT: success for " Patchwork
` (4 subsequent siblings)
5 siblings, 0 replies; 11+ messages in thread
From: Chris Wilson @ 2017-09-15 13:29 UTC (permalink / raw)
To: intel-gfx; +Cc: Jari Tahvanainen
Quoting Chris Wilson (2017-09-15 14:09:29)
> If we see the seqno stop progressing, we abandon the test for fear that
> the GPU died following the reset. However, during test teardown we still
> wait for the GPU to idle before continuing, but we have already
> confirmed that the GPU is dead. Furthermore, since we are inside a reset
> test, we have disabled the hangchecker, and so there is no safety net and
> we wait indefinitely. Detect the stuck GPU and declare it wedged as a
> state of emergency so we can escape.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Jari Tahvanainen <jari.tahvanainen@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
> drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 25 +++++++++++++++++++-----
> 1 file changed, 20 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> index 02e52a146ed8..913fe752f6b4 100644
> --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> @@ -165,6 +165,7 @@ static int emit_recurse_batch(struct hang *h,
> *batch++ = lower_32_bits(vma->node.start);
> }
> *batch++ = MI_BATCH_BUFFER_END; /* not reached */
> + wmb();
>
> flags = 0;
> if (INTEL_GEN(vm->i915) <= 5)
> @@ -621,7 +622,12 @@ static int igt_wait_reset(void *arg)
> __i915_add_request(rq, true);
>
> if (!wait_for_hang(&h, rq)) {
> - pr_err("Failed to start request %x\n", rq->fence.seqno);
> + pr_err("Failed to start request %x, at %x\n",
> + rq->fence.seqno, hws_seqno(&h, rq));
> +
> + i915_reset(i915, 0);
> + i915_gem_set_wedged(i915);
> +
> err = -EIO;
> goto out_rq;
> }
> @@ -708,10 +714,14 @@ static int igt_reset_queue(void *arg)
> __i915_add_request(rq, true);
>
> if (!wait_for_hang(&h, prev)) {
> - pr_err("Failed to start request %x\n",
> - prev->fence.seqno);
> + pr_err("Failed to start request %x, at %x\n",
> + rq->fence.seqno, hws_seqno(&h, rq));
Odd one out needs s/rq/prev/
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread* ✓ Fi.CI.BAT: success for drm/i915/selftests: Try to recover from a wedged GPU during reset tests
2017-09-15 13:09 [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests Chris Wilson
2017-09-15 13:29 ` Chris Wilson
@ 2017-09-15 13:31 ` Patchwork
2017-09-15 15:04 ` ✓ Fi.CI.IGT: " Patchwork
` (3 subsequent siblings)
5 siblings, 0 replies; 11+ messages in thread
From: Patchwork @ 2017-09-15 13:31 UTC (permalink / raw)
To: Chris Wilson; +Cc: intel-gfx
== Series Details ==
Series: drm/i915/selftests: Try to recover from a wedged GPU during reset tests
URL : https://patchwork.freedesktop.org/series/30419/
State : success
== Summary ==
Series 30419v1 drm/i915/selftests: Try to recover from a wedged GPU during reset tests
https://patchwork.freedesktop.org/api/1.0/series/30419/revisions/1/mbox/
Test chamelium:
Subgroup dp-crc-fast:
pass -> FAIL (fi-kbl-7500u) fdo#102514
Test kms_cursor_legacy:
Subgroup basic-busy-flip-before-cursor-atomic:
fail -> PASS (fi-snb-2600) fdo#100215 +1
Test pm_rpm:
Subgroup basic-rte:
pass -> DMESG-WARN (fi-cfl-s) fdo#102294
fdo#102514 https://bugs.freedesktop.org/show_bug.cgi?id=102514
fdo#100215 https://bugs.freedesktop.org/show_bug.cgi?id=100215
fdo#102294 https://bugs.freedesktop.org/show_bug.cgi?id=102294
fi-bdw-5557u total:289 pass:268 dwarn:0 dfail:0 fail:0 skip:21 time:444s
fi-bdw-gvtdvm total:289 pass:265 dwarn:0 dfail:0 fail:0 skip:24 time:456s
fi-blb-e6850 total:289 pass:224 dwarn:1 dfail:0 fail:0 skip:64 time:379s
fi-bsw-n3050 total:289 pass:243 dwarn:0 dfail:0 fail:0 skip:46 time:528s
fi-bwr-2160 total:289 pass:184 dwarn:0 dfail:0 fail:0 skip:105 time:268s
fi-bxt-j4205 total:289 pass:260 dwarn:0 dfail:0 fail:0 skip:29 time:504s
fi-byt-j1900 total:289 pass:254 dwarn:1 dfail:0 fail:0 skip:34 time:504s
fi-byt-n2820 total:289 pass:250 dwarn:1 dfail:0 fail:0 skip:38 time:493s
fi-cfl-s total:289 pass:222 dwarn:35 dfail:0 fail:0 skip:32 time:543s
fi-elk-e7500 total:289 pass:230 dwarn:0 dfail:0 fail:0 skip:59 time:414s
fi-glk-2a total:289 pass:260 dwarn:0 dfail:0 fail:0 skip:29 time:600s
fi-hsw-4770 total:289 pass:263 dwarn:0 dfail:0 fail:0 skip:26 time:429s
fi-hsw-4770r total:289 pass:263 dwarn:0 dfail:0 fail:0 skip:26 time:408s
fi-ilk-650 total:289 pass:229 dwarn:0 dfail:0 fail:0 skip:60 time:436s
fi-ivb-3520m total:289 pass:261 dwarn:0 dfail:0 fail:0 skip:28 time:484s
fi-ivb-3770 total:289 pass:261 dwarn:0 dfail:0 fail:0 skip:28 time:468s
fi-kbl-7500u total:289 pass:263 dwarn:1 dfail:0 fail:1 skip:24 time:485s
fi-kbl-7560u total:289 pass:270 dwarn:0 dfail:0 fail:0 skip:19 time:586s
fi-kbl-r total:289 pass:262 dwarn:0 dfail:0 fail:0 skip:27 time:588s
fi-pnv-d510 total:289 pass:223 dwarn:1 dfail:0 fail:0 skip:65 time:553s
fi-skl-6260u total:289 pass:269 dwarn:0 dfail:0 fail:0 skip:20 time:458s
fi-skl-6700k total:289 pass:265 dwarn:0 dfail:0 fail:0 skip:24 time:521s
fi-skl-6770hq total:289 pass:269 dwarn:0 dfail:0 fail:0 skip:20 time:493s
fi-skl-gvtdvm total:289 pass:266 dwarn:0 dfail:0 fail:0 skip:23 time:457s
fi-skl-x1585l total:289 pass:268 dwarn:0 dfail:0 fail:0 skip:21 time:474s
fi-snb-2520m total:289 pass:251 dwarn:0 dfail:0 fail:0 skip:38 time:570s
fi-snb-2600 total:289 pass:250 dwarn:0 dfail:0 fail:0 skip:39 time:433s
9adc9e93d6243c82bcefd175c2d11770802de194 drm-tip: 2017y-09m-15d-11h-44m-46s UTC integration manifest
d69539e5cad5 drm/i915/selftests: Try to recover from a wedged GPU during reset tests
== Logs ==
For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_5712/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread* ✓ Fi.CI.IGT: success for drm/i915/selftests: Try to recover from a wedged GPU during reset tests
2017-09-15 13:09 [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests Chris Wilson
2017-09-15 13:29 ` Chris Wilson
2017-09-15 13:31 ` ✓ Fi.CI.BAT: success for " Patchwork
@ 2017-09-15 15:04 ` Patchwork
2017-09-19 14:18 ` [PATCH] " Chris Wilson
` (2 subsequent siblings)
5 siblings, 0 replies; 11+ messages in thread
From: Patchwork @ 2017-09-15 15:04 UTC (permalink / raw)
To: Chris Wilson; +Cc: intel-gfx
== Series Details ==
Series: drm/i915/selftests: Try to recover from a wedged GPU during reset tests
URL : https://patchwork.freedesktop.org/series/30419/
State : success
== Summary ==
Test perf:
Subgroup polling:
pass -> FAIL (shard-hsw) fdo#102252 +1
fdo#102252 https://bugs.freedesktop.org/show_bug.cgi?id=102252
shard-hsw total:2313 pass:1245 dwarn:0 dfail:0 fail:13 skip:1055 time:9350s
== Logs ==
For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_5712/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
2017-09-15 13:09 [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests Chris Wilson
` (2 preceding siblings ...)
2017-09-15 15:04 ` ✓ Fi.CI.IGT: " Patchwork
@ 2017-09-19 14:18 ` Chris Wilson
2017-09-19 14:24 ` Tahvanainen, Jari
2017-09-25 12:01 ` Chris Wilson
2017-09-26 12:48 ` Mika Kuoppala
5 siblings, 1 reply; 11+ messages in thread
From: Chris Wilson @ 2017-09-19 14:18 UTC (permalink / raw)
To: intel-gfx; +Cc: Jari Tahvanainen
Quoting Chris Wilson (2017-09-15 14:09:29)
> If we see the seqno stop progressing, we abandon the test for fear that
> the GPU died following the reset. However, during test teardown we still
> wait for the GPU to idle before continuing, but we have already
> confirmed that the GPU is dead. Furthermore, since we are inside a reset
> test, we have disabled the hangchecker, and so there is no safety net and
> we wait indefinitely. Detect the stuck GPU and declare it wedged as a
> state of emergency so we can escape.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Jari Tahvanainen <jari.tahvanainen@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Ping?
> ---
> drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 25 +++++++++++++++++++-----
> 1 file changed, 20 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> index 02e52a146ed8..913fe752f6b4 100644
> --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> @@ -165,6 +165,7 @@ static int emit_recurse_batch(struct hang *h,
> *batch++ = lower_32_bits(vma->node.start);
> }
> *batch++ = MI_BATCH_BUFFER_END; /* not reached */
> + wmb();
>
> flags = 0;
> if (INTEL_GEN(vm->i915) <= 5)
> @@ -621,7 +622,12 @@ static int igt_wait_reset(void *arg)
> __i915_add_request(rq, true);
>
> if (!wait_for_hang(&h, rq)) {
> - pr_err("Failed to start request %x\n", rq->fence.seqno);
> + pr_err("Failed to start request %x, at %x\n",
> + rq->fence.seqno, hws_seqno(&h, rq));
> +
> + i915_reset(i915, 0);
> + i915_gem_set_wedged(i915);
> +
> err = -EIO;
> goto out_rq;
> }
> @@ -708,10 +714,14 @@ static int igt_reset_queue(void *arg)
> __i915_add_request(rq, true);
>
> if (!wait_for_hang(&h, prev)) {
> - pr_err("Failed to start request %x\n",
> - prev->fence.seqno);
> + pr_err("Failed to start request %x, at %x\n",
> + rq->fence.seqno, hws_seqno(&h, rq));
> i915_gem_request_put(rq);
> i915_gem_request_put(prev);
> +
> + i915_reset(i915, 0);
> + i915_gem_set_wedged(i915);
> +
> err = -EIO;
> goto fini;
> }
> @@ -806,7 +816,12 @@ static int igt_handle_error(void *arg)
> __i915_add_request(rq, true);
>
> if (!wait_for_hang(&h, rq)) {
> - pr_err("Failed to start request %x\n", rq->fence.seqno);
> + pr_err("Failed to start request %x, at %x\n",
> + rq->fence.seqno, hws_seqno(&h, rq));
> +
> + i915_reset(i915, 0);
> + i915_gem_set_wedged(i915);
> +
> err = -EIO;
> goto err_request;
> }
> @@ -843,8 +858,8 @@ static int igt_handle_error(void *arg)
> int intel_hangcheck_live_selftests(struct drm_i915_private *i915)
> {
> static const struct i915_subtest tests[] = {
> + SUBTEST(igt_global_reset), /* attempt to recover GPU first */
> SUBTEST(igt_hang_sanitycheck),
> - SUBTEST(igt_global_reset),
> SUBTEST(igt_reset_engine),
> SUBTEST(igt_reset_active_engines),
> SUBTEST(igt_wait_reset),
> --
> 2.14.1
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
2017-09-19 14:18 ` [PATCH] " Chris Wilson
@ 2017-09-19 14:24 ` Tahvanainen, Jari
2017-09-19 14:33 ` Chris Wilson
0 siblings, 1 reply; 11+ messages in thread
From: Tahvanainen, Jari @ 2017-09-19 14:24 UTC (permalink / raw)
To: Chris Wilson, intel-gfx@lists.freedesktop.org
-----Original Message-----
From: Chris Wilson [mailto:chris@chris-wilson.co.uk]
Sent: Tuesday, September 19, 2017 5:19 PM
To: intel-gfx@lists.freedesktop.org
Cc: Tahvanainen, Jari <jari.tahvanainen@intel.com>; Mika Kuoppala <mika.kuoppala@linux.intel.com>
Subject: Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
Quoting Chris Wilson (2017-09-15 14:09:29)
> If we see the seqno stop progressing, we abandon the test for fear
> that the GPU died following the reset. However, during test teardown
> we still wait for the GPU to idle before continuing, but we have
> already confirmed that the GPU is dead. Furthermore, since we are
> inside a reset test, we have disabled the hangchecker, and so there is
> no safety net and we wait indefinitely. Detect the stuck GPU and
> declare it wedged as a state of emergency so we can escape.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Jari Tahvanainen <jari.tahvanainen@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>Ping?
Sorry Chris for late answer. Tried to get touch with you earlier through IRC.
I merged the series on top of the drm-tip and executed it in HSW - no hang anymore - FAIL.
(drv_selftest:6304) igt-kmod-CRITICAL: Test assertion failure function igt_kselftest_execute, file igt_kmod.c:513:
(drv_selftest:6304) igt-kmod-CRITICAL: Failed assertion: err == 0
(drv_selftest:6304) igt-kmod-CRITICAL: kselftest "i915 igt__19__live_hangcheck=1 live_selftests=-1" failed: Input/output error [5]
(drv_selftest:6304) igt-core-INFO: Stack trace:
(drv_selftest:6304) igt-core-INFO: #0 [__igt_fail_assert+0x101]
(drv_selftest:6304) igt-core-INFO: #1 [igt_kselftest_execute+0x296]
(drv_selftest:6304) igt-core-INFO: #2 [igt_kselftests+0x295]
(drv_selftest:6304) igt-core-INFO: #3 [main+0x5f]
(drv_selftest:6304) igt-core-INFO: #4 [__libc_start_main+0xf1]
(drv_selftest:6304) igt-core-INFO: #5 [_start+0x2a]
(drv_selftest:6304) igt-core-INFO: #6 [<unknown>+0x2a]
**** END ****
Stack trace:
#0 [__igt_fail_assert+0x101]
#1 [igt_kselftest_execute+0x296]
#2 [igt_kselftests+0x295]
#3 [main+0x5f]
#4 [__libc_start_main+0xf1]
#5 [_start+0x2a]
#6 [<unknown>+0x2a]
Subtest live_hangcheck: FAIL (1.911s)
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
2017-09-19 14:24 ` Tahvanainen, Jari
@ 2017-09-19 14:33 ` Chris Wilson
0 siblings, 0 replies; 11+ messages in thread
From: Chris Wilson @ 2017-09-19 14:33 UTC (permalink / raw)
To: Tahvanainen, Jari, intel-gfx@lists.freedesktop.org
Quoting Tahvanainen, Jari (2017-09-19 15:24:22)
> -----Original Message-----
> From: Chris Wilson [mailto:chris@chris-wilson.co.uk]
> Sent: Tuesday, September 19, 2017 5:19 PM
> To: intel-gfx@lists.freedesktop.org
> Cc: Tahvanainen, Jari <jari.tahvanainen@intel.com>; Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Subject: Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
>
> Quoting Chris Wilson (2017-09-15 14:09:29)
> > If we see the seqno stop progressing, we abandon the test for fear
> > that the GPU died following the reset. However, during test teardown
> > we still wait for the GPU to idle before continuing, but we have
> > already confirmed that the GPU is dead. Furthermore, since we are
> > inside a reset test, we have disabled the hangchecker, and so there is
> > no safety net and we wait indefinitely. Detect the stuck GPU and
> > declare it wedged as a state of emergency so we can escape.
> >
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Jari Tahvanainen <jari.tahvanainen@intel.com>
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>
> >Ping?
>
> Sorry Chris for late answer. Tried to get touch with you earlier through IRC.
> I merged the series on top of the drm-tip and executed it in HSW - no hang anymore - FAIL.
>
> (drv_selftest:6304) igt-kmod-CRITICAL: Test assertion failure function igt_kselftest_execute, file igt_kmod.c:513:
> (drv_selftest:6304) igt-kmod-CRITICAL: Failed assertion: err == 0
> (drv_selftest:6304) igt-kmod-CRITICAL: kselftest "i915 igt__19__live_hangcheck=1 live_selftests=-1" failed: Input/output error [5]
> (drv_selftest:6304) igt-core-INFO: Stack trace:
> (drv_selftest:6304) igt-core-INFO: #0 [__igt_fail_assert+0x101]
> (drv_selftest:6304) igt-core-INFO: #1 [igt_kselftest_execute+0x296]
> (drv_selftest:6304) igt-core-INFO: #2 [igt_kselftests+0x295]
> (drv_selftest:6304) igt-core-INFO: #3 [main+0x5f]
> (drv_selftest:6304) igt-core-INFO: #4 [__libc_start_main+0xf1]
> (drv_selftest:6304) igt-core-INFO: #5 [_start+0x2a]
> (drv_selftest:6304) igt-core-INFO: #6 [<unknown>+0x2a]
> **** END ****
> Stack trace:
> #0 [__igt_fail_assert+0x101]
> #1 [igt_kselftest_execute+0x296]
> #2 [igt_kselftests+0x295]
> #3 [main+0x5f]
> #4 [__libc_start_main+0xf1]
> #5 [_start+0x2a]
> #6 [<unknown>+0x2a]
> Subtest live_hangcheck: FAIL (1.911s)
That's what it is meant to do; stop the fail from freezing the machine.
I'll take that as a
Tested-by: Jari Tahvanainen <jari.tahvanainen@intel.com>
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
2017-09-15 13:09 [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests Chris Wilson
` (3 preceding siblings ...)
2017-09-19 14:18 ` [PATCH] " Chris Wilson
@ 2017-09-25 12:01 ` Chris Wilson
2017-09-26 12:48 ` Mika Kuoppala
5 siblings, 0 replies; 11+ messages in thread
From: Chris Wilson @ 2017-09-25 12:01 UTC (permalink / raw)
To: intel-gfx; +Cc: Jari Tahvanainen
Quoting Chris Wilson (2017-09-15 14:09:29)
> If we see the seqno stop progressing, we abandon the test for fear that
> the GPU died following the reset. However, during test teardown we still
> wait for the GPU to idle before continuing, but we have already
> confirmed that the GPU is dead. Furthermore, since we are inside a reset
> test, we have disabled the hangchecker, and so there is no safety net and
> we wait indefinitely. Detect the stuck GPU and declare it wedged as a
> state of emergency so we can escape.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Jari Tahvanainen <jari.tahvanainen@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Ping? We now have CI coverage of kselftests!
-Chris
> ---
> drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 25 +++++++++++++++++++-----
> 1 file changed, 20 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> index 02e52a146ed8..913fe752f6b4 100644
> --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> @@ -165,6 +165,7 @@ static int emit_recurse_batch(struct hang *h,
> *batch++ = lower_32_bits(vma->node.start);
> }
> *batch++ = MI_BATCH_BUFFER_END; /* not reached */
> + wmb();
>
> flags = 0;
> if (INTEL_GEN(vm->i915) <= 5)
> @@ -621,7 +622,12 @@ static int igt_wait_reset(void *arg)
> __i915_add_request(rq, true);
>
> if (!wait_for_hang(&h, rq)) {
> - pr_err("Failed to start request %x\n", rq->fence.seqno);
> + pr_err("Failed to start request %x, at %x\n",
> + rq->fence.seqno, hws_seqno(&h, rq));
> +
> + i915_reset(i915, 0);
> + i915_gem_set_wedged(i915);
> +
> err = -EIO;
> goto out_rq;
> }
> @@ -708,10 +714,14 @@ static int igt_reset_queue(void *arg)
> __i915_add_request(rq, true);
>
> if (!wait_for_hang(&h, prev)) {
> - pr_err("Failed to start request %x\n",
> - prev->fence.seqno);
> + pr_err("Failed to start request %x, at %x\n",
> + rq->fence.seqno, hws_seqno(&h, rq));
> i915_gem_request_put(rq);
> i915_gem_request_put(prev);
> +
> + i915_reset(i915, 0);
> + i915_gem_set_wedged(i915);
> +
> err = -EIO;
> goto fini;
> }
> @@ -806,7 +816,12 @@ static int igt_handle_error(void *arg)
> __i915_add_request(rq, true);
>
> if (!wait_for_hang(&h, rq)) {
> - pr_err("Failed to start request %x\n", rq->fence.seqno);
> + pr_err("Failed to start request %x, at %x\n",
> + rq->fence.seqno, hws_seqno(&h, rq));
> +
> + i915_reset(i915, 0);
> + i915_gem_set_wedged(i915);
> +
> err = -EIO;
> goto err_request;
> }
> @@ -843,8 +858,8 @@ static int igt_handle_error(void *arg)
> int intel_hangcheck_live_selftests(struct drm_i915_private *i915)
> {
> static const struct i915_subtest tests[] = {
> + SUBTEST(igt_global_reset), /* attempt to recover GPU first */
> SUBTEST(igt_hang_sanitycheck),
> - SUBTEST(igt_global_reset),
> SUBTEST(igt_reset_engine),
> SUBTEST(igt_reset_active_engines),
> SUBTEST(igt_wait_reset),
> --
> 2.14.1
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
2017-09-15 13:09 [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests Chris Wilson
` (4 preceding siblings ...)
2017-09-25 12:01 ` Chris Wilson
@ 2017-09-26 12:48 ` Mika Kuoppala
2017-09-26 13:03 ` Chris Wilson
5 siblings, 1 reply; 11+ messages in thread
From: Mika Kuoppala @ 2017-09-26 12:48 UTC (permalink / raw)
To: Chris Wilson, intel-gfx; +Cc: Jari Tahvanainen
Chris Wilson <chris@chris-wilson.co.uk> writes:
> If we see the seqno stop progressing, we abandon the test for fear that
> the GPU died following the reset. However, during test teardown we still
> wait for the GPU to idle before continuing, but we have already
> confirmed that the GPU is dead. Furthermore, since we are inside a reset
> test, we have disabled the hangchecker, and so there is no safety net and
> we wait indefinitely. Detect the stuck GPU and declare it wedged as a
> state of emergency so we can escape.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Jari Tahvanainen <jari.tahvanainen@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
> drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 25 +++++++++++++++++++-----
> 1 file changed, 20 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> index 02e52a146ed8..913fe752f6b4 100644
> --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> @@ -165,6 +165,7 @@ static int emit_recurse_batch(struct hang *h,
> *batch++ = lower_32_bits(vma->node.start);
> }
> *batch++ = MI_BATCH_BUFFER_END; /* not reached */
> + wmb();
>
Why not the big hammer with i915_gem_chipset_flush() here?
> flags = 0;
> if (INTEL_GEN(vm->i915) <= 5)
> @@ -621,7 +622,12 @@ static int igt_wait_reset(void *arg)
> __i915_add_request(rq, true);
>
> if (!wait_for_hang(&h, rq)) {
> - pr_err("Failed to start request %x\n", rq->fence.seqno);
> + pr_err("Failed to start request %x, at %x\n",
> + rq->fence.seqno, hws_seqno(&h, rq));
> +
> + i915_reset(i915, 0);
> + i915_gem_set_wedged(i915);
> +
> err = -EIO;
> goto out_rq;
> }
> @@ -708,10 +714,14 @@ static int igt_reset_queue(void *arg)
> __i915_add_request(rq, true);
>
> if (!wait_for_hang(&h, prev)) {
> - pr_err("Failed to start request %x\n",
> - prev->fence.seqno);
> + pr_err("Failed to start request %x, at %x\n",
> + rq->fence.seqno, hws_seqno(&h, rq));
As you pointed out the debug in here is for wrong request.
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> i915_gem_request_put(rq);
> i915_gem_request_put(prev);
> +
> + i915_reset(i915, 0);
> + i915_gem_set_wedged(i915);
> +
> err = -EIO;
> goto fini;
> }
> @@ -806,7 +816,12 @@ static int igt_handle_error(void *arg)
> __i915_add_request(rq, true);
>
> if (!wait_for_hang(&h, rq)) {
> - pr_err("Failed to start request %x\n", rq->fence.seqno);
> + pr_err("Failed to start request %x, at %x\n",
> + rq->fence.seqno, hws_seqno(&h, rq));
> +
> + i915_reset(i915, 0);
> + i915_gem_set_wedged(i915);
> +
> err = -EIO;
> goto err_request;
> }
> @@ -843,8 +858,8 @@ static int igt_handle_error(void *arg)
> int intel_hangcheck_live_selftests(struct drm_i915_private *i915)
> {
> static const struct i915_subtest tests[] = {
> + SUBTEST(igt_global_reset), /* attempt to recover GPU first */
> SUBTEST(igt_hang_sanitycheck),
> - SUBTEST(igt_global_reset),
> SUBTEST(igt_reset_engine),
> SUBTEST(igt_reset_active_engines),
> SUBTEST(igt_wait_reset),
> --
> 2.14.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
2017-09-26 12:48 ` Mika Kuoppala
@ 2017-09-26 13:03 ` Chris Wilson
2017-09-26 13:36 ` Mika Kuoppala
0 siblings, 1 reply; 11+ messages in thread
From: Chris Wilson @ 2017-09-26 13:03 UTC (permalink / raw)
To: Mika Kuoppala, intel-gfx; +Cc: Jari Tahvanainen
Quoting Mika Kuoppala (2017-09-26 13:48:17)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
>
> > If we see the seqno stop progressing, we abandon the test for fear that
> > the GPU died following the reset. However, during test teardown we still
> > wait for the GPU to idle before continuing, but we have already
> > confirmed that the GPU is dead. Furthermore, since we are inside a reset
> > test, we have disabled the hangchecker, and so there is no safety net and
> > we wait indefinitely. Detect the stuck GPU and declare it wedged as a
> > state of emergency so we can escape.
> >
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Jari Tahvanainen <jari.tahvanainen@intel.com>
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > ---
> > drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 25 +++++++++++++++++++-----
> > 1 file changed, 20 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> > index 02e52a146ed8..913fe752f6b4 100644
> > --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> > +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> > @@ -165,6 +165,7 @@ static int emit_recurse_batch(struct hang *h,
> > *batch++ = lower_32_bits(vma->node.start);
> > }
> > *batch++ = MI_BATCH_BUFFER_END; /* not reached */
> > + wmb();
> >
>
> Why not the big hammer with i915_gem_chipset_flush() here?
It didn't cross my mind, I was just doodling :)
>
> > flags = 0;
> > if (INTEL_GEN(vm->i915) <= 5)
> > @@ -621,7 +622,12 @@ static int igt_wait_reset(void *arg)
> > __i915_add_request(rq, true);
> >
> > if (!wait_for_hang(&h, rq)) {
> > - pr_err("Failed to start request %x\n", rq->fence.seqno);
> > + pr_err("Failed to start request %x, at %x\n",
> > + rq->fence.seqno, hws_seqno(&h, rq));
> > +
> > + i915_reset(i915, 0);
> > + i915_gem_set_wedged(i915);
> > +
> > err = -EIO;
> > goto out_rq;
> > }
> > @@ -708,10 +714,14 @@ static int igt_reset_queue(void *arg)
> > __i915_add_request(rq, true);
> >
> > if (!wait_for_hang(&h, prev)) {
> > - pr_err("Failed to start request %x\n",
> > - prev->fence.seqno);
> > + pr_err("Failed to start request %x, at %x\n",
> > + rq->fence.seqno, hws_seqno(&h, rq));
>
> As you pointed out the debug in here is for wrong request.
>
> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Happy if I drop the wmb() for a later patch and replace it with a
chipset flush instead?
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
2017-09-26 13:03 ` Chris Wilson
@ 2017-09-26 13:36 ` Mika Kuoppala
0 siblings, 0 replies; 11+ messages in thread
From: Mika Kuoppala @ 2017-09-26 13:36 UTC (permalink / raw)
To: Chris Wilson, intel-gfx; +Cc: Jari Tahvanainen
Chris Wilson <chris@chris-wilson.co.uk> writes:
> Quoting Mika Kuoppala (2017-09-26 13:48:17)
>> Chris Wilson <chris@chris-wilson.co.uk> writes:
>>
>> > If we see the seqno stop progressing, we abandon the test for fear that
>> > the GPU died following the reset. However, during test teardown we still
>> > wait for the GPU to idle before continuing, but we have already
>> > confirmed that the GPU is dead. Furthermore, since we are inside a reset
>> > test, we have disabled the hangchecker, and so there is no safety net and
>> > we wait indefinitely. Detect the stuck GPU and declare it wedged as a
>> > state of emergency so we can escape.
>> >
>> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>> > Cc: Jari Tahvanainen <jari.tahvanainen@intel.com>
>> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>> > ---
>> > drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 25 +++++++++++++++++++-----
>> > 1 file changed, 20 insertions(+), 5 deletions(-)
>> >
>> > diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
>> > index 02e52a146ed8..913fe752f6b4 100644
>> > --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
>> > +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
>> > @@ -165,6 +165,7 @@ static int emit_recurse_batch(struct hang *h,
>> > *batch++ = lower_32_bits(vma->node.start);
>> > }
>> > *batch++ = MI_BATCH_BUFFER_END; /* not reached */
>> > + wmb();
>> >
>>
>> Why not the big hammer with i915_gem_chipset_flush() here?
>
> It didn't cross my mind, I was just doodling :)
>
>>
>> > flags = 0;
>> > if (INTEL_GEN(vm->i915) <= 5)
>> > @@ -621,7 +622,12 @@ static int igt_wait_reset(void *arg)
>> > __i915_add_request(rq, true);
>> >
>> > if (!wait_for_hang(&h, rq)) {
>> > - pr_err("Failed to start request %x\n", rq->fence.seqno);
>> > + pr_err("Failed to start request %x, at %x\n",
>> > + rq->fence.seqno, hws_seqno(&h, rq));
>> > +
>> > + i915_reset(i915, 0);
>> > + i915_gem_set_wedged(i915);
>> > +
>> > err = -EIO;
>> > goto out_rq;
>> > }
>> > @@ -708,10 +714,14 @@ static int igt_reset_queue(void *arg)
>> > __i915_add_request(rq, true);
>> >
>> > if (!wait_for_hang(&h, prev)) {
>> > - pr_err("Failed to start request %x\n",
>> > - prev->fence.seqno);
>> > + pr_err("Failed to start request %x, at %x\n",
>> > + rq->fence.seqno, hws_seqno(&h, rq));
>>
>> As you pointed out the debug in here is for wrong request.
>>
>> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>
> Happy if I drop the wmb() for a later patch and replace it with a
> chipset flush instead?
Will be happy.
-Mika
> -Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread