* [Intel-gfx] [PATCH i-g-t v2] runner: Don't kill a test on taint if watching timeouts
@ 2020-12-04 19:50 Janusz Krzysztofik
2020-12-07 13:09 ` Petri Latvala
2021-01-06 9:41 ` Chris Wilson
0 siblings, 2 replies; 7+ messages in thread
From: Janusz Krzysztofik @ 2020-12-04 19:50 UTC (permalink / raw)
To: igt-dev; +Cc: intel-gfx, Chris Wilson
We may still be interested in results of a test even if it has tainted
the kernel. On the other hand, we need to kill the test on taint if no
other means of killing it on a jam is active.
If abort on both kernel taint or a timeout is requested, decrease all
potential timeouts significantly while the taint is detected instead of
aborting immediately. However, report the taint as the reason of the
abort if a timeout decreased by the taint expires.
v2: Fix missing show_kernel_task_state() lost on rebase conflict
resolution (Chris - thanks!)
Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
---
runner/executor.c | 26 ++++++++++++++++++++------
1 file changed, 20 insertions(+), 6 deletions(-)
diff --git a/runner/executor.c b/runner/executor.c
index 1688ae41d..faf272d85 100644
--- a/runner/executor.c
+++ b/runner/executor.c
@@ -726,6 +726,8 @@ static const char *need_to_timeout(struct settings *settings,
double time_since_kill,
size_t disk_usage)
{
+ int decrease = 1;
+
if (killed) {
/*
* Timeout after being killed is a hardcoded amount
@@ -753,20 +755,32 @@ static const char *need_to_timeout(struct settings *settings,
}
/*
- * If we're configured to care about taints, kill the
- * test if there's a taint.
+ * If we're configured to care about taints,
+ * decrease timeouts in use if there's a taint,
+ * or kill the test if no timeouts have been requested.
*/
if (settings->abort_mask & ABORT_TAINT &&
- is_tainted(taints))
- return "Killing the test because the kernel is tainted.\n";
+ is_tainted(taints)) {
+ /* list of timeouts that may postpone immediate kill on taint */
+ if (settings->per_test_timeout || settings->inactivity_timeout)
+ decrease = 10;
+ else
+ return "Killing the test because the kernel is tainted.\n";
+ }
if (settings->per_test_timeout != 0 &&
- time_since_subtest > settings->per_test_timeout)
+ time_since_subtest > settings->per_test_timeout / decrease) {
+ if (decrease > 1)
+ return "Killing the test because the kernel is tainted.\n";
return show_kernel_task_state("Per-test timeout exceeded. Killing the current test with SIGQUIT.\n");
+ }
if (settings->inactivity_timeout != 0 &&
- time_since_activity > settings->inactivity_timeout)
+ time_since_activity > settings->inactivity_timeout / decrease ) {
+ if (decrease > 1)
+ return "Killing the test because the kernel is tainted.\n";
return show_kernel_task_state("Inactivity timeout exceeded. Killing the current test with SIGQUIT.\n");
+ }
if (disk_usage_limit_exceeded(settings, disk_usage))
return "Disk usage limit exceeded.\n";
--
2.21.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [Intel-gfx] [PATCH i-g-t v2] runner: Don't kill a test on taint if watching timeouts
2020-12-04 19:50 [Intel-gfx] [PATCH i-g-t v2] runner: Don't kill a test on taint if watching timeouts Janusz Krzysztofik
@ 2020-12-07 13:09 ` Petri Latvala
2020-12-07 13:18 ` Janusz Krzysztofik
2021-01-06 9:41 ` Chris Wilson
1 sibling, 1 reply; 7+ messages in thread
From: Petri Latvala @ 2020-12-07 13:09 UTC (permalink / raw)
To: Janusz Krzysztofik; +Cc: igt-dev, intel-gfx, Chris Wilson
On Fri, Dec 04, 2020 at 08:50:07PM +0100, Janusz Krzysztofik wrote:
> We may still be interested in results of a test even if it has tainted
> the kernel. On the other hand, we need to kill the test on taint if no
> other means of killing it on a jam is active.
>
> If abort on both kernel taint or a timeout is requested, decrease all
> potential timeouts significantly while the taint is detected instead of
> aborting immediately. However, report the taint as the reason of the
> abort if a timeout decreased by the taint expires.
>
> v2: Fix missing show_kernel_task_state() lost on rebase conflict
> resolution (Chris - thanks!)
>
> Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
The effects of this is that we sometimes now get more logs from a test
at the cost of it not directly showing up as an incomplete. We would
still get the igt@runner@aborted result for it so overall we still
catch tainting cases.
Chris's comments have been clarified off-list not to mean directly
opposing this patch, so
Reviewed-by: Petri Latvala <petri.latvala@intel.com>
> ---
> runner/executor.c | 26 ++++++++++++++++++++------
> 1 file changed, 20 insertions(+), 6 deletions(-)
>
> diff --git a/runner/executor.c b/runner/executor.c
> index 1688ae41d..faf272d85 100644
> --- a/runner/executor.c
> +++ b/runner/executor.c
> @@ -726,6 +726,8 @@ static const char *need_to_timeout(struct settings *settings,
> double time_since_kill,
> size_t disk_usage)
> {
> + int decrease = 1;
> +
> if (killed) {
> /*
> * Timeout after being killed is a hardcoded amount
> @@ -753,20 +755,32 @@ static const char *need_to_timeout(struct settings *settings,
> }
>
> /*
> - * If we're configured to care about taints, kill the
> - * test if there's a taint.
> + * If we're configured to care about taints,
> + * decrease timeouts in use if there's a taint,
> + * or kill the test if no timeouts have been requested.
> */
> if (settings->abort_mask & ABORT_TAINT &&
> - is_tainted(taints))
> - return "Killing the test because the kernel is tainted.\n";
> + is_tainted(taints)) {
> + /* list of timeouts that may postpone immediate kill on taint */
> + if (settings->per_test_timeout || settings->inactivity_timeout)
> + decrease = 10;
> + else
> + return "Killing the test because the kernel is tainted.\n";
> + }
>
> if (settings->per_test_timeout != 0 &&
> - time_since_subtest > settings->per_test_timeout)
> + time_since_subtest > settings->per_test_timeout / decrease) {
> + if (decrease > 1)
> + return "Killing the test because the kernel is tainted.\n";
> return show_kernel_task_state("Per-test timeout exceeded. Killing the current test with SIGQUIT.\n");
> + }
>
> if (settings->inactivity_timeout != 0 &&
> - time_since_activity > settings->inactivity_timeout)
> + time_since_activity > settings->inactivity_timeout / decrease ) {
> + if (decrease > 1)
> + return "Killing the test because the kernel is tainted.\n";
> return show_kernel_task_state("Inactivity timeout exceeded. Killing the current test with SIGQUIT.\n");
> + }
>
> if (disk_usage_limit_exceeded(settings, disk_usage))
> return "Disk usage limit exceeded.\n";
> --
> 2.21.1
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Intel-gfx] [PATCH i-g-t v2] runner: Don't kill a test on taint if watching timeouts
2020-12-07 13:09 ` Petri Latvala
@ 2020-12-07 13:18 ` Janusz Krzysztofik
0 siblings, 0 replies; 7+ messages in thread
From: Janusz Krzysztofik @ 2020-12-07 13:18 UTC (permalink / raw)
To: Petri Latvala; +Cc: igt-dev, intel-gfx, Chris Wilson
On Mon, 2020-12-07 at 15:09 +0200, Petri Latvala wrote:
> On Fri, Dec 04, 2020 at 08:50:07PM +0100, Janusz Krzysztofik wrote:
> > We may still be interested in results of a test even if it has tainted
> > the kernel. On the other hand, we need to kill the test on taint if no
> > other means of killing it on a jam is active.
> >
> > If abort on both kernel taint or a timeout is requested, decrease all
> > potential timeouts significantly while the taint is detected instead of
> > aborting immediately. However, report the taint as the reason of the
> > abort if a timeout decreased by the taint expires.
> >
> > v2: Fix missing show_kernel_task_state() lost on rebase conflict
> > resolution (Chris - thanks!)
> >
> > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
>
> The effects of this is that we sometimes now get more logs from a test
> at the cost of it not directly showing up as an incomplete. We would
> still get the igt@runner@aborted result for it so overall we still
> catch tainting cases.
>
> Chris's comments have been clarified off-list not to mean directly
> opposing this patch, so
>
>
> Reviewed-by: Petri Latvala <petri.latvala@intel.com>
Thanks, pushed.
Janusz
>
>
>
> > ---
> > runner/executor.c | 26 ++++++++++++++++++++------
> > 1 file changed, 20 insertions(+), 6 deletions(-)
> >
> > diff --git a/runner/executor.c b/runner/executor.c
> > index 1688ae41d..faf272d85 100644
> > --- a/runner/executor.c
> > +++ b/runner/executor.c
> > @@ -726,6 +726,8 @@ static const char *need_to_timeout(struct settings *settings,
> > double time_since_kill,
> > size_t disk_usage)
> > {
> > + int decrease = 1;
> > +
> > if (killed) {
> > /*
> > * Timeout after being killed is a hardcoded amount
> > @@ -753,20 +755,32 @@ static const char *need_to_timeout(struct settings *settings,
> > }
> >
> > /*
> > - * If we're configured to care about taints, kill the
> > - * test if there's a taint.
> > + * If we're configured to care about taints,
> > + * decrease timeouts in use if there's a taint,
> > + * or kill the test if no timeouts have been requested.
> > */
> > if (settings->abort_mask & ABORT_TAINT &&
> > - is_tainted(taints))
> > - return "Killing the test because the kernel is tainted.\n";
> > + is_tainted(taints)) {
> > + /* list of timeouts that may postpone immediate kill on taint */
> > + if (settings->per_test_timeout || settings->inactivity_timeout)
> > + decrease = 10;
> > + else
> > + return "Killing the test because the kernel is tainted.\n";
> > + }
> >
> > if (settings->per_test_timeout != 0 &&
> > - time_since_subtest > settings->per_test_timeout)
> > + time_since_subtest > settings->per_test_timeout / decrease) {
> > + if (decrease > 1)
> > + return "Killing the test because the kernel is tainted.\n";
> > return show_kernel_task_state("Per-test timeout exceeded. Killing the current test with SIGQUIT.\n");
> > + }
> >
> > if (settings->inactivity_timeout != 0 &&
> > - time_since_activity > settings->inactivity_timeout)
> > + time_since_activity > settings->inactivity_timeout / decrease ) {
> > + if (decrease > 1)
> > + return "Killing the test because the kernel is tainted.\n";
> > return show_kernel_task_state("Inactivity timeout exceeded. Killing the current test with SIGQUIT.\n");
> > + }
> >
> > if (disk_usage_limit_exceeded(settings, disk_usage))
> > return "Disk usage limit exceeded.\n";
> > --
> > 2.21.1
> >
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Intel-gfx] [PATCH i-g-t v2] runner: Don't kill a test on taint if watching timeouts
2020-12-04 19:50 [Intel-gfx] [PATCH i-g-t v2] runner: Don't kill a test on taint if watching timeouts Janusz Krzysztofik
2020-12-07 13:09 ` Petri Latvala
@ 2021-01-06 9:41 ` Chris Wilson
2021-01-07 9:40 ` Petri Latvala
1 sibling, 1 reply; 7+ messages in thread
From: Chris Wilson @ 2021-01-06 9:41 UTC (permalink / raw)
To: Janusz Krzysztofik, igt-dev; +Cc: intel-gfx
Quoting Janusz Krzysztofik (2020-12-04 19:50:07)
> We may still be interested in results of a test even if it has tainted
> the kernel. On the other hand, we need to kill the test on taint if no
> other means of killing it on a jam is active.
>
> If abort on both kernel taint or a timeout is requested, decrease all
> potential timeouts significantly while the taint is detected instead of
> aborting immediately. However, report the taint as the reason of the
> abort if a timeout decreased by the taint expires.
This has the nasty side effect of not stopping the test run after a
kernel taint. Instead the next test inherits the tainted condition from
the previous test and usually ends up being declared incomplete.
False positives are frustrating.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Intel-gfx] [PATCH i-g-t v2] runner: Don't kill a test on taint if watching timeouts
2021-01-06 9:41 ` Chris Wilson
@ 2021-01-07 9:40 ` Petri Latvala
2021-01-07 9:49 ` Chris Wilson
0 siblings, 1 reply; 7+ messages in thread
From: Petri Latvala @ 2021-01-07 9:40 UTC (permalink / raw)
To: Chris Wilson; +Cc: igt-dev, intel-gfx
On Wed, Jan 06, 2021 at 09:41:37AM +0000, Chris Wilson wrote:
> Quoting Janusz Krzysztofik (2020-12-04 19:50:07)
> > We may still be interested in results of a test even if it has tainted
> > the kernel. On the other hand, we need to kill the test on taint if no
> > other means of killing it on a jam is active.
> >
> > If abort on both kernel taint or a timeout is requested, decrease all
> > potential timeouts significantly while the taint is detected instead of
> > aborting immediately. However, report the taint as the reason of the
> > abort if a timeout decreased by the taint expires.
>
> This has the nasty side effect of not stopping the test run after a
> kernel taint. Instead the next test inherits the tainted condition from
> the previous test and usually ends up being declared incomplete.
>
> False positives are frustrating.
> -Chris
Do you have a link to a test run where this happened? This patch
didn't change the between-tests abort checks.
--
Petri Latvala
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Intel-gfx] [PATCH i-g-t v2] runner: Don't kill a test on taint if watching timeouts
2021-01-07 9:40 ` Petri Latvala
@ 2021-01-07 9:49 ` Chris Wilson
2021-01-07 10:06 ` [Intel-gfx] [igt-dev] " Petri Latvala
0 siblings, 1 reply; 7+ messages in thread
From: Chris Wilson @ 2021-01-07 9:49 UTC (permalink / raw)
To: Petri Latvala; +Cc: igt-dev, intel-gfx
Quoting Petri Latvala (2021-01-07 09:40:02)
> On Wed, Jan 06, 2021 at 09:41:37AM +0000, Chris Wilson wrote:
> > Quoting Janusz Krzysztofik (2020-12-04 19:50:07)
> > > We may still be interested in results of a test even if it has tainted
> > > the kernel. On the other hand, we need to kill the test on taint if no
> > > other means of killing it on a jam is active.
> > >
> > > If abort on both kernel taint or a timeout is requested, decrease all
> > > potential timeouts significantly while the taint is detected instead of
> > > aborting immediately. However, report the taint as the reason of the
> > > abort if a timeout decreased by the taint expires.
> >
> > This has the nasty side effect of not stopping the test run after a
> > kernel taint. Instead the next test inherits the tainted condition from
> > the previous test and usually ends up being declared incomplete.
> >
> > False positives are frustrating.
> > -Chris
>
>
> Do you have a link to a test run where this happened? This patch
> didn't change the between-tests abort checks.
The taint is from the warnings in the penultimate test:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9550/shard-skl7/igt_runner20.txt
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Intel-gfx] [igt-dev] [PATCH i-g-t v2] runner: Don't kill a test on taint if watching timeouts
2021-01-07 9:49 ` Chris Wilson
@ 2021-01-07 10:06 ` Petri Latvala
0 siblings, 0 replies; 7+ messages in thread
From: Petri Latvala @ 2021-01-07 10:06 UTC (permalink / raw)
To: Chris Wilson; +Cc: igt-dev, intel-gfx
On Thu, Jan 07, 2021 at 09:49:23AM +0000, Chris Wilson wrote:
> Quoting Petri Latvala (2021-01-07 09:40:02)
> > On Wed, Jan 06, 2021 at 09:41:37AM +0000, Chris Wilson wrote:
> > > Quoting Janusz Krzysztofik (2020-12-04 19:50:07)
> > > > We may still be interested in results of a test even if it has tainted
> > > > the kernel. On the other hand, we need to kill the test on taint if no
> > > > other means of killing it on a jam is active.
> > > >
> > > > If abort on both kernel taint or a timeout is requested, decrease all
> > > > potential timeouts significantly while the taint is detected instead of
> > > > aborting immediately. However, report the taint as the reason of the
> > > > abort if a timeout decreased by the taint expires.
> > >
> > > This has the nasty side effect of not stopping the test run after a
> > > kernel taint. Instead the next test inherits the tainted condition from
> > > the previous test and usually ends up being declared incomplete.
> > >
> > > False positives are frustrating.
> > > -Chris
> >
> >
> > Do you have a link to a test run where this happened? This patch
> > didn't change the between-tests abort checks.
>
> The taint is from the warnings in the penultimate test:
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9550/shard-skl7/igt_runner20.txt
Ah, I see.
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9550/shard-skl7/dmesg20.txt
This is the tainting WARN I presume:
<4>[ 917.575173] Memory manager not clean during takedown.
<4>[ 917.575272] WARNING: CPU: 2 PID: 7 at drivers/gpu/drm/drm_mm.c:999 drm_mm_takedown+0x51/0x100
That happens after @gem, before @evict.
In other words, this is all in the same exec() of i915_selftest
--run-subtest live. Incorrect _dynamic_ subtest gets blamed.
Getting the killing right here is a bit tricky, possibly doable. Or
rather, getting the blame right is doable, killing will be inherently
racy...
--
Petri Latvala
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2021-01-07 10:06 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-12-04 19:50 [Intel-gfx] [PATCH i-g-t v2] runner: Don't kill a test on taint if watching timeouts Janusz Krzysztofik
2020-12-07 13:09 ` Petri Latvala
2020-12-07 13:18 ` Janusz Krzysztofik
2021-01-06 9:41 ` Chris Wilson
2021-01-07 9:40 ` Petri Latvala
2021-01-07 9:49 ` Chris Wilson
2021-01-07 10:06 ` [Intel-gfx] [igt-dev] " Petri Latvala
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).