Inscrutable CI jobs (avocado & Travis s390 check-tcg)

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* Inscrutable CI jobs (avocado & Travis s390 check-tcg)
@ 2022-09-22 19:04 Stefan Hajnoczi
  2022-09-23  7:28 ` Daniel P. Berrangé
  0 siblings, 1 reply; 8+ messages in thread
From: Stefan Hajnoczi @ 2022-09-22 19:04 UTC (permalink / raw)
  To: thuth, Philippe Mathieu-Daudé, Cleber Rosa,
	Wainer dos Santos Moschetta, Beraldo Leal
  Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 845 bytes --]

QEMU's avocado and Travis s390x check-tcg CI jobs fail often and I don't
know why. I think it's due to timeouts but maybe there is something
buried in the logs that I missed.

I waste time skimming through logs when merging qemu.git pull requests
and electricity is wasted on tests that don't produce useful pass/fail
output.

Here are two recent examples:
https://gitlab.com/qemu-project/qemu/-/jobs/3070754718
https://app.travis-ci.com/gitlab/qemu-project/qemu/jobs/583629583

If there are real test failures then the test output needs to be
improved so people can identify failures.

If the tests are timing out then they need to be split up and/or reduced
in duration. BTW, if it's a timeout, why are we using an internal
timeout instead of letting CI mark the job as timed out?

Any other ideas for improving these CI jobs?

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Inscrutable CI jobs (avocado & Travis s390 check-tcg)
  2022-09-22 19:04 Inscrutable CI jobs (avocado & Travis s390 check-tcg) Stefan Hajnoczi
@ 2022-09-23  7:28 ` Daniel P. Berrangé
  2022-09-23  7:47   ` Thomas Huth
  2022-09-29 13:01   ` Stefan Hajnoczi
  0 siblings, 2 replies; 8+ messages in thread
From: Daniel P. Berrangé @ 2022-09-23  7:28 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: thuth, Philippe Mathieu-Daudé, Cleber Rosa,
	Wainer dos Santos Moschetta, Beraldo Leal, qemu-devel

On Thu, Sep 22, 2022 at 03:04:12PM -0400, Stefan Hajnoczi wrote:
> QEMU's avocado and Travis s390x check-tcg CI jobs fail often and I don't
> know why. I think it's due to timeouts but maybe there is something
> buried in the logs that I missed.
> 
> I waste time skimming through logs when merging qemu.git pull requests
> and electricity is wasted on tests that don't produce useful pass/fail
> output.
> 
> Here are two recent examples:
> https://gitlab.com/qemu-project/qemu/-/jobs/3070754718
> https://app.travis-ci.com/gitlab/qemu-project/qemu/jobs/583629583
> 
> If there are real test failures then the test output needs to be
> improved so people can identify failures.
> 
> If the tests are timing out then they need to be split up and/or reduced
> in duration. BTW, if it's a timeout, why are we using an internal
> timeout instead of letting CI mark the job as timed out?
> 
> Any other ideas for improving these CI jobs?

The avocado job there does show the errors, but the summary at the
end leaves something to be desired. At first glance it looked like
everything passed because it says "ERROR 0" and that's what caught
my eye. Took a long time to notice the 'INTERRUPT 5' bit is actually
just an error state too.  I don't understand why it has to have so
many different ways of saying the same thing:

  RESULTS    : PASS 14 | ERROR 0 | FAIL 0 | SKIP 37 | WARN 0 | INTERRUPT 5 | CANCEL 136


  "ERROR", "FAIL" and "INTERRUPT" are all just the same thing

  "SKIP" and "CANCEL" are just the same thing

I'm sure there was some reason for these different terms, but IMHO they
are actively unhelpful.

For example I see no justiable reason for the choice of SKIP vs CANCEL
in these two messages:

 (173/192) tests/avocado/virtiofs_submounts.py:VirtiofsSubmountsTest.test_pre_launch_set_up:  SKIP: sudo -n required, but "sudo -n true" failed: [Errno 2] No such file or directory: 'sudo'

 (183/192) tests/avocado/x86_cpu_model_versions.py:X86CPUModelAliases.test_4_1_alias:  CANCEL: No QEMU binary defined or found in the build tree (0.00 s)

It would be clearer to understand the summary as:

 RESULTS: PASS 14 | ERROR 5 | SKIP 173 | WARN 0

I'd also like to see it repeat the error messages for the failed
tests at the end, so you don't have to search back up through the
huge log to find them.


On the TCG tests we see

imeout --foreground 90  /home/travis/build/qemu-project/qemu/build/qemu-s390x  noexec >  noexec.out

make[1]: *** [../Makefile.target:158: run-noexec] Error 1

make[1]: Leaving directory '/home/travis/build/qemu-project/qemu/build/tests/tcg/s390x-linux-user'

make: *** [/home/travis/build/qemu-project/qemu/tests/Makefile.include:60: run-tcg-tests-s390x-linux-user] Error 2


I presume that indicates the 'noexec' test failed, but we have zero
info. Perhaps noexec.out contains the useful info ? We don't know as
we can't access the file.  If a test system is storing results in a
file that needs to be cat'd to stderr on failure, for it to be
visible in CI.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Inscrutable CI jobs (avocado & Travis s390 check-tcg)
  2022-09-23  7:28 ` Daniel P. Berrangé
@ 2022-09-23  7:47   ` Thomas Huth
  2022-09-23 10:08     ` Alex Bennée
  2022-09-29 13:01   ` Stefan Hajnoczi
  1 sibling, 1 reply; 8+ messages in thread
From: Thomas Huth @ 2022-09-23  7:47 UTC (permalink / raw)
  To: Daniel P. Berrangé, Stefan Hajnoczi, Alex Bennée
  Cc: Philippe Mathieu-Daudé, Cleber Rosa,
	Wainer dos Santos Moschetta, Beraldo Leal, qemu-devel,
	Ilya Leoshkevich

On 23/09/2022 09.28, Daniel P. Berrangé wrote:
> On Thu, Sep 22, 2022 at 03:04:12PM -0400, Stefan Hajnoczi wrote:
>> QEMU's avocado and Travis s390x check-tcg CI jobs fail often and I don't
>> know why. I think it's due to timeouts but maybe there is something
>> buried in the logs that I missed.
>>
>> I waste time skimming through logs when merging qemu.git pull requests
>> and electricity is wasted on tests that don't produce useful pass/fail
>> output.
>>
>> Here are two recent examples:
>> https://gitlab.com/qemu-project/qemu/-/jobs/3070754718
>> https://app.travis-ci.com/gitlab/qemu-project/qemu/jobs/583629583
>>
>> If there are real test failures then the test output needs to be
>> improved so people can identify failures.
>>
>> If the tests are timing out then they need to be split up and/or reduced
>> in duration. BTW, if it's a timeout, why are we using an internal
>> timeout instead of letting CI mark the job as timed out?
>>
>> Any other ideas for improving these CI jobs?
> 
> The avocado job there does show the errors, but the summary at the
> end leaves something to be desired. At first glance it looked like
> everything passed because it says "ERROR 0" and that's what caught
> my eye. Took a long time to notice the 'INTERRUPT 5' bit is actually
> just an error state too.  I don't understand why it has to have so
> many different ways of saying the same thing:
> 
>    RESULTS    : PASS 14 | ERROR 0 | FAIL 0 | SKIP 37 | WARN 0 | INTERRUPT 5 | CANCEL 136
> 
> 
>    "ERROR", "FAIL" and "INTERRUPT" are all just the same thing
> 
>    "SKIP" and "CANCEL" are just the same thing
> 
> I'm sure there was some reason for these different terms, but IMHO they
> are actively unhelpful.
> 
> For example I see no justiable reason for the choice of SKIP vs CANCEL
> in these two messages:
> 
>   (173/192) tests/avocado/virtiofs_submounts.py:VirtiofsSubmountsTest.test_pre_launch_set_up:  SKIP: sudo -n required, but "sudo -n true" failed: [Errno 2] No such file or directory: 'sudo'
> 
>   (183/192) tests/avocado/x86_cpu_model_versions.py:X86CPUModelAliases.test_4_1_alias:  CANCEL: No QEMU binary defined or found in the build tree (0.00 s)
> 
> It would be clearer to understand the summary as:
> 
>   RESULTS: PASS 14 | ERROR 5 | SKIP 173 | WARN 0
> 
> I'd also like to see it repeat the error messages for the failed
> tests at the end, so you don't have to search back up through the
> huge log to find them.
> 
> 
> On the TCG tests we see
> 
> imeout --foreground 90  /home/travis/build/qemu-project/qemu/build/qemu-s390x  noexec >  noexec.out
> 
> make[1]: *** [../Makefile.target:158: run-noexec] Error 1
> 
> make[1]: Leaving directory '/home/travis/build/qemu-project/qemu/build/tests/tcg/s390x-linux-user'
> 
> make: *** [/home/travis/build/qemu-project/qemu/tests/Makefile.include:60: run-tcg-tests-s390x-linux-user] Error 2
> 
> 
> I presume that indicates the 'noexec' test failed, but we have zero
> info.

I think this is the bug that will be fixed by Ilya's patch here:

  https://lists.gnu.org/archive/html/qemu-devel/2022-09/msg02756.html

But I agree, it is unfortunate that the output is not available. Looking at 
this on my s390x box:

$ cat tests/tcg/s390x-linux-user/noexec.out
[ RUN      ] fallthrough
[       OK ]
[ RUN      ] jump
[  FAILED  ] unexpected SEGV

so there is an indication of what's going wrong in there indeed.

Alex, would it be possible to change the tcg test harness to dump the .out 
file of failing tests?

  Thomas



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Inscrutable CI jobs (avocado & Travis s390 check-tcg)
  2022-09-23  7:47   ` Thomas Huth
@ 2022-09-23 10:08     ` Alex Bennée
  0 siblings, 0 replies; 8+ messages in thread
From: Alex Bennée @ 2022-09-23 10:08 UTC (permalink / raw)
  To: Thomas Huth
  Cc: Daniel P. Berrangé, Stefan Hajnoczi,
	Philippe Mathieu-Daudé, Cleber Rosa,
	Wainer dos Santos Moschetta, Beraldo Leal, qemu-devel,
	Ilya Leoshkevich


Thomas Huth <thuth@redhat.com> writes:

> On 23/09/2022 09.28, Daniel P. Berrangé wrote:
>> On Thu, Sep 22, 2022 at 03:04:12PM -0400, Stefan Hajnoczi wrote:
>>> QEMU's avocado and Travis s390x check-tcg CI jobs fail often and I don't
>>> know why. I think it's due to timeouts but maybe there is something
>>> buried in the logs that I missed.
>>>
>>> I waste time skimming through logs when merging qemu.git pull requests
>>> and electricity is wasted on tests that don't produce useful pass/fail
>>> output.
>>>
>>> Here are two recent examples:
>>> https://gitlab.com/qemu-project/qemu/-/jobs/3070754718
>>> https://app.travis-ci.com/gitlab/qemu-project/qemu/jobs/583629583
>>>
>>> If there are real test failures then the test output needs to be
>>> improved so people can identify failures.
>>>
>>> If the tests are timing out then they need to be split up and/or reduced
>>> in duration. BTW, if it's a timeout, why are we using an internal
>>> timeout instead of letting CI mark the job as timed out?
>>>
>>> Any other ideas for improving these CI jobs?
>> The avocado job there does show the errors, but the summary at the
>> end leaves something to be desired. At first glance it looked like
>> everything passed because it says "ERROR 0" and that's what caught
>> my eye. Took a long time to notice the 'INTERRUPT 5' bit is actually
>> just an error state too.  I don't understand why it has to have so
>> many different ways of saying the same thing:
>>    RESULTS    : PASS 14 | ERROR 0 | FAIL 0 | SKIP 37 | WARN 0 |
>> INTERRUPT 5 | CANCEL 136
>>    "ERROR", "FAIL" and "INTERRUPT" are all just the same thing
>>    "SKIP" and "CANCEL" are just the same thing
>> I'm sure there was some reason for these different terms, but IMHO
>> they
>> are actively unhelpful.
>> For example I see no justiable reason for the choice of SKIP vs
>> CANCEL
>> in these two messages:
>>   (173/192)
>> tests/avocado/virtiofs_submounts.py:VirtiofsSubmountsTest.test_pre_launch_set_up:
>> SKIP: sudo -n required, but "sudo -n true" failed: [Errno 2] No such
>> file or directory: 'sudo'
>>   (183/192)
>> tests/avocado/x86_cpu_model_versions.py:X86CPUModelAliases.test_4_1_alias:
>> CANCEL: No QEMU binary defined or found in the build tree (0.00 s)
>> It would be clearer to understand the summary as:
>>   RESULTS: PASS 14 | ERROR 5 | SKIP 173 | WARN 0
>> I'd also like to see it repeat the error messages for the failed
>> tests at the end, so you don't have to search back up through the
>> huge log to find them.
>> On the TCG tests we see
>> imeout --foreground 90
>> /home/travis/build/qemu-project/qemu/build/qemu-s390x  noexec >
>> noexec.out
>> make[1]: *** [../Makefile.target:158: run-noexec] Error 1
>> make[1]: Leaving directory
>> '/home/travis/build/qemu-project/qemu/build/tests/tcg/s390x-linux-user'
>> make: ***
>> [/home/travis/build/qemu-project/qemu/tests/Makefile.include:60:
>> run-tcg-tests-s390x-linux-user] Error 2
>> I presume that indicates the 'noexec' test failed, but we have zero
>> info.
>
> I think this is the bug that will be fixed by Ilya's patch here:
>
>  https://lists.gnu.org/archive/html/qemu-devel/2022-09/msg02756.html
>
> But I agree, it is unfortunate that the output is not available.
> Looking at this on my s390x box:
>
> $ cat tests/tcg/s390x-linux-user/noexec.out
> [ RUN      ] fallthrough
> [       OK ]
> [ RUN      ] jump
> [  FAILED  ] unexpected SEGV
>
> so there is an indication of what's going wrong in there indeed.
>
> Alex, would it be possible to change the tcg test harness to dump the
> .out file of failing tests?

Yes I think so, either by tweaking the run-% rules in
tests/tcg/Makefile.target to handle a failed call or possibly expanding
the run-test macro itself.

>
>  Thomas


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Inscrutable CI jobs (avocado & Travis s390 check-tcg)
  2022-09-23  7:28 ` Daniel P. Berrangé
  2022-09-23  7:47   ` Thomas Huth
@ 2022-09-29 13:01   ` Stefan Hajnoczi
  2022-09-29 13:08     ` Daniel P. Berrangé
  2022-09-29 13:16     ` Thomas Huth
  1 sibling, 2 replies; 8+ messages in thread
From: Stefan Hajnoczi @ 2022-09-29 13:01 UTC (permalink / raw)
  To: Daniel P. Berrangé, Philippe Mathieu-Daudé
  Cc: Stefan Hajnoczi, thuth, Cleber Rosa, Wainer dos Santos Moschetta,
	Beraldo Leal, qemu-devel

I still don't know what the failures from the Avocado tests are:
https://gitlab.com/qemu-project/qemu/-/jobs/3100466535
https://gitlab.com/qemu-project/qemu/-/jobs/3100466546
https://gitlab.com/qemu-project/qemu/-/jobs/3100466539
https://gitlab.com/qemu-project/qemu/-/jobs/3100466548
https://gitlab.com/qemu-project/qemu/-/jobs/3100466537

There are lots of logs and I don't know what to look for.

Any ideas?

Stefan


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Inscrutable CI jobs (avocado & Travis s390 check-tcg)
  2022-09-29 13:01   ` Stefan Hajnoczi
@ 2022-09-29 13:08     ` Daniel P. Berrangé
  2022-09-29 13:16     ` Thomas Huth
  1 sibling, 0 replies; 8+ messages in thread
From: Daniel P. Berrangé @ 2022-09-29 13:08 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Philippe Mathieu-Daudé, Stefan Hajnoczi, thuth, Cleber Rosa,
	Wainer dos Santos Moschetta, Beraldo Leal, qemu-devel

On Thu, Sep 29, 2022 at 09:01:21AM -0400, Stefan Hajnoczi wrote:
> I still don't know what the failures from the Avocado tests are:
> https://gitlab.com/qemu-project/qemu/-/jobs/3100466535
> https://gitlab.com/qemu-project/qemu/-/jobs/3100466546
> https://gitlab.com/qemu-project/qemu/-/jobs/3100466539
> https://gitlab.com/qemu-project/qemu/-/jobs/3100466548
> https://gitlab.com/qemu-project/qemu/-/jobs/3100466537
> 
> There are lots of logs and I don't know what to look for.

All of those show  'INTERRUPT: <non-zero-count>', and looking at
which tests report status INTERRUPT, shows that they are all
hitting the 2 minute timeout.

It is a different set of tests failing in each job, but that's
because in most of the other jobs  the failing tests are marked
CANCELLED due to not building the required QEMU binary.

The very few tests which do PASS are all very fast, not near
the 2 minute timeout.

Overall this looks pretty doomed.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Inscrutable CI jobs (avocado & Travis s390 check-tcg)
  2022-09-29 13:01   ` Stefan Hajnoczi
  2022-09-29 13:08     ` Daniel P. Berrangé
@ 2022-09-29 13:16     ` Thomas Huth
  2022-09-29 13:24       ` Peter Maydell
  1 sibling, 1 reply; 8+ messages in thread
From: Thomas Huth @ 2022-09-29 13:16 UTC (permalink / raw)
  To: Stefan Hajnoczi, Daniel P. Berrangé,
	Philippe Mathieu-Daudé
  Cc: Stefan Hajnoczi, Cleber Rosa, Wainer dos Santos Moschetta,
	Beraldo Leal, qemu-devel, Peter Maydell, qemu-arm

On 29/09/2022 15.01, Stefan Hajnoczi wrote:
> I still don't know what the failures from the Avocado tests are:
> https://gitlab.com/qemu-project/qemu/-/jobs/3100466535
> https://gitlab.com/qemu-project/qemu/-/jobs/3100466546
> https://gitlab.com/qemu-project/qemu/-/jobs/3100466539
> https://gitlab.com/qemu-project/qemu/-/jobs/3100466548
> https://gitlab.com/qemu-project/qemu/-/jobs/3100466537
> 
> There are lots of logs and I don't know what to look for.
> 
> Any ideas?

These are likely the ones I reported here:

https://lists.nongnu.org/archive/html/qemu-arm/2022-09/msg00234.html

Should be fixed by Peter's patches here:

https://lore.kernel.org/qemu-devel/20220923123412.1214041-1-peter.maydell@linaro.org/T/

  HTH,
   Thomas



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Inscrutable CI jobs (avocado & Travis s390 check-tcg)
  2022-09-29 13:16     ` Thomas Huth
@ 2022-09-29 13:24       ` Peter Maydell
  0 siblings, 0 replies; 8+ messages in thread
From: Peter Maydell @ 2022-09-29 13:24 UTC (permalink / raw)
  To: Thomas Huth
  Cc: Stefan Hajnoczi, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Stefan Hajnoczi, Cleber Rosa,
	Wainer dos Santos Moschetta, Beraldo Leal, qemu-devel, qemu-arm

On Thu, 29 Sept 2022 at 14:16, Thomas Huth <thuth@redhat.com> wrote:
>
> On 29/09/2022 15.01, Stefan Hajnoczi wrote:
> > I still don't know what the failures from the Avocado tests are:
> > https://gitlab.com/qemu-project/qemu/-/jobs/3100466535
> > https://gitlab.com/qemu-project/qemu/-/jobs/3100466546
> > https://gitlab.com/qemu-project/qemu/-/jobs/3100466539
> > https://gitlab.com/qemu-project/qemu/-/jobs/3100466548
> > https://gitlab.com/qemu-project/qemu/-/jobs/3100466537
> >
> > There are lots of logs and I don't know what to look for.
> >
> > Any ideas?
>
> These are likely the ones I reported here:
>
> https://lists.nongnu.org/archive/html/qemu-arm/2022-09/msg00234.html
>
> Should be fixed by Peter's patches here:
>
> https://lore.kernel.org/qemu-devel/20220923123412.1214041-1-peter.maydell@linaro.org/T/

Incidentally, the reason that regression slipped through (which
was my fault) is because I also find Avocado failures
inscrutable -- I do run 'make check-avocado' but I tend to
assume that random timeouts and so on are just flaky tests
rather than actual problems.

In particular in this case QEMU printed an error message
and exited, but the Avocado framework completely failed
to notice this or bring it to anybody's attention -- it
just ends up timing out, which makes the problem much much
less obvious.

-- PMM


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-09-29 15:31 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-09-22 19:04 Inscrutable CI jobs (avocado & Travis s390 check-tcg) Stefan Hajnoczi
2022-09-23  7:28 ` Daniel P. Berrangé
2022-09-23  7:47   ` Thomas Huth
2022-09-23 10:08     ` Alex Bennée
2022-09-29 13:01   ` Stefan Hajnoczi
2022-09-29 13:08     ` Daniel P. Berrangé
2022-09-29 13:16     ` Thomas Huth
2022-09-29 13:24       ` Peter Maydell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).