Re: Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?)

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: "Daniel P. Berrangé" <berrange@redhat.com>
To: Stefan Weil <sw@weilnetz.de>
Cc: QEMU Developers <qemu-devel@nongnu.org>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Richard Henderson <richard.henderson@linaro.org>,
	Peter Maydell <peter.maydell@linaro.org>
Subject: Re: Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?)
Date: Wed, 24 Apr 2024 18:09:07 +0100	[thread overview]
Message-ID: <Zik8s6_iNM8u0SZ6@redhat.com> (raw)
In-Reply-To: <50ee3a92-1bb5-4113-8558-281e78b0c2e3@weilnetz.de>

On Wed, Apr 24, 2024 at 06:27:58PM +0200, Stefan Weil via wrote:
> Am 20.04.24 um 22:25 schrieb Stefan Weil:
> > Am 16.04.24 um 14:17 schrieb Stefan Weil:
> > > Am 16.04.24 um 14:10 schrieb Peter Maydell:
> > > 
> > > > The cross-i686-tci job is flaky again, with persistent intermittent
> > > > failures due to jobs timing out.
> > > [...]
> > > > Some of these timeouts are very high -- no test should be taking
> > > > 10 minutes, even given TCI and a slowish CI runner -- which suggests
> > > > to me that there's some kind of intermittent deadlock going on.
> > > > 
> > > > Can somebody who cares about TCI investigate, please, and track
> > > > down whatever this is?
> > > 
> > > I'll have a look.
> > 
> > Short summary:
> > 
> > The "persistent intermittent failures due to jobs timing out" are not
> > related to TCI: they also occur if the same tests are run with the
> > normal TCG. I suggest that the CI tests should run single threaded.
> 
> Hi Paolo,
> 
> I need help from someone who knows the CI and the build and test framework
> better.
> 
> Peter reported intermittent timeouts for the cross-i686-tci job, causing it
> to fail. I can reproduce such timeouts locally, but noticed that they are
> not limited to TCI. The GitLab CI also shows other examples, such as this
> job:
> 
> https://gitlab.com/qemu-project/qemu/-/jobs/6700955287
> 
> I think the timeouts are caused by running too many parallel processes
> during testing.
> 
> The CI uses parallel builds:
> 
> make -j$(expr $(nproc) + 1) all check-build $MAKE_CHECK_ARGS

Note that command is running both the compile and test phases of
the job. Overcommitting CPUs for the compile phase is a good
idea to keep CPUs busy while another process is waiting on
I/O, and is almost always safe todo.

Overcommitting CPUs for the test phase is less helpful and
can cause a variety of problems as you say.

> 
> It looks like `nproc` returns 8, and make runs with 9 threads.
> `meson test` uses the same value to run 9 test processes in parallel:
> 
> /builds/qemu-project/qemu/build/pyvenv/bin/meson test  --no-rebuild -t 1
> --num-processes 9 --print-errorlogs
> 
> Since the host can only handle 8 parallel threads, 9 threads might already
> cause some tests to run non-deterministically.

In contributor forks, gitlab CI will be using the public shared
runners. These are Google Cloud VMs, which only have 2 vCPUs.

In the primary QEMU repo, we have a customer runner registered
that uses Azure based VMs. Not sure on the resources we have
configured for them offhand.

The important thing there is that what you see for CI speed in
your fork repo is not neccessarily a match for CI speed in QEMU
upstream repo.

> 
> But if some of the individual tests also use multithreading (according to my
> tests they do so with at least 3 or 4 threads), things get even worse. Then
> there are up to 4 * 9 = 36 threads competing to run on the available 8
> cores.
> 
> In this scenario timeouts are expected and can occur randomly.
> 
> In my tests setting --num-processes to a lower value not only avoided
> timeouts but also reduced the processing overhead without increasing the
> runtime.
> 
> Could we run all tests with `--num-processes 1`?

The question is what impact that has on the overall job execution
time. A lot of our jobs are already quite long, which is bad for
the turnaround time of CI testing.  Reliable CI though is arguably
the #1 priority though, otherwise developers cease trusting it.
We need to find the balance between avoiding timeouts, while having
the shortest pratical job time.  The TCI job you show abuot came
out at 22 minutes, which is not our worst job, so there is some
scope for allowing it to run longer with less parallelism.

Timeouts for individual tests are a relatively  recent change to
QEMU in:

  commit 4156325cd3c205ce77b82de2c48127ccacddaf5b
  Author: Daniel P. Berrangé <berrange@redhat.com>
  Date:   Fri Dec 15 08:03:57 2023 +0100

    mtest2make: stop disabling meson test timeouts

Read the full commit message of that for the background rationale,
but especially this paragraph:

    The main risk of this change is that the individual test timeouts might
    be too short to allow completion in high load scenarios. Thus, there is
    likely to be some short term pain where we have to bump the timeouts for
    certain tests to make them reliable enough. The preceeding few patches
    raised the timeouts for all failures that were immediately apparent
    in GitLab CI.

which highlights the problem you're looking at.

The expectation was that we would need to bump the timeouts for various
tests until we get the the point where they reliably run in GitLab CI,
both forks and upstream.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

next prev parent reply	other threads:[~2024-04-24 17:10 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-16 12:10 cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate? Peter Maydell
2024-04-16 12:17 ` Stefan Weil via
2024-04-20 20:25   ` Stefan Weil via
2024-04-24 16:27     ` Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?) Stefan Weil via
2024-04-24 17:09       ` Daniel P. Berrangé [this message]
2024-04-24 18:10         ` Timeouts in CI jobs Stefan Weil via
2024-04-25 13:27           ` Daniel P. Berrangé
2024-04-25 13:30             ` Daniel P. Berrangé

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zik8s6_iNM8u0SZ6@redhat.com \
    --to=berrange@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    --cc=sw@weilnetz.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).