cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?
@ 2024-04-16 12:10 Peter Maydell
  2024-04-16 12:17 ` Stefan Weil via
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Maydell @ 2024-04-16 12:10 UTC (permalink / raw)
  To: QEMU Developers; +Cc: Stefan Weil, Richard Henderson

The cross-i686-tci job is flaky again, with persistent intermittent
failures due to jobs timing out.
https://gitlab.com/qemu-project/qemu/-/issues/2285 has the details
with links to 8 CI jobs in the last week or so with timeouts, typically
something like:

 16/258 qemu:qtest+qtest-aarch64 / qtest-aarch64/test-hmp
    TIMEOUT        240.17s   killed by signal 15 SIGTERM
 73/258 qemu:qtest+qtest-ppc / qtest-ppc/boot-serial-test
    TIMEOUT        360.14s   killed by signal 15 SIGTERM
 78/258 qemu:qtest+qtest-i386 / qtest-i386/ide-test
    TIMEOUT         60.09s   killed by signal 15 SIGTERM
253/258 qemu:softfloat+softfloat-ops / fp-test-mul
    TIMEOUT         30.11s   killed by signal 15 SIGTERM
254/258 qemu:softfloat+softfloat-ops / fp-test-div
    TIMEOUT         30.25s   killed by signal 15 SIGTERM
255/258 qemu:qtest+qtest-aarch64 / qtest-aarch64/migration-test
    TIMEOUT        480.23s   killed by signal 15 SIGTERM
257/258 qemu:qtest+qtest-aarch64 / qtest-aarch64/bios-tables-test
    TIMEOUT        610.10s   killed by signal 15 SIGTERM

but not always those exact tests. This isn't the first time
this CI job for TCI has been flaky either.

Some of these timeouts are very high -- no test should be taking
10 minutes, even given TCI and a slowish CI runner -- which suggests
to me that there's some kind of intermittent deadlock going on.

Can somebody who cares about TCI investigate, please, and track
down whatever this is?

(My alternate suggestion is that we mark TCI as deprecated in 9.1
and drop it entirely, if nobody cares enough about it...)

thanks
-- PMM

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?
  2024-04-16 12:10 cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate? Peter Maydell
@ 2024-04-16 12:17 ` Stefan Weil via
  2024-04-20 20:25   ` Stefan Weil via
  0 siblings, 1 reply; 8+ messages in thread
From: Stefan Weil via @ 2024-04-16 12:17 UTC (permalink / raw)
  To: Peter Maydell, QEMU Developers; +Cc: Richard Henderson

Am 16.04.24 um 14:10 schrieb Peter Maydell:

> The cross-i686-tci job is flaky again, with persistent intermittent
> failures due to jobs timing out.
[...]
> Some of these timeouts are very high -- no test should be taking
> 10 minutes, even given TCI and a slowish CI runner -- which suggests
> to me that there's some kind of intermittent deadlock going on.
>
> Can somebody who cares about TCI investigate, please, and track
> down whatever this is?

I'll have a look.

Regards

Stefan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?
  2024-04-16 12:17 ` Stefan Weil via
@ 2024-04-20 20:25   ` Stefan Weil via
  2024-04-24 16:27     ` Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?) Stefan Weil via
  0 siblings, 1 reply; 8+ messages in thread
From: Stefan Weil via @ 2024-04-20 20:25 UTC (permalink / raw)
  To: Peter Maydell, QEMU Developers; +Cc: Richard Henderson

Am 16.04.24 um 14:17 schrieb Stefan Weil:
> Am 16.04.24 um 14:10 schrieb Peter Maydell:
> 
>> The cross-i686-tci job is flaky again, with persistent intermittent
>> failures due to jobs timing out.
> [...]
>> Some of these timeouts are very high -- no test should be taking
>> 10 minutes, even given TCI and a slowish CI runner -- which suggests
>> to me that there's some kind of intermittent deadlock going on.
>>
>> Can somebody who cares about TCI investigate, please, and track
>> down whatever this is?
> 
> I'll have a look.

Short summary:

The "persistent intermittent failures due to jobs timing out" are not 
related to TCI: they also occur if the same tests are run with the 
normal TCG. I suggest that the CI tests should run single threaded.

But let's have a look on details of my results.

I have run `time make test` using different scenarios on the rather old 
and not so performant VM which I typically use for QEMU builds. I did 
not restrict the tests to selected architectures like it is done in the 
QEMU CI tests. Therefore I had more tests which all ended successfully:

Ok:                 848
Expected Fail:      0
Fail:               0
Unexpected Pass:    0
Skipped:            68
Timeout:            0

---

1st test with normal TCG

`nohup time ../configure --enable-modules --disable-werror && nohup time 
make -j4 && nohup time make test`
[...]
Full log written to 
/home/stefan/src/gitlab/qemu-project/qemu/bin/ndebug/x86_64-linux-gnu/meson-logs/testlog.txt
2296.08user 1525.02system 21:49.78elapsed 291%CPU (0avgtext+0avgdata 
633476maxresident)k
1730448inputs+14140528outputs (11668major+56827263minor)pagefaults 0swaps

---

2nd test with TCI

`nohup time ../configure --enable-modules --disable-werror 
--enable-tcg-interpreter && nohup time make -j4 && nohup time make test`
[...]
Full log written to 
/home/stefan/src/gitlab/qemu-project/qemu/bin/ndebug/x86_64-linux-gnu/meson-logs/testlog.txt
3766.74user 1521.38system 26:50.51elapsed 328%CPU (0avgtext+0avgdata 
633012maxresident)k
32768inputs+14145080outputs (3033major+56121586minor)pagefaults 0swaps

---

So the total test time with TCI was 26:50.51 minutes while for the 
normal test it was 21:49.78 minutes.

These 10 single tests had the longest duration:

1st test with normal TCG

   94/916 qtest-arm / qtest-arm/qom-test                 373.41s
   99/916 qtest-aarch64 / qtest-aarch64/qom-test         398.43s
100/916 qtest-i386 / qtest-i386/bios-tables-test       188.06s
103/916 qtest-x86_64 / qtest-x86_64/bios-tables-test   228.33s
106/916 qtest-aarch64 / qtest-aarch64/migration-test   201.15s
119/916 qtest-i386 / qtest-i386/migration-test         253.58s
126/916 qtest-x86_64 / qtest-x86_64/migration-test     266.66s
143/916 qtest-arm / qtest-arm/test-hmp                 101.72s
144/916 qtest-aarch64 / qtest-aarch64/test-hmp         113.10s
163/916 qtest-arm / qtest-arm/aspeed_smc-test          256.92s

2nd test with TCI

   68/916 qtest-arm / qtest-arm/qom-test                 375.35s
   82/916 qtest-aarch64 / qtest-aarch64/qom-test         403.50s
   93/916 qtest-i386 / qtest-i386/bios-tables-test       192.22s
   99/916 qtest-aarch64 / qtest-aarch64/bios-tables-test 379.92s
100/916 qtest-x86_64 / qtest-x86_64/bios-tables-test   240.19s
103/916 qtest-aarch64 / qtest-aarch64/migration-test   223.49s
106/916 qtest-ppc64 / qtest-ppc64/pxe-test             418.42s
113/916 qtest-i386 / qtest-i386/migration-test         284.96s
118/916 qtest-arm / qtest-arm/aspeed_smc-test          271.10s
119/916 qtest-x86_64 / qtest-x86_64/migration-test     287.36s

---

Summary:

TCI is not much slower than the normal TCG. Surprisingly it was even 
faster for the tests 99 and 103. For other tests like test 106 TCI was 
about half as fast as normal TCG, but in summary it is not "factors" 
slower. A total test time of 26:50 minutes is also not so bad compared 
with the 21:49 minutes of the normal TCG.

No single test (including subtests) with TCI exceeded 10 minutes, the 
longest one was well below that margin with 418 seconds.

---

The tests above were running with x86_64, and I could not reproduce 
timeouts. The Gitlab CI tests were running with i686 and used different 
configure options. Therefore I made additional tests with 32 bit builds 
in a chroot environment (Debian GNU Linux bullseye i386) with the 
original configure options. As expected that reduced the number of tests 
to 250. All tests passed with `make test`:

3rd test with normal TCG

Ok:                 250
Expected Fail:      0
Fail:               0
Unexpected Pass:    0
Skipped:            8
Timeout:            0

Full log written to 
/root/qemu/bin/ndebug/i586-linux-gnu/meson-logs/testlog.txt
855.30user 450.53system 6:45.57elapsed 321%CPU (0avgtext+0avgdata 
609180maxresident)k
28232inputs+4772968outputs (64944major+8328814minor)pagefaults 0swaps

4th test with TCI

Ok:                 250
Expected Fail:      0
Fail:               0
Unexpected Pass:    0
Skipped:            8
Timeout:            0

Full log written to /root/qemu/build/meson-logs/testlog.txt
make[1]: Leaving directory '/root/qemu/build'
1401.64user 483.55system 9:03.25elapsed 347%CPU (0avgtext+0avgdata 
609244maxresident)k
24inputs+4690040outputs (70405major+7972385minor)pagefaults 0swaps

---

Summary:

Again TCI is not much slower than the normal TCG. The total test time 
for all 250 tests is below 10 minutes, even with TCI!

---

Could it be that the timeouts in Gitlab CI are caused by wrongly 
configured multithreading? Those tests are not started with `make test` 
which would run single threaded, but with meson and an argument 
`--num-processes 9`. I tested that with the normal TCG on my VM which 
only can run 4 threads simultaneously:

5th test with TCG

pyvenv/bin/meson test  --no-rebuild -t 1  --num-processes 9 
--print-errorlogs
Summary of Failures:

254/258 qemu:softfloat+softfloat-ops / fp-test-mul 
   TIMEOUT         30.12s   killed by signal 15 SIGTERM

Ok:                 249
Expected Fail:      0
Fail:               0
Unexpected Pass:    0
Skipped:            8
Timeout:            1

Full log written to 
/root/qemu/bin/ndebug/i586-linux-gnu/meson-logs/testlog.txt

Repeating that test again fails, this time with 2 timeouts:

time pyvenv/bin/meson test  --no-rebuild -t 1  --num-processes 9 
--print-errorlogs
Summary of Failures:

253/258 qemu:softfloat+softfloat-ops / fp-test-mul 
   TIMEOUT         30.14s   killed by signal 15 SIGTERM
254/258 qemu:softfloat+softfloat-ops / fp-test-div 
   TIMEOUT         30.18s   killed by signal 15 SIGTERM

Ok:                 248
Expected Fail:      0
Fail:               0
Unexpected Pass:    0
Skipped:            8
Timeout:            2

Full log written to 
/root/qemu/bin/ndebug/i586-linux-gnu/meson-logs/testlog.txt

real	7m51.102s
user	14m0.470s
sys	7m58.427s

Now I reduced the number of threads to 4:

time pyvenv/bin/meson test  --no-rebuild -t 1  --num-processes 4 
--print-errorlogs
Ok:                 250
Expected Fail:      0
Fail:               0
Unexpected Pass:    0
Skipped:            8
Timeout:            0

Full log written to 
/root/qemu/bin/ndebug/i586-linux-gnu/meson-logs/testlog.txt

real	6m42.648s
user	13m52.271s
sys	7m35.643s

---

Summary:

I could reproduce persistent intermittent failures due to jobs timing 
out without TCI when I tried to run the tests with more threads than my 
machine supports. This result is expected because not all single tests 
can not run 100% of the time. They are interrupted because of 
scheduling, and it's normal that some (random) tests will have a longer 
duration. Some will even raise a timeout.

The final test shows that adjusting the number of threads to the hosts 
capabilities fixes the problem. But it also shows that 4 threads don't 
accelerate the whole test job compared to a single thread. Using 
multithreading obviously wastes a lot of user and sys CPU time.

Regards
Stefan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?)
  2024-04-20 20:25   ` Stefan Weil via
@ 2024-04-24 16:27     ` Stefan Weil via
  2024-04-24 17:09       ` Daniel P. Berrangé
  0 siblings, 1 reply; 8+ messages in thread
From: Stefan Weil via @ 2024-04-24 16:27 UTC (permalink / raw)
  To: QEMU Developers, Paolo Bonzini; +Cc: Richard Henderson, Peter Maydell

Am 20.04.24 um 22:25 schrieb Stefan Weil:
> Am 16.04.24 um 14:17 schrieb Stefan Weil:
>> Am 16.04.24 um 14:10 schrieb Peter Maydell:
>>
>>> The cross-i686-tci job is flaky again, with persistent intermittent
>>> failures due to jobs timing out.
>> [...]
>>> Some of these timeouts are very high -- no test should be taking
>>> 10 minutes, even given TCI and a slowish CI runner -- which suggests
>>> to me that there's some kind of intermittent deadlock going on.
>>>
>>> Can somebody who cares about TCI investigate, please, and track
>>> down whatever this is?
>>
>> I'll have a look.
> 
> Short summary:
> 
> The "persistent intermittent failures due to jobs timing out" are not 
> related to TCI: they also occur if the same tests are run with the 
> normal TCG. I suggest that the CI tests should run single threaded.

Hi Paolo,

I need help from someone who knows the CI and the build and test 
framework better.

Peter reported intermittent timeouts for the cross-i686-tci job, causing 
it to fail. I can reproduce such timeouts locally, but noticed that they 
are not limited to TCI. The GitLab CI also shows other examples, such as 
this job:

https://gitlab.com/qemu-project/qemu/-/jobs/6700955287

I think the timeouts are caused by running too many parallel processes 
during testing.

The CI uses parallel builds:

make -j$(expr $(nproc) + 1) all check-build $MAKE_CHECK_ARGS

It looks like `nproc` returns 8, and make runs with 9 threads.
`meson test` uses the same value to run 9 test processes in parallel:

/builds/qemu-project/qemu/build/pyvenv/bin/meson test  --no-rebuild -t 1 
  --num-processes 9 --print-errorlogs

Since the host can only handle 8 parallel threads, 9 threads might 
already cause some tests to run non-deterministically.

But if some of the individual tests also use multithreading (according 
to my tests they do so with at least 3 or 4 threads), things get even 
worse. Then there are up to 4 * 9 = 36 threads competing to run on the 
available 8 cores.

In this scenario timeouts are expected and can occur randomly.

In my tests setting --num-processes to a lower value not only avoided 
timeouts but also reduced the processing overhead without increasing the 
runtime.

Could we run all tests with `--num-processes 1`?

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?)
  2024-04-24 16:27     ` Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?) Stefan Weil via
@ 2024-04-24 17:09       ` Daniel P. Berrangé
  2024-04-24 18:10         ` Timeouts in CI jobs Stefan Weil via
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel P. Berrangé @ 2024-04-24 17:09 UTC (permalink / raw)
  To: Stefan Weil
  Cc: QEMU Developers, Paolo Bonzini, Richard Henderson, Peter Maydell

On Wed, Apr 24, 2024 at 06:27:58PM +0200, Stefan Weil via wrote:
> Am 20.04.24 um 22:25 schrieb Stefan Weil:
> > Am 16.04.24 um 14:17 schrieb Stefan Weil:
> > > Am 16.04.24 um 14:10 schrieb Peter Maydell:
> > > 
> > > > The cross-i686-tci job is flaky again, with persistent intermittent
> > > > failures due to jobs timing out.
> > > [...]
> > > > Some of these timeouts are very high -- no test should be taking
> > > > 10 minutes, even given TCI and a slowish CI runner -- which suggests
> > > > to me that there's some kind of intermittent deadlock going on.
> > > > 
> > > > Can somebody who cares about TCI investigate, please, and track
> > > > down whatever this is?
> > > 
> > > I'll have a look.
> > 
> > Short summary:
> > 
> > The "persistent intermittent failures due to jobs timing out" are not
> > related to TCI: they also occur if the same tests are run with the
> > normal TCG. I suggest that the CI tests should run single threaded.
> 
> Hi Paolo,
> 
> I need help from someone who knows the CI and the build and test framework
> better.
> 
> Peter reported intermittent timeouts for the cross-i686-tci job, causing it
> to fail. I can reproduce such timeouts locally, but noticed that they are
> not limited to TCI. The GitLab CI also shows other examples, such as this
> job:
> 
> https://gitlab.com/qemu-project/qemu/-/jobs/6700955287
> 
> I think the timeouts are caused by running too many parallel processes
> during testing.
> 
> The CI uses parallel builds:
> 
> make -j$(expr $(nproc) + 1) all check-build $MAKE_CHECK_ARGS

Note that command is running both the compile and test phases of
the job. Overcommitting CPUs for the compile phase is a good
idea to keep CPUs busy while another process is waiting on
I/O, and is almost always safe todo.

Overcommitting CPUs for the test phase is less helpful and
can cause a variety of problems as you say.

> 
> It looks like `nproc` returns 8, and make runs with 9 threads.
> `meson test` uses the same value to run 9 test processes in parallel:
> 
> /builds/qemu-project/qemu/build/pyvenv/bin/meson test  --no-rebuild -t 1
> --num-processes 9 --print-errorlogs
> 
> Since the host can only handle 8 parallel threads, 9 threads might already
> cause some tests to run non-deterministically.

In contributor forks, gitlab CI will be using the public shared
runners. These are Google Cloud VMs, which only have 2 vCPUs.

In the primary QEMU repo, we have a customer runner registered
that uses Azure based VMs. Not sure on the resources we have
configured for them offhand.

The important thing there is that what you see for CI speed in
your fork repo is not neccessarily a match for CI speed in QEMU
upstream repo.

> 
> But if some of the individual tests also use multithreading (according to my
> tests they do so with at least 3 or 4 threads), things get even worse. Then
> there are up to 4 * 9 = 36 threads competing to run on the available 8
> cores.
> 
> In this scenario timeouts are expected and can occur randomly.
> 
> In my tests setting --num-processes to a lower value not only avoided
> timeouts but also reduced the processing overhead without increasing the
> runtime.
> 
> Could we run all tests with `--num-processes 1`?

The question is what impact that has on the overall job execution
time. A lot of our jobs are already quite long, which is bad for
the turnaround time of CI testing.  Reliable CI though is arguably
the #1 priority though, otherwise developers cease trusting it.
We need to find the balance between avoiding timeouts, while having
the shortest pratical job time.  The TCI job you show abuot came
out at 22 minutes, which is not our worst job, so there is some
scope for allowing it to run longer with less parallelism.

Timeouts for individual tests are a relatively  recent change to
QEMU in:

  commit 4156325cd3c205ce77b82de2c48127ccacddaf5b
  Author: Daniel P. Berrangé <berrange@redhat.com>
  Date:   Fri Dec 15 08:03:57 2023 +0100

    mtest2make: stop disabling meson test timeouts

Read the full commit message of that for the background rationale,
but especially this paragraph:

    The main risk of this change is that the individual test timeouts might
    be too short to allow completion in high load scenarios. Thus, there is
    likely to be some short term pain where we have to bump the timeouts for
    certain tests to make them reliable enough. The preceeding few patches
    raised the timeouts for all failures that were immediately apparent
    in GitLab CI.

which highlights the problem you're looking at.

The expectation was that we would need to bump the timeouts for various
tests until we get the the point where they reliably run in GitLab CI,
both forks and upstream.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Timeouts in CI jobs
  2024-04-24 17:09       ` Daniel P. Berrangé
@ 2024-04-24 18:10         ` Stefan Weil via
  2024-04-25 13:27           ` Daniel P. Berrangé
  0 siblings, 1 reply; 8+ messages in thread
From: Stefan Weil via @ 2024-04-24 18:10 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: QEMU Developers, Paolo Bonzini, Richard Henderson, Peter Maydell

[-- Attachment #1: Type: text/plain, Size: 3533 bytes --]

Am 24.04.24 um 19:09 schrieb Daniel P. Berrangé:

> On Wed, Apr 24, 2024 at 06:27:58PM +0200, Stefan Weil via wrote:
>> I think the timeouts are caused by running too many parallel processes
>> during testing.
>>
>> The CI uses parallel builds:
>>
>> make -j$(expr $(nproc) + 1) all check-build $MAKE_CHECK_ARGS
> Note that command is running both the compile and test phases of
> the job. Overcommitting CPUs for the compile phase is a good
> idea to keep CPUs busy while another process is waiting on
> I/O, and is almost always safe todo.

Thank you for your answer.

Overcommitting for the build is safe, but in my experience the positive 
effect is typically very small on modern hosts with fast disk I/O and 
large buffer caches.

And there is also a negative impact because this requires scheduling 
with process switches.

Therefore I am not so sure that overcommitting is a good idea, 
especially not on cloud servers where the jobs are running in VMs.

> Overcommitting CPUs for the test phase is less helpful and
> can cause a variety of problems as you say.
>
>> It looks like `nproc` returns 8, and make runs with 9 threads.
>> `meson test` uses the same value to run 9 test processes in parallel:
>>
>> /builds/qemu-project/qemu/build/pyvenv/bin/meson test  --no-rebuild -t 1
>> --num-processes 9 --print-errorlogs
>>
>> Since the host can only handle 8 parallel threads, 9 threads might already
>> cause some tests to run non-deterministically.
> In contributor forks, gitlab CI will be using the public shared
> runners. These are Google Cloud VMs, which only have 2 vCPUs.
>
> In the primary QEMU repo, we have a customer runner registered
> that uses Azure based VMs. Not sure on the resources we have
> configured for them offhand.

I was talking about the primary QEMU.

> The important thing there is that what you see for CI speed in
> your fork repo is not neccessarily a match for CI speed in QEMU
> upstream repo.

I did not run tests in my GitLab fork because I still have to figure out 
how to do that.

In my initial answer to Peter's mail I had described my tests and the 
test environment in detail.

My test environment was an older (= slow) VM with 4 cores. I tested with 
different values for --num-processes. As expected higher values raised 
the number of timeouts. And the most interesting result was that 
`--num-processes 1` avoided timeouts, used less CPU time and did not 
increase the duration.

>> In my tests setting --num-processes to a lower value not only avoided
>> timeouts but also reduced the processing overhead without increasing the
>> runtime.
>>
>> Could we run all tests with `--num-processes 1`?
> The question is what impact that has on the overall job execution
> time. A lot of our jobs are already quite long, which is bad for
> the turnaround time of CI testing.  Reliable CI though is arguably
> the #1 priority though, otherwise developers cease trusting it.
> We need to find the balance between avoiding timeouts, while having
> the shortest practical job time.  The TCI job you show about came
> out at 22 minutes, which is not our worst job, so there is some
> scope for allowing it to run longer with less parallelism.

The TCI job terminates after less than 7 minutes in my test runs with 
less parallelism.

Obviously there are tests which already do their own multithreading, and 
maybe other tests run single threaded. So maybe we need different values 
for `--num-processes` depending on the number of threads which the 
single tests use?

Regards,

Stefan

[-- Attachment #2: Type: text/html, Size: 5008 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Timeouts in CI jobs
  2024-04-24 18:10         ` Timeouts in CI jobs Stefan Weil via
@ 2024-04-25 13:27           ` Daniel P. Berrangé
  2024-04-25 13:30             ` Daniel P. Berrangé
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel P. Berrangé @ 2024-04-25 13:27 UTC (permalink / raw)
  To: Stefan Weil
  Cc: QEMU Developers, Paolo Bonzini, Richard Henderson, Peter Maydell

On Wed, Apr 24, 2024 at 08:10:19PM +0200, Stefan Weil wrote:
> Am 24.04.24 um 19:09 schrieb Daniel P. Berrangé:
> 
> > On Wed, Apr 24, 2024 at 06:27:58PM +0200, Stefan Weil via wrote:
> > > I think the timeouts are caused by running too many parallel processes
> > > during testing.
> > > 
> > > The CI uses parallel builds:
> > > 
> > > make -j$(expr $(nproc) + 1) all check-build $MAKE_CHECK_ARGS
> > Note that command is running both the compile and test phases of
> > the job. Overcommitting CPUs for the compile phase is a good
> > idea to keep CPUs busy while another process is waiting on
> > I/O, and is almost always safe todo.
> 
> 
> Thank you for your answer.
> 
> Overcommitting for the build is safe, but in my experience the positive
> effect is typically very small on modern hosts with fast disk I/O and large
> buffer caches.

Fine with typical developer machines, but the shared runners in
gitlab are fairly resource constrained by comparison, and resources
are often under contention from other VMs in their infra.

> And there is also a negative impact because this requires scheduling with
> process switches.
> 
> Therefore I am not so sure that overcommitting is a good idea, especially
> not on cloud servers where the jobs are running in VMs.

As a point of reference, 'ninja' defaults to '$nproc + 2'.

> > 
> > In the primary QEMU repo, we have a customer runner registered
> > that uses Azure based VMs. Not sure on the resources we have
> > configured for them offhand.
> 
> I was talking about the primary QEMU.
> 
> > The important thing there is that what you see for CI speed in
> > your fork repo is not neccessarily a match for CI speed in QEMU
> > upstream repo.
> 
> I did not run tests in my GitLab fork because I still have to figure out how
> to do that.

It is quite simple:

  git remote add gitlab ssh://git@gitlab.com/<yourusername>/qemu
  git push gitlab -o QEMU_CI=2

this immediately runs all pipelines jobs. USe QEMU_CI=1 to not
start any jobs, and let you manually start the subset you are
interested in checking

> My test environment was an older (= slow) VM with 4 cores. I tested with
> different values for --num-processes. As expected higher values raised the
> number of timeouts. And the most interesting result was that
> `--num-processes 1` avoided timeouts, used less CPU time and did not
> increase the duration.
> 
> > > In my tests setting --num-processes to a lower value not only avoided
> > > timeouts but also reduced the processing overhead without increasing the
> > > runtime.
> > > 
> > > Could we run all tests with `--num-processes 1`?
> > The question is what impact that has on the overall job execution
> > time. A lot of our jobs are already quite long, which is bad for
> > the turnaround time of CI testing.  Reliable CI though is arguably
> > the #1 priority though, otherwise developers cease trusting it.
> > We need to find the balance between avoiding timeouts, while having
> > the shortest practical job time.  The TCI job you show about came
> > out at 22 minutes, which is not our worst job, so there is some
> > scope for allowing it to run longer with less parallelism.
> 
> The TCI job terminates after less than 7 minutes in my test runs with less
> parallelism.
> 
> Obviously there are tests which already do their own multithreading, and
> maybe other tests run single threaded. So maybe we need different values for
> `--num-processes` depending on the number of threads which the single tests
> use?

QEMU has differnt test suites too. The unit tests are likely safe
to run fully parallel, but the block I/O tests and qtests are likely
to benefit from serialization, since they all spawn many QEMU processes
as children that will consume multiple CPUs, so we probably don't need
to run the actually test suite in parallel to max out the CPUs. Still
needs testing under gitlab CI to prove the theory.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Timeouts in CI jobs
  2024-04-25 13:27           ` Daniel P. Berrangé
@ 2024-04-25 13:30             ` Daniel P. Berrangé
  0 siblings, 0 replies; 8+ messages in thread
From: Daniel P. Berrangé @ 2024-04-25 13:30 UTC (permalink / raw)
  To: Stefan Weil, QEMU Developers, Paolo Bonzini, Richard Henderson,
	Peter Maydell

On Thu, Apr 25, 2024 at 02:27:17PM +0100, Daniel P. Berrangé wrote:
> On Wed, Apr 24, 2024 at 08:10:19PM +0200, Stefan Weil wrote:
> > 
> > I did not run tests in my GitLab fork because I still have to figure out how
> > to do that.
> 
> It is quite simple:
> 
>   git remote add gitlab ssh://git@gitlab.com/<yourusername>/qemu
>   git push gitlab -o QEMU_CI=2

Sorry, mistake, the second line should be

  git push gitlab -o ci.variable=QEMU_CI=2


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-04-25 13:30 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-16 12:10 cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate? Peter Maydell
2024-04-16 12:17 ` Stefan Weil via
2024-04-20 20:25   ` Stefan Weil via
2024-04-24 16:27     ` Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?) Stefan Weil via
2024-04-24 17:09       ` Daniel P. Berrangé
2024-04-24 18:10         ` Timeouts in CI jobs Stefan Weil via
2024-04-25 13:27           ` Daniel P. Berrangé
2024-04-25 13:30             ` Daniel P. Berrangé

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).