* cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?
@ 2024-04-16 12:10 Peter Maydell
2024-04-16 12:17 ` Stefan Weil via
0 siblings, 1 reply; 8+ messages in thread
From: Peter Maydell @ 2024-04-16 12:10 UTC (permalink / raw)
To: QEMU Developers; +Cc: Stefan Weil, Richard Henderson
The cross-i686-tci job is flaky again, with persistent intermittent
failures due to jobs timing out.
https://gitlab.com/qemu-project/qemu/-/issues/2285 has the details
with links to 8 CI jobs in the last week or so with timeouts, typically
something like:
16/258 qemu:qtest+qtest-aarch64 / qtest-aarch64/test-hmp
TIMEOUT 240.17s killed by signal 15 SIGTERM
73/258 qemu:qtest+qtest-ppc / qtest-ppc/boot-serial-test
TIMEOUT 360.14s killed by signal 15 SIGTERM
78/258 qemu:qtest+qtest-i386 / qtest-i386/ide-test
TIMEOUT 60.09s killed by signal 15 SIGTERM
253/258 qemu:softfloat+softfloat-ops / fp-test-mul
TIMEOUT 30.11s killed by signal 15 SIGTERM
254/258 qemu:softfloat+softfloat-ops / fp-test-div
TIMEOUT 30.25s killed by signal 15 SIGTERM
255/258 qemu:qtest+qtest-aarch64 / qtest-aarch64/migration-test
TIMEOUT 480.23s killed by signal 15 SIGTERM
257/258 qemu:qtest+qtest-aarch64 / qtest-aarch64/bios-tables-test
TIMEOUT 610.10s killed by signal 15 SIGTERM
but not always those exact tests. This isn't the first time
this CI job for TCI has been flaky either.
Some of these timeouts are very high -- no test should be taking
10 minutes, even given TCI and a slowish CI runner -- which suggests
to me that there's some kind of intermittent deadlock going on.
Can somebody who cares about TCI investigate, please, and track
down whatever this is?
(My alternate suggestion is that we mark TCI as deprecated in 9.1
and drop it entirely, if nobody cares enough about it...)
thanks
-- PMM
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?
2024-04-16 12:10 cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate? Peter Maydell
@ 2024-04-16 12:17 ` Stefan Weil via
2024-04-20 20:25 ` Stefan Weil via
0 siblings, 1 reply; 8+ messages in thread
From: Stefan Weil via @ 2024-04-16 12:17 UTC (permalink / raw)
To: Peter Maydell, QEMU Developers; +Cc: Richard Henderson
Am 16.04.24 um 14:10 schrieb Peter Maydell:
> The cross-i686-tci job is flaky again, with persistent intermittent
> failures due to jobs timing out.
[...]
> Some of these timeouts are very high -- no test should be taking
> 10 minutes, even given TCI and a slowish CI runner -- which suggests
> to me that there's some kind of intermittent deadlock going on.
>
> Can somebody who cares about TCI investigate, please, and track
> down whatever this is?
I'll have a look.
Regards
Stefan
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?
2024-04-16 12:17 ` Stefan Weil via
@ 2024-04-20 20:25 ` Stefan Weil via
2024-04-24 16:27 ` Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?) Stefan Weil via
0 siblings, 1 reply; 8+ messages in thread
From: Stefan Weil via @ 2024-04-20 20:25 UTC (permalink / raw)
To: Peter Maydell, QEMU Developers; +Cc: Richard Henderson
Am 16.04.24 um 14:17 schrieb Stefan Weil:
> Am 16.04.24 um 14:10 schrieb Peter Maydell:
>
>> The cross-i686-tci job is flaky again, with persistent intermittent
>> failures due to jobs timing out.
> [...]
>> Some of these timeouts are very high -- no test should be taking
>> 10 minutes, even given TCI and a slowish CI runner -- which suggests
>> to me that there's some kind of intermittent deadlock going on.
>>
>> Can somebody who cares about TCI investigate, please, and track
>> down whatever this is?
>
> I'll have a look.
Short summary:
The "persistent intermittent failures due to jobs timing out" are not
related to TCI: they also occur if the same tests are run with the
normal TCG. I suggest that the CI tests should run single threaded.
But let's have a look on details of my results.
I have run `time make test` using different scenarios on the rather old
and not so performant VM which I typically use for QEMU builds. I did
not restrict the tests to selected architectures like it is done in the
QEMU CI tests. Therefore I had more tests which all ended successfully:
Ok: 848
Expected Fail: 0
Fail: 0
Unexpected Pass: 0
Skipped: 68
Timeout: 0
---
1st test with normal TCG
`nohup time ../configure --enable-modules --disable-werror && nohup time
make -j4 && nohup time make test`
[...]
Full log written to
/home/stefan/src/gitlab/qemu-project/qemu/bin/ndebug/x86_64-linux-gnu/meson-logs/testlog.txt
2296.08user 1525.02system 21:49.78elapsed 291%CPU (0avgtext+0avgdata
633476maxresident)k
1730448inputs+14140528outputs (11668major+56827263minor)pagefaults 0swaps
---
2nd test with TCI
`nohup time ../configure --enable-modules --disable-werror
--enable-tcg-interpreter && nohup time make -j4 && nohup time make test`
[...]
Full log written to
/home/stefan/src/gitlab/qemu-project/qemu/bin/ndebug/x86_64-linux-gnu/meson-logs/testlog.txt
3766.74user 1521.38system 26:50.51elapsed 328%CPU (0avgtext+0avgdata
633012maxresident)k
32768inputs+14145080outputs (3033major+56121586minor)pagefaults 0swaps
---
So the total test time with TCI was 26:50.51 minutes while for the
normal test it was 21:49.78 minutes.
These 10 single tests had the longest duration:
1st test with normal TCG
94/916 qtest-arm / qtest-arm/qom-test 373.41s
99/916 qtest-aarch64 / qtest-aarch64/qom-test 398.43s
100/916 qtest-i386 / qtest-i386/bios-tables-test 188.06s
103/916 qtest-x86_64 / qtest-x86_64/bios-tables-test 228.33s
106/916 qtest-aarch64 / qtest-aarch64/migration-test 201.15s
119/916 qtest-i386 / qtest-i386/migration-test 253.58s
126/916 qtest-x86_64 / qtest-x86_64/migration-test 266.66s
143/916 qtest-arm / qtest-arm/test-hmp 101.72s
144/916 qtest-aarch64 / qtest-aarch64/test-hmp 113.10s
163/916 qtest-arm / qtest-arm/aspeed_smc-test 256.92s
2nd test with TCI
68/916 qtest-arm / qtest-arm/qom-test 375.35s
82/916 qtest-aarch64 / qtest-aarch64/qom-test 403.50s
93/916 qtest-i386 / qtest-i386/bios-tables-test 192.22s
99/916 qtest-aarch64 / qtest-aarch64/bios-tables-test 379.92s
100/916 qtest-x86_64 / qtest-x86_64/bios-tables-test 240.19s
103/916 qtest-aarch64 / qtest-aarch64/migration-test 223.49s
106/916 qtest-ppc64 / qtest-ppc64/pxe-test 418.42s
113/916 qtest-i386 / qtest-i386/migration-test 284.96s
118/916 qtest-arm / qtest-arm/aspeed_smc-test 271.10s
119/916 qtest-x86_64 / qtest-x86_64/migration-test 287.36s
---
Summary:
TCI is not much slower than the normal TCG. Surprisingly it was even
faster for the tests 99 and 103. For other tests like test 106 TCI was
about half as fast as normal TCG, but in summary it is not "factors"
slower. A total test time of 26:50 minutes is also not so bad compared
with the 21:49 minutes of the normal TCG.
No single test (including subtests) with TCI exceeded 10 minutes, the
longest one was well below that margin with 418 seconds.
---
The tests above were running with x86_64, and I could not reproduce
timeouts. The Gitlab CI tests were running with i686 and used different
configure options. Therefore I made additional tests with 32 bit builds
in a chroot environment (Debian GNU Linux bullseye i386) with the
original configure options. As expected that reduced the number of tests
to 250. All tests passed with `make test`:
3rd test with normal TCG
Ok: 250
Expected Fail: 0
Fail: 0
Unexpected Pass: 0
Skipped: 8
Timeout: 0
Full log written to
/root/qemu/bin/ndebug/i586-linux-gnu/meson-logs/testlog.txt
855.30user 450.53system 6:45.57elapsed 321%CPU (0avgtext+0avgdata
609180maxresident)k
28232inputs+4772968outputs (64944major+8328814minor)pagefaults 0swaps
4th test with TCI
Ok: 250
Expected Fail: 0
Fail: 0
Unexpected Pass: 0
Skipped: 8
Timeout: 0
Full log written to /root/qemu/build/meson-logs/testlog.txt
make[1]: Leaving directory '/root/qemu/build'
1401.64user 483.55system 9:03.25elapsed 347%CPU (0avgtext+0avgdata
609244maxresident)k
24inputs+4690040outputs (70405major+7972385minor)pagefaults 0swaps
---
Summary:
Again TCI is not much slower than the normal TCG. The total test time
for all 250 tests is below 10 minutes, even with TCI!
---
Could it be that the timeouts in Gitlab CI are caused by wrongly
configured multithreading? Those tests are not started with `make test`
which would run single threaded, but with meson and an argument
`--num-processes 9`. I tested that with the normal TCG on my VM which
only can run 4 threads simultaneously:
5th test with TCG
pyvenv/bin/meson test --no-rebuild -t 1 --num-processes 9
--print-errorlogs
Summary of Failures:
254/258 qemu:softfloat+softfloat-ops / fp-test-mul
TIMEOUT 30.12s killed by signal 15 SIGTERM
Ok: 249
Expected Fail: 0
Fail: 0
Unexpected Pass: 0
Skipped: 8
Timeout: 1
Full log written to
/root/qemu/bin/ndebug/i586-linux-gnu/meson-logs/testlog.txt
Repeating that test again fails, this time with 2 timeouts:
time pyvenv/bin/meson test --no-rebuild -t 1 --num-processes 9
--print-errorlogs
Summary of Failures:
253/258 qemu:softfloat+softfloat-ops / fp-test-mul
TIMEOUT 30.14s killed by signal 15 SIGTERM
254/258 qemu:softfloat+softfloat-ops / fp-test-div
TIMEOUT 30.18s killed by signal 15 SIGTERM
Ok: 248
Expected Fail: 0
Fail: 0
Unexpected Pass: 0
Skipped: 8
Timeout: 2
Full log written to
/root/qemu/bin/ndebug/i586-linux-gnu/meson-logs/testlog.txt
real 7m51.102s
user 14m0.470s
sys 7m58.427s
Now I reduced the number of threads to 4:
time pyvenv/bin/meson test --no-rebuild -t 1 --num-processes 4
--print-errorlogs
Ok: 250
Expected Fail: 0
Fail: 0
Unexpected Pass: 0
Skipped: 8
Timeout: 0
Full log written to
/root/qemu/bin/ndebug/i586-linux-gnu/meson-logs/testlog.txt
real 6m42.648s
user 13m52.271s
sys 7m35.643s
---
Summary:
I could reproduce persistent intermittent failures due to jobs timing
out without TCI when I tried to run the tests with more threads than my
machine supports. This result is expected because not all single tests
can not run 100% of the time. They are interrupted because of
scheduling, and it's normal that some (random) tests will have a longer
duration. Some will even raise a timeout.
The final test shows that adjusting the number of threads to the hosts
capabilities fixes the problem. But it also shows that 4 threads don't
accelerate the whole test job compared to a single thread. Using
multithreading obviously wastes a lot of user and sys CPU time.
Regards
Stefan
^ permalink raw reply [flat|nested] 8+ messages in thread
* Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?)
2024-04-20 20:25 ` Stefan Weil via
@ 2024-04-24 16:27 ` Stefan Weil via
2024-04-24 17:09 ` Daniel P. Berrangé
0 siblings, 1 reply; 8+ messages in thread
From: Stefan Weil via @ 2024-04-24 16:27 UTC (permalink / raw)
To: QEMU Developers, Paolo Bonzini; +Cc: Richard Henderson, Peter Maydell
Am 20.04.24 um 22:25 schrieb Stefan Weil:
> Am 16.04.24 um 14:17 schrieb Stefan Weil:
>> Am 16.04.24 um 14:10 schrieb Peter Maydell:
>>
>>> The cross-i686-tci job is flaky again, with persistent intermittent
>>> failures due to jobs timing out.
>> [...]
>>> Some of these timeouts are very high -- no test should be taking
>>> 10 minutes, even given TCI and a slowish CI runner -- which suggests
>>> to me that there's some kind of intermittent deadlock going on.
>>>
>>> Can somebody who cares about TCI investigate, please, and track
>>> down whatever this is?
>>
>> I'll have a look.
>
> Short summary:
>
> The "persistent intermittent failures due to jobs timing out" are not
> related to TCI: they also occur if the same tests are run with the
> normal TCG. I suggest that the CI tests should run single threaded.
Hi Paolo,
I need help from someone who knows the CI and the build and test
framework better.
Peter reported intermittent timeouts for the cross-i686-tci job, causing
it to fail. I can reproduce such timeouts locally, but noticed that they
are not limited to TCI. The GitLab CI also shows other examples, such as
this job:
https://gitlab.com/qemu-project/qemu/-/jobs/6700955287
I think the timeouts are caused by running too many parallel processes
during testing.
The CI uses parallel builds:
make -j$(expr $(nproc) + 1) all check-build $MAKE_CHECK_ARGS
It looks like `nproc` returns 8, and make runs with 9 threads.
`meson test` uses the same value to run 9 test processes in parallel:
/builds/qemu-project/qemu/build/pyvenv/bin/meson test --no-rebuild -t 1
--num-processes 9 --print-errorlogs
Since the host can only handle 8 parallel threads, 9 threads might
already cause some tests to run non-deterministically.
But if some of the individual tests also use multithreading (according
to my tests they do so with at least 3 or 4 threads), things get even
worse. Then there are up to 4 * 9 = 36 threads competing to run on the
available 8 cores.
In this scenario timeouts are expected and can occur randomly.
In my tests setting --num-processes to a lower value not only avoided
timeouts but also reduced the processing overhead without increasing the
runtime.
Could we run all tests with `--num-processes 1`?
Thanks,
Stefan
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?)
2024-04-24 16:27 ` Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?) Stefan Weil via
@ 2024-04-24 17:09 ` Daniel P. Berrangé
2024-04-24 18:10 ` Timeouts in CI jobs Stefan Weil via
0 siblings, 1 reply; 8+ messages in thread
From: Daniel P. Berrangé @ 2024-04-24 17:09 UTC (permalink / raw)
To: Stefan Weil
Cc: QEMU Developers, Paolo Bonzini, Richard Henderson, Peter Maydell
On Wed, Apr 24, 2024 at 06:27:58PM +0200, Stefan Weil via wrote:
> Am 20.04.24 um 22:25 schrieb Stefan Weil:
> > Am 16.04.24 um 14:17 schrieb Stefan Weil:
> > > Am 16.04.24 um 14:10 schrieb Peter Maydell:
> > >
> > > > The cross-i686-tci job is flaky again, with persistent intermittent
> > > > failures due to jobs timing out.
> > > [...]
> > > > Some of these timeouts are very high -- no test should be taking
> > > > 10 minutes, even given TCI and a slowish CI runner -- which suggests
> > > > to me that there's some kind of intermittent deadlock going on.
> > > >
> > > > Can somebody who cares about TCI investigate, please, and track
> > > > down whatever this is?
> > >
> > > I'll have a look.
> >
> > Short summary:
> >
> > The "persistent intermittent failures due to jobs timing out" are not
> > related to TCI: they also occur if the same tests are run with the
> > normal TCG. I suggest that the CI tests should run single threaded.
>
> Hi Paolo,
>
> I need help from someone who knows the CI and the build and test framework
> better.
>
> Peter reported intermittent timeouts for the cross-i686-tci job, causing it
> to fail. I can reproduce such timeouts locally, but noticed that they are
> not limited to TCI. The GitLab CI also shows other examples, such as this
> job:
>
> https://gitlab.com/qemu-project/qemu/-/jobs/6700955287
>
> I think the timeouts are caused by running too many parallel processes
> during testing.
>
> The CI uses parallel builds:
>
> make -j$(expr $(nproc) + 1) all check-build $MAKE_CHECK_ARGS
Note that command is running both the compile and test phases of
the job. Overcommitting CPUs for the compile phase is a good
idea to keep CPUs busy while another process is waiting on
I/O, and is almost always safe todo.
Overcommitting CPUs for the test phase is less helpful and
can cause a variety of problems as you say.
>
> It looks like `nproc` returns 8, and make runs with 9 threads.
> `meson test` uses the same value to run 9 test processes in parallel:
>
> /builds/qemu-project/qemu/build/pyvenv/bin/meson test --no-rebuild -t 1
> --num-processes 9 --print-errorlogs
>
> Since the host can only handle 8 parallel threads, 9 threads might already
> cause some tests to run non-deterministically.
In contributor forks, gitlab CI will be using the public shared
runners. These are Google Cloud VMs, which only have 2 vCPUs.
In the primary QEMU repo, we have a customer runner registered
that uses Azure based VMs. Not sure on the resources we have
configured for them offhand.
The important thing there is that what you see for CI speed in
your fork repo is not neccessarily a match for CI speed in QEMU
upstream repo.
>
> But if some of the individual tests also use multithreading (according to my
> tests they do so with at least 3 or 4 threads), things get even worse. Then
> there are up to 4 * 9 = 36 threads competing to run on the available 8
> cores.
>
> In this scenario timeouts are expected and can occur randomly.
>
> In my tests setting --num-processes to a lower value not only avoided
> timeouts but also reduced the processing overhead without increasing the
> runtime.
>
> Could we run all tests with `--num-processes 1`?
The question is what impact that has on the overall job execution
time. A lot of our jobs are already quite long, which is bad for
the turnaround time of CI testing. Reliable CI though is arguably
the #1 priority though, otherwise developers cease trusting it.
We need to find the balance between avoiding timeouts, while having
the shortest pratical job time. The TCI job you show abuot came
out at 22 minutes, which is not our worst job, so there is some
scope for allowing it to run longer with less parallelism.
Timeouts for individual tests are a relatively recent change to
QEMU in:
commit 4156325cd3c205ce77b82de2c48127ccacddaf5b
Author: Daniel P. Berrangé <berrange@redhat.com>
Date: Fri Dec 15 08:03:57 2023 +0100
mtest2make: stop disabling meson test timeouts
Read the full commit message of that for the background rationale,
but especially this paragraph:
The main risk of this change is that the individual test timeouts might
be too short to allow completion in high load scenarios. Thus, there is
likely to be some short term pain where we have to bump the timeouts for
certain tests to make them reliable enough. The preceeding few patches
raised the timeouts for all failures that were immediately apparent
in GitLab CI.
which highlights the problem you're looking at.
The expectation was that we would need to bump the timeouts for various
tests until we get the the point where they reliably run in GitLab CI,
both forks and upstream.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Timeouts in CI jobs
2024-04-24 17:09 ` Daniel P. Berrangé
@ 2024-04-24 18:10 ` Stefan Weil via
2024-04-25 13:27 ` Daniel P. Berrangé
0 siblings, 1 reply; 8+ messages in thread
From: Stefan Weil via @ 2024-04-24 18:10 UTC (permalink / raw)
To: Daniel P. Berrangé
Cc: QEMU Developers, Paolo Bonzini, Richard Henderson, Peter Maydell
[-- Attachment #1: Type: text/plain, Size: 3533 bytes --]
Am 24.04.24 um 19:09 schrieb Daniel P. Berrangé:
> On Wed, Apr 24, 2024 at 06:27:58PM +0200, Stefan Weil via wrote:
>> I think the timeouts are caused by running too many parallel processes
>> during testing.
>>
>> The CI uses parallel builds:
>>
>> make -j$(expr $(nproc) + 1) all check-build $MAKE_CHECK_ARGS
> Note that command is running both the compile and test phases of
> the job. Overcommitting CPUs for the compile phase is a good
> idea to keep CPUs busy while another process is waiting on
> I/O, and is almost always safe todo.
Thank you for your answer.
Overcommitting for the build is safe, but in my experience the positive
effect is typically very small on modern hosts with fast disk I/O and
large buffer caches.
And there is also a negative impact because this requires scheduling
with process switches.
Therefore I am not so sure that overcommitting is a good idea,
especially not on cloud servers where the jobs are running in VMs.
> Overcommitting CPUs for the test phase is less helpful and
> can cause a variety of problems as you say.
>
>> It looks like `nproc` returns 8, and make runs with 9 threads.
>> `meson test` uses the same value to run 9 test processes in parallel:
>>
>> /builds/qemu-project/qemu/build/pyvenv/bin/meson test --no-rebuild -t 1
>> --num-processes 9 --print-errorlogs
>>
>> Since the host can only handle 8 parallel threads, 9 threads might already
>> cause some tests to run non-deterministically.
> In contributor forks, gitlab CI will be using the public shared
> runners. These are Google Cloud VMs, which only have 2 vCPUs.
>
> In the primary QEMU repo, we have a customer runner registered
> that uses Azure based VMs. Not sure on the resources we have
> configured for them offhand.
I was talking about the primary QEMU.
> The important thing there is that what you see for CI speed in
> your fork repo is not neccessarily a match for CI speed in QEMU
> upstream repo.
I did not run tests in my GitLab fork because I still have to figure out
how to do that.
In my initial answer to Peter's mail I had described my tests and the
test environment in detail.
My test environment was an older (= slow) VM with 4 cores. I tested with
different values for --num-processes. As expected higher values raised
the number of timeouts. And the most interesting result was that
`--num-processes 1` avoided timeouts, used less CPU time and did not
increase the duration.
>> In my tests setting --num-processes to a lower value not only avoided
>> timeouts but also reduced the processing overhead without increasing the
>> runtime.
>>
>> Could we run all tests with `--num-processes 1`?
> The question is what impact that has on the overall job execution
> time. A lot of our jobs are already quite long, which is bad for
> the turnaround time of CI testing. Reliable CI though is arguably
> the #1 priority though, otherwise developers cease trusting it.
> We need to find the balance between avoiding timeouts, while having
> the shortest practical job time. The TCI job you show about came
> out at 22 minutes, which is not our worst job, so there is some
> scope for allowing it to run longer with less parallelism.
The TCI job terminates after less than 7 minutes in my test runs with
less parallelism.
Obviously there are tests which already do their own multithreading, and
maybe other tests run single threaded. So maybe we need different values
for `--num-processes` depending on the number of threads which the
single tests use?
Regards,
Stefan
[-- Attachment #2: Type: text/html, Size: 5008 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Timeouts in CI jobs
2024-04-24 18:10 ` Timeouts in CI jobs Stefan Weil via
@ 2024-04-25 13:27 ` Daniel P. Berrangé
2024-04-25 13:30 ` Daniel P. Berrangé
0 siblings, 1 reply; 8+ messages in thread
From: Daniel P. Berrangé @ 2024-04-25 13:27 UTC (permalink / raw)
To: Stefan Weil
Cc: QEMU Developers, Paolo Bonzini, Richard Henderson, Peter Maydell
On Wed, Apr 24, 2024 at 08:10:19PM +0200, Stefan Weil wrote:
> Am 24.04.24 um 19:09 schrieb Daniel P. Berrangé:
>
> > On Wed, Apr 24, 2024 at 06:27:58PM +0200, Stefan Weil via wrote:
> > > I think the timeouts are caused by running too many parallel processes
> > > during testing.
> > >
> > > The CI uses parallel builds:
> > >
> > > make -j$(expr $(nproc) + 1) all check-build $MAKE_CHECK_ARGS
> > Note that command is running both the compile and test phases of
> > the job. Overcommitting CPUs for the compile phase is a good
> > idea to keep CPUs busy while another process is waiting on
> > I/O, and is almost always safe todo.
>
>
> Thank you for your answer.
>
> Overcommitting for the build is safe, but in my experience the positive
> effect is typically very small on modern hosts with fast disk I/O and large
> buffer caches.
Fine with typical developer machines, but the shared runners in
gitlab are fairly resource constrained by comparison, and resources
are often under contention from other VMs in their infra.
> And there is also a negative impact because this requires scheduling with
> process switches.
>
> Therefore I am not so sure that overcommitting is a good idea, especially
> not on cloud servers where the jobs are running in VMs.
As a point of reference, 'ninja' defaults to '$nproc + 2'.
> >
> > In the primary QEMU repo, we have a customer runner registered
> > that uses Azure based VMs. Not sure on the resources we have
> > configured for them offhand.
>
> I was talking about the primary QEMU.
>
> > The important thing there is that what you see for CI speed in
> > your fork repo is not neccessarily a match for CI speed in QEMU
> > upstream repo.
>
> I did not run tests in my GitLab fork because I still have to figure out how
> to do that.
It is quite simple:
git remote add gitlab ssh://git@gitlab.com/<yourusername>/qemu
git push gitlab -o QEMU_CI=2
this immediately runs all pipelines jobs. USe QEMU_CI=1 to not
start any jobs, and let you manually start the subset you are
interested in checking
> My test environment was an older (= slow) VM with 4 cores. I tested with
> different values for --num-processes. As expected higher values raised the
> number of timeouts. And the most interesting result was that
> `--num-processes 1` avoided timeouts, used less CPU time and did not
> increase the duration.
>
> > > In my tests setting --num-processes to a lower value not only avoided
> > > timeouts but also reduced the processing overhead without increasing the
> > > runtime.
> > >
> > > Could we run all tests with `--num-processes 1`?
> > The question is what impact that has on the overall job execution
> > time. A lot of our jobs are already quite long, which is bad for
> > the turnaround time of CI testing. Reliable CI though is arguably
> > the #1 priority though, otherwise developers cease trusting it.
> > We need to find the balance between avoiding timeouts, while having
> > the shortest practical job time. The TCI job you show about came
> > out at 22 minutes, which is not our worst job, so there is some
> > scope for allowing it to run longer with less parallelism.
>
> The TCI job terminates after less than 7 minutes in my test runs with less
> parallelism.
>
> Obviously there are tests which already do their own multithreading, and
> maybe other tests run single threaded. So maybe we need different values for
> `--num-processes` depending on the number of threads which the single tests
> use?
QEMU has differnt test suites too. The unit tests are likely safe
to run fully parallel, but the block I/O tests and qtests are likely
to benefit from serialization, since they all spawn many QEMU processes
as children that will consume multiple CPUs, so we probably don't need
to run the actually test suite in parallel to max out the CPUs. Still
needs testing under gitlab CI to prove the theory.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Timeouts in CI jobs
2024-04-25 13:27 ` Daniel P. Berrangé
@ 2024-04-25 13:30 ` Daniel P. Berrangé
0 siblings, 0 replies; 8+ messages in thread
From: Daniel P. Berrangé @ 2024-04-25 13:30 UTC (permalink / raw)
To: Stefan Weil, QEMU Developers, Paolo Bonzini, Richard Henderson,
Peter Maydell
On Thu, Apr 25, 2024 at 02:27:17PM +0100, Daniel P. Berrangé wrote:
> On Wed, Apr 24, 2024 at 08:10:19PM +0200, Stefan Weil wrote:
> >
> > I did not run tests in my GitLab fork because I still have to figure out how
> > to do that.
>
> It is quite simple:
>
> git remote add gitlab ssh://git@gitlab.com/<yourusername>/qemu
> git push gitlab -o QEMU_CI=2
Sorry, mistake, the second line should be
git push gitlab -o ci.variable=QEMU_CI=2
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-04-25 13:30 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-16 12:10 cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate? Peter Maydell
2024-04-16 12:17 ` Stefan Weil via
2024-04-20 20:25 ` Stefan Weil via
2024-04-24 16:27 ` Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?) Stefan Weil via
2024-04-24 17:09 ` Daniel P. Berrangé
2024-04-24 18:10 ` Timeouts in CI jobs Stefan Weil via
2024-04-25 13:27 ` Daniel P. Berrangé
2024-04-25 13:30 ` Daniel P. Berrangé
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).