* cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate? @ 2024-04-16 12:10 Peter Maydell 2024-04-16 12:17 ` Stefan Weil via 0 siblings, 1 reply; 8+ messages in thread From: Peter Maydell @ 2024-04-16 12:10 UTC (permalink / raw) To: QEMU Developers; +Cc: Stefan Weil, Richard Henderson The cross-i686-tci job is flaky again, with persistent intermittent failures due to jobs timing out. https://gitlab.com/qemu-project/qemu/-/issues/2285 has the details with links to 8 CI jobs in the last week or so with timeouts, typically something like: 16/258 qemu:qtest+qtest-aarch64 / qtest-aarch64/test-hmp TIMEOUT 240.17s killed by signal 15 SIGTERM 73/258 qemu:qtest+qtest-ppc / qtest-ppc/boot-serial-test TIMEOUT 360.14s killed by signal 15 SIGTERM 78/258 qemu:qtest+qtest-i386 / qtest-i386/ide-test TIMEOUT 60.09s killed by signal 15 SIGTERM 253/258 qemu:softfloat+softfloat-ops / fp-test-mul TIMEOUT 30.11s killed by signal 15 SIGTERM 254/258 qemu:softfloat+softfloat-ops / fp-test-div TIMEOUT 30.25s killed by signal 15 SIGTERM 255/258 qemu:qtest+qtest-aarch64 / qtest-aarch64/migration-test TIMEOUT 480.23s killed by signal 15 SIGTERM 257/258 qemu:qtest+qtest-aarch64 / qtest-aarch64/bios-tables-test TIMEOUT 610.10s killed by signal 15 SIGTERM but not always those exact tests. This isn't the first time this CI job for TCI has been flaky either. Some of these timeouts are very high -- no test should be taking 10 minutes, even given TCI and a slowish CI runner -- which suggests to me that there's some kind of intermittent deadlock going on. Can somebody who cares about TCI investigate, please, and track down whatever this is? (My alternate suggestion is that we mark TCI as deprecated in 9.1 and drop it entirely, if nobody cares enough about it...) thanks -- PMM ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate? 2024-04-16 12:10 cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate? Peter Maydell @ 2024-04-16 12:17 ` Stefan Weil via 2024-04-20 20:25 ` Stefan Weil via 0 siblings, 1 reply; 8+ messages in thread From: Stefan Weil via @ 2024-04-16 12:17 UTC (permalink / raw) To: Peter Maydell, QEMU Developers; +Cc: Richard Henderson Am 16.04.24 um 14:10 schrieb Peter Maydell: > The cross-i686-tci job is flaky again, with persistent intermittent > failures due to jobs timing out. [...] > Some of these timeouts are very high -- no test should be taking > 10 minutes, even given TCI and a slowish CI runner -- which suggests > to me that there's some kind of intermittent deadlock going on. > > Can somebody who cares about TCI investigate, please, and track > down whatever this is? I'll have a look. Regards Stefan ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate? 2024-04-16 12:17 ` Stefan Weil via @ 2024-04-20 20:25 ` Stefan Weil via 2024-04-24 16:27 ` Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?) Stefan Weil via 0 siblings, 1 reply; 8+ messages in thread From: Stefan Weil via @ 2024-04-20 20:25 UTC (permalink / raw) To: Peter Maydell, QEMU Developers; +Cc: Richard Henderson Am 16.04.24 um 14:17 schrieb Stefan Weil: > Am 16.04.24 um 14:10 schrieb Peter Maydell: > >> The cross-i686-tci job is flaky again, with persistent intermittent >> failures due to jobs timing out. > [...] >> Some of these timeouts are very high -- no test should be taking >> 10 minutes, even given TCI and a slowish CI runner -- which suggests >> to me that there's some kind of intermittent deadlock going on. >> >> Can somebody who cares about TCI investigate, please, and track >> down whatever this is? > > I'll have a look. Short summary: The "persistent intermittent failures due to jobs timing out" are not related to TCI: they also occur if the same tests are run with the normal TCG. I suggest that the CI tests should run single threaded. But let's have a look on details of my results. I have run `time make test` using different scenarios on the rather old and not so performant VM which I typically use for QEMU builds. I did not restrict the tests to selected architectures like it is done in the QEMU CI tests. Therefore I had more tests which all ended successfully: Ok: 848 Expected Fail: 0 Fail: 0 Unexpected Pass: 0 Skipped: 68 Timeout: 0 --- 1st test with normal TCG `nohup time ../configure --enable-modules --disable-werror && nohup time make -j4 && nohup time make test` [...] Full log written to /home/stefan/src/gitlab/qemu-project/qemu/bin/ndebug/x86_64-linux-gnu/meson-logs/testlog.txt 2296.08user 1525.02system 21:49.78elapsed 291%CPU (0avgtext+0avgdata 633476maxresident)k 1730448inputs+14140528outputs (11668major+56827263minor)pagefaults 0swaps --- 2nd test with TCI `nohup time ../configure --enable-modules --disable-werror --enable-tcg-interpreter && nohup time make -j4 && nohup time make test` [...] Full log written to /home/stefan/src/gitlab/qemu-project/qemu/bin/ndebug/x86_64-linux-gnu/meson-logs/testlog.txt 3766.74user 1521.38system 26:50.51elapsed 328%CPU (0avgtext+0avgdata 633012maxresident)k 32768inputs+14145080outputs (3033major+56121586minor)pagefaults 0swaps --- So the total test time with TCI was 26:50.51 minutes while for the normal test it was 21:49.78 minutes. These 10 single tests had the longest duration: 1st test with normal TCG 94/916 qtest-arm / qtest-arm/qom-test 373.41s 99/916 qtest-aarch64 / qtest-aarch64/qom-test 398.43s 100/916 qtest-i386 / qtest-i386/bios-tables-test 188.06s 103/916 qtest-x86_64 / qtest-x86_64/bios-tables-test 228.33s 106/916 qtest-aarch64 / qtest-aarch64/migration-test 201.15s 119/916 qtest-i386 / qtest-i386/migration-test 253.58s 126/916 qtest-x86_64 / qtest-x86_64/migration-test 266.66s 143/916 qtest-arm / qtest-arm/test-hmp 101.72s 144/916 qtest-aarch64 / qtest-aarch64/test-hmp 113.10s 163/916 qtest-arm / qtest-arm/aspeed_smc-test 256.92s 2nd test with TCI 68/916 qtest-arm / qtest-arm/qom-test 375.35s 82/916 qtest-aarch64 / qtest-aarch64/qom-test 403.50s 93/916 qtest-i386 / qtest-i386/bios-tables-test 192.22s 99/916 qtest-aarch64 / qtest-aarch64/bios-tables-test 379.92s 100/916 qtest-x86_64 / qtest-x86_64/bios-tables-test 240.19s 103/916 qtest-aarch64 / qtest-aarch64/migration-test 223.49s 106/916 qtest-ppc64 / qtest-ppc64/pxe-test 418.42s 113/916 qtest-i386 / qtest-i386/migration-test 284.96s 118/916 qtest-arm / qtest-arm/aspeed_smc-test 271.10s 119/916 qtest-x86_64 / qtest-x86_64/migration-test 287.36s --- Summary: TCI is not much slower than the normal TCG. Surprisingly it was even faster for the tests 99 and 103. For other tests like test 106 TCI was about half as fast as normal TCG, but in summary it is not "factors" slower. A total test time of 26:50 minutes is also not so bad compared with the 21:49 minutes of the normal TCG. No single test (including subtests) with TCI exceeded 10 minutes, the longest one was well below that margin with 418 seconds. --- The tests above were running with x86_64, and I could not reproduce timeouts. The Gitlab CI tests were running with i686 and used different configure options. Therefore I made additional tests with 32 bit builds in a chroot environment (Debian GNU Linux bullseye i386) with the original configure options. As expected that reduced the number of tests to 250. All tests passed with `make test`: 3rd test with normal TCG Ok: 250 Expected Fail: 0 Fail: 0 Unexpected Pass: 0 Skipped: 8 Timeout: 0 Full log written to /root/qemu/bin/ndebug/i586-linux-gnu/meson-logs/testlog.txt 855.30user 450.53system 6:45.57elapsed 321%CPU (0avgtext+0avgdata 609180maxresident)k 28232inputs+4772968outputs (64944major+8328814minor)pagefaults 0swaps 4th test with TCI Ok: 250 Expected Fail: 0 Fail: 0 Unexpected Pass: 0 Skipped: 8 Timeout: 0 Full log written to /root/qemu/build/meson-logs/testlog.txt make[1]: Leaving directory '/root/qemu/build' 1401.64user 483.55system 9:03.25elapsed 347%CPU (0avgtext+0avgdata 609244maxresident)k 24inputs+4690040outputs (70405major+7972385minor)pagefaults 0swaps --- Summary: Again TCI is not much slower than the normal TCG. The total test time for all 250 tests is below 10 minutes, even with TCI! --- Could it be that the timeouts in Gitlab CI are caused by wrongly configured multithreading? Those tests are not started with `make test` which would run single threaded, but with meson and an argument `--num-processes 9`. I tested that with the normal TCG on my VM which only can run 4 threads simultaneously: 5th test with TCG pyvenv/bin/meson test --no-rebuild -t 1 --num-processes 9 --print-errorlogs Summary of Failures: 254/258 qemu:softfloat+softfloat-ops / fp-test-mul TIMEOUT 30.12s killed by signal 15 SIGTERM Ok: 249 Expected Fail: 0 Fail: 0 Unexpected Pass: 0 Skipped: 8 Timeout: 1 Full log written to /root/qemu/bin/ndebug/i586-linux-gnu/meson-logs/testlog.txt Repeating that test again fails, this time with 2 timeouts: time pyvenv/bin/meson test --no-rebuild -t 1 --num-processes 9 --print-errorlogs Summary of Failures: 253/258 qemu:softfloat+softfloat-ops / fp-test-mul TIMEOUT 30.14s killed by signal 15 SIGTERM 254/258 qemu:softfloat+softfloat-ops / fp-test-div TIMEOUT 30.18s killed by signal 15 SIGTERM Ok: 248 Expected Fail: 0 Fail: 0 Unexpected Pass: 0 Skipped: 8 Timeout: 2 Full log written to /root/qemu/bin/ndebug/i586-linux-gnu/meson-logs/testlog.txt real 7m51.102s user 14m0.470s sys 7m58.427s Now I reduced the number of threads to 4: time pyvenv/bin/meson test --no-rebuild -t 1 --num-processes 4 --print-errorlogs Ok: 250 Expected Fail: 0 Fail: 0 Unexpected Pass: 0 Skipped: 8 Timeout: 0 Full log written to /root/qemu/bin/ndebug/i586-linux-gnu/meson-logs/testlog.txt real 6m42.648s user 13m52.271s sys 7m35.643s --- Summary: I could reproduce persistent intermittent failures due to jobs timing out without TCI when I tried to run the tests with more threads than my machine supports. This result is expected because not all single tests can not run 100% of the time. They are interrupted because of scheduling, and it's normal that some (random) tests will have a longer duration. Some will even raise a timeout. The final test shows that adjusting the number of threads to the hosts capabilities fixes the problem. But it also shows that 4 threads don't accelerate the whole test job compared to a single thread. Using multithreading obviously wastes a lot of user and sys CPU time. Regards Stefan ^ permalink raw reply [flat|nested] 8+ messages in thread
* Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?) 2024-04-20 20:25 ` Stefan Weil via @ 2024-04-24 16:27 ` Stefan Weil via 2024-04-24 17:09 ` Daniel P. Berrangé 0 siblings, 1 reply; 8+ messages in thread From: Stefan Weil via @ 2024-04-24 16:27 UTC (permalink / raw) To: QEMU Developers, Paolo Bonzini; +Cc: Richard Henderson, Peter Maydell Am 20.04.24 um 22:25 schrieb Stefan Weil: > Am 16.04.24 um 14:17 schrieb Stefan Weil: >> Am 16.04.24 um 14:10 schrieb Peter Maydell: >> >>> The cross-i686-tci job is flaky again, with persistent intermittent >>> failures due to jobs timing out. >> [...] >>> Some of these timeouts are very high -- no test should be taking >>> 10 minutes, even given TCI and a slowish CI runner -- which suggests >>> to me that there's some kind of intermittent deadlock going on. >>> >>> Can somebody who cares about TCI investigate, please, and track >>> down whatever this is? >> >> I'll have a look. > > Short summary: > > The "persistent intermittent failures due to jobs timing out" are not > related to TCI: they also occur if the same tests are run with the > normal TCG. I suggest that the CI tests should run single threaded. Hi Paolo, I need help from someone who knows the CI and the build and test framework better. Peter reported intermittent timeouts for the cross-i686-tci job, causing it to fail. I can reproduce such timeouts locally, but noticed that they are not limited to TCI. The GitLab CI also shows other examples, such as this job: https://gitlab.com/qemu-project/qemu/-/jobs/6700955287 I think the timeouts are caused by running too many parallel processes during testing. The CI uses parallel builds: make -j$(expr $(nproc) + 1) all check-build $MAKE_CHECK_ARGS It looks like `nproc` returns 8, and make runs with 9 threads. `meson test` uses the same value to run 9 test processes in parallel: /builds/qemu-project/qemu/build/pyvenv/bin/meson test --no-rebuild -t 1 --num-processes 9 --print-errorlogs Since the host can only handle 8 parallel threads, 9 threads might already cause some tests to run non-deterministically. But if some of the individual tests also use multithreading (according to my tests they do so with at least 3 or 4 threads), things get even worse. Then there are up to 4 * 9 = 36 threads competing to run on the available 8 cores. In this scenario timeouts are expected and can occur randomly. In my tests setting --num-processes to a lower value not only avoided timeouts but also reduced the processing overhead without increasing the runtime. Could we run all tests with `--num-processes 1`? Thanks, Stefan ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?) 2024-04-24 16:27 ` Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?) Stefan Weil via @ 2024-04-24 17:09 ` Daniel P. Berrangé 2024-04-24 18:10 ` Timeouts in CI jobs Stefan Weil via 0 siblings, 1 reply; 8+ messages in thread From: Daniel P. Berrangé @ 2024-04-24 17:09 UTC (permalink / raw) To: Stefan Weil Cc: QEMU Developers, Paolo Bonzini, Richard Henderson, Peter Maydell On Wed, Apr 24, 2024 at 06:27:58PM +0200, Stefan Weil via wrote: > Am 20.04.24 um 22:25 schrieb Stefan Weil: > > Am 16.04.24 um 14:17 schrieb Stefan Weil: > > > Am 16.04.24 um 14:10 schrieb Peter Maydell: > > > > > > > The cross-i686-tci job is flaky again, with persistent intermittent > > > > failures due to jobs timing out. > > > [...] > > > > Some of these timeouts are very high -- no test should be taking > > > > 10 minutes, even given TCI and a slowish CI runner -- which suggests > > > > to me that there's some kind of intermittent deadlock going on. > > > > > > > > Can somebody who cares about TCI investigate, please, and track > > > > down whatever this is? > > > > > > I'll have a look. > > > > Short summary: > > > > The "persistent intermittent failures due to jobs timing out" are not > > related to TCI: they also occur if the same tests are run with the > > normal TCG. I suggest that the CI tests should run single threaded. > > Hi Paolo, > > I need help from someone who knows the CI and the build and test framework > better. > > Peter reported intermittent timeouts for the cross-i686-tci job, causing it > to fail. I can reproduce such timeouts locally, but noticed that they are > not limited to TCI. The GitLab CI also shows other examples, such as this > job: > > https://gitlab.com/qemu-project/qemu/-/jobs/6700955287 > > I think the timeouts are caused by running too many parallel processes > during testing. > > The CI uses parallel builds: > > make -j$(expr $(nproc) + 1) all check-build $MAKE_CHECK_ARGS Note that command is running both the compile and test phases of the job. Overcommitting CPUs for the compile phase is a good idea to keep CPUs busy while another process is waiting on I/O, and is almost always safe todo. Overcommitting CPUs for the test phase is less helpful and can cause a variety of problems as you say. > > It looks like `nproc` returns 8, and make runs with 9 threads. > `meson test` uses the same value to run 9 test processes in parallel: > > /builds/qemu-project/qemu/build/pyvenv/bin/meson test --no-rebuild -t 1 > --num-processes 9 --print-errorlogs > > Since the host can only handle 8 parallel threads, 9 threads might already > cause some tests to run non-deterministically. In contributor forks, gitlab CI will be using the public shared runners. These are Google Cloud VMs, which only have 2 vCPUs. In the primary QEMU repo, we have a customer runner registered that uses Azure based VMs. Not sure on the resources we have configured for them offhand. The important thing there is that what you see for CI speed in your fork repo is not neccessarily a match for CI speed in QEMU upstream repo. > > But if some of the individual tests also use multithreading (according to my > tests they do so with at least 3 or 4 threads), things get even worse. Then > there are up to 4 * 9 = 36 threads competing to run on the available 8 > cores. > > In this scenario timeouts are expected and can occur randomly. > > In my tests setting --num-processes to a lower value not only avoided > timeouts but also reduced the processing overhead without increasing the > runtime. > > Could we run all tests with `--num-processes 1`? The question is what impact that has on the overall job execution time. A lot of our jobs are already quite long, which is bad for the turnaround time of CI testing. Reliable CI though is arguably the #1 priority though, otherwise developers cease trusting it. We need to find the balance between avoiding timeouts, while having the shortest pratical job time. The TCI job you show abuot came out at 22 minutes, which is not our worst job, so there is some scope for allowing it to run longer with less parallelism. Timeouts for individual tests are a relatively recent change to QEMU in: commit 4156325cd3c205ce77b82de2c48127ccacddaf5b Author: Daniel P. Berrangé <berrange@redhat.com> Date: Fri Dec 15 08:03:57 2023 +0100 mtest2make: stop disabling meson test timeouts Read the full commit message of that for the background rationale, but especially this paragraph: The main risk of this change is that the individual test timeouts might be too short to allow completion in high load scenarios. Thus, there is likely to be some short term pain where we have to bump the timeouts for certain tests to make them reliable enough. The preceeding few patches raised the timeouts for all failures that were immediately apparent in GitLab CI. which highlights the problem you're looking at. The expectation was that we would need to bump the timeouts for various tests until we get the the point where they reliably run in GitLab CI, both forks and upstream. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Timeouts in CI jobs 2024-04-24 17:09 ` Daniel P. Berrangé @ 2024-04-24 18:10 ` Stefan Weil via 2024-04-25 13:27 ` Daniel P. Berrangé 0 siblings, 1 reply; 8+ messages in thread From: Stefan Weil via @ 2024-04-24 18:10 UTC (permalink / raw) To: Daniel P. Berrangé Cc: QEMU Developers, Paolo Bonzini, Richard Henderson, Peter Maydell [-- Attachment #1: Type: text/plain, Size: 3533 bytes --] Am 24.04.24 um 19:09 schrieb Daniel P. Berrangé: > On Wed, Apr 24, 2024 at 06:27:58PM +0200, Stefan Weil via wrote: >> I think the timeouts are caused by running too many parallel processes >> during testing. >> >> The CI uses parallel builds: >> >> make -j$(expr $(nproc) + 1) all check-build $MAKE_CHECK_ARGS > Note that command is running both the compile and test phases of > the job. Overcommitting CPUs for the compile phase is a good > idea to keep CPUs busy while another process is waiting on > I/O, and is almost always safe todo. Thank you for your answer. Overcommitting for the build is safe, but in my experience the positive effect is typically very small on modern hosts with fast disk I/O and large buffer caches. And there is also a negative impact because this requires scheduling with process switches. Therefore I am not so sure that overcommitting is a good idea, especially not on cloud servers where the jobs are running in VMs. > Overcommitting CPUs for the test phase is less helpful and > can cause a variety of problems as you say. > >> It looks like `nproc` returns 8, and make runs with 9 threads. >> `meson test` uses the same value to run 9 test processes in parallel: >> >> /builds/qemu-project/qemu/build/pyvenv/bin/meson test --no-rebuild -t 1 >> --num-processes 9 --print-errorlogs >> >> Since the host can only handle 8 parallel threads, 9 threads might already >> cause some tests to run non-deterministically. > In contributor forks, gitlab CI will be using the public shared > runners. These are Google Cloud VMs, which only have 2 vCPUs. > > In the primary QEMU repo, we have a customer runner registered > that uses Azure based VMs. Not sure on the resources we have > configured for them offhand. I was talking about the primary QEMU. > The important thing there is that what you see for CI speed in > your fork repo is not neccessarily a match for CI speed in QEMU > upstream repo. I did not run tests in my GitLab fork because I still have to figure out how to do that. In my initial answer to Peter's mail I had described my tests and the test environment in detail. My test environment was an older (= slow) VM with 4 cores. I tested with different values for --num-processes. As expected higher values raised the number of timeouts. And the most interesting result was that `--num-processes 1` avoided timeouts, used less CPU time and did not increase the duration. >> In my tests setting --num-processes to a lower value not only avoided >> timeouts but also reduced the processing overhead without increasing the >> runtime. >> >> Could we run all tests with `--num-processes 1`? > The question is what impact that has on the overall job execution > time. A lot of our jobs are already quite long, which is bad for > the turnaround time of CI testing. Reliable CI though is arguably > the #1 priority though, otherwise developers cease trusting it. > We need to find the balance between avoiding timeouts, while having > the shortest practical job time. The TCI job you show about came > out at 22 minutes, which is not our worst job, so there is some > scope for allowing it to run longer with less parallelism. The TCI job terminates after less than 7 minutes in my test runs with less parallelism. Obviously there are tests which already do their own multithreading, and maybe other tests run single threaded. So maybe we need different values for `--num-processes` depending on the number of threads which the single tests use? Regards, Stefan [-- Attachment #2: Type: text/html, Size: 5008 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Timeouts in CI jobs 2024-04-24 18:10 ` Timeouts in CI jobs Stefan Weil via @ 2024-04-25 13:27 ` Daniel P. Berrangé 2024-04-25 13:30 ` Daniel P. Berrangé 0 siblings, 1 reply; 8+ messages in thread From: Daniel P. Berrangé @ 2024-04-25 13:27 UTC (permalink / raw) To: Stefan Weil Cc: QEMU Developers, Paolo Bonzini, Richard Henderson, Peter Maydell On Wed, Apr 24, 2024 at 08:10:19PM +0200, Stefan Weil wrote: > Am 24.04.24 um 19:09 schrieb Daniel P. Berrangé: > > > On Wed, Apr 24, 2024 at 06:27:58PM +0200, Stefan Weil via wrote: > > > I think the timeouts are caused by running too many parallel processes > > > during testing. > > > > > > The CI uses parallel builds: > > > > > > make -j$(expr $(nproc) + 1) all check-build $MAKE_CHECK_ARGS > > Note that command is running both the compile and test phases of > > the job. Overcommitting CPUs for the compile phase is a good > > idea to keep CPUs busy while another process is waiting on > > I/O, and is almost always safe todo. > > > Thank you for your answer. > > Overcommitting for the build is safe, but in my experience the positive > effect is typically very small on modern hosts with fast disk I/O and large > buffer caches. Fine with typical developer machines, but the shared runners in gitlab are fairly resource constrained by comparison, and resources are often under contention from other VMs in their infra. > And there is also a negative impact because this requires scheduling with > process switches. > > Therefore I am not so sure that overcommitting is a good idea, especially > not on cloud servers where the jobs are running in VMs. As a point of reference, 'ninja' defaults to '$nproc + 2'. > > > > In the primary QEMU repo, we have a customer runner registered > > that uses Azure based VMs. Not sure on the resources we have > > configured for them offhand. > > I was talking about the primary QEMU. > > > The important thing there is that what you see for CI speed in > > your fork repo is not neccessarily a match for CI speed in QEMU > > upstream repo. > > I did not run tests in my GitLab fork because I still have to figure out how > to do that. It is quite simple: git remote add gitlab ssh://git@gitlab.com/<yourusername>/qemu git push gitlab -o QEMU_CI=2 this immediately runs all pipelines jobs. USe QEMU_CI=1 to not start any jobs, and let you manually start the subset you are interested in checking > My test environment was an older (= slow) VM with 4 cores. I tested with > different values for --num-processes. As expected higher values raised the > number of timeouts. And the most interesting result was that > `--num-processes 1` avoided timeouts, used less CPU time and did not > increase the duration. > > > > In my tests setting --num-processes to a lower value not only avoided > > > timeouts but also reduced the processing overhead without increasing the > > > runtime. > > > > > > Could we run all tests with `--num-processes 1`? > > The question is what impact that has on the overall job execution > > time. A lot of our jobs are already quite long, which is bad for > > the turnaround time of CI testing. Reliable CI though is arguably > > the #1 priority though, otherwise developers cease trusting it. > > We need to find the balance between avoiding timeouts, while having > > the shortest practical job time. The TCI job you show about came > > out at 22 minutes, which is not our worst job, so there is some > > scope for allowing it to run longer with less parallelism. > > The TCI job terminates after less than 7 minutes in my test runs with less > parallelism. > > Obviously there are tests which already do their own multithreading, and > maybe other tests run single threaded. So maybe we need different values for > `--num-processes` depending on the number of threads which the single tests > use? QEMU has differnt test suites too. The unit tests are likely safe to run fully parallel, but the block I/O tests and qtests are likely to benefit from serialization, since they all spawn many QEMU processes as children that will consume multiple CPUs, so we probably don't need to run the actually test suite in parallel to max out the CPUs. Still needs testing under gitlab CI to prove the theory. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Timeouts in CI jobs 2024-04-25 13:27 ` Daniel P. Berrangé @ 2024-04-25 13:30 ` Daniel P. Berrangé 0 siblings, 0 replies; 8+ messages in thread From: Daniel P. Berrangé @ 2024-04-25 13:30 UTC (permalink / raw) To: Stefan Weil, QEMU Developers, Paolo Bonzini, Richard Henderson, Peter Maydell On Thu, Apr 25, 2024 at 02:27:17PM +0100, Daniel P. Berrangé wrote: > On Wed, Apr 24, 2024 at 08:10:19PM +0200, Stefan Weil wrote: > > > > I did not run tests in my GitLab fork because I still have to figure out how > > to do that. > > It is quite simple: > > git remote add gitlab ssh://git@gitlab.com/<yourusername>/qemu > git push gitlab -o QEMU_CI=2 Sorry, mistake, the second line should be git push gitlab -o ci.variable=QEMU_CI=2 With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-04-25 13:30 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-04-16 12:10 cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate? Peter Maydell 2024-04-16 12:17 ` Stefan Weil via 2024-04-20 20:25 ` Stefan Weil via 2024-04-24 16:27 ` Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?) Stefan Weil via 2024-04-24 17:09 ` Daniel P. Berrangé 2024-04-24 18:10 ` Timeouts in CI jobs Stefan Weil via 2024-04-25 13:27 ` Daniel P. Berrangé 2024-04-25 13:30 ` Daniel P. Berrangé
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).