* Migration tests are very slow in the CI
@ 2022-08-08 11:57 Thomas Huth
2022-08-08 12:14 ` Daniel P. Berrangé
0 siblings, 1 reply; 5+ messages in thread
From: Thomas Huth @ 2022-08-08 11:57 UTC (permalink / raw)
To: QEMU Developers, Dr. David Alan Gilbert, Juan Quintela
Cc: Peter Xu, Peter Maydell, Richard Henderson
Hi!
Seems like we're getting more timeouts in the CI pipelines since commit
2649a72555e ("Allow test to run without uffd") enabled the migration tests
in more scenarios.
For example:
https://gitlab.com/qemu-project/qemu/-/jobs/2821578332#L49
You can see that the migration-test ran for more than 20 minutes for each
target (x86 and aarch64)! I think that's way too much by default.
I had a check whether there is one subtest taking a lot of time, but it
rather seems like each of the migration test is taking 40 to 50 seconds in
the CI:
https://gitlab.com/thuth/qemu/-/jobs/2825365836#L44
Given the fact that we're running more than 30 migration tests, this quickly
sums up to 20 minutes and more.
Could we maybe focus on running only the most important migration tests in
quick mode, and only run the full suite under an "if (g_test_slow())" statement?
Thomas
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Migration tests are very slow in the CI
2022-08-08 11:57 Migration tests are very slow in the CI Thomas Huth
@ 2022-08-08 12:14 ` Daniel P. Berrangé
2022-08-08 12:43 ` Thomas Huth
0 siblings, 1 reply; 5+ messages in thread
From: Daniel P. Berrangé @ 2022-08-08 12:14 UTC (permalink / raw)
To: Thomas Huth
Cc: QEMU Developers, Dr. David Alan Gilbert, Juan Quintela, Peter Xu,
Peter Maydell, Richard Henderson
On Mon, Aug 08, 2022 at 01:57:17PM +0200, Thomas Huth wrote:
>
> Hi!
>
> Seems like we're getting more timeouts in the CI pipelines since commit
> 2649a72555e ("Allow test to run without uffd") enabled the migration tests
> in more scenarios.
>
> For example:
>
> https://gitlab.com/qemu-project/qemu/-/jobs/2821578332#L49
>
> You can see that the migration-test ran for more than 20 minutes for each
> target (x86 and aarch64)! I think that's way too much by default.
Definitely too much.
> I had a check whether there is one subtest taking a lot of time, but it
> rather seems like each of the migration test is taking 40 to 50 seconds in
> the CI:
>
> https://gitlab.com/thuth/qemu/-/jobs/2825365836#L44
Normally with CI we expect a constant slowdown factor, eg x2.
I expect with migration though, we're triggering behaviour whereby
the guest workload is generating dirty pages quicker than we can
migrate them over localhost. The balance in this can quickly tip
to create an exponential slowdown.
> Given the fact that we're running more than 30 migration tests, this quickly
> sums up to 20 minutes and more.
>
> Could we maybe focus on running only the most important migration tests in
> quick mode, and only run the full suite under an "if (g_test_slow())"
> statement?
THe GitLab shared runners in particular i think are going to impact the
migration tests, given that the runners are overcommitted, pre-emptiable
instances.
If we want reliability we may need to restrict it to just do migration
qtests on the private runners, since we have predictable compute
resource available on those.
I'm not sure if 'g_test_slow' gives us enough granularity though, as
if we enable that, it'll impact the whole test suite, not just
migration tests.
Not sure of the best answer here for how to toggle it.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Migration tests are very slow in the CI
2022-08-08 12:14 ` Daniel P. Berrangé
@ 2022-08-08 12:43 ` Thomas Huth
2022-08-08 12:58 ` Daniel P. Berrangé
0 siblings, 1 reply; 5+ messages in thread
From: Thomas Huth @ 2022-08-08 12:43 UTC (permalink / raw)
To: Daniel P. Berrangé
Cc: QEMU Developers, Dr. David Alan Gilbert, Juan Quintela, Peter Xu,
Peter Maydell, Richard Henderson
On 08/08/2022 14.14, Daniel P. Berrangé wrote:
> On Mon, Aug 08, 2022 at 01:57:17PM +0200, Thomas Huth wrote:
>>
>> Hi!
>>
>> Seems like we're getting more timeouts in the CI pipelines since commit
>> 2649a72555e ("Allow test to run without uffd") enabled the migration tests
>> in more scenarios.
>>
>> For example:
>>
>> https://gitlab.com/qemu-project/qemu/-/jobs/2821578332#L49
>>
>> You can see that the migration-test ran for more than 20 minutes for each
>> target (x86 and aarch64)! I think that's way too much by default.
>
> Definitely too much.
>
>> I had a check whether there is one subtest taking a lot of time, but it
>> rather seems like each of the migration test is taking 40 to 50 seconds in
>> the CI:
>>
>> https://gitlab.com/thuth/qemu/-/jobs/2825365836#L44
>
> Normally with CI we expect a constant slowdown factor, eg x2.
>
> I expect with migration though, we're triggering behaviour whereby
> the guest workload is generating dirty pages quicker than we can
> migrate them over localhost. The balance in this can quickly tip
> to create an exponential slowdown.
If I run the aarch64 migration-test on my otherwise idle x86 laptop, it also
takes already ca. 460 seconds to finish, which is IMHO also already too much
for a normal "make check" run (without SPEED=slow).
> I'm not sure if 'g_test_slow' gives us enough granularity though, as
> if we enable that, it'll impact the whole test suite, not just
> migration tests.
We could also check for the GITLAB_CI environment variable, just like we
already do it in some of the avocado-based tests ... but given the fact that
the migration test is already very slow on my normal x86 laptop, I think I'd
prefer if we added some checks with g_test_slow() in there ...
Are there any tests in migration-test.c that are rather redundant and could
be easily skipped in quick mode?
Thomas
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Migration tests are very slow in the CI
2022-08-08 12:43 ` Thomas Huth
@ 2022-08-08 12:58 ` Daniel P. Berrangé
2022-08-08 17:00 ` Dr. David Alan Gilbert
0 siblings, 1 reply; 5+ messages in thread
From: Daniel P. Berrangé @ 2022-08-08 12:58 UTC (permalink / raw)
To: Thomas Huth
Cc: QEMU Developers, Dr. David Alan Gilbert, Juan Quintela, Peter Xu,
Peter Maydell, Richard Henderson
On Mon, Aug 08, 2022 at 02:43:49PM +0200, Thomas Huth wrote:
> On 08/08/2022 14.14, Daniel P. Berrangé wrote:
> > On Mon, Aug 08, 2022 at 01:57:17PM +0200, Thomas Huth wrote:
> > >
> > > Hi!
> > >
> > > Seems like we're getting more timeouts in the CI pipelines since commit
> > > 2649a72555e ("Allow test to run without uffd") enabled the migration tests
> > > in more scenarios.
> > >
> > > For example:
> > >
> > > https://gitlab.com/qemu-project/qemu/-/jobs/2821578332#L49
> > >
> > > You can see that the migration-test ran for more than 20 minutes for each
> > > target (x86 and aarch64)! I think that's way too much by default.
> >
> > Definitely too much.
> >
> > > I had a check whether there is one subtest taking a lot of time, but it
> > > rather seems like each of the migration test is taking 40 to 50 seconds in
> > > the CI:
> > >
> > > https://gitlab.com/thuth/qemu/-/jobs/2825365836#L44
> >
> > Normally with CI we expect a constant slowdown factor, eg x2.
> >
> > I expect with migration though, we're triggering behaviour whereby
> > the guest workload is generating dirty pages quicker than we can
> > migrate them over localhost. The balance in this can quickly tip
> > to create an exponential slowdown.
>
> If I run the aarch64 migration-test on my otherwise idle x86 laptop, it also
> takes already ca. 460 seconds to finish, which is IMHO also already too much
> for a normal "make check" run (without SPEED=slow).
>
> > I'm not sure if 'g_test_slow' gives us enough granularity though, as
> > if we enable that, it'll impact the whole test suite, not just
> > migration tests.
>
> We could also check for the GITLAB_CI environment variable, just like we
> already do it in some of the avocado-based tests ... but given the fact that
> the migration test is already very slow on my normal x86 laptop, I think I'd
> prefer if we added some checks with g_test_slow() in there ...
>
> Are there any tests in migration-test.c that are rather redundant and could
> be easily skipped in quick mode?
The trouble with migration is that there are alot of subtle permutations
that interact in wierd ways, so we've got alot of test scenarios, includuing
many with TLS:
/x86_64/migration/bad_dest
/x86_64/migration/fd_proto
/x86_64/migration/validate_uuid
/x86_64/migration/validate_uuid_error
/x86_64/migration/validate_uuid_src_not_set
/x86_64/migration/validate_uuid_dst_not_set
/x86_64/migration/auto_converge
/x86_64/migration/dirty_ring
/x86_64/migration/vcpu_dirty_limit
/x86_64/migration/postcopy/unix
/x86_64/migration/postcopy/plain
/x86_64/migration/postcopy/recovery/plain
/x86_64/migration/postcopy/recovery/tls/psk
/x86_64/migration/postcopy/preempt/plain
/x86_64/migration/postcopy/preempt/recovery/plain
/x86_64/migration/postcopy/preempt/recovery/tls/psk
/x86_64/migration/postcopy/preempt/tls/psk
/x86_64/migration/postcopy/tls/psk
/x86_64/migration/precopy/unix/plain
/x86_64/migration/precopy/unix/xbzrle
/x86_64/migration/precopy/unix/tls/psk
/x86_64/migration/precopy/unix/tls/x509/default-host
/x86_64/migration/precopy/unix/tls/x509/override-host
/x86_64/migration/precopy/tcp/plain
/x86_64/migration/precopy/tcp/tls/psk/match
/x86_64/migration/precopy/tcp/tls/psk/mismatch
/x86_64/migration/precopy/tcp/tls/x509/default-host
/x86_64/migration/precopy/tcp/tls/x509/override-host
/x86_64/migration/precopy/tcp/tls/x509/mismatch-host
/x86_64/migration/precopy/tcp/tls/x509/friendly-client
/x86_64/migration/precopy/tcp/tls/x509/hostile-client
/x86_64/migration/precopy/tcp/tls/x509/allow-anon-client
/x86_64/migration/precopy/tcp/tls/x509/reject-anon-client
/x86_64/migration/multifd/tcp/plain/none
/x86_64/migration/multifd/tcp/plain/cancel
/x86_64/migration/multifd/tcp/plain/zlib
/x86_64/migration/multifd/tcp/plain/zstd
/x86_64/migration/multifd/tcp/tls/psk/match
/x86_64/migration/multifd/tcp/tls/psk/mismatch
/x86_64/migration/multifd/tcp/tls/x509/default-host
/x86_64/migration/multifd/tcp/tls/x509/override-host
/x86_64/migration/multifd/tcp/tls/x509/mismatch-host
/x86_64/migration/multifd/tcp/tls/x509/allow-anon-client
/x86_64/migration/multifd/tcp/tls/x509/reject-anon-client
Each takes about 4 seconds, except for the xbzrle, autoconverge and
vcpu-dirty-rate tests which take 8-12 seconds.
We could short-circuit most of the tls tests, because 90% of what
they're validating is the initial connection setup phase. We don't
really need to run the full migration to completion, we can just
abort once we're running. Just keep 3 doing the full migration
to completion - one precopy, one postcopy and one multifd.
That'd cut most of thte TLS tests from 4 seconds to 0.5 seconds.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Migration tests are very slow in the CI
2022-08-08 12:58 ` Daniel P. Berrangé
@ 2022-08-08 17:00 ` Dr. David Alan Gilbert
0 siblings, 0 replies; 5+ messages in thread
From: Dr. David Alan Gilbert @ 2022-08-08 17:00 UTC (permalink / raw)
To: Daniel P. Berrangé
Cc: Thomas Huth, QEMU Developers, Juan Quintela, Peter Xu,
Peter Maydell, Richard Henderson
* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Mon, Aug 08, 2022 at 02:43:49PM +0200, Thomas Huth wrote:
> > On 08/08/2022 14.14, Daniel P. Berrangé wrote:
> > > On Mon, Aug 08, 2022 at 01:57:17PM +0200, Thomas Huth wrote:
> > > >
> > > > Hi!
> > > >
> > > > Seems like we're getting more timeouts in the CI pipelines since commit
> > > > 2649a72555e ("Allow test to run without uffd") enabled the migration tests
> > > > in more scenarios.
> > > >
> > > > For example:
> > > >
> > > > https://gitlab.com/qemu-project/qemu/-/jobs/2821578332#L49
> > > >
> > > > You can see that the migration-test ran for more than 20 minutes for each
> > > > target (x86 and aarch64)! I think that's way too much by default.
> > >
> > > Definitely too much.
> > >
> > > > I had a check whether there is one subtest taking a lot of time, but it
> > > > rather seems like each of the migration test is taking 40 to 50 seconds in
> > > > the CI:
> > > >
> > > > https://gitlab.com/thuth/qemu/-/jobs/2825365836#L44
> > >
> > > Normally with CI we expect a constant slowdown factor, eg x2.
> > >
> > > I expect with migration though, we're triggering behaviour whereby
> > > the guest workload is generating dirty pages quicker than we can
> > > migrate them over localhost. The balance in this can quickly tip
> > > to create an exponential slowdown.
> >
> > If I run the aarch64 migration-test on my otherwise idle x86 laptop, it also
> > takes already ca. 460 seconds to finish, which is IMHO also already too much
> > for a normal "make check" run (without SPEED=slow).
> >
> > > I'm not sure if 'g_test_slow' gives us enough granularity though, as
> > > if we enable that, it'll impact the whole test suite, not just
> > > migration tests.
> >
> > We could also check for the GITLAB_CI environment variable, just like we
> > already do it in some of the avocado-based tests ... but given the fact that
> > the migration test is already very slow on my normal x86 laptop, I think I'd
> > prefer if we added some checks with g_test_slow() in there ...
> >
> > Are there any tests in migration-test.c that are rather redundant and could
> > be easily skipped in quick mode?
>
> The trouble with migration is that there are alot of subtle permutations
> that interact in wierd ways, so we've got alot of test scenarios, includuing
> many with TLS:
>
> /x86_64/migration/bad_dest
> /x86_64/migration/fd_proto
> /x86_64/migration/validate_uuid
> /x86_64/migration/validate_uuid_error
> /x86_64/migration/validate_uuid_src_not_set
> /x86_64/migration/validate_uuid_dst_not_set
> /x86_64/migration/auto_converge
> /x86_64/migration/dirty_ring
> /x86_64/migration/vcpu_dirty_limit
> /x86_64/migration/postcopy/unix
> /x86_64/migration/postcopy/plain
> /x86_64/migration/postcopy/recovery/plain
> /x86_64/migration/postcopy/recovery/tls/psk
> /x86_64/migration/postcopy/preempt/plain
> /x86_64/migration/postcopy/preempt/recovery/plain
> /x86_64/migration/postcopy/preempt/recovery/tls/psk
> /x86_64/migration/postcopy/preempt/tls/psk
> /x86_64/migration/postcopy/tls/psk
> /x86_64/migration/precopy/unix/plain
> /x86_64/migration/precopy/unix/xbzrle
> /x86_64/migration/precopy/unix/tls/psk
> /x86_64/migration/precopy/unix/tls/x509/default-host
> /x86_64/migration/precopy/unix/tls/x509/override-host
> /x86_64/migration/precopy/tcp/plain
> /x86_64/migration/precopy/tcp/tls/psk/match
> /x86_64/migration/precopy/tcp/tls/psk/mismatch
> /x86_64/migration/precopy/tcp/tls/x509/default-host
> /x86_64/migration/precopy/tcp/tls/x509/override-host
> /x86_64/migration/precopy/tcp/tls/x509/mismatch-host
> /x86_64/migration/precopy/tcp/tls/x509/friendly-client
> /x86_64/migration/precopy/tcp/tls/x509/hostile-client
> /x86_64/migration/precopy/tcp/tls/x509/allow-anon-client
> /x86_64/migration/precopy/tcp/tls/x509/reject-anon-client
> /x86_64/migration/multifd/tcp/plain/none
> /x86_64/migration/multifd/tcp/plain/cancel
> /x86_64/migration/multifd/tcp/plain/zlib
> /x86_64/migration/multifd/tcp/plain/zstd
> /x86_64/migration/multifd/tcp/tls/psk/match
> /x86_64/migration/multifd/tcp/tls/psk/mismatch
> /x86_64/migration/multifd/tcp/tls/x509/default-host
> /x86_64/migration/multifd/tcp/tls/x509/override-host
> /x86_64/migration/multifd/tcp/tls/x509/mismatch-host
> /x86_64/migration/multifd/tcp/tls/x509/allow-anon-client
> /x86_64/migration/multifd/tcp/tls/x509/reject-anon-client
>
> Each takes about 4 seconds, except for the xbzrle, autoconverge and
> vcpu-dirty-rate tests which take 8-12 seconds.
>
> We could short-circuit most of the tls tests, because 90% of what
> they're validating is the initial connection setup phase. We don't
> really need to run the full migration to completion, we can just
> abort once we're running. Just keep 3 doing the full migration
> to completion - one precopy, one postcopy and one multifd.
I'd rather we combined some than cutting stuff off; I was about to
suggest doing zlib with some of the TLS but then that wouldn't have
found the recent zlib one!
Dave
> That'd cut most of thte TLS tests from 4 seconds to 0.5 seconds.
>
> With regards,
> Daniel
> --
> |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o- https://fstop138.berrange.com :|
> |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2022-08-08 17:03 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-08-08 11:57 Migration tests are very slow in the CI Thomas Huth
2022-08-08 12:14 ` Daniel P. Berrangé
2022-08-08 12:43 ` Thomas Huth
2022-08-08 12:58 ` Daniel P. Berrangé
2022-08-08 17:00 ` Dr. David Alan Gilbert
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).