* [PATCH 0/2] tests/qtest: make migraton-test faster @ 2023-04-18 13:30 Daniel P. Berrangé 2023-04-18 13:30 ` [PATCH 1/2] tests/qtest: capture RESUME events during migration Daniel P. Berrangé 2023-04-18 13:31 ` [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live Daniel P. Berrangé 0 siblings, 2 replies; 10+ messages in thread From: Daniel P. Berrangé @ 2023-04-18 13:30 UTC (permalink / raw) To: qemu-devel Cc: Juan Quintela, Thomas Huth, Paolo Bonzini, Laurent Vivier, Daniel P. Berrangé This makes migration-test faster by observing that most of the pre-copy tests don't need to be doing a live migration. They get sufficient code coverage with the guest CPUs paused. On my machine this cuts the overall execution time of migration-test by 50% from 15 minutes, down to 8 minutes, without sacrificing any noticeable code coverage. This is still quite slow though. The following are the test timings sorted by speed: /x86_64/migration/auto_converge 68.85 /x86_64/migration/precopy/unix/xbzrle 68.29 /x86_64/migration/postcopy/preempt/tls/psk 36.57 /x86_64/migration/dirty_ring 35.58 /x86_64/migration/precopy/unix/plain 35.56 /x86_64/migration/postcopy/preempt/plain 34.71 /x86_64/migration/postcopy/recovery/plain 34.56 /x86_64/migration/postcopy/tls/psk 34.45 /x86_64/migration/postcopy/preempt/recovery/tls/psk 33.99 /x86_64/migration/postcopy/preempt/recovery/plain 33.99 /x86_64/migration/postcopy/plain 33.78 /x86_64/migration/postcopy/recovery/tls/psk 33.30 /x86_64/migration/multifd/tcp/plain/none 21.12 /x86_64/migration/vcpu_dirty_limit 12.28 /x86_64/migration/multifd/tcp/tls/x509/default-host 2.95 /x86_64/migration/multifd/tcp/tls/x509/allow-anon-client 2.68 /x86_64/migration/multifd/tcp/tls/x509/override-host 2.51 /x86_64/migration/precopy/tcp/tls/x509/default-host 1.52 /x86_64/migration/precopy/unix/tls/x509/override-host 1.49 /x86_64/migration/precopy/unix/tls/psk 1.48 /x86_64/migration/precopy/tcp/tls/psk/match 1.47 /x86_64/migration/multifd/tcp/tls/psk/match 1.35 /x86_64/migration/precopy/tcp/tls/x509/allow-anon-client 1.32 /x86_64/migration/precopy/tcp/tls/x509/override-host 1.27 /x86_64/migration/precopy/tcp/tls/x509/friendly-client 1.27 /x86_64/migration/multifd/tcp/plain/zlib 1.08 /x86_64/migration/precopy/tcp/plain 1.05 /x86_64/migration/fd_proto 1.04 /x86_64/migration/multifd/tcp/tls/psk/mismatch 1.00 /x86_64/migration/multifd/tcp/plain/zstd 0.98 /x86_64/migration/precopy/tcp/tls/x509/hostile-client 0.85 /x86_64/migration/multifd/tcp/tls/x509/mismatch-host 0.79 /x86_64/migration/precopy/tcp/tls/x509/mismatch-host 0.75 /x86_64/migration/precopy/unix/tls/x509/default-host 0.74 /x86_64/migration/precopy/tcp/tls/x509/reject-anon-client 0.71 /x86_64/migration/multifd/tcp/tls/x509/reject-anon-client 0.68 /x86_64/migration/precopy/tcp/tls/psk/mismatch 0.63 /x86_64/migration/validate_uuid_src_not_set 0.59 /x86_64/migration/validate_uuid 0.54 /x86_64/migration/validate_uuid_dst_not_set 0.53 /x86_64/migration/bad_dest 0.41 /x86_64/migration/validate_uuid_error 0.31 The auto-converge and xbzrle tests are top because they both inherantly *need* todo multiple interations in order to exercise the desired code paths. The post-copy tests are all up there because we always do one iteration of pre-copy before switching to post-copy and we need to ensure that we don't complete before getting to the post-copy bit. I think we can optimize the post-copy bit though. Only 1 of the post-copy tests really needs to go through a full pre-copy iteration to get good code coverage. For the other post-copy tests we can change to let it copy 10 MBs of data in pre-copy mode and then switch to post-copy. Also in commit 1bfc8dde505f1e6a92697c52aa9b09e81b54c78f Author: Dr. David Alan Gilbert <dgilbert@redhat.com> Date: Mon Mar 6 15:26:12 2023 +0000 tests/migration: Tweek auto converge limits check we cut the bandwidth by factor of x10 to ensure reliability. I think that was perhaps too aggressive. If we bump it back up to say 10 MB/sec that's still better than the original 30 MB/sec, perhaps enough to give us reliability, while cutting time of other tests by 70% Daniel P. Berrangé (2): tests/qtest: capture RESUME events during migration tests/qtest: make more migration pre-copy scenarios run non-live tests/qtest/migration-helpers.c | 12 +++++++++--- tests/qtest/migration-helpers.h | 1 + tests/qtest/migration-test.c | 34 ++++++++++++++++++++++++++------- 3 files changed, 37 insertions(+), 10 deletions(-) -- 2.40.0 ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 1/2] tests/qtest: capture RESUME events during migration 2023-04-18 13:30 [PATCH 0/2] tests/qtest: make migraton-test faster Daniel P. Berrangé @ 2023-04-18 13:30 ` Daniel P. Berrangé 2023-04-20 11:32 ` Juan Quintela 2023-04-18 13:31 ` [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live Daniel P. Berrangé 1 sibling, 1 reply; 10+ messages in thread From: Daniel P. Berrangé @ 2023-04-18 13:30 UTC (permalink / raw) To: qemu-devel Cc: Juan Quintela, Thomas Huth, Paolo Bonzini, Laurent Vivier, Daniel P. Berrangé When running migration tests we monitor for a STOP event so we can skip redundant waits. This will be needed for the RESUME event too shortly. Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> --- tests/qtest/migration-helpers.c | 12 +++++++++--- tests/qtest/migration-helpers.h | 1 + 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/tests/qtest/migration-helpers.c b/tests/qtest/migration-helpers.c index f6f3c6680f..61396335cc 100644 --- a/tests/qtest/migration-helpers.c +++ b/tests/qtest/migration-helpers.c @@ -24,14 +24,20 @@ #define MIGRATION_STATUS_WAIT_TIMEOUT 120 bool got_stop; +bool got_resume; -static void check_stop_event(QTestState *who) +static void check_events(QTestState *who) { QDict *event = qtest_qmp_event_ref(who, "STOP"); if (event) { got_stop = true; qobject_unref(event); } + event = qtest_qmp_event_ref(who, "RESUME"); + if (event) { + got_resume = true; + qobject_unref(event); + } } #ifndef _WIN32 @@ -48,7 +54,7 @@ QDict *wait_command_fd(QTestState *who, int fd, const char *command, ...) va_end(ap); resp = qtest_qmp_receive(who); - check_stop_event(who); + check_events(who); g_assert(!qdict_haskey(resp, "error")); g_assert(qdict_haskey(resp, "return")); @@ -73,7 +79,7 @@ QDict *wait_command(QTestState *who, const char *command, ...) resp = qtest_vqmp(who, command, ap); va_end(ap); - check_stop_event(who); + check_events(who); g_assert(!qdict_haskey(resp, "error")); g_assert(qdict_haskey(resp, "return")); diff --git a/tests/qtest/migration-helpers.h b/tests/qtest/migration-helpers.h index a188b62787..726a66cfc1 100644 --- a/tests/qtest/migration-helpers.h +++ b/tests/qtest/migration-helpers.h @@ -16,6 +16,7 @@ #include "libqtest.h" extern bool got_stop; +extern bool got_resume; #ifndef _WIN32 G_GNUC_PRINTF(3, 4) -- 2.40.0 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] tests/qtest: capture RESUME events during migration 2023-04-18 13:30 ` [PATCH 1/2] tests/qtest: capture RESUME events during migration Daniel P. Berrangé @ 2023-04-20 11:32 ` Juan Quintela 2023-04-20 11:37 ` Daniel P. Berrangé 0 siblings, 1 reply; 10+ messages in thread From: Juan Quintela @ 2023-04-20 11:32 UTC (permalink / raw) To: Daniel P. Berrangé Cc: qemu-devel, Thomas Huth, Paolo Bonzini, Laurent Vivier Daniel P. Berrangé <berrange@redhat.com> wrote: > When running migration tests we monitor for a STOP event so we can skip > redundant waits. This will be needed for the RESUME event too shortly. > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> I am waiting for you to check the problem than Lukas detected, but this part of the patch is "obviously" correct. Famous last words. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] tests/qtest: capture RESUME events during migration 2023-04-20 11:32 ` Juan Quintela @ 2023-04-20 11:37 ` Daniel P. Berrangé 0 siblings, 0 replies; 10+ messages in thread From: Daniel P. Berrangé @ 2023-04-20 11:37 UTC (permalink / raw) To: Juan Quintela; +Cc: qemu-devel, Thomas Huth, Paolo Bonzini, Laurent Vivier On Thu, Apr 20, 2023 at 01:32:37PM +0200, Juan Quintela wrote: > Daniel P. Berrangé <berrange@redhat.com> wrote: > > When running migration tests we monitor for a STOP event so we can skip > > redundant waits. This will be needed for the RESUME event too shortly. > > > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > > Reviewed-by: Juan Quintela <quintela@redhat.com> > > > I am waiting for you to check the problem than Lukas detected, but this > part of the patch is "obviously" correct. > > Famous last words. Actually it has a small flaw - I don't set 'got_resume = false' at the start of each test :-( With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live 2023-04-18 13:30 [PATCH 0/2] tests/qtest: make migraton-test faster Daniel P. Berrangé 2023-04-18 13:30 ` [PATCH 1/2] tests/qtest: capture RESUME events during migration Daniel P. Berrangé @ 2023-04-18 13:31 ` Daniel P. Berrangé 2023-04-18 19:52 ` Fabiano Rosas 2023-04-20 12:59 ` Juan Quintela 1 sibling, 2 replies; 10+ messages in thread From: Daniel P. Berrangé @ 2023-04-18 13:31 UTC (permalink / raw) To: qemu-devel Cc: Juan Quintela, Thomas Huth, Paolo Bonzini, Laurent Vivier, Daniel P. Berrangé There are 27 pre-copy live migration scenarios being tested. In all of these we force non-convergance and run for one iteration, then let it converge and wait for completion during the second (or following) iterations. At 3 mbps bandwidth limit the first iteration takes a very long time (~30 seconds). While it is important to test the migration passes and convergance logic, it is overkill to do this for all 27 pre-copy scenarios. The TLS migration scenarios in particular are merely exercising different code paths during connection establishment. To optimize time taken, switch most of the test scenarios to run non-live (ie guest CPUs paused) with no bandwidth limits. This gives a massive speed up for most of the test scenarios. For test coverage the following scenarios are unchanged * Precopy with UNIX sockets * Precopy with UNIX sockets and dirty ring tracking * Precopy with XBZRLE * Precopy with multifd Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> --- tests/qtest/migration-test.c | 34 +++++++++++++++++++++++++++------- 1 file changed, 27 insertions(+), 7 deletions(-) diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c index 3b615b0da9..cdc9635f0b 100644 --- a/tests/qtest/migration-test.c +++ b/tests/qtest/migration-test.c @@ -574,6 +574,9 @@ typedef struct { /* Optional: set number of migration passes to wait for */ unsigned int iterations; + /* Whether the guest CPUs should be running during migration */ + bool live; + /* Postcopy specific fields */ void *postcopy_data; bool postcopy_preempt; @@ -1329,7 +1332,11 @@ static void test_precopy_common(MigrateCommon *args) return; } - migrate_ensure_non_converge(from); + if (args->live) { + migrate_ensure_non_converge(from); + } else { + migrate_ensure_converge(from); + } if (args->start_hook) { data_hook = args->start_hook(from, to); @@ -1357,16 +1364,20 @@ static void test_precopy_common(MigrateCommon *args) qtest_set_expected_status(to, EXIT_FAILURE); } } else { - if (args->iterations) { - while (args->iterations--) { + if (args->live) { + if (args->iterations) { + while (args->iterations--) { + wait_for_migration_pass(from); + } + } else { wait_for_migration_pass(from); } + + migrate_ensure_converge(from); } else { - wait_for_migration_pass(from); + qtest_qmp_discard_response(from, "{ 'execute' : 'stop'}"); } - migrate_ensure_converge(from); - /* We do this first, as it has a timeout to stop us * hanging forever if migration didn't converge */ wait_for_migration_complete(from); @@ -1375,7 +1386,12 @@ static void test_precopy_common(MigrateCommon *args) qtest_qmp_eventwait(from, "STOP"); } - qtest_qmp_eventwait(to, "RESUME"); + if (!args->live) { + qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}"); + } + if (!got_resume) { + qtest_qmp_eventwait(to, "RESUME"); + } wait_for_serial("dest_serial"); } @@ -1393,6 +1409,7 @@ static void test_precopy_unix_plain(void) MigrateCommon args = { .listen_uri = uri, .connect_uri = uri, + .live = true, }; test_precopy_common(&args); @@ -1408,6 +1425,7 @@ static void test_precopy_unix_dirty_ring(void) }, .listen_uri = uri, .connect_uri = uri, + .live = true, }; test_precopy_common(&args); @@ -1519,6 +1537,7 @@ static void test_precopy_unix_xbzrle(void) .start_hook = test_migrate_xbzrle_start, .iterations = 2, + .live = true, }; test_precopy_common(&args); @@ -1919,6 +1938,7 @@ static void test_multifd_tcp_none(void) MigrateCommon args = { .listen_uri = "defer", .start_hook = test_migrate_precopy_tcp_multifd_start, + .live = true, }; test_precopy_common(&args); } -- 2.40.0 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live 2023-04-18 13:31 ` [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live Daniel P. Berrangé @ 2023-04-18 19:52 ` Fabiano Rosas 2023-04-19 17:14 ` Daniel P. Berrangé 2023-04-21 17:20 ` Daniel P. Berrangé 2023-04-20 12:59 ` Juan Quintela 1 sibling, 2 replies; 10+ messages in thread From: Fabiano Rosas @ 2023-04-18 19:52 UTC (permalink / raw) To: Daniel P. Berrangé, qemu-devel Cc: Juan Quintela, Thomas Huth, Paolo Bonzini, Laurent Vivier, Daniel P. Berrangé Daniel P. Berrangé <berrange@redhat.com> writes: > There are 27 pre-copy live migration scenarios being tested. In all of > these we force non-convergance and run for one iteration, then let it > converge and wait for completion during the second (or following) > iterations. At 3 mbps bandwidth limit the first iteration takes a very > long time (~30 seconds). > > While it is important to test the migration passes and convergance > logic, it is overkill to do this for all 27 pre-copy scenarios. The > TLS migration scenarios in particular are merely exercising different > code paths during connection establishment. > > To optimize time taken, switch most of the test scenarios to run > non-live (ie guest CPUs paused) with no bandwidth limits. This gives > a massive speed up for most of the test scenarios. > > For test coverage the following scenarios are unchanged > > * Precopy with UNIX sockets > * Precopy with UNIX sockets and dirty ring tracking > * Precopy with XBZRLE > * Precopy with multifd > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> ... > - qtest_qmp_eventwait(to, "RESUME"); > + if (!args->live) { > + qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}"); > + } > + if (!got_resume) { > + qtest_qmp_eventwait(to, "RESUME"); > + } Hi Daniel, On an aarch64 host I'm sometimes (~30%) seeing a hang here on a TLS test: ../configure --target-list=aarch64-softmmu --enable-gnutls ... ./tests/qtest/migration-test --tap -k -p /aarch64/migration/precopy/tcp/tls/psk/match (gdb) bt #0 0x0000fffff7b33f8c in recv () from /lib64/libpthread.so.0 #1 0x0000aaaaaaac8bf4 in recv (__flags=0, __n=1, __buf=0xffffffffe477, __fd=5) at /usr/include/bits/socket2.h:44 #2 qmp_fd_receive (fd=5) at ../tests/qtest/libqmp.c:73 #3 0x0000aaaaaaac6dbc in qtest_qmp_receive_dict (s=0xaaaaaaca7d10) at ../tests/qtest/libqtest.c:713 #4 qtest_qmp_eventwait_ref (s=0xaaaaaaca7d10, event=0xaaaaaab26ce8 "RESUME") at ../tests/qtest/libqtest.c:837 #5 0x0000aaaaaaac6e34 in qtest_qmp_eventwait (s=<optimized out>, event=<optimized out>) at ../tests/qtest/libqtest.c:850 #6 0x0000aaaaaaabbd90 in test_precopy_common (args=0xffffffffe590, args@entry=0xffffffffe5a0) at ../tests/qtest/migration-test.c:1393 #7 0x0000aaaaaaabc804 in test_precopy_tcp_tls_psk_match () at ../tests/qtest/migration-test.c:1564 #8 0x0000fffff7c89630 in ?? () from //usr/lib64/libglib-2.0.so.0 ... #15 0x0000fffff7c89a70 in g_test_run_suite () from //usr/lib64/libglib-2.0.so.0 #16 0x0000fffff7c89ae4 in g_test_run () from //usr/lib64/libglib-2.0.so.0 #17 0x0000aaaaaaab7fdc in main (argc=<optimized out>, argv=<optimized out>) at ../tests/qtest/migration-test.c:2642 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live 2023-04-18 19:52 ` Fabiano Rosas @ 2023-04-19 17:14 ` Daniel P. Berrangé 2023-04-21 17:20 ` Daniel P. Berrangé 1 sibling, 0 replies; 10+ messages in thread From: Daniel P. Berrangé @ 2023-04-19 17:14 UTC (permalink / raw) To: Fabiano Rosas Cc: qemu-devel, Juan Quintela, Thomas Huth, Paolo Bonzini, Laurent Vivier On Tue, Apr 18, 2023 at 04:52:32PM -0300, Fabiano Rosas wrote: > Daniel P. Berrangé <berrange@redhat.com> writes: > > > There are 27 pre-copy live migration scenarios being tested. In all of > > these we force non-convergance and run for one iteration, then let it > > converge and wait for completion during the second (or following) > > iterations. At 3 mbps bandwidth limit the first iteration takes a very > > long time (~30 seconds). > > > > While it is important to test the migration passes and convergance > > logic, it is overkill to do this for all 27 pre-copy scenarios. The > > TLS migration scenarios in particular are merely exercising different > > code paths during connection establishment. > > > > To optimize time taken, switch most of the test scenarios to run > > non-live (ie guest CPUs paused) with no bandwidth limits. This gives > > a massive speed up for most of the test scenarios. > > > > For test coverage the following scenarios are unchanged > > > > * Precopy with UNIX sockets > > * Precopy with UNIX sockets and dirty ring tracking > > * Precopy with XBZRLE > > * Precopy with multifd > > > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > > ... > > > - qtest_qmp_eventwait(to, "RESUME"); > > + if (!args->live) { > > + qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}"); > > + } > > + if (!got_resume) { > > + qtest_qmp_eventwait(to, "RESUME"); > > + } > > Hi Daniel, > > On an aarch64 host I'm sometimes (~30%) seeing a hang here on a TLS test: > > ../configure --target-list=aarch64-softmmu --enable-gnutls > > ... ./tests/qtest/migration-test --tap -k -p /aarch64/migration/precopy/tcp/tls/psk/match > > (gdb) bt > #0 0x0000fffff7b33f8c in recv () from /lib64/libpthread.so.0 > #1 0x0000aaaaaaac8bf4 in recv (__flags=0, __n=1, __buf=0xffffffffe477, __fd=5) at /usr/include/bits/socket2.h:44 > #2 qmp_fd_receive (fd=5) at ../tests/qtest/libqmp.c:73 > #3 0x0000aaaaaaac6dbc in qtest_qmp_receive_dict (s=0xaaaaaaca7d10) at ../tests/qtest/libqtest.c:713 > #4 qtest_qmp_eventwait_ref (s=0xaaaaaaca7d10, event=0xaaaaaab26ce8 "RESUME") at ../tests/qtest/libqtest.c:837 > #5 0x0000aaaaaaac6e34 in qtest_qmp_eventwait (s=<optimized out>, event=<optimized out>) at ../tests/qtest/libqtest.c:850 > #6 0x0000aaaaaaabbd90 in test_precopy_common (args=0xffffffffe590, args@entry=0xffffffffe5a0) at ../tests/qtest/migration-test.c:1393 > #7 0x0000aaaaaaabc804 in test_precopy_tcp_tls_psk_match () at ../tests/qtest/migration-test.c:1564 > #8 0x0000fffff7c89630 in ?? () from //usr/lib64/libglib-2.0.so.0 > ... > #15 0x0000fffff7c89a70 in g_test_run_suite () from //usr/lib64/libglib-2.0.so.0 > #16 0x0000fffff7c89ae4 in g_test_run () from //usr/lib64/libglib-2.0.so.0 > #17 0x0000aaaaaaab7fdc in main (argc=<optimized out>, argv=<optimized out>) at ../tests/qtest/migration-test.c:2642 Urgh, ok, there must be an unexpected race condition wrt events in my change. Thanks for the stack trace, i'll investigate. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live 2023-04-18 19:52 ` Fabiano Rosas 2023-04-19 17:14 ` Daniel P. Berrangé @ 2023-04-21 17:20 ` Daniel P. Berrangé 1 sibling, 0 replies; 10+ messages in thread From: Daniel P. Berrangé @ 2023-04-21 17:20 UTC (permalink / raw) To: Fabiano Rosas Cc: qemu-devel, Juan Quintela, Thomas Huth, Paolo Bonzini, Laurent Vivier On Tue, Apr 18, 2023 at 04:52:32PM -0300, Fabiano Rosas wrote: > Daniel P. Berrangé <berrange@redhat.com> writes: > > > There are 27 pre-copy live migration scenarios being tested. In all of > > these we force non-convergance and run for one iteration, then let it > > converge and wait for completion during the second (or following) > > iterations. At 3 mbps bandwidth limit the first iteration takes a very > > long time (~30 seconds). > > > > While it is important to test the migration passes and convergance > > logic, it is overkill to do this for all 27 pre-copy scenarios. The > > TLS migration scenarios in particular are merely exercising different > > code paths during connection establishment. > > > > To optimize time taken, switch most of the test scenarios to run > > non-live (ie guest CPUs paused) with no bandwidth limits. This gives > > a massive speed up for most of the test scenarios. > > > > For test coverage the following scenarios are unchanged > > > > * Precopy with UNIX sockets > > * Precopy with UNIX sockets and dirty ring tracking > > * Precopy with XBZRLE > > * Precopy with multifd > > > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > > ... > > > - qtest_qmp_eventwait(to, "RESUME"); > > + if (!args->live) { > > + qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}"); > > + } > > + if (!got_resume) { > > + qtest_qmp_eventwait(to, "RESUME"); > > + } > > Hi Daniel, > > On an aarch64 host I'm sometimes (~30%) seeing a hang here on a TLS test: > > ../configure --target-list=aarch64-softmmu --enable-gnutls > > ... ./tests/qtest/migration-test --tap -k -p /aarch64/migration/precopy/tcp/tls/psk/match I never came to a satisfactory understanding of why this problem hits you. I've just sent out a new version of this series, which has quite a few differences, so possibly I've fixed it by luck. So if you have time, I'd appreciate any testing you can try on https://lists.gnu.org/archive/html/qemu-devel/2023-04/msg03688.html With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live 2023-04-18 13:31 ` [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live Daniel P. Berrangé 2023-04-18 19:52 ` Fabiano Rosas @ 2023-04-20 12:59 ` Juan Quintela 2023-04-20 15:58 ` Daniel P. Berrangé 1 sibling, 1 reply; 10+ messages in thread From: Juan Quintela @ 2023-04-20 12:59 UTC (permalink / raw) To: Daniel P. Berrangé Cc: qemu-devel, Thomas Huth, Paolo Bonzini, Laurent Vivier Daniel P. Berrangé <berrange@redhat.com> wrote: > There are 27 pre-copy live migration scenarios being tested. In all of > these we force non-convergance and run for one iteration, then let it > converge and wait for completion during the second (or following) > iterations. At 3 mbps bandwidth limit the first iteration takes a very > long time (~30 seconds). > > While it is important to test the migration passes and convergance > logic, it is overkill to do this for all 27 pre-copy scenarios. The > TLS migration scenarios in particular are merely exercising different > code paths during connection establishment. > > To optimize time taken, switch most of the test scenarios to run > non-live (ie guest CPUs paused) with no bandwidth limits. This gives > a massive speed up for most of the test scenarios. > > For test coverage the following scenarios are unchanged > > * Precopy with UNIX sockets > * Precopy with UNIX sockets and dirty ring tracking > * Precopy with XBZRLE > * Precopy with multifd Just for completeness: the other test that is still slow is /migration/vcpu_dirty_limit. > - migrate_ensure_non_converge(from); > + if (args->live) { > + migrate_ensure_non_converge(from); > + } else { > + migrate_ensure_converge(from); > + } Looks ... weird? But the only way that I can think of improving it is to pass args to migrate_ensure_*() and that is a different kind of weird. > } else { > - if (args->iterations) { > - while (args->iterations--) { > + if (args->live) { > + if (args->iterations) { > + while (args->iterations--) { > + wait_for_migration_pass(from); > + } > + } else { > wait_for_migration_pass(from); > } > + > + migrate_ensure_converge(from); I think we should change iterations to be 1 when we create args, but otherwise, treat 0 as 1 and change it to something in the lines of: if (args->live) { while (args->iterations-- >= 0) { wait_for_migration_pass(from); } migrate_ensure_converge(from); What do you think? > - qtest_qmp_eventwait(to, "RESUME"); > + if (!args->live) { > + qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}"); > + } > + if (!got_resume) { > + qtest_qmp_eventwait(to, "RESUME"); > + } > > wait_for_serial("dest_serial"); > } I was looking at the "culprit" of Lukas problem, and it is not directly obvious. I see that when we expect one event, we just drop any event that we are not interested in. I don't know if that is the proper behaviour or if that is what affecting this test. Later, Juan. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live 2023-04-20 12:59 ` Juan Quintela @ 2023-04-20 15:58 ` Daniel P. Berrangé 0 siblings, 0 replies; 10+ messages in thread From: Daniel P. Berrangé @ 2023-04-20 15:58 UTC (permalink / raw) To: Juan Quintela; +Cc: qemu-devel, Thomas Huth, Paolo Bonzini, Laurent Vivier On Thu, Apr 20, 2023 at 02:59:00PM +0200, Juan Quintela wrote: > Daniel P. Berrangé <berrange@redhat.com> wrote: > > There are 27 pre-copy live migration scenarios being tested. In all of > > these we force non-convergance and run for one iteration, then let it > > converge and wait for completion during the second (or following) > > iterations. At 3 mbps bandwidth limit the first iteration takes a very > > long time (~30 seconds). > > > > While it is important to test the migration passes and convergance > > logic, it is overkill to do this for all 27 pre-copy scenarios. The > > TLS migration scenarios in particular are merely exercising different > > code paths during connection establishment. > > > > To optimize time taken, switch most of the test scenarios to run > > non-live (ie guest CPUs paused) with no bandwidth limits. This gives > > a massive speed up for most of the test scenarios. > > > > For test coverage the following scenarios are unchanged > > > > * Precopy with UNIX sockets > > * Precopy with UNIX sockets and dirty ring tracking > > * Precopy with XBZRLE > > * Precopy with multifd > > Just for completeness: the other test that is still slow is > /migration/vcpu_dirty_limit. > > > - migrate_ensure_non_converge(from); > > + if (args->live) { > > + migrate_ensure_non_converge(from); > > + } else { > > + migrate_ensure_converge(from); > > + } > > Looks ... weird? > But the only way that I can think of improving it is to pass args to > migrate_ensure_*() and that is a different kind of weird. I'm going to change this a little anyway. Currently for the non-live case, I start the migration and then stop the CPUs. I want to reverse that order, as we should have CPUs paused before even starting the migration to ensure we don't have any re-dirtied pages at all. > > > } else { > > - if (args->iterations) { > > - while (args->iterations--) { > > + if (args->live) { > > + if (args->iterations) { > > + while (args->iterations--) { > > + wait_for_migration_pass(from); > > + } > > + } else { > > wait_for_migration_pass(from); > > } > > + > > + migrate_ensure_converge(from); > > I think we should change iterations to be 1 when we create args, but > otherwise, treat 0 as 1 and change it to something in the lines of: > > if (args->live) { > while (args->iterations-- >= 0) { > wait_for_migration_pass(from); > } > migrate_ensure_converge(from); > > What do you think? I think in retrospect 'iterations' was overkill as we only use the values 0 (implicitly 1) or 2. IOW we could just just a 'bool multipass' to distinguish the two cases. > > - qtest_qmp_eventwait(to, "RESUME"); > > + if (!args->live) { > > + qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}"); > > + } > > + if (!got_resume) { > > + qtest_qmp_eventwait(to, "RESUME"); > > + } > > > > wait_for_serial("dest_serial"); > > } > > I was looking at the "culprit" of Lukas problem, and it is not directly > obvious. I see that when we expect one event, we just drop any event > that we are not interested in. I don't know if that is the proper > behaviour or if that is what affecting this test. I've not successfully reproduced it yet, nor figured out a real scenario where it could plausibly happen. i'm looking to add more debug to help us out. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2023-04-21 17:20 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-04-18 13:30 [PATCH 0/2] tests/qtest: make migraton-test faster Daniel P. Berrangé 2023-04-18 13:30 ` [PATCH 1/2] tests/qtest: capture RESUME events during migration Daniel P. Berrangé 2023-04-20 11:32 ` Juan Quintela 2023-04-20 11:37 ` Daniel P. Berrangé 2023-04-18 13:31 ` [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live Daniel P. Berrangé 2023-04-18 19:52 ` Fabiano Rosas 2023-04-19 17:14 ` Daniel P. Berrangé 2023-04-21 17:20 ` Daniel P. Berrangé 2023-04-20 12:59 ` Juan Quintela 2023-04-20 15:58 ` Daniel P. Berrangé
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).