qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] tests/qtest: make migraton-test faster
@ 2023-04-18 13:30 Daniel P. Berrangé
  2023-04-18 13:30 ` [PATCH 1/2] tests/qtest: capture RESUME events during migration Daniel P. Berrangé
  2023-04-18 13:31 ` [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live Daniel P. Berrangé
  0 siblings, 2 replies; 10+ messages in thread
From: Daniel P. Berrangé @ 2023-04-18 13:30 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Thomas Huth, Paolo Bonzini, Laurent Vivier,
	Daniel P. Berrangé

This makes migration-test faster by observing that most of the pre-copy
tests don't need to be doing a live migration. They get sufficient code
coverage with the guest CPUs paused.

On my machine this cuts the overall execution time of migration-test
by 50% from 15 minutes, down to 8 minutes, without sacrificing any
noticeable code coverage.

This is still quite slow though.

The following are the test timings sorted by speed:

/x86_64/migration/auto_converge  68.85
/x86_64/migration/precopy/unix/xbzrle  68.29
/x86_64/migration/postcopy/preempt/tls/psk  36.57
/x86_64/migration/dirty_ring  35.58
/x86_64/migration/precopy/unix/plain  35.56
/x86_64/migration/postcopy/preempt/plain  34.71
/x86_64/migration/postcopy/recovery/plain  34.56
/x86_64/migration/postcopy/tls/psk  34.45
/x86_64/migration/postcopy/preempt/recovery/tls/psk  33.99
/x86_64/migration/postcopy/preempt/recovery/plain  33.99
/x86_64/migration/postcopy/plain  33.78
/x86_64/migration/postcopy/recovery/tls/psk  33.30
/x86_64/migration/multifd/tcp/plain/none  21.12
/x86_64/migration/vcpu_dirty_limit  12.28
/x86_64/migration/multifd/tcp/tls/x509/default-host  2.95
/x86_64/migration/multifd/tcp/tls/x509/allow-anon-client  2.68
/x86_64/migration/multifd/tcp/tls/x509/override-host  2.51
/x86_64/migration/precopy/tcp/tls/x509/default-host  1.52
/x86_64/migration/precopy/unix/tls/x509/override-host  1.49
/x86_64/migration/precopy/unix/tls/psk  1.48
/x86_64/migration/precopy/tcp/tls/psk/match  1.47
/x86_64/migration/multifd/tcp/tls/psk/match  1.35
/x86_64/migration/precopy/tcp/tls/x509/allow-anon-client  1.32
/x86_64/migration/precopy/tcp/tls/x509/override-host  1.27
/x86_64/migration/precopy/tcp/tls/x509/friendly-client  1.27
/x86_64/migration/multifd/tcp/plain/zlib  1.08
/x86_64/migration/precopy/tcp/plain  1.05
/x86_64/migration/fd_proto  1.04
/x86_64/migration/multifd/tcp/tls/psk/mismatch  1.00
/x86_64/migration/multifd/tcp/plain/zstd  0.98
/x86_64/migration/precopy/tcp/tls/x509/hostile-client  0.85
/x86_64/migration/multifd/tcp/tls/x509/mismatch-host  0.79
/x86_64/migration/precopy/tcp/tls/x509/mismatch-host  0.75
/x86_64/migration/precopy/unix/tls/x509/default-host  0.74
/x86_64/migration/precopy/tcp/tls/x509/reject-anon-client  0.71
/x86_64/migration/multifd/tcp/tls/x509/reject-anon-client  0.68
/x86_64/migration/precopy/tcp/tls/psk/mismatch  0.63
/x86_64/migration/validate_uuid_src_not_set  0.59
/x86_64/migration/validate_uuid  0.54
/x86_64/migration/validate_uuid_dst_not_set  0.53
/x86_64/migration/bad_dest  0.41
/x86_64/migration/validate_uuid_error  0.31

The auto-converge and xbzrle tests are top because they both inherantly
*need* todo multiple interations in order to exercise the desired
code paths.

The post-copy tests are all up there because we always do one iteration
of pre-copy before switching to post-copy and we need to ensure that we
don't complete before getting to the post-copy bit.

I think we can optimize the post-copy bit though. Only 1 of the
post-copy tests really needs to go through a full pre-copy iteration
to get good code coverage.  For the other post-copy tests we can change
to let it copy 10 MBs of data in pre-copy mode and then switch to
post-copy.

Also in

  commit 1bfc8dde505f1e6a92697c52aa9b09e81b54c78f
  Author: Dr. David Alan Gilbert <dgilbert@redhat.com>
  Date:   Mon Mar 6 15:26:12 2023 +0000

    tests/migration: Tweek auto converge limits check

we cut the bandwidth by factor of x10 to ensure reliability. I think
that was perhaps too aggressive. If we bump it back up to say 10 MB/sec
that's still better than the original 30 MB/sec, perhaps enough to give
us reliability, while cutting time of other tests by 70%

Daniel P. Berrangé (2):
  tests/qtest: capture RESUME events during migration
  tests/qtest: make more migration pre-copy scenarios run non-live

 tests/qtest/migration-helpers.c | 12 +++++++++---
 tests/qtest/migration-helpers.h |  1 +
 tests/qtest/migration-test.c    | 34 ++++++++++++++++++++++++++-------
 3 files changed, 37 insertions(+), 10 deletions(-)

-- 
2.40.0



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] tests/qtest: capture RESUME events during migration
  2023-04-18 13:30 [PATCH 0/2] tests/qtest: make migraton-test faster Daniel P. Berrangé
@ 2023-04-18 13:30 ` Daniel P. Berrangé
  2023-04-20 11:32   ` Juan Quintela
  2023-04-18 13:31 ` [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live Daniel P. Berrangé
  1 sibling, 1 reply; 10+ messages in thread
From: Daniel P. Berrangé @ 2023-04-18 13:30 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Thomas Huth, Paolo Bonzini, Laurent Vivier,
	Daniel P. Berrangé

When running migration tests we monitor for a STOP event so we can skip
redundant waits. This will be needed for the RESUME event too shortly.

Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
 tests/qtest/migration-helpers.c | 12 +++++++++---
 tests/qtest/migration-helpers.h |  1 +
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/tests/qtest/migration-helpers.c b/tests/qtest/migration-helpers.c
index f6f3c6680f..61396335cc 100644
--- a/tests/qtest/migration-helpers.c
+++ b/tests/qtest/migration-helpers.c
@@ -24,14 +24,20 @@
 #define MIGRATION_STATUS_WAIT_TIMEOUT 120
 
 bool got_stop;
+bool got_resume;
 
-static void check_stop_event(QTestState *who)
+static void check_events(QTestState *who)
 {
     QDict *event = qtest_qmp_event_ref(who, "STOP");
     if (event) {
         got_stop = true;
         qobject_unref(event);
     }
+    event = qtest_qmp_event_ref(who, "RESUME");
+    if (event) {
+        got_resume = true;
+        qobject_unref(event);
+    }
 }
 
 #ifndef _WIN32
@@ -48,7 +54,7 @@ QDict *wait_command_fd(QTestState *who, int fd, const char *command, ...)
     va_end(ap);
 
     resp = qtest_qmp_receive(who);
-    check_stop_event(who);
+    check_events(who);
 
     g_assert(!qdict_haskey(resp, "error"));
     g_assert(qdict_haskey(resp, "return"));
@@ -73,7 +79,7 @@ QDict *wait_command(QTestState *who, const char *command, ...)
     resp = qtest_vqmp(who, command, ap);
     va_end(ap);
 
-    check_stop_event(who);
+    check_events(who);
 
     g_assert(!qdict_haskey(resp, "error"));
     g_assert(qdict_haskey(resp, "return"));
diff --git a/tests/qtest/migration-helpers.h b/tests/qtest/migration-helpers.h
index a188b62787..726a66cfc1 100644
--- a/tests/qtest/migration-helpers.h
+++ b/tests/qtest/migration-helpers.h
@@ -16,6 +16,7 @@
 #include "libqtest.h"
 
 extern bool got_stop;
+extern bool got_resume;
 
 #ifndef _WIN32
 G_GNUC_PRINTF(3, 4)
-- 
2.40.0



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live
  2023-04-18 13:30 [PATCH 0/2] tests/qtest: make migraton-test faster Daniel P. Berrangé
  2023-04-18 13:30 ` [PATCH 1/2] tests/qtest: capture RESUME events during migration Daniel P. Berrangé
@ 2023-04-18 13:31 ` Daniel P. Berrangé
  2023-04-18 19:52   ` Fabiano Rosas
  2023-04-20 12:59   ` Juan Quintela
  1 sibling, 2 replies; 10+ messages in thread
From: Daniel P. Berrangé @ 2023-04-18 13:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Thomas Huth, Paolo Bonzini, Laurent Vivier,
	Daniel P. Berrangé

There are 27 pre-copy live migration scenarios being tested. In all of
these we force non-convergance and run for one iteration, then let it
converge and wait for completion during the second (or following)
iterations. At 3 mbps bandwidth limit the first iteration takes a very
long time (~30 seconds).

While it is important to test the migration passes and convergance
logic, it is overkill to do this for all 27 pre-copy scenarios. The
TLS migration scenarios in particular are merely exercising different
code paths during connection establishment.

To optimize time taken, switch most of the test scenarios to run
non-live (ie guest CPUs paused) with no bandwidth limits. This gives
a massive speed up for most of the test scenarios.

For test coverage the following scenarios are unchanged

 * Precopy with UNIX sockets
 * Precopy with UNIX sockets and dirty ring tracking
 * Precopy with XBZRLE
 * Precopy with multifd

Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
 tests/qtest/migration-test.c | 34 +++++++++++++++++++++++++++-------
 1 file changed, 27 insertions(+), 7 deletions(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 3b615b0da9..cdc9635f0b 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -574,6 +574,9 @@ typedef struct {
     /* Optional: set number of migration passes to wait for */
     unsigned int iterations;
 
+    /* Whether the guest CPUs should be running during migration */
+    bool live;
+
     /* Postcopy specific fields */
     void *postcopy_data;
     bool postcopy_preempt;
@@ -1329,7 +1332,11 @@ static void test_precopy_common(MigrateCommon *args)
         return;
     }
 
-    migrate_ensure_non_converge(from);
+    if (args->live) {
+        migrate_ensure_non_converge(from);
+    } else {
+        migrate_ensure_converge(from);
+    }
 
     if (args->start_hook) {
         data_hook = args->start_hook(from, to);
@@ -1357,16 +1364,20 @@ static void test_precopy_common(MigrateCommon *args)
             qtest_set_expected_status(to, EXIT_FAILURE);
         }
     } else {
-        if (args->iterations) {
-            while (args->iterations--) {
+        if (args->live) {
+            if (args->iterations) {
+                while (args->iterations--) {
+                    wait_for_migration_pass(from);
+                }
+            } else {
                 wait_for_migration_pass(from);
             }
+
+            migrate_ensure_converge(from);
         } else {
-            wait_for_migration_pass(from);
+            qtest_qmp_discard_response(from, "{ 'execute' : 'stop'}");
         }
 
-        migrate_ensure_converge(from);
-
         /* We do this first, as it has a timeout to stop us
          * hanging forever if migration didn't converge */
         wait_for_migration_complete(from);
@@ -1375,7 +1386,12 @@ static void test_precopy_common(MigrateCommon *args)
             qtest_qmp_eventwait(from, "STOP");
         }
 
-        qtest_qmp_eventwait(to, "RESUME");
+        if (!args->live) {
+            qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}");
+        }
+        if (!got_resume) {
+            qtest_qmp_eventwait(to, "RESUME");
+        }
 
         wait_for_serial("dest_serial");
     }
@@ -1393,6 +1409,7 @@ static void test_precopy_unix_plain(void)
     MigrateCommon args = {
         .listen_uri = uri,
         .connect_uri = uri,
+        .live = true,
     };
 
     test_precopy_common(&args);
@@ -1408,6 +1425,7 @@ static void test_precopy_unix_dirty_ring(void)
         },
         .listen_uri = uri,
         .connect_uri = uri,
+        .live = true,
     };
 
     test_precopy_common(&args);
@@ -1519,6 +1537,7 @@ static void test_precopy_unix_xbzrle(void)
         .start_hook = test_migrate_xbzrle_start,
 
         .iterations = 2,
+        .live = true,
     };
 
     test_precopy_common(&args);
@@ -1919,6 +1938,7 @@ static void test_multifd_tcp_none(void)
     MigrateCommon args = {
         .listen_uri = "defer",
         .start_hook = test_migrate_precopy_tcp_multifd_start,
+        .live = true,
     };
     test_precopy_common(&args);
 }
-- 
2.40.0



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live
  2023-04-18 13:31 ` [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live Daniel P. Berrangé
@ 2023-04-18 19:52   ` Fabiano Rosas
  2023-04-19 17:14     ` Daniel P. Berrangé
  2023-04-21 17:20     ` Daniel P. Berrangé
  2023-04-20 12:59   ` Juan Quintela
  1 sibling, 2 replies; 10+ messages in thread
From: Fabiano Rosas @ 2023-04-18 19:52 UTC (permalink / raw)
  To: Daniel P. Berrangé, qemu-devel
  Cc: Juan Quintela, Thomas Huth, Paolo Bonzini, Laurent Vivier,
	Daniel P. Berrangé

Daniel P. Berrangé <berrange@redhat.com> writes:

> There are 27 pre-copy live migration scenarios being tested. In all of
> these we force non-convergance and run for one iteration, then let it
> converge and wait for completion during the second (or following)
> iterations. At 3 mbps bandwidth limit the first iteration takes a very
> long time (~30 seconds).
>
> While it is important to test the migration passes and convergance
> logic, it is overkill to do this for all 27 pre-copy scenarios. The
> TLS migration scenarios in particular are merely exercising different
> code paths during connection establishment.
>
> To optimize time taken, switch most of the test scenarios to run
> non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> a massive speed up for most of the test scenarios.
>
> For test coverage the following scenarios are unchanged
>
>  * Precopy with UNIX sockets
>  * Precopy with UNIX sockets and dirty ring tracking
>  * Precopy with XBZRLE
>  * Precopy with multifd
>
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>

...

> -        qtest_qmp_eventwait(to, "RESUME");
> +        if (!args->live) {
> +            qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}");
> +        }
> +        if (!got_resume) {
> +            qtest_qmp_eventwait(to, "RESUME");
> +        }

Hi Daniel,

On an aarch64 host I'm sometimes (~30%) seeing a hang here on a TLS test:

../configure --target-list=aarch64-softmmu --enable-gnutls

... ./tests/qtest/migration-test --tap -k -p /aarch64/migration/precopy/tcp/tls/psk/match

(gdb) bt
#0  0x0000fffff7b33f8c in recv () from /lib64/libpthread.so.0
#1  0x0000aaaaaaac8bf4 in recv (__flags=0, __n=1, __buf=0xffffffffe477, __fd=5) at /usr/include/bits/socket2.h:44
#2  qmp_fd_receive (fd=5) at ../tests/qtest/libqmp.c:73
#3  0x0000aaaaaaac6dbc in qtest_qmp_receive_dict (s=0xaaaaaaca7d10) at ../tests/qtest/libqtest.c:713
#4  qtest_qmp_eventwait_ref (s=0xaaaaaaca7d10, event=0xaaaaaab26ce8 "RESUME") at ../tests/qtest/libqtest.c:837
#5  0x0000aaaaaaac6e34 in qtest_qmp_eventwait (s=<optimized out>, event=<optimized out>) at ../tests/qtest/libqtest.c:850
#6  0x0000aaaaaaabbd90 in test_precopy_common (args=0xffffffffe590, args@entry=0xffffffffe5a0) at ../tests/qtest/migration-test.c:1393
#7  0x0000aaaaaaabc804 in test_precopy_tcp_tls_psk_match () at ../tests/qtest/migration-test.c:1564
#8  0x0000fffff7c89630 in ?? () from //usr/lib64/libglib-2.0.so.0
...
#15 0x0000fffff7c89a70 in g_test_run_suite () from //usr/lib64/libglib-2.0.so.0
#16 0x0000fffff7c89ae4 in g_test_run () from //usr/lib64/libglib-2.0.so.0
#17 0x0000aaaaaaab7fdc in main (argc=<optimized out>, argv=<optimized out>) at ../tests/qtest/migration-test.c:2642


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live
  2023-04-18 19:52   ` Fabiano Rosas
@ 2023-04-19 17:14     ` Daniel P. Berrangé
  2023-04-21 17:20     ` Daniel P. Berrangé
  1 sibling, 0 replies; 10+ messages in thread
From: Daniel P. Berrangé @ 2023-04-19 17:14 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, Juan Quintela, Thomas Huth, Paolo Bonzini,
	Laurent Vivier

On Tue, Apr 18, 2023 at 04:52:32PM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > There are 27 pre-copy live migration scenarios being tested. In all of
> > these we force non-convergance and run for one iteration, then let it
> > converge and wait for completion during the second (or following)
> > iterations. At 3 mbps bandwidth limit the first iteration takes a very
> > long time (~30 seconds).
> >
> > While it is important to test the migration passes and convergance
> > logic, it is overkill to do this for all 27 pre-copy scenarios. The
> > TLS migration scenarios in particular are merely exercising different
> > code paths during connection establishment.
> >
> > To optimize time taken, switch most of the test scenarios to run
> > non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> > a massive speed up for most of the test scenarios.
> >
> > For test coverage the following scenarios are unchanged
> >
> >  * Precopy with UNIX sockets
> >  * Precopy with UNIX sockets and dirty ring tracking
> >  * Precopy with XBZRLE
> >  * Precopy with multifd
> >
> > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> 
> ...
> 
> > -        qtest_qmp_eventwait(to, "RESUME");
> > +        if (!args->live) {
> > +            qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}");
> > +        }
> > +        if (!got_resume) {
> > +            qtest_qmp_eventwait(to, "RESUME");
> > +        }
> 
> Hi Daniel,
> 
> On an aarch64 host I'm sometimes (~30%) seeing a hang here on a TLS test:
> 
> ../configure --target-list=aarch64-softmmu --enable-gnutls
> 
> ... ./tests/qtest/migration-test --tap -k -p /aarch64/migration/precopy/tcp/tls/psk/match
> 
> (gdb) bt
> #0  0x0000fffff7b33f8c in recv () from /lib64/libpthread.so.0
> #1  0x0000aaaaaaac8bf4 in recv (__flags=0, __n=1, __buf=0xffffffffe477, __fd=5) at /usr/include/bits/socket2.h:44
> #2  qmp_fd_receive (fd=5) at ../tests/qtest/libqmp.c:73
> #3  0x0000aaaaaaac6dbc in qtest_qmp_receive_dict (s=0xaaaaaaca7d10) at ../tests/qtest/libqtest.c:713
> #4  qtest_qmp_eventwait_ref (s=0xaaaaaaca7d10, event=0xaaaaaab26ce8 "RESUME") at ../tests/qtest/libqtest.c:837
> #5  0x0000aaaaaaac6e34 in qtest_qmp_eventwait (s=<optimized out>, event=<optimized out>) at ../tests/qtest/libqtest.c:850
> #6  0x0000aaaaaaabbd90 in test_precopy_common (args=0xffffffffe590, args@entry=0xffffffffe5a0) at ../tests/qtest/migration-test.c:1393
> #7  0x0000aaaaaaabc804 in test_precopy_tcp_tls_psk_match () at ../tests/qtest/migration-test.c:1564
> #8  0x0000fffff7c89630 in ?? () from //usr/lib64/libglib-2.0.so.0
> ...
> #15 0x0000fffff7c89a70 in g_test_run_suite () from //usr/lib64/libglib-2.0.so.0
> #16 0x0000fffff7c89ae4 in g_test_run () from //usr/lib64/libglib-2.0.so.0
> #17 0x0000aaaaaaab7fdc in main (argc=<optimized out>, argv=<optimized out>) at ../tests/qtest/migration-test.c:2642

Urgh, ok, there must be an unexpected race condition wrt events in my
change. Thanks for the stack trace, i'll investigate.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] tests/qtest: capture RESUME events during migration
  2023-04-18 13:30 ` [PATCH 1/2] tests/qtest: capture RESUME events during migration Daniel P. Berrangé
@ 2023-04-20 11:32   ` Juan Quintela
  2023-04-20 11:37     ` Daniel P. Berrangé
  0 siblings, 1 reply; 10+ messages in thread
From: Juan Quintela @ 2023-04-20 11:32 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu-devel, Thomas Huth, Paolo Bonzini, Laurent Vivier

Daniel P. Berrangé <berrange@redhat.com> wrote:
> When running migration tests we monitor for a STOP event so we can skip
> redundant waits. This will be needed for the RESUME event too shortly.
>
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>


I am waiting for you to check the problem than Lukas detected, but this
part of the patch is "obviously" correct.

Famous last words.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] tests/qtest: capture RESUME events during migration
  2023-04-20 11:32   ` Juan Quintela
@ 2023-04-20 11:37     ` Daniel P. Berrangé
  0 siblings, 0 replies; 10+ messages in thread
From: Daniel P. Berrangé @ 2023-04-20 11:37 UTC (permalink / raw)
  To: Juan Quintela; +Cc: qemu-devel, Thomas Huth, Paolo Bonzini, Laurent Vivier

On Thu, Apr 20, 2023 at 01:32:37PM +0200, Juan Quintela wrote:
> Daniel P. Berrangé <berrange@redhat.com> wrote:
> > When running migration tests we monitor for a STOP event so we can skip
> > redundant waits. This will be needed for the RESUME event too shortly.
> >
> > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> 
> Reviewed-by: Juan Quintela <quintela@redhat.com>
> 
> 
> I am waiting for you to check the problem than Lukas detected, but this
> part of the patch is "obviously" correct.
>
> Famous last words.

Actually it has a small flaw - I don't set  'got_resume = false' at the
start of each test :-(

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live
  2023-04-18 13:31 ` [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live Daniel P. Berrangé
  2023-04-18 19:52   ` Fabiano Rosas
@ 2023-04-20 12:59   ` Juan Quintela
  2023-04-20 15:58     ` Daniel P. Berrangé
  1 sibling, 1 reply; 10+ messages in thread
From: Juan Quintela @ 2023-04-20 12:59 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu-devel, Thomas Huth, Paolo Bonzini, Laurent Vivier

Daniel P. Berrangé <berrange@redhat.com> wrote:
> There are 27 pre-copy live migration scenarios being tested. In all of
> these we force non-convergance and run for one iteration, then let it
> converge and wait for completion during the second (or following)
> iterations. At 3 mbps bandwidth limit the first iteration takes a very
> long time (~30 seconds).
>
> While it is important to test the migration passes and convergance
> logic, it is overkill to do this for all 27 pre-copy scenarios. The
> TLS migration scenarios in particular are merely exercising different
> code paths during connection establishment.
>
> To optimize time taken, switch most of the test scenarios to run
> non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> a massive speed up for most of the test scenarios.
>
> For test coverage the following scenarios are unchanged
>
>  * Precopy with UNIX sockets
>  * Precopy with UNIX sockets and dirty ring tracking
>  * Precopy with XBZRLE
>  * Precopy with multifd

Just for completeness: the other test that is still slow is
/migration/vcpu_dirty_limit.

> -    migrate_ensure_non_converge(from);
> +    if (args->live) {
> +        migrate_ensure_non_converge(from);
> +    } else {
> +        migrate_ensure_converge(from);
> +    }

Looks ... weird?
But the only way that I can think of improving it is to pass args to
migrate_ensure_*() and that is a different kind of weird.

>      } else {
> -        if (args->iterations) {
> -            while (args->iterations--) {
> +        if (args->live) {
> +            if (args->iterations) {
> +                while (args->iterations--) {
> +                    wait_for_migration_pass(from);
> +                }
> +            } else {
>                  wait_for_migration_pass(from);
>              }
> +
> +            migrate_ensure_converge(from);

I think we should change iterations to be 1 when we create args, but
otherwise, treat 0 as 1 and change it to something in the lines of:

        if (args->live) {
            while (args->iterations-- >= 0) {
                wait_for_migration_pass(from);
            }
            migrate_ensure_converge(from);

What do you think?


> -        qtest_qmp_eventwait(to, "RESUME");
> +        if (!args->live) {
> +            qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}");
> +        }
> +        if (!got_resume) {
> +            qtest_qmp_eventwait(to, "RESUME");
> +        }
>  
>          wait_for_serial("dest_serial");
>      }

I was looking at the "culprit" of Lukas problem, and it is not directly
obvious.  I see that when we expect one event, we just drop any event
that we are not interested in.  I don't know if that is the proper
behaviour or if that is what affecting this test.

Later, Juan.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live
  2023-04-20 12:59   ` Juan Quintela
@ 2023-04-20 15:58     ` Daniel P. Berrangé
  0 siblings, 0 replies; 10+ messages in thread
From: Daniel P. Berrangé @ 2023-04-20 15:58 UTC (permalink / raw)
  To: Juan Quintela; +Cc: qemu-devel, Thomas Huth, Paolo Bonzini, Laurent Vivier

On Thu, Apr 20, 2023 at 02:59:00PM +0200, Juan Quintela wrote:
> Daniel P. Berrangé <berrange@redhat.com> wrote:
> > There are 27 pre-copy live migration scenarios being tested. In all of
> > these we force non-convergance and run for one iteration, then let it
> > converge and wait for completion during the second (or following)
> > iterations. At 3 mbps bandwidth limit the first iteration takes a very
> > long time (~30 seconds).
> >
> > While it is important to test the migration passes and convergance
> > logic, it is overkill to do this for all 27 pre-copy scenarios. The
> > TLS migration scenarios in particular are merely exercising different
> > code paths during connection establishment.
> >
> > To optimize time taken, switch most of the test scenarios to run
> > non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> > a massive speed up for most of the test scenarios.
> >
> > For test coverage the following scenarios are unchanged
> >
> >  * Precopy with UNIX sockets
> >  * Precopy with UNIX sockets and dirty ring tracking
> >  * Precopy with XBZRLE
> >  * Precopy with multifd
> 
> Just for completeness: the other test that is still slow is
> /migration/vcpu_dirty_limit.
> 
> > -    migrate_ensure_non_converge(from);
> > +    if (args->live) {
> > +        migrate_ensure_non_converge(from);
> > +    } else {
> > +        migrate_ensure_converge(from);
> > +    }
> 
> Looks ... weird?
> But the only way that I can think of improving it is to pass args to
> migrate_ensure_*() and that is a different kind of weird.

I'm going to change this a little anyway. Currently for the non-live
case, I start the migration and then  stop the CPUs. I want to reverse
that order, as we should have CPUs paused before even starting the
migration to ensure we don't have any re-dirtied pages at all.

> 
> >      } else {
> > -        if (args->iterations) {
> > -            while (args->iterations--) {
> > +        if (args->live) {
> > +            if (args->iterations) {
> > +                while (args->iterations--) {
> > +                    wait_for_migration_pass(from);
> > +                }
> > +            } else {
> >                  wait_for_migration_pass(from);
> >              }
> > +
> > +            migrate_ensure_converge(from);
> 
> I think we should change iterations to be 1 when we create args, but
> otherwise, treat 0 as 1 and change it to something in the lines of:
> 
>         if (args->live) {
>             while (args->iterations-- >= 0) {
>                 wait_for_migration_pass(from);
>             }
>             migrate_ensure_converge(from);
> 
> What do you think?

I think in retrospect 'iterations' was overkill as we only use the
values 0 (implicitly 1) or 2. IOW we could just just a 'bool multipass'
to distinguish the two cases.

> > -        qtest_qmp_eventwait(to, "RESUME");
> > +        if (!args->live) {
> > +            qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}");
> > +        }
> > +        if (!got_resume) {
> > +            qtest_qmp_eventwait(to, "RESUME");
> > +        }
> >  
> >          wait_for_serial("dest_serial");
> >      }
> 
> I was looking at the "culprit" of Lukas problem, and it is not directly
> obvious.  I see that when we expect one event, we just drop any event
> that we are not interested in.  I don't know if that is the proper
> behaviour or if that is what affecting this test.

I've not successfully reproduced it yet, nor figured out a real
scenario where it could plausibly happen. i'm looking to add more
debug to help us out.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live
  2023-04-18 19:52   ` Fabiano Rosas
  2023-04-19 17:14     ` Daniel P. Berrangé
@ 2023-04-21 17:20     ` Daniel P. Berrangé
  1 sibling, 0 replies; 10+ messages in thread
From: Daniel P. Berrangé @ 2023-04-21 17:20 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, Juan Quintela, Thomas Huth, Paolo Bonzini,
	Laurent Vivier

On Tue, Apr 18, 2023 at 04:52:32PM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > There are 27 pre-copy live migration scenarios being tested. In all of
> > these we force non-convergance and run for one iteration, then let it
> > converge and wait for completion during the second (or following)
> > iterations. At 3 mbps bandwidth limit the first iteration takes a very
> > long time (~30 seconds).
> >
> > While it is important to test the migration passes and convergance
> > logic, it is overkill to do this for all 27 pre-copy scenarios. The
> > TLS migration scenarios in particular are merely exercising different
> > code paths during connection establishment.
> >
> > To optimize time taken, switch most of the test scenarios to run
> > non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> > a massive speed up for most of the test scenarios.
> >
> > For test coverage the following scenarios are unchanged
> >
> >  * Precopy with UNIX sockets
> >  * Precopy with UNIX sockets and dirty ring tracking
> >  * Precopy with XBZRLE
> >  * Precopy with multifd
> >
> > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> 
> ...
> 
> > -        qtest_qmp_eventwait(to, "RESUME");
> > +        if (!args->live) {
> > +            qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}");
> > +        }
> > +        if (!got_resume) {
> > +            qtest_qmp_eventwait(to, "RESUME");
> > +        }
> 
> Hi Daniel,
> 
> On an aarch64 host I'm sometimes (~30%) seeing a hang here on a TLS test:
> 
> ../configure --target-list=aarch64-softmmu --enable-gnutls
> 
> ... ./tests/qtest/migration-test --tap -k -p /aarch64/migration/precopy/tcp/tls/psk/match

I never came to a satisfactory understanding of why this problem hits
you. I've just sent out a new version of this series, which has quite
a few differences, so possibly I've fixed it by luck.

So if you have time, I'd appreciate any testing you can try on

  https://lists.gnu.org/archive/html/qemu-devel/2023-04/msg03688.html


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-04-21 17:20 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-18 13:30 [PATCH 0/2] tests/qtest: make migraton-test faster Daniel P. Berrangé
2023-04-18 13:30 ` [PATCH 1/2] tests/qtest: capture RESUME events during migration Daniel P. Berrangé
2023-04-20 11:32   ` Juan Quintela
2023-04-20 11:37     ` Daniel P. Berrangé
2023-04-18 13:31 ` [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live Daniel P. Berrangé
2023-04-18 19:52   ` Fabiano Rosas
2023-04-19 17:14     ` Daniel P. Berrangé
2023-04-21 17:20     ` Daniel P. Berrangé
2023-04-20 12:59   ` Juan Quintela
2023-04-20 15:58     ` Daniel P. Berrangé

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).