* [PATCH 0/2] tests/qtest: make migraton-test faster
@ 2023-04-18 13:30 Daniel P. Berrangé
2023-04-18 13:30 ` [PATCH 1/2] tests/qtest: capture RESUME events during migration Daniel P. Berrangé
2023-04-18 13:31 ` [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live Daniel P. Berrangé
0 siblings, 2 replies; 10+ messages in thread
From: Daniel P. Berrangé @ 2023-04-18 13:30 UTC (permalink / raw)
To: qemu-devel
Cc: Juan Quintela, Thomas Huth, Paolo Bonzini, Laurent Vivier,
Daniel P. Berrangé
This makes migration-test faster by observing that most of the pre-copy
tests don't need to be doing a live migration. They get sufficient code
coverage with the guest CPUs paused.
On my machine this cuts the overall execution time of migration-test
by 50% from 15 minutes, down to 8 minutes, without sacrificing any
noticeable code coverage.
This is still quite slow though.
The following are the test timings sorted by speed:
/x86_64/migration/auto_converge 68.85
/x86_64/migration/precopy/unix/xbzrle 68.29
/x86_64/migration/postcopy/preempt/tls/psk 36.57
/x86_64/migration/dirty_ring 35.58
/x86_64/migration/precopy/unix/plain 35.56
/x86_64/migration/postcopy/preempt/plain 34.71
/x86_64/migration/postcopy/recovery/plain 34.56
/x86_64/migration/postcopy/tls/psk 34.45
/x86_64/migration/postcopy/preempt/recovery/tls/psk 33.99
/x86_64/migration/postcopy/preempt/recovery/plain 33.99
/x86_64/migration/postcopy/plain 33.78
/x86_64/migration/postcopy/recovery/tls/psk 33.30
/x86_64/migration/multifd/tcp/plain/none 21.12
/x86_64/migration/vcpu_dirty_limit 12.28
/x86_64/migration/multifd/tcp/tls/x509/default-host 2.95
/x86_64/migration/multifd/tcp/tls/x509/allow-anon-client 2.68
/x86_64/migration/multifd/tcp/tls/x509/override-host 2.51
/x86_64/migration/precopy/tcp/tls/x509/default-host 1.52
/x86_64/migration/precopy/unix/tls/x509/override-host 1.49
/x86_64/migration/precopy/unix/tls/psk 1.48
/x86_64/migration/precopy/tcp/tls/psk/match 1.47
/x86_64/migration/multifd/tcp/tls/psk/match 1.35
/x86_64/migration/precopy/tcp/tls/x509/allow-anon-client 1.32
/x86_64/migration/precopy/tcp/tls/x509/override-host 1.27
/x86_64/migration/precopy/tcp/tls/x509/friendly-client 1.27
/x86_64/migration/multifd/tcp/plain/zlib 1.08
/x86_64/migration/precopy/tcp/plain 1.05
/x86_64/migration/fd_proto 1.04
/x86_64/migration/multifd/tcp/tls/psk/mismatch 1.00
/x86_64/migration/multifd/tcp/plain/zstd 0.98
/x86_64/migration/precopy/tcp/tls/x509/hostile-client 0.85
/x86_64/migration/multifd/tcp/tls/x509/mismatch-host 0.79
/x86_64/migration/precopy/tcp/tls/x509/mismatch-host 0.75
/x86_64/migration/precopy/unix/tls/x509/default-host 0.74
/x86_64/migration/precopy/tcp/tls/x509/reject-anon-client 0.71
/x86_64/migration/multifd/tcp/tls/x509/reject-anon-client 0.68
/x86_64/migration/precopy/tcp/tls/psk/mismatch 0.63
/x86_64/migration/validate_uuid_src_not_set 0.59
/x86_64/migration/validate_uuid 0.54
/x86_64/migration/validate_uuid_dst_not_set 0.53
/x86_64/migration/bad_dest 0.41
/x86_64/migration/validate_uuid_error 0.31
The auto-converge and xbzrle tests are top because they both inherantly
*need* todo multiple interations in order to exercise the desired
code paths.
The post-copy tests are all up there because we always do one iteration
of pre-copy before switching to post-copy and we need to ensure that we
don't complete before getting to the post-copy bit.
I think we can optimize the post-copy bit though. Only 1 of the
post-copy tests really needs to go through a full pre-copy iteration
to get good code coverage. For the other post-copy tests we can change
to let it copy 10 MBs of data in pre-copy mode and then switch to
post-copy.
Also in
commit 1bfc8dde505f1e6a92697c52aa9b09e81b54c78f
Author: Dr. David Alan Gilbert <dgilbert@redhat.com>
Date: Mon Mar 6 15:26:12 2023 +0000
tests/migration: Tweek auto converge limits check
we cut the bandwidth by factor of x10 to ensure reliability. I think
that was perhaps too aggressive. If we bump it back up to say 10 MB/sec
that's still better than the original 30 MB/sec, perhaps enough to give
us reliability, while cutting time of other tests by 70%
Daniel P. Berrangé (2):
tests/qtest: capture RESUME events during migration
tests/qtest: make more migration pre-copy scenarios run non-live
tests/qtest/migration-helpers.c | 12 +++++++++---
tests/qtest/migration-helpers.h | 1 +
tests/qtest/migration-test.c | 34 ++++++++++++++++++++++++++-------
3 files changed, 37 insertions(+), 10 deletions(-)
--
2.40.0
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 1/2] tests/qtest: capture RESUME events during migration
2023-04-18 13:30 [PATCH 0/2] tests/qtest: make migraton-test faster Daniel P. Berrangé
@ 2023-04-18 13:30 ` Daniel P. Berrangé
2023-04-20 11:32 ` Juan Quintela
2023-04-18 13:31 ` [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live Daniel P. Berrangé
1 sibling, 1 reply; 10+ messages in thread
From: Daniel P. Berrangé @ 2023-04-18 13:30 UTC (permalink / raw)
To: qemu-devel
Cc: Juan Quintela, Thomas Huth, Paolo Bonzini, Laurent Vivier,
Daniel P. Berrangé
When running migration tests we monitor for a STOP event so we can skip
redundant waits. This will be needed for the RESUME event too shortly.
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
tests/qtest/migration-helpers.c | 12 +++++++++---
tests/qtest/migration-helpers.h | 1 +
2 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/tests/qtest/migration-helpers.c b/tests/qtest/migration-helpers.c
index f6f3c6680f..61396335cc 100644
--- a/tests/qtest/migration-helpers.c
+++ b/tests/qtest/migration-helpers.c
@@ -24,14 +24,20 @@
#define MIGRATION_STATUS_WAIT_TIMEOUT 120
bool got_stop;
+bool got_resume;
-static void check_stop_event(QTestState *who)
+static void check_events(QTestState *who)
{
QDict *event = qtest_qmp_event_ref(who, "STOP");
if (event) {
got_stop = true;
qobject_unref(event);
}
+ event = qtest_qmp_event_ref(who, "RESUME");
+ if (event) {
+ got_resume = true;
+ qobject_unref(event);
+ }
}
#ifndef _WIN32
@@ -48,7 +54,7 @@ QDict *wait_command_fd(QTestState *who, int fd, const char *command, ...)
va_end(ap);
resp = qtest_qmp_receive(who);
- check_stop_event(who);
+ check_events(who);
g_assert(!qdict_haskey(resp, "error"));
g_assert(qdict_haskey(resp, "return"));
@@ -73,7 +79,7 @@ QDict *wait_command(QTestState *who, const char *command, ...)
resp = qtest_vqmp(who, command, ap);
va_end(ap);
- check_stop_event(who);
+ check_events(who);
g_assert(!qdict_haskey(resp, "error"));
g_assert(qdict_haskey(resp, "return"));
diff --git a/tests/qtest/migration-helpers.h b/tests/qtest/migration-helpers.h
index a188b62787..726a66cfc1 100644
--- a/tests/qtest/migration-helpers.h
+++ b/tests/qtest/migration-helpers.h
@@ -16,6 +16,7 @@
#include "libqtest.h"
extern bool got_stop;
+extern bool got_resume;
#ifndef _WIN32
G_GNUC_PRINTF(3, 4)
--
2.40.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live
2023-04-18 13:30 [PATCH 0/2] tests/qtest: make migraton-test faster Daniel P. Berrangé
2023-04-18 13:30 ` [PATCH 1/2] tests/qtest: capture RESUME events during migration Daniel P. Berrangé
@ 2023-04-18 13:31 ` Daniel P. Berrangé
2023-04-18 19:52 ` Fabiano Rosas
2023-04-20 12:59 ` Juan Quintela
1 sibling, 2 replies; 10+ messages in thread
From: Daniel P. Berrangé @ 2023-04-18 13:31 UTC (permalink / raw)
To: qemu-devel
Cc: Juan Quintela, Thomas Huth, Paolo Bonzini, Laurent Vivier,
Daniel P. Berrangé
There are 27 pre-copy live migration scenarios being tested. In all of
these we force non-convergance and run for one iteration, then let it
converge and wait for completion during the second (or following)
iterations. At 3 mbps bandwidth limit the first iteration takes a very
long time (~30 seconds).
While it is important to test the migration passes and convergance
logic, it is overkill to do this for all 27 pre-copy scenarios. The
TLS migration scenarios in particular are merely exercising different
code paths during connection establishment.
To optimize time taken, switch most of the test scenarios to run
non-live (ie guest CPUs paused) with no bandwidth limits. This gives
a massive speed up for most of the test scenarios.
For test coverage the following scenarios are unchanged
* Precopy with UNIX sockets
* Precopy with UNIX sockets and dirty ring tracking
* Precopy with XBZRLE
* Precopy with multifd
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
tests/qtest/migration-test.c | 34 +++++++++++++++++++++++++++-------
1 file changed, 27 insertions(+), 7 deletions(-)
diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 3b615b0da9..cdc9635f0b 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -574,6 +574,9 @@ typedef struct {
/* Optional: set number of migration passes to wait for */
unsigned int iterations;
+ /* Whether the guest CPUs should be running during migration */
+ bool live;
+
/* Postcopy specific fields */
void *postcopy_data;
bool postcopy_preempt;
@@ -1329,7 +1332,11 @@ static void test_precopy_common(MigrateCommon *args)
return;
}
- migrate_ensure_non_converge(from);
+ if (args->live) {
+ migrate_ensure_non_converge(from);
+ } else {
+ migrate_ensure_converge(from);
+ }
if (args->start_hook) {
data_hook = args->start_hook(from, to);
@@ -1357,16 +1364,20 @@ static void test_precopy_common(MigrateCommon *args)
qtest_set_expected_status(to, EXIT_FAILURE);
}
} else {
- if (args->iterations) {
- while (args->iterations--) {
+ if (args->live) {
+ if (args->iterations) {
+ while (args->iterations--) {
+ wait_for_migration_pass(from);
+ }
+ } else {
wait_for_migration_pass(from);
}
+
+ migrate_ensure_converge(from);
} else {
- wait_for_migration_pass(from);
+ qtest_qmp_discard_response(from, "{ 'execute' : 'stop'}");
}
- migrate_ensure_converge(from);
-
/* We do this first, as it has a timeout to stop us
* hanging forever if migration didn't converge */
wait_for_migration_complete(from);
@@ -1375,7 +1386,12 @@ static void test_precopy_common(MigrateCommon *args)
qtest_qmp_eventwait(from, "STOP");
}
- qtest_qmp_eventwait(to, "RESUME");
+ if (!args->live) {
+ qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}");
+ }
+ if (!got_resume) {
+ qtest_qmp_eventwait(to, "RESUME");
+ }
wait_for_serial("dest_serial");
}
@@ -1393,6 +1409,7 @@ static void test_precopy_unix_plain(void)
MigrateCommon args = {
.listen_uri = uri,
.connect_uri = uri,
+ .live = true,
};
test_precopy_common(&args);
@@ -1408,6 +1425,7 @@ static void test_precopy_unix_dirty_ring(void)
},
.listen_uri = uri,
.connect_uri = uri,
+ .live = true,
};
test_precopy_common(&args);
@@ -1519,6 +1537,7 @@ static void test_precopy_unix_xbzrle(void)
.start_hook = test_migrate_xbzrle_start,
.iterations = 2,
+ .live = true,
};
test_precopy_common(&args);
@@ -1919,6 +1938,7 @@ static void test_multifd_tcp_none(void)
MigrateCommon args = {
.listen_uri = "defer",
.start_hook = test_migrate_precopy_tcp_multifd_start,
+ .live = true,
};
test_precopy_common(&args);
}
--
2.40.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live
2023-04-18 13:31 ` [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live Daniel P. Berrangé
@ 2023-04-18 19:52 ` Fabiano Rosas
2023-04-19 17:14 ` Daniel P. Berrangé
2023-04-21 17:20 ` Daniel P. Berrangé
2023-04-20 12:59 ` Juan Quintela
1 sibling, 2 replies; 10+ messages in thread
From: Fabiano Rosas @ 2023-04-18 19:52 UTC (permalink / raw)
To: Daniel P. Berrangé, qemu-devel
Cc: Juan Quintela, Thomas Huth, Paolo Bonzini, Laurent Vivier,
Daniel P. Berrangé
Daniel P. Berrangé <berrange@redhat.com> writes:
> There are 27 pre-copy live migration scenarios being tested. In all of
> these we force non-convergance and run for one iteration, then let it
> converge and wait for completion during the second (or following)
> iterations. At 3 mbps bandwidth limit the first iteration takes a very
> long time (~30 seconds).
>
> While it is important to test the migration passes and convergance
> logic, it is overkill to do this for all 27 pre-copy scenarios. The
> TLS migration scenarios in particular are merely exercising different
> code paths during connection establishment.
>
> To optimize time taken, switch most of the test scenarios to run
> non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> a massive speed up for most of the test scenarios.
>
> For test coverage the following scenarios are unchanged
>
> * Precopy with UNIX sockets
> * Precopy with UNIX sockets and dirty ring tracking
> * Precopy with XBZRLE
> * Precopy with multifd
>
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
...
> - qtest_qmp_eventwait(to, "RESUME");
> + if (!args->live) {
> + qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}");
> + }
> + if (!got_resume) {
> + qtest_qmp_eventwait(to, "RESUME");
> + }
Hi Daniel,
On an aarch64 host I'm sometimes (~30%) seeing a hang here on a TLS test:
../configure --target-list=aarch64-softmmu --enable-gnutls
... ./tests/qtest/migration-test --tap -k -p /aarch64/migration/precopy/tcp/tls/psk/match
(gdb) bt
#0 0x0000fffff7b33f8c in recv () from /lib64/libpthread.so.0
#1 0x0000aaaaaaac8bf4 in recv (__flags=0, __n=1, __buf=0xffffffffe477, __fd=5) at /usr/include/bits/socket2.h:44
#2 qmp_fd_receive (fd=5) at ../tests/qtest/libqmp.c:73
#3 0x0000aaaaaaac6dbc in qtest_qmp_receive_dict (s=0xaaaaaaca7d10) at ../tests/qtest/libqtest.c:713
#4 qtest_qmp_eventwait_ref (s=0xaaaaaaca7d10, event=0xaaaaaab26ce8 "RESUME") at ../tests/qtest/libqtest.c:837
#5 0x0000aaaaaaac6e34 in qtest_qmp_eventwait (s=<optimized out>, event=<optimized out>) at ../tests/qtest/libqtest.c:850
#6 0x0000aaaaaaabbd90 in test_precopy_common (args=0xffffffffe590, args@entry=0xffffffffe5a0) at ../tests/qtest/migration-test.c:1393
#7 0x0000aaaaaaabc804 in test_precopy_tcp_tls_psk_match () at ../tests/qtest/migration-test.c:1564
#8 0x0000fffff7c89630 in ?? () from //usr/lib64/libglib-2.0.so.0
...
#15 0x0000fffff7c89a70 in g_test_run_suite () from //usr/lib64/libglib-2.0.so.0
#16 0x0000fffff7c89ae4 in g_test_run () from //usr/lib64/libglib-2.0.so.0
#17 0x0000aaaaaaab7fdc in main (argc=<optimized out>, argv=<optimized out>) at ../tests/qtest/migration-test.c:2642
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live
2023-04-18 19:52 ` Fabiano Rosas
@ 2023-04-19 17:14 ` Daniel P. Berrangé
2023-04-21 17:20 ` Daniel P. Berrangé
1 sibling, 0 replies; 10+ messages in thread
From: Daniel P. Berrangé @ 2023-04-19 17:14 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, Juan Quintela, Thomas Huth, Paolo Bonzini,
Laurent Vivier
On Tue, Apr 18, 2023 at 04:52:32PM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
>
> > There are 27 pre-copy live migration scenarios being tested. In all of
> > these we force non-convergance and run for one iteration, then let it
> > converge and wait for completion during the second (or following)
> > iterations. At 3 mbps bandwidth limit the first iteration takes a very
> > long time (~30 seconds).
> >
> > While it is important to test the migration passes and convergance
> > logic, it is overkill to do this for all 27 pre-copy scenarios. The
> > TLS migration scenarios in particular are merely exercising different
> > code paths during connection establishment.
> >
> > To optimize time taken, switch most of the test scenarios to run
> > non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> > a massive speed up for most of the test scenarios.
> >
> > For test coverage the following scenarios are unchanged
> >
> > * Precopy with UNIX sockets
> > * Precopy with UNIX sockets and dirty ring tracking
> > * Precopy with XBZRLE
> > * Precopy with multifd
> >
> > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
>
> ...
>
> > - qtest_qmp_eventwait(to, "RESUME");
> > + if (!args->live) {
> > + qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}");
> > + }
> > + if (!got_resume) {
> > + qtest_qmp_eventwait(to, "RESUME");
> > + }
>
> Hi Daniel,
>
> On an aarch64 host I'm sometimes (~30%) seeing a hang here on a TLS test:
>
> ../configure --target-list=aarch64-softmmu --enable-gnutls
>
> ... ./tests/qtest/migration-test --tap -k -p /aarch64/migration/precopy/tcp/tls/psk/match
>
> (gdb) bt
> #0 0x0000fffff7b33f8c in recv () from /lib64/libpthread.so.0
> #1 0x0000aaaaaaac8bf4 in recv (__flags=0, __n=1, __buf=0xffffffffe477, __fd=5) at /usr/include/bits/socket2.h:44
> #2 qmp_fd_receive (fd=5) at ../tests/qtest/libqmp.c:73
> #3 0x0000aaaaaaac6dbc in qtest_qmp_receive_dict (s=0xaaaaaaca7d10) at ../tests/qtest/libqtest.c:713
> #4 qtest_qmp_eventwait_ref (s=0xaaaaaaca7d10, event=0xaaaaaab26ce8 "RESUME") at ../tests/qtest/libqtest.c:837
> #5 0x0000aaaaaaac6e34 in qtest_qmp_eventwait (s=<optimized out>, event=<optimized out>) at ../tests/qtest/libqtest.c:850
> #6 0x0000aaaaaaabbd90 in test_precopy_common (args=0xffffffffe590, args@entry=0xffffffffe5a0) at ../tests/qtest/migration-test.c:1393
> #7 0x0000aaaaaaabc804 in test_precopy_tcp_tls_psk_match () at ../tests/qtest/migration-test.c:1564
> #8 0x0000fffff7c89630 in ?? () from //usr/lib64/libglib-2.0.so.0
> ...
> #15 0x0000fffff7c89a70 in g_test_run_suite () from //usr/lib64/libglib-2.0.so.0
> #16 0x0000fffff7c89ae4 in g_test_run () from //usr/lib64/libglib-2.0.so.0
> #17 0x0000aaaaaaab7fdc in main (argc=<optimized out>, argv=<optimized out>) at ../tests/qtest/migration-test.c:2642
Urgh, ok, there must be an unexpected race condition wrt events in my
change. Thanks for the stack trace, i'll investigate.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] tests/qtest: capture RESUME events during migration
2023-04-18 13:30 ` [PATCH 1/2] tests/qtest: capture RESUME events during migration Daniel P. Berrangé
@ 2023-04-20 11:32 ` Juan Quintela
2023-04-20 11:37 ` Daniel P. Berrangé
0 siblings, 1 reply; 10+ messages in thread
From: Juan Quintela @ 2023-04-20 11:32 UTC (permalink / raw)
To: Daniel P. Berrangé
Cc: qemu-devel, Thomas Huth, Paolo Bonzini, Laurent Vivier
Daniel P. Berrangé <berrange@redhat.com> wrote:
> When running migration tests we monitor for a STOP event so we can skip
> redundant waits. This will be needed for the RESUME event too shortly.
>
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
I am waiting for you to check the problem than Lukas detected, but this
part of the patch is "obviously" correct.
Famous last words.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] tests/qtest: capture RESUME events during migration
2023-04-20 11:32 ` Juan Quintela
@ 2023-04-20 11:37 ` Daniel P. Berrangé
0 siblings, 0 replies; 10+ messages in thread
From: Daniel P. Berrangé @ 2023-04-20 11:37 UTC (permalink / raw)
To: Juan Quintela; +Cc: qemu-devel, Thomas Huth, Paolo Bonzini, Laurent Vivier
On Thu, Apr 20, 2023 at 01:32:37PM +0200, Juan Quintela wrote:
> Daniel P. Berrangé <berrange@redhat.com> wrote:
> > When running migration tests we monitor for a STOP event so we can skip
> > redundant waits. This will be needed for the RESUME event too shortly.
> >
> > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
>
> Reviewed-by: Juan Quintela <quintela@redhat.com>
>
>
> I am waiting for you to check the problem than Lukas detected, but this
> part of the patch is "obviously" correct.
>
> Famous last words.
Actually it has a small flaw - I don't set 'got_resume = false' at the
start of each test :-(
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live
2023-04-18 13:31 ` [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live Daniel P. Berrangé
2023-04-18 19:52 ` Fabiano Rosas
@ 2023-04-20 12:59 ` Juan Quintela
2023-04-20 15:58 ` Daniel P. Berrangé
1 sibling, 1 reply; 10+ messages in thread
From: Juan Quintela @ 2023-04-20 12:59 UTC (permalink / raw)
To: Daniel P. Berrangé
Cc: qemu-devel, Thomas Huth, Paolo Bonzini, Laurent Vivier
Daniel P. Berrangé <berrange@redhat.com> wrote:
> There are 27 pre-copy live migration scenarios being tested. In all of
> these we force non-convergance and run for one iteration, then let it
> converge and wait for completion during the second (or following)
> iterations. At 3 mbps bandwidth limit the first iteration takes a very
> long time (~30 seconds).
>
> While it is important to test the migration passes and convergance
> logic, it is overkill to do this for all 27 pre-copy scenarios. The
> TLS migration scenarios in particular are merely exercising different
> code paths during connection establishment.
>
> To optimize time taken, switch most of the test scenarios to run
> non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> a massive speed up for most of the test scenarios.
>
> For test coverage the following scenarios are unchanged
>
> * Precopy with UNIX sockets
> * Precopy with UNIX sockets and dirty ring tracking
> * Precopy with XBZRLE
> * Precopy with multifd
Just for completeness: the other test that is still slow is
/migration/vcpu_dirty_limit.
> - migrate_ensure_non_converge(from);
> + if (args->live) {
> + migrate_ensure_non_converge(from);
> + } else {
> + migrate_ensure_converge(from);
> + }
Looks ... weird?
But the only way that I can think of improving it is to pass args to
migrate_ensure_*() and that is a different kind of weird.
> } else {
> - if (args->iterations) {
> - while (args->iterations--) {
> + if (args->live) {
> + if (args->iterations) {
> + while (args->iterations--) {
> + wait_for_migration_pass(from);
> + }
> + } else {
> wait_for_migration_pass(from);
> }
> +
> + migrate_ensure_converge(from);
I think we should change iterations to be 1 when we create args, but
otherwise, treat 0 as 1 and change it to something in the lines of:
if (args->live) {
while (args->iterations-- >= 0) {
wait_for_migration_pass(from);
}
migrate_ensure_converge(from);
What do you think?
> - qtest_qmp_eventwait(to, "RESUME");
> + if (!args->live) {
> + qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}");
> + }
> + if (!got_resume) {
> + qtest_qmp_eventwait(to, "RESUME");
> + }
>
> wait_for_serial("dest_serial");
> }
I was looking at the "culprit" of Lukas problem, and it is not directly
obvious. I see that when we expect one event, we just drop any event
that we are not interested in. I don't know if that is the proper
behaviour or if that is what affecting this test.
Later, Juan.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live
2023-04-20 12:59 ` Juan Quintela
@ 2023-04-20 15:58 ` Daniel P. Berrangé
0 siblings, 0 replies; 10+ messages in thread
From: Daniel P. Berrangé @ 2023-04-20 15:58 UTC (permalink / raw)
To: Juan Quintela; +Cc: qemu-devel, Thomas Huth, Paolo Bonzini, Laurent Vivier
On Thu, Apr 20, 2023 at 02:59:00PM +0200, Juan Quintela wrote:
> Daniel P. Berrangé <berrange@redhat.com> wrote:
> > There are 27 pre-copy live migration scenarios being tested. In all of
> > these we force non-convergance and run for one iteration, then let it
> > converge and wait for completion during the second (or following)
> > iterations. At 3 mbps bandwidth limit the first iteration takes a very
> > long time (~30 seconds).
> >
> > While it is important to test the migration passes and convergance
> > logic, it is overkill to do this for all 27 pre-copy scenarios. The
> > TLS migration scenarios in particular are merely exercising different
> > code paths during connection establishment.
> >
> > To optimize time taken, switch most of the test scenarios to run
> > non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> > a massive speed up for most of the test scenarios.
> >
> > For test coverage the following scenarios are unchanged
> >
> > * Precopy with UNIX sockets
> > * Precopy with UNIX sockets and dirty ring tracking
> > * Precopy with XBZRLE
> > * Precopy with multifd
>
> Just for completeness: the other test that is still slow is
> /migration/vcpu_dirty_limit.
>
> > - migrate_ensure_non_converge(from);
> > + if (args->live) {
> > + migrate_ensure_non_converge(from);
> > + } else {
> > + migrate_ensure_converge(from);
> > + }
>
> Looks ... weird?
> But the only way that I can think of improving it is to pass args to
> migrate_ensure_*() and that is a different kind of weird.
I'm going to change this a little anyway. Currently for the non-live
case, I start the migration and then stop the CPUs. I want to reverse
that order, as we should have CPUs paused before even starting the
migration to ensure we don't have any re-dirtied pages at all.
>
> > } else {
> > - if (args->iterations) {
> > - while (args->iterations--) {
> > + if (args->live) {
> > + if (args->iterations) {
> > + while (args->iterations--) {
> > + wait_for_migration_pass(from);
> > + }
> > + } else {
> > wait_for_migration_pass(from);
> > }
> > +
> > + migrate_ensure_converge(from);
>
> I think we should change iterations to be 1 when we create args, but
> otherwise, treat 0 as 1 and change it to something in the lines of:
>
> if (args->live) {
> while (args->iterations-- >= 0) {
> wait_for_migration_pass(from);
> }
> migrate_ensure_converge(from);
>
> What do you think?
I think in retrospect 'iterations' was overkill as we only use the
values 0 (implicitly 1) or 2. IOW we could just just a 'bool multipass'
to distinguish the two cases.
> > - qtest_qmp_eventwait(to, "RESUME");
> > + if (!args->live) {
> > + qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}");
> > + }
> > + if (!got_resume) {
> > + qtest_qmp_eventwait(to, "RESUME");
> > + }
> >
> > wait_for_serial("dest_serial");
> > }
>
> I was looking at the "culprit" of Lukas problem, and it is not directly
> obvious. I see that when we expect one event, we just drop any event
> that we are not interested in. I don't know if that is the proper
> behaviour or if that is what affecting this test.
I've not successfully reproduced it yet, nor figured out a real
scenario where it could plausibly happen. i'm looking to add more
debug to help us out.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live
2023-04-18 19:52 ` Fabiano Rosas
2023-04-19 17:14 ` Daniel P. Berrangé
@ 2023-04-21 17:20 ` Daniel P. Berrangé
1 sibling, 0 replies; 10+ messages in thread
From: Daniel P. Berrangé @ 2023-04-21 17:20 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, Juan Quintela, Thomas Huth, Paolo Bonzini,
Laurent Vivier
On Tue, Apr 18, 2023 at 04:52:32PM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
>
> > There are 27 pre-copy live migration scenarios being tested. In all of
> > these we force non-convergance and run for one iteration, then let it
> > converge and wait for completion during the second (or following)
> > iterations. At 3 mbps bandwidth limit the first iteration takes a very
> > long time (~30 seconds).
> >
> > While it is important to test the migration passes and convergance
> > logic, it is overkill to do this for all 27 pre-copy scenarios. The
> > TLS migration scenarios in particular are merely exercising different
> > code paths during connection establishment.
> >
> > To optimize time taken, switch most of the test scenarios to run
> > non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> > a massive speed up for most of the test scenarios.
> >
> > For test coverage the following scenarios are unchanged
> >
> > * Precopy with UNIX sockets
> > * Precopy with UNIX sockets and dirty ring tracking
> > * Precopy with XBZRLE
> > * Precopy with multifd
> >
> > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
>
> ...
>
> > - qtest_qmp_eventwait(to, "RESUME");
> > + if (!args->live) {
> > + qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}");
> > + }
> > + if (!got_resume) {
> > + qtest_qmp_eventwait(to, "RESUME");
> > + }
>
> Hi Daniel,
>
> On an aarch64 host I'm sometimes (~30%) seeing a hang here on a TLS test:
>
> ../configure --target-list=aarch64-softmmu --enable-gnutls
>
> ... ./tests/qtest/migration-test --tap -k -p /aarch64/migration/precopy/tcp/tls/psk/match
I never came to a satisfactory understanding of why this problem hits
you. I've just sent out a new version of this series, which has quite
a few differences, so possibly I've fixed it by luck.
So if you have time, I'd appreciate any testing you can try on
https://lists.gnu.org/archive/html/qemu-devel/2023-04/msg03688.html
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2023-04-21 17:20 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-18 13:30 [PATCH 0/2] tests/qtest: make migraton-test faster Daniel P. Berrangé
2023-04-18 13:30 ` [PATCH 1/2] tests/qtest: capture RESUME events during migration Daniel P. Berrangé
2023-04-20 11:32 ` Juan Quintela
2023-04-20 11:37 ` Daniel P. Berrangé
2023-04-18 13:31 ` [PATCH 2/2] tests/qtest: make more migration pre-copy scenarios run non-live Daniel P. Berrangé
2023-04-18 19:52 ` Fabiano Rosas
2023-04-19 17:14 ` Daniel P. Berrangé
2023-04-21 17:20 ` Daniel P. Berrangé
2023-04-20 12:59 ` Juan Quintela
2023-04-20 15:58 ` Daniel P. Berrangé
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).