qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Daniel P. Berrangé" <berrange@redhat.com>
To: Fabiano Rosas <farosas@suse.de>,
	qemu-devel@nongnu.org, qemu-block@nongnu.org,
	Paolo Bonzini <pbonzini@redhat.com>,
	Thomas Huth <thuth@redhat.com>, John Snow <jsnow@redhat.com>,
	Li Zhijian <lizhijian@fujitsu.com>,
	Juan Quintela <quintela@redhat.com>,
	Stefan Hajnoczi <stefanha@redhat.com>,
	Zhang Chen <chen.zhang@intel.com>,
	Laurent Vivier <lvivier@redhat.com>
Subject: Re: [PATCH v2 4/6] tests/qtest: make more migration pre-copy scenarios run non-live
Date: Wed, 31 May 2023 13:15:39 +0100	[thread overview]
Message-ID: <ZHc6a+7881ExE0D/@redhat.com> (raw)
In-Reply-To: <ZHDzVe5rOSW2CX96@redhat.com>

On Fri, May 26, 2023 at 06:58:45PM +0100, Daniel P. Berrangé wrote:
> On Mon, Apr 24, 2023 at 06:01:36PM -0300, Fabiano Rosas wrote:
> > Daniel P. Berrangé <berrange@redhat.com> writes:
> > 
> > > There are 27 pre-copy live migration scenarios being tested. In all of
> > > these we force non-convergance and run for one iteration, then let it
> > > converge and wait for completion during the second (or following)
> > > iterations. At 3 mbps bandwidth limit the first iteration takes a very
> > > long time (~30 seconds).
> > >
> > > While it is important to test the migration passes and convergance
> > > logic, it is overkill to do this for all 27 pre-copy scenarios. The
> > > TLS migration scenarios in particular are merely exercising different
> > > code paths during connection establishment.
> > >
> > > To optimize time taken, switch most of the test scenarios to run
> > > non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> > > a massive speed up for most of the test scenarios.
> > >
> > > For test coverage the following scenarios are unchanged
> > >
> > >  * Precopy with UNIX sockets
> > >  * Precopy with UNIX sockets and dirty ring tracking
> > >  * Precopy with XBZRLE
> > >  * Precopy with multifd
> > >
> > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> > > ---
> > >  tests/qtest/migration-test.c | 60 ++++++++++++++++++++++++++++++------
> > >  1 file changed, 50 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> > > index 6492ffa7fe..40d0f75480 100644
> > > --- a/tests/qtest/migration-test.c
> > > +++ b/tests/qtest/migration-test.c
> > > @@ -568,6 +568,9 @@ typedef struct {
> > >          MIG_TEST_FAIL_DEST_QUIT_ERR,
> > >      } result;
> > >  
> > > +    /* Whether the guest CPUs should be running during migration */
> > > +    bool live;
> > > +
> > >      /* Postcopy specific fields */
> > >      void *postcopy_data;
> > >      bool postcopy_preempt;
> > > @@ -1324,8 +1327,6 @@ static void test_precopy_common(MigrateCommon *args)
> > >          return;
> > >      }
> > >  
> > > -    migrate_ensure_non_converge(from);
> > > -
> > >      if (args->start_hook) {
> > >          data_hook = args->start_hook(from, to);
> > >      }
> > > @@ -1335,6 +1336,31 @@ static void test_precopy_common(MigrateCommon *args)
> > >          wait_for_serial("src_serial");
> > >      }
> > >  
> > > +    if (args->live) {
> > > +        /*
> > > +         * Testing live migration, we want to ensure that some
> > > +         * memory is re-dirtied after being transferred, so that
> > > +         * we exercise logic for dirty page handling. We achieve
> > > +         * this with a ridiculosly low bandwidth that guarantees
> > > +         * non-convergance.
> > > +         */
> > > +        migrate_ensure_non_converge(from);
> > > +    } else {
> > > +        /*
> > > +         * Testing non-live migration, we allow it to run at
> > > +         * full speed to ensure short test case duration.
> > > +         * For tests expected to fail, we don't need to
> > > +         * change anything.
> > > +         */
> > > +        if (args->result == MIG_TEST_SUCCEED) {
> > > +            qtest_qmp_assert_success(from, "{ 'execute' : 'stop'}");
> > > +            if (!got_stop) {
> > > +                qtest_qmp_eventwait(from, "STOP");
> > > +            }
> > > +            migrate_ensure_converge(from);
> > > +        }
> > > +    }
> > > +
> > >      if (!args->connect_uri) {
> > >          g_autofree char *local_connect_uri =
> > >              migrate_get_socket_address(to, "socket-address");
> > > @@ -1352,19 +1378,29 @@ static void test_precopy_common(MigrateCommon *args)
> > >              qtest_set_expected_status(to, EXIT_FAILURE);
> > >          }
> > >      } else {
> > > -        wait_for_migration_pass(from);
> > > +        if (args->live) {
> > > +            wait_for_migration_pass(from);
> > >  
> > > -        migrate_ensure_converge(from);
> > > +            migrate_ensure_converge(from);
> > >  
> > > -        /* We do this first, as it has a timeout to stop us
> > > -         * hanging forever if migration didn't converge */
> > > -        wait_for_migration_complete(from);
> > > +            /*
> > > +             * We do this first, as it has a timeout to stop us
> > > +             * hanging forever if migration didn't converge
> > > +             */
> > > +            wait_for_migration_complete(from);
> > > +
> > > +            if (!got_stop) {
> > > +                qtest_qmp_eventwait(from, "STOP");
> > > +            }
> > > +        } else {
> > > +            wait_for_migration_complete(from);
> > >  
> > > -        if (!got_stop) {
> > > -            qtest_qmp_eventwait(from, "STOP");
> > > +            qtest_qmp_assert_success(to, "{ 'execute' : 'cont'}");
> > 
> > I retested and the problem still persists. The issue is with this wait +
> > cont sequence:
> > 
> > wait_for_migration_complete(from);
> > qtest_qmp_assert_success(to, "{ 'execute' : 'cont'}");
> > 
> > We wait for the source to finish but by the time qmp_cont executes, the
> > dst is still INMIGRATE, autostart gets set and I never see the RESUME
> > event.
> 
> This is ultimately caused by the broken logic in the previous
> patch 3 that looked for RESUME. The loooking for the STOP would
> discard all non-STOP events, which includes the RESUME event
> we were just about to look for. I've had to completely change
> the event handling in migration-helpers and libqtest to fix this.

Actually, no it is not. The broken logic wouldn't help, but the root
cause was indeed a race condition that Fabiano points out. 

We are issuing the 'cont' before tgt QEMU has finished reading data
from the source.  The solution is actually quite simple - we must
call 'query-migrate' on dst to check its status. ie the code needs
to be:

 wait_for_migration_complete(from);
 wait_for_migration_complete(to);
 qtest_qmp_assert_success(to, "{ 'execute' : 'cont'}");

this matches what libvirt does, and libvirt has a comment saying
it was not permitted to issue 'cont' before 'query-migrate' on
the dst indicated completion.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



  reply	other threads:[~2023-05-31 12:16 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-21 17:14 [PATCH v2 0/6] tests/qtest: make migration-test massively faster Daniel P. Berrangé
2023-04-21 17:14 ` [PATCH v2 1/6] tests/qtest: replace qmp_discard_response with qtest_qmp_assert_success Daniel P. Berrangé
2023-04-21 21:52   ` Juan Quintela
2023-04-23  2:22   ` Zhang, Chen
2023-04-21 17:14 ` [PATCH v2 2/6] tests/qtests: remove migration test iterations config Daniel P. Berrangé
2023-04-21 21:54   ` Juan Quintela
2023-04-26  9:07     ` Daniel P. Berrangé
2023-04-26  9:42       ` Juan Quintela
2023-04-26 10:15         ` Daniel P. Berrangé
2023-04-21 17:14 ` [PATCH v2 3/6] tests/qtest: capture RESUME events during migration Daniel P. Berrangé
2023-04-21 21:59   ` Juan Quintela
2023-04-24  9:53     ` Daniel P. Berrangé
2023-05-26 11:56       ` Daniel P. Berrangé
2023-04-21 17:14 ` [PATCH v2 4/6] tests/qtest: make more migration pre-copy scenarios run non-live Daniel P. Berrangé
2023-04-21 22:06   ` Juan Quintela
2023-04-24 21:01   ` Fabiano Rosas
2023-05-26 17:58     ` Daniel P. Berrangé
2023-05-31 12:15       ` Daniel P. Berrangé [this message]
2023-04-21 17:14 ` [PATCH v2 5/6] tests/qtest: massively speed up migration-tet Daniel P. Berrangé
2023-04-21 22:15   ` Juan Quintela
2023-04-21 17:14 ` [PATCH v2 6/6] tests/migration: Only run auto_converge in slow mode Daniel P. Berrangé
2023-04-23  2:41   ` Zhang, Chen
2023-04-24  5:58     ` Juan Quintela
2023-04-24  6:56       ` Thomas Huth
2023-04-24  8:05         ` Zhang, Chen
2023-04-24  8:06   ` Zhang, Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZHc6a+7881ExE0D/@redhat.com \
    --to=berrange@redhat.com \
    --cc=chen.zhang@intel.com \
    --cc=farosas@suse.de \
    --cc=jsnow@redhat.com \
    --cc=lizhijian@fujitsu.com \
    --cc=lvivier@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=stefanha@redhat.com \
    --cc=thuth@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).