Re: recent flakiness (intermittent hangs) of migration-test

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Peter Xu <peterx@redhat.com>
To: Peter Maydell <peter.maydell@linaro.org>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Juan Quintela <quintela@redhat.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	QEMU Developers <qemu-devel@nongnu.org>
Subject: Re: recent flakiness (intermittent hangs) of migration-test
Date: Fri, 30 Oct 2020 09:53:50 -0400	[thread overview]
Message-ID: <20201030135350.GA588069@xz-x1> (raw)
In-Reply-To: <CAFEAcA_ZEHHdKO+82G_e16uZkSbtH-Rjn701Nc0e7pOfk3OUoA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2043 bytes --]

On Fri, Oct 30, 2020 at 11:48:28AM +0000, Peter Maydell wrote:
> > Peter, is it possible that you enable QTEST_LOG=1 in your future migration-test
> > testcase and try to capture the stderr?  With the help of commit a47295014d
> > ("migration-test: Only hide error if !QTEST_LOG", 2020-10-26), the test should
> > be able to dump quite some helpful information to further identify the issue.
> 
> Here's the result of running just the migration test with
> QTEST_LOG=1:
> https://people.linaro.org/~peter.maydell/migration.log
> It's 300MB because when the test hangs one of the processes
> is apparently in a polling state and continues to send status
> queries.
> 
> My impression is that the test is OK on an unloaded machine but
> more likely to fail if the box is doing other things at the
> same time. Alternatively it might be a 'parallel make check' bug.

Thanks for collecting that, Peter.

I'm copy-pasting the important information out here (with some moves and
indents to make things even clearer):

...
{"execute": "migrate-recover", "arguments": {"uri": "unix:/tmp/migration-test-nGzu4q/migsocket-recover"}, "id": "recover-cmd"}
{"timestamp": {"seconds": 1604056292, "microseconds": 177955}, "event": "MIGRATION", "data": {"status": "setup"}}
{"return": {}, "id": "recover-cmd"}
{"execute": "query-migrate"}
...
{"execute": "migrate", "arguments": {"resume": true, "uri": "unix:/tmp/migration-test-nGzu4q/migsocket-recover"}}
qemu-system-x86_64: ram_save_queue_pages no previous block
qemu-system-x86_64: Detected IO failure for postcopy. Migration paused.
{"return": {}}
{"execute": "migrate-set-parameters", "arguments": {"max-postcopy-bandwidth": 0}}
...

The problem is probably an misuse on last_rb on destination node.  When looking
at it, I also found a race.  So I guess I should fix both...

Peter, would it be easy to try apply the two patches I attached to see whether
the test hang would be resolved?  Dave, feel free to give early comments too on
the two fixes before I post them on the list.

Thanks!

-- 
Peter Xu

[-- Attachment #2: 0001-migration-Unify-reset-of-last_rb-on-destination-node.patch --]
[-- Type: text/plain, Size: 2851 bytes --]

From b9af559df89ae9c2eed506ebdd811356ea799f63 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Fri, 30 Oct 2020 09:37:44 -0400
Subject: [PATCH 1/2] migration: Unify reset of last_rb on destination node
 when recover

When postcopy recover happens, we need to reset last_rb after each return of
postcopy_pause_fault_thread() because that means we just got the postcopy
migration continued.

Unify this reset to the place right before we want to kick the fault thread
again, when we get the command MIG_CMD_POSTCOPY_RESUME from source.

This is actually more than that - because the main thread on destination will
now be able to call migrate_send_rp_req_pages_pending() too, so the fault
thread is not the only user of last_rb now.  Move the reset earlier will allow
the first call to migrate_send_rp_req_pages_pending() to use the reset value
even if called from the main thread.

(NOTE: this is not a real fix to 0c26781c09 mentioned below, however it is just
 a mark that when picking up 0c26781c09 we'd better have this one too; the real
 fix will come later)

Fixes: 0c26781c09 ("migration: Sync requested pages after postcopy recovery")
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/postcopy-ram.c | 2 --
 migration/savevm.c       | 6 ++++++
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index d3bb3a744b..d99842eb1b 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -903,7 +903,6 @@ static void *postcopy_ram_fault_thread(void *opaque)
              * the channel is rebuilt.
              */
             if (postcopy_pause_fault_thread(mis)) {
-                mis->last_rb = NULL;
                 /* Continue to read the userfaultfd */
             } else {
                 error_report("%s: paused but don't allow to continue",
@@ -985,7 +984,6 @@ retry:
                 /* May be network failure, try to wait for recovery */
                 if (ret == -EIO && postcopy_pause_fault_thread(mis)) {
                     /* We got reconnected somehow, try to continue */
-                    mis->last_rb = NULL;
                     goto retry;
                 } else {
                     /* This is a unavoidable fault */
diff --git a/migration/savevm.c b/migration/savevm.c
index 21ccba9fb3..e8834991ec 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2061,6 +2061,12 @@ static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
         return 0;
     }
 
+    /*
+     * Reset the last_rb before we resend any page req to source again, since
+     * the source should have it reset already.
+     */
+    mis->last_rb = NULL;
+
     /*
      * This means source VM is ready to resume the postcopy migration.
      * It's time to switch state and release the fault thread to
-- 
2.26.2


[-- Attachment #3: 0002-migration-Postpone-the-kick-of-the-fault-thread-afte.patch --]
[-- Type: text/plain, Size: 2176 bytes --]

From 2a28241de49d27d7b0968f1d1b137aea153fc3b4 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Fri, 30 Oct 2020 09:44:32 -0400
Subject: [PATCH 2/2] migration: Postpone the kick of the fault thread after
 recover

The new migrate_send_rp_req_pages_pending() call should greatly improve
destination responsiveness because it will resync faulted address after
postcopy recovery.  However it is also the 1st place to initiate the page
request from the main thread.

One thing is overlooked on that migrate_send_rp_message_req_pages() is not
designed to be thread-safe.  So if we wake the fault thread before syncing all
the faulted pages in the main thread, it means they can race.

Postpone the wake up operation after the sync of faulted addresses.

Fixes: 0c26781c09 ("migration: Sync requested pages after postcopy recovery")
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/savevm.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index e8834991ec..5f937a2762 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2069,12 +2069,9 @@ static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
 
     /*
      * This means source VM is ready to resume the postcopy migration.
-     * It's time to switch state and release the fault thread to
-     * continue service page faults.
      */
     migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_RECOVER,
                       MIGRATION_STATUS_POSTCOPY_ACTIVE);
-    qemu_sem_post(&mis->postcopy_pause_sem_fault);
 
     trace_loadvm_postcopy_handle_resume();
 
@@ -2095,6 +2092,14 @@ static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
      */
     migrate_send_rp_req_pages_pending(mis);
 
+    /*
+     * It's time to switch state and release the fault thread to continue
+     * service page faults.  Note that this should be explicitly after the
+     * above call to migrate_send_rp_req_pages_pending().  In short:
+     * migrate_send_rp_message_req_pages() is not thread safe, yet.
+     */
+    qemu_sem_post(&mis->postcopy_pause_sem_fault);
+
     return 0;
 }
 
-- 
2.26.2

next prev parent reply	other threads:[~2020-10-30 13:54 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-29 17:20 recent flakiness (intermittent hangs) of migration-test Peter Maydell
2020-10-29 17:41 ` Dr. David Alan Gilbert
2020-10-29 18:55   ` Peter Maydell
2020-10-29 19:34     ` Dr. David Alan Gilbert
2020-10-29 20:28       ` Peter Xu
2020-10-30 11:48         ` Peter Maydell
2020-10-30 13:53           ` Peter Xu [this message]
2020-11-02 13:55             ` Philippe Mathieu-Daudé
2020-11-02 14:19               ` Christian Schoenebeck
2020-11-02 15:14                 ` Peter Xu

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:d3bb3a744 dfblob:d99842eb1 dfblob:21ccba9fb dfblob:e8834991e
dfblob:e8834991e dfblob:5f937a276 )
 OR (
bs:"migration: Postpone the kick of the fault thread after" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201030135350.GA588069@xz-x1 \
    --to=peterx@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).