qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	Peter Maydell <peter.maydell@linaro.org>
Cc: "Peter Maydell" <peter.maydell@linaro.org>,
	qemu-devel@nongnu.org, "Alex Bennée" <alex.bennee@linaro.org>,
	"Thomas Huth" <thuth@redhat.com>,
	"Juan Quintela" <quintela@redhat.com>
Subject: Re: [PATCH] tests/qtest/migration-test: Disable migration/multifd/tcp/plain/cancel
Date: Tue, 14 Mar 2023 12:46:34 -0400	[thread overview]
Message-ID: <ZBCk6rMT5wmxwIuZ@x1n> (raw)
In-Reply-To: <ZBBIaX+cZD5Ud2wQ@work-vm>

[-- Attachment #1: Type: text/plain, Size: 1944 bytes --]

On Tue, Mar 14, 2023 at 10:11:53AM +0000, Dr. David Alan Gilbert wrote:
> OK, I think I kind of see what's happening here, one for Peter Xu.
> If I'm right it's a race something like:
>   a) The test harness tells the source it wants to enter postcopy
>   b) The harness then waits for the source to stop
>   c) ... and the dest to start 
> 
>   It's blocked on one of b&c but can't tell which
> 
>   d) The main thread in the dest is waiting for the postcopy recovery fd
>     to be opened
>   e) But I think the source is still trying to send normal precopy RAM
>     and perhaps hasn't got around yet to opening that socket yet????
>   f) But I think the dest isn't reading from the main channel at that
>     point because of (d)

I think this analysis is spot on.  Thanks Dave!

Src qemu does this with below order:

        1. setup preempt channel
        1.1. connect()  --> this is done in another thread
        1.2. sem_wait(postcopy_qemufile_src_sem) --> make sure it's created
        2. prepare postcopy package (LISTEN, non-iterable states, ping-3, RUN)
        3. send the package

So logically the sequence is guaranteed so that when LISTEN cmd is
processed, we should have connect()ed already.

But I think there's one thing missing on dest.. since the accept() on the
dest node should be run in the main thread, meanwhile the LISTEN cmd is
also processed on the main thread, even if the listening socket is trying
to kick the main thread to do the accept() (so the connection has
established) it won't be able to kick the final accept() as main thread is
waiting in the semaphore.  That caused a deadlock.

A simple fix I can think of is moving the wait channel operation outside
the main thread, e.g. to the preempt thread.

I've attached that simple fix.  Peter, is it easy to verify it?  I'm not
sure the reproducability, fine by me too if it's easier to just disable
preempt tests for 8.0 release.

Thanks,

-- 
Peter Xu

[-- Attachment #2: 0001-migration-Wait-on-preempt-channel-in-preempt-thread.patch --]
[-- Type: text/plain, Size: 2191 bytes --]

From 92f2f90d2eb270ca158479bfd9a5a855ec7ddf4d Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Tue, 14 Mar 2023 12:24:02 -0400
Subject: [PATCH] migration: Wait on preempt channel in preempt thread

QEMU main thread will wait until dest preempt channel established during
processing the LISTEN command (within the whole postcopy PACKAGED data), by
waiting on the semaphore postcopy_qemufile_dst_done.

That's racy, because it's possible that the dest QEMU main thread hasn't
yet accept()ed the new connection when processing the LISTEN event.  The
sem_wait() will yield the main thread without being able to run anything
else including the accept() of the new socket, which can cause deadlock
within the main thread.

To avoid the race, move the "wait channel" from main thread to the preempt
thread right at the start.

Reported-by: Peter Maydell <peter.maydell@linaro.org>
Fixes: 5655aab079 ("migration: Postpone postcopy preempt channel to be after main")
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/postcopy-ram.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index f54f44d899..41c0713650 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -1197,11 +1197,6 @@ int postcopy_ram_incoming_setup(MigrationIncomingState *mis)
     }
 
     if (migrate_postcopy_preempt()) {
-        /*
-         * The preempt channel is established in asynchronous way.  Wait
-         * for its completion.
-         */
-        qemu_sem_wait(&mis->postcopy_qemufile_dst_done);
         /*
          * This thread needs to be created after the temp pages because
          * it'll fetch RAM_CHANNEL_POSTCOPY PostcopyTmpPage immediately.
@@ -1668,6 +1663,12 @@ void *postcopy_preempt_thread(void *opaque)
 
     qemu_sem_post(&mis->thread_sync_sem);
 
+    /*
+     * The preempt channel is established in asynchronous way.  Wait
+     * for its completion.
+     */
+    qemu_sem_wait(&mis->postcopy_qemufile_dst_done);
+
     /* Sending RAM_SAVE_FLAG_EOS to terminate this thread */
     qemu_mutex_lock(&mis->postcopy_prio_thread_mutex);
     while (1) {
-- 
2.39.1


  parent reply	other threads:[~2023-03-14 16:47 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-02 17:22 [PATCH] tests/qtest/migration-test: Disable migration/multifd/tcp/plain/cancel Peter Maydell
2023-03-02 17:34 ` Daniel P. Berrangé
2023-03-03  9:10   ` Juan Quintela
2023-03-03  9:12     ` Daniel P. Berrangé
2023-03-03 11:18     ` Peter Maydell
2023-03-03 11:28       ` Thomas Huth
2023-03-03 11:43         ` Peter Maydell
2023-03-03 12:05         ` Peter Maydell
2023-03-06 13:08           ` Thomas Huth
2023-03-06 13:44             ` Dr. David Alan Gilbert
2023-03-06 14:00               ` Daniel P. Berrangé
2023-03-06 14:09                 ` Dr. David Alan Gilbert
2023-03-06 15:17                 ` Dr. David Alan Gilbert
2023-03-02 17:37 ` Dr. David Alan Gilbert
2023-03-02 22:25   ` Philippe Mathieu-Daudé
2023-03-03  7:43 ` Thomas Huth
2023-03-03  9:08 ` Juan Quintela
2023-03-04 15:39 ` Peter Maydell
2023-03-07  9:53   ` Peter Maydell
2023-03-12 14:06     ` Peter Maydell
2023-03-12 17:46       ` Peter Maydell
2023-03-14 10:11         ` Dr. David Alan Gilbert
2023-03-14 12:46           ` Peter Maydell
2023-03-14 13:05             ` Daniel P. Berrangé
2023-03-14 13:13             ` Dr. David Alan Gilbert
2023-03-14 16:46           ` Peter Xu [this message]
2023-03-14 17:48             ` Daniel P. Berrangé
2023-03-14 19:31             ` Peter Maydell
2023-03-14 20:51               ` Peter Xu
2023-03-22 20:15     ` Peter Maydell
2023-04-03 19:16       ` Peter Maydell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZBCk6rMT5wmxwIuZ@x1n \
    --to=peterx@redhat.com \
    --cc=alex.bennee@linaro.org \
    --cc=dgilbert@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=thuth@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).