From: "Daniel P. Berrangé" <berrange@redhat.com>
To: Peter Xu <peterx@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>,
"Peter Maydell" <peter.maydell@linaro.org>,
qemu-devel@nongnu.org, "Alex Bennée" <alex.bennee@linaro.org>,
"Thomas Huth" <thuth@redhat.com>,
"Juan Quintela" <quintela@redhat.com>
Subject: Re: [PATCH] tests/qtest/migration-test: Disable migration/multifd/tcp/plain/cancel
Date: Tue, 14 Mar 2023 17:48:04 +0000 [thread overview]
Message-ID: <ZBCzVF2FjU1vsy+b@redhat.com> (raw)
In-Reply-To: <ZBCk6rMT5wmxwIuZ@x1n>
On Tue, Mar 14, 2023 at 12:46:34PM -0400, Peter Xu wrote:
> On Tue, Mar 14, 2023 at 10:11:53AM +0000, Dr. David Alan Gilbert wrote:
> > OK, I think I kind of see what's happening here, one for Peter Xu.
> > If I'm right it's a race something like:
> > a) The test harness tells the source it wants to enter postcopy
> > b) The harness then waits for the source to stop
> > c) ... and the dest to start
> >
> > It's blocked on one of b&c but can't tell which
> >
> > d) The main thread in the dest is waiting for the postcopy recovery fd
> > to be opened
> > e) But I think the source is still trying to send normal precopy RAM
> > and perhaps hasn't got around yet to opening that socket yet????
> > f) But I think the dest isn't reading from the main channel at that
> > point because of (d)
>
> I think this analysis is spot on. Thanks Dave!
>
> Src qemu does this with below order:
>
> 1. setup preempt channel
> 1.1. connect() --> this is done in another thread
> 1.2. sem_wait(postcopy_qemufile_src_sem) --> make sure it's created
> 2. prepare postcopy package (LISTEN, non-iterable states, ping-3, RUN)
> 3. send the package
>
> So logically the sequence is guaranteed so that when LISTEN cmd is
> processed, we should have connect()ed already.
>
> But I think there's one thing missing on dest.. since the accept() on the
> dest node should be run in the main thread, meanwhile the LISTEN cmd is
> also processed on the main thread, even if the listening socket is trying
> to kick the main thread to do the accept() (so the connection has
> established) it won't be able to kick the final accept() as main thread is
> waiting in the semaphore. That caused a deadlock.
>
> A simple fix I can think of is moving the wait channel operation outside
> the main thread, e.g. to the preempt thread.
>
> I've attached that simple fix. Peter, is it easy to verify it? I'm not
> sure the reproducability, fine by me too if it's easier to just disable
> preempt tests for 8.0 release.
>
> Thanks,
>
> --
> Peter Xu
> From 92f2f90d2eb270ca158479bfd9a5a855ec7ddf4d Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Tue, 14 Mar 2023 12:24:02 -0400
> Subject: [PATCH] migration: Wait on preempt channel in preempt thread
>
> QEMU main thread will wait until dest preempt channel established during
> processing the LISTEN command (within the whole postcopy PACKAGED data), by
> waiting on the semaphore postcopy_qemufile_dst_done.
>
> That's racy, because it's possible that the dest QEMU main thread hasn't
> yet accept()ed the new connection when processing the LISTEN event. The
> sem_wait() will yield the main thread without being able to run anything
> else including the accept() of the new socket, which can cause deadlock
> within the main thread.
>
> To avoid the race, move the "wait channel" from main thread to the preempt
> thread right at the start.
>
> Reported-by: Peter Maydell <peter.maydell@linaro.org>
> Fixes: 5655aab079 ("migration: Postpone postcopy preempt channel to be after main")
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> migration/postcopy-ram.c | 11 ++++++-----
> 1 file changed, 6 insertions(+), 5 deletions(-)
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
This description of the bug & proposed fixed matches what I could
infer as the flaw from the stack trace's Peter provided.
>
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index f54f44d899..41c0713650 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -1197,11 +1197,6 @@ int postcopy_ram_incoming_setup(MigrationIncomingState *mis)
> }
>
> if (migrate_postcopy_preempt()) {
> - /*
> - * The preempt channel is established in asynchronous way. Wait
> - * for its completion.
> - */
> - qemu_sem_wait(&mis->postcopy_qemufile_dst_done);
> /*
> * This thread needs to be created after the temp pages because
> * it'll fetch RAM_CHANNEL_POSTCOPY PostcopyTmpPage immediately.
> @@ -1668,6 +1663,12 @@ void *postcopy_preempt_thread(void *opaque)
>
> qemu_sem_post(&mis->thread_sync_sem);
>
> + /*
> + * The preempt channel is established in asynchronous way. Wait
> + * for its completion.
> + */
> + qemu_sem_wait(&mis->postcopy_qemufile_dst_done);
> +
> /* Sending RAM_SAVE_FLAG_EOS to terminate this thread */
> qemu_mutex_lock(&mis->postcopy_prio_thread_mutex);
> while (1) {
> --
> 2.39.1
>
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
next prev parent reply other threads:[~2023-03-14 17:48 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-02 17:22 [PATCH] tests/qtest/migration-test: Disable migration/multifd/tcp/plain/cancel Peter Maydell
2023-03-02 17:34 ` Daniel P. Berrangé
2023-03-03 9:10 ` Juan Quintela
2023-03-03 9:12 ` Daniel P. Berrangé
2023-03-03 11:18 ` Peter Maydell
2023-03-03 11:28 ` Thomas Huth
2023-03-03 11:43 ` Peter Maydell
2023-03-03 12:05 ` Peter Maydell
2023-03-06 13:08 ` Thomas Huth
2023-03-06 13:44 ` Dr. David Alan Gilbert
2023-03-06 14:00 ` Daniel P. Berrangé
2023-03-06 14:09 ` Dr. David Alan Gilbert
2023-03-06 15:17 ` Dr. David Alan Gilbert
2023-03-02 17:37 ` Dr. David Alan Gilbert
2023-03-02 22:25 ` Philippe Mathieu-Daudé
2023-03-03 7:43 ` Thomas Huth
2023-03-03 9:08 ` Juan Quintela
2023-03-04 15:39 ` Peter Maydell
2023-03-07 9:53 ` Peter Maydell
2023-03-12 14:06 ` Peter Maydell
2023-03-12 17:46 ` Peter Maydell
2023-03-14 10:11 ` Dr. David Alan Gilbert
2023-03-14 12:46 ` Peter Maydell
2023-03-14 13:05 ` Daniel P. Berrangé
2023-03-14 13:13 ` Dr. David Alan Gilbert
2023-03-14 16:46 ` Peter Xu
2023-03-14 17:48 ` Daniel P. Berrangé [this message]
2023-03-14 19:31 ` Peter Maydell
2023-03-14 20:51 ` Peter Xu
2023-03-22 20:15 ` Peter Maydell
2023-04-03 19:16 ` Peter Maydell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZBCzVF2FjU1vsy+b@redhat.com \
--to=berrange@redhat.com \
--cc=alex.bennee@linaro.org \
--cc=dgilbert@redhat.com \
--cc=peter.maydell@linaro.org \
--cc=peterx@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
--cc=thuth@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).