From: Peter Xu <peterx@redhat.com>
To: Prasad Pandit <ppandit@redhat.com>
Cc: qemu-devel@nongnu.org, Fabiano Rosas <farosas@suse.de>,
Prasad Pandit <pjp@fedoraproject.org>
Subject: Re: [PATCH 5/5] migration: enable multifd and postcopy together
Date: Tue, 5 Nov 2024 11:55:26 -0500 [thread overview]
Message-ID: <ZypN_icZsyS_8BPK@x1n> (raw)
In-Reply-To: <CAE8KmOyz9gWfKZmVkXLvj-NMo_=QB42sZhTyVGLqkOjU1GCXfg@mail.gmail.com>
On Tue, Nov 05, 2024 at 05:24:55PM +0530, Prasad Pandit wrote:
> On Mon, 4 Nov 2024 at 23:19, Peter Xu <peterx@redhat.com> wrote:
> > Precopy keeps sending data even during postcopy, that's the background
> > stream (with/without preempt feature enabled). May need some amendment
> > when repost here.
>
> * Okay.
>
> > > + if (channel == CH_POSTCOPY) {
> > > + return false;
> > > + }
> >
> > Please see my other reply, I think here it should never be POSTCOPY
> > channel, because postcopy (aka, preempt) channel is only created after the
> > main channel.. So I wonder whether this "if" will hit at all.
>
> * It does. migration_ioc_process_incoming() is called for each
> incoming connection. And every time it calls
> migration_should_start_incoming() to check if it should proceed with
> the migration or should wait for further connections, because multifd
> does not start until all its connections are established.
>
> * Similarly when the Postcopy channel is initiated towards the end of
> Precopy migration, migration_should_start_incoming() gets called, and
> returns 'false' because we don't want to start a new incoming
> migration at that time. Earlier it was receiving boolean value
> 'default_channel' as parameter, which was set to false while accepting
> 'postcopy' connection via => default_channel = !mis->from_src_file;
Hmm yes, it will happen, but should only happen after the main channel.
>
> > > +
> > > /* Multifd doesn't start unless all channels are established */
> > > if (migrate_multifd()) {
> > > - return migration_has_all_channels();
> > > - }
> > > -
> > > - /* Preempt channel only starts when the main channel is created */
> > > - if (migrate_postcopy_preempt()) {
> > > - return main_channel;
> > > + return multifd_recv_all_channels_created();
> >
> > I think this is incorrect.. We should also need to check main channel is
> > established before start incoming. The old code uses
> > migration_has_all_channels() which checks that.
>
> * Okay, will try to fix it better. But calling
> migration_has_all_channels() after checking migrate_multifd() does not
> seem/read right.
>
> > I don't yet see how you manage the multifd threads, etc, on both sides. Or
> > any logic to make sure multifd will properly flush the pages before
> > postcopy starts. IOW, any guarantee that all the pages will only be
> > installed using UFFDIO_COPY as long as vcpu starts on dest. Any comments?
>
> * There are no changes related to that. As this series only tries to
> enable the multifd and postcopy options together.
In short, qemu_savevm_state_complete_precopy_iterable() is skipped in
postcopy. I don't yet see anywhere multifd flushes pages. We need to
flush multifd pages before vcpu starts on dest.
As we discussed off-list, we can add an assertion in multifd recv threads
to make sure it won't touch any guest page during active postcopy.
Maybe something like:
diff --git a/migration/multifd.c b/migration/multifd.c
index 4374e14a96..32425137bd 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -1182,6 +1182,13 @@ static void *multifd_recv_thread(void *opaque)
}
if (has_data) {
+ /*
+ * Now we're going to receive some guest pages into iov
+ * buffers. If it's during postcopy, it means vcpu can be
+ * running, so this can corrupt memory when modified
+ * concurrently by multifd threads!
+ */
+ assert(!migration_in_postcopy());
ret = multifd_recv_state->ops->recv(p, &local_err);
if (ret != 0) {
break;
>
> > The most complicated part of this work would be testing, to make sure it
> > works in all previous cases, and that's majorly why we disabled it before:
> > it was because it randomly crashes, but not always; fundamentally it's
> > because when multifd was designed there wasn't enough consideration on
> > working together with postcopy. We didn't clearly know what's missing at
> > that time.
>
> * Right. I have tested this series with the following migration
> commands to confirm that migration works as expected and there were no
> crash(es) observed.
> ===
> [Precopy]
> # virsh migrate --verbose --live --auto-converge f39vm
> qemu+ssh://<destination-host-name>/system
>
> [Precopy + Multifd]
> # virsh migrate --verbose --live --auto-converge --parallel
> --parallel-connections 02 \
> f39vm qemu+ssh://<destination-host-name>/system
> # virsh migrate --verbose --live --auto-converge --parallel
> --parallel-connections 04 \
> f39vm qemu+ssh://<destination-host-name>/system
> # virsh migrate --verbose --live --auto-converge --parallel
> --parallel-connections 08 \
> f39vm qemu+ssh://<destination-host-name>/system
> # virsh migrate --verbose --live --auto-converge --parallel
> --parallel-connections 16 \
> f39vm qemu+ssh://<destination-host-name>/system
>
> [Postcopy]
> # virsh migrate --verbose --live --auto-converge \
> --postcopy --postcopy-after-precopy f39vm
> qemu+ssh://<destination-host-name>/system
I'd suggest we try interrupt postcopy with migrate-pause, then go with
postcopy recover. I wonder how the current series work if the network
failure will fail all the multifd iochannels, and how reconnect works.
>
> [Postcopy + Multifd]
> # virsh migrate --verbose --live --auto-converge --parallel
> --parallel-connections 02 \
> --postcopy --postcopy-after-precopy f39vm
> qemu+ssh://<destination-host-name>/system
> # virsh migrate --verbose --live --auto-converge --parallel
> --parallel-connections 04 \
> --postcopy --postcopy-after-precopy f39vm
> qemu+ssh://<destination-host-name>/system
> # virsh migrate --verbose --live --auto-converge --parallel
> --parallel-connections 08 \
> --postcopy --postcopy-after-precopy f39vm
> qemu+ssh://<destination-host-name>/system
> # virsh migrate --verbose --live --auto-converge --parallel
> --parallel-connections 16 \
> --postcopy --postcopy-after-precopy f39vm
> qemu+ssh://<destination-host-name>/system
> ===
>
> > So we would definitely need to add test cases, just like whoever adds new
> > features to migration, to make sure at least it works for existing multifd
> > / postcopy test cases, but when both features enabled.
> ...
> > It will add quite a few tests to run here, but I don't see a good way
> > otherwise when we want to enable the two features, because it is indeed a
> > challenge to enable the two major features together here.
> >
>
> * Thank you for the hints about the tests, will surely look into them
> and try to learn about how to add new test cases.
Thanks,
--
Peter Xu
prev parent reply other threads:[~2024-11-05 16:56 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-29 15:09 [PATCH 0/5] Allow to enable multifd and postcopy migration together Prasad Pandit
2024-10-29 15:09 ` [PATCH 1/5] migration/multifd: move macros to multifd header Prasad Pandit
2024-10-29 15:09 ` [PATCH 2/5] migration/postcopy: magic value for postcopy channel Prasad Pandit
[not found] ` <ZyTnBwpOwXcHGGPJ@x1n>
2024-11-04 12:32 ` Prasad Pandit
2024-11-04 17:18 ` Peter Xu
2024-11-05 11:19 ` Prasad Pandit
2024-11-05 13:00 ` Peter Xu
2024-11-06 12:19 ` Prasad Pandit
2024-11-06 13:11 ` Fabiano Rosas
2024-11-07 12:05 ` Prasad Pandit
2024-11-07 12:11 ` Fabiano Rosas
2024-11-07 12:33 ` Daniel P. Berrangé
2024-11-07 16:17 ` Peter Xu
2024-11-07 16:57 ` Daniel P. Berrangé
2024-11-07 17:45 ` Peter Xu
2024-11-08 12:37 ` Prasad Pandit
2024-11-08 13:25 ` Fabiano Rosas
2024-11-06 16:00 ` Peter Xu
2024-11-07 11:52 ` Prasad Pandit
2024-11-07 15:56 ` Peter Xu
2024-10-29 15:09 ` [PATCH 3/5] migration: remove multifd check with postcopy Prasad Pandit
[not found] ` <ZyTnWYyHlrJUYQRB@x1n>
2024-11-04 12:23 ` Prasad Pandit
2024-11-04 16:52 ` Peter Xu
2024-11-05 9:50 ` Prasad Pandit
2024-10-29 15:09 ` [PATCH 4/5] migration: refactor ram_save_target_page functions Prasad Pandit
[not found] ` <ZyToBbvfWkIZ_40W@x1n>
2024-11-04 11:56 ` Prasad Pandit
2024-11-04 17:00 ` Peter Xu
2024-11-05 10:01 ` Prasad Pandit
2024-11-05 13:01 ` Peter Xu
2024-10-29 15:09 ` [PATCH 5/5] migration: enable multifd and postcopy together Prasad Pandit
2024-11-04 17:48 ` Peter Xu
2024-11-05 11:54 ` Prasad Pandit
2024-11-05 16:55 ` Peter Xu [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZypN_icZsyS_8BPK@x1n \
--to=peterx@redhat.com \
--cc=farosas@suse.de \
--cc=pjp@fedoraproject.org \
--cc=ppandit@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).