From: Fabiano Rosas <farosas@suse.de>
To: Peter Xu <peterx@redhat.com>
Cc: qemu-devel@nongnu.org,
"Maciej S . Szmigiero" <mail@maciej.szmigiero.name>,
"Cédric Le Goater" <clg@redhat.com>
Subject: Re: [PATCH 1/2] migration: Add some documentation for multifd
Date: Mon, 10 Mar 2025 16:27:57 -0300 [thread overview]
Message-ID: <87ecz4adoi.fsf@suse.de> (raw)
In-Reply-To: <Z88DmvrNrW5Q1n7y@x1.local>
Peter Xu <peterx@redhat.com> writes:
> On Mon, Mar 10, 2025 at 11:24:15AM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>>
>> > On Fri, Mar 07, 2025 at 04:06:17PM -0300, Fabiano Rosas wrote:
>> >> > I never tried vsock, would it be used in any use case?
>> >> >
>> >>
>> >> I don't know, I'm going by what's in the code.
>> >>
>> >> > It seems to be introduced by accident in 72a8192e225cea, but I'm not sure.
>> >> > Maybe there's something I missed.
>> >>
>> >> The code was always had some variation of:
>> >>
>> >> static bool transport_supports_multi_channels(SocketAddress *saddr)
>> >> {
>> >> return strstart(uri, "tcp:", NULL) || strstart(uri, "unix:", NULL) ||
>> >> strstart(uri, "vsock:", NULL);
>> >> }
>> >>
>> >> Introduced by b7acd65707 ("migration: allow multifd for socket protocol
>> >> only").
>> >
>> > Looks like a copy-paste error.. I found it, should be 9ba3b2baa1
>> > ("migration: add vsock as data channel support").
>> >
>> > https://lore.kernel.org/all/2bc0e226-ee71-330a-1bcd-bd9d097509bc@huawei.com/
>> > https://kvmforum2019.sched.com/event/Tmzh/zero-next-generation-virtualization-platform-for-huawei-cloud-jinsong-liu-zhichao-huang-huawei
>> > https://e.huawei.com/sa/material/event/HC/e37b9c4c33e14e869bb1183fab468fed
>> >
>>
>> Great, thanks for finding those. I'll add it to my little "lore"
>> folder. This kind of information is always good to know.
>>
>> > So if I read it right.. the VM in this case is inside a container or
>> > something, then it talks to an "agent" on a PCIe device which understands
>> > virtio-vsock protocol. So maybe vsock just performs better than other ways
>> > to dig that tunnel for the container.
>> >
>> > In that case, mentioning vsock is at least ok.
>> >
>> > [...]
>> >
>> >> >> +After the channels have been put into a wait state by the sync
>> >> >> +functions, the client code may continue to transmit additional data by
>> >> >> +issuing ``multifd_send()`` once again.
>> >> >> +
>> >> >> +Note:
>> >> >> +
>> >> >> +- the RAM migration does, effectively, a global synchronization by
>> >> >> + chaining a call to ``multifd_send_sync_main()`` with the emission of a
>> >> >> + flag on the main migration channel (``RAM_SAVE_FLAG_MULTIFD_FLUSH``)
>> >> >
>> >> > ... or RAM_SAVE_FLAG_EOS ... depending on the machine type.
>> >> >
>> >>
>> >> Eh.. big compatibility mess. I rather not mention it.
>> >
>> > It's not strictly a compatibility mess. IIUC, it was used to be designed
>> > to always work with EOS. I think at that time Juan was still focused on
>> > making it work and not whole perf tunings, but then we found it can be a
>> > major perf issue if we flush too soon. Then if we flush it once per round,
>> > it may not always pair with a EOS. That's why we needed a new message.
>> >
>>
>> Being fully honest, at the time I got the impression the situation was
>> "random person inside RH decided to measure performance of random thing
>> and upstream maintainer felt pressure to push a fix".
>>
>> Whether that was the case or not, it doesn't matter now, but we can't
>> deny that this _has_ generated some headache, just look at how many
>> issues arose from the introduction of that flag.
>
> I might be the "random person" here. I remember I raised this question to
> Juan on why we need to flush for each iteration.
>
It's a good question indeed. It was the right call to address it, I just
wish we had come up with a more straight-forward solution to it.
> I also remember we did perf test, we can redo it. But we can discuss the
> design first.
>
> To me, this is a fairly important question to ask. Fundamentally, the very
> initial question is why do we need periodic flush and sync at all. It's
> because we want to make sure new version of pages to land later than old
> versions.
>
> Note that we can achieve that in other ways too. E.g., if we only enqueue a
> page to a specific multifd thread (e.g. page_index % n_multifd_threads),
> then it'll guarantee the ordering without flush and sync, because new / old
> version for the same page will only go via the same channel, which
> guarantees ordering of packets in time order naturally.
Right.
> But that at least has risk of not being able to fully leverage the
> bandwidth, e.g., worst case is the guest has dirty pages that are
> accidentally always hashed to the same channel; consider a program
> keeps dirtying every 32K on a 4K psize system with 8 channels. Or
> something like that.
>
> Not documenting EOS part is ok too from that pov, because it's confusing
> too on why we need to flush per EOS. Per-round is more understandable from
> that POV, because we want to make sure new version lands later, and
> versioning boost for pages only happen per-round, not per-iteration.
>
>>
>> > But hey, you're writting a doc that helps everyone. You deserve to decide
>> > whether you like to mention it or not on this one. :)
>>
>> My rant aside, I really want to avoid any readers having to think too
>> much about this flush thing. We're already seeing some confusion when
>> discussing it with Prasad in the other thread. The code itself and the
>> git log are more reliable to explain the compat situation IMO.
>>
>> >
>> > IIRC we updated our compat rule so we maintain each machine type for only 6
>> > years. It means the whole per-iteration + EOS stuff can be removed in 3.5
>> > years or so - we did that work in July 2022. So it isn't that important
>> > either to mention indeed.
>> >
>>
>> Yep, that as well.
>>
>> >>
>> >> > Maybe we should also add a sentence on the relationship of
>> >> > MULTIFD_FLAG_SYNC and RAM_SAVE_FLAG_MULTIFD_FLUSH (or RAM_SAVE_FLAG_EOS ),
>> >> > in that they should always be sent together, and only if so would it
>> >> > provide ordering of multifd messages and what happens in the main migration
>> >> > thread.
>> >> >
>> >>
>> >> The problem is that RAM_SAVE_FLAGs are a ram.c thing. In theory the need
>> >> for RAM_SAVE_FLAG_MULTIFD_FLUSH is just because the RAM migration is
>> >> driven by the source machine by the flags that are put on the
>> >> stream. IOW, this is a RAM migration design, not a multifd design. The
>> >> multifd design is (could be, we decide) that once sync packets are sent,
>> >> _something_ must do the following:
>> >>
>> >> for (i = 0; i < thread_count; i++) {
>> >> trace_multifd_recv_sync_main_wait(i);
>> >> qemu_sem_wait(&multifd_recv_state->sem_sync);
>> >> }
>> >>
>> >> ... which is already part of multifd_recv_sync_main(), but that just
>> >> _happens to be_ called by ram.c when it sees the
>> >> RAM_SAVE_FLAG_MULTIFD_FLUSH flag on the stream, that's not a multifd
>> >> design requirement. The ram.c code could for instance do the sync when
>> >> some QEMU_VM_SECTION_EOS (or whatever it's called) appears.
>> >
>> > I still think it should be done in RAM code only. One major goal (if not
>> > the only goal..) is it wants to order different versions of pages and
>> > that's only what the RAM module is about, not migration in general.
>> >
>> > From that POV, having a QEMU_VM_* is kind of the wrong layer - they should
>> > work for migration in general.
>> >
>> > Said so, I agree we do violate it from time to time, for example, we have a
>> > bunch of subcmds (MIG_CMD_POSTCOPY*) just for postcopy, which is under
>> > QEMU_VM_COMMAND. But IIUC that was either kind of ancient (so we need to
>> > stick with them now.. postcopy was there for 10 years) or it needs some
>> > ping-pong messages in which case QEMU_VM_COMMAND is the easiest.. IMHO we
>> > should still try to stick with the layering if possible.
>>
>> All good points, but I was talking of something else:
>>
>> I was just throwing an example of how it could be done differently to
>> make the point clear that the recv_sync has nothing to do with the ram
>> flag, that's just implementation detail. I was thinking specifically
>> about the multifd+postcopy work where we might need syncs but there is
>> no RAM_FLAGS there.
>>
>> We don't actually _need_ to sync with the migration thread on the
>> destination like that. The client could send control information in it's
>> opaque packet (instead of in the migration thread) or in a completely
>> separate channel if it wanted. That sync is also not necessary if there
>> is no dependency around the data being transferred (i.e. mapped-ram just
>> takes data form the file and writes to guest memory)
>
> Mapped-ram is definitely different.
>
> For sockets, IIUC we do rely on the messages on the multifd channels _and_
> the message on the main channel.
>
> So I may not have fully get your points above, but..
My point is just a theoretical one. We _could_ make this work with a
different mechanism. And that's why I'm being careful in what to
document, that's all.
> See how it more or
> less implemented a remote memory barrier kind of thing _with_ the main
> channel message:
>
> main channel multifd channel 1 multifd channel 2
> ------------ ----------------- -----------------
> send page P v1
> +------------------------------------------------------------------+
> | RAM_SAVE_FLAG_MULTIFD_FLUSH |
> | MULTIFD_FLAG_SYNC MULTIFD_FLAG_SYNC |
> +------------------------------------------------------------------+
> send page P v2
>
This is a nice way of diagramming that!
> Then v1 and v2 of the page P are ordered.
>
> If without the message on the main channel:
>
> main channel multifd channel 1 multifd channel 2
> ------------ ----------------- -----------------
> send page P v1
> MULTIFD_FLAG_SYNC
> MULTIFD_FLAG_SYNC
> send page P v2
>
> Then I don't see what protects reorder of arrival of messages like:
>
> main channel multifd channel 1 multifd channel 2
> ------------ ----------------- -----------------
> MULTIFD_FLAG_SYNC
> send page P v2
> send page P v1
> MULTIFD_FLAG_SYNC
>
That's all fine. As long as the recv part doesn't see them out of
order. I'll try to write some code to confirm so I don't waste too much
of your time.
next prev parent reply other threads:[~2025-03-10 19:28 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-07 13:42 [PATCH 0/2] migration: multifd documentation Fabiano Rosas
2025-03-07 13:42 ` [PATCH 1/2] migration: Add some documentation for multifd Fabiano Rosas
2025-03-07 17:27 ` Peter Xu
2025-03-07 19:06 ` Fabiano Rosas
2025-03-07 22:15 ` Peter Xu
2025-03-10 14:24 ` Fabiano Rosas
2025-03-10 15:22 ` Peter Xu
2025-03-10 19:27 ` Fabiano Rosas [this message]
2025-03-20 12:06 ` Prasad Pandit
2025-03-20 13:38 ` Fabiano Rosas
2025-03-20 11:50 ` Prasad Pandit
2025-03-20 14:45 ` Fabiano Rosas
2025-03-20 15:56 ` Peter Xu
2025-03-20 17:12 ` Fabiano Rosas
2025-03-21 10:47 ` Prasad Pandit
2025-03-21 14:04 ` Fabiano Rosas
2025-03-24 11:14 ` Prasad Pandit
2025-03-07 13:42 ` [PATCH 2/2] migration: Move compression docs under multifd Fabiano Rosas
2025-03-07 17:28 ` Peter Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87ecz4adoi.fsf@suse.de \
--to=farosas@suse.de \
--cc=clg@redhat.com \
--cc=mail@maciej.szmigiero.name \
--cc=peterx@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).