From: Fabiano Rosas <farosas@suse.de>
To: Peter Xu <peterx@redhat.com>
Cc: qemu-devel@nongnu.org,
"Maciej S . Szmigiero" <mail@maciej.szmigiero.name>,
"Cédric Le Goater" <clg@redhat.com>
Subject: Re: [PATCH 1/2] migration: Add some documentation for multifd
Date: Mon, 10 Mar 2025 16:27:57 -0300 [thread overview]
Message-ID: <87ecz4adoi.fsf@suse.de> (raw)
In-Reply-To: <Z88DmvrNrW5Q1n7y@x1.local>
Peter Xu <peterx@redhat.com> writes:
> On Mon, Mar 10, 2025 at 11:24:15AM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>>
>> > On Fri, Mar 07, 2025 at 04:06:17PM -0300, Fabiano Rosas wrote:
>> >> > I never tried vsock, would it be used in any use case?
>> >> >
>> >>
>> >> I don't know, I'm going by what's in the code.
>> >>
>> >> > It seems to be introduced by accident in 72a8192e225cea, but I'm not sure.
>> >> > Maybe there's something I missed.
>> >>
>> >> The code was always had some variation of:
>> >>
>> >> static bool transport_supports_multi_channels(SocketAddress *saddr)
>> >> {
>> >> return strstart(uri, "tcp:", NULL) || strstart(uri, "unix:", NULL) ||
>> >> strstart(uri, "vsock:", NULL);
>> >> }
>> >>
>> >> Introduced by b7acd65707 ("migration: allow multifd for socket protocol
>> >> only").
>> >
>> > Looks like a copy-paste error.. I found it, should be 9ba3b2baa1
>> > ("migration: add vsock as data channel support").
>> >
>> > https://lore.kernel.org/all/2bc0e226-ee71-330a-1bcd-bd9d097509bc@huawei.com/
>> > https://kvmforum2019.sched.com/event/Tmzh/zero-next-generation-virtualization-platform-for-huawei-cloud-jinsong-liu-zhichao-huang-huawei
>> > https://e.huawei.com/sa/material/event/HC/e37b9c4c33e14e869bb1183fab468fed
>> >
>>
>> Great, thanks for finding those. I'll add it to my little "lore"
>> folder. This kind of information is always good to know.
>>
>> > So if I read it right.. the VM in this case is inside a container or
>> > something, then it talks to an "agent" on a PCIe device which understands
>> > virtio-vsock protocol. So maybe vsock just performs better than other ways
>> > to dig that tunnel for the container.
>> >
>> > In that case, mentioning vsock is at least ok.
>> >
>> > [...]
>> >
>> >> >> +After the channels have been put into a wait state by the sync
>> >> >> +functions, the client code may continue to transmit additional data by
>> >> >> +issuing ``multifd_send()`` once again.
>> >> >> +
>> >> >> +Note:
>> >> >> +
>> >> >> +- the RAM migration does, effectively, a global synchronization by
>> >> >> + chaining a call to ``multifd_send_sync_main()`` with the emission of a
>> >> >> + flag on the main migration channel (``RAM_SAVE_FLAG_MULTIFD_FLUSH``)
>> >> >
>> >> > ... or RAM_SAVE_FLAG_EOS ... depending on the machine type.
>> >> >
>> >>
>> >> Eh.. big compatibility mess. I rather not mention it.
>> >
>> > It's not strictly a compatibility mess. IIUC, it was used to be designed
>> > to always work with EOS. I think at that time Juan was still focused on
>> > making it work and not whole perf tunings, but then we found it can be a
>> > major perf issue if we flush too soon. Then if we flush it once per round,
>> > it may not always pair with a EOS. That's why we needed a new message.
>> >
>>
>> Being fully honest, at the time I got the impression the situation was
>> "random person inside RH decided to measure performance of random thing
>> and upstream maintainer felt pressure to push a fix".
>>
>> Whether that was the case or not, it doesn't matter now, but we can't
>> deny that this _has_ generated some headache, just look at how many
>> issues arose from the introduction of that flag.
>
> I might be the "random person" here. I remember I raised this question to
> Juan on why we need to flush for each iteration.
>
It's a good question indeed. It was the right call to address it, I just
wish we had come up with a more straight-forward solution to it.
> I also remember we did perf test, we can redo it. But we can discuss the
> design first.
>
> To me, this is a fairly important question to ask. Fundamentally, the very
> initial question is why do we need periodic flush and sync at all. It's
> because we want to make sure new version of pages to land later than old
> versions.
>
> Note that we can achieve that in other ways too. E.g., if we only enqueue a
> page to a specific multifd thread (e.g. page_index % n_multifd_threads),
> then it'll guarantee the ordering without flush and sync, because new / old
> version for the same page will only go via the same channel, which
> guarantees ordering of packets in time order naturally.
Right.
> But that at least has risk of not being able to fully leverage the
> bandwidth, e.g., worst case is the guest has dirty pages that are
> accidentally always hashed to the same channel; consider a program
> keeps dirtying every 32K on a 4K psize system with 8 channels. Or
> something like that.
>
> Not documenting EOS part is ok too from that pov, because it's confusing
> too on why we need to flush per EOS. Per-round is more understandable from
> that POV, because we want to make sure new version lands later, and
> versioning boost for pages only happen per-round, not per-iteration.
>
>>
>> > But hey, you're writting a doc that helps everyone. You deserve to decide
>> > whether you like to mention it or not on this one. :)
>>
>> My rant aside, I really want to avoid any readers having to think too
>> much about this flush thing. We're already seeing some confusion when
>> discussing it with Prasad in the other thread. The code itself and the
>> git log are more reliable to explain the compat situation IMO.
>>
>> >
>> > IIRC we updated our compat rule so we maintain each machine type for only 6
>> > years. It means the whole per-iteration + EOS stuff can be removed in 3.5
>> > years or so - we did that work in July 2022. So it isn't that important
>> > either to mention indeed.
>> >
>>
>> Yep, that as well.
>>
>> >>
>> >> > Maybe we should also add a sentence on the relationship of
>> >> > MULTIFD_FLAG_SYNC and RAM_SAVE_FLAG_MULTIFD_FLUSH (or RAM_SAVE_FLAG_EOS ),
>> >> > in that they should always be sent together, and only if so would it
>> >> > provide ordering of multifd messages and what happens in the main migration
>> >> > thread.
>> >> >
>> >>
>> >> The problem is that RAM_SAVE_FLAGs are a ram.c thing. In theory the need
>> >> for RAM_SAVE_FLAG_MULTIFD_FLUSH is just because the RAM migration is
>> >> driven by the source machine by the flags that are put on the
>> >> stream. IOW, this is a RAM migration design, not a multifd design. The
>> >> multifd design is (could be, we decide) that once sync packets are sent,
>> >> _something_ must do the following:
>> >>
>> >> for (i = 0; i < thread_count; i++) {
>> >> trace_multifd_recv_sync_main_wait(i);
>> >> qemu_sem_wait(&multifd_recv_state->sem_sync);
>> >> }
>> >>
>> >> ... which is already part of multifd_recv_sync_main(), but that just
>> >> _happens to be_ called by ram.c when it sees the
>> >> RAM_SAVE_FLAG_MULTIFD_FLUSH flag on the stream, that's not a multifd
>> >> design requirement. The ram.c code could for instance do the sync when
>> >> some QEMU_VM_SECTION_EOS (or whatever it's called) appears.
>> >
>> > I still think it should be done in RAM code only. One major goal (if not
>> > the only goal..) is it wants to order different versions of pages and
>> > that's only what the RAM module is about, not migration in general.
>> >
>> > From that POV, having a QEMU_VM_* is kind of the wrong layer - they should
>> > work for migration in general.
>> >
>> > Said so, I agree we do violate it from time to time, for example, we have a
>> > bunch of subcmds (MIG_CMD_POSTCOPY*) just for postcopy, which is under
>> > QEMU_VM_COMMAND. But IIUC that was either kind of ancient (so we need to
>> > stick with them now.. postcopy was there for 10 years) or it needs some
>> > ping-pong messages in which case QEMU_VM_COMMAND is the easiest.. IMHO we
>> > should still try to stick with the layering if possible.
>>
>> All good points, but I was talking of something else:
>>
>> I was just throwing an example of how it could be done differently to
>> make the point clear that the recv_sync has nothing to do with the ram
>> flag, that's just implementation detail. I was thinking specifically
>> about the multifd+postcopy work where we might need syncs but there is
>> no RAM_FLAGS there.
>>
>> We don't actually _need_ to sync with the migration thread on the
>> destination like that. The client could send control information in it's
>> opaque packet (instead of in the migration thread) or in a completely
>> separate channel if it wanted. That sync is also not necessary if there
>> is no dependency around the data being transferred (i.e. mapped-ram just
>> takes data form the file and writes to guest memory)
>
> Mapped-ram is definitely different.
>
> For sockets, IIUC we do rely on the messages on the multifd channels _and_
> the message on the main channel.
>
> So I may not have fully get your points above, but..
My point is just a theoretical one. We _could_ make this work with a
different mechanism. And that's why I'm being careful in what to
document, that's all.
> See how it more or
> less implemented a remote memory barrier kind of thing _with_ the main
> channel message:
>
> main channel multifd channel 1 multifd channel 2
> ------------ ----------------- -----------------
> send page P v1
> +------------------------------------------------------------------+
> | RAM_SAVE_FLAG_MULTIFD_FLUSH |
> | MULTIFD_FLAG_SYNC MULTIFD_FLAG_SYNC |
> +------------------------------------------------------------------+
> send page P v2
>
This is a nice way of diagramming that!
> Then v1 and v2 of the page P are ordered.
>
> If without the message on the main channel:
>
> main channel multifd channel 1 multifd channel 2
> ------------ ----------------- -----------------
> send page P v1
> MULTIFD_FLAG_SYNC
> MULTIFD_FLAG_SYNC
> send page P v2
>
> Then I don't see what protects reorder of arrival of messages like:
>
> main channel multifd channel 1 multifd channel 2
> ------------ ----------------- -----------------
> MULTIFD_FLAG_SYNC
> send page P v2
> send page P v1
> MULTIFD_FLAG_SYNC
>
That's all fine. As long as the recv part doesn't see them out of
order. I'll try to write some code to confirm so I don't waste too much
of your time.
next prev parent reply other threads:[~2025-03-10 19:28 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-07 13:42 [PATCH 0/2] migration: multifd documentation Fabiano Rosas
2025-03-07 13:42 ` [PATCH 1/2] migration: Add some documentation for multifd Fabiano Rosas
2025-03-07 17:27 ` Peter Xu
2025-03-07 19:06 ` Fabiano Rosas
2025-03-07 22:15 ` Peter Xu
2025-03-10 14:24 ` Fabiano Rosas
2025-03-10 15:22 ` Peter Xu
2025-03-10 19:27 ` Fabiano Rosas [this message]
2025-03-20 12:06 ` Prasad Pandit
2025-03-20 13:38 ` Fabiano Rosas
2025-03-20 11:50 ` Prasad Pandit
2025-03-20 14:45 ` Fabiano Rosas
2025-03-20 15:56 ` Peter Xu
2025-03-20 17:12 ` Fabiano Rosas
2025-03-21 10:47 ` Prasad Pandit
2025-03-21 14:04 ` Fabiano Rosas
2025-03-24 11:14 ` Prasad Pandit
2025-03-07 13:42 ` [PATCH 2/2] migration: Move compression docs under multifd Fabiano Rosas
2025-03-07 17:28 ` Peter Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87ecz4adoi.fsf@suse.de \
--to=farosas@suse.de \
--cc=clg@redhat.com \
--cc=mail@maciej.szmigiero.name \
--cc=peterx@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.