Re: [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Peter Xu <peterx@redhat.com>
To: "Maciej S. Szmigiero" <mail@maciej.szmigiero.name>
Cc: "Fabiano Rosas" <farosas@suse.de>,
	"Alex Williamson" <alex.williamson@redhat.com>,
	"Cédric Le Goater" <clg@redhat.com>,
	"Eric Blake" <eblake@redhat.com>,
	"Markus Armbruster" <armbru@redhat.com>,
	"Daniel P . Berrangé" <berrange@redhat.com>,
	"Avihai Horon" <avihaih@nvidia.com>,
	"Joao Martins" <joao.m.martins@oracle.com>,
	qemu-devel@nongnu.org
Subject: Re: [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer
Date: Sun, 23 Jun 2024 16:27:11 -0400	[thread overview]
Message-ID: <ZniFH14DT6ycjbrL@x1n> (raw)
In-Reply-To: <cover.1718717584.git.maciej.szmigiero@oracle.com>

On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> This is an updated v1 patch series of the RFC (v0) series located here:
> https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigiero@oracle.com/

OK I took some hours thinking about this today, and here's some high level
comments for this series.  I'll start with which are more relevant to what
Fabiano has already suggested in the other thread, then I'll add some more.

https://lore.kernel.org/r/20240620212111.29319-1-farosas@suse.de

1. Multifd device state support
===============================

As Fabiano suggested in his RFC post, we may need one more layer of
abstraction to represent VFIO's demand on allowing multifd to send
arbitrary buffer to the wire.  This can be more than "how to pass the
device state buffer to the sender threads".

So far, MultiFDMethods is only about RAM.  If you pull the latest master
branch Fabiano just merged yet two more RAM compressors that are extended
on top of MultiFDMethods model.  However still they're all about RAM.  I
think it's better to keep it this way, so maybe MultiFDMethods should some
day be called MultiFDRamMethods.

multifd_send_fill_packet() may only be suitable for RAM buffers, not adhoc
buffers like what VFIO is using. multifd_send_zero_page_detect() may not be
needed either for arbitrary buffers.  Most of those are still page-based.

I think it also means we shouldn't call ->send_prepare() when multifd send
thread notices that it's going to send a VFIO buffer.  So it should look
like this:

  int type = multifd_payload_type(p->data);
  if (type == MULTIFD_PAYLOAD_RAM) {
      multifd_send_state->ops->send_prepare(p, &local_err);
  } else {
      // VFIO buffers should belong here
      assert(type == MULTIFD_PAYLOAD_DEVICE_STATE);
      ...
  }

It also means it shouldn't contain code like:

nocomp_send_prepare():
    if (p->is_device_state_job) {
        return nocomp_send_prepare_device_state(p, errp);
    } else {
        return nocomp_send_prepare_ram(p, errp);
    }

nocomp should only exist in RAM world, not VFIO's.

And it looks like you agree with Fabiano's RFC proposal, please work on top
of that to provide that layer.  Please make sure it outputs the minimum in
"$ git grep device_state migration/multifd.c" when you work on the new
version.  Currently:

$ git grep device_state migration/multifd.c | wc -l
59

The hope is zero, or at least a minimum with good reasons.

2. Frequent mallocs/frees
=========================

Fabiano's series can also help to address some of these, but it looks like
this series used malloc/free more than the opaque data buffer.  This is not
required to get things merged, but it'll be nice to avoid those if possible.

3. load_state_buffer() and VFIODeviceStatePacket protocol
=========================================================

VFIODeviceStatePacket is the new protocol you introduced into multifd
packets, along with the new load_state_buffer() hook for loading such
buffers.  My question is whether it's needed at all, or.. whether it can be
more generic (and also easier) to just allow taking any device state in the
multifd packets, then load it with vmstate load().

I mean, the vmstate_load() should really have worked on these buffers, if
after all VFIO is looking for: (1) VFIO_MIG_FLAG_DEV_DATA_STATE as the
first flag (uint64), size as the 2nd, then (2) load that rest buffer into
VFIO kernel driver.  That is the same to happen during the blackout window.
It's not clear to me why load_state_buffer() is needed.

I also see that you're also using exactly the same chunk size for such
buffering (VFIOMigration.data_buffer_size).

I think you have a "reason": VFIODeviceStatePacket and loading of the
buffer data resolved one major issue that wasn't there before but start to
have now: multifd allows concurrent arrivals of vfio buffers, even if the
buffer *must* be sequentially loaded.

That's a major pain for current VFIO kernel ioctl design, IMHO.  I think I
used to ask nVidia people on whether the VFIO get_state/set_state interface
can allow indexing or tagging of buffers but I never get a real response.
IMHO that'll be extremely helpful for migration purpose on concurrency if
it can happen, rather than using a serialized buffer.  It means
concurrently save/load one VFIO device could be extremely hard, if not
impossible.

Now in your series IIUC you resolved that by using vfio_load_bufs_thread(),
holding off the load process but only until sequential buffers are
received.  I think that causes one issue that I'll mention below as a
separate topic.  But besides that, my point is, this is not the reason that
you need to introduce VFIODeviceStatePacket, load_state_buffer() and so on.
My understanding is that we do need one way to re-serialize the buffers,
but it doesn't need load_state_buffer(), instead it can call vmstate_load()
in order, properly invoke vfio_load_state() with the right buffers.  It'll
just be nice if VFIO can keep its "load state" logic at one place.

One benefit of that is with such a more generic framework, QEMU can easily
extend this infra to other device states, so that logically we can consider
sending non-VFIO device states also in the multifd buffers.  However with
your current solution, new structures are needed, new hooks, a lot of new
codes around, however less problems it solved..  That's not optimal.

4. Risk of OOM on unlimited VFIO buffering
==========================================

This follows with above bullet, but my pure question to ask here is how
does VFIO guarantees no OOM condition by buffering VFIO state?

I mean, currently your proposal used vfio_load_bufs_thread() as a separate
thread to only load the vfio states until sequential data is received,
however is there an upper limit of how much buffering it could do?  IOW:

vfio_load_state_buffer():

  if (packet->idx >= migration->load_bufs->len) {
      g_array_set_size(migration->load_bufs, packet->idx + 1);
  }

  lb = &g_array_index(migration->load_bufs, typeof(*lb), packet->idx);
  ...
  lb->data = g_memdup2(&packet->data, data_size - sizeof(*packet));
  lb->len = data_size - sizeof(*packet);
  lb->is_present = true;

What if garray keeps growing with lb->data allocated, which triggers the
memcg limit of the process (if QEMU is in such process)?  Or just deplete
host memory and causing OOM kill.

I think we may need to find a way to throttle max memory usage of such
buffering.

So far this will be more of a problem indeed if this will be done during
VFIO iteration phases, but I still hope a solution can work with both
iteration phase and the switchover phase, even if you only do that in
switchover phase (and I don't know why you don't care about VFIO iteration
phase, if you cared enough on how VFIO works now with migration.. literally
that should help VFIO migrates faster on 25G+ networks, with/without a
shorter blackout window).

5. Worker thread model
======================

I'm so far not happy with what this proposal suggests on creating the
threads, also the two new hooks mostly just to create these threads..

I know I suggested that.. but that's comparing to what I read in the even
earlier version, and sorry I wasn't able to suggest something better at
that time because I simply thought less.

As I mentioned in the other reply elsewhere, I think we should firstly have
these threads ready to take data at the start of migration, so that it'll
work when someone wants to add vfio iteration support.  Then the jobs
(mostly what vfio_save_complete_precopy_async_thread() does now) can be
enqueued into the thread pools.

It's better to create the thread pool owned by migration, rather than
threads owned by VFIO, because it also paves way for non-VFIO device state
save()s, as I mentioned also above on the multifd packet header.  Maybe we
can have a flag in the packet header saying "this is device xxx's state,
just load it".

I'd start looking at util/thread-pool.c, removing all the AIO implications
but simply provide a raw thread pool for what thread_pool_submit() is
doing.

I know this is a lot, but I really think this is the right thing.. but we
can discuss, and you can correct me on my mistakes if there is.

If you want I can have a look at this pool model and prepare a patch, so
you can work on other vfio relevant stuff and pick that up, if that helps
you trying to reach the goal of landing this whole stuff in 9.1.

But I hope I explained more or less in this email on why I think this
feature is more involved than it looks like, and not yet mature from
design. And I hope I'm not purly asking for too much: merging this VFIO
series first then refactor on top can mean dropping too much unneeded code
after adding them, not to mention the protocol going to need another break.

It just doesn't sound like the right thing to do.

Thanks,

-- 
Peter Xu

next prev parent reply	other threads:[~2024-06-23 20:27 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-18 16:12 [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
2024-06-18 16:12 ` [PATCH v1 01/13] vfio/migration: Add save_{iterate, complete_precopy}_started trace events Maciej S. Szmigiero
2024-06-18 16:12 ` [PATCH v1 02/13] migration/ram: Add load start trace event Maciej S. Szmigiero
2024-06-18 16:12 ` [PATCH v1 03/13] migration/multifd: Zero p->flags before starting filling a packet Maciej S. Szmigiero
2024-06-18 16:12 ` [PATCH v1 04/13] migration: Add save_live_complete_precopy_{begin, end} handlers Maciej S. Szmigiero
2024-06-18 16:12 ` [PATCH v1 05/13] migration: Add qemu_loadvm_load_state_buffer() and its handler Maciej S. Szmigiero
2024-06-18 16:12 ` [PATCH v1 06/13] migration: Add load_finish handler and associated functions Maciej S. Szmigiero
2024-06-18 16:12 ` [PATCH v1 07/13] migration/multifd: Device state transfer support - receive side Maciej S. Szmigiero
2024-06-18 16:12 ` [PATCH v1 08/13] migration/multifd: Convert multifd_send_pages::next_channel to atomic Maciej S. Szmigiero
2024-06-18 16:12 ` [PATCH v1 09/13] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
2024-06-18 16:12 ` [PATCH v1 10/13] migration/multifd: Add migration_has_device_state_support() Maciej S. Szmigiero
2024-06-18 16:12 ` [PATCH v1 11/13] vfio/migration: Multifd device state transfer support - receive side Maciej S. Szmigiero
2024-06-18 16:12 ` [PATCH v1 12/13] vfio/migration: Add x-migration-multifd-transfer VFIO property Maciej S. Szmigiero
2024-06-18 16:12 ` [PATCH v1 13/13] vfio/migration: Multifd device state transfer support - send side Maciej S. Szmigiero
2024-06-23 20:27 ` Peter Xu [this message]
2024-06-24 19:51   ` [PATCH v1 00/13] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
2024-06-25 17:25     ` Peter Xu
2024-06-25 22:44       ` Maciej S. Szmigiero
2024-06-26  1:51         ` Peter Xu
2024-06-26 15:47           ` Maciej S. Szmigiero
2024-06-26 16:23             ` Peter Xu
2024-06-27  9:14               ` Maciej S. Szmigiero
2024-06-27 14:56                 ` Peter Xu
2024-07-16 20:10                   ` Maciej S. Szmigiero
2024-07-17 18:49                     ` Peter Xu
2024-07-17 20:19                       ` Fabiano Rosas
2024-07-17 21:07                         ` Maciej S. Szmigiero
2024-07-17 21:21                           ` Peter Xu
2024-06-27 15:09                 ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZniFH14DT6ycjbrL@x1n \
    --to=peterx@redhat.com \
    --cc=alex.williamson@redhat.com \
    --cc=armbru@redhat.com \
    --cc=avihaih@nvidia.com \
    --cc=berrange@redhat.com \
    --cc=clg@redhat.com \
    --cc=eblake@redhat.com \
    --cc=farosas@suse.de \
    --cc=joao.m.martins@oracle.com \
    --cc=mail@maciej.szmigiero.name \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).